Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

27
Statistical Approaches to testing for linearity in regression problems in Astrophysics with particular application to the Extra-Galactic distance scale Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016

description

Linear Regression A very common type of model in science (x i,y i ), i=1,….,N Y i = a + bx i + ε i, where x i, y i are the independent/dependent variables, respectively, a,b are the intercept/slope, respectively and ε i is the error. The error model is usually ε i ~ N(0, σ 2 ) We are interested in testing hypotheses on the slope b.

Transcript of Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Page 1: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Statistical Approaches to testing for linearity in regression problems in Astrophysics with

particular application to the Extra-Galactic distance scale

Shashi M. KanburSUNY Oswego

Workshop on Best Practices in Astro-StatisticsIUCAA January 2016

Page 2: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Collaborators and FundingHP Singh, R. Gupta, C. Ngeow, L. Macri, A.

Bhardwaj, S. Das, R. Kundu, S. Deb.A. NanthakumarNSF, IUSSTF, IUCAA, Delhi University, SUNY

OswegoWebsite:http://www.oswego.edu/~kanbur/iucaa2016/http://www.oswego.edu/~kanbur/DU2014

Page 3: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Linear RegressionA very common type of model in science(xi,yi), i=1,….,NYi = a + bxi + εi, where xi, yi are the independent/dependent

variables, respectively, a,b are the intercept/slope, respectively and εi is the error.

The error model is usually εi ~ N(0, σ2 )We are interested in testing hypotheses on the

slope b.

Page 4: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Linear RegressionLeast Squares estimates of the intercept and

slope are given by

with standard errors given by €

ˆ b =(xi − x )(yi − y )

i=1

i=N

(xi − x )2

i=1

i=N

∑, ˆ a = y − ˆ b x

s ˆ b =

1n − 2

ˆ ε i2

i=1

i=N

(x i − x )2

i=1

i=N

∑,s ˆ a = s ˆ b

1N

x i2

i=1

i=N

Page 5: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Linear RegressionInterested in testing whether the following

model is better:H0: b=b0 vs. HA: b=b1 , x ≤ x0 , b=b2 , x > x0

That is there is a change of slope at x0 - the break point.

Can fit regression lines to data on both sides of the break point with slope estimates

ˆ b 1, ˆ b 2

Page 6: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Linear RegressionThe standard way to “check” this is by

looking at the intervals

and see if they are mutually exclusive.This essentially puts confidence intervals

around the slope estimates. Depending on the choice of m, this says that the probability that the true slope is in the interval above is 1-α – or the probability of an error is α.

( ˆ b 1 ± m.s ˆ b 1),( ˆ b 2 ± m.s ˆ b 2

)

Page 7: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Linear RegressionThen if A={“short period” slope is wrong}, B={“long

period” slope is wrong}.In comparing the long and short period slopes, the

probability of at least one mistake

If 1 > α > 0, then 2α-α2 > α.

If we carry out statistical tests to significance level α, then this is saying that the statistical tests outlined in this talk have a smaller chance of making an error.

P(AUB)= P(A)+ P(B)− P(AI B) = 2α − α 2

Page 8: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

F TestPerhaps the simplest way to test for nonlinearity is to use the

F test:

Refer this statistic to F(νR – νF, νF)

where the subscript R, F stands for the reduced and full models respectively, and ν stands for the degrees of freedom. RSS stands for the residual sum of squares and refer this test statistic to the theoretical F distribution.

Normality, heteroskedasticity and IID observations.

(RSSR − RSSF

RSSF

)(ν F

ν R −ν F

)

Page 9: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Normality/Heteroskedasticity(Xi, Yi) with residuals εi.Yi ‘ = Yf

i + εi

Permute residuals without replacement (bootstrap is with replacement)

εni = εj

Yni = Yf

i + εni

With (Xi, Yni) get the F statistic – repeat – Fi.

Find proportion of Fi that are greater than the observed value of F.

Heteroskedasticity – plot residuals against the independent variable. Try a transformation - perhaps log.

Page 10: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Testing for NormalityData (Xi, Yi), i=1,….NQuantiles: Fn(u) = (#Yi ≤ u)/N and compare

with that expected from a normal distribution.

If the data are from a normal distribution, the q-q plot should be close to a straight line.

Page 11: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Random Walk MethodsOrder the independent variable: x1<xx<….<xN

If rk is the kth residual from a linear regression, then

If the data are consistent with a single linear regression, then the C(j) are a simple random walk.

Our test statistic, R, is the vertical range of the C(j)

C( j)= rkk =1

k = j

R = max[C( j)] − min[C( j)]

Page 12: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Random Walk Methods

If the partial sums are a random walk, R will be small.

Permute rk so that you randomize the residuals. Then recompute R. Repeat this procedure for a large number (~10000) permutations. The significance statistic is the

Fraction of the permuted R statistics that are greater than the observed value of R: this is the significance level under the null hypothesis of linearity.

This is a non-parametric test and does not depend on normality of the errors.

Page 13: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

TestimatorTest EstimatorSort the data in order of increasing

independent variable.Divide the sample into N1 different non-

overlapping and hence completely independent datasets. Each subset has n data points and the remaining datapoints are included in the last subset.

We fit a linear regression to the first subset and determine an initial slope estimate, β’.

Page 14: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

TestimatorThis initial estimate of the slope becomes β0

in the next subset under the null hypothesis that the slope of the second subset is equal to the slope of the first subset. We calculate the t-statistic such that

tobs =β' −β0

MSE /Sxx

, MSE =1

N − 2(y i

i=1

i=N

∑ − ˆ y i)2,

Sxx = (x i − x )2

i=1

i=N

Page 15: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

TestimatorSince there will be ng=n-1 hypothesis tests, the

critical t value will be a Bonferroni typeand ν is the number of data points in each subset. Once we know the observed and critical value of

the t-statistics, we determine

which is the probability that the initial testimator guess is true. If the value of k < 1, the null hypothesis is accepted and we derive the new testimator slope for the next subset using the previously determined β’s such that

tα / 2ng ,ν

k = (| tobs |

tc

)

βw = k ˆ β − (1− k)β0

Page 16: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

TestimatorThis value of the testimator is taken as β0 for

the next subset. This process of hypothesis testing is repeated ng times or until the value of k > 1, suggesting rejection of the null hypothesis – that is the data are more consistent with a non-linear relation.

Page 17: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

The Extra-Galactic Distance Scaleμ=m-Mμ=m-(a+b.logP)Calibrating Galaxy, observe Cepheids and

determine M=a+blogPTarget galaxy, observe Cepheids mi, i=1,…N. So μi = mi – (a + blogPi)y=Lqwhere y=(m1, m2,…mN), q=(a,b,μ1,μ2,…μN) is the

vector of unknowns and L is a (Nx(N+2)) matrix containing 1’s and logPi’s.

Page 18: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

The Extra-Galactic Distance ScaleThis is a vector equation for the q’s and easily

solvable using the General Linear Model interface in R.

Minimize χ2 = (y-Lq)TC-1(y-Lq) yields the MLE estimator for q. C is the matrix of measurement errors

Weighted least squares estimate when errors are normally distributed.

q’ = (LTC-1L)-1LTC-1[y] and standard errors for the parameters in q’ are (LTC-1L)-1.

If you formulate your statistical data analysis problems in this General Linear Model formalism, its very easy to solve in R along with a full error analysis.

Page 19: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

The Extra-Galactic Distance Scale and BayesBayesian GLM formalism applied to the

estimate of H0

Page 20: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Segmented Lines and the Davies TestThe model isY =as + bsX + ψ(X)Δa(X-Xb) andΔa=aL-aS and Ψ(X)=0, X<Xb, ψ(X)=1, X≥Xb.

This assumes a continuous transition between the two linear models. A more general situation , perhaps a discontinuity is

Y=as+bsX + Ψ(X)[Δa(X-Xb) – γ],where γ represents the magnitude of the gap.

Page 21: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Segmented LinesChoose an initial break point Xb’ and then fit

the other parameters in the equation. Estimate a new break point, Xb’’ = Xb’ + γ/Δa.Repeat until γ≈0.

Page 22: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Cepheid PL Relations

Page 23: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Cepheid PC Relations

Page 24: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Multiphase PL Relations

Page 25: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Multiwavelength PL Relations

Page 26: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

Galactic PL Relations

Page 27: Shashi M. Kanbur SUNY Oswego Workshop on Best Practices in Astro-Statistics IUCAA January 2016.

ExtraGalactic PL Relations