Biostatistics-Lecture 10 Regression

55
Biostatistics-Lecture 10 Regression Ruibin Xi Peking University School of Mathematical Sciences

Transcript of Biostatistics-Lecture 10 Regression

Page 1: Biostatistics-Lecture 10 Regression

Biostatistics-Lecture 10 Regression

Ruibin Xi

Peking University

School of Mathematical Sciences

Page 2: Biostatistics-Lecture 10 Regression

Analysis of Variance (ANOVA)

• Consider the Iris data again

• Want to see if the average sepal widths of the three species are the same

– μ1 , μ2, μ3 : the mean sepal width of Setosa, Versicolor, Virginica

– Hypothesis:

H0: μ1 = μ2= μ3

H1: at least one mean is different

Page 3: Biostatistics-Lecture 10 Regression

Analysis of Variance (ANOVA)

• Used to compare ≥ 2 means

• Definitions

– Response variable (dependent)—the outcome of interest, must be continuous

– Factors (independent)—variables by which the groups are formed and whose effect on response is of interest, must be categorical

– Factor levels—possible values the factors can take

Page 4: Biostatistics-Lecture 10 Regression

Sources of Variation in One-Way ANOVA

• Partition the total variability of the outcome into components—source of variation

– the sepal width of the jth plant from the ith species (group)

jiy , jnjki 1,1

)()( yyyyyy iiijij

Grand mean The ith group mean

Page 5: Biostatistics-Lecture 10 Regression

Sources of Variation in One-Way ANOVA

• SST: sum of squares total

• SSB: sum of squares between

• SSW (SSE): sum of squares within (error)

2

1 1 k

i

n

j ij

i

yySSWSSBSST

2

1 k

i ii yynSSB

2

1 1 k

i

n

j iij

j

yySSW

Page 6: Biostatistics-Lecture 10 Regression

F-test in one-way ANOVA

• The test statistic is called F-statistic

Under the null hypothesis, follows an F-distribution with

(df1,df2) = (k-1,n-k)

• For the Iris data – SSB=11.34, MSB = 5.67, SSE=16.96, MSE=0.12 – f = 49.16, df1=2,df2=147 – Critical value 3.06 at α=0.05, reject the null – Pvalue = P(F>f)=4.49e-17

)/(

)1/(

knSSE

kSSB

MSE

MSBF

Page 7: Biostatistics-Lecture 10 Regression

One-way ANOVA

• ANOVA table

Page 8: Biostatistics-Lecture 10 Regression

One-way ANOVA

• ANOVA table

Page 9: Biostatistics-Lecture 10 Regression

ANOVA model

• The statistical model

Yij = μ + αi + eij

The ith response in the jth group

grand mean

The effect of group j

error

Page 10: Biostatistics-Lecture 10 Regression

ANOVA assumptions

• Normality

• Homogeneity

• Independence

Page 11: Biostatistics-Lecture 10 Regression

Regression—an example

• Cystic fibrosis (囊胞性纤维症) lung function data

– PEmax (maximal static expiratory pressure) is the response variable

– Potential explanatory variables

• age, sex, height, weight,

• BMP (body mass as a percentage of the age‐specific median)

• FEV1 (forced expiratory volume in 1 second)

• RV (residual volume)

• FRC(funcAonal residual capacity)

• TLC (total lung capacity)

Page 12: Biostatistics-Lecture 10 Regression

Regression—an example

• Let’s first concentrate on the age variable

• The model

• Plot PEmax vs age

Page 13: Biostatistics-Lecture 10 Regression

Regression—an example

• Let’s first concentrate on the age variable

• The model

• Plot PEmax vs age

Page 14: Biostatistics-Lecture 10 Regression

Simple Linear regression

Page 15: Biostatistics-Lecture 10 Regression

Assumptions

• Normality

– Given x, the distribution of y is normal with mean α+βx with standard deviation σ

• Homogeneity

– σ does not depend on x

• Independence

Page 16: Biostatistics-Lecture 10 Regression

Residuals

Page 17: Biostatistics-Lecture 10 Regression

Fitting the model

Page 18: Biostatistics-Lecture 10 Regression

Fitting the model

Page 19: Biostatistics-Lecture 10 Regression

Goodness of Fit

Page 20: Biostatistics-Lecture 10 Regression

Inference about β

Page 21: Biostatistics-Lecture 10 Regression

Inference about β

Page 22: Biostatistics-Lecture 10 Regression

Inference about β: the CF data

Page 23: Biostatistics-Lecture 10 Regression

Plotting the regression line

Page 24: Biostatistics-Lecture 10 Regression

R2

Page 25: Biostatistics-Lecture 10 Regression

Residual plot

Page 26: Biostatistics-Lecture 10 Regression

Residual plot

Page 27: Biostatistics-Lecture 10 Regression

Residual plot

Page 28: Biostatistics-Lecture 10 Regression

Residual plot

• The CF patients data

Page 29: Biostatistics-Lecture 10 Regression

Linear Regression

Page 30: Biostatistics-Lecture 10 Regression

Summary: simple linear regression

Page 31: Biostatistics-Lecture 10 Regression

Multiple regression

• See blackboard

Page 32: Biostatistics-Lecture 10 Regression

GENERALIZED LINEAR MODELS Regression

Page 33: Biostatistics-Lecture 10 Regression

Generalized liner models

• GLM allow for response distributions other than normal

– Basic structure

• The distribution of Y is usually assumed to be independent and

Page 34: Biostatistics-Lecture 10 Regression

Generalized Linear model-an example

• An example

– A study investigated the roadkills of amphibian

• Response variable: the total number of amphibian fatalities per segment (500m)

• Explanatory variables

Page 35: Biostatistics-Lecture 10 Regression

Generalized Linear model-an example

• An example

– A study investigated the roadkills of amphibian

• Response variable: the total number of amphibian fatalities per segment (500m)

• Explanatory variables

Page 36: Biostatistics-Lecture 10 Regression

Generalized Linear model-an example

• An example

– A study investigated the roadkills of amphibian

• Response variable: the total number of amphibian fatalities per segment (500m)

• Explanatory variables

• For now, we are only interested in Distance to Natural Park

Page 37: Biostatistics-Lecture 10 Regression

Generalized Linear model-an example

• An example

– A study investigated the roadkills of amphibian

• Response variable: the total number of amphibian fatalities per segment (500m)

• Explanatory variables

• For now, we are only interested in Distance to Natural Park

Page 38: Biostatistics-Lecture 10 Regression

Generalized Linear model-an example

• An over-simplified model

• For Poisson we have

Page 39: Biostatistics-Lecture 10 Regression

GLM-exponential families

• Exponential families

– The density

– Normal distributions is an exponential family

Page 40: Biostatistics-Lecture 10 Regression

GLM-exponential families

• Consider the log-likelihood of a general exponential families

Since

Page 41: Biostatistics-Lecture 10 Regression

GLM-exponential families

• Differentiating the likelihood one more time

using the equation

We often assume

Define

Page 42: Biostatistics-Lecture 10 Regression

GLM-exponential families

Page 43: Biostatistics-Lecture 10 Regression

Fitting the GLM

• In a GLM

• The joint likelihood is

• The log likelihood

Page 44: Biostatistics-Lecture 10 Regression

Fitting the GLM

• Assuming ( is known)

• Differentiating the log likelihood and setting it to zero

Page 45: Biostatistics-Lecture 10 Regression

Fitting the GLM

• By the chain rule

• Since

• We have

)( ii b

Page 46: Biostatistics-Lecture 10 Regression

Canonical Link Function

• The Canonical Link Function is such that

• Remember that

• So

)( ii b

Page 47: Biostatistics-Lecture 10 Regression

Iteratively Reweighted Least Squares (IRLS)

• We first consider fitting the nonlinear model

by minimizing

where f is a nonlinear function

Page 48: Biostatistics-Lecture 10 Regression

Iteratively Reweighted Least Squares (IRLS)

• Given a good guess

by using the Taylor expansion

Define the pseudodata

Page 49: Biostatistics-Lecture 10 Regression

Iteratively Reweighted Least Squares (IRLS)

• Note that in GLM, we are trying to solve

• If are known, this is equivalent to minimizing

Page 50: Biostatistics-Lecture 10 Regression

Iteratively Reweighted Least Squares (IRLS)

• We are inspired to use the following algorithm

– At the kth iteration, define

• update as in the nonlinear model case

• set k to be k+1

– But the second step also involves iteration, we may perform one step iteration here to obtain

Page 51: Biostatistics-Lecture 10 Regression

Deviance

• The deviance is defined as

where is the log likelihood with the saturated model: the model with one parameter per data point

Also note that deviance is defined to be independent of

Page 52: Biostatistics-Lecture 10 Regression

Deviance

• GLM does not have R2

• The closest one is the explained deviance

• The over-dispersion parameter may be estimated by

or by the Pearson statistic

Page 53: Biostatistics-Lecture 10 Regression

Model comparison

• Scaled deviance

• For the hypothesis testing problem

under H0

Page 54: Biostatistics-Lecture 10 Regression

Residuals

• Pearson Residuals

– Approximately zero mean and variance

• Deviance Residuals

Page 55: Biostatistics-Lecture 10 Regression

Negative binomial

• Density