F. Farnir, E. Moyse Biostatistics& Bioinformatics ...

Linear models

F. Farnir, E. Moyse

Biostatistics & Bioinformatics

Faculty of Vet. Medicine

University of Liege

Outline of the course

� Linear models

◦ Basic formulation

◦ The two tasks

� Estimating parameters

� Testing hypotheses about the parameters

� Examples of linear model

◦ Simple linear regression

◦ Simple ANOVA

◦ A more complicated example

◦ A more complex situation: repeated measures

Linear models

� A basic formulation

◦ A model is a linear model if the relationship

between the parameters and the modelled

variable is linear.

◦ Examples:

� Linear regression: y(ij) = β0 + β1*x(i) + e(ij)

� Quadratic regression: y(i) = β0 + β1*x(i) + β2*x(i)² + e(ij)

� One way ANOVA: y(i) = µ + a(i) + e(ij)

� Multiple ways ANOVA, other regressions, mixture

of ANOVA and regressions, …

Linear models

� Matrix formulation:y = X*β + e

◦ y’ = (y(1),…,y(n)) = vector of observations

� known (observed, or measured)

◦ β’ = (b(1),…,b(m)) = vector of parameters

� unknown, and to be estimated

◦ e’ = (e(1),…,e(n)) = vector of residuals

� unknown, but supposed N(0,σ²*I)

◦ X = design (or incidence) matrix, linking the parameters to the observations.

� known

Linear models

� Two main problems:

◦ How do we obtain estimators b for the

parameters β of the model ?

◦ How do we test hypotheses about the

parameters of the model ?

Linear models

� Estimation problem:

◦ The rationale we use to infer estimators is to

choose parameters values that make error « as

small as possible »

� To simultaneously reduce all components of e, we

choose to minimize e’*e = Σ e(i)²

� This leads to a « least-square (of the error) estimate »

in the following form:

b = (X’*X)-1*X’*y

Linear models

� Estimation problem: example 1

◦ Assume we have the following simple problem

10 = 3*β + e1

12 = 2*β + e2

◦ 2 equations – 1 unknown => no solution

◦ We choose an estimator of β that minimizes the

sum of the squared errors:

min (e1² + e2²) = min [(10 – 3*β)² + (12 – 2*β)²]

Linear models

◦ Minimizing this function of β can be achieved

by deriving the function with respect to β an

setting the derivative equal to 0

◦ => 2*(10-3*b)*(-3) + 2*(12-2*b)*(-2) = 0

=> b = 54/13

◦ No other value of β makes the sum of squared

errors smaller.

Linear models

◦ Using the matrix notation:

y’ = (10 12), X’ = (3 2),

b = (b), e = (e1 e2)

◦ X’X = 13 => (X’X)-1 = 1/13

X’y = 54

◦ => b = (X’X)-1 * X’y = 54/13

Linear models

◦ The simplest linear model:

yi = µ + ei

◦ In matrix form: y = X*β + e

� y’ = (y1 y2 … yn)

� X’ = (1 1 … 1)

� β = (µ)

� e’ = (e1 e2 … en)

◦ The estimator b of β is then: b = (X’*X)-1*X’*y

Linear models

◦ Let’s compute b:

� X’*X = n => (X’*X)-1 = 1/n

� X’*y = y1 + y2 +… + yn = Σ yi

� b = (X’*X)-1 * X’*y = 1/n * Σ yi = m

◦ Let’s compute (y-X*b)’*(y-X*b)/(n-r(X))

� (X*b)’ = (m m … m)

� (y-X*b)’ = (y1-m y2-m … yn-m)

� (y-X*b)’*(y-X*b) = (y1-m)² + (y2-m)² + … +(yn-m)²

= Σ (yi-m)²

� r(X) = # of indpdt lines of X = 1

Linear models

◦ Consequently:

� (y-X*b)’*(y-X*b)/(n-r(X)) = Σ (yi-m)²/(n-1) = s²

Linear models

� Testing problem:

◦ Most of the (null) hypotheses we might want to

test can be written as:

H0: L*β = c

where L and c are known matrix and vector,

respectively. This is known as « general linear

hypothesis ».

Linear models

� Testing problem (cont’d):

◦ A general test can be devised, based on a F statistics, for such hypotheses:

� G is a generalized inverse of X’*X. If the hypothesis is« testable », all G provide the same F value

� q = # of independent lines of L

� The hypothesis is of course embedded in the statistic

� The numerator is the estimator of σ².

� Normality assumptions are necessary to obtain the F distributions

Fq,n-r(X) = [(L*b-c)’*(L*G*L’)-1*(L*b-c)/q] / [(y-X*b)’*(y-X*b)/(n-r(X))]

Examples of linear models

� A simple linear regression:

◦ Let’s consider the following dataset:

◦ A question of interest: is there a significant

relation between weights and weeks ?

Week Weight

◦ A first answer is to look at a plot of weights

versus weeks. This can be achieved using R:

> weeks<-1:5> weights<-c(4.1,4.6,4.9,5.2,5.4)> plot(weeks,weights)

� A simple linear regression: there is a clear,

almost linear, increasing trend

◦ This could be modeled using a classical linear

regression:

Y(i) = β0 + β1*X(i) + e(i)

or, using the matrix notation:

Y = X*β + e where β =

◦ Computing the estimators b of β.

� X and y are easy to obtain:

( ) yXXX '**'* 1

0 −=

⋮⋮⋮⋮

⋮⋮

> weeks<-1:5> weights<-c(4.1,4.6,4.9,5.2,5.4)> Y<-weights> X<-matrix(c(rep(1,5),weeks),byrow=F,nr=5)> b<-solve(t(X)%*%X)%*%t(X)%*%Y> b

[,1][1,] 3.88[2,] 0.32> abline(b[1],b[2],col=« red »)

◦ Testing an hypothesis: β1 = 0.

� The hypothesis can be put in the form L*β = c

as follows:

� L = (0,1), c = 0

� Next, we can use these elements in the formula

� Note that: q = 1 and r(X) = 2

Fq,n-r(X) = [(L*b-c)’*(L*G*L’)-1*(L*b-c)/q] / [(y-X*b)’*(y-X*b)/(n-r(X))]

◦ Testing an hypothesis: β1 = 0.> weeks<-1:5> weights<-c(4.1,4.6,4.9,5.2,5.4)> Y<-weights> n<-length(Y)> X<-matrix(c(rep(1,5),weeks),byrow=F,nr=5)> G<-solve(t(X)%*%X)> b<-G%*%t(X)%*%Y> L<-matrix(c(0,1),nr=1)> c<-c(0)> hypo<-L%*%b-c> numer<-t(hypo)%*%solve(L%*%G%*%t(L))%*%hypo> denom<-t(Y-X%*%b)%*%(Y-X%*%b)> F<-(numer/1)/(denom/(n-2))> pf(F,1,n-2,lower.tail=FALSE)

[,1][1,] 0.001857831

� A simpler solution to linear regression:

◦ Testing an hypothesis: β1 = 0.> weeks<-1:5> weights<-c(4.1,4.6,4.9,5.2,5.4)# ‘lm’ stands for ‘linear models’> lr<-lm(weights~weeks)> summary(lr)

Call:lm(formula = weights ~ weeks)

Residuals:1 2 3 4 5

-0.10 0.08 0.06 0.04 -0.08

Examples of linear models� A simpler solution to linear regression

(cont’d):◦ Testing an hypothesis: β1 = 0.

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.88000 0.10132 38.29 3.92e-05 ** *weeks 0.32000 0.03055 10.47 0.00186 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘. ’ 0.1 ‘ ’ 1

Residual standard error: 0.09661 on 3 degrees of free domMultiple R-squared: 0.9734, Adjusted R-squared: 0.9645 F-statistic: 109.7 on 1 and 3 DF, p-value: 0.00185 8

� The classical solution for linear regression:

◦ Testing an hypothesis: β1 = 0.> weeks<-1:5> weights<-c(4.1,4.6,4.9,5.2,5.4)# b = Σ(X-Xm)*(Y-Ym)/ Σ(X-Xm)²> xm<-mean(weeks)> x<-weeks-xm> ym<-mean(weights)> y<-weights-ym> b<-sum(x*y)/sum(x**2)> b[1] 0.32> SCR<-b*sum(x*y)> SCT<-sum(y*y)> SCE<-SCT-SCR> dfR<-1> dfE<-length(weights)-2> F<-(SCR/dfR)/(SCE/dfE)> pf(F,dfR,dfE,lower.tail=FALSE)[1] 0.001857831

� A simple analysis of variance:

◦ As a second example, consider these data, with

horses heart rates:

Ardennes Warm Half

106.6 115.4 100.2

100.8 97.8 102.1

110.9 120.3 99.6

114.5 98.2 103.8

115.9 113.2 100.7

91.9 107.6

� A simple analysis of variance:

◦ The question of interest here is: is there a linkbetween heart rates and breed ?

◦ This question can be addressed using an ANOVA, i.e. the following model:

y(ij) = µ + αi + e(ij), i = 1, …, 3

or, using the matrix notation:

y = X*β + e where β =

ααα

� A simple analysis of variance

◦ Elements of the model

⋮⋮⋮⋮X

5.1894

◦ Elements of the model (using R)

# Data> X<-matrix(c(rep(c(1,1,0,0),6),rep(c(1,0,1,0),5),+ rep(c(1,0,0,1),7)),byrow=TRUE,nr=18)> Y<-c(106.6,100.8,110.9,114.5,115.9,91.9,115.4,+ 97.8,120.3,98.2,113.2,100.2,102.1,99.6,103.8,+ 100.7,107.6,95.0)# Parameters estimators> XX<-t(X)%*%X> XY<-t(X)%*%Y

◦ Computing estimators of βIt is easy to see that X’X matrix is singular (the

first row is equal to the sum of the 3 following

ones => use a generalized inverse

> library(MASS)> G1<-ginv(XX)> b1<-G1%*%XY> b1

[,1][1,] 79.25810[2,] 27.50857[3,] 29.72190[4,] 22.02762

◦ Computing estimators of βNote that another generalized inverse could be

obtained « by hand », by setting the estimator

of µ = 0 (and inverting the remaining diag.):

7/1000

05/100

006/10

◦ Computing estimators of β (using R)

# Another G> G2<-matrix(rep(0,16),nr=4)> G2[2,2]<-1/6> G2[3,3]<-1/5> G2[4,4]<-1/7# Check generalized inverse> XX%*%G2%*%XX

[,1] [,2] [,3] [,4][1,] 18 6 5 7[2,] 6 6 0 0[3,] 5 0 5 0[4,] 7 0 0 7

◦ Computing estimators of β (using R) (cont’d)

# Solutions> b2<-G2%*%XY> b2

[,1][1,] 0.0000[2,] 106.7667[3,] 108.9800[4,] 101.2857# i.e. the 3 breeds averages# Observe that:> b1[1,1]+b1[2,1][1] 106.7667> b2[1,1]+b2[2,1][1] 106.7667

◦ Testing a first hypothesis:

� H0: µA = µW = µH

which can be rewritten as:

H0: (µA = µW & µA = µH) or (µA - µW = 0 & µA - µH = 0)

� In terms of the general linear hypothesis, this can be

written as:

−−

ααα

◦ Testing a first hypothesis using R:# Hypothesis 1> L<-matrix(c(0,1,-1,0,0,1,0,-1),byrow=TRUE,nr=2)> q<-2> n<-18> rX<-3> num<-(t(L%*%b1)%*%solve(L%*%G1%*%t(L))%*%L%*%b1)/q> den<-(t(Y-X%*%b1)%*%(Y-X%*%b1))/(n-rX)# Test the hypothesis> F<-num/den> F

[,1][1,] 1.549397 > pf(F,q,n-rX,lower.tail=FALSE)

[,1][1,] 0.2445187

◦ Easily testing the first hypothesis using R:

# Hypothesis 1> breed<-factor(c(rep(«A»,6),rep(«W»,5),rep(«H»,7)))> model<-lm(Y~breed)> summary(model)

Call:lm(formula = Y ~ breed)

Residuals:Min 1Q Median 3Q Max

-14.8667 -4.8964 0.3238 5.7907 11.3200

◦ Easily testing the first hypothesis using R:

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 106.767 3.225 33.106 1.94e-15 ** *breedW 2.213 4.783 0.463 0.650 breedH -5.481 4.395 -1.247 0.231 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘. ’ 0.1 ‘ ’ 1

Residual standard error: 7.9 on 15 degrees of freedomMultiple R-squared: 0.1712, Adjusted R-squared: 0.06071 F-statistic: 1.549 on 2 and 15 DF, p-value: 0.2445

◦ Testing a second hypothesis using R:# Hypothesis 2: H0: µ(H)=0.5*(µ(A)+µ(W))> L<-matrix(c(0,-0.5,-0.5,1),byrow=TRUE,nr=1)> q<-1> n<-18> rX<-3> num<-(t(L%*%b1)%*%solve(L%*%G1%*%t(L))%*%L%*%b1)/q> den<-(t(Y-X%*%b1)%*%(Y-X%*%b1))/(n-rX)# Test the hypothesis> F<-num/den> F

[,1][1,] 2.965257> pf(F,q,n-rX,lower.tail=FALSE)

[,1][1,] 0.1056209

� A more complex example:

◦ As a third example, consider this dataset

H St Ge Age HR H St Ge Age HR

1 T M 88 68 8 NT M 64 76

2 T M 96 64 9 NT M 77 75

3 T F 90 76 10 NT F 100 71

4 T F 73 71 11 NT F 75 85

5 T M 85 63 12 NT M 63 81

6 T F 99 63 13 NT M 73 80

7 T F 60 67 14 NT F 67 81

15 NT F 76 83

� A more complex example:

◦ Possible questions are:

� Is there any effect of training on the heart rate (HR) ?

� Is there an age effect and/or a gender effect on HR ?

� Is the (potential) training effect similar in males and in

females ?

◦ Answers:

� All these questions can be « easily » addressed using

a linear model…

� A more complex example :

◦ The model:

y(ijk) = µ + β*a(ijk) + τi + γj + (τγ)ij + e(ijk)

◦ Using the matrix notation:

y = X*β + e

where β’ = (µ,β,τT,τNT,γF,γM,τγTF,τγTM,τγNTF,τγNTM)

◦ Obtaining X and Y

01000110761

01000110671

10001010731

10001010631

01000110751

010001101001

10001010771

10001010641

00010101601

00010101991

00101001851

00010101731

00010101901

00101001961

00101001881

◦ Performing computations (using R)

# Building matrices> y<-c(68,64,76,71,63,63,67,76,75,71,85,81,80,81,83)> X<-matrix(rep(0,150),nr=15)> X[,1]<-rep(1,15)> X[,2]<-c(88,96,90,73,85,99,60,64,77,+ 100,75,63,73,67,76)> X[,3]<-c(rep(1,7),rep(0,8))> X[,4]<-1-X[,3]> X[,5]<-c(0,0,1,1,0,1,1,0,0,1,1,0,0,1,1)> X[,6]<-1-X[,5]> X[,7]<-c(X[,5][1:7],rep(0,8))> X[,8]<-c(X[,6][1:7],rep(0,8))> X[,9]<-c(rep(0,7),X[,5][8:15])> X[,10]<-c(rep(0,7),X[,6][8:15])

◦ Performing computations (using R) (cont’d)

# Computing estimators> XX<-t(X)%*%X> Xy<-t(X)%*%y> G<-ginv(XX)> b<-G%*%Xy

[,1][1,] 38.1162191[2,] -0.1592766[3,] 15.6683053[4,] 22.4479138[5,] 20.1285345[6,] 17.9876846[7,] 8.1587098[8,] 7.5095955[9,] 11.9698247

[10,] 10.4780891

◦ Remarks:

� Other solutions could be obtained, using other

generalized inverses.

� Since regression parameters are « estimable », the

other solutions would give the same solution for β(see notes for an example).

◦ Testing hypotheses (using R)

# test_H: a generic function to test hypothesestest_H<-function(X,y,L,c) {

library(MASS)XX<-t(X)%*%XG<-ginv(t(X)%*%X)b<-G%*%t(X)%*%ynum<-t(L%*%b-c)%*%solve(L%*%G%*%t(L))%*%(L%*%b-c)den<-t(y-X%*%b)%*%(y-X%*%b)q<-dim(L)[1]n<-length(y)rX<-qr(XX)$rankF<-(num/q)/(den/(n-rX))pF<-pf(F,q,n-rX,lower.tail=FALSE)c(F,pF)

◦ a) Testing the regression coefficient

◦ b) Testing the training effect

# Test: beta=0> L<-matrix(c(0,1,rep(0,8)),nr=1)> c<-matrix(c(0))> test_H(X,y,L,c)[1] 2.1324558 0.1749022# No significant regression

# Test: tau(T)-tau(NT)=0> L<-matrix(c(0,0,1,-1,rep(0,6)),nr=1)> c<-matrix(c(0))> test_H(X,y,L,c)[1] 14.951074581 0.003126119# Significant training effect

� A word of caution:

◦ When testing the training effect, we actually

compare the means of the 2 groups (Trained –

Not Trained)

◦ The raw means embed information on other

effects of the model, which might not be

desirable…

� This can be shown by replacing the observation by the

assumed model and averaging over each group (see

next slide)

( ) ( )T

TMTFMFTTT eay +++++++=

τγτγγγτβµ

( ) ( )NT

NTMNTFMFNTNTNT eay +++++++=

τγτγγγτβµ

( ) ( )( ) ( ) ( ) ( ) ( )NTT

NTMNTFTMTF

MFNTTNTTNTT

−+++++

−+−+−=−

*7*7*6*814

τγτγτγτγ

γγττβ

◦ This complicated expression shows that:

� Due to the non-balanced nature of the dataset,

comparing training statuses involves the gender

effect

� The presence of a covariate might induce differences

if both groups are not balanced wrt age

� The potential interactions between training status

and gender might render comparison of status

meaningless.

� A possible solution:

◦ Use « Least Square Means » (LSM)

� We first obtain averages (LSM) on subgroups

� We average these means to obtain marginal LSM

� Example: LSM(T,F) = ?, LSM(T) = ?

Heart rates Trained Not Trained

Females 76,71,63,67 71,85,81,83

Males 68,64,63 76,75,81,80

( ) TFTFFTTF eay +++++= τγγτβµ *

Conventionally averaged over all dataset

� We first obtain averages on subgroups

# Compute L for the 4 subgroups> L_TF<-matrix(c(1,mean(X[,2]),1,0,1,0,1,0,0,0),nr=1)> L_TM<-matrix(c(1,mean(X[,2]),1,0,0,1,0,1,0,0),nr=1)> L_NTF<-matrix(c(1,mean(X[,2]),0,1,1,0,0,0,1,0),nr=1)> L_NTM<-matrix(c(1,mean(X[,2]),0,1,0,1,0,0,0,1),nr=1)# Compute LSM for the 4 subgroups> LSM_TF<-L_TF%*%b> LSM_TM<-L_TM%*%b> LSM_NTF<-L_NTF%*%b> LSM_NTM<-L_NTM%*%b

� We then average to obtain marginal LSM

# Compute LSM for the main effects> LSM_T<-0.5*(LSM_TF+LSM_TM)> LSM_NT<-0.5*(LSM_NTF+LSM_NTM)> LSM_F<-0.5*(LSM_TF+LSM_NTF)> LSM_M<-0.5*(LSM_TM+LSM_NTM)

� Finally, since they are linear combinations of the

parameters, LSM (or differences of LSM) can be

tested using the general linear hypothesis test given

above !

� Example: let’s compare the T & NT groups

LSM_T-LSM_NT

= 0.5*(LSM_TM + LSM_TF - LSM_NTF - LSM_NTM)

= 0.5*(L_TM + L_TF - L_NTF - L_NTM)*b

= (0,0,1,-1,0,0,0.5,0.5,-0.5,-0.5)*b

� (0,0,1,-1,0,0,0.5,0.5,-0.5,-0.5)*b

# Compute difference of LSM between T and NT> L<-matrix(c(0,0,1,-1,0,0,0.5,0.5,-0.5,-0.5),nr=1)> c<-0> test_H(X,y,L,c)[1] 14.951074581 0.003126119# Same result as before, so showing that the obtained# solution is corrected for the other effects of the # model !

A (even) more complex situation

� Imagine the following situation

◦ 2 groups of 2 individuals are followed

longitudinally and 3 measures are taken on

each individual at 3 specific times (see figure)

A more complex situation

� Some questions are:

◦ Is there a significant difference in the

measures between the groups ?

◦ Is there a significant difference in the

measure between the times ?

� If yes, for which times ?

◦ [Are the 2 groups dynamic behaviour

different ?]

� These questions can easily be adressed

using linear models, as done above.

◦ Omitting the interaction for simplicity:

100101

010101

001101

100101

010101

001101

100011

010011

001011

100011

010011

001011

τττγγµ

� LM analysis, using R (1):

## Observations#> y<-c(89.4,106.4,116.3,103.7,113.7,118.0,91.5,89.8,+ 110.6,85.0,88.5,97.2)## Design matrix#> X<-matrix(rep(0,72),nr=12)> X[,1]<-1> X[1:6,2]<-1> X[7:12,3]<-1> X[c(1,4,7,10),4]<-1> X[c(2,5,8,11),5]<-1> X[c(3,6,9,12),6]<-1

## Compute solutions#> XX<-t(X)%*%X> Xy<-t(X)%*%y> library(MASS)> b<-ginv(XX)%*%Xy## Test of group effect#> L<-matrix(c(0,1,-1,0,0,0),nr=1)> c<-matrix(c(0),nr=1)> test_H(X,y,L,c)[1] 14.89196727 0.00481566## Significant group effect (p = 0.0048)

## Test of time effect#> L<-matrix(c(0,0,0,1,-1,0,0,0,0,0,1,-1),nr=2,byrow=T)> c<-matrix(c(0,0),nr=2)> test_H(X,y,L,c)[1] 8.25934879 0.01133366# Significant time effect (p = 0.0113)

# Or, equivalently:> L<-matrix(c(0,0,0,1,-1,0,0,0,0,1,0,-1),nr=2,byrow=T)> c<-matrix(c(0,0),nr=2)> test_H(X,y,L,c)[1] 8.25934879 0.01133366# Significant time effect (p = 0.0113)

# Obtain LSM for groups> L_G1T1<-matrix(c(1,1,0,1,0,0),nr=1)> L_G1T2<-matrix(c(1,1,0,0,1,0),nr=1)> L_G1T3<-matrix(c(1,1,0,0,0,1),nr=1)> L_G2T1<-matrix(c(1,0,1,1,0,0),nr=1)> L_G2T2<-matrix(c(1,0,1,0,1,0),nr=1)> L_G2T3<-matrix(c(1,0,1,0,0,1),nr=1)> LSM_G1T1<-L_G1T1%*%b> LSM_G1T2<-L_G1T2%*%b> LSM_G1T3<-L_G1T3%*%b> LSM_G2T1<-L_G2T1%*%b> LSM_G2T2<-L_G2T2%*%b> LSM_G2T3<-L_G2T3%*%b> LSM_G1<-(LSM_G1T1+LSM_G1T2+LSM_G1T3)/3> LSM_G2<-(LSM_G2T1+LSM_G2T2+LSM_G2T3)/3

� LM analysis, using R (5):## Show LSM for groups, and difference#> c(LSM_G1,LSM_G2,LSM_G1-LSM_G2)[1] 107.91667 93.76667 14.15000## Test whether true difference is 0#> L_delta<-(L_G1T1+L_G1T2+L_G1T3-L_G2T1-L_G2T2-L_G2T3)/3> c_delta<-matrix(c(0),nr=1)> LSM_delta<-L_delta%*%b## Test the difference#> test_H(X,y,L_delta,c_delta)[1] 14.89196727 0.00481566# Of course identical to previous groups test

� LM analysis, summary:

◦ Everything seems fine, but...

� Independence assumptions have clearly been violated (measures taken on the same individualare likely to be correlated)

� Assuming erroneously independence might:

� Underestimate random residual variation (σ²e)

� Consequently, overestimate effects...

� An thus, increase false positive rates

◦ So, a question of interest is: how can we take these correlations intoaccount ?

� Idea: use a more general family of linear

models, named « mixed models »,

allowing for correlations

◦ « Mixed » refers to the simultaneous use of

« fixed » an « random » effects

� Fixed: this effect would be the same if we repeat

the experiment

� Example: groups, times

� Random: this effect is randomly sampled in a

population of possible levels

� Example: animals

� Matrix formulation:

y = X*β + Z*u + e

◦ Z = design (or incidence) matrix, linking the

random parameters to the observations.

� known

◦ u = vector of random effects

� unknown, values to be predicted

� assumed to be random samples from N(0,I*σ²u)

� so, var(ui) = σ²u for all i

� and cov(ui,uj) = 0 for all combinations of i and j ≠ i (i.e.

individuals are assumed to be un(co)related

� σ²u is an unknown parameter, to be estimated

� Matrix formulation (cont’d):

y = X*β + Z*u + e

◦ e = vector of random residuals

� unknown

� assumed to be random samples from N(0,I*σ²e)

� so, var(ei) = σ²u for all i

� and cov(ei,ej) = 0 for all combinations of i and j (i.e. individuals

are assumed to be un(co)related

� σ²e is an unknown parameter, to be estimated

◦ Furthermore, we will assume:

� Cov(ui,ej) = 0 for all i,j

� Variances and covariances

◦ u ~ N(0;G) with G = I*σ²u

◦ e ~ N(0;R) with R = I*σ²e

◦ V = V(y)

= V(X*β + Z*u + e)

= V(X*β) + V(Z*u) + V(e)

+ 2*Cov(X*β,Z*u) + 2*Cov(X*β,e)

+ 2*Cov(Z*u,e)

= 0 + Z*V(u)*Z’ + R + 0 + 0 + Z*Cov(u,e)

= Z*G*Z’ + R + 0

= Z*G*Z’ + R

� Variances and covariances: example

◦ Back to our problem...

## Random effect design matrix#> Z<-matrix(rep(0,48),nr=12)> Z[c(1:3),1]<-1> Z[c(4:6),2]<-1> Z[c(7:9),3]<-1> Z[c(10:12),4]<-1## Known correlation matrices# Arbitrary values are used to start with#> sigma_2_a<-10.0> sigma_2_e<-20.0> G<-diag(4)*sigma_2_a # No correlation between animals> R<-diag(12)*sigma_2_e # No correlation between residuals

◦ Back to our problem...

## Observations variance-covariance matrix#> V<-Z%*%G%*%t(Z)+R> V

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12][1,] 30 10 10 0 0 0 0 0 0 0 0 0[2,] 10 30 10 0 0 0 0 0 0 0 0 0[3,] 10 10 30 0 0 0 0 0 0 0 0 0[4,] 0 0 0 30 10 10 0 0 0 0 0 0[5,] 0 0 0 10 30 10 0 0 0 0 0 0[6,] 0 0 0 10 10 30 0 0 0 0 0 0[7,] 0 0 0 0 0 0 30 10 10 0 0 0[8,] 0 0 0 0 0 0 10 30 10 0 0 0[9,] 0 0 0 0 0 0 10 10 30 0 0 0

[10,] 0 0 0 0 0 0 0 0 0 30 10 10[11,] 0 0 0 0 0 0 0 0 0 10 30 10[12,] 0 0 0 0 0 0 0 0 0 10 10 30

◦ More generally, in our problem...

000000000

σσσσσσσσσσσσ

� We can see that:

◦ Introducing a random individual effect in the

model has led to introduce a correlation

between observations on the same

individual

◦ The price to be paid:

� More parameters (u, σ²u)

� Much more complicated resolution => use of

specialized softwares (SAS, AIREML, ...)

� Some details given in the appendix for R solution

� An alternative solution:

◦ Instead of introducing an individual effect in

order to correlate the observations,

correlations can be introduced directly in the

R matrix

� No random effect anymore (Z = 0)

� V = Z*G*Z’ + R = R

� Two parameters (σ1² & σ2²) need to be estimated, with

R(i,i) = σ1² + σ2² for all i

R(i,j) = σ1² for all i ≠ j (same animal)

� An alternative solution (cont’d):� Other equivalent coding: R = K*σ²

σ² = σu² + σe², ρ = σu²/(σu² + σe²)

K(i,i) = 1 for all i

K(i,j) = ρ for all i,j (same animal)

� This correlation structure is referred to as « compound

symmetry » (CS). It involves only 2 parameters (σ² and ρ).

� An alternative solution (cont’d):

◦ The later approach is more flexible, because

other R structures than CS can be introduced

� For example, it could be expected that measures

taken on times 1 and 3 should be less correlated

than measures taken on times 1 and 2

� A possible structure to model this is:

� σ² = σu² + σe², ρ = σu²/(σu² + σe²)

K(i,i) = 1 for all i

K(i,j) = ρ|i-j| for all i,j (same animal)

� This type of structure is named « type 1 auto-regression »

(AR(1)), and also involves only 2 parameters (σ² and ρ)

� Model selection refers to procedures used

to select the « best » model to be used on

a given dataset

� Various approaches have been proposed,

and we’ll show one as an example

� Although almost automated procedures

exist and are implemented, nothing will

replace the experimenter knowledge of

the problem and sound reasoning...

A word on model selection

� We can re-use a previous example to

show the rationale:

A word on model selection

H St Ge Age HR H St Ge Age HR

1 T M 88 68 8 NT M 64 76

2 T M 96 64 9 NT M 77 75

3 T F 90 76 10 NT F 100 71

4 T F 73 71 11 NT F 75 85

5 T M 85 63 12 NT M 63 81

6 T F 99 63 13 NT M 73 80

7 T F 60 67 14 NT F 67 81

15 NT F 76 83

� Several models could be used to select

the « best » one:

(1) yi = m + ei

(2) yi = m + b*Ai + ei

(3) yi = m + b*Ai + Gi + ei

(4) yi = m + b*Ai + Gi + Si + ei

(5) yi = m + b*Ai + Gi + Si + Gi*Si + ei

(6) yi = m + Gi + Si + Gi*Si + ei

(7) yi = m + Gi + Si + ei

Possible models

� A model is nested within another if it is

built with a subset of the factors of this

model:

(1) => (2) => (3) => (4) => (5)

(7) => (6) => (5)

but, for example:

(6) ≠> (4)

� Comparing nested (linear) models can be

done using a F test (see next slide)

Nested models

� Idea: if a new factor does not contribute

significantly to the model, the extra-fit

provided by this factor is an estimator of

the error variance.

� Remark: when models are not nested,

other criteria (such as ‘Akaike Information

Criterion = AIC) must be used.

F test for nested models selection

Fq,n-r(X) = [(Xc*bc-Xr*br)’*y/q] / [(y-Xc*bc)’*(y-Xc*bc)/(n-r(Xc))]

� Example: comparing (3) to (2)

F test for nested models selection

# Reduced model> Xr<-matrix(rep(0,30),nr=15)> Xr[,1]<-rep(1,15)> Xr[,2]<-age# Complete model> Xc<-matrix(rep(0,45),nr=15)> Xc[,1]<-rep(1,15)> Xc[,2]<-age> Xc[,3]<-gender# Solutions> library(MASS)> br<-ginv(t(Xr)%*%Xr)%*%t(Xr)%*%hr> bc<-ginv(t(Xc)%*%Xc)%*%t(Xc)%*%hr# Test > num<-t(Xc%*%bc-Xr%*%br)%*%hr/1> den<-t(hr-Xc%*%bc)%*%(hr-Xc%*%bc)/(15-2)> pf(num/den,q,n-rxc,lower.tail=FALSE)

[,1][1,] 0.4151881

Appendix

� Procedure:

1. Obtain variance components (σ²,ρ,...)

estimation using specific methods

� The today preferred method is called REML

(REstricted Maximum Likelihood)

2. Obtain solutions using these estimates

� A practical method is to used so-called

Henderson’s mixed model equations (MME)

3. Perform (approximate) testing

� Use a modified version of the general linear

hypothesis described above

Appendix: solving mixed models

� Procedure – 1) REML: AI algorithm

◦ θ(k+1) = θ(k) + AI-1*SC

� k = iteration #

� θ’ = (σ²u,σ²e)’ and θ(0) ~ arbitrary

� SC = « score vector »

� SC(1) = 0.5*y’*P*Z*G*Z’*P’*y – trace(Z*G*Z’*P)

� P = « hat matrix » = V-1 – V-1*X’*(X’*V-1*X)-*X*V-1

� SC(2) = 0.5*y’*P*P’*y – trace(P)

� AI = « average information matrix »

P'*y*P*y'*PZ'*P'*y*G*Z*P*y'*P

P'*y*Z'*P*G*Z*y'*PZ'*P'*y*G*Z*Z'*P*G*Z*y'*PAI

� Procedure – 1) REML: R implementation

## REML estimators computation#library(MASS)# Init computationsdiff<-1000.0AI<-matrix(rep(0,4),nr=2)SC<-matrix(rep(0,2),nr=2)sigma_2_u<-10.0sigma_2_e<-20.0sigma_2<-c(sigma_2_u,sigma_2_e)# Loop while estimates differwhile (diff>0.01) {

# Loop body => see next slide}

# Loop body (1) # Variance of the observationsZGZ<-Z%*%G%*%t(Z)V<-(ZGZ)*sigma_2_u+R*sigma_2_e# P matrixVi<-solve(V)XVi<-t(X)%*%ViP<-Vi-t(XVi)%*%(ginv(XVi%*%X)%*%XVi)# Partial computationsPy<-P%*%yZGZP<-ZGZ%*%PZGZPy<-ZGZ%*%Py# Continued on next slide...

# Loop body (2) # TracestrP<-0trPZGZ<-0for (i in 1:dim(P)[1]) {

trP<-trP+P[i,i]trPZGZ<-trPZGZ+ZGZP[i,i]

}# AI matrixAI[1,1]<-0.5*t(ZGZPy)%*%(P%*%ZGZPy)AI[1,2]<-0.5*t(Py)%*%(P%*%ZGZPy)AI[2,2]<-0.5*t(Py)%*%(P%*%Py)AI[2,1]<-AI[1,2]# Score vectorSC[1]<-0.5*(t(Py)%*%ZGZPy-trPZGZ)SC[2]<-0.5*(t(Py)%*%Py-trP)

# Loop body (3) # New estimatorsnew_sigma_2<-sigma_2+solve(AI)%*%SCnew_sigma_2_u<-new_sigma_2[1]new_sigma_2_e<-new_sigma_2[2]# Differencediff<-(sigma_2[1]-new_sigma_2[1])**2diff<-diff+(sigma_2[2]-new_sigma_2[2])**2sigma_2<-new_sigma_2

}sigma_2

[,1][1,] 18.82629[2,] 26.21528

� Procedure – 2) MME: method

◦ BLUE ( ) and BLUP ( ) can be obtained using:

◦ In our case, this can be written:

−−−

−−

y*Z'*R

y*X'*R

GZ*Z'*RX*Z'*R

Z*X'*RX*X'*R1

+ Z'*y

IZ'*ZZ'*X

X'*ZX'*X

)/(* 22ue σσ

� Procedure – 2) MME: R implementation

MMEl<-matrix(rep(0,100),nr=10)MMEr<-matrix(rep(0,10),nr=10)MMEl[1:6,1:6]<-t(X)%*%XMMEl[1:6,7:10]<-t(X)%*%ZMMEl[7:10,1:6]<-t(Z)%*%XMMEl[7:10,7:10]<-t(Z)%*%Z+solve(G)*(sigma_2[2]/sigma_2[1])MMEr[1:6]<-t(X)%*%yMMEr[7:10]<-t(Z)%*%ysol<-ginv(MMEl)%*%MMEr

� Procedure – 3) Approximate testing

◦ The approximation comes from the fact that

only estimators of β and u are available

◦ The denominator degres of freedom are

estimated using various methods (see

details in the litterature) well beyond the

scope of this text...!

◦ The estimator is a simple extension of the

method for fixed effects only models

Fq,v = [(L*b-c)’*(L*C*L’)-1*(L*b-c)/q] ^

# Testing group effects> L<-matrix(c(0,1,-1,rep(0,7)),nr=1)> Lsol<-L%*%sol> C_hat<-ginv(MMEl/sigma_2[2])> LCL<-L%*%(C_hat%*%t(L))> LCLi<-ginv(LCL)> Fg<-t(Lsol)%*%(LCLi%*%Lsol)/1> dfg1<-1> dfg2<-2 # Cfr Kenward-Rogers...> 1-pf(Fg,dfg1,dfg2)

[,1][1,] 0.1145035>

# Testing time effects> L<-matrix(c(0, 0,0, 1,-1,0, 0,0,0,0, 0, 0,0, 1,0,-1,+ 0,0,0,0),nr=2,byrow=T) > Lsol<-L%*%sol> C_hat<-ginv(MMEl/sigma_2[2])> LCL<-L%*%(C_hat%*%t(L))> LCLi<-ginv(LCL)> Ft<-t(Lsol)%*%(LCLi%*%Lsol)/1> dft1<-1> dft2<-6 # Cfr Kenward-Rogers...> 1-pf(Ft,dft1,dft2)

[,1][1,] 0.006966431 >

� The same analyses, with SAS: program

options ls=80;data phd;

input groupe temps animal pheno @@;cards;

1 1 1 89.4 1 1 2 103.7 1 2 1 106.4 1 2 2 113.71 3 1 116.3 1 3 2 118.0 2 1 3 91.5 2 1 4 85.02 2 3 89.8 2 2 4 88.5 2 3 3 110.6 2 3 4 97.2;proc glm;

class groupe temps;model pheno=groupe temps;lsmeans groupe /pdiff stderr;

proc mixed;class groupe temps animal;model pheno=groupe temps / solution;repeated /sub=animal type=cs;

� The same analyses, with SAS: result (1)

SASlisting

F. Farnir, E. Moyse Biostatistics& Bioinformatics ...

Documents

Transcript of F. Farnir, E. Moyse Biostatistics& Bioinformatics ...

ΕΙΣΑΓΩΓΗ - stat-athens.aueb.grexek/Biostatistics/chapter1.pdf · 1 ΚΕΦΑΛΑΙΟ 1 ΕΙΣΑΓΩΓΗ ΟΙ ΡΙΖΕΣ ΤΗΣ ΒΙΟΣΤΑΤΙΣΤΙΚΗΣ, ΑΝΑΠΤΥΞΗ

Practical Bioinformatics - histo.ucsf.eduhisto.ucsf.edu/BMS270/BMS270_2018/slides/Slides02_FileFormats.pdf · mylist [1:3] == [3.1415926535 , "GATACA" ] mylist [:2] == [1 , 3.1415926535]

Biostatistics Case Studies

Less is more Approaches to biologist-driven analysis and next-generation sequencing data Paul Gordon Genome Canada Bioinformatics Platform University of.

The LIBI Grid Platform for Bioinformaticsmargara/page14/assets/LIBIchapter.pdf · The LIBI Grid Platform for Bioinformatics ... Mirco Mazzucato§, Salvatore My*&, Giovanna Selvaggi&

Prestin Bioinformatics. Prestin Location Dallos ‘02.

Bioinformatics Data Skills Working with Python · 2019-09-20 · Bioinformatics Data Skills ΤαμπόσηςΙωάννης Researcher & Software Development Engineer PhD Candidate,

2ND MEETING DEBATES ON CANCER TREATMENTSgk.gr/wp-content/uploads/2015/11/papadopoulos_program.pdf · Τσουκαλάς Νικόλαος MD, MSc, PhD, Medical Oncologist, MSc in Bioinformatics,

Ab initio methods: how/why do they work · EM. Crystallography. NMR. Biochemistry. FRET. Bioinformatics. Complementary techniques. AUC. Oligomeric . mixtures. Hierarchical systems.

Secondary and tertiary protein structurefaculty.uml.edu/vbarsegov/teaching/bioinformatics/lectures/Prot... · shows the computation of (φ,ψ) angles distribution for 403 PDB X-ray

Meta data and bioinformatics Bioinformatics is EBI-centred, loosely organised Bioinformatics was coined by Pauline Hogekamp ~1979 European bioinformatics.

Principles of Biostatistics ANOVA

BMC Bioinformatics BioMed Central - IIT Bombay · from The Eighth Asia Pacific Bioinformatics ... India 18-21 January 2010 Published: 18 January 2010 BMC Bioinformatics 2010, 11(Suppl

BMC Bioinformatics BioMed Central - bluebox.ippt.pan.plbluebox.ippt.pan.pl/~tlipnia/docs/bmc_bioinformatics.pdf · BioMed Central Page 1 of 20 (page number not for citation purposes)

EECS 730 Introduction to Bioinformatics Sequence Alignmentjhuan/EECS730_F12/slides/9... · 2012-09-30 · EECS 730 Introduction to Bioinformatics Sequence Alignment ... We know how

ARTCAM-655KY-WOMartray.co.jp/download/catlog/artray_allcat_eng.pdf · ARTCAM-320C-THERMO : f=8mm ・ f=14mm ・ f=25mm ・ f=35mm ・ f=50mm Detector Type Detected Wavelength Pixel

Biostatistics-Lecture 10 Regression

Biostatistics Case Studies 2010 Peter D. Christenson Biostatistician Session 3: Clustering and Experimental Replicates.

Membrane Bioinformatics SoSe 2009 Helms/Böckmann

Dr. D. Y. PATIL BIOTECHNOLOGY & BIOINFORMATICS …biotech.dpu.edu.in/Documents/ConferencesWorkshops.pdf · (DEEMED UNIVERSITY) ... Dr. D. Y. PATIL BIOTECHNOLOGY & BIOINFORMATICS INSTITUTE,