A Brief and Friendly Introduction to Mixed-Effects Models...

131
A Brief and Friendly Introduction to Mixed-Effects Models in Psycholinguistics θ Σb b1 b2 bM ··· ··· x11 x1n1 y11 y1n1 ··· ··· x21 x2n2 y21 y2n2 ··· ··· xM1 xMnM yM1 yMnM ··· ··· Cluster-specific parameters (“randomeffects”) Shared parameters (“fixedeffects”) Parameters governing inter-cluster variability Roger Levy UC San Diego Department of Linguistics 25 March 2009

Transcript of A Brief and Friendly Introduction to Mixed-Effects Models...

Page 1: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

A Brief and Friendly Introduction toMixed-Effects Models in Psycholinguistics

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Roger Levy

UC San DiegoDepartment of Linguistics

25 March 2009

Page 2: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Goals of this talk

I Briefly review generalized linear models and how to use them

I Give a precise description of multi-level models

I Show how to draw inferences using a multi-level model (fittingthe model)

I Discuss how to interpret model parameter estimatesI Fixed effectsI Random effects

I Briefly discuss multi-level logit models

Page 3: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing generalized linear models I

Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .

The picture:

θ

x1

y1

x2

y2

xn

yn· · ·

Predictors

Model parameters

Response

Page 4: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing generalized linear models I

Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .

The picture:

θ

x1

y1

x2

y2

xn

yn· · ·

Predictors

Model parameters

Response

Page 5: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing generalized linear models I

Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .

The picture:

θ

x1

y1

x2

y2

xn

yn· · ·

Predictors

Model parameters

Response

Page 6: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing generalized linear models I

Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .

The picture:

θ

x1

y1

x2

y2

xn

yn· · ·

Predictors

Model parameters

Response

Page 7: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Page 8: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Page 9: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Page 10: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Page 11: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Page 12: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs III

Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.

I The predicted mean is just the linear predictor:

η = l(µ) = µ

I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:

ε ∼ N(0, σ)

I This gives us the traditional linear regression equation:

Y =

Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +

Noise∼N(0,σ)︷︸︸︷ε

Page 13: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs III

Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.

I The predicted mean is just the linear predictor:

η = l(µ) = µ

I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:

ε ∼ N(0, σ)

I This gives us the traditional linear regression equation:

Y =

Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +

Noise∼N(0,σ)︷︸︸︷ε

Page 14: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs III

Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.

I The predicted mean is just the linear predictor:

η = l(µ) = µ

I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:

ε ∼ N(0, σ)

I This gives us the traditional linear regression equation:

Y =

Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +

Noise∼N(0,σ)︷︸︸︷ε

Page 15: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs III

Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.

I The predicted mean is just the linear predictor:

η = l(µ) = µ

I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:

ε ∼ N(0, σ)

I This gives us the traditional linear regression equation:

Y =

Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +

Noise∼N(0,σ)︷︸︸︷ε

Page 16: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IV

Y =

Predicted Meanz }| {α + β1X1 + · · · + βnXn +

Noise∼N(0,σ)z}|{ε

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

Page 17: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IV

Y =

Predicted Meanz }| {α + β1X1 + · · · + βnXn +

Noise∼N(0,σ)z}|{ε

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

Page 18: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IV

Y =

Predicted Meanz }| {α + β1X1 + · · · + βnXn +

Noise∼N(0,σ)z}|{ε

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

Page 19: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IV

Y =

Predicted Meanz }| {α + β1X1 + · · · + βnXn +

Noise∼N(0,σ)z}|{ε

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

P({βi}, σ|Y ) =P(Y |{βi}, σ)

Prior︷ ︸︸ ︷P({βi}, σ)

P(Y )

Page 20: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IV

Y =

Predicted Meanz }| {α + β1X1 + · · · + βnXn +

Noise∼N(0,σ)z}|{ε

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

P({βi}, σ|Y ) =

Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)

Prior︷ ︸︸ ︷P({βi}, σ)

P(Y )

Page 21: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 22: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?

houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 23: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 24: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 25: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 26: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 27: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 28: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs V: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

Page 29: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VI

I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:

Few neighbors Many neighborsgaty peme rixy lish pait yine

I There’s a wide range of neighborhood density:

Number of neighbors

Fre

quen

cy

0 2 4 6 8 10 12

02

46

8

pait

Page 30: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VI

I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:

Few neighbors Many neighborsgaty peme rixy lish pait yine

I There’s a wide range of neighborhood density:

Number of neighbors

Fre

quen

cy

0 2 4 6 8 10 12

02

46

8

pait

Page 31: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ 1 + x

I The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 32: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 33: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 34: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 35: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 36: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 37: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 38: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 39: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs VII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ xI The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

α̂

β̂

σ̂

Page 40: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs: maximum-likelihood fitting VIII

Intercept 383.00neighbors 4.83σ̂ 107.22

I Estimated coefficients arewhat underlies “best linearfit”plots

●●

●●

0 2 4 6 8 10 12

300

400

500

600

700

Neighbors

RT

(m

s)

Page 41: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs: maximum-likelihood fitting VIII

Intercept 383.00neighbors 4.83σ̂ 107.22

I Estimated coefficients arewhat underlies “best linearfit”plots

●●

●●

0 2 4 6 8 10 12

300

400

500

600

700

Neighbors

RT

(m

s)

Page 42: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs: maximum-likelihood fitting VIII

Intercept 383.00neighbors 4.83σ̂ 107.22

I Estimated coefficients arewhat underlies “best linearfit”plots

●●

●●

0 2 4 6 8 10 1230

040

050

060

070

0Neighbors

RT

(m

s)

Page 43: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

0.00

0

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 44: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

0.00

0

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 45: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

0.00

0

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 46: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

0.00

0

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 47: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

0.00

0

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 48: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 6000.

000

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 49: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Reviewing GLMs IX: Bayesian model fitting

P({βi}, σ|Y ) =

Likelihoodz }| {P(Y |{βi}, σ)

Priorz }| {P({βi}, σ)

P(Y )

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

−10

−5

0

5

10

15

20

25

0.00 0.06

P(ββ|Y)

ββ

pMCMC = 0.46

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 600

−10

05

1020

Intercept

Nei

ghbo

rs

<αα̂,ββ̂>

200 300 400 500 6000.

000

alpha

P(αα

|Y)

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

Page 50: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models

I But of course experiments don’t have just one participant

I Different participants may have different idiosyncratic behavior

I And items may have idiosyncratic properties too

I We’d like to take these into account, and perhaps investigatethem directly too.

I This is what multi-level (hierarchical, mixed-effects) models arefor!

Page 51: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models II

I Recap of the graphical picture of a single-level model:

θ

x1

y1

x2

y2

xn

yn· · ·

Predictors

Model parameters

Response

Page 52: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models III: the new graphical picture

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Page 53: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models III: the new graphical picture

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Page 54: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models III: the new graphical picture

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Page 55: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models III: the new graphical picture

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Page 56: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models III: the new graphical picture

θ

Σb

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

· · ·

x21 x2n2

y21 y2n2

· · ·

· · ·

xM1 xMnM

yM1 yMnM

· · ·

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Page 57: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models IV

An example of a multi-level model:

I Back to your lexical-decision experiment

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities should havedifferent average decision time

I Additionally, different participants in your study may alsohave:

I different overall decision speedsI differing sensitivity to neighborhood density

I You want to draw inferences about all these things at thesame time

Page 58: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models IV

An example of a multi-level model:

I Back to your lexical-decision experiment

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities should havedifferent average decision time

I Additionally, different participants in your study may alsohave:

I different overall decision speedsI differing sensitivity to neighborhood density

I You want to draw inferences about all these things at thesame time

Page 59: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models IV

An example of a multi-level model:

I Back to your lexical-decision experiment

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities should havedifferent average decision time

I Additionally, different participants in your study may alsohave:

I different overall decision speedsI differing sensitivity to neighborhood density

I You want to draw inferences about all these things at thesame time

Page 60: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

I Once again we can leave off the 1, and the noise term εij isimplicit

Page 61: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

I Once again we can leave off the 1, and the noise term εij isimplicit

Page 62: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

RT ∼ 1 + x + (1 | participant)

I Once again we can leave off the 1, and the noise term εij isimplicit

Page 63: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

RT ∼ 1 + x + (1 | participant)

I Once again we can leave off the 1, and the noise term εij isimplicit

Page 64: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

RT ∼ x + (1 | participant)

I Once again we can leave off the 1, and the noise term εij isimplicit

Page 65: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VI: simulating data

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I One beauty of multi-level models is that you can simulatetrial-level data

I This is invaluable for achieving deeper understanding of bothyour analysis and your data

## simulate some data

> sigma.b <- 125 # inter-subject variation larger than

> sigma.e <- 40 # intra-subject, inter-trial variation

> alpha <- 500

> beta <- 12

> M <- 6 # number of participants

> n <- 50 # trials per participant

> b <- rnorm(M, 0, sigma.b) # individual differences

> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors

> subj <- rep(1:M,n)

> RT <- alpha + beta * nneighbors + # simulate RTs!

b[subj] + rnorm(M*n,0,sigma.e) #

Page 66: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VI: simulating data

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I One beauty of multi-level models is that you can simulatetrial-level data

I This is invaluable for achieving deeper understanding of bothyour analysis and your data

## simulate some data

> sigma.b <- 125 # inter-subject variation larger than

> sigma.e <- 40 # intra-subject, inter-trial variation

> alpha <- 500

> beta <- 12

> M <- 6 # number of participants

> n <- 50 # trials per participant

> b <- rnorm(M, 0, sigma.b) # individual differences

> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors

> subj <- rep(1:M,n)

> RT <- alpha + beta * nneighbors + # simulate RTs!

b[subj] + rnorm(M*n,0,sigma.e) #

Page 67: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VII: simulating data

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2 4 6 8

400

600

800

1000

# Neighbors

RT

I Participant-level clustering is easily visible

I This reflects the fact that inter-participant variation (125ms)is larger than inter-trial variation (40ms)

I And the effects of neighborhood density are also visible

Page 68: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VII: simulating data

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2 4 6 8

400

600

800

1000

# Neighbors

RT

Subject intercepts (alpha + beta[subj])

I Participant-level clustering is easily visible

I This reflects the fact that inter-participant variation (125ms)is larger than inter-trial variation (40ms)

I And the effects of neighborhood density are also visible

Page 69: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VII: simulating data

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2 4 6 8

400

600

800

1000

# Neighbors

RT

Subject intercepts (alpha + beta[subj])

I Participant-level clustering is easily visible

I This reflects the fact that inter-participant variation (125ms)is larger than inter-trial variation (40ms)

I And the effects of neighborhood density are also visible

Page 70: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Multi-level Models VII: simulating data

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

0 2 4 6 8

400

600

800

1000

# Neighbors

RT

Subject intercepts (alpha + beta[subj])

I Participant-level clustering is easily visible

I This reflects the fact that inter-participant variation (125ms)is larger than inter-trial variation (40ms)

I And the effects of neighborhood density are also visible

Page 71: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Statistical inference with multi-level models

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I Thus far, we’ve just defined a model and used it to generatedata

I We psycholinguists are usually in the opposite situation. . .I We have data and we need to infer a model

I Specifically, the“fixed-effect”parameters α, β, and σε, plus theparameter governing inter-subject variation, σb

I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?

I Fortunately, we can use the same principles as before to dothis:

I The principle of maximum likelihoodI Or Bayesian inference

Page 72: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Statistical inference with multi-level models

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I Thus far, we’ve just defined a model and used it to generatedata

I We psycholinguists are usually in the opposite situation. . .

I We have data and we need to infer a modelI Specifically, the“fixed-effect”parameters α, β, and σε, plus the

parameter governing inter-subject variation, σb

I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?

I Fortunately, we can use the same principles as before to dothis:

I The principle of maximum likelihoodI Or Bayesian inference

Page 73: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Statistical inference with multi-level models

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I Thus far, we’ve just defined a model and used it to generatedata

I We psycholinguists are usually in the opposite situation. . .I We have data and we need to infer a model

I Specifically, the“fixed-effect”parameters α, β, and σε, plus theparameter governing inter-subject variation, σb

I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?

I Fortunately, we can use the same principles as before to dothis:

I The principle of maximum likelihoodI Or Bayesian inference

Page 74: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Statistical inference with multi-level models

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I Thus far, we’ve just defined a model and used it to generatedata

I We psycholinguists are usually in the opposite situation. . .I We have data and we need to infer a model

I Specifically, the“fixed-effect”parameters α, β, and σε, plus theparameter governing inter-subject variation, σb

I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?

I Fortunately, we can use the same principles as before to dothis:

I The principle of maximum likelihoodI Or Bayesian inference

Page 75: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

α̂

β̂

σ̂b

σ̂ε

Page 76: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

α̂

β̂

σ̂b

σ̂ε

Page 77: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

α̂

β̂

σ̂b

σ̂ε

Page 78: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

α̂

β̂

σ̂b

σ̂ε

Page 79: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

α̂

β̂

σ̂b

σ̂ε

Page 80: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 81: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 82: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79ms

I Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 83: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 84: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretation

I Inter-trial variability for a given participant is Gaussian,centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 85: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 86: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:

I Variability in average RT in the population from which theparticipants were drawn has standard deviation 70.18ms

Page 87: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The“average”RT for a non-word in this study is 583.79msI Every extra neighbor increases“average”RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

Page 88: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I What about the participants’ idiosyncracies themselves—thebi?

I We can also draw inferences about these—you may haveheard about BLUPs

I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:

P(bi |α̂, β̂, σ̂b, σ̂ε)

I The BLUPS are the conditional modes of bi—the choices thatmaximize the above probability

Page 89: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I What about the participants’ idiosyncracies themselves—thebi?

I We can also draw inferences about these—you may haveheard about BLUPs

I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:

P(bi |α̂, β̂, σ̂b, σ̂ε)

I The BLUPS are the conditional modes of bi—the choices thatmaximize the above probability

Page 90: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I What about the participants’ idiosyncracies themselves—thebi?

I We can also draw inferences about these—you may haveheard about BLUPs

I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:

P(bi |α̂, β̂, σ̂b, σ̂ε)

I The BLUPS are the conditional modes of bi—the choices thatmaximize the above probability

Page 91: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I What about the participants’ idiosyncracies themselves—thebi?

I We can also draw inferences about these—you may haveheard about BLUPs

I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:

P(bi |α̂, β̂, σ̂b, σ̂ε)

I The BLUPS are the conditional modes of bi—the choices thatmaximize the above probability

Page 92: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters II

I The BLUP participant-specific“average”RTs for this datasetare black lines on the base of this graph

400 500 600 700 800

0.00

00.

002

0.00

40.

006

Millseconds

Pro

babi

lity

dens

ity

I The solid line is a guess at their distribution

I The dotted line is the distribution predicted by the model forthe population from which the participants are drawn

I Reasonably close correspondence

Page 93: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters II

I The BLUP participant-specific“average”RTs for this datasetare black lines on the base of this graph

400 500 600 700 800

0.00

00.

002

0.00

40.

006

Millseconds

Pro

babi

lity

dens

ity

I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for

the population from which the participants are drawn

I Reasonably close correspondence

Page 94: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inferences about cluster-level parameters II

I The BLUP participant-specific“average”RTs for this datasetare black lines on the base of this graph

400 500 600 700 800

0.00

00.

002

0.00

40.

006

Millseconds

Pro

babi

lity

dens

ity

I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for

the population from which the participants are drawnI Reasonably close correspondence

Page 95: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +

Noise∼N(0,σε)︷︸︸︷εij

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

Page 96: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +

Noise∼N(0,σε)︷︸︸︷εij

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

Page 97: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +

Noise∼N(0,σε)︷︸︸︷εij

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

Page 98: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +

Noise∼N(0,σε)︷︸︸︷εij

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

Page 99: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +

Noise∼N(0,σε)︷︸︸︷εij

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

Page 100: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

02

46

810

Intercept

neig

hbor

s

●●

● ●

●●

●●

●●

●●

Participants

I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more

affected by neighborhood density

Σ̂ =

(70.20 −0.3097

−0.3097 4.41

)

Page 101: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

02

46

810

Intercept

neig

hbor

s

●●

● ●

●●

●●

●●

●●

Participants

I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more

affected by neighborhood density

Σ̂ =

(70.20 −0.3097

−0.3097 4.41

)

Page 102: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

02

46

810

Intercept

neig

hbor

s

●●

● ●

●●

●●

●●

●●

Participants

I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more

affected by neighborhood density

Σ̂ =

(70.20 −0.3097

−0.3097 4.41

)

Page 103: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

02

46

810

Intercept

neig

hbor

s

●●

● ●

●●

●●

●●

●●

Participants

I Correlation visible in participant-specific BLUPs

I Participants who were faster overall also tend to be moreaffected by neighborhood density

Σ̂ =

(70.20 −0.3097

−0.3097 4.41

)

Page 104: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

02

46

810

Intercept

neig

hbor

s

●●

● ●

●●

●●

●●

●●

Participants

I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more

affected by neighborhood density

Σ̂ =

(70.20 −0.3097

−0.3097 4.41

)

Page 105: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Bayesian inference for multilevel models

P({βi}, σb, σε|Y ) =

Likelihoodz }| {P(Y |{βi}, σb, σε)

Priorz }| {P({βi}, σb, σε)

P(Y )

I We can also use Bayes’rule to draw inferencesabout fixed effects

I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it

560 580 600 620

68

1012

Intercept

neig

hbor

s

560 580 600 620

0.00

αα

P(αα

|Y)

4

6

8

10

12

14

0.00

P(αα|Y)

ββ

Page 106: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Bayesian inference for multilevel models

P({βi}, σb, σε|Y ) =

Likelihoodz }| {P(Y |{βi}, σb, σε)

Priorz }| {P({βi}, σb, σε)

P(Y )

I We can also use Bayes’rule to draw inferencesabout fixed effects

I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it

560 580 600 620

68

1012

Intercept

neig

hbor

s

560 580 600 620

0.00

αα

P(αα

|Y)

4

6

8

10

12

14

0.00

P(αα|Y)

ββ

Page 107: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Bayesian inference for multilevel models

P({βi}, σb, σε|Y ) =

Likelihoodz }| {P(Y |{βi}, σb, σε)

Priorz }| {P({βi}, σb, σε)

P(Y )

I We can also use Bayes’rule to draw inferencesabout fixed effects

I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it

560 580 600 620

68

1012

Intercept

neig

hbor

s

560 580 600 620

0.00

αα

P(αα

|Y)

4

6

8

10

12

14

0.00

P(αα|Y)

ββ

Page 108: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care???

I You may be asking yourself:

Why did I come to this workshop? I could doeverything you just did with an ANCOVA, treatingparticipant as a random factor, or by looking atparticipant means.

Page 109: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 110: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 111: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 112: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 113: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

cluster

I You can build a single unified model for inferences from yourdata

I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)

Page 114: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

data

I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)

Page 115: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 116: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Why do you care??? II

Why did I come to this workshop? I could do every-thing you just did with an ANCOVA, treating partici-pant as a random factor, or by looking at participantmeans.

I Yes, but there are several respects in which multi-level modelsgo beyond AN(C)OVA:

1. They handle imbalanced datasets just as well as balanceddatasets

2. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

3. You can cross cluster-level effectsI Every trial belongs to both a participant cluster and an item

clusterI You can build a single unified model for inferences from your

dataI ANOVA requires separate by-participants and by-items

analyses (quasi-F ′ is quite conservative)

Page 117: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 118: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 119: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 120: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 121: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 122: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 123: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

Page 124: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 125: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 126: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 127: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 128: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 129: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

bσε for logit models)

β̂—effect small compared with inter-subject variation

Page 130: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

Summary

I Multi-level models may seem strange and foreign

I But all you really need to understand them is three basicthings

I Generalized linear modelsI The principle of maximum likelihoodI Bayesian inference

I As you will see in the rest of the workshop (and theconference. . . ?), these models open up many new interestingdoors!

Page 131: A Brief and Friendly Introduction to Mixed-Effects Models ...idiom.ucsd.edu/~rlevy/presentations/cuny2009/hierarchical_models_intro.pdfA Brief and Friendly Introduction to Mixed-Effects

References I

Agresti, A. (2002). Categorical Data Analysis. Wiley, second edition.

Baayen, R. H., Davidson, D. J., and Bates, D. M. (2008). Mixed-effectsmodeling with crossed random effects for subjects and items. Journalof Memory and Language, 59(4):390–412.

Bicknell, K., Elman, J. L., Hare, M., McRae, K., and Kutas, M. (2008).Online expectations for verbal arguments conditional on eventknowledge. In Proceedings of the 30th Annual Conference of theCognitive Science Society, pages 2220–2225.

Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs(transformation or not) and towards logit mixed models. Journal ofMemory and Language, 59(4):434–446.