From L to N: Nonlinear Predictors in Generalized Models

32
From L to N: Nonlinear Predictors in Generalized Models Heather Turner Independent Statistical/R Consultant owing much to David Firth, University of Warwick Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 1 / 32

Transcript of From L to N: Nonlinear Predictors in Generalized Models

Page 1: From L to N: Nonlinear Predictors in Generalized Models

From L to N: Nonlinear Predictors in

Generalized Models

Heather Turner

Independent Statistical/R Consultant

owing much toDavid Firth, University of Warwick

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 1 / 32

Page 2: From L to N: Nonlinear Predictors in Generalized Models

From L to N

In a GLM we have

g(µ) = β0 + β1x1 + ...+ βpxp

andVar(Y ) = φV (µ)

A generalized nonlinear model (GNM) is the same as a GLMexcept that we have

g(µ) = η(x;β)

where η(x;β) is nonlinear in the parameters β.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 2 / 32

Page 3: From L to N: Nonlinear Predictors in Generalized Models

Motivation

GNMs may be thought of as...

... an extension of Nonlinear Least Squares

I using a nonlinear function of a continuous variable to model anon-Gaussian response

... an extension of GLMs

I using nonlinear functions of parameters to produce a moreparsimonious model and interpretable model.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 3 / 32

Page 4: From L to N: Nonlinear Predictors in Generalized Models

Example: Mental Health Status

The following contingency table cross-classifies a sample of 1660residents of Manhattan by child’s mental impairment and parents’socioeconomic status (Agresti, 2002)

## MHS

## SES well mild moderate impaired

## A 64 94 58 46

## B 57 94 54 40

## C 57 105 65 60

## D 72 141 77 94

## E 36 97 54 78

## F 21 71 54 71

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 4 / 32

Page 5: From L to N: Nonlinear Predictors in Generalized Models

Independence

A simple analysis of these data might be to test for independence ofMHS and SES using a chi-squared test.

This is equivalent to testing the goodness-of-fit of the independencemodel

log(µrc) = αr + βc

Such a test compares the independence model to the saturated model

log(µrc) = αr + βc + γrc

which may be over-complex.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 5 / 32

Page 6: From L to N: Nonlinear Predictors in Generalized Models

Row-column Association

One intermediate model is the Row-Column association model:

log(µrc) = αr + βc + φrψc

(Goodman, 1979), an example of a multiplicative interaction model.

For the Mental Health data:

## Analysis of Deviance Table

##

## Model 1: Freq ~ SES + MHS

## Model 2: Freq ~ SES + MHS + Mult(SES , MHS)

## Model 3: Freq ~ SES + MHS + SES:MHS

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 15 47.4

## 2 8 3.6 7 43.8 2.3e-07

## 3 0 0.0 8 3.6 0.89

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 6 / 32

Page 7: From L to N: Nonlinear Predictors in Generalized Models

Parameterisation

The independence model was defined earlier in an over-parameterisedform:

log(µrc) = αr + βc

= (αr + 1) + (βc − 1)

= α∗r + β∗

c

Identifiability constraints may be imposedI to fix a one-to-one mapping between parameter values and

distributions

I to enable interpretation of parameters

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 7 / 32

Page 8: From L to N: Nonlinear Predictors in Generalized Models

Standard Implementation

The standard approach of all major statistical software packages is toapply the identifiability constraints in the construction of the model

g(µ) =Xβ

so that rank(X) is equal to the number of parameters p.

Then the inverse in the score equations of the IWLS algorithm

β(r+1) =(XTW (r)X

)−1

XTW (r)z(r)

exists.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 8 / 32

Page 9: From L to N: Nonlinear Predictors in Generalized Models

Alternative Implementation

An alternative is to keep models in their over-parameterised form, sothat rank(X) < p, and use the generalised inverse in the IWLSupdates:

β(r+1) =(XTW (r)X

)−XTW (r)z(r)

This approach is more useful for GNMs, since in this case it is muchharder to define standard rules for specifying identifiabilityconstraints.

Rather, identifiability constraints can be applied post-fitting forinference and interpretation.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 9 / 32

Page 10: From L to N: Nonlinear Predictors in Generalized Models

Estimation of GNMs

GNMs present further technical difficulties vs. GLMs

I automatic generation of starting values is hard

I the likelihood may have multiple optima

The default approach used in the gnm package for R is as follows:

I generate starting values randomly for nonlinear parameters andusing a GLM fit for linear parameters

I use one-parameter-at-a-time Newton method to updatenonlinear parameters

I use the generalized IWLS to update all parameters

Consequently, the parameterisation returned is random.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 10 / 32

Page 11: From L to N: Nonlinear Predictors in Generalized Models

Parameterisation of RC ModelThe RC model is invariant to changes in scale or location of theinteraction parameters:

log(µrc) = αr + βc + φrψc

= αr + βc + (2φr)(0.5ψc)

= αr + (βc − ψc) + (φr + 1)(ψc)

One way to constrain these parameters is as follows

φ∗r =

φr −∑

r wrφr∑r wr√∑

r wr

(φr −

∑r wrφr∑r wr

)where wr is the row probability, say, so that∑

r

wrφ∗r = 0

∑r

wr(φ∗r)

2 = 1

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 11 / 32

Page 12: From L to N: Nonlinear Predictors in Generalized Models

Row and Column Scores

The row and columns scores for the RC model are

## Estimate Std. Error

## Mult(., MHS).SESA 1.11 0.30

## Mult(., MHS).SESB 1.12 0.31

## Mult(., MHS).SESC 0.37 0.32

## Mult(., MHS).SESD -0.03 0.27

## Mult(., MHS).SESE -1.01 0.31

## Mult(., MHS).SESF -1.82 0.28

## Estimate Std. Error

## Mult(SES , .).MHSwell 1.68 0.19

## Mult(SES , .).MHSmild 0.14 0.20

## Mult(SES , .).MHSmoderate -0.14 0.28

## Mult(SES , .).MHSimpaired -1.41 0.17

As one might expect, the scores are ordered for both factors,suggesting the model for the dependence structure might besimplified further.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 12 / 32

Page 13: From L to N: Nonlinear Predictors in Generalized Models

Biplot Model

Biplots are graphical displays of data arrays which represent theobjects that index all dimensions of the array on the same plot.

So for a two-way table, a biplot represents both the rows andcolumns at the same time.

The biplot is constructed from a rank-2 representation of the data.Here we consider the generalized bilinear model

g(µij) = α1iβ1j + α2iβ2j

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 13 / 32

Page 14: From L to N: Nonlinear Predictors in Generalized Models

Example: Leaf Blotch Data

The proportion of leaf area affected by leaf blotch was recorded for10 varieties of barley grown at nine sites (Gabriel, 1998).

Thus the response is a continuous variable in [0, 1].

Wedderburn (1974) suggested to model these data using a logit linkand a variance proportional to the square of that of the binomial, i.e.V (µ) = µ2(1− µ)2 – a quasi-likelihood model.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 14 / 32

Page 15: From L to N: Nonlinear Predictors in Generalized Models

Geometrical Intepretation

Given the bilinear model

logit(µij) = α1iβ1j + α2iβ2j

the effect of site i can be represented by the point

(α1i, α2i)

in the space spanned by the linearly independent basis vectors

a1 = (α11, α12, . . . α19)T

a2 = (α21, α22, . . . α29)T

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 15 / 32

Page 16: From L to N: Nonlinear Predictors in Generalized Models

Visualising Sites and Varieties

Thus we can represent the sites and varieties separately as follows

A

B

C

D

EFG

H I

−4 −2 0 2 4

−4

−2

02

4

Site Effects

Component 1

Com

pone

nt 2 1 23

45 6789X

−4 −2 0 2 4

−4

−2

02

4

Variety Effects

Component 1

Com

pone

nt 2

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 16 / 32

Page 17: From L to N: Nonlinear Predictors in Generalized Models

Obtaining Orthogonal Bases

Given the SVD of the matrix of predictors

η = UDV T

matrices of orthogonal basis vectors on the same scale are given by

A = UD12 B =D

12V T

The model stays the same, but the parametrization changes.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 17 / 32

Page 18: From L to N: Nonlinear Predictors in Generalized Models

Biplot

ABC

DEFG

H

I

123 4

5

6

7

89 X

−4 −2 0 2 4

−4

−2

02

4

Biplot for barley data

Component 1

Com

pone

nt 2

sites: A−Ivarieties: 1−9, X

ABC

DEFG

H

I

123 4

5

6

7

89 X

−4 −2 0 2 4−

4−

20

24

Biplot for barley data

Component 1

Com

pone

nt 2

sites: A−Ivarieties: 1−9, X v−axis

h−axis

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 18 / 32

Page 19: From L to N: Nonlinear Predictors in Generalized Models

Model RefinementThe biplot suggests that the sites could be represented by pointsalong a line, with co-ordinates

(γi, δ0).

and the varieties by points on two lines perpendicular to the site line:

(ν0 + ν1I(i ∈ {2, 3, 6}), ωj)

This corresponds to the following simplification of the bilinear model:

α1iβ1j + α2iβ2j

≈γi(ν0 + ν1I(i ∈ {2, 3, 6})) + δ0ωj

or equivalently

γi(ν0 + ν1I(i ∈ {2, 3, 6})) + ωj,

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 19 / 32

Page 20: From L to N: Nonlinear Predictors in Generalized Models

Double Additive Model

Gabriel (1998) described the model derived from the biplot as thedouble additive model.

An analysis of deviance confirms that this model is adequate for theleaf blotch data

## Analysis of Deviance Table

##

## Model 1: y ~ 0 + Mult(site , variety , inst = 1) + Mult(site ,

## variety , inst = 2)

## Model 2: y ~ variety + Mult(site , variety.binary) - 1

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 56 41

## 2 71 51 -15 -9.94 0.8

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 20 / 32

Page 21: From L to N: Nonlinear Predictors in Generalized Models

Stereotype Model

The stereotype model (Anderson, 1984) is suitable for orderedcategorical data. It is a special case of the multinomial logistic model:

pr(yi = c|xi) =exp(β0c + β

Tc xi)∑

r exp(β0r + βTr xi)

in which only the scale of the relationship with the covariates changesbetween categories:

pr(yi = c|xi) =exp(β0c + γcβ

Txi)∑r exp(β0r + γrβ

Txi)

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 21 / 32

Page 22: From L to N: Nonlinear Predictors in Generalized Models

Poisson Trick

The stereotype model can be fitted as a GNM by re-expressing thecategorical data as category counts Yi = (Yi1, . . . , Yik).

Assuming a Poisson distribution for Yic, the joint distribution of Yi isMultinomial(Ni, pi1, . . . , pik) conditional on the total count Ni.

The expected counts are then µic = Nipic and the parameters of thesterotype model can be estimated through fitting

log µic = log(Ni) + log(pic)

= αi + β0c + γc∑r

βrxir

where the “nuisance” parameters αi ensure that the multinomialdenominators are reproduced exactly, as required.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 22 / 32

Page 23: From L to N: Nonlinear Predictors in Generalized Models

Augmented Least SquaresA disadvantage of using the Poisson trick is that the number ofnuisance parameters can be large, making computation slow.

The algorithm can be adapted using augmented least squares.

For an ordinary least squares model,

[(y|X)T (y|X)

]−1=

(yTy yTXXTy XTX

)−1

=

(A11 A12

A21 A22

)where A11,A12 and A22 are functions of yTy, XTy and XTX.

Then it can be shown that

β̂ = (XTX)−1XTy = −A21

A11

requiring only the first row (column) of the inverse to be found.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 23 / 32

Page 24: From L to N: Nonlinear Predictors in Generalized Models

Application to Nuisance Parameters I

The same approach can be applied to the IWLS algorithm, letting

X̃ =W12 (z|X)

Now letX̃ = (U |V )

where V is the part of the design matrix corresponding to thenuisance factor.

U is an nk × p matrix where n is the number of nuisance parametersand k is the number of categories and p is the number of modelparameters, typically with n >> p.

V is an nc× n matrix of dummy variables identifying each individual.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 24 / 32

Page 25: From L to N: Nonlinear Predictors in Generalized Models

Application to Nuisance Parameters II

Then

(X̃TX̃)− =

(UTU UTVV TU V TV

)−

=

(B11 B12

B21 B22

)

Again, only the first row (column) of this generalised inverse isrequired to estimate β̂, so we are only interested in B11 and B12.

B11 = (UTU −UTV (V TV )−1V TU)−

B12 = −(V TV )−1V TUB11

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 25 / 32

Page 26: From L to N: Nonlinear Predictors in Generalized Models

Elimination of the Nuisance Factor

UTU is p× p, therefore not expensive to compute.

V TV and V TU can be computed without constructing the largenk × n matrix V , due to the stucture of V

I V TV is diagonal and the non-zero elements can be computeddirectly

I V TU is equivalent to aggregating the rows of U by levels of thenuisance factor

Thus we only need to construct the U matrix, saving memory andreducing the computational burden

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 26 / 32

Page 27: From L to N: Nonlinear Predictors in Generalized Models

Example: Back Pain Data

For 101 patients, 3 prognostic variables were recorded at baseline,then after 3 weeks the level of back pain was recorded (Anderson,1984)

These data were converted to counts, for example for the first record:

## x1 x2 x3 pain count id

## 1 1 1 1 worse 0 1

## 1.1 1 1 1 same 1 1

## 1.2 1 1 1 slight.improvement 0 1

## 1.3 1 1 1 moderate.improvement 0 1

## 1.4 1 1 1 marked.improvement 0 1

## 1.5 1 1 1 complete.relief 0 1

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 27 / 32

Page 28: From L to N: Nonlinear Predictors in Generalized Models

Back Pain Model

In this example, the expanded data is not that long (606 records) andthe total number of parameters is only 115 (9 nonlinear), so themodel does not take long to fit (< 1s!).

However, eliminating the linear parameters reduces the computationtime by almost two-thirds, showing the potential of this technique.

Compare the stereotype model to the multinomial logistic model:

## Analysis of Deviance Table

##

## Model 1: count ~ pain + Mult(pain , x1 + x2 + x3) - 1

## Model 2: count ~ pain + pain:x1 + pain:x2 + pain:x3 - 1

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 493 303

## 2 485 299 8 4.08 0.85

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 28 / 32

Page 29: From L to N: Nonlinear Predictors in Generalized Models

Identifiability Constraints

In order to make the category-specific multipliers identifiable, wemust constrain both the location and scale.

A simple way to do this is to set the first multiplier to zero and fixthe coefficient of the first covariate to one.

## estimate SE quasiSE quasiVar

## worse 0.000 0.000 1.7797 3.16745

## same -3.710 1.826 0.4281 0.18330

## slight.improvement -3.510 1.792 0.4025 0.16198

## moderate.improvement -2.633 1.669 0.5519 0.30454

## marked.improvement -4.612 1.895 0.3133 0.09817

## complete.relief -5.372 2.000 0.4920 0.24202

Quasi standard errors (Firth and de Menezes, 2004) are invariant toreference class

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 29 / 32

Page 30: From L to N: Nonlinear Predictors in Generalized Models

Comparison Intervals

worse same slightimprovement

moderateimprovment

markedimprovement

completerelief

−6

−4

−2

02

4

Intervals based on quasi standard errors

pain

estim

ate ●

● ●

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 30 / 32

Page 31: From L to N: Nonlinear Predictors in Generalized Models

Summary

Moving from GLMs to GNMs present some technical difficulties, butprovides a framework that covers several useful models.

Further examples can be found in the help files and manualaccompanying the gnm package which is available on CRAN.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 31 / 32

Page 32: From L to N: Nonlinear Predictors in Generalized Models

References

Agresti, A. (2002). Categorical Data Analysis (2nd ed.). New York: Wiley.

Anderson, J. A. (1984). Regression and Ordered Categorical Variables. J.R. Statist. Soc. B 46(1), 1–30.

Firth, D. and R. X. de Menezes (2004). Quasi-variances. Biometrika 91,65–80.

Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika 85,689–700.

Goodman, L. A. (1979). Simple models for the analysis of association incross-classifications having ordered categories. J. Amer. Statist.Assoc. 74, 537–552.

Wedderburn, R. W. M. (1974). Quasi-likelihood Functions, GeneralizedLinear Models, and the Gauss-Newton Method. Biometrika 61,439–447.

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 32 / 32