From L to N: Nonlinear Predictors in Generalized Models

From L to N: Nonlinear Predictors in

Generalized Models

Heather Turner

Independent Statistical/R Consultant

owing much toDavid Firth, University of Warwick

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 1 / 32

From L to N

In a GLM we have

g(µ) = β0 + β1x1 + ...+ βpxp

andVar(Y ) = φV (µ)

A generalized nonlinear model (GNM) is the same as a GLMexcept that we have

g(µ) = η(x;β)

where η(x;β) is nonlinear in the parameters β.


Motivation

GNMs may be thought of as...

... an extension of Nonlinear Least Squares

I using a nonlinear function of a continuous variable to model anon-Gaussian response

... an extension of GLMs

I using nonlinear functions of parameters to produce a moreparsimonious model and interpretable model.


Example: Mental Health Status

The following contingency table cross-classifies a sample of 1660residents of Manhattan by child’s mental impairment and parents’socioeconomic status (Agresti, 2002)

## MHS

## SES well mild moderate impaired

## A 64 94 58 46

## B 57 94 54 40

## C 57 105 65 60

## D 72 141 77 94

## E 36 97 54 78

## F 21 71 54 71


Independence

A simple analysis of these data might be to test for independence ofMHS and SES using a chi-squared test.

This is equivalent to testing the goodness-of-fit of the independencemodel

log(µrc) = αr + βc

Such a test compares the independence model to the saturated model

log(µrc) = αr + βc + γrc

which may be over-complex.


Row-column Association

One intermediate model is the Row-Column association model:

log(µrc) = αr + βc + φrψc

(Goodman, 1979), an example of a multiplicative interaction model.

For the Mental Health data:

## Analysis of Deviance Table

##

## Model 1: Freq ~ SES + MHS

## Model 2: Freq ~ SES + MHS + Mult(SES , MHS)

## Model 3: Freq ~ SES + MHS + SES:MHS

## Resid. Df Resid. Dev Df Deviance Pr(>Chi)

## 1 15 47.4

## 2 8 3.6 7 43.8 2.3e-07

## 3 0 0.0 8 3.6 0.89


Parameterisation

The independence model was defined earlier in an over-parameterisedform:

log(µrc) = αr + βc

= (αr + 1) + (βc − 1)

= α∗r + β∗

c

Identifiability constraints may be imposedI to fix a one-to-one mapping between parameter values and

distributions

I to enable interpretation of parameters


Standard Implementation

The standard approach of all major statistical software packages is toapply the identifiability constraints in the construction of the model

g(µ) =Xβ

so that rank(X) is equal to the number of parameters p.

Then the inverse in the score equations of the IWLS algorithm

β(r+1) =(XTW (r)X

)−1

XTW (r)z(r)

exists.


Alternative Implementation

An alternative is to keep models in their over-parameterised form, sothat rank(X) < p, and use the generalised inverse in the IWLSupdates:

β(r+1) =(XTW (r)X

)−XTW (r)z(r)

This approach is more useful for GNMs, since in this case it is muchharder to define standard rules for specifying identifiabilityconstraints.

Rather, identifiability constraints can be applied post-fitting forinference and interpretation.


Estimation of GNMs

GNMs present further technical difficulties vs. GLMs

I automatic generation of starting values is hard

I the likelihood may have multiple optima

The default approach used in the gnm package for R is as follows:

I generate starting values randomly for nonlinear parameters andusing a GLM fit for linear parameters

I use one-parameter-at-a-time Newton method to updatenonlinear parameters

I use the generalized IWLS to update all parameters

Consequently, the parameterisation returned is random.


Parameterisation of RC ModelThe RC model is invariant to changes in scale or location of theinteraction parameters:

log(µrc) = αr + βc + φrψc

= αr + βc + (2φr)(0.5ψc)

= αr + (βc − ψc) + (φr + 1)(ψc)

One way to constrain these parameters is as follows

φ∗r =

φr −∑

r wrφr∑r wr√∑

r wr

(φr −

∑r wrφr∑r wr

)where wr is the row probability, say, so that∑

r

wrφ∗r = 0

∑r

wr(φ∗r)

2 = 1


Row and Column Scores

The row and columns scores for the RC model are

## Estimate Std. Error

## Mult(., MHS).SESA 1.11 0.30

## Mult(., MHS).SESB 1.12 0.31

## Mult(., MHS).SESC 0.37 0.32

## Mult(., MHS).SESD -0.03 0.27

## Mult(., MHS).SESE -1.01 0.31

## Mult(., MHS).SESF -1.82 0.28

## Estimate Std. Error

## Mult(SES , .).MHSwell 1.68 0.19

## Mult(SES , .).MHSmild 0.14 0.20

## Mult(SES , .).MHSmoderate -0.14 0.28

## Mult(SES , .).MHSimpaired -1.41 0.17

As one might expect, the scores are ordered for both factors,suggesting the model for the dependence structure might besimplified further.


Biplot Model

Biplots are graphical displays of data arrays which represent theobjects that index all dimensions of the array on the same plot.

So for a two-way table, a biplot represents both the rows andcolumns at the same time.

The biplot is constructed from a rank-2 representation of the data.Here we consider the generalized bilinear model

g(µij) = α1iβ1j + α2iβ2j


Example: Leaf Blotch Data

The proportion of leaf area affected by leaf blotch was recorded for10 varieties of barley grown at nine sites (Gabriel, 1998).

Thus the response is a continuous variable in [0, 1].

Wedderburn (1974) suggested to model these data using a logit linkand a variance proportional to the square of that of the binomial, i.e.V (µ) = µ2(1− µ)2 – a quasi-likelihood model.


Geometrical Intepretation

Given the bilinear model

logit(µij) = α1iβ1j + α2iβ2j

the effect of site i can be represented by the point

(α1i, α2i)

in the space spanned by the linearly independent basis vectors

a1 = (α11, α12, . . . α19)T

a2 = (α21, α22, . . . α29)T


Visualising Sites and Varieties

Thus we can represent the sites and varieties separately as follows

A

B

C

D

EFG

H I

−4 −2 0 2 4

−4

−2

02

4

Site Effects

Component 1

Com

pone

nt 2 1 23

45 6789X

−4 −2 0 2 4

−4

−2

02

4

Variety Effects

Component 1

Com

pone

nt 2


Obtaining Orthogonal Bases

Given the SVD of the matrix of predictors

η = UDV T

matrices of orthogonal basis vectors on the same scale are given by

A = UD12 B =D

12V T

The model stays the same, but the parametrization changes.


Biplot

ABC

DEFG

H

I

123 4

5

6

7

89 X

−4 −2 0 2 4

−4

−2

02

4

Biplot for barley data

Component 1

Com

pone

nt 2

sites: A−Ivarieties: 1−9, X

ABC

DEFG

H

I

123 4

5

6

7

89 X

−4 −2 0 2 4−

4−

20

24

Biplot for barley data

Component 1

Com

pone

nt 2

sites: A−Ivarieties: 1−9, X v−axis

h−axis


Model RefinementThe biplot suggests that the sites could be represented by pointsalong a line, with co-ordinates

(γi, δ0).

and the varieties by points on two lines perpendicular to the site line:

(ν0 + ν1I(i ∈ {2, 3, 6}), ωj)

This corresponds to the following simplification of the bilinear model:

α1iβ1j + α2iβ2j

≈γi(ν0 + ν1I(i ∈ {2, 3, 6})) + δ0ωj

or equivalently

γi(ν0 + ν1I(i ∈ {2, 3, 6})) + ωj,


Double Additive Model

Gabriel (1998) described the model derived from the biplot as thedouble additive model.

An analysis of deviance confirms that this model is adequate for theleaf blotch data


##

## Model 1: y ~ 0 + Mult(site , variety , inst = 1) + Mult(site ,

## variety , inst = 2)

## Model 2: y ~ variety + Mult(site , variety.binary) - 1


## 1 56 41

## 2 71 51 -15 -9.94 0.8


Stereotype Model

The stereotype model (Anderson, 1984) is suitable for orderedcategorical data. It is a special case of the multinomial logistic model:

pr(yi = c|xi) =exp(β0c + β

Tc xi)∑

r exp(β0r + βTr xi)

in which only the scale of the relationship with the covariates changesbetween categories:

pr(yi = c|xi) =exp(β0c + γcβ

Txi)∑r exp(β0r + γrβ

Txi)


Poisson Trick

The stereotype model can be fitted as a GNM by re-expressing thecategorical data as category counts Yi = (Yi1, . . . , Yik).

Assuming a Poisson distribution for Yic, the joint distribution of Yi isMultinomial(Ni, pi1, . . . , pik) conditional on the total count Ni.

The expected counts are then µic = Nipic and the parameters of thesterotype model can be estimated through fitting

log µic = log(Ni) + log(pic)

= αi + β0c + γc∑r

βrxir

where the “nuisance” parameters αi ensure that the multinomialdenominators are reproduced exactly, as required.


Augmented Least SquaresA disadvantage of using the Poisson trick is that the number ofnuisance parameters can be large, making computation slow.

The algorithm can be adapted using augmented least squares.

For an ordinary least squares model,

[(y|X)T (y|X)

]−1=

(yTy yTXXTy XTX

)−1

=

(A11 A12

A21 A22

)where A11,A12 and A22 are functions of yTy, XTy and XTX.

Then it can be shown that

β̂ = (XTX)−1XTy = −A21

A11

requiring only the first row (column) of the inverse to be found.Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 23 / 32

Application to Nuisance Parameters I

The same approach can be applied to the IWLS algorithm, letting

X̃ =W12 (z|X)

Now letX̃ = (U |V )

where V is the part of the design matrix corresponding to thenuisance factor.

U is an nk × p matrix where n is the number of nuisance parametersand k is the number of categories and p is the number of modelparameters, typically with n >> p.

V is an nc× n matrix of dummy variables identifying each individual.


Application to Nuisance Parameters II

Then

(X̃TX̃)− =

(UTU UTVV TU V TV

)−

=

(B11 B12

B21 B22

)

Again, only the first row (column) of this generalised inverse isrequired to estimate β̂, so we are only interested in B11 and B12.

B11 = (UTU −UTV (V TV )−1V TU)−

B12 = −(V TV )−1V TUB11


Elimination of the Nuisance Factor

UTU is p× p, therefore not expensive to compute.

V TV and V TU can be computed without constructing the largenk × n matrix V , due to the stucture of V

I V TV is diagonal and the non-zero elements can be computeddirectly

I V TU is equivalent to aggregating the rows of U by levels of thenuisance factor

Thus we only need to construct the U matrix, saving memory andreducing the computational burden


Example: Back Pain Data

For 101 patients, 3 prognostic variables were recorded at baseline,then after 3 weeks the level of back pain was recorded (Anderson,1984)

These data were converted to counts, for example for the first record:

## x1 x2 x3 pain count id

## 1 1 1 1 worse 0 1

## 1.1 1 1 1 same 1 1

## 1.2 1 1 1 slight.improvement 0 1

## 1.3 1 1 1 moderate.improvement 0 1

## 1.4 1 1 1 marked.improvement 0 1

## 1.5 1 1 1 complete.relief 0 1


Back Pain Model

In this example, the expanded data is not that long (606 records) andthe total number of parameters is only 115 (9 nonlinear), so themodel does not take long to fit (< 1s!).

However, eliminating the linear parameters reduces the computationtime by almost two-thirds, showing the potential of this technique.

Compare the stereotype model to the multinomial logistic model:


##

## Model 1: count ~ pain + Mult(pain , x1 + x2 + x3) - 1

## Model 2: count ~ pain + pain:x1 + pain:x2 + pain:x3 - 1


## 1 493 303

## 2 485 299 8 4.08 0.85


Identifiability Constraints

In order to make the category-specific multipliers identifiable, wemust constrain both the location and scale.

A simple way to do this is to set the first multiplier to zero and fixthe coefficient of the first covariate to one.

## estimate SE quasiSE quasiVar

## worse 0.000 0.000 1.7797 3.16745

## same -3.710 1.826 0.4281 0.18330

## slight.improvement -3.510 1.792 0.4025 0.16198

## moderate.improvement -2.633 1.669 0.5519 0.30454

## marked.improvement -4.612 1.895 0.3133 0.09817

## complete.relief -5.372 2.000 0.4920 0.24202

Quasi standard errors (Firth and de Menezes, 2004) are invariant toreference class


Comparison Intervals

worse same slightimprovement

moderateimprovment

markedimprovement

completerelief

−6

−4

−2

02

4

Intervals based on quasi standard errors

pain

estim

ate ●

● ●

●

●

●


Summary

Moving from GLMs to GNMs present some technical difficulties, butprovides a framework that covers several useful models.

Further examples can be found in the help files and manualaccompanying the gnm package which is available on CRAN.


References

Agresti, A. (2002). Categorical Data Analysis (2nd ed.). New York: Wiley.

Anderson, J. A. (1984). Regression and Ordered Categorical Variables. J.R. Statist. Soc. B 46(1), 1–30.

Firth, D. and R. X. de Menezes (2004). Quasi-variances. Biometrika 91,65–80.

Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika 85,689–700.

Goodman, L. A. (1979). Simple models for the analysis of association incross-classifications having ordered categories. J. Amer. Statist.Assoc. 74, 537–552.

Wedderburn, R. W. M. (1974). Quasi-likelihood Functions, GeneralizedLinear Models, and the Gauss-Newton Method. Biometrika 61,439–447.


From L to N: Nonlinear Predictors in Generalized Models

Documents

Transcript of From L to N: Nonlinear Predictors in Generalized Models