Logistic Regression and Generalized Linear Models · Logistic Regression and Generalized Linear...

Logistic Regression andGeneralized Linear

ModelsSridhar Mahadevan

[email protected]

University of Massachusetts

©Sridhar Mahadevan: CMPSCI 689 – p. 1/29

TopicsGenerative vs. Discriminative models

In many cases, it is difficult to model data using a parametric class conditionaldensity P (X|ω, θ)

Yet, in many problems, a linear decision boundary is usually adequate toseparate classes (also, gaussian densities with a shared covariance matrixproduces a linear decision boundary).

Logistic regression: discriminative model for classification that produces lineardecision boundaries

Model fitting problem solved using maximum likelihood

Iterative gradient-based algorithm for solving nonlinear maximum likelihoodequations

Recursive weighted least squares regression

Logistic regression is an instance of a generalized linear model (GLM), whichconsists of a large variety of exponential models.GLMs can also be extended togeneralized additive models (GAMs).


Discriminative vs.Generative Models

Both generative and discriminative approaches address the problem of modelingthe discriminant function P (y|x) of output labels (or values) y conditioned on theinput x.

In generative models, we estimate both P (x) and P (x|y), and use Bayes rule tocompute the discriminant.

P (y|x) ∝ P (x)P (x|y)

Discriminative approaches model the conditional distribution P (y|x) directly, andignore the marginal P (x).

We now turn to explore several types instances of discriminative models, includinglogistic regression in this class, and later several other types including supportvector machines.


Generalized LinearModels

In linear regression, we model the output y as a linear function of the inputvariables, with a noise term that is zero mean constant variance Gaussian.

y = g(x) + ǫ, where the conditional mean E(y|x) = g(x), and the noise term is ǫ.g(x) = βT x (where β0 is an offset term).

We saw earlier that the maximum likelihood framework justified the use of asquared error loss function, provided the errors were IID gaussian (the variancedoes not matter).

We want to generalize this idea of specifying a model family by specifying the typeof error distribution:

When the output variable y is discrete (e.g., binary or multinomial), the noiseterm is not gaussian, but binomial or multinomial.

A change in the mean is coupled by a change in the variance, and we want tobe able to couple mean and variance in our model.

Generalized linear models provides a rich tool of models based on specifying theerror distribution.


Logit FunctionSince the output variable y only takes on values ∈ (0, 1) (for binary classification),we need a different way of representing E(y|x) so that the range of y ∈ (0, 1).

One convenient form to use is the sigmoid or logistic function. Let us assume avector-valued input variable x = (x1, . . . , xp). The logistic function is S shapedand approaches 0 (as x→ −∞) or 1 (as x→∞).

P (y = 1|x, β) = µ(x|β) =eβT x

1 + eβT x=

1

1 + e−βT x

P (y = 0|x, β) = 1− µ(x|β) =1

1 + eβT x

We assume an extra input x0 = 1, so that β0 is an offset. We can invert the abovetransformation to get the logit function

g(x|β) = logµ(x|β)

1− µ(x|β)= βT x


Logistic Regression

y

X1 X0X2

β2 β1β0


Example Dataset forLogistic Regression

The data set we are analyzing is coronary heart disease in South Africa. The chd

response (output) variable is binary (yes, no), and there are 9 predictor variables:

There are 462 instances, out of which 160 are cases (positive instances), and 302are controls (negative instances).

The predictor variables are systolic blood pressure, tobacco, ldl, famhist,obesity, alcohol, age, adiposity, typea,

Let’s focus on a subset of the predictors: sbp, tobacco, ldl, famhist,obesity, alcohol, age.

We want to fit a model of the following form

P (chd = 1|x, β) =1

1 + e−βT x

where βT x =

β0+β1xsbp+β2xtobacco+β3xldl+β4xfamhist+β5xage+β6xalcohol+β7xobesity


Noise Model forLogistic Regression

Let us try to represent the logistic regression model as y = µ(x|β) + ǫ and askourself what sort of noise model is represented by ǫ.

Since y takes on the value 1 with probability µ(x|β), it follows that ǫ can also onlytake on two possible values, namely

If y = 1, then ǫ = 1− µ(x|β) with probability µ(x|β).

Conversely, if y = 0, then ǫ = −µ(x|β) and this happens with probability(1− µ(x|β)).

This analysis shows that the error term in logistic regression is a binomiallydistributed random variable. Its moments can be computed readily as shownbelow:

E(ǫ) = µ(x|β)(1− µ(x|β))− (1− µ(x|β))µ(x|β) = 0 (the error term hasmean 0).

V ar(ǫ) = Eǫ2 − (Eǫ)2 = Eǫ2 = µ(x|β)(1− µ(x|β)) (show this!)


Maximum Likelihoodfor LR

Suppose we want to fit a logistic regression model to a dataset of n observationsX = (x1, y1), . . . , (xn, yn).

We can express the conditional likelihood of a single observation simply as

P (yi|xi, β) = µ(xi|β)yi

(1− µ(xi|β))1−yi

Hence, the conditional likelihood of the entire dataset can be written as

P (Y |X, β) =

n∏

i=1

µ(xi|β)yi

(1− µ(xi|β))1−yi

The conditional log-likelihood is then simply

l(β|X, Y ) =

n∑

i=1

yi log µ(xi|β) + (1− yi) log(1− µ(xi|β))


Maximum Likelihoodfor LR

We solve the conditional log-likelihood equation by taking gradients

∂l(β|X, Y )

∂βk

=

n∑

i=1

yi 1

µ(xi|β)

∂µ(xi|β)

∂βk

− (1− yi)1

(1− µ(xi|β))

∂µ(xi)

∂βk

Using the fact that ∂µ(xi|β)βk

= ∂∂βk

( 1

1+e−βT xi ) = µ(xi|β)(1−µ(xi|β))xik

, we get

∂l(β|X, Y )

∂βk

=

n∑

i=1

xik(yi − µ(xi|β))

Setting this to 0, since x0 = 1 the first component of these equations reduces to

n∑

i=1

yi =

n∑

i=1

µ(xi|β)

The expected number of instances of each class must match the observednumber. ©Sridhar Mahadevan: CMPSCI 689 – p. 10/29

Newton-RaphsonMethod

Newton’s method is a general procedure for finding the roots of an equationf(θ) = 0. Newton’s algorithm is based on the recursion

θt+1 = θt −f(θt)

f ′(θt)

Newton’s method finds the minimum of a function f . We want to find the maximumof the log likehood equation.

But, the maximum of a function f(θ) is exactly when its derivative f ′(θ) = 0. So,plugging in f ′(θ) for f(θ) above, we get

θt+1 = θt −f ′(θt)

f ′′(θt)


Fisher ScoringIn logistic regresion, the parameter β is a vector, so we have to use theNewton-Raphson algorithm

βt+1 = βt −H−1∇β l(βt|X, Y )

Here,∇β l(βt|X, Y ) is the vector of partial derivatives of the log-likelihood equation

Hij =∂2l(β|X,Y )

∂βi∂βjis the Hessian matrix of second order derivatives.

The use of Newton’s method to find the solution to the conditional log-likelihoodequation is called Fisher scoring.


Fisher Scoring forMaximum Likelihood

Taking the second derivative of the likelihood score equations gives us

∂2l(β|X, Y )

∂βk∂βm= −

n∑

i=1

xikxi

mµ(xi|β)(1− µ(xi|β))

We can use matrix notation to write the Newton-Raphson algorithm for logisticregression. Define the n× n diagonal matrix

W =

µ(x1|β)(1− µ(x1|β)) . . . 0

0 µ(x2|β)(1− µ(x2|β)) . . .

. . .

0 . . . µ(xn|β)(1− µ(xn|β))

Let Y be an n× 1 column vector of output values, and X be the design matrix ofsize n× (p + 1) of input values, and P be the column vector of fitted probabilityvalues µ(xi|β).


Iterative WeightedLeast Squares

The gradient of the log likelihood can be written in matrix form as

∂l(β|X, Y )

∂β=

n∑

i=1

xi(yi − µ(xi|β)) = XT (Y − P )

The Hessian can be written as

∂2l(β|X, Y )

∂β∂βT= −XT WX

The Newton-Raphson algorithm then becomes

βnew = βold + (XT WX)−1XT (Y − P )

= (XT WX)−1XT W(

Xβold + W−1(Y − P ))

= (XT WX)−1XT WZ where Z ≡ Xβold + W−1(Y − P )


Weighted LeastSquares Regression

Weighted least squares regression finds the best least-squares solution to theequation

WAx ≈ Wb

(WA)T WAx = (WA)T Wb

x = (AT CA)−1AT Cb where C = WT W

Returning to logistic regression, we now see βnew = (XT WX)−1XT WZ isweighted least squares regression (where X is the matrix A above, W is adiagonal weight vector with entries µ(xi|β)(1− µ(xi|β)), and Z corresponds tothe vector b above).

It is termed recursive weighted least squares, because at each step, the weightvector W keeps changing (since the β’s are changing). We can visualize RWLS assolving the following equation

βnew ← argminβ(Z −Xβ)T W (Z −Xβ)


Stochastic GradientAscent

Newton’s method is often referred to as a 2nd order method, because it involvestaking the Hessian. This can be difficult in large problems, because it involvesmatrix inversion.

One way to avoid this is to settle for slower convergence, but less work at eachstep. For each training instance (x, y) we can derive an incremental gradientupdate rule.

∂l(β|x, y)

∂βj

= xj(y − µ(x|β))

The stochastic gradient ascent rule can be written as (for instance (xi, yi))

βj ← βj + α(yi − µ(xi|β))xij

The convergence of this update rule can be iffy, unlike Newton’s method. Itdepends on tweaking the learning rate α so that the steps are not too small or toolarge, and also a cooling schedule.


The LMS AlgorithmThe LMS (least mean square) algorithm can be used to solve least squaresregression problems incrementally taking the gradient of the loss function w.r.t. theparameters, and adjusting the weights for each data instance.

h(x|β) = β0 +

n∑

j=1

βjxj = βT x

Given a data set D, we want to find the vector β that minimizes the (mean)squared error loss L(h) = 1

2

∑

i(yi − h(x; β))2.

LMS algorithm: find the gradient on a particular instance

∂L(h)

βj

= −(y − h(x|β))xj

Adjust the weight in the direction of decreasing the error (negative gradient).

βj ← βj + α(y − h(x|β))xj


Logistic Regressionvs LDA

Recall from Bayes decision theory that when the class conditional densitiesP (x|ωi, µi, Σ) share the same underlying covariance matrix Σ, the decisionboundary that separates the classes is a line (or hyperplane). Such Bayesianclassifiers are called linear discriminant classifiers.

We see from above that logistic regression also produces a decision boundary thatis a hyperplane, since its discriminant function was shown to be

g(x|β) = logµ(x|β)

1− µ(x|β)= βT x

So, if both gaussian LDA and logistic regression produce linear decisionboundaries, which is preferable?

As a general rule, logistic regression works in a larger class of problems because itdoes not assume the underlying class conditional densities are Gaussian!



A generalized linear model is specified using two functions

A link function that describes how the mean depends on the linear predictorg(µ) = η, where η = βT x.

For example, in logistic regression, the link function is the logit functiong(µ) = log µ

1−µ= βT x

In linear regression, the link function is simply g(µ) = µ

The inverse of the logit function g−1(η) = µ describes how the mean µ canbe related back to the linear predictor. For logistic regression, the inverse linkfunction is the sigmoid.

A variance function that specifies how the variance of the output y depends on themean φV (µ), where φ is constant.

For example, in logistic regression, the variance is µ(1− µ), because it is abinomial distribution.

In linear regression, the variance is simply 1 because the error term ismodeled as having constant variance.



Distribution Link Function Variance Function

Gaussian µ 1Binomial log µ

1−µµ(1 − µ)

Poisson log µ µ

Gamma 1

µµ

2


MultiwayClassification

As one more example of a generalized linear model, we generalize the logisticregression model to a multinomial (e.g, sorting email into various categories)

logP (Y = 1|X = x, β)

P (Y = k|X = x, β)= βT

1 x

logP (Y = 2|X = x, β)

P (Y = k|X = x, β)= βT

2 x

. . .

logP (Y = K − 1|X = x, β)

P (Y = k|X = x, β)= βT

K−1x

It easily follows that

P (Y = k|X = x, β) =eβT

kx

1 +∑K−1

l=1eβT

lx

, k = 1, . . . , K − 1

P (Y = K|X = x, β) =1

1 +∑K−1

l=1eβT

lx

, k = 1, . . . , K − 1


Multinomial Linkfunction

The multinomial PDF can be written as a member of the exponential family

P (y|φ) = φ1{y=1}1 φ

2{y=2}2 . . . φ

1{y=K}K

= φy1

1 φy2

2 . . . φ1−

∑

k−1

i=1yi

K

= eηT y−a(η)

The last expression above is an instance of the generic form of the exponentialfamily (check Weber handout or earlier notes)

Instead of a single link function, we now have a vector

η =

log φ1

φk

log φ2

φk

...

logφk−1

φk


Fitting GLM Modelsin R

The statisics package R has comprehensive built-in features for fitting generalizedlinear models (and many related models as well).

This is a very brief intro to model fitting in R. See the R documentation, as well asthe excellent text Statistical Models in S by Chambers and Hastie.

The main function is called glm(formula=, family=,...).

Here, formula is a symbolic description of the model that is to be fit.

family is the description of the appropriate error distribution and link function tobe used.


Heart DiseaseDataset

The data set we are analyzing is coronary heart disease in South Africa. The chd

response (output) variable is binary (yes, no), and there are 9 predictor variables:

systolic blood pressure, tobacco, ldl, famhist, obesity, alcohol, age, adiposity,typea,

The R command glm(chd ∼ sbp + tobacco + ldl + famhist + age +

alcohol + obesity, family = binomial, data = heart) will fit alogistic regression model to the heart disease data set (assuming the dataset isloaded into the system).

The R command glm(chd ∼ sbp + tobacco + ldl + famhist + age +

alcohol + obesity, family = binomial(link=probit), data =

heart) will fit the same data, now using the probit distribution (inverse CDF ofthe normal distribution).


Heart Disease DataThe following model was found to be the best fit to the data, using maximum likelihood.

Coefficient Estimate

(Intercept) -4.1290787

sbp 0.0057608

tobacco 0.0795237

ldl 0.1847710

famhistPresent 0.9391330

age 0.0425344

alcohol 0.0006058

obesity -0.0345467


PolynomialRegression

Load the data data(Cars93) which contains information about 93 models ofcars.

names(Cars93) tells you the attributes that make up the dataset.

[1] "Manufacturer" "Model" "Type"

[4] "Min.Price" "Price" "Max.Price"

[7] "MPG.city" "MPG.highway" "AirBags"

[10] "DriveTrain" "Cylinders" "EngineSize"

[13] "Horsepower" "RPM" "Rev.per.mile"

[16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers"

[19] "Length" "Wheelbase" "Width"

[22] "Turn.circle" "Rear.seat.room" "Luggage.room"

[25] "Weight" "Origin" "Make"

Let’s assume we want to fit a polynomial regression model, and we chose Price

as the predictor variable, and Weight as the response variable.

We can graph the these two variables using: plot(Weight ∼ Price, data =

Cars93)


Fitting GLM Modelsin R: Cars

10 20 30 40 50 60

200

0250

030

0035

0040

00

Price

Wei

ght


Fitting GLM Modelsin R

The R command to fit a dth degree polynomial isglm(Weight ∼ poly(Price, d), family = gaussian, data =

Cars93)

Notice how we specified the link function to be gaussian.

10 20 30 40 50 60

200

0250

030

0035

0040

00

Price

Wei

ght


Generalized AdditiveModels

In the regression setting, a generalized additive model has the form

E(Y |X1, . . . , Xp) = α + f1(X1) + f2(X2) + . . . + fp(Xp)

Here, the fi are unspecified “smooth” (nonparametric ) functions.

In the classification setting, a generalized additive logistic regression model hasthe form

log(µ(X|β)

1− µ(X|β)= α + f1(X1) + f2(X2) + . . . + fp(Xp)

See Section 9.1.2 in Hastie’s book (Statistical Learning) for how to fit additivelogistic regression models to an email-spam dataset.


Logistic Regression and Generalized Linear Models · Logistic Regression and Generalized Linear...

Documents

Transcript of Logistic Regression and Generalized Linear Models · Logistic Regression and Generalized Linear...