Download - Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

Transcript
Page 1: Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

Logistic Regression, Discriminant Analysis and

K-Nearest Neighbour

Tarek Dib

June 11, 2015

1 Logistic Regression Model - Single Predictor

p(X) =eβ0+β1X

1 + eβ0+β1X(1)

2 Oddsp

1− p= eβ0+β1X (2)

3 logit, log odds

log(p

1− p) = β0 + β1X (3)

4 Summary

In linear regression model, β1 gives the average change of Y for every one-unitincrease in X. However, in logistic regression model, increasing X by one unitchanges log odds by β1, or equivalently it multiplies the odds by eβ1 . Moreover,the amount that p(X) changes due to a one-unit change in X will depend onthe current value of X. But regardless of the value of X, if β1 is positive thenincreasing X will be associated with increasing p(X), and if β1 is negative thenincreasing X will be associated with decreasing p(X).

5 Linear Discriminant Analysis

An alternative approach to logistic regression model is LDA. There are severalreasons to choose LDA over logistic regression:

1. When the classes are well-separated, the parameter estimates for the logis-tic regression model are surprisingly unstable. Linear discriminant analy-sis does not suffer from this problem.

2. If n is small and the distribution of the predictors X is approximatelynormal in each of the classes, the linear discriminant model is again morestable than the logistic regression model.

1

Page 2: Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

3. Linear discriminant analysis is popular when we have more than two re-sponse classes.

Let πk be the prior probability that a randomly chosen observation comesfrom the kth class.

fk(X) = P (X = x|Y = k) is the density function of an observation thatcomes from the kth class.

Posterior probability is then

pk(X) = P (Y = k|X = x) =πkfk(x)

ΣKl=1πlfl(x)(4)

The posterior probability is the probability that an observation X = x be-longs to the kth class. That is, it is the probability that the observation belongsto the kth class, given the predictor value for that observation.

5.1 Linear Discriminant Analysis for p = 1

Suppose we assume that f k (x) is normal or Gaussian. In the one-dimensionalsetting, the normal density takes the form

fk(x) =1√

2πσkexp(− 1

2σ2k

(x− µk)2) (5)

Where µk and σ2k are the mean and variance parameters for the kth class. As-

sume constant variance across all classes, then

fk(x) =1√2πσ

exp(− 1

2σ2(x− µk)2) (6)

The LDA classifier results from assuming that the observations within each classcome from a normal distribution with a class-specific mean vector and a com-mon variance σ2 , and plugging estimates for these parameters into the Bayesclassifier.

The linear discriminant function for a single predictor is found to be:

δk(x) = x.µkσ2− µ2

k

2σ2+ log(πk) (7)

Example: for K = 2, if δ1 − δ2 > 0, then an observation belongs to Class 1.Thus, 2x(µ1 − µ2) > µ2

1 − µ22

5.2 Linear Discriminant Analysis for p>1

The multivariate Gaussian density is defined as

f(x) =1

(2π)p/2|Σ|1/2exp(−1

2(x− µ)TΣ−1(x− µ)) (8)

Discriminant Function:

δk = xTΣ−1µk −1

2µTk Σ−1µk + logπk (9)

2

Page 3: Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

5.3 Quadratic Discriminant Analysis

LDA assumes that the observations within each class are drawn from a multi-variate Gaussian distribution with a class-specific mean vector and a covariancematrix that is common to all K classes. Quadratic discriminant analysis (QDA)provides an alternative approach. Like LDA, the QDA classifier results fromassuming that the observations from each class are drawn from a Gaussian dis-tribution, and plugging estimates for the parameters into Bayes’ theorem inorder to perform prediction. However, unlike LDA, QDA assumes that eachclass has its own covariance matrix. That is, it assumes that an observationfrom the kth class is of the form X ∼ N(µk,Σk), where Σk is a covariancematrix for the kth class. Discriminant Function:

δk(x) = −1

2xTΣ−1

k x+ xTΣ−1k µk −

1

2µTk Σ−1

k µk + logπk (10)

6 Summary - Logistic vs. LDA vs. KNN vs.QDA

Since logistic regression and LDA differ only in their fitting procedures, onemight expect the two approaches to give similar results. This is often, but notalways, the case. LDA assumes that the observations are drawn from a Gaus-sian distribution with a common covariance matrix in each class, and so canprovide some improvements over logistic regression when this assumption ap-proximately holds. Conversely, logistic regression can outperform LDA if theseGaussian assumptions are not met.

KNN takes a completely different approach from the classifiers seen in thischapter. In order to make a prediction for an observation X = x, the K trainingobservations that are closest to x are identified. Then X is assigned to the classto which the plurality of these observations belong. Hence KNN is a completelynon-parametric approach: no assumptions are made about the shape of the de-cision boundary. There- fore, we can expect this approach to dominate LDAand logistic regression when the decision boundary is highly non-linear. On theother hand, KNN does not tell us which predictors are important; we don’t geta table of coefficients.

Finally, QDA serves as a compromise between the non-parametric KNNmethod and the linear LDA and logistic regression approaches. Since QDAassumes a quadratic decision boundary, it can accurately model a wider rangeof problems than can the linear methods. Though not as flexible as KNN, QDAcan perform better in the presence of a limited number of training observationsbecause it does make some assumptions about the form of the decision boundary.

7 References

James, Gareth & Witten, Daniela & Hastie, Trevor & Tibshirani, Robert 2013,An Introduction to Statistical Learning with Applications in R

3