Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
download Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

of 3

Embed Size (px)

Transcript of Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN

  1. 1. Logistic Regression, Discriminant Analysis and K-Nearest Neighbour Tarek Dib June 11, 2015 1 Logistic Regression Model - Single Predictor p(X) = e0+1X 1 + e0+1X (1) 2 Odds p 1 p = e0+1X (2) 3 logit, log odds log( p 1 p ) = 0 + 1X (3) 4 Summary In linear regression model, 1 gives the average change of Y for every one-unit increase in X. However, in logistic regression model, increasing X by one unit changes log odds by 1, or equivalently it multiplies the odds by e1 . Moreover, the amount that p(X) changes due to a one-unit change in X will depend on the current value of X. But regardless of the value of X, if 1 is positive then increasing X will be associated with increasing p(X), and if 1 is negative then increasing X will be associated with decreasing p(X). 5 Linear Discriminant Analysis An alternative approach to logistic regression model is LDA. There are several reasons to choose LDA over logistic regression: 1. When the classes are well-separated, the parameter estimates for the logis- tic regression model are surprisingly unstable. Linear discriminant analy- sis does not suer from this problem. 2. If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. 1
  2. 2. 3. Linear discriminant analysis is popular when we have more than two re- sponse classes. Let k be the prior probability that a randomly chosen observation comes from the kth class. fk(X) = P(X = x|Y = k) is the density function of an observation that comes from the kth class. Posterior probability is then pk(X) = P(Y = k|X = x) = kfk(x) K l=1lfl(x) (4) The posterior probability is the probability that an observation X = x be- longs to the kth class. That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation. 5.1 Linear Discriminant Analysis for p = 1 Suppose we assume that f k (x) is normal or Gaussian. In the one-dimensional setting, the normal density takes the form fk(x) = 1 2k exp( 1 22 k (x k)2 ) (5) Where k and 2 k are the mean and variance parameters for the kth class. As- sume constant variance across all classes, then fk(x) = 1 2 exp( 1 22 (x k)2 ) (6) The LDA classier results from assuming that the observations within each class come from a normal distribution with a class-specic mean vector and a com- mon variance 2 , and plugging estimates for these parameters into the Bayes classier. The linear discriminant function for a single predictor is found to be: k(x) = x. k 2 2 k 22 + log(k) (7) Example: for K = 2, if 1 2 > 0, then an observation belongs to Class 1. Thus, 2x(1 2) > 2 1 2 2 5.2 Linear Discriminant Analysis for p>1 The multivariate Gaussian density is dened as f(x) = 1 (2)p/2||1/2 exp( 1 2 (x )T 1 (x )) (8) Discriminant Function: k = xT 1 k 1 2 T k 1 k + logk (9) 2
  3. 3. 5.3 Quadratic Discriminant Analysis LDA assumes that the observations within each class are drawn from a multi- variate Gaussian distribution with a class-specic mean vector and a covariance matrix that is common to all K classes. Quadratic discriminant analysis (QDA) provides an alternative approach. Like LDA, the QDA classier results from assuming that the observations from each class are drawn from a Gaussian dis- tribution, and plugging estimates for the parameters into Bayes theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the kth class is of the form X N(k, k), where k is a covariance matrix for the kth class. Discriminant Function: k(x) = 1 2 xT 1 k x + xT 1 k k 1 2 T k 1 k k + logk (10) 6 Summary - Logistic vs. LDA vs. KNN vs. QDA Since logistic regression and LDA dier only in their tting procedures, one might expect the two approaches to give similar results. This is often, but not always, the case. LDA assumes that the observations are drawn from a Gaus- sian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption ap- proximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met. KNN takes a completely dierent approach from the classiers seen in this chapter. In order to make a prediction for an observation X = x, the K training observations that are closest to x are identied. Then X is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: no assumptions are made about the shape of the de- cision boundary. There- fore, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important; we dont get a table of coecients. Finally, QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods. Though not as exible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary. 7 References James, Gareth & Witten, Daniela & Hastie, Trevor & Tibshirani, Robert 2013, An Introduction to Statistical Learning with Applications in R 3