Classification Algorithms I (Peceptron, SVMs, …Classification Algorithms I (Peceptron, SVMs,...

36
Classification Algorithms I (Peceptron, SVMs, Logistic Regression) CS57300 Data Mining Spring 2016 Instructor: Bruno Ribeiro

Transcript of Classification Algorithms I (Peceptron, SVMs, …Classification Algorithms I (Peceptron, SVMs,...

Classification Algorithms I(Peceptron, SVMs, Logistic Regression)

CS57300 Data MiningSpring 2016

Instructor: Bruno Ribeiro

2

Logistic Regression (part 2)

} Biased data happens when

where πi is probability of seeing sample i and αi is prevalence of iin the population

} Generally, a good idea to test training data for different weights◦ The weights can be used to protect against sampling designs which could

cause selection bias. ◦ The weights can be used to protect against misspecification of the model.

3

Training Logistic with Selection Bias

Log Likelihood =

NX

i=1

↵i

⇡ilogP (Ci = k|�i)

} Consider K classes and N observations} Let Ci be the class of the i-th example with feature vector 𝜙i

} Easier to assume 1-of-K coding of target variable ti

} Log-likelihood function (learned via iterative reweighted least squares, has unique maximum)

4

Multiclass Logistic Regression (MLR)

P (Ci = k|�i) =exp(wT

k �i)PKh=1 exp(w

Th�i)

NX

i=1

KX

k=1

ti,k logexp(wT

k �i)PKh=1 exp(w

Th�i)

=

NX

i=1

KX

k=1

ti,kwTk �i �

NX

i=1

log

KX

h=1

exp(wTh�i)

a.k.a. softmax

} Example: word2vec (Mikolov et al., 2013)} Words in a 5-word sentence: xt-2 , xt-1 , xt , xt+1 , xt+2

} xt is word of interest, other words are nearby words

where

Learning hard via gradient descent. Noise contrastive estimation used instead (or MCMC or Hierarchical Softmax).

5

What if Feature Vector is Latent in MLR?

P [x

t+j

|xt

,W,�] =

exp(wT

xt�

xt+j )PK

k=1 exp(wT

xt�

k

)

P [xt�2, xt�1, xt+1, xt+2|xt,W,�] =Y

j=�2,�1,1,2

P [xt+j |xt,W,�]

} We go back to learning feature vectors in neural network representations

} For now we keep engineering our own features6

Word2vec results

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Country and Capital Vectors Projected by PCAChina

Japan

France

Russia

Germany

Italy

SpainGreece

Turkey

Beijing

Paris

Tokyo

Poland

Moscow

Portugal

Berlin

RomeAthens

Madrid

Ankara

Warsaw

Lisbon

Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and theircapital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitlythe relationships between them, as during the training we did not provide any supervised information aboutwhat a capital city means.

which is used to replace every logP (wO|wI) term in the Skip-gram objective. Thus the task is todistinguish the target word wO from draws from the noise distribution Pn(w) using logistic regres-sion, where there are k negative samples for each data sample. Our experiments indicate that valuesof k in the range 5–20 are useful for small training datasets, while for large datasets the k can be assmall as 2–5. The main difference between the Negative sampling and NCE is that NCE needs bothsamples and the numerical probabilities of the noise distribution, while Negative sampling uses onlysamples. And while NCE approximately maximizes the log probability of the softmax, this propertyis not important for our application.

Both NCE and NEG have the noise distributionPn(w) as a free parameter. We investigated a numberof choices for Pn(w) and found that the unigram distribution U(w) raised to the 3/4rd power (i.e.,U(w)3/4/Z) outperformed significantly the unigram and the uniform distributions, for both NCEand NEG on every task we tried including language modeling (not reported here).

2.3 Subsampling of Frequent Words

In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,“in”, “the”, and “a”). Such words usually provide less information value than the rare words. Forexample, while the Skip-gram model benefits from observing the co-occurrences of “France” and“Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, asnearly every word co-occurs frequently within a sentence with “the”. This idea can also be appliedin the opposite direction; the vector representations of frequent words do not change significantlyafter training on several million examples.

To counter the imbalance between the rare and frequent words, we used a simple subsampling ap-proach: each word wi in the training set is discarded with probability computed by the formula

P (wi) = 1−

!

t

f(wi)(5)

4

(Mikolov et al., 2013)

7

Classification without a Model

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Least Squares Solution

8

Issues with Least Squares Classification

minimize minimize

h+ε

ε

LS cares too much about reducing classification error of well separable items

X

j

(y(xj)� hyperplane)2X

j

(y(xj)� hyperplane)2

} Input Data◦ Item features

◦ Item labels

} Output: given new data point x’� Find y(x’) that best approximates possible value of t’

9

Input – Output of Our Problem

D

x

= {x1, . . . ,xN

}, x

n

2 Rk

, k � 1

Dt = {t1, . . . , tN}, tn 2 {�1, 1}

} In what follows we will ”ignore” all points that are far away from the border (separating hyperplane)

} Two algorithms:◦ Perceptron◦ SVMs

10

We Care About What is Close

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

Perceptron

Model:

Dot product is product of:(i) projection of x onto w, and (ii) the length of w

Dot product is 0 if x is perpendicular to wAdd b, if >0 then positive class, if <0 then negative class

Iterate over training examples until error is below a pre-specified threshold Υ

Perceptron

Model:

Learning:

and

13

4.1. Discriminant Functions 195

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from twoclasses (red and blue) in a two-dimensional feature space (φ1, φ2). The top left plot shows the initial parametervector w shown as a black arrow together with the corresponding decision boundary (black line), in which thearrow points towards the decision region which classified as belonging to the red class. The data point circledin green is misclassified and so its feature vector is added to the current weight vector, giving the new decisionboundary shown in the top right plot. The bottom left plot shows the next misclassified point to be considered,indicated by the green circle, and its feature vector is again added to the weight vector giving the decisionboundary shown in the bottom right plot for which all data points are correctly classified.

4.1. Discriminant Functions 195

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from twoclasses (red and blue) in a two-dimensional feature space (φ1, φ2). The top left plot shows the initial parametervector w shown as a black arrow together with the corresponding decision boundary (black line), in which thearrow points towards the decision region which classified as belonging to the red class. The data point circledin green is misclassified and so its feature vector is added to the current weight vector, giving the new decisionboundary shown in the top right plot. The bottom left plot shows the next misclassified point to be considered,indicated by the green circle, and its feature vector is again added to the weight vector giving the decisionboundary shown in the bottom right plot for which all data points are correctly classified.

C. Bishop 2006

} Model space◦ Set of weights w and b (hyperplane boundary)

} Search algorithm◦ Iterative refinement of weights and b

} Score function◦ Minimize misclassification rate

Perceptron components

} Perceptron convergence theorem:◦ If exact solution exists (i.e., if data is linearly separable),

then perceptron algorithm will find an exact solution in a finite number of steps◦ The “finite” number of steps can be very large. The smaller the

gap, the longer the time to find it

} If data not linearly separable?◦ Algorithm will not converge, and cycles develop. The cycles

can be long and therefore hard to detect

16

Perceptron Algorithm Convergence

} Is there a better formulation for two classes?

} What about find hyperplanethat is equidistant tocircled points?

} We arrived atMaximum Margin Classifiers(e.g. Support Vector Machines)

How Can We Extend the Perceptron Idea?

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

−4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

18

Linear Methods → Linear Boundaries?

4.2 Linear Regression of an Indicator Matrix 103

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

1

1

1

1111

11

11

1

1

1

1

1

1

11 1

111

1

1

1

1

1

11

1

1

11

1

11

11

1

1

1

1

1

1

1 11

1

1 1

1

1

1

1

1

1

111

1

1

1

1

1

1

1 111

1

1

1

11

1

1

1

1

1

11

1

1

1

11

1

1

1

1

111

1

1

11

11

1

1

1

1

1

11

1

1

1

1

1

1 1

1

1

11

1

1

1

1 1

1

1

11 1

1

1

1 11

11

1

1

1

1

1

1

1

11

1 1

11

1

1

1

11

1

1

1

1 1

11

1

11

1

1

1

1

11

1

11

1

11

1

1

111

1

1 1

1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

22

222

2

2

2

22

2

222

22

2

22

222

2

222

2

2

2

2 2

2

2

2

22

2

2

22

2

2

2

2

2

2 222

2

2

2

2

2

22

2

2

22

2

2

2

2

2

2

2

2

2

2

22

2

2

2

2

2

2

2

2

22

2

2

222

2

2

2

2

2 2

2

2

2

2

2

22

2

2

2

2 22

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

2

2

2

2

2

22

2

2

2

22

2 2

22 2

2

2

2

2

2

2

2

2

2

2

2

22

2

22

222

2

2

2

222

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2 22

2

2

22

22

2 2

2

2

2

3

3

3

3

33

3

3

3

3

3 3

3

3

3

3

3

3

3

3

3

33

3

3

3

33

3

33

3

3

3

3

3

3

3

3 3

3

3

3

3

33

3

3

33

3

3

33

3

333

3

3

3 3

3 3

3

3

3

3

3

3

3

3

3

3

33

3 3

3

33

3 3

33

3

3

3

333

3

3

3

3

3

3

3

3

33

3

3

3

3 3

3

3

3

33

3

3 33

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

33

3

33

33

3

33

3

3

3

3

3

3

3 3

3

3

3 3

3

3

3

33

3

3

3 3

33

33

3

3

3

3

3

3

3

3

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

FIGURE 4.1. The left plot shows some data from three classes, with lineardecision boundaries found by linear discriminant analysis. The right plot showsquadratic decision boundaries. These were obtained by finding linear boundariesin the five-dimensional space X1, X2, X1X2, X

21 , X

22 . Linear inequalities in this

space are quadratic inequalities in the original space.

mation h(X) where h : IRp !→ IRq with q > p, and will be explored in laterchapters.

4.2 Linear Regression of an Indicator Matrix

Here each of the response categories are coded via an indicator variable.Thus if G has K classes, there will be K such indicators Yk, k = 1, . . . ,K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1, . . . , YK), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given by

Y = X(XTX)−1XTY. (4.3)

Chapter 3 has more details on linear regression. Note that we have a coeffi-cient vector for each response column yk, and hence a (p+1)×K coefficientmatrix B = (XTX)−1XTY. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.

A new observation with input x is classified as follows:

• compute the fitted output f(x)T = (1, xT )B, a K vector;

• identify the largest component and classify accordingly:

G(x) = argmaxk∈G fk(x). (4.4)

Data from three classes, with linear decision boundaries

Linear boundaries in the five-dimensional space ϕ(x)=(X1,X2,X1X2,X1

2,X22)

� ⌘ �(x)

ϕ is known as a ”kernel”

} Back to our linear model (with non-linear features ϕ)

} For two classes, if class of item xn

} Then means correctlyclassified

} Thus, distance of xn from decision hyperplane is the maximumminimum distance

19

How to Find The ”Support” Points?(assuming linearly separable data)

326 7. SPARSE KERNEL MACHINES

encouraged to review the key concepts covered in Appendix E. Additional infor-mation on support vector machines can be found in Vapnik (1995), Burges (1998),Cristianini and Shawe-Taylor (2000), Muller et al. (2001), Scholkopf and Smola(2002), and Herbrich (2002).

The SVM is a decision machine and so does not provide posterior probabilities.We have already discussed some of the benefits of determining probabilities in Sec-tion 1.5.4. An alternative sparse kernel technique, known as the relevance vectormachine (RVM), is based on a Bayesian formulation and provides posterior proba-Section 7.2bilistic outputs, as well as having typically much sparser solutions than the SVM.

7.1. Maximum Margin Classifiers

We begin our discussion of support vector machines by returning to the two-classclassification problem using linear models of the form

y(x) = wTφ(x) + b (7.1)

where φ(x) denotes a fixed feature-space transformation, and we have made thebias parameter b explicit. Note that we shall shortly introduce a dual representationexpressed in terms of kernel functions, which avoids having to work explicitly infeature space. The training data set comprises N input vectors x1, . . . ,xN , withcorresponding target values t1, . . . , tN where tn ∈ {−1, 1}, and new data points xare classified according to the sign of y(x).

We shall assume for the moment that the training data set is linearly separable infeature space, so that by definition there exists at least one choice of the parametersw and b such that a function of the form (7.1) satisfies y(xn) > 0 for points havingtn = +1 and y(xn) < 0 for points having tn = −1, so that tny(xn) > 0 for alltraining data points.

There may of course exist many such solutions that separate the classes exactly.In Section 4.1.7, we described the perceptron algorithm that is guaranteed to finda solution in a finite number of steps. The solution that it finds, however, will bedependent on the (arbitrary) initial values chosen for w and b as well as on theorder in which the data points are presented. If there are multiple solutions all ofwhich classify the training data set exactly, then we should try to find the one thatwill give the smallest generalization error. The support vector machine approachesthis problem through the concept of the margin, which is defined to be the smallestdistance between the decision boundary and any of the samples, as illustrated inFigure 7.1.

In support vector machines the decision boundary is chosen to be the one forwhich the margin is maximized. The maximum margin solution can be motivated us-ing computational learning theory, also known as statistical learning theory. How-Section 7.1.5ever, a simple insight into the origins of maximum margin has been given by Tongand Koller (2000) who consider a framework for classification based on a hybrid ofgenerative and discriminative approaches. They first model the distribution over in-put vectors x for each class using a Parzen density estimator with Gaussian kernels

182 4. LINEAR MODELS FOR CLASSIFICATION

Figure 4.1 Illustration of the geometry of alinear discriminant function in two dimensions.The decision surface, shown in red, is perpen-dicular to w, and its displacement from theorigin is controlled by the bias parameter w0.Also, the signed orthogonal distance of a gen-eral point x from the decision surface is givenby y(x)/∥w∥.

x2

x1

wx

y(x)∥w∥

x⊥

−w0∥w∥

y = 0y < 0

y > 0

R2

R1

an arbitrary point x and let x⊥ be its orthogonal projection onto the decision surface,so that

x = x⊥ + rw

∥w∥ . (4.6)

Multiplying both sides of this result by wT and adding w0, and making use of y(x) =wTx + w0 and y(x⊥) = wTx⊥ + w0 = 0, we have

r =y(x)∥w∥ . (4.7)

This result is illustrated in Figure 4.1.As with the linear regression models in Chapter 3, it is sometimes convenient

to use a more compact notation in which we introduce an additional dummy ‘input’value x0 = 1 and then define !w = (w0,w) and !x = (x0,x) so that

y(x) = !wT!x. (4.8)

In this case, the decision surfaces are D-dimensional hyperplanes passing throughthe origin of the D + 1-dimensional expanded input space.

4.1.2 Multiple classesNow consider the extension of linear discriminants to K > 2 classes. We might

be tempted be to build a K-class discriminant by combining a number of two-classdiscriminant functions. However, this leads to some serious difficulties (Duda andHart, 1973) as we now show.

Consider the use of K−1 classifiers each of which solves a two-class problem ofseparating points in a particular class Ck from points not in that class. This is knownas a one-versus-the-rest classifier. The left-hand example in Figure 4.2 shows an

y(x)/||w|| determines distance to hyperplane

326 7. SPARSE KERNEL MACHINES

encouraged to review the key concepts covered in Appendix E. Additional infor-mation on support vector machines can be found in Vapnik (1995), Burges (1998),Cristianini and Shawe-Taylor (2000), Muller et al. (2001), Scholkopf and Smola(2002), and Herbrich (2002).

The SVM is a decision machine and so does not provide posterior probabilities.We have already discussed some of the benefits of determining probabilities in Sec-tion 1.5.4. An alternative sparse kernel technique, known as the relevance vectormachine (RVM), is based on a Bayesian formulation and provides posterior proba-Section 7.2bilistic outputs, as well as having typically much sparser solutions than the SVM.

7.1. Maximum Margin Classifiers

We begin our discussion of support vector machines by returning to the two-classclassification problem using linear models of the form

y(x) = wTφ(x) + b (7.1)

where φ(x) denotes a fixed feature-space transformation, and we have made thebias parameter b explicit. Note that we shall shortly introduce a dual representationexpressed in terms of kernel functions, which avoids having to work explicitly infeature space. The training data set comprises N input vectors x1, . . . ,xN , withcorresponding target values t1, . . . , tN where tn ∈ {−1, 1}, and new data points xare classified according to the sign of y(x).

We shall assume for the moment that the training data set is linearly separable infeature space, so that by definition there exists at least one choice of the parametersw and b such that a function of the form (7.1) satisfies y(xn) > 0 for points havingtn = +1 and y(xn) < 0 for points having tn = −1, so that tny(xn) > 0 for alltraining data points.

There may of course exist many such solutions that separate the classes exactly.In Section 4.1.7, we described the perceptron algorithm that is guaranteed to finda solution in a finite number of steps. The solution that it finds, however, will bedependent on the (arbitrary) initial values chosen for w and b as well as on theorder in which the data points are presented. If there are multiple solutions all ofwhich classify the training data set exactly, then we should try to find the one thatwill give the smallest generalization error. The support vector machine approachesthis problem through the concept of the margin, which is defined to be the smallestdistance between the decision boundary and any of the samples, as illustrated inFigure 7.1.

In support vector machines the decision boundary is chosen to be the one forwhich the margin is maximized. The maximum margin solution can be motivated us-ing computational learning theory, also known as statistical learning theory. How-Section 7.1.5ever, a simple insight into the origins of maximum margin has been given by Tongand Koller (2000) who consider a framework for classification based on a hybrid ofgenerative and discriminative approaches. They first model the distribution over in-put vectors x for each class using a Parzen density estimator with Gaussian kernels

326 7. SPARSE KERNEL MACHINES

encouraged to review the key concepts covered in Appendix E. Additional infor-mation on support vector machines can be found in Vapnik (1995), Burges (1998),Cristianini and Shawe-Taylor (2000), Muller et al. (2001), Scholkopf and Smola(2002), and Herbrich (2002).

The SVM is a decision machine and so does not provide posterior probabilities.We have already discussed some of the benefits of determining probabilities in Sec-tion 1.5.4. An alternative sparse kernel technique, known as the relevance vectormachine (RVM), is based on a Bayesian formulation and provides posterior proba-Section 7.2bilistic outputs, as well as having typically much sparser solutions than the SVM.

7.1. Maximum Margin Classifiers

We begin our discussion of support vector machines by returning to the two-classclassification problem using linear models of the form

y(x) = wTφ(x) + b (7.1)

where φ(x) denotes a fixed feature-space transformation, and we have made thebias parameter b explicit. Note that we shall shortly introduce a dual representationexpressed in terms of kernel functions, which avoids having to work explicitly infeature space. The training data set comprises N input vectors x1, . . . ,xN , withcorresponding target values t1, . . . , tN where tn ∈ {−1, 1}, and new data points xare classified according to the sign of y(x).

We shall assume for the moment that the training data set is linearly separable infeature space, so that by definition there exists at least one choice of the parametersw and b such that a function of the form (7.1) satisfies y(xn) > 0 for points havingtn = +1 and y(xn) < 0 for points having tn = −1, so that tny(xn) > 0 for alltraining data points.

There may of course exist many such solutions that separate the classes exactly.In Section 4.1.7, we described the perceptron algorithm that is guaranteed to finda solution in a finite number of steps. The solution that it finds, however, will bedependent on the (arbitrary) initial values chosen for w and b as well as on theorder in which the data points are presented. If there are multiple solutions all ofwhich classify the training data set exactly, then we should try to find the one thatwill give the smallest generalization error. The support vector machine approachesthis problem through the concept of the margin, which is defined to be the smallestdistance between the decision boundary and any of the samples, as illustrated inFigure 7.1.

In support vector machines the decision boundary is chosen to be the one forwhich the margin is maximized. The maximum margin solution can be motivated us-ing computational learning theory, also known as statistical learning theory. How-Section 7.1.5ever, a simple insight into the origins of maximum margin has been given by Tongand Koller (2000) who consider a framework for classification based on a hybrid ofgenerative and discriminative approaches. They first model the distribution over in-put vectors x for each class using a Parzen density estimator with Gaussian kernels

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because w

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because wMaximum margin solution

b

} Too complex to compute} Brings hyperplane too close to point

} We can recast problem into another optimization problem

} Data points for which the equality holds are said to be active◦ All other data points are inactive◦ There is always at least one active constraint◦ When margin is maximized there will be two active

constraints20

Modified Solution

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because w

328 7. SPARSE KERNEL MACHINES

does not depend on n. Direct solution of this optimization problem would be verycomplex, and so we shall convert it into an equivalent problem that is much easierto solve. To do this we note that if we make the rescaling w → κw and b → κb,then the distance from any point xn to the decision surface, given by tny(xn)/∥w∥,is unchanged. We can use this freedom to set

tn!wTφ(xn) + b

"= 1 (7.4)

for the point that is closest to the surface. In this case, all data points will satisfy theconstraints

tn!wTφ(xn) + b

"! 1, n = 1, . . . , N. (7.5)

This is known as the canonical representation of the decision hyperplane. In thecase of data points for which the equality holds, the constraints are said to be active,whereas for the remainder they are said to be inactive. By definition, there willalways be at least one active constraint, because there will always be a closest point,and once the margin has been maximized there will be at least two active constraints.The optimization problem then simply requires that we maximize ∥w∥−1, which isequivalent to minimizing ∥w∥2, and so we have to solve the optimization problem

arg minw,b

12∥w∥2 (7.6)

subject to the constraints given by (7.5). The factor of 1/2 in (7.6) is included forlater convenience. This is an example of a quadratic programming problem in whichwe are trying to minimize a quadratic function subject to a set of linear inequalityconstraints. It appears that the bias parameter b has disappeared from the optimiza-tion. However, it is determined implicitly via the constraints, because these requirethat changes to ∥w∥ be compensated by changes to b. We shall see how this worksshortly.

In order to solve this constrained optimization problem, we introduce Lagrangemultipliers an ! 0, with one multiplier an for each of the constraints in (7.5), givingAppendix Ethe Lagrangian function

L(w, b,a) =12∥w∥2 −

N#

n=1

an

$tn(wTφ(xn) + b) − 1

%(7.7)

where a = (a1, . . . , aN )T. Note the minus sign in front of the Lagrange multiplierterm, because we are minimizing with respect to w and b, and maximizing withrespect to a. Setting the derivatives of L(w, b,a) with respect to w and b equal tozero, we obtain the following two conditions

w =N#

n=1

antnφ(xn) (7.8)

0 =N#

n=1

antn. (7.9)

Constraint takes care of thispart of the optimization

} The original problem becomes (as argmax ||w||-1 = argmin ||w||2)a quadratic programming problem

s.t.

21

Modified Solution 2/2

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because w

Quadratic programming

} Linear programming (LP) is a technique for the optimization of a linear objective function, subject to linear constraints on the variables

} Quadratic programming (QP) is a technique for the optimization of a quadratic function of several variables, subject to linear constraints on these variables

} The original problem becomes (as argmax ||w||-1 = argmin ||w||2)a quadratic programming problem

s.t.

23

Transforming Optimization Problem

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because w

} For a convex problem (no local minima) there is a dual problem that is equivalent to the primal problem (i.e. we can switch between them)

} Primal objective: minimize wrt to M feature variables

◦ Quadratic programming problem with parameters w, b

} Dual objective: maximize wrt to N instances

◦ Dual depends on inner product between feature vectors

◦ → simpler quadratic programming problem

} The original problem becomes (as argmax ||w||-1 = argmin ||w||2)a quadratic programming problem

s.t.

} We can solve constrained optimization problem via Lagrange multipliers an ≥ 0

25

Modified Solution 2/2

7.1. Maximum Margin Classifiers 327

y = 1y = 0

y = −1

margin

y = 1

y = 0

y = −1

Figure 7.1 The margin is defined as the perpendicular distance between the decision boundary and the closestof the data points, as shown on the left figure. Maximizing the margin leads to a particular choice of decisionboundary, as shown on the right. The location of this boundary is determined by a subset of the data points,known as support vectors, which are indicated by the circles.

having a common parameter σ2. Together with the class priors, this defines an opti-mal misclassification-rate decision boundary. However, instead of using this optimalboundary, they determine the best hyperplane by minimizing the probability of errorrelative to the learned density model. In the limit σ2 → 0, the optimal hyperplaneis shown to be the one having maximum margin. The intuition behind this result isthat as σ2 is reduced, the hyperplane is increasingly dominated by nearby data pointsrelative to more distant ones. In the limit, the hyperplane becomes independent ofdata points that are not support vectors.

We shall see in Figure 10.13 that marginalization with respect to the prior distri-bution of the parameters in a Bayesian approach for a simple linearly separable dataset leads to a decision boundary that lies in the middle of the region separating thedata points. The large margin solution has similar behaviour.

Recall from Figure 4.1 that the perpendicular distance of a point x from a hyper-plane defined by y(x) = 0 where y(x) takes the form (7.1) is given by |y(x)|/∥w∥.Furthermore, we are only interested in solutions for which all data points are cor-rectly classified, so that tny(xn) > 0 for all n. Thus the distance of a point xn to thedecision surface is given by

tny(xn)∥w∥ =

tn(wTφ(xn) + b)∥w∥ . (7.2)

The margin is given by the perpendicular distance to the closest point xn from thedata set, and we wish to optimize the parameters w and b in order to maximize thisdistance. Thus the maximum margin solution is found by solving

arg maxw,b

!1

∥w∥ minn

"tn

#wTφ(xn) + b

$%&(7.3)

where we have taken the factor 1/∥w∥ outside the optimization over n because w

328 7. SPARSE KERNEL MACHINES

does not depend on n. Direct solution of this optimization problem would be verycomplex, and so we shall convert it into an equivalent problem that is much easierto solve. To do this we note that if we make the rescaling w → κw and b → κb,then the distance from any point xn to the decision surface, given by tny(xn)/∥w∥,is unchanged. We can use this freedom to set

tn!wTφ(xn) + b

"= 1 (7.4)

for the point that is closest to the surface. In this case, all data points will satisfy theconstraints

tn!wTφ(xn) + b

"! 1, n = 1, . . . , N. (7.5)

This is known as the canonical representation of the decision hyperplane. In thecase of data points for which the equality holds, the constraints are said to be active,whereas for the remainder they are said to be inactive. By definition, there willalways be at least one active constraint, because there will always be a closest point,and once the margin has been maximized there will be at least two active constraints.The optimization problem then simply requires that we maximize ∥w∥−1, which isequivalent to minimizing ∥w∥2, and so we have to solve the optimization problem

arg minw,b

12∥w∥2 (7.6)

subject to the constraints given by (7.5). The factor of 1/2 in (7.6) is included forlater convenience. This is an example of a quadratic programming problem in whichwe are trying to minimize a quadratic function subject to a set of linear inequalityconstraints. It appears that the bias parameter b has disappeared from the optimiza-tion. However, it is determined implicitly via the constraints, because these requirethat changes to ∥w∥ be compensated by changes to b. We shall see how this worksshortly.

In order to solve this constrained optimization problem, we introduce Lagrangemultipliers an ! 0, with one multiplier an for each of the constraints in (7.5), givingAppendix Ethe Lagrangian function

L(w, b,a) =12∥w∥2 −

N#

n=1

an

$tn(wTφ(xn) + b) − 1

%(7.7)

where a = (a1, . . . , aN )T. Note the minus sign in front of the Lagrange multiplierterm, because we are minimizing with respect to w and b, and maximizing withrespect to a. Setting the derivatives of L(w, b,a) with respect to w and b equal tozero, we obtain the following two conditions

w =N#

n=1

antnφ(xn) (7.8)

0 =N#

n=1

antn. (7.9)

} Setting derivative of L(w,b,a)w.r.t. w and b to zero we get:

} Eliminating w and b from equation (1) using these conditions:

s.t.

26

Solution to Modified Problem

328 7. SPARSE KERNEL MACHINES

does not depend on n. Direct solution of this optimization problem would be verycomplex, and so we shall convert it into an equivalent problem that is much easierto solve. To do this we note that if we make the rescaling w → κw and b → κb,then the distance from any point xn to the decision surface, given by tny(xn)/∥w∥,is unchanged. We can use this freedom to set

tn!wTφ(xn) + b

"= 1 (7.4)

for the point that is closest to the surface. In this case, all data points will satisfy theconstraints

tn!wTφ(xn) + b

"! 1, n = 1, . . . , N. (7.5)

This is known as the canonical representation of the decision hyperplane. In thecase of data points for which the equality holds, the constraints are said to be active,whereas for the remainder they are said to be inactive. By definition, there willalways be at least one active constraint, because there will always be a closest point,and once the margin has been maximized there will be at least two active constraints.The optimization problem then simply requires that we maximize ∥w∥−1, which isequivalent to minimizing ∥w∥2, and so we have to solve the optimization problem

arg minw,b

12∥w∥2 (7.6)

subject to the constraints given by (7.5). The factor of 1/2 in (7.6) is included forlater convenience. This is an example of a quadratic programming problem in whichwe are trying to minimize a quadratic function subject to a set of linear inequalityconstraints. It appears that the bias parameter b has disappeared from the optimiza-tion. However, it is determined implicitly via the constraints, because these requirethat changes to ∥w∥ be compensated by changes to b. We shall see how this worksshortly.

In order to solve this constrained optimization problem, we introduce Lagrangemultipliers an ! 0, with one multiplier an for each of the constraints in (7.5), givingAppendix Ethe Lagrangian function

L(w, b,a) =12∥w∥2 −

N#

n=1

an

$tn(wTφ(xn) + b) − 1

%(7.7)

where a = (a1, . . . , aN )T. Note the minus sign in front of the Lagrange multiplierterm, because we are minimizing with respect to w and b, and maximizing withrespect to a. Setting the derivatives of L(w, b,a) with respect to w and b equal tozero, we obtain the following two conditions

w =N#

n=1

antnφ(xn) (7.8)

0 =N#

n=1

antn. (7.9)

(1)7.1. Maximum Margin Classifiers 329

Eliminating w and b from L(w, b,a) using these conditions then gives the dualrepresentation of the maximum margin problem in which we maximize

!L(a) =N"

n=1

an − 12

N"

n=1

N"

m=1

anamtntmk(xn,xm) (7.10)

with respect to a subject to the constraints

an ! 0, n = 1, . . . , N, (7.11)N"

n=1

antn = 0. (7.12)

Here the kernel function is defined by k(x,x′) = φ(x)Tφ(x′). Again, this takes theform of a quadratic programming problem in which we optimize a quadratic functionof a subject to a set of inequality constraints. We shall discuss techniques for solvingsuch quadratic programming problems in Section 7.1.1.

The solution to a quadratic programming problem in M variables in general hascomputational complexity that is O(M3). In going to the dual formulation we haveturned the original optimization problem, which involved minimizing (7.6) over Mvariables, into the dual problem (7.10), which has N variables. For a fixed set ofbasis functions whose number M is smaller than the number N of data points, themove to the dual problem appears disadvantageous. However, it allows the model tobe reformulated using kernels, and so the maximum margin classifier can be appliedefficiently to feature spaces whose dimensionality exceeds the number of data points,including infinite feature spaces. The kernel formulation also makes clear the roleof the constraint that the kernel function k(x,x′) be positive definite, because thisensures that the Lagrangian function !L(a) is bounded below, giving rise to a well-defined optimization problem.

In order to classify new data points using the trained model, we evaluate the signof y(x) defined by (7.1). This can be expressed in terms of the parameters {an} andthe kernel function by substituting for w using (7.8) to give

y(x) =N"

n=1

antnk(x,xn) + b. (7.13)

Joseph-Louis Lagrange1736–1813

Although widely considered to bea French mathematician, Lagrangewas born in Turin in Italy. By the ageof nineteen, he had already madeimportant contributions mathemat-ics and had been appointed as Pro-

fessor at the Royal Artillery School in Turin. For many

years, Euler worked hard to persuade Lagrange tomove to Berlin, which he eventually did in 1766 wherehe succeeded Euler as Director of Mathematics atthe Berlin Academy. Later he moved to Paris, nar-rowly escaping with his life during the French revo-lution thanks to the personal intervention of Lavoisier(the French chemist who discovered oxygen) who him-self was later executed at the guillotine. Lagrangemade key contributions to the calculus of variationsand the foundations of dynamics.

k is the ”kernel”

} Satisfies Karush-Kuhn-Tucker (KKT) conditions so a solution to the nonlinear programming is optimal, as the following properties hold:

} Note that either an = 0 or tn y(xn) = 1◦ What does it mean?◦ Point is either at the margin or we don’t care about it in the sum

27

Important Properties330 7. SPARSE KERNEL MACHINES

In Appendix E, we show that a constrained optimization of this form satisfies theKarush-Kuhn-Tucker (KKT) conditions, which in this case require that the followingthree properties hold

an ! 0 (7.14)tny(xn) − 1 ! 0 (7.15)

an {tny(xn) − 1} = 0. (7.16)

Thus for every data point, either an = 0 or tny(xn) = 1. Any data point forwhich an = 0 will not appear in the sum in (7.13) and hence plays no role in makingpredictions for new data points. The remaining data points are called support vectors,and because they satisfy tny(xn) = 1, they correspond to points that lie on themaximum margin hyperplanes in feature space, as illustrated in Figure 7.1. Thisproperty is central to the practical applicability of support vector machines. Oncethe model is trained, a significant proportion of the data points can be discarded andonly the support vectors retained.

Having solved the quadratic programming problem and found a value for a, wecan then determine the value of the threshold parameter b by noting that any supportvector xn satisfies tny(xn) = 1. Using (7.13) this gives

tn

!"

m∈S

amtmk(xn,xm) + b

#= 1 (7.17)

where S denotes the set of indices of the support vectors. Although we can solvethis equation for b using an arbitrarily chosen support vector xn, a numerically morestable solution is obtained by first multiplying through by tn, making use of t2n = 1,and then averaging these equations over all support vectors and solving for b to give

b =1

NS

"

n∈S

!tn −

"

m∈S

amtmk(xn,xm)

#(7.18)

where NS is the total number of support vectors.For later comparison with alternative models, we can express the maximum-

margin classifier in terms of the minimization of an error function, with a simplequadratic regularizer, in the form

N"

n=1

E∞(y(xn)tn − 1) + λ∥w∥2 (7.19)

where E∞(z) is a function that is zero if z ! 0 and ∞ otherwise and ensures thatthe constraints (7.5) are satisfied. Note that as long as the regularization parametersatisfies λ > 0, its precise value plays no role.

Figure 7.2 shows an example of the classification resulting from training a sup-port vector machine on a simple synthetic data set using a Gaussian kernel of the

7.1. Maximum Margin Classifiers 329

Eliminating w and b from L(w, b,a) using these conditions then gives the dualrepresentation of the maximum margin problem in which we maximize

!L(a) =N"

n=1

an − 12

N"

n=1

N"

m=1

anamtntmk(xn,xm) (7.10)

with respect to a subject to the constraints

an ! 0, n = 1, . . . , N, (7.11)N"

n=1

antn = 0. (7.12)

Here the kernel function is defined by k(x,x′) = φ(x)Tφ(x′). Again, this takes theform of a quadratic programming problem in which we optimize a quadratic functionof a subject to a set of inequality constraints. We shall discuss techniques for solvingsuch quadratic programming problems in Section 7.1.1.

The solution to a quadratic programming problem in M variables in general hascomputational complexity that is O(M3). In going to the dual formulation we haveturned the original optimization problem, which involved minimizing (7.6) over Mvariables, into the dual problem (7.10), which has N variables. For a fixed set ofbasis functions whose number M is smaller than the number N of data points, themove to the dual problem appears disadvantageous. However, it allows the model tobe reformulated using kernels, and so the maximum margin classifier can be appliedefficiently to feature spaces whose dimensionality exceeds the number of data points,including infinite feature spaces. The kernel formulation also makes clear the roleof the constraint that the kernel function k(x,x′) be positive definite, because thisensures that the Lagrangian function !L(a) is bounded below, giving rise to a well-defined optimization problem.

In order to classify new data points using the trained model, we evaluate the signof y(x) defined by (7.1). This can be expressed in terms of the parameters {an} andthe kernel function by substituting for w using (7.8) to give

y(x) =N"

n=1

antnk(x,xn) + b. (7.13)

Joseph-Louis Lagrange1736–1813

Although widely considered to bea French mathematician, Lagrangewas born in Turin in Italy. By the ageof nineteen, he had already madeimportant contributions mathemat-ics and had been appointed as Pro-

fessor at the Royal Artillery School in Turin. For many

years, Euler worked hard to persuade Lagrange tomove to Berlin, which he eventually did in 1766 wherehe succeeded Euler as Director of Mathematics atthe Berlin Academy. Later he moved to Paris, nar-rowly escaping with his life during the French revo-lution thanks to the personal intervention of Lavoisier(the French chemist who discovered oxygen) who him-self was later executed at the guillotine. Lagrangemade key contributions to the calculus of variationsand the foundations of dynamics.

Modified equation:

C. Bishop Fig 7.2

28

Example with Gaussian Kernel7.1. Maximum Margin Classifiers 331

Figure 7.2 Example of synthetic data fromtwo classes in two dimensionsshowing contours of constanty(x) obtained from a supportvector machine having a Gaus-sian kernel function. Also shownare the decision boundary, themargin boundaries, and the sup-port vectors.

form (6.23). Although the data set is not linearly separable in the two-dimensionaldata space x, it is linearly separable in the nonlinear feature space defined implicitlyby the nonlinear kernel function. Thus the training data points are perfectly separatedin the original data space.

This example also provides a geometrical insight into the origin of sparsity inthe SVM. The maximum margin hyperplane is defined by the location of the supportvectors. Other data points can be moved around freely (so long as they remain out-side the margin region) without changing the decision boundary, and so the solutionwill be independent of such data points.

7.1.1 Overlapping class distributionsSo far, we have assumed that the training data points are linearly separable in the

feature space φ(x). The resulting support vector machine will give exact separationof the training data in the original input space x, although the corresponding decisionboundary will be nonlinear. In practice, however, the class-conditional distributionsmay overlap, in which case exact separation of the training data can lead to poorgeneralization.

We therefore need a way to modify the support vector machine so as to allowsome of the training points to be misclassified. From (7.19) we see that in the caseof separable classes, we implicitly used an error function that gave infinite errorif a data point was misclassified and zero error if it was classified correctly, andthen optimized the model parameters to maximize the margin. We now modify thisapproach so that data points are allowed to be on the ‘wrong side’ of the marginboundary, but with a penalty that increases with the distance from that boundary. Forthe subsequent optimization problem, it is convenient to make this penalty a linearfunction of this distance. To do this, we introduce slack variables, ξn ! 0 wheren = 1, . . . , N , with one slack variable for each training data point (Bennett, 1992;Cortes and Vapnik, 1995). These are defined by ξn = 0 for data points that are on orinside the correct margin boundary and ξn = |tn − y(xn)| for other points. Thus adata point that is on the decision boundary y(xn) = 0 will have ξn = 1, and points

hyperplane

active constraints(support points)

} Linear classifiers cannot deal with:

◦ Non-linear concepts

◦ Noisy data

} Solutions:

◦ Soft margin (e.g., allow mistakes in training data [next])

◦ Network of simple linear classifiers (e.g., neural networks)

◦ Map data into richer feature space (e.g., non-linear features) and then use linear classifier

} Add slack variables to optimization problemdefined as 0 if correctly classified not within marginand otherwise

} The new constraints are

} And it is easy to show that the optimization is now to minimize

for C > 0

30

What if data is not linearly separable?332 7. SPARSE KERNEL MACHINES

Figure 7.3 Illustration of the slack variables ξn ! 0.Data points with circles around them aresupport vectors.

y = 1

y = 0

y = −1

ξ > 1

ξ < 1

ξ = 0

ξ = 0

with ξn > 1 will be misclassified. The exact classification constraints (7.5) are thenreplaced with

tny(xn) ! 1 − ξn, n = 1, . . . , N (7.20)

in which the slack variables are constrained to satisfy ξn ! 0. Data points for whichξn = 0 are correctly classified and are either on the margin or on the correct sideof the margin. Points for which 0 < ξn " 1 lie inside the margin, but on the cor-rect side of the decision boundary, and those data points for which ξn > 1 lie onthe wrong side of the decision boundary and are misclassified, as illustrated in Fig-ure 7.3. This is sometimes described as relaxing the hard margin constraint to give asoft margin and allows some of the training set data points to be misclassified. Notethat while slack variables allow for overlapping class distributions, this framework isstill sensitive to outliers because the penalty for misclassification increases linearlywith ξ.

Our goal is now to maximize the margin while softly penalizing points that lieon the wrong side of the margin boundary. We therefore minimize

CN!

n=1

ξn +12∥w∥2 (7.21)

where the parameter C > 0 controls the trade-off between the slack variable penaltyand the margin. Because any point that is misclassified has ξn > 1, it follows that"

n ξn is an upper bound on the number of misclassified points. The parameter C istherefore analogous to (the inverse of) a regularization coefficient because it controlsthe trade-off between minimizing training errors and controlling model complexity.In the limit C → ∞, we will recover the earlier support vector machine for separabledata.

We now wish to minimize (7.21) subject to the constraints (7.20) together withξn ! 0. The corresponding Lagrangian is given by

L(w, b,a) =12∥w∥2 +C

N!

n=1

ξn−N!

n=1

an {tny(xn) − 1 + ξn}−N!

n=1

µnξn (7.22)

332 7. SPARSE KERNEL MACHINES

Figure 7.3 Illustration of the slack variables ξn ! 0.Data points with circles around them aresupport vectors.

y = 1

y = 0

y = −1

ξ > 1

ξ < 1

ξ = 0

ξ = 0

with ξn > 1 will be misclassified. The exact classification constraints (7.5) are thenreplaced with

tny(xn) ! 1 − ξn, n = 1, . . . , N (7.20)

in which the slack variables are constrained to satisfy ξn ! 0. Data points for whichξn = 0 are correctly classified and are either on the margin or on the correct sideof the margin. Points for which 0 < ξn " 1 lie inside the margin, but on the cor-rect side of the decision boundary, and those data points for which ξn > 1 lie onthe wrong side of the decision boundary and are misclassified, as illustrated in Fig-ure 7.3. This is sometimes described as relaxing the hard margin constraint to give asoft margin and allows some of the training set data points to be misclassified. Notethat while slack variables allow for overlapping class distributions, this framework isstill sensitive to outliers because the penalty for misclassification increases linearlywith ξ.

Our goal is now to maximize the margin while softly penalizing points that lieon the wrong side of the margin boundary. We therefore minimize

CN!

n=1

ξn +12∥w∥2 (7.21)

where the parameter C > 0 controls the trade-off between the slack variable penaltyand the margin. Because any point that is misclassified has ξn > 1, it follows that"

n ξn is an upper bound on the number of misclassified points. The parameter C istherefore analogous to (the inverse of) a regularization coefficient because it controlsthe trade-off between minimizing training errors and controlling model complexity.In the limit C → ∞, we will recover the earlier support vector machine for separabledata.

We now wish to minimize (7.21) subject to the constraints (7.20) together withξn ! 0. The corresponding Lagrangian is given by

L(w, b,a) =12∥w∥2 +C

N!

n=1

ξn−N!

n=1

an {tny(xn) − 1 + ξn}−N!

n=1

µnξn (7.22)

7.1. Maximum Margin Classifiers 331

Figure 7.2 Example of synthetic data fromtwo classes in two dimensionsshowing contours of constanty(x) obtained from a supportvector machine having a Gaus-sian kernel function. Also shownare the decision boundary, themargin boundaries, and the sup-port vectors.

form (6.23). Although the data set is not linearly separable in the two-dimensionaldata space x, it is linearly separable in the nonlinear feature space defined implicitlyby the nonlinear kernel function. Thus the training data points are perfectly separatedin the original data space.

This example also provides a geometrical insight into the origin of sparsity inthe SVM. The maximum margin hyperplane is defined by the location of the supportvectors. Other data points can be moved around freely (so long as they remain out-side the margin region) without changing the decision boundary, and so the solutionwill be independent of such data points.

7.1.1 Overlapping class distributionsSo far, we have assumed that the training data points are linearly separable in the

feature space φ(x). The resulting support vector machine will give exact separationof the training data in the original input space x, although the corresponding decisionboundary will be nonlinear. In practice, however, the class-conditional distributionsmay overlap, in which case exact separation of the training data can lead to poorgeneralization.

We therefore need a way to modify the support vector machine so as to allowsome of the training points to be misclassified. From (7.19) we see that in the caseof separable classes, we implicitly used an error function that gave infinite errorif a data point was misclassified and zero error if it was classified correctly, andthen optimized the model parameters to maximize the margin. We now modify thisapproach so that data points are allowed to be on the ‘wrong side’ of the marginboundary, but with a penalty that increases with the distance from that boundary. Forthe subsequent optimization problem, it is convenient to make this penalty a linearfunction of this distance. To do this, we introduce slack variables, ξn ! 0 wheren = 1, . . . , N , with one slack variable for each training data point (Bennett, 1992;Cortes and Vapnik, 1995). These are defined by ξn = 0 for data points that are on orinside the correct margin boundary and ξn = |tn − y(xn)| for other points. Thus adata point that is on the decision boundary y(xn) = 0 will have ξn = 1, and points

7.1. Maximum Margin Classifiers 331

Figure 7.2 Example of synthetic data fromtwo classes in two dimensionsshowing contours of constanty(x) obtained from a supportvector machine having a Gaus-sian kernel function. Also shownare the decision boundary, themargin boundaries, and the sup-port vectors.

form (6.23). Although the data set is not linearly separable in the two-dimensionaldata space x, it is linearly separable in the nonlinear feature space defined implicitlyby the nonlinear kernel function. Thus the training data points are perfectly separatedin the original data space.

This example also provides a geometrical insight into the origin of sparsity inthe SVM. The maximum margin hyperplane is defined by the location of the supportvectors. Other data points can be moved around freely (so long as they remain out-side the margin region) without changing the decision boundary, and so the solutionwill be independent of such data points.

7.1.1 Overlapping class distributionsSo far, we have assumed that the training data points are linearly separable in the

feature space φ(x). The resulting support vector machine will give exact separationof the training data in the original input space x, although the corresponding decisionboundary will be nonlinear. In practice, however, the class-conditional distributionsmay overlap, in which case exact separation of the training data can lead to poorgeneralization.

We therefore need a way to modify the support vector machine so as to allowsome of the training points to be misclassified. From (7.19) we see that in the caseof separable classes, we implicitly used an error function that gave infinite errorif a data point was misclassified and zero error if it was classified correctly, andthen optimized the model parameters to maximize the margin. We now modify thisapproach so that data points are allowed to be on the ‘wrong side’ of the marginboundary, but with a penalty that increases with the distance from that boundary. Forthe subsequent optimization problem, it is convenient to make this penalty a linearfunction of this distance. To do this, we introduce slack variables, ξn ! 0 wheren = 1, . . . , N , with one slack variable for each training data point (Bennett, 1992;Cortes and Vapnik, 1995). These are defined by ξn = 0 for data points that are on orinside the correct margin boundary and ξn = |tn − y(xn)| for other points. Thus adata point that is on the decision boundary y(xn) = 0 will have ξn = 1, and points

332 7. SPARSE KERNEL MACHINES

Figure 7.3 Illustration of the slack variables ξn ! 0.Data points with circles around them aresupport vectors.

y = 1

y = 0

y = −1

ξ > 1

ξ < 1

ξ = 0

ξ = 0

with ξn > 1 will be misclassified. The exact classification constraints (7.5) are thenreplaced with

tny(xn) ! 1 − ξn, n = 1, . . . , N (7.20)

in which the slack variables are constrained to satisfy ξn ! 0. Data points for whichξn = 0 are correctly classified and are either on the margin or on the correct sideof the margin. Points for which 0 < ξn " 1 lie inside the margin, but on the cor-rect side of the decision boundary, and those data points for which ξn > 1 lie onthe wrong side of the decision boundary and are misclassified, as illustrated in Fig-ure 7.3. This is sometimes described as relaxing the hard margin constraint to give asoft margin and allows some of the training set data points to be misclassified. Notethat while slack variables allow for overlapping class distributions, this framework isstill sensitive to outliers because the penalty for misclassification increases linearlywith ξ.

Our goal is now to maximize the margin while softly penalizing points that lieon the wrong side of the margin boundary. We therefore minimize

CN!

n=1

ξn +12∥w∥2 (7.21)

where the parameter C > 0 controls the trade-off between the slack variable penaltyand the margin. Because any point that is misclassified has ξn > 1, it follows that"

n ξn is an upper bound on the number of misclassified points. The parameter C istherefore analogous to (the inverse of) a regularization coefficient because it controlsthe trade-off between minimizing training errors and controlling model complexity.In the limit C → ∞, we will recover the earlier support vector machine for separabledata.

We now wish to minimize (7.21) subject to the constraints (7.20) together withξn ! 0. The corresponding Lagrangian is given by

L(w, b,a) =12∥w∥2 +C

N!

n=1

ξn−N!

n=1

an {tny(xn) − 1 + ξn}−N!

n=1

µnξn (7.22)

} Minimize

s.t.

where C can be seen as a penalty for misclassification

} For C→ ∞ we recover the linearly separable case

31

SVMs for Non-linearly Separable Data

7.1. Maximum Margin Classifiers 333

where {an ! 0} and {µn ! 0} are Lagrange multipliers. The corresponding set ofKKT conditions are given byAppendix E

an ! 0 (7.23)tny(xn) − 1 + ξn ! 0 (7.24)

an (tny(xn) − 1 + ξn) = 0 (7.25)µn ! 0 (7.26)ξn ! 0 (7.27)

µnξn = 0 (7.28)

where n = 1, . . . , N .We now optimize out w, b, and {ξn} making use of the definition (7.1) of y(x)

to give

∂L

∂w= 0 ⇒ w =

N!

n=1

antnφ(xn) (7.29)

∂L

∂b= 0 ⇒

N!

n=1

antn = 0 (7.30)

∂L

∂ξn= 0 ⇒ an = C − µn. (7.31)

Using these results to eliminate w, b, and {ξn} from the Lagrangian, we obtain thedual Lagrangian in the form

"L(a) =N!

n=1

an − 12

N!

n=1

N!

m=1

anamtntmk(xn,xm) (7.32)

which is identical to the separable case, except that the constraints are somewhatdifferent. To see what these constraints are, we note that an ! 0 is required becausethese are Lagrange multipliers. Furthermore, (7.31) together with µn ! 0 impliesan " C. We therefore have to minimize (7.32) with respect to the dual variables{an} subject to

0 " an " C (7.33)N!

n=1

antn = 0 (7.34)

for n = 1, . . . , N , where (7.33) are known as box constraints. This again representsa quadratic programming problem. If we substitute (7.29) into (7.1), we see thatpredictions for new data points are again made by using (7.13).

We can now interpret the resulting solution. As before, a subset of the datapoints may have an = 0, in which case they do not contribute to the predictive

7.1. Maximum Margin Classifiers 333

where {an ! 0} and {µn ! 0} are Lagrange multipliers. The corresponding set ofKKT conditions are given byAppendix E

an ! 0 (7.23)tny(xn) − 1 + ξn ! 0 (7.24)

an (tny(xn) − 1 + ξn) = 0 (7.25)µn ! 0 (7.26)ξn ! 0 (7.27)

µnξn = 0 (7.28)

where n = 1, . . . , N .We now optimize out w, b, and {ξn} making use of the definition (7.1) of y(x)

to give

∂L

∂w= 0 ⇒ w =

N!

n=1

antnφ(xn) (7.29)

∂L

∂b= 0 ⇒

N!

n=1

antn = 0 (7.30)

∂L

∂ξn= 0 ⇒ an = C − µn. (7.31)

Using these results to eliminate w, b, and {ξn} from the Lagrangian, we obtain thedual Lagrangian in the form

"L(a) =N!

n=1

an − 12

N!

n=1

N!

m=1

anamtntmk(xn,xm) (7.32)

which is identical to the separable case, except that the constraints are somewhatdifferent. To see what these constraints are, we note that an ! 0 is required becausethese are Lagrange multipliers. Furthermore, (7.31) together with µn ! 0 impliesan " C. We therefore have to minimize (7.32) with respect to the dual variables{an} subject to

0 " an " C (7.33)N!

n=1

antn = 0 (7.34)

for n = 1, . . . , N , where (7.33) are known as box constraints. This again representsa quadratic programming problem. If we substitute (7.29) into (7.1), we see thatpredictions for new data points are again made by using (7.13).

We can now interpret the resulting solution. As before, a subset of the datapoints may have an = 0, in which case they do not contribute to the predictive

} State-of-the-art classifier (with good kernel)

} Solves computational problem of working in high dimensional space

} Non-parametric classifier (keeps all data around in kernel)

} Learning: O(N2) , but approximations available for O(N)

◦ Hint: Focus computation on good support point candidates

} What is the model space?

33

} SVM can be described as optimizing

where is known as hinge loss function,[.]+ is positive part

34

7.1. Maximum Margin Classifiers 337

Figure 7.5 Plot of the ‘hinge’ error function usedin support vector machines, shownin blue, along with the error functionfor logistic regression, rescaled by afactor of 1/ ln(2) so that it passesthrough the point (0, 1), shown in red.Also shown are the misclassificationerror in black and the squared errorin green.

−2 −1 0 1 2z

E(z)

remaining points we have ξn = 1 − yntn. Thus the objective function (7.21) can bewritten (up to an overall multiplicative constant) in the form

N!

n=1

ESV(yntn) + λ∥w∥2 (7.44)

where λ = (2C)−1, and ESV(·) is the hinge error function defined by

ESV(yntn) = [1 − yntn]+ (7.45)

where [ · ]+ denotes the positive part. The hinge error function, so-called becauseof its shape, is plotted in Figure 7.5. It can be viewed as an approximation to themisclassification error, i.e., the error function that ideally we would like to minimize,which is also shown in Figure 7.5.

When we considered the logistic regression model in Section 4.3.2, we found itconvenient to work with target variable t ∈ {0, 1}. For comparison with the supportvector machine, we first reformulate maximum likelihood logistic regression usingthe target variable t ∈ {−1, 1}. To do this, we note that p(t = 1|y) = σ(y) wherey(x) is given by (7.1), and σ(y) is the logistic sigmoid function defined by (4.59). Itfollows that p(t = −1|y) = 1 − σ(y) = σ(−y), where we have used the propertiesof the logistic sigmoid function, and so we can write

p(t|y) = σ(yt). (7.46)

From this we can construct an error function by taking the negative logarithm of thelikelihood function that, with a quadratic regularizer, takes the formExercise 7.6

N!

n=1

ELR(yntn) + λ∥w∥2. (7.47)

whereELR(yt) = ln (1 + exp(−yt)) . (7.48)

7.1. Maximum Margin Classifiers 337

Figure 7.5 Plot of the ‘hinge’ error function usedin support vector machines, shownin blue, along with the error functionfor logistic regression, rescaled by afactor of 1/ ln(2) so that it passesthrough the point (0, 1), shown in red.Also shown are the misclassificationerror in black and the squared errorin green.

−2 −1 0 1 2z

E(z)

remaining points we have ξn = 1 − yntn. Thus the objective function (7.21) can bewritten (up to an overall multiplicative constant) in the form

N!

n=1

ESV(yntn) + λ∥w∥2 (7.44)

where λ = (2C)−1, and ESV(·) is the hinge error function defined by

ESV(yntn) = [1 − yntn]+ (7.45)

where [ · ]+ denotes the positive part. The hinge error function, so-called becauseof its shape, is plotted in Figure 7.5. It can be viewed as an approximation to themisclassification error, i.e., the error function that ideally we would like to minimize,which is also shown in Figure 7.5.

When we considered the logistic regression model in Section 4.3.2, we found itconvenient to work with target variable t ∈ {0, 1}. For comparison with the supportvector machine, we first reformulate maximum likelihood logistic regression usingthe target variable t ∈ {−1, 1}. To do this, we note that p(t = 1|y) = σ(y) wherey(x) is given by (7.1), and σ(y) is the logistic sigmoid function defined by (4.59). Itfollows that p(t = −1|y) = 1 − σ(y) = σ(−y), where we have used the propertiesof the logistic sigmoid function, and so we can write

p(t|y) = σ(yt). (7.46)

From this we can construct an error function by taking the negative logarithm of thelikelihood function that, with a quadratic regularizer, takes the formExercise 7.6

N!

n=1

ELR(yntn) + λ∥w∥2. (7.47)

whereELR(yt) = ln (1 + exp(−yt)) . (7.48)

Hinge lossLogistic loss(scaled by 1/ln(2))

Squared loss

NX

n=1

ESV(yn, tn) +1

2Ckwk2

Logistic Regression generally faster than SVM

} Classification task (our current goal):◦ Given (xi ,yi), i=1,2,… ◦ Goal 1: Learn p(Yi|Xi= xi)◦ Goal 2: Just predict yi

} Generative model task (interesting if you want to ask questions about x)◦ Given (xi ,yi), i=1,2,… ◦ Goal: Learn p(Yi , Xi)

} Contextual Bandits◦ Given xt and after performing action a, get reward yt◦ Problem is stateless: p(yt | a, xt, xt-1,...) = p(yt | a, xt)◦ Goal: Learn action a that maximizes p(Y | a, X)

} Reinforcement Learning◦ Given xt and after performing action a, get reward yt◦ Problem is stateful: p(yt | a, xt, xt-1,...) ≠ p(yt | a, xt)

35

Classification , Contextual Bandits & Reinforcement Learning

} We will see decision trees when we see ensembles

36