Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li...

15
Lecture Notes on Support Vector Machine Feng Li [email protected] Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b =0 (1) where ω R n is an outward pointing normal vector, and b is a bias term. The n-dimensional space is separated into two half-spaces H + = {x R n | ω T x + b 0} and H - = {x R n | ω T x + b< 0} by the hyperplane, such that we can identify to which half-space a given point x R n belongs according to sign(ω T x + b). Specifically, given a point x 0 R n , its label y is defined as y 0 = sign(ω T x 0 + b), i.e. y 0 = ( 1, ω T x 0 + b 0 -1, otherwise (2) The (signed) distance from x 0 to the hyperplane (denoted by γ ) is γ = ω T x 0 + b kωk = w kwk T x 0 + b kwk Note that, for now, γ is signed such that sign(γ ) = sign(ω T x 0 + b)= y 0 Therefore, we remove the sign of γ by redefining it as γ = y 0 (ω T x 0 + b) kωk If scaling both ω and b such that y 0 (ω T x 0 + b)=1 1 , we get γ = 1 kωk γ is so-called the “margin ” of x 0 . Note that, the scaling operation does not change hyperplane ω T x + b = 0. Now, given a set of training data {(x (i) ,y (i) )} i=1,··· ,m , we assume that they are linearly separable. Specifically, there exists a hyperplane (parameterized by 1 We can scale both ω and b, e.g., by simultaneously multiplying them with a constant. 1

Transcript of Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li...

Page 1: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Lecture Notes on Support Vector Machine

Feng [email protected]

Shandong University, China

1 Hyperplane and Margin

In a n-dimensional space, a hyper plane is defined by

ωTx+ b = 0 (1)

where ω ∈ Rn is an outward pointing normal vector, and b is a bias term.The n-dimensional space is separated into two half-spaces H+ = {x ∈ Rn |ωTx + b ≥ 0} and H− = {x ∈ Rn | ωTx + b < 0} by the hyperplane, suchthat we can identify to which half-space a given point x ∈ Rn belongs accordingto sign(ωTx + b). Specifically, given a point x0 ∈ Rn, its label y is defined asy0 = sign(ωTx0 + b), i.e.

y0 =

{1, ωTx0 + b ≥ 0

−1, otherwise(2)

The (signed) distance from x0 to the hyperplane (denoted by γ) is

γ =ωTx0 + b

‖ω‖=

(w

‖w‖

)Tx0 +

b

‖w‖

Note that, for now, γ is signed such that

sign(γ) = sign(ωTx0 + b) = y0

Therefore, we remove the sign of γ by redefining it as

γ =y0(ωTx0 + b)

‖ω‖

If scaling both ω and b such that y0(ωTx0 + b) = 1 1, we get

γ =1

‖ω‖

γ is so-called the “margin” of x0. Note that, the scaling operation does notchange hyperplane ωTx+ b = 0.

Now, given a set of training data {(x(i), y(i))}i=1,··· ,m, we assume that theyare linearly separable. Specifically, there exists a hyperplane (parameterized by

1We can scale both ω and b, e.g., by simultaneously multiplying them with a constant.

1

Page 2: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

ω and b) such that y(i) = 1 if ωTx(i) +b ≥ 0, while y(i) = 0 otherwise. We definethe margin between each of them and the hyperplane. In fact, for (x(i), y(i)),margin γ(i) is the signed distance between x(i) and the hyperplane ωTx+ b = 0,i.e.

wT(x(i) − γ(i) w

‖w‖

)+ b = 0

and we thus have

γ(i) =

(w

‖w‖

)Tx(i) +

b

‖w‖(3)

as shown in Fig. 1. In particular, if y(i) = 1, γ(i) ≥ 0; otherwise, γ(i) < 0.

Figure 1: Margin and hyperplane.

We then remove the sign and introduce the concept of geometric margin thatis unsigned as follows:

γ(i) = y(i)

((w

‖w‖

)Tx(i) +

b

‖w‖

)(4)

With respect to the whole training set, the margin is written as

γ = miniγ(i) (5)

2 Support Vector Machine

2.1 Formulation

The hyperplane actually serves as a decision boundary to differentiating positivelabels from negative labels, and we make more confident decision if larger marginis given. By leveraging different values of ω and b, we can construct a infinitenumber of hyperplanes, but which one is the best? The goal of Supported VectorMachine (SVM) is to maximize γ, and the SVM problem can be formulated as

maxγ,ω,b

γ

s.t. y(i)(ωTx(i) + b) ≥ γ‖ω‖, ∀i

2

Page 3: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Figure 2: Hard-margin SVM.

Note that scaling ω and b (e.g., multiplying both ω and b by a constant) doesnot change the hyperplane. Hence, we scale (ω, b) such that

mini{y(i)(ωTx(i) + b)} = 1,

In this case, the representation of the margin becomes 1/‖ω‖ according toEq. (5). Then, the problem formulation can be rewritten as

maxω,b

1/‖w‖

s.t. y(i)(ωTx(i) + b) ≥ 1, ∀i

Since maximizing 1/‖ω‖ is equivalent to minimizing ‖ω‖2 = ωTω, we furtherrewrite the problem formulation as follows

minω,b

ωTω (6)

s.t. y(i)(ωTx(i) + b) ≥ 1, ∀i (7)

As shown in Fig. 2.1, the distance from the dashed lines (ωTx + b = 1 andωTx+ b = −1) to the hyperplane ωTx+ b = 0 is the margin (see Eq. (5)). Theaim of the above optimization problem is to find a hyperplane (parameterizedby ω and b) with margin γ = 1/‖ω‖ maximized, while the resulting dashed linessatisfy the following condition: for each training sample (x(i), y(i)), ωTx(i) +b ≥1 if y(i) = 1, and ωTx(i) + b ≤ 1 if y(i) = −1.

This is a quadratic programming (QP) problem, and can be solved by exitinggeneric QP solvers, e.g., interior point method, active set method, gradientprojection method. Unfortunately, the existing generic QP solvers is of lowefficiency, especially in face of a large training set.

2.2 Preliminary Knowledge of Convex Optimization

Before diving into optimizing the problem formulation of SVM to improve theefficiency of solving QP programming problem, we first introduce some prelim-

3

Page 4: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

inaries about convex optimization.

2.2.1 Optimization Problems and Lagrangian Duality

We now consider the following optimization problem

minω

f(ω) (8)

s.t. gi(ω) ≤ 0, i = 1, · · · , k (9)

hj(ω) = 0, j = 1, · · · , l (10)

where ω ∈ D is variable with D ⊆ Rn denotes the feasible domain defined bythe constraints. Specifically, D =

⋂ki=1 domgi ∩

⋂lj=1 domhj . f(ω) is so-called

objective function. In the above optimization problem, we have k inequalityconstraints gi(ω) and l equality constraints hj(ω).

We construct the Lagrangian of the above optimization problem, i.e., L :Rn × Rk × Rl → R with domL = D × Rk × Rl,

L(ω, α, β) = f(ω) +

k∑i=1

αigi(ω) +

l∑j=1

βjhj(ω) (11)

In fact, L(ω, α, β) can be treated as a weighted sum of objective and constraintfunctions. αi is Lagrange multiplier associated with gi(ω) ≤ 0, while βi isLagrange multiplier associated with hi(ω) = 0

We then define its Lagrange dual function G : Rk × Rl → R as an infimum2 of L with respect to ω, i.e.,

G(α, β) = infω∈DL(ω, α, β)

= infω∈D

f(ω) +

k∑i=1

αigi(ω) +

l∑j=1

βjhj(ω)

(12)

We observe that, i) the infimum is unconstrained (as supposed to the originalconstrained minimization problem); ii) G is an infimum of a set of affine functionsand thus is a concave function regardless of the original problem; iii) G can be−∞ for some α and β

Theorem 1. Lower bounds property: If α � 0, then G(α, β) ≤ p∗ where p∗ isthe optimal value of the original problem (8)∼(10).

Proof. If ω̃ is feasible, then we have gi(ω̃) ≤ 0 for ∀i = 1, · · · , k and hj(ω̃) = 0for ∀j = 1, · · · , l. Since α � 0, we have f(ω̃) ≥ L(ω̃, α, β) for all feasible ω̃’s.Because L(ω̃, α, β) ≥ infω∈D L(ω, α, β), we have

f(ω̃) ≥ L(ω̃, α, β) ≥ infω∈D

L(ω, α, β) = G(α, β)

for all feasible ω̃. We now choose minimizer of f(ω̃) over all feasible ω̃’s to getp∗ ≥ G(α, β).

2In mathematics, the infimum (abbreviated inf ; plural infima) of a subset S of a partiallyordered set T is the greatest element in T that is less than or equal to all elements of S, if suchan element exists. More details about infimum and its counterpart suprema can be found inhttps://en.wikipedia.org/wiki/Infimum_and_supremum.

4

Page 5: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

We could try to find the best lower bound by maximizing G(α, β). This is infact the dual problem. We formally define the Lagrange dual problem as follows

maxα,β G(α, β)

s.t. α � 0, ∀i = 1, · · · , k

We denote by d∗ the optimal value of the above Lagrange dual problem. Theweak duality d∗ ≤ p∗ always holds for all optimization problems, and can beused to find non-trivial lower bounds for difficult problems. We say the dualityis strong if the equality holds, i.e., d∗ = p∗. In this case, we can optimize theoriginal problem by optimizing its dual problem.

2.2.2 Complementary Slackness

Let ω∗ be a primal optimal point and (α∗, β∗) be a dual optimal point.

Theorem 2. Complementary slackness: If strong duality holds, then

α∗i gi(ω∗) = 0

for ∀i = 1, 2, · · · , k

Proof.

f(ω∗) = G(α∗, β∗)

= infω

f(ω) +

k∑i=1

α∗i gi(ω) +

l∑j=1

β∗j hj(ω)

≤ f(ω∗) +

k∑i=1

α∗i gi(ω∗) +

l∑j=1

β∗j hj(ω∗)

≤ f(ω∗)

The first line is due to the strong duality, and the second line is the definition ofthe dual function. The third line follows because the infimum of the Lagrangianover ω is less than or equal to its value at ω = ω∗. We have the fourth line sinceα∗i ≥ 0 and gi(ω

∗) ≤ 0 holds for ∀i = 1, · · · , k, and hj(ω∗) = 0 for j = 1, · · · , l.

We conclude that the last two inequalities hold with equality, such that we have∑ki=1 α

∗i gi(ω

∗) = 0. Since each term, i.e., α∗i gi(ω∗), is nonpositive, we thus

conclude α∗i gi(ω∗) = 0 for ∀i = 1, 2, · · · , k.

Another observation is that, since the inequality in the third line holds withequality, ω∗ actually minimizes L(ω, α∗, β∗) over ω.

2.2.3 Karush-Kuhn-Tucker Conditions

We now introduce Karush-Kuhn-Tucker (KKT) conditions. We assume thatthe objective function and the inequality constraint functions are differentiable.Again, let ω∗ and (α∗, β∗) be any primal and dual optimal points, and wesuppose the strong duality holds. Since ω∗ is a minimizer of L(ω, α∗, β∗) overω, it follows that its gradient must vanish at ω∗, i.e.,

∇f(ω∗) +

k∑i=1

α∗i∇gi(ω∗) +

l∑j=1

β∗j∇hj(ω∗) = 0 (13)

5

Page 6: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

We thus summarize the KKT conditions as follows:

gi(ω∗) ≤ 0, ∀i = 1, · · · , k (14)

hj(ω∗) = 0, ∀j = 1, · · · , l (15)

α∗i ≥ 0, ∀i = 1, · · · , k (16)

α∗i gi(ω∗) = 0, ∀i = 1, · · · , k (17)

∇f(ω∗) +

k∑i=1

α∗i∇gi(ω∗) +

l∑j=1

β∗j∇hj(ω∗) = 0 (18)

Remarks: For any optimization problem with differentiable objective andconstraint functions for which strong duality obtains, any pair of primal anddual optimal points must satisfy the KKT conditions.

2.2.4 Convex Optimization Problems

An optimization problem is convex, if both objective function f(ω) and inequal-ity constraints gi(ω) (i = 1, · · · , k) are convex, and the equality constraintshj(ω) are affine functions, which are denoted by Aω − b = 0 where A is a l × nmatrix. Therefore, a convex optimization problem can be represented by

minω

f(w) (19)

s.t. gi(w) ≤ 0, i = 1, · · · , k (20)

Aw − b = 0 (21)

Although strong duality does not (in general) hold, but we usually (but notalways) have strong duality for convex optimization problems. There are manyresults that establish conditions on the problem, beyond convexity, under whichstrong duality holds. These conditions are called constraint qualifications. Onesimple constraint qualification is Slater’s condition.

Theorem 3. Slater’s condition: Strong duality holds for a convex problem

minω

f(w)

s.t. gi(w) ≤ 0, i = 1, · · · , kAw − b = 0

if it is strictly feasible, i.e.,

∃ω ∈ relintD : gi(ω) < 0, i = 1, · · · ,m,Aw = b

Detailed proof of the above theorem can be found in Prof Boyd and Prof Van-denberghe’s Convex Optimization book (https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf, see Sec. 5.3.2, pp. 234-236).

For convex optimization problem, the KKT conditions are also sufficient forthe points to be primal and dual optimal. In particular, suppose ω̃, α̃, and β̃

6

Page 7: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

are any points satisfying the following KKT conditions

gi(ω̃) ≤ 0, ∀i = 1, · · · , k (22)

hj(ω̃) = 0, ∀j = 1, · · · , l (23)

α̃i ≥ 0, ∀i = 1, · · · , k (24)

α̃igi(ω̃) = 0, ∀i = 1, · · · , k (25)

∇f(ω̃) +

k∑i=1

α̃i∇gi(ω̃) +

l∑j=1

β̃j∇hj(ω̃) = 0 (26)

then they are primal and dual optimal with strong duality holding.

3 Duality of SVM

We now re-visit our problem formulation of SVM. The (primal) SVM problemis given

P : minω,b

1

2‖ω‖2 (27)

s.t. y(i)(ωTx(i) + b) ≥ 1, ∀i (28)

where we introduce a constant 1/2 so as to simplify our later derivations.

Theorem 4. The dual optimization problem of the primal SVM problem P canbe formulated as

D : maxα

G(α) =

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαj(x(i))Tx(j) (29)

s.t. αi ≥ 0 ∀i (30)m∑i=1

αiy(i) = 0 (31)

Proof. We calculate the Lagrange dual function G(α) by taking the infimum ofL(ω, b, α) over ω and b.

L(w, b, α) =1

2‖w‖2 −

m∑i=1

αi(y(i)(wTx(i) + b)− 1) (32)

where αi ≥ 0 is the Lagrangian multiplier for the i-th inequality constraint.Therefore,we calculate the gradient of L(ω, b, α) with respect to ω, and let the gradientbe zero

5ωL(ω, b, α) = w −m∑i=1

αiy(i)x(i) = 0

and we thus have

ω =

m∑i=1

αiy(i)x(i) (33)

7

Page 8: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Similarly,

∂bL(ω, b, α) =

m∑i=1

αiy(i) = 0 (34)

In another word, the above two equations are necessary to calculating infω,b L(ω, b, α)over ω and b. Substituting (33) and (34) into (32) gives us

1

2‖ω‖2 −

m∑i=1

αi[y(i)(ωTx(i) + b)− 1]

=1

2ωTω −

m∑i=1

αiy(i)ωTx(i) −

m∑i=1

αiy(i)b+

m∑i=1

αi

=1

2ωT

m∑i=1

αiy(i)x(i) − ωT

m∑i=1

αiy(i)x(i) −

m∑i=1

αiy(i)b+

m∑i=1

αi

= −1

2ωT

m∑i=1

αiy(i)x(i) −

m∑i=1

αiy(i)b+

m∑i=1

αi

= −1

2

(m∑i=1

αiy(i)x(i)

)T m∑i=1

αiy(i)x(i) − b

m∑i=1

αiy(i) +

m∑i=1

αi

= −1

2

m∑i=1

αiy(i)(x(i))T

m∑i=1

αiy(i)x(i) − b

m∑i=1

αiy(i) +

m∑i=1

αi

= −1

2

m∑i=1

m∑j=1

αiαjy(i)y(j)(x(i))Tx(j) − b

m∑i=1

αiy(i) +

m∑i=1

αi

=

m∑i=1

αi −1

2

m∑i=1

m∑j=1

αiαjy(i)y(j)(x(i))Tx(j) − b

m∑i=1

αiy(i)

=

m∑i=1

αi −1

2

m∑i=1,j=1

αiαjy(i)y(j)(x(i))Tx(j)

which completes our proof.

It is a convex optimization problem respecting Slater’s condition; therefore,the strong duality (p∗ = d∗) holds and optimal solutions of ω, α and β satisfythe KKT conditions (as well as complementary slackness). We can use severaloff-the-shelf solvers (e.g., quadprog (MATLAB), CVXOPT, CPLEX, IPOPT,etc.) exist to solve such a QP problem.

Let α∗ be the optimal value of α for the dual SVM problem. We can useEq. (33) to calculate the optimal value of ω, i.e., ω∗. The question is, beingaware of ω∗, how to calculating the optimal value of b, i.e., b∗? Recall that, dueto the complementary slackness,

α∗i (y(i)(ω∗Tx(i) + b∗)− 1) = 0

for ∀i = 1, · · · , k, we have

y(i)(ω∗Tx(i) + b∗) = 1

8

Page 9: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Figure 3: Non-linear data v.s. linear classifier

for {i : α∗i > 0}. As y(i) ∈ {−1, 1}, we have

b∗ = y(i) − ω∗Tx(i)

for ∀i such that α∗i > 0. For robustness, we calculated the optimal value for bby taking the average

b∗ =

∑i:α∗

i>0(y(i) − ω∗Tx(i))∑mi=1 1(α∗i > 0)

In fact, most αi’s in the solution are zero (sparse solution). According tocompletment slackness, for the optimal α∗i ’s,

α∗i [1− y(i)(ω∗Tx(i) + b∗)] = 0

α∗i is non-zero only if x(i) lies on the one of the two margin boundaries. i.e., forwhich y(i)(ω∗Tx(i) + b) = 1. These data samples are called support vector,i.e., support vectors “support” the margin boundaries, as shown in Fig. ??. Wecan redefine ω by

w =∑s∈S

αsy(s)x(s)

where S denotes the set of the indices of the support vectors

4 Kernel based SVM

By far, one of our assumption is that the training data can be separated linearly.Nevertheless, Linear models (e.g., linear regression, linear SVM etc.) cannotreflect the nonlinear pattern in the data, as demonstrated in Fig. 4.

The basic idea of kernel method is to make linear model work in nonlinearsettings by introducing kernel functions. In particular, by mapping data tohigher dimensions where it exhibits linear patterns, we can employ the linearmodel in the new input space. Mapping is equivalent to changing the featurerepresentation

We take the following binary classification problem for example. As shownin Fig. 4 (a), Each sample is represented by a single feature x, and no linearseparator exists for this data. We map each data sample by x → {x, x2}, suchthat each sample now has two features (“derived” from the old representation).

9

Page 10: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

(a) (b)

Figure 4: Feature mapping for 1-dimensional feature space.

As shown in Fig. 4 (b), data become linearly separable in the new representation

Another example is given in Fig. 5. Each sample is defined by x = {x1, x2},and there is no linear separator exists for this data. We apply the mappingx = {x1, x2} → z = {x21,

√2x1x2, x

22}, such that the data become linearly

separable in the output 3D space.

(a) (b)

Figure 5: Feature mapping for 2-dimensional feature space.

We now consider the follow quadratic feature mapping φ for a sample x ={x1, · · · , xn}

φ : x→ {x21, x22, · · · , x2n, x1x2, x1x2, · · · , x1xn, · · · , xn−1xn}

where each new feature uses a pair of the original features. It can be observedthat, feature mapping usually leads to the number of features blow up, suchthat i) computing the mapping itself can be inefficient, especially when the newspace is very high dimensional; ii) storing and using these mappings in latercomputations can be expensive (e.g., we may have to compute inner productsin a very high dimensional space); iii) using the mapped representation couldbe inefficient too. Fortunately, kernels help us avoid both these issues! Withthe help of kernels, the mapping does not have to be explicitly computed, andcomputations with the mapped features remains efficient.

10

Page 11: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Let’s assume we are given a function K (kernel) that takes as inputs x andz

K(x, z) = (xT z)2

= (x1z1 + x2z2)2

= x21z21 + x22z

22 + 2x1x2z1z2

= (x21,√

2x1x2, x22)T (z21 ,

√2z1z2, z

22)

The above function K implicitly defines a mapping φ to a higher dimensionspace

φ(x) = {x21,√

2x1x2, x22}

Now, simply defining the kernel a certain way gives a higher dimension mappingφ. The mapping does not have to be explicitly computed, while computationswith the mapped features remain efficient. Moreover, the kernel K(x, z) alsocomputes the dot product φ(x)Tφ(z)

Formally speaking, each kernel K has an associated feature mapping φ,which takes input x ∈ X (input space) and maps it to F (feature space). Fneeds to be a vector space with a dot product defined upon it, and is so-calleda Hilbert space. In another word, kernel K(x, z) = φ(x)Tφ(z) takes two inputsand gives their similarity in F space

K : X × X → R

The problem is, can just any function be used as a kernel function? Theanswer is no, and kernel function must satisfy Mercer’s Condition. To introduceMercer’s condition, we need to define the quadratically integrable (or squareintegrable) function concept. A function q : Rn → R is square integrable if∫ ∞

−∞q2(x)dx <∞

A function K(·, ·) : Rn × Rn → R satisfies Mercer’s condition if for any squareintegrable function q(x), the following inequality is always true:∫ ∫

q(x)K(x, z)q(z)dxdz ≥ 0

Let K1 and K2 be two kernel functions then the followings are as well:

• Direct sum: K(x, z) = K1(x, z) +K2(x, z)

• Scalar product: K(x, z) = αK1(x, z)

• Direct product: K(x, z) = K1(x, z)K2(x, z)

• Kernels can also be constructed by composing these rules

In the context of SVM, Mercer’s condition translates to another way to checkwhether K is a valid kernel (i.e., meets Mercer’s condition or not). The kernelfunction K also defines the Kernel Matrix over the data (also denoted by K).Given m samples {x(1), x(2), · · · , x(m)}, the (i, j)-th entry of K is defined as

Ki,j = K(x(i), x(j)) = φ(x(i))Tφ(x(j))

11

Page 12: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

If the matrix K is positive semi-definite, function K(·, ·) is a valid kernel func-tion.

Follows are some commonly used kernels:

• Linear (trivial) Kernal:K(x, z) = xT z

• Quadratic Kernel

K(x, z) = (xT z)2 or (1 + xT z)2

• Polynomial Kernel (of degree d)

K(x, z) = (xT z)d or (1 + xT z)d

• Gaussian Kernel

K(x, z) = exp

(−‖x− z‖

2

2σ2

)• Sigmoid Kernel

K(x, z) = tanh(αxT + c)

Overall, kernel K(x, z) represents a dot product in some high dimensionalfeature space F

K(x, z) = (xT z)2 or (1 + xT z)2

Any learning algorithm in which examples only appear as dot products (x(i)Tx(j))

can be kernelized (i.e., non-linearlized), by replacing the x(i)Tx(j) terms by

φ(x(i))Tφ(x(j)) = K(x(i), x(j)). Actually, most learning algorithms are like that,such as SVM, linear regression, etc. Many of the unsupervised learning algo-rithms too can be kernelized (e.g., K-means clustering, Principal ComponentAnalysis, etc.)

Recall that, the dual problem of SVM can be formulated as

maxα

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαj < x(i), x(j) >

s.t. αi ≥ 0 (∀i),m∑i=1

αiy(i) = 0

Replacing < x(i), x(j) > by φ(x(i))Tφ(x(j)) = K(x(i), x(j)) = Kij gives us

maxα

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαjKi,j

s.t. αi ≥ 0 (∀i),m∑i=1

αiy(i) = 0

SVM now learns a linear separator in the kernel defined feature space F , andthis corresponds to a non-linear separator in the original space X

12

Page 13: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Figure 6: Regularized (Soft-Margin) SVM

Prediction can be made by the SVM without kernel by

y = sign(ωTx) = sign

(∑s∈S

αsy(s)x(s)

Tx

)

where we assume assume b = 0 3 We replacing each example with its featuremapped representation (x→ φ(x))

y = sign

(∑s∈S

αsy(s)x(s)

Tx

)= sign

(∑s∈S

αsy(s)K(x(s), x)

)

Kernelized SVM needs the support vectors at the test time (except when youcan write φ(x) as an explicit, reasonably-sized vector). In the unkernelizedversion ω =

∑s∈S αsy

(s)x(s) can be computed and stored as a n× 1 vector, sothe support vectors need not be stored.

5 Regularized SVM

We now introduce regularization to SVM. The regularized SVM is also calledSoft-Margin SVM. In the regularized SVM, we allow some training examplesto be misclassified, such that some training examples to fall within the marginregion, as shown in Fig. 5. For the linearly separable case, the constraints are

y(i)(ωTx(i) + b) ≥ 1

for ∀i = 1, · · · ,m, while in the non-separable case, we relax the above constraintsas:

y(i)(ωTx(i) + b) ≥ 1− ξi3This can be done by shifting the hyper plane (as well as the test data) such that the plane

just passing through the original point (i.e., b = 0).

13

Page 14: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

for ∀i = 1, · · · ,m, where ξi is called slack variable.In the non-separable case, we allow misclassified training examples, but we

would like their number to be minimized, by minimizing the sum of the slackvariables

∑i ξi. We reformulating the SVM problem by introducing slack vari-

ables ξi

minw,b,ξ

1

2‖w‖2 + C

m∑i=1

ξi (35)

s.t. y(i)(wTx(i) + b) ≥ 1− ξi, ∀i = 1, · · · ,m (36)

xii ≥ 0, ∀i = 1, · · · ,m (37)

The parameter C controls the relative weighting between the following two goals:i) although small C implies that ‖ω‖2/2 dominates such that large margins arepreferred, this allows a potential large number of misclassified training examples;ii) large C means C

∑mi=1 ξi dominates⇒ such that the number of misclassified

examples is decreased at the expense of having a small margin.The Lagrangian can be defined by

L(ω, b, ξ, α, r) =1

2ωTω + C

m∑i=1

ξi −m∑i=1

αi[y(i)(ωTx(i) + b)− 1 + ξi]−

m∑i=1

riξi

and according to KKT conditions, we have

5ωL(ω∗, b∗, ξ∗, α∗, r∗) = 0 ⇒ ω∗ =

m∑i=1

α∗i y(i)x(i) (38)

5bL(ω∗, b∗, ξ∗, α∗, r∗) = 0 ⇒m∑i=1

α∗i y(i) = 0 (39)

5ξiL(ω∗, b∗, ξ∗, α∗, r∗) = 0 ⇒ α∗i + r∗i = C, for ∀i (40)

α∗i , r∗i , ξ∗i ≥ 0, for ∀i (41)

y(i)(ω∗Tx(i) + b∗) + ξ∗i − 1 ≥ 0, for ∀i (42)

α∗i (y(i)(ω∗Tx(i) + b∗) + ξ∗i − 1) = 0, for ∀i (43)

r∗i ξ∗i = 0, for ∀i (44)

We then formulated the corresponding dual problem as

maxα

m∑i=1

αi −1

2

m∑i=1

m∑j=1

y(i)y(j)αiαj < x(i), x(j) >

s.t. 0 ≤ αi ≤ C, ∀i = 1, · · · ,mm∑i=1

αiy(i) = 0

We can use existing QP solvers to address the above optimization problemLet α∗i be the optimal values of α. We now show how to calculate the optimal

values of ω and b. According to KKT conditions, ω∗ can be calculated by

ω∗ =

m∑i=1

α∗i y(i)x(i)

14

Page 15: Lecture Notes on Support Vector Machine - GitHub …Lecture Notes on Support Vector Machine Feng Li i@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional

Since α∗i + r∗i = C, for ∀i, we have

r∗i = C − α∗i , ∀i

Also, considering r∗i ξ∗i = 0, for ∀i, we have

(C − α∗i )ξ∗i = 0, ∀i

For ∀i such that α∗i 6= C, we have ξ∗i = 0, and thus

α∗i (y(i)(ω∗Tx(i) + b∗)− 1) = 0

for those i’s. For ∀i such that 0 < α∗i < C, we have

y(i)(ω∗Tx(i) + b∗) = 1

Hence,ω∗Tx(i) + b∗ = y(i)

for {i : 0 < α∗i < C}. We finally calculate b∗ as

b∗ =

∑i:0<α∗

i<C(y(i) − ω∗Tx(i))∑m

i=1 1(0 < α∗i < C)

15