Lecture 4: Support Vector Machines - Computer Scienceinabd161/wiki.files/lecture4a_handouts.pdf ·...

Lecture 4: Support VectorMachines

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 4 1 / 14

Linear predictors: Intermediate summary

Linear predictors are very popular, becauseI If the sample size is Θ(d) (e.g. 10 times the dimensions), the training

error and the true error will probably be similar

I For many natural problems, there are linear predictors with low error.

Computing the ERM for a linear predictor is NP-hard.

But in the realizable/separable case, there are efficient algorithms:I Using Linear Programming;

I The Batch Perceptron algorithm.

The Batch Perceptron is faster if the margin is larger.

•


Linear predictors: Problems and Solutions

Two problems:I In many cases the dimension is huge: we don’t have Θ(d) examples.

I In many cases the problem is not separable.

One solution: Support Vector Machine (SVM).

Another challenge: non-linear predictors.I Solution: Kernels.

•


Which linear predictors to choose?

ERM: Find a linear predictor which is best on the sample.

In the separable case: A predictor with zero training error.

Which one should the algorithm choose?

Intuitively: choose the one with the largest margin.

We expect this to work better for new unseen examples.

•Kontorovich and Sabato (BGU) Lecture 4 4 / 14

Large margin reduces sample complexity

Overfitting: When true error is much larger than training error.

Sample complexity: Training sample size to prevent overfitting.

Assume a separable problem.

Recall margin definition: For w such that err(hw ,S) = 0,

γ(w) :=1

Rmini≤m

|〈w , xi 〉|‖w‖

, R := maxi‖xi‖.

Large-margin algorithm

Select the separating linear predictor with maximal margin on sample.

What is the sample complexity of this algorithm?

•


Large margin reduces sample complexityRecall margin definition: For w such that err(hw , S) = 0,

γ(w) :=1

Rmini≤m

|〈w , xi 〉|‖w‖

, R := maxi‖xi‖.

What is the sample complexity of this algorithm?For this we need to look at properties of the distribution.Generalize definition of R to arbitrary distributions D:

RD = the smallest value such that P[‖x‖ ≤ RD] = 1.

Define the distribution margin of w : For w such that err(hw ,D) = 0,

γD(w) := the largest value such that

P(X ,Y )∼D

[1

RD

|〈w ,X 〉|‖w‖

≥ γD(w)

]= 1.

Best possible margin on distribution:

γ∗ := maxw :err(w ,D)=0

γD(w).


Large margin reduces sample complexity

Theorem

The sample complexity of the large-margin algorithm is O( 1γ2∗

).

Compare to ERM:I Select any linear predictor with lowest training error.I The sample complexity for this algorithm is O(d).

For a separable sample: There are many ERM solutions,but only one large-margin solution.

Large-margin is also an ERM algorithm!

So large-margin sample complexity is O(min(d , 1γ2∗

)).

How to implement that large-margin algorithm?

•


The Hard-SVM algorithmRecall R := maxi ‖xi‖. The margin of a w that separates S :

γ(w) :=1

Rmini≤m

|〈w , xi 〉|‖w‖

=1

Rmini≤m

yi 〈w , xi 〉‖w‖

.

Hard-SVM: Find the separator with the largest margin.

Works on a separable sample.

Hard-SVMinput A separable training sample S = {(x1, y1), . . . , (xm, ym)}output w ∈ Rd such that ∀i ≤ m, hw (xi ) = yi .1: Find w that solves the following problem:

Minimize ‖w‖2

s.t. ∀i , yi 〈w , xi 〉 ≥ 1.

2: Return w .


The Hard-SVM algorithm

Hard-SVM minimization problem: Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.

Claim: Hard-SVM returns a maximal-margin separator.

Proof:I The margin of w is γ(w) = 1

R miniyi 〈w ,xi 〉‖w‖ .

I For all i , yi 〈w , xi 〉 ≥ R · γ(w) · ‖w‖.I Define w̄ := w

R·γ(w)·‖w‖ .

I w and w̄ define the same separator, with the same margin.

I w̄ satisfies the constraints of the minimization problem:

yi 〈w̄ , xi 〉 =yi 〈w , xi 〉

R · γ(w) · ‖w‖≥ 1.

I Also

‖w̄‖ = ‖ w

R · γ(w) · ‖w‖‖ =

‖w‖R · γ(w) · ‖w‖

=1

R · γ(w)

I So, the norm ‖w̄‖ is minimal when the margin γ(w) is maximal.


The Hard-SVM algorithm

Hard-SVM minimization problem:Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.

How should we solve this problem?

A quadratic program (QP) is a problem of the following form:

minimizew∈Rd

1

2wT · H · w + 〈u,w〉

subject to Aw ≥ v .

w ∈ Rd : a vector we wish to find.

u ∈ Rd , H ∈ Rd×d , v ∈ Rm, A ∈ Rm×d .

The values of u,H, v ,A define the specific quadratic program.

QPs can be solved efficiently

Many solvers are available .

In Matlab: w = quadprog(H,u,A,v).


Solving Hard-SVM with Quadratic Programming

Quadratic program Hard-SVM minimization problem

minimizew∈Rd

1

2wT · H · w + 〈u,w〉

subject to Aw ≥ v .Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.

Define Id := The identity matrix (diagonal with 1’s on the diagonal.)

Expressing ‖w‖2 as 12w

T · H · w :

‖w‖2 =d∑

i=1

w(i)2 = (w(1),w(2), . . . ,w(d)) · Id ·

w(1)w(2). . .

w(d)

.

So H = 2 · Id and u = (0, . . . , 0).

A and v are as in the Perceptron: v = (1, . . . , 1),row i of the matrix A is yixi ≡ (yi · xi (1), . . . , yi · xi (d)).


Hard-SVM: Recap

Hard-SVM works on separable problems.

It finds the linear predictor with the maximal marginon the training sample.

It requires only O(min(d , 1γ2∗

)) training examples.

I Recall: A general ERM for linear predictors requires O(d) examples.I Recall: The Perceptron requires O( 1

γ2S

) updates (not a coincidence!)

Hard-SVM can be solved efficiently with a quadratic program.

•


Example: margin versus dimension

Hard-SVM improves the sample complexity over ERM if 1γ2∗� d .

Example: Classifying documents — is the document about sports?

Representation: Bag of Words.I X = Rd , each coordinate represents a word in English.

I x = a binary vector, x(i) = I[ word i exists in the document].

d = number of words in the English language.

We will show that under reasonable assumptions, the samplecomplexity of hard-SVM is smaller than a general ERM.

•


Example: margin versus dimensionRecall:RD = the smallest value such that P[‖x‖ ≤ RD] = 1.For w such that err(hw ,D) = 0,

γD(w) := the largest value such that P(X ,Y )∼D

[1

RD

|〈w ,X 〉|‖w‖

≥ γD(w)

]= 1.

Best possible margin on distribution: γ∗ := maxw :err(w,D)=0 γD(w).There are d coordinates (possible words, say 100,000)

Suppose:I At most n different words in each document (say 1,00)I The best w∗ uses a small number k of words (say 10)I And the coordinates of w∗ are in {−1, 0, 1},I And w∗ classifies all documents correctly.

Then:I |〈w∗,X 〉| ≥ 1

I ‖w∗‖ =√∑d

i=1 w∗(i)2 ≤√k

I RD ≤√n

I γ(w∗) ≥ 1√nk

We conclude that 1γ2∗� d if nk � d .

So, large-margin learning will have a smaller estimation error for the same sample size.In other words, it will need fewer examples to avoid overfitting.


Lecture 4: Support Vector Machines - Computer Scienceinabd161/wiki.files/lecture4a_handouts.pdf ·...

Documents

Transcript of Lecture 4: Support Vector Machines - Computer Scienceinabd161/wiki.files/lecture4a_handouts.pdf ·...