Lecture 4: Support Vector Machines - Computer Scienceinabd161/wiki.files/lecture4a_handouts.pdf ·...
Click here to load reader
Transcript of Lecture 4: Support Vector Machines - Computer Scienceinabd161/wiki.files/lecture4a_handouts.pdf ·...
Lecture 4: Support VectorMachines
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 4 1 / 14
Linear predictors: Intermediate summary
Linear predictors are very popular, becauseI If the sample size is Θ(d) (e.g. 10 times the dimensions), the training
error and the true error will probably be similar
I For many natural problems, there are linear predictors with low error.
Computing the ERM for a linear predictor is NP-hard.
But in the realizable/separable case, there are efficient algorithms:I Using Linear Programming;
I The Batch Perceptron algorithm.
The Batch Perceptron is faster if the margin is larger.
•
Kontorovich and Sabato (BGU) Lecture 4 2 / 14
Linear predictors: Problems and Solutions
Two problems:I In many cases the dimension is huge: we don’t have Θ(d) examples.
I In many cases the problem is not separable.
One solution: Support Vector Machine (SVM).
Another challenge: non-linear predictors.I Solution: Kernels.
•
Kontorovich and Sabato (BGU) Lecture 4 3 / 14
Which linear predictors to choose?
ERM: Find a linear predictor which is best on the sample.
In the separable case: A predictor with zero training error.
Which one should the algorithm choose?
Intuitively: choose the one with the largest margin.
We expect this to work better for new unseen examples.
•Kontorovich and Sabato (BGU) Lecture 4 4 / 14
Large margin reduces sample complexity
Overfitting: When true error is much larger than training error.
Sample complexity: Training sample size to prevent overfitting.
Assume a separable problem.
Recall margin definition: For w such that err(hw ,S) = 0,
γ(w) :=1
Rmini≤m
|〈w , xi 〉|‖w‖
, R := maxi‖xi‖.
Large-margin algorithm
Select the separating linear predictor with maximal margin on sample.
What is the sample complexity of this algorithm?
•
Kontorovich and Sabato (BGU) Lecture 4 5 / 14
Large margin reduces sample complexityRecall margin definition: For w such that err(hw , S) = 0,
γ(w) :=1
Rmini≤m
|〈w , xi 〉|‖w‖
, R := maxi‖xi‖.
What is the sample complexity of this algorithm?For this we need to look at properties of the distribution.Generalize definition of R to arbitrary distributions D:
RD = the smallest value such that P[‖x‖ ≤ RD] = 1.
Define the distribution margin of w : For w such that err(hw ,D) = 0,
γD(w) := the largest value such that
P(X ,Y )∼D
[1
RD
|〈w ,X 〉|‖w‖
≥ γD(w)
]= 1.
Best possible margin on distribution:
γ∗ := maxw :err(w ,D)=0
γD(w).
•Kontorovich and Sabato (BGU) Lecture 4 6 / 14
Large margin reduces sample complexity
Theorem
The sample complexity of the large-margin algorithm is O( 1γ2∗
).
Compare to ERM:I Select any linear predictor with lowest training error.I The sample complexity for this algorithm is O(d).
For a separable sample: There are many ERM solutions,but only one large-margin solution.
Large-margin is also an ERM algorithm!
So large-margin sample complexity is O(min(d , 1γ2∗
)).
How to implement that large-margin algorithm?
•
Kontorovich and Sabato (BGU) Lecture 4 7 / 14
The Hard-SVM algorithmRecall R := maxi ‖xi‖. The margin of a w that separates S :
γ(w) :=1
Rmini≤m
|〈w , xi 〉|‖w‖
=1
Rmini≤m
yi 〈w , xi 〉‖w‖
.
Hard-SVM: Find the separator with the largest margin.
Works on a separable sample.
Hard-SVMinput A separable training sample S = {(x1, y1), . . . , (xm, ym)}output w ∈ Rd such that ∀i ≤ m, hw (xi ) = yi .1: Find w that solves the following problem:
Minimize ‖w‖2
s.t. ∀i , yi 〈w , xi 〉 ≥ 1.
2: Return w .
•Kontorovich and Sabato (BGU) Lecture 4 8 / 14
The Hard-SVM algorithm
Hard-SVM minimization problem: Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.
Claim: Hard-SVM returns a maximal-margin separator.
Proof:I The margin of w is γ(w) = 1
R miniyi 〈w ,xi 〉‖w‖ .
I For all i , yi 〈w , xi 〉 ≥ R · γ(w) · ‖w‖.I Define w̄ := w
R·γ(w)·‖w‖ .
I w and w̄ define the same separator, with the same margin.
I w̄ satisfies the constraints of the minimization problem:
yi 〈w̄ , xi 〉 =yi 〈w , xi 〉
R · γ(w) · ‖w‖≥ 1.
I Also
‖w̄‖ = ‖ w
R · γ(w) · ‖w‖‖ =
‖w‖R · γ(w) · ‖w‖
=1
R · γ(w)
I So, the norm ‖w̄‖ is minimal when the margin γ(w) is maximal.
•Kontorovich and Sabato (BGU) Lecture 4 9 / 14
The Hard-SVM algorithm
Hard-SVM minimization problem:Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.
How should we solve this problem?
A quadratic program (QP) is a problem of the following form:
minimizew∈Rd
1
2wT · H · w + 〈u,w〉
subject to Aw ≥ v .
w ∈ Rd : a vector we wish to find.
u ∈ Rd , H ∈ Rd×d , v ∈ Rm, A ∈ Rm×d .
The values of u,H, v ,A define the specific quadratic program.
QPs can be solved efficiently
Many solvers are available .
In Matlab: w = quadprog(H,u,A,v).
•Kontorovich and Sabato (BGU) Lecture 4 10 / 14
Solving Hard-SVM with Quadratic Programming
Quadratic program Hard-SVM minimization problem
minimizew∈Rd
1
2wT · H · w + 〈u,w〉
subject to Aw ≥ v .Minimize ‖w‖2 s.t. ∀i , yi 〈w , xi 〉 ≥ 1.
Define Id := The identity matrix (diagonal with 1’s on the diagonal.)
Expressing ‖w‖2 as 12w
T · H · w :
‖w‖2 =d∑
i=1
w(i)2 = (w(1),w(2), . . . ,w(d)) · Id ·
w(1)w(2). . .
w(d)
.
So H = 2 · Id and u = (0, . . . , 0).
A and v are as in the Perceptron: v = (1, . . . , 1),row i of the matrix A is yixi ≡ (yi · xi (1), . . . , yi · xi (d)).
•Kontorovich and Sabato (BGU) Lecture 4 11 / 14
Hard-SVM: Recap
Hard-SVM works on separable problems.
It finds the linear predictor with the maximal marginon the training sample.
It requires only O(min(d , 1γ2∗
)) training examples.
I Recall: A general ERM for linear predictors requires O(d) examples.I Recall: The Perceptron requires O( 1
γ2S
) updates (not a coincidence!)
Hard-SVM can be solved efficiently with a quadratic program.
•
Kontorovich and Sabato (BGU) Lecture 4 12 / 14
Example: margin versus dimension
Hard-SVM improves the sample complexity over ERM if 1γ2∗� d .
Example: Classifying documents — is the document about sports?
Representation: Bag of Words.I X = Rd , each coordinate represents a word in English.
I x = a binary vector, x(i) = I[ word i exists in the document].
d = number of words in the English language.
We will show that under reasonable assumptions, the samplecomplexity of hard-SVM is smaller than a general ERM.
•
Kontorovich and Sabato (BGU) Lecture 4 13 / 14
Example: margin versus dimensionRecall:RD = the smallest value such that P[‖x‖ ≤ RD] = 1.For w such that err(hw ,D) = 0,
γD(w) := the largest value such that P(X ,Y )∼D
[1
RD
|〈w ,X 〉|‖w‖
≥ γD(w)
]= 1.
Best possible margin on distribution: γ∗ := maxw :err(w,D)=0 γD(w).There are d coordinates (possible words, say 100,000)
Suppose:I At most n different words in each document (say 1,00)I The best w∗ uses a small number k of words (say 10)I And the coordinates of w∗ are in {−1, 0, 1},I And w∗ classifies all documents correctly.
Then:I |〈w∗,X 〉| ≥ 1
I ‖w∗‖ =√∑d
i=1 w∗(i)2 ≤√k
I RD ≤√n
I γ(w∗) ≥ 1√nk
We conclude that 1γ2∗� d if nk � d .
So, large-margin learning will have a smaller estimation error for the same sample size.In other words, it will need fewer examples to avoid overfitting.
•Kontorovich and Sabato (BGU) Lecture 4 14 / 14