Lecture 8: Kernels
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 8 1 / 27
Non-linear ClassifiersSVM tries to find a good linear classifier on the training set.
What if such a classifier does not exist?
There might be a good non-linear classifier.
Can we make this problem linearly separable?
Change representations: Instead of x ∈ R, use ψ(x) = (x , x2) ∈ R2.
x2
x
There is a linear separator in this new representation!
•Kontorovich and Sabato (BGU) Lecture 8 2 / 27
Feature maps
ψ(x) = (x , x2) maps examples from R to R2.
A linear predictor in new space is a non-linear predictor in original space.
Many different feature mappings are possible.
The feature mapping procedure:
1. Given X and some learning task, choose some feature mappingψ : X → F .
F F is the feature space.F F is usually Rn for some n. (later: n can also be infinite!)
2. When given the training set S = {(x1, y1), . . . , (xm, ym)}, create animage of the set in the feature space:
S = {(ψ(x1), y1), . . . , (ψ(xm), ym)}
3. Find a good linear predictor hw : F → Y on S .F w is a vector in the feature space Rn.
4. For new examples x ∈ X , predict the label
hw (ψ(x)) = sign(〈w , ψ(x)〉).
•Kontorovich and Sabato (BGU) Lecture 8 3 / 27
Feature maps
The feature mapping induces a new distribution.
I D is the original distribution over X × Y.I Dψ := the new distribution on F × Y, induced by ψ and D.
For any h, err(h,Dψ) = err(h ◦ ψ,D).
So, a good classifier on Dψ induces a good classifier on D.
Which feature map should we use?
Similar to selecting a hypothesis class:
I Low approximation error: There is a good predictor for D in themapped feature space.
I Low estimation error: A learning algorithm on Dψ will find a goodpredictor from the training set.
E.g., if there is a linear classifier with a large margin in the feature space.
•
Kontorovich and Sabato (BGU) Lecture 8 4 / 27
General-purpose feature maps
The feature mapping is selected before observing the training sample!
I Otherwise we allow any function =⇒ overfitting .
How to choose a good feature mapping?
I Domain knowledgeI A general-purpose feature mapping
Polynomial feature maps: Allow polynomials as separators.
For x ∈ R (one-dimensional), define ψ : R→ Rk+1
ψ(x) = (1, x , x2, . . . , xk).
For x ∈ reals, can now learn any separators that are polynomials of degree≤ k:
sign
(k∑
i=0
wixi
)= sign(〈w , ψ(x)〉).
•Kontorovich and Sabato (BGU) Lecture 8 5 / 27
Polynomial feature maps in high dimensions
For x ∈ Rd : Allow multivariate polynomials.
I Multivariate polynomial example:
3·x(4)6 ·x(16)4 − 1.17·x(3)19 ·x(27)11 ·x(15) + 4.
I k-degree multivariate polynomial: Each monomial’s powers sum to atmost k .
I Define all integer sequences of length d which sum to at most k:
I kd := {(t1, t2, . . . , td) |d∑
j=1
tj ≤ k}
I The set of all multivariate polynomials of degree k:∑z∈I kd
w(z)d∏
i=1
x(i)z(i)∣∣∣w ∈ R|I
kd |
.
I wz are the coefficients of the polynomial.
•Kontorovich and Sabato (BGU) Lecture 8 6 / 27
Polynomial feature maps in high dimensions
Define a feature mapping that allows all multivariate polynomials upto degree k
Every coordinate in ψ(x) corresponds to one z ∈ I kd .
I kd := {(t1, t2, . . . , td) |d∑
j=1
tj ≤ k}
Examples:I d = 2, k = 2: ψ(x) = (1, x(1), x(2), x(1)x(2), x(1)2, x(2)2).I d = 6, k = 7:
F Coordinate (3, 0, 0, 2, 2, 0) in ψ(x) corresponds to x(1)3 · x(4)2 · x(5)2.F Coordinate (0, 0, 5, 0, 0, 0) corresponds to x(3)5.F If w has 7 in coordinate (0, 0, 5, 0, 1, 0) and the rest is zero, the
polynomial represented by w is 7·x(3)5 ·x(5).
•
Kontorovich and Sabato (BGU) Lecture 8 7 / 27
Learning with feature maps
Feature maps can be in a very high dimension.I Polynomial map in degree k : If X = Rd , number of features inF = Rn, needs to be n ≈ dk .
I E.g. text documents with the bag-of-words representation:d = 100, 000 (English words), k = 10, n = enormous.
Two problems:I High dimension =⇒ need a lot of training examples for ERM.
F Solution: use a large-margin algorithm such as SVM.
I High dimension =⇒ heavy computation and memory consumption.F Even representing examples and w in memory can be impossible!
I Solution for computational issues: kernels.
•
Kontorovich and Sabato (BGU) Lecture 8 8 / 27
The Kernel Trick
If ψ maps to a very high dimension, it can be computationallyexpensive to use the representations ψ(xi ) directly.
Define K (x , x ′) := 〈ψ(x), ψ(x ′)〉.
The function K is also called a kernel.
The Kernel Trick: Many learning algorithms for linear separators canbe implemented using only the function K (x , x ′), and never ψ(x).
Advantage: ψ(x) can be huge-dimensional or eveninfinite-dimensional, while K (x , x ′) might be easy to compute.
•
Kontorovich and Sabato (BGU) Lecture 8 9 / 27
The Kernel TrickHow can the algorithm find and output w when the dimension is huge or infinite?
We need an alternative representation for w .
Claim: The hard/soft SVM objectives on ψ can be rewritten as:
Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),
where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.
SVM objectives:I Soft SVM:
Minimize λ‖w‖2 + `h(w , S) ≡
Minimize λ‖w‖2 +1
m
m∑i=1
max{0, 1− yi 〈w , ψ(xi )〉}.
Define R(a) := λa2 and f (a1, . . . , am) = 1m
∑mi=1 max{0, 1− yiai}.
I Hard SVM: Minimize ‖w‖2 s.t. ∀i , yi 〈w , ψ(xi )〉 ≥ 1.Define R(a) := a2 and
f (a1, . . . , am) :=
{0 ∀i , yiai ≥ 1,
∞ otherwise.
There are other SVM-like algorithms with the same form.
•Kontorovich and Sabato (BGU) Lecture 8 10 / 27
The representer theorem
Many popular objectives on ψ can be rewritten as:
Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.
Theorem (The representer theorem)
If ψ maps the data to Rn for some n, and the objective has the form above, thenthe optimal solution to the objective can be written as
w =m∑i=1
α(i)ψ(xi ), where α = (α(1), . . . , α(m)) ∈ Rm.
This allows kernel-SVM algorithms to represent w using α(1), . . . , α(m)instead of w(1), . . . ,w(n).
If n� m, this is a huge improvement!
xi with α(i) 6= 0 are the support vectors of w .
•Kontorovich and Sabato (BGU) Lecture 8 11 / 27
The representer theoremObjective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.
Need to show that there is an optimal solution with the form: w =∑m
i=1 α(i)ψ(xi ).
Note: w ∈ Rn, the feature space.
Proof.Let w∗ be the optimal solution to the objective.
w∗ ∈ Rn and so are ψ(xi ), so we can write: w∗ =∑m
i=1 α(i)ψ(xi ) + u, where∀i ≤ m, 〈u, ψ(xi )〉 = 0.
Let w := w∗ − u. Note w =∑m
i=1 α(i)ψ(xi ).
Then ‖w∗‖2 = ‖w‖2 + ‖u‖2 + 2〈w , u〉.〈w , u〉 = 〈
∑mi=1 α(i)ψ(xi ), u〉 = 0. So ‖w∗‖2 = ‖w‖2 + ‖u‖2.
Therefore ‖w‖ ≤ ‖w∗‖, so R(‖w‖) ≤ R(‖w∗‖).
For all i , 〈w , ψ(xi )〉 = 〈w∗ − u, ψ(xi )〉 = 〈w∗, ψ(xi )〉.Therefore, f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) = f (〈w∗, ψ(x1)〉, . . . , 〈w∗, ψ(xm)〉).
So w is also an optimal solution, and it has the wanted representation.
•Kontorovich and Sabato (BGU) Lecture 8 12 / 27
Rewriting the objective
Objective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)
Solution has the form w =∑m
i=1 α(i)ψ(xi ) for some vector α ∈ Rm.
To avoid representing w directly, rewrite everything using α instead:
〈w , ψ(xi )〉 =
⟨m∑j=1
α(j)ψ(xj ), ψ(xi )
⟩=
m∑j=1
α(j)〈ψ(xj ), ψ(xi )〉
‖w‖2 = 〈m∑j=1
α(j)ψ(xj ),m∑j=1
α(j)ψ(xj )〉 =m∑
i,j=1
α(i)α(j)〈ψ(xi ), ψ(xj )〉
Recall K(x , x ′) := 〈ψ(x), ψ(x ′)〉.
So, we can rewrite the objective as:
Minimize f
m∑j=1
α(j)K(xj , x1), . . . ,m∑j=1
α(j)K(xj , xm)
+R
√√√√ m∑i,j=1
α(i)α(j)K(xi , xj )
.
•
Kontorovich and Sabato (BGU) Lecture 8 13 / 27
Solving the objective using the kernel function
Objective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)Rewriting the objective using the kernel function:
Minimize f
m∑j=1
α(j)K(xj , x1), . . . ,m∑j=1
α(j)K(xj , xm)
+R
√√√√ m∑i,j=1
α(i)α(j)K(xi , xj )
.
To solve this objective, the algorithm needs only the values of K (xi , xj) forall i , j ≤ m
The algorithm never needs to see the examples xi or to calculate ψ.
The kernel values form an m ×m matrix called the Gram matrix.
The gram matrix is G , where Gi,j = K (xi , xj).
•
Kontorovich and Sabato (BGU) Lecture 8 14 / 27
Example: Kernel soft-SVMThe soft-SVM objective for the feature mapping ψ:
Minimize λ‖w‖2 +1
m
m∑i=1
ξi
s.t. ∀i , yi 〈w , ψ(xi )〉 ≥ 1− ξi , and ξi ≥ 0.
How is it implemented using a kernel?
Recall: 〈w , ψ(xi )〉 =∑m
j=1 α(j)K(xj , xi ), ‖w‖2 =∑m
i,j=1 α(i)α(j)K(xi , xj ).
Kernel Soft-SVMinput The training sample Gram matrix G ∈ Rm×m, the training labels y1, . . . , ym,
parameter λ > 0.output α ∈ Rm
1: Find w that solves the following problem:
Minimize λαTGα+1
m
m∑i=1
ξi
s.t. ∀i , yi 〈α,G [i ]〉 ≥ 1− ξi , and ξi ≥ 0.
(G [i ] is row i of G)2: Return α.
•Kontorovich and Sabato (BGU) Lecture 8 15 / 27
The output of a kernel algorithm
The output of a kernel algorithm is α ∈ Rm instead of w .
How do we use this to predict new labels?
Solution is w =∑m
i=1 α(i)ψ(xi ).
For any example x ,
hw (ψ(x)) = sign(〈w , ψ(x)〉) = sign
m∑j=1
α(j)K (xj , x)
.
Calculate K (x1, x), . . . ,K (xm, x), and use α to get the labelhw (ψ(x)).
•
Kontorovich and Sabato (BGU) Lecture 8 16 / 27
Calculating the kernel function
Kernel algorithms never need to represent x or w .
But to run them we need to provide K (xi , xj).
Also, to predict using α we need K (xi , x).
A simple option to calculate K : use K (x , x ′) = 〈ψ(x), ψ(x ′)〉.
Problem: ψ could be huge dimensional, or even infinite dimensional!
We want to avoid representing ψ(x) directly.
Solution: There are several useful functions K (kernels) which can becalculated without representing ψ.
•
Kontorovich and Sabato (BGU) Lecture 8 17 / 27
Polynomial kernels
Polynomial kernels
The k-degree polynomial kernel is
K(x , x ′) = (1 + 〈x , x ′〉)k .
Claim: This is a kernel function.
In other words: There exists a ψ such that K(x , x ′) = 〈ψ(x), ψ(x ′)〉.
Proof (first part)
An example vector is x = (x(1), . . . , x(d)). Denote x(0) = 1.
Then 1 + 〈x , x ′〉 =∑d
j=0 x(j)x ′(j).
The polynomial kernel is:
K(x , x ′) = (1 + 〈x , x ′〉)k =
(d∑
i=0
x(j)x ′(j)
)k
The set of possible k-tuples is: {0, . . . , d}k .
K(x , x ′) =∑
z∈{0,...,d}k
k∏i=1
x(z(i)) · x ′(z(i)) =∑
z∈{0,...,d}k
(k∏
i=1
x(z(i))
)(k∏
i=1
x ′(z(i))
).
•Kontorovich and Sabato (BGU) Lecture 8 18 / 27
Polynomial kernels
Polynomial kernels
The k-degree polynomial kernel is
K(x , x ′) = (1 + 〈x , x ′〉)k .
Claim: This is a kernel function.In other words: There exists a ψ such that K(x , x ′) = 〈ψ(x), ψ(x ′)〉.
Proof (Second part).
The polynomial kernel is: K(x , x ′) =∑
z∈{0,...,d}k(∏k
i=1 x(z(i)))(∏k
i=1 x′(z(i))
).
Define ψ : Rd → R(d+1)k :
I Map the k-tuples z ∈ {0, . . . , d}k to coordinates in R(d+1)k .
I Coordinate z of ψ(x) is∏k
i=1 x(z(i)).
ThenK(x , x ′) = 〈ψ(x), ψ(x ′)〉.
•Kontorovich and Sabato (BGU) Lecture 8 19 / 27
Polynomial kernels
Polynomial kernels
The k-degree polynomial kernel is
K (x , x ′) = (1 + 〈x , x ′〉)k .
We showed that this is a kernel function.
The coordinates in the feature map are∏k
i=1 x(z(i)) for all z ∈ {0, . . . , d}k .
Every monomial of degree up to k has a coordinate in the feature map.
For degrees smaller than k , use z with several 0’s.
This representation causes many duplicate coordinates.
E.g. (1, 2, 1) and (2, 1, 1) give the same value x(1)2 · x(2).
The identical coordinates can be “collapsed” to a single coordinate (it willhave a coefficient which is the number of duplicates).
•Kontorovich and Sabato (BGU) Lecture 8 20 / 27
Polynomial Kernels
The polynomial kernel K (x , x ′) = (1 + 〈x , x ′〉)k :
Any multivariate polynomial in x of degree up to kis a linear function in the feature map that matches K .
So, kernel SVM with the polynomial kernel looks fora predictor on X which is defined by a polynomial.
Without the kernel trick, need to calculate ψ, which would costO(dk).
Using the kernel trick, only calculate K (x , x ′), which costs O(d).
•
Kontorovich and Sabato (BGU) Lecture 8 21 / 27
Gaussian kernels
Guassian kernels
The Gaussian kernel with a parameter σ > 0 is K(x , x ′) = e−‖x−x′‖2
2σ .
Claim: This is a kernel function.
Proof for X = R (one dimensional examples):
I Define ψ(x) such that the n’th coordinate of ψ(x) is: 1√n!e−
x2
2σ · (x/√σ)n.
I Allow an infinite dimension: all coordinates in {0, 1, 2, . . .}.I A “nice” infinite-dimensional inner-product space is called a Hilbert space.I Then
〈ψ(x), ψ(x ′)〉 =∞∑n=0
(1√n!
e−x2
2σ · (x/√σ)n)·(
1√n!
e−x′22σ · (x ′/
√σ)n)
= e−x2
2σ e−x′22σ
∞∑n=0
(xx ′)n
σnn!
= e−x2−x′2
2σ · exx′/σ = e−(x−x′)2/2σ = e−‖x−x′‖2/2σ .
I If X = Rd : ψ(x) has a coordinate for every sequence z of integers in {1, . . . , d} ofany length n.
I The coordinate z (for z of size n) is 1√n!e−‖x‖2
2σ ·∏n
i=1 x(z(i))/σ.
•Kontorovich and Sabato (BGU) Lecture 8 22 / 27
Gaussian kernels
Guassian kernels
The Gaussian kernel with a parameter σ > 0 is K(x , x ′) = e−‖x−x′‖2
2σ .
The coordinate z is 1√n!e−‖x‖2
2σ ·∏n
i=1 x(z(i))/σ.
So, every multivariate polynomial (of any degree!) can be used as a linear predictor withthis kernel.What does the predictor look like in the original space?
hw (ψ(x)) = sign(〈w , ψ(x)〉) = sign
m∑j=1
α(j)K(xj , x)
= sign
m∑j=1
α(j)e−‖x−xj‖
2
2σ
.
Predict the label of x based on the distance of x from each of the xj with α(j) 6= 0.If x is close enough to xj , sign of α(j) determines label of x .σ ≈ “radius of influence” of each xj .Can illustrate the predictor as “circles” around some of the xj ’s.
A trade-off in the size of σ:I Small σ: Can have a very fine-tuned boundary. larger estimation errorI Large σ: Function is more smooth. larger approximation error
σ is usually selected using cross validation.Gaussian kernel is also called Radial Basis Function (RBF) kernel.
•Kontorovich and Sabato (BGU) Lecture 8 23 / 27
Using kernels to encode a hypothesis class
If we have prior knowledge on the problem, we might know whichhypotheses make sense.
Use a kernel to convert hypotheses to linear predictors.
Example: Learn to identify executable files that contain a virus
Viruses are identified by a signature: a substring in the executable file.
X is the set of all possible files of length at most n: strings in Σ≤n,where Σ is the alphabet.
We would like to find the most predictive substring in the file:
I For each substring v , hv (x) is positive iff v is a substring of x .I The hypothesis class: H = {hv | v ∈ Σ≤k}.
How to make H into a set of linear predictors?
•
Kontorovich and Sabato (BGU) Lecture 8 24 / 27
Using kernels to encode a hypothesis class
hv (x) is positive iff v is a substring of x , H = {hv | v ∈ Σ∗, |v | ≤ k}.
The feature space F = Rd :
I A feature v for every possible string |v | ≤ k .I ψ(x)(v) = I[v is a substring of x ]I An extra feature 0 that is always 1 (for bias): ψ(x)(0) = 1.
wv ∈ Rd : A vector with wv (v) = 2,wv (0) = −1, other coordinates are zero.
hv (x) = sign(〈wv , ψ(x)〉).
d (number of features) is exponential in k : d = Θ(|Σ|k).
But we have
K (x , x ′) = 1 +∑
v∈Σ∗,|v |≤k
I[v is a substring of x ] · I[v is a substring of x ′]
= 1 + number of common substrings of x , x ′ of length ≤ k.
K (x , x ′) can be calculated in O(n2k2).
•Kontorovich and Sabato (BGU) Lecture 8 25 / 27
Using kernels to encode a hypothesis class
We saw that this kernel can be efficiently calculated.
Computational complexity: Good.
What about the sample complexity?I Sample complexity of SVM is O(min(d , 1/γ2
∗)).
I Dimension is huge: d = Θ(|Σ|k).
I Let’s check the margin.I Suppose there is a single substring v that separates the input.
w = (−1, 0, . . . , 0, 2, 0, . . . , 0)
γD(w) := ”the largest value such that P(X ,Y )∼D
[1
RD
|〈w , ψ(X )〉|‖w‖
≥ γD(w)
]= 1”
RD := ”the smallest value such that P[‖ψ(X )‖ ≤ RD] = 1” ≤√
1 + nk.
∀x , |〈w , x〉| = 1, ‖w‖ =√
5 =⇒ γ∗ ≥ γD(w) ≥1√
5nk.
Conclusion: The sample complexity is O( 1γ2∗
) ≤ O(nk).
Note: the kernel-SVM uses a richer class than the original H.
•Kontorovich and Sabato (BGU) Lecture 8 26 / 27
Kernels: Summary
Feature maps allow using SVM to learn non-linear classifiers.
Kernel functions can be used to represent large and infinitedimensional feature maps.
The kernel trick: run SVM without representing ψ directly.
If the kernel can be calculated efficiently, this can improvecomputational complexity.
The sample complexity will depend mainly on the margin.
There are general-purpose kernels, such asI Polynomial kernelI Gaussian kernel
Special-purpose kernels can encode a specific hypothesis class.
This can work even if the examples are not vectors at all.
•
Kontorovich and Sabato (BGU) Lecture 8 27 / 27
Top Related