Lecture 8: Kernels - BGUinabd171/wiki.files/lecture8_handouts.pdf · Lecture 8: Kernels...

Lecture 8: Kernels

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 8 1 / 27

Non-linear ClassifiersSVM tries to find a good linear classifier on the training set.

What if such a classifier does not exist?

There might be a good non-linear classifier.

Can we make this problem linearly separable?

Change representations: Instead of x ∈ R, use ψ(x) = (x , x2) ∈ R2.

x2

x

There is a linear separator in this new representation!

•Kontorovich and Sabato (BGU) Lecture 8 2 / 27

Feature maps

ψ(x) = (x , x2) maps examples from R to R2.

A linear predictor in new space is a non-linear predictor in original space.

Many different feature mappings are possible.

The feature mapping procedure:

1. Given X and some learning task, choose some feature mappingψ : X → F .

F F is the feature space.F F is usually Rn for some n. (later: n can also be infinite!)

2. When given the training set S = {(x1, y1), . . . , (xm, ym)}, create animage of the set in the feature space:

S = {(ψ(x1), y1), . . . , (ψ(xm), ym)}

3. Find a good linear predictor hw : F → Y on S .F w is a vector in the feature space Rn.

4. For new examples x ∈ X , predict the label

hw (ψ(x)) = sign(〈w , ψ(x)〉).


Feature maps

The feature mapping induces a new distribution.

I D is the original distribution over X × Y.I Dψ := the new distribution on F × Y, induced by ψ and D.

For any h, err(h,Dψ) = err(h ◦ ψ,D).

So, a good classifier on Dψ induces a good classifier on D.

Which feature map should we use?

Similar to selecting a hypothesis class:

I Low approximation error: There is a good predictor for D in themapped feature space.

I Low estimation error: A learning algorithm on Dψ will find a goodpredictor from the training set.

E.g., if there is a linear classifier with a large margin in the feature space.

•


General-purpose feature maps

The feature mapping is selected before observing the training sample!

I Otherwise we allow any function =⇒ overfitting .

How to choose a good feature mapping?

I Domain knowledgeI A general-purpose feature mapping

Polynomial feature maps: Allow polynomials as separators.

For x ∈ R (one-dimensional), define ψ : R→ Rk+1

ψ(x) = (1, x , x2, . . . , xk).

For x ∈ reals, can now learn any separators that are polynomials of degree≤ k:

sign

(k∑

i=0

wixi

)= sign(〈w , ψ(x)〉).


Polynomial feature maps in high dimensions

For x ∈ Rd : Allow multivariate polynomials.

I Multivariate polynomial example:

3·x(4)6 ·x(16)4 − 1.17·x(3)19 ·x(27)11 ·x(15) + 4.

I k-degree multivariate polynomial: Each monomial’s powers sum to atmost k .

I Define all integer sequences of length d which sum to at most k:

I kd := {(t1, t2, . . . , td) |d∑

j=1

tj ≤ k}

I The set of all multivariate polynomials of degree k:∑z∈I kd

w(z)d∏

i=1

x(i)z(i)∣∣∣w ∈ R|I

kd |

.

I wz are the coefficients of the polynomial.


Polynomial feature maps in high dimensions

Define a feature mapping that allows all multivariate polynomials upto degree k

Every coordinate in ψ(x) corresponds to one z ∈ I kd .

I kd := {(t1, t2, . . . , td) |d∑

j=1

tj ≤ k}

Examples:I d = 2, k = 2: ψ(x) = (1, x(1), x(2), x(1)x(2), x(1)2, x(2)2).I d = 6, k = 7:

F Coordinate (3, 0, 0, 2, 2, 0) in ψ(x) corresponds to x(1)3 · x(4)2 · x(5)2.F Coordinate (0, 0, 5, 0, 0, 0) corresponds to x(3)5.F If w has 7 in coordinate (0, 0, 5, 0, 1, 0) and the rest is zero, the

polynomial represented by w is 7·x(3)5 ·x(5).

•


Learning with feature maps

Feature maps can be in a very high dimension.I Polynomial map in degree k : If X = Rd , number of features inF = Rn, needs to be n ≈ dk .

I E.g. text documents with the bag-of-words representation:d = 100, 000 (English words), k = 10, n = enormous.

Two problems:I High dimension =⇒ need a lot of training examples for ERM.

F Solution: use a large-margin algorithm such as SVM.

I High dimension =⇒ heavy computation and memory consumption.F Even representing examples and w in memory can be impossible!

I Solution for computational issues: kernels.

•


The Kernel Trick

If ψ maps to a very high dimension, it can be computationallyexpensive to use the representations ψ(xi ) directly.

Define K (x , x ′) := 〈ψ(x), ψ(x ′)〉.

The function K is also called a kernel.

The Kernel Trick: Many learning algorithms for linear separators canbe implemented using only the function K (x , x ′), and never ψ(x).

Advantage: ψ(x) can be huge-dimensional or eveninfinite-dimensional, while K (x , x ′) might be easy to compute.

•


The Kernel TrickHow can the algorithm find and output w when the dimension is huge or infinite?

We need an alternative representation for w .

Claim: The hard/soft SVM objectives on ψ can be rewritten as:

Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),

where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.

SVM objectives:I Soft SVM:

Minimize λ‖w‖2 + `h(w , S) ≡

Minimize λ‖w‖2 +1

m

m∑i=1

max{0, 1− yi 〈w , ψ(xi )〉}.

Define R(a) := λa2 and f (a1, . . . , am) = 1m

∑mi=1 max{0, 1− yiai}.

I Hard SVM: Minimize ‖w‖2 s.t. ∀i , yi 〈w , ψ(xi )〉 ≥ 1.Define R(a) := a2 and

f (a1, . . . , am) :=

{0 ∀i , yiai ≥ 1,

∞ otherwise.

There are other SVM-like algorithms with the same form.


The representer theorem

Many popular objectives on ψ can be rewritten as:

Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.

Theorem (The representer theorem)

If ψ maps the data to Rn for some n, and the objective has the form above, thenthe optimal solution to the objective can be written as

w =m∑i=1

α(i)ψ(xi ), where α = (α(1), . . . , α(m)) ∈ Rm.

This allows kernel-SVM algorithms to represent w using α(1), . . . , α(m)instead of w(1), . . . ,w(n).

If n� m, this is a huge improvement!

xi with α(i) 6= 0 are the support vectors of w .


The representer theoremObjective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)where f : Rm → R is a function, andR : R+ → R is a monotonic non-decreasing function.

Need to show that there is an optimal solution with the form: w =∑m

i=1 α(i)ψ(xi ).

Note: w ∈ Rn, the feature space.

Proof.Let w∗ be the optimal solution to the objective.

w∗ ∈ Rn and so are ψ(xi ), so we can write: w∗ =∑m

i=1 α(i)ψ(xi ) + u, where∀i ≤ m, 〈u, ψ(xi )〉 = 0.

Let w := w∗ − u. Note w =∑m

i=1 α(i)ψ(xi ).

Then ‖w∗‖2 = ‖w‖2 + ‖u‖2 + 2〈w , u〉.〈w , u〉 = 〈

∑mi=1 α(i)ψ(xi ), u〉 = 0. So ‖w∗‖2 = ‖w‖2 + ‖u‖2.

Therefore ‖w‖ ≤ ‖w∗‖, so R(‖w‖) ≤ R(‖w∗‖).

For all i , 〈w , ψ(xi )〉 = 〈w∗ − u, ψ(xi )〉 = 〈w∗, ψ(xi )〉.Therefore, f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) = f (〈w∗, ψ(x1)〉, . . . , 〈w∗, ψ(xm)〉).

So w is also an optimal solution, and it has the wanted representation.


Rewriting the objective

Objective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)

Solution has the form w =∑m

i=1 α(i)ψ(xi ) for some vector α ∈ Rm.

To avoid representing w directly, rewrite everything using α instead:

〈w , ψ(xi )〉 =

⟨m∑j=1

α(j)ψ(xj ), ψ(xi )

⟩=

m∑j=1

α(j)〈ψ(xj ), ψ(xi )〉

‖w‖2 = 〈m∑j=1

α(j)ψ(xj ),m∑j=1

α(j)ψ(xj )〉 =m∑

i,j=1

α(i)α(j)〈ψ(xi ), ψ(xj )〉

Recall K(x , x ′) := 〈ψ(x), ψ(x ′)〉.

So, we can rewrite the objective as:

Minimize f

m∑j=1

α(j)K(xj , x1), . . . ,m∑j=1

α(j)K(xj , xm)

+R

√√√√ m∑i,j=1

α(i)α(j)K(xi , xj )

.

•


Solving the objective using the kernel function

Objective: Minimize f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖)Rewriting the objective using the kernel function:

Minimize f

m∑j=1

α(j)K(xj , x1), . . . ,m∑j=1

α(j)K(xj , xm)

+R

√√√√ m∑i,j=1

α(i)α(j)K(xi , xj )

.

To solve this objective, the algorithm needs only the values of K (xi , xj) forall i , j ≤ m

The algorithm never needs to see the examples xi or to calculate ψ.

The kernel values form an m ×m matrix called the Gram matrix.

The gram matrix is G , where Gi,j = K (xi , xj).

•


Example: Kernel soft-SVMThe soft-SVM objective for the feature mapping ψ:

Minimize λ‖w‖2 +1

m

m∑i=1

ξi

s.t. ∀i , yi 〈w , ψ(xi )〉 ≥ 1− ξi , and ξi ≥ 0.

How is it implemented using a kernel?

Recall: 〈w , ψ(xi )〉 =∑m

j=1 α(j)K(xj , xi ), ‖w‖2 =∑m

i,j=1 α(i)α(j)K(xi , xj ).

Kernel Soft-SVMinput The training sample Gram matrix G ∈ Rm×m, the training labels y1, . . . , ym,

parameter λ > 0.output α ∈ Rm

1: Find w that solves the following problem:

Minimize λαTGα+1

m

m∑i=1

ξi

s.t. ∀i , yi 〈α,G [i ]〉 ≥ 1− ξi , and ξi ≥ 0.

(G [i ] is row i of G)2: Return α.


The output of a kernel algorithm

The output of a kernel algorithm is α ∈ Rm instead of w .

How do we use this to predict new labels?

Solution is w =∑m

i=1 α(i)ψ(xi ).

For any example x ,

hw (ψ(x)) = sign(〈w , ψ(x)〉) = sign

m∑j=1

α(j)K (xj , x)

.

Calculate K (x1, x), . . . ,K (xm, x), and use α to get the labelhw (ψ(x)).

•


Calculating the kernel function

Kernel algorithms never need to represent x or w .

But to run them we need to provide K (xi , xj).

Also, to predict using α we need K (xi , x).

A simple option to calculate K : use K (x , x ′) = 〈ψ(x), ψ(x ′)〉.

Problem: ψ could be huge dimensional, or even infinite dimensional!

We want to avoid representing ψ(x) directly.

Solution: There are several useful functions K (kernels) which can becalculated without representing ψ.

•


Polynomial kernels

Polynomial kernels

The k-degree polynomial kernel is

K(x , x ′) = (1 + 〈x , x ′〉)k .

Claim: This is a kernel function.

In other words: There exists a ψ such that K(x , x ′) = 〈ψ(x), ψ(x ′)〉.

Proof (first part)

An example vector is x = (x(1), . . . , x(d)). Denote x(0) = 1.

Then 1 + 〈x , x ′〉 =∑d

j=0 x(j)x ′(j).

The polynomial kernel is:

K(x , x ′) = (1 + 〈x , x ′〉)k =

(d∑

i=0

x(j)x ′(j)

)k

The set of possible k-tuples is: {0, . . . , d}k .

K(x , x ′) =∑

z∈{0,...,d}k

k∏i=1

x(z(i)) · x ′(z(i)) =∑

z∈{0,...,d}k

(k∏

i=1

x(z(i))

)(k∏

i=1

x ′(z(i))

).


Polynomial kernels

Polynomial kernels


K(x , x ′) = (1 + 〈x , x ′〉)k .

Claim: This is a kernel function.In other words: There exists a ψ such that K(x , x ′) = 〈ψ(x), ψ(x ′)〉.

Proof (Second part).

The polynomial kernel is: K(x , x ′) =∑

z∈{0,...,d}k(∏k

i=1 x(z(i)))(∏k

i=1 x′(z(i))

).

Define ψ : Rd → R(d+1)k :

I Map the k-tuples z ∈ {0, . . . , d}k to coordinates in R(d+1)k .

I Coordinate z of ψ(x) is∏k

i=1 x(z(i)).

ThenK(x , x ′) = 〈ψ(x), ψ(x ′)〉.


Polynomial kernels

Polynomial kernels


K (x , x ′) = (1 + 〈x , x ′〉)k .

We showed that this is a kernel function.

The coordinates in the feature map are∏k

i=1 x(z(i)) for all z ∈ {0, . . . , d}k .

Every monomial of degree up to k has a coordinate in the feature map.

For degrees smaller than k , use z with several 0’s.

This representation causes many duplicate coordinates.

E.g. (1, 2, 1) and (2, 1, 1) give the same value x(1)2 · x(2).

The identical coordinates can be “collapsed” to a single coordinate (it willhave a coefficient which is the number of duplicates).


Polynomial Kernels

The polynomial kernel K (x , x ′) = (1 + 〈x , x ′〉)k :

Any multivariate polynomial in x of degree up to kis a linear function in the feature map that matches K .

So, kernel SVM with the polynomial kernel looks fora predictor on X which is defined by a polynomial.

Without the kernel trick, need to calculate ψ, which would costO(dk).

Using the kernel trick, only calculate K (x , x ′), which costs O(d).

•


Gaussian kernels

Guassian kernels

The Gaussian kernel with a parameter σ > 0 is K(x , x ′) = e−‖x−x′‖2

2σ .

Claim: This is a kernel function.

Proof for X = R (one dimensional examples):

I Define ψ(x) such that the n’th coordinate of ψ(x) is: 1√n!e−

x2

2σ · (x/√σ)n.

I Allow an infinite dimension: all coordinates in {0, 1, 2, . . .}.I A “nice” infinite-dimensional inner-product space is called a Hilbert space.I Then

〈ψ(x), ψ(x ′)〉 =∞∑n=0

(1√n!

e−x2

2σ · (x/√σ)n)·(

1√n!

e−x′22σ · (x ′/

√σ)n)

= e−x2

2σ e−x′22σ

∞∑n=0

(xx ′)n

σnn!

= e−x2−x′2

2σ · exx′/σ = e−(x−x′)2/2σ = e−‖x−x′‖2/2σ .

I If X = Rd : ψ(x) has a coordinate for every sequence z of integers in {1, . . . , d} ofany length n.

I The coordinate z (for z of size n) is 1√n!e−‖x‖2

2σ ·∏n

i=1 x(z(i))/σ.


Gaussian kernels

Guassian kernels

The Gaussian kernel with a parameter σ > 0 is K(x , x ′) = e−‖x−x′‖2

2σ .

The coordinate z is 1√n!e−‖x‖2

2σ ·∏n

i=1 x(z(i))/σ.

So, every multivariate polynomial (of any degree!) can be used as a linear predictor withthis kernel.What does the predictor look like in the original space?

hw (ψ(x)) = sign(〈w , ψ(x)〉) = sign

m∑j=1

α(j)K(xj , x)

= sign

m∑j=1

α(j)e−‖x−xj‖

2

2σ

.

Predict the label of x based on the distance of x from each of the xj with α(j) 6= 0.If x is close enough to xj , sign of α(j) determines label of x .σ ≈ “radius of influence” of each xj .Can illustrate the predictor as “circles” around some of the xj ’s.

A trade-off in the size of σ:I Small σ: Can have a very fine-tuned boundary. larger estimation errorI Large σ: Function is more smooth. larger approximation error

σ is usually selected using cross validation.Gaussian kernel is also called Radial Basis Function (RBF) kernel.


Using kernels to encode a hypothesis class

If we have prior knowledge on the problem, we might know whichhypotheses make sense.

Use a kernel to convert hypotheses to linear predictors.

Example: Learn to identify executable files that contain a virus

Viruses are identified by a signature: a substring in the executable file.

X is the set of all possible files of length at most n: strings in Σ≤n,where Σ is the alphabet.

We would like to find the most predictive substring in the file:

I For each substring v , hv (x) is positive iff v is a substring of x .I The hypothesis class: H = {hv | v ∈ Σ≤k}.

How to make H into a set of linear predictors?

•



hv (x) is positive iff v is a substring of x , H = {hv | v ∈ Σ∗, |v | ≤ k}.

The feature space F = Rd :

I A feature v for every possible string |v | ≤ k .I ψ(x)(v) = I[v is a substring of x ]I An extra feature 0 that is always 1 (for bias): ψ(x)(0) = 1.

wv ∈ Rd : A vector with wv (v) = 2,wv (0) = −1, other coordinates are zero.

hv (x) = sign(〈wv , ψ(x)〉).

d (number of features) is exponential in k : d = Θ(|Σ|k).

But we have

K (x , x ′) = 1 +∑

v∈Σ∗,|v |≤k

I[v is a substring of x ] · I[v is a substring of x ′]

= 1 + number of common substrings of x , x ′ of length ≤ k.

K (x , x ′) can be calculated in O(n2k2).



We saw that this kernel can be efficiently calculated.

Computational complexity: Good.

What about the sample complexity?I Sample complexity of SVM is O(min(d , 1/γ2

∗)).

I Dimension is huge: d = Θ(|Σ|k).

I Let’s check the margin.I Suppose there is a single substring v that separates the input.

w = (−1, 0, . . . , 0, 2, 0, . . . , 0)

γD(w) := ”the largest value such that P(X ,Y )∼D

[1

RD

|〈w , ψ(X )〉|‖w‖

≥ γD(w)

]= 1”

RD := ”the smallest value such that P[‖ψ(X )‖ ≤ RD] = 1” ≤√

1 + nk.

∀x , |〈w , x〉| = 1, ‖w‖ =√

5 =⇒ γ∗ ≥ γD(w) ≥1√

5nk.

Conclusion: The sample complexity is O( 1γ2∗

) ≤ O(nk).

Note: the kernel-SVM uses a richer class than the original H.


Kernels: Summary

Feature maps allow using SVM to learn non-linear classifiers.

Kernel functions can be used to represent large and infinitedimensional feature maps.

The kernel trick: run SVM without representing ψ directly.

If the kernel can be calculated efficiently, this can improvecomputational complexity.

The sample complexity will depend mainly on the margin.

There are general-purpose kernels, such asI Polynomial kernelI Gaussian kernel

Special-purpose kernels can encode a specific hypothesis class.

This can work even if the examples are not vectors at all.

•


Lecture 8: Kernels - BGUinabd171/wiki.files/lecture8_handouts.pdf · Lecture 8: Kernels...

Documents

Transcript of Lecture 8: Kernels - BGUinabd171/wiki.files/lecture8_handouts.pdf · Lecture 8: Kernels...