CS 337: Artificial Intelligence & Machine Learning ...

62
CS 337: Articial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 12: Kernels: Perceptron, Logistic Regression and Ridge Regression August 2019

Transcript of CS 337: Artificial Intelligence & Machine Learning ...

Page 1: CS 337: Artificial Intelligence & Machine Learning ...

CS 337: Artificial Intelligence & Machine LearningInstructor: Prof. Ganesh Ramakrishnan

Lecture 12: Kernels: Perceptron, Logistic Regressionand Ridge Regression

August 2019

Page 2: CS 337: Artificial Intelligence & Machine Learning ...

Kernel Perceptronhttps://youtu.be/ql_GMONpmHM?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa

Page 3: CS 337: Artificial Intelligence & Machine Learning ...

Recap: Perceptron Update Rule

Perceptron works for two classes (y = ±1).

A point is misclassified if ywT (φ(x)) < 0.

Perceptron Algorithm:

INITIALIZE: w=ones()REPEAT: for each < x, y >

If ywTΦ(x) < 0then, w = w + ηφ(x).yendif

Page 4: CS 337: Artificial Intelligence & Machine Learning ...
Page 5: CS 337: Artificial Intelligence & Machine Learning ...
Page 6: CS 337: Artificial Intelligence & Machine Learning ...
Page 7: CS 337: Artificial Intelligence & Machine Learning ...
Page 8: CS 337: Artificial Intelligence & Machine Learning ...
Page 9: CS 337: Artificial Intelligence & Machine Learning ...

Kernel (Non-linear) perceptron?

Kernelized perceptron1:

1The first kernel classification learner, was invented in 1964

Page 10: CS 337: Artificial Intelligence & Machine Learning ...

Kernel (Non-linear) perceptron?

Kernelized perceptron1: f (x) = sign

��

i

αiyiK (x, xi) + b

INITIALIZE: α=zeroes()REPEAT: for < xi , yi >

If sign��

j αjyjK (xi , xj) + b��= yi

then, αi = αi + 1endif

Convergence is matter of Tutorial 4 & 5

1The first kernel classification learner, was invented in 1964

This could help you avoidseparately storing all the in solution to Problem 2 of lab 3

Page 11: CS 337: Artificial Intelligence & Machine Learning ...

An Example Kernel

We illustrate going from φ(x) to K (x, y) and back...

For example, for a 2-dimensional xi :

φ(xi) =

1

xi1√2

xi2√2

xi1xi2√2

x2i1x2i2

φ(xi) exists in a 6-dimensional space

But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)

Page 12: CS 337: Artificial Intelligence & Machine Learning ...

Example Kernels

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

Page 13: CS 337: Artificial Intelligence & Machine Learning ...

Example: Classification using Kernels

Source:https://www.slideshare.net/ANITALOKITA/winnow-vs-perceptron

Page 14: CS 337: Artificial Intelligence & Machine Learning ...

More on the Kernel Trick

Kernels operate in a high-dimensional, implicit feature space withoutnecessarily computing the coordinates of the data in that space, but ratherby simply computing the Kernel function

This operation is often computationally cheaper than the explicitcomputation of the coordinates

Page 15: CS 337: Artificial Intelligence & Machine Learning ...

The Gram (Kernel) Matrix

For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as

K =

K (x1, x1) ... K (x1, xn)

... K (xi , xj) ...K (xm, x1) ... K (xm, xm)

Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....

K must be

Page 16: CS 337: Artificial Intelligence & Machine Learning ...

The Gram (Kernel) Matrix

For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as

K =

K (x1, x1) ... K (x1, xn)

... K (xi , xj) ...K (xm, x1) ... K (xm, xm)

Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....

K must be positive semi-definite

Proof: For any b ∈ �m, bTKb =�

i ,j

biKijbj =�

i ,j

bibj�φ(xi ),φ(xj)�

= ��

i

biφ(xi ),�

j

bjφ(xj)� = ||�

i

biφ(xi )||22 ≥ 0

Page 17: CS 337: Artificial Intelligence & Machine Learning ...

Basis expansion φ for symmetric K?

Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, theGram matrix K must be positive semi-definite so thatK = UΣUT = (UΣ

12 )(UΣ

12 )T = RRT where rows of U are linearly

independent and Σ is a positive diagonal matrix

We will illustrate through an example..

Page 18: CS 337: Artificial Intelligence & Machine Learning ...
Page 19: CS 337: Artificial Intelligence & Machine Learning ...
Page 20: CS 337: Artificial Intelligence & Machine Learning ...
Page 21: CS 337: Artificial Intelligence & Machine Learning ...

From matrix decomposition to function decomposition:Mercer Kernel K

Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, the

Gram matrix K must be positive semi-definite so that

K = UΣUT = (UΣ12 )(UΣ

12 )T = RRT where rows of U are linearly independent

and Σ is a positive diagonal matrix

Mercer kernel: Extending to eigenfunction decomposition2:

K (x1, x2) =∞�

j=1

αjφj(x1)φj(x2) where αj ≥ 0 and�∞

j=1 α2j < ∞

Mercer kernel and Positive-definite kernel turn out to be equivalent if theinput space {x} is compact3

2Eigen-decomposition wrt linear operators.

3Equivalent of closed and bounded.

Page 22: CS 337: Artificial Intelligence & Machine Learning ...

Mercer and Positive Definite Kernels

Mercers Theorem is NOT INCLUDED IN

MIDSEMSEMESTER EXAM SYLLABUS:

Page 23: CS 337: Artificial Intelligence & Machine Learning ...

Mercer’s theorem (Not for Midsem)

Mercer kernel: K (x1, x2) is a Mercer kernel if�x1

�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions

g(x)

Page 24: CS 337: Artificial Intelligence & Machine Learning ...

Mercer’s theorem (Not for Midsem)

Mercer kernel: K (x1, x2) is a Mercer kernel if�x1

�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions

g(x)(g(x) is square integrable iff

�(g(x))2 dx is finite)

Mercer’s theorem: For any Mercer kernel K (x1, x2), ∃φ(x) : IRn �→ H ,s.t. K (x1, x2) = φ�(x1)φ(x2)

where H is a Hilbert space, the infinite dimensional version of the Eucledianspace, which is......(�n,< ., . >) where < ., . > is the standard dot product in �n

Advanced: Formally, Hibert Space is an inner product space with associatednorms, where every Cauchy sequence is convergent

Page 25: CS 337: Artificial Intelligence & Machine Learning ...

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

We want to prove that for all square integrable functions g(x)�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2 ≥ 0

Here, x1 and x2 are vectors s.t x1, x2 ∈ �t

Thus,�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2

=

x11

..

x1t

x21

..

x2t

��

n1..nt

d !

n1!..nt!

t�

j=1

(x1jx2j)nj

�g(x1)g(x2) dx11..dx1tdx21..dx2t

s.t.t�

i=1

ni = d

Page 26: CS 337: Artificial Intelligence & Machine Learning ...

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

We want to prove that for all square integrable functions g(x)�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2 ≥ 0

Here, x1 and x2 are vectors s.t x1, x2 ∈ �t

Thus,�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2

=

x11

..

x1t

x21

..

x2t

��

n1..nt

d !

n1!..nt!

t�

j=1

(x1jx2j)nj

�g(x1)g(x2) dx11..dx1tdx21..dx2t

s.t.t�

i=1

ni = d

By some Fubini’s theorem, under finiteness conditions, integral of sum is sum ofintegrals

Page 27: CS 337: Artificial Intelligence & Machine Learning ...

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

=�

n1...nt

d !

n1! . . . nt!

x1

x2

t�

j=1

(x1jx2j)nj g(x1)g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

x1

x2

(xn111xn212 . . . x

nt1t )g(x1) (x

n121x

n222 . . . x

nt2t )g(x2) dx1dx2

Page 28: CS 337: Artificial Intelligence & Machine Learning ...

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

=�

n1...nt

d !

n1! . . . nt!

x1

x2

t�

j=1

(x1jx2j)nj g(x1)g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

x1

x2

(xn111xn212 . . . x

nt1t )g(x1) (x

n121x

n222 . . . x

nt2t )g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

��

x1

(xn111 . . . xnt1t )g(x1) dx1

� ��

x2

(xn121 . . . xnt2t )g(x2) dx2

(integral of decomposable product as product of integrals)

s.t.t�

i

ni = d

Page 29: CS 337: Artificial Intelligence & Machine Learning ...

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

Realize that both the integrals are basically the same, with different variablenames

Thus, the quadratic (positive-definiteness) expression becomes:

n1...nt

d !

n1! . . . nt!(

x1

(xn111 . . . xnt1t )g(x1) dx1)

2 ≥ 0

(the square is non-negative for reals)

Thus, we have shown that (x�1 x2)d is a Mercer kernel.

Page 30: CS 337: Artificial Intelligence & Machine Learning ...

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

K (x1, x2) =r�

d=1

αd(x�1 x2)

d

Is

x1

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

Page 31: CS 337: Artificial Intelligence & Machine Learning ...

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Is

x1

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

We have �

x1

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 =

Page 32: CS 337: Artificial Intelligence & Machine Learning ...

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Is

x1

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

We have �

x1

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 =

r�

d=1

αd

x1

x2

(x�1 x2)dg(x1)g(x2) dx1dx2

Page 33: CS 337: Artificial Intelligence & Machine Learning ...

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Since αd ≥ 0, ∀d and since we have already proved that�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2 ≥ 0

We must have,

r�

d=1

αd

x1

x2

(x1�x2)dg(x1)g(x2) dx1dx2 ≥ 0

By which, K (x1, x2) =r�

d=1

αd(x�1 x2)

d is a Mercer kernel.

Examples of Mercer Kernels: Linear Kernel, Polynomial Kernel, Radial BasisFunction Kernel

Page 34: CS 337: Artificial Intelligence & Machine Learning ...

Closure properties of Kernels (Part of Midsems)

Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.

α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:

Page 35: CS 337: Artificial Intelligence & Machine Learning ...

Closure properties of Kernels (Part of Midsems)

Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.

α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:

K1(x1, x2)K2(x1, x2)Proof:

For simplicity, we assume that the is finite dimensional

Page 36: CS 337: Artificial Intelligence & Machine Learning ...

Are the following Mercer Kernels? (Part of Midsems)

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

Exponential Kernel: K (x, y) = exp (�x, y�)

Page 37: CS 337: Artificial Intelligence & Machine Learning ...

Are the following Mercer Kernels? (Part of Midsems)

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

Exponential Kernel: K (x, y) = exp (�x, y�)

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

Page 38: CS 337: Artificial Intelligence & Machine Learning ...

Are the following Mercer Kernels? (Part of Midsems)

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

Exponential Kernel: K (x, y) = exp (�x, y�)

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

�= k(x− y)

Page 39: CS 337: Artificial Intelligence & Machine Learning ...

Are the following Mercer Kernels? (Part of Midsems)

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

Exponential Kernel: K (x, y) = exp (�x, y�)

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

�= k(x− y) = ky(x)

The function ky(x) is also called a smoothing kernel (as well see soon).

Page 40: CS 337: Artificial Intelligence & Machine Learning ...

Some more Tutorial 4 + 5 Questions

Informally show that the Kernelized Logistic Regression form is equivalent tothe original Logistic Regression when regularized cross entropy is minimized

Show that Ridge Regression has an equivalent Kernelized form

Page 41: CS 337: Artificial Intelligence & Machine Learning ...

Kernelized Logistic Regression

(Part of Lab 3, Tutorial 4 +5 and Midsems)https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=316

Page 42: CS 337: Artificial Intelligence & Machine Learning ...

Logistic Regression Kernelized

1 We have already seen (a) Cross Entropy loss and (b) Bayesian interpretationfor regularization

2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −

1

m

m�

i=1

�y (i) log σw

�x(i)

�+

�1 − y (i)

�log

�1 − σw

�x(i)

��� +

λ

2m�w�22 (1)

3 Equivalent dual kernelized objective?

Page 43: CS 337: Artificial Intelligence & Machine Learning ...

Logistic Regression Kernelized

2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1

�y (i) log σw

�x(i)

�+�1− y (i)

�log

�1− σw

�x(i)

����+

λ

2m�w�22(2)

3 By substituting σw (x) =�

1

1+e−wTφ(x)

�=

�ew

Tφ(x)

1+ewTφ(x)

�, we simplify (2) to...

E (w) = −�1

m

m�

i=1

�y (i)wTφ(x(i))− log

�1 + exp

�wTφ

�x(i)

�����+

λ

2m�w�22

(3)

Page 44: CS 337: Artificial Intelligence & Machine Learning ...

Logistic Regression Kernelized

2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1

�y (i)wTφ(x(i))− log

�1 + exp

�wTφ

�x(i)

�����+

λ

2m�w�22

(4)

3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):

4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Page 45: CS 337: Artificial Intelligence & Machine Learning ...

Logistic Regression Kernelized

2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1

�y (i)wTφ(x(i))− log

�1 + exp

�wTφ

�x(i)

�����+

λ

2m�w�22

(4)

3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):

ED (α) = −

m�

i=1

m�

j=1

y (i)K�x(i), x(j)

�αj −

λ

2αi K

�x(i), x(j)

�αj

− log

1 + exp

m�

j=1

αj K�x(i), x(j)

(5)

Decision function σw(x) = 1

1+ exp

m�

j=1

αj K�x, x(j)

4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

Page 46: CS 337: Artificial Intelligence & Machine Learning ...

CS 337: Artificial Intelligence & Machine LearningThe Kernel Trick: Illustrations on Ridge Regression

Page 47: CS 337: Artificial Intelligence & Machine Learning ...

Recall: Ridge RegressionVideo link: https://youtu.be/UVopa_V7rgE?t=598

Page 48: CS 337: Artificial Intelligence & Machine Learning ...

Recall: Penalized Regularized Least Squares Regression

Φ =

φ1(x1) φ2(x1) ...... φn(x1)..

φ1(xm) φ2(xm) ...... φn(xm)

The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1

λI ) coincide with

Regularized Ridge Regression

wRidge = argminw

||Φw − y||22 + λσ2||w||22

Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude

Page 49: CS 337: Artificial Intelligence & Machine Learning ...

Recall: Penalized Regularized Least Squares Regression

Φ =

φ1(x1) φ2(x1) ...... φn(x1)..

φ1(xm) φ2(xm) ...... φn(xm)

The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1

λI ) coincide with

Regularized Ridge Regression

wRidge = argminw

||Φw − y||22 + λσ2||w||22

Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude

We replace λσ2 by a single λ for convenience.

Page 50: CS 337: Artificial Intelligence & Machine Learning ...

Recall: Closed-form solutions

Linear regression and Ridge regression both admit closed-form solutions

For linear regression,w∗ = (Φ�Φ)−1Φ�y

For ridge regression,w∗ = (Φ�Φ+ λI )−1Φ�y

(for linear regression, λ = 0)

Page 51: CS 337: Artificial Intelligence & Machine Learning ...

Recall: Polynomial regression

Consider a degree 3 polynomialregression model as shown in thefigure

Each bend in the curve correspondsto increase in �w�Eigen values of (Φ�Φ+ λI ) areindicative of curvature.Increasing λ reduces the curvature(Tutorial 4 + 5)

Page 52: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

If y = wTφ(x), its ridge regression estimate is w = (ΦTΦ+ λI )−1ΦTy,where

Φ =

φ1(x1) ... φp(x1)... ... ...

φ1(xm) ... φp(xm)

Please note the difference between matrix Φ and vector φ(x)

φ(xj) =

φ1(xj)...

φp(xj)

Page 53: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT

Page 54: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT

�ΦTΦ

�ij=

�mk=1 φi (xk)φj(xk)�

ΦΦT�ij=

�pk=1 φk(xi )φk(xj) = φT (xi )φ(xj) = K (xi , xj)

Page 55: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Page 56: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Consider the following matrix identity (verify for scalars):

(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ by setting R = I , P = 1λ I and B = Φ,

Page 57: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Consider the following matrix identity (verify for scalars):

(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ by setting R = I , P = 1λ I and B = Φ,

⇒ w = ΦT (ΦΦT + λI )−1y =�m

i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y

�i

⇒ the final decision function f (x) = φT (x)w =�m

i=1 αiφT (x)φ(xi )

Page 58: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

Thus, given w = (ΦTΦ+ λI )−1ΦTy and the matrix identity(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ w = ΦT (ΦΦT + λI )−1y =�m

i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y

�i

⇒ the final decision function f (x) = φT (x)w =�m

i=1 αiφT (x)φ(xi )

We notice that the only way the decision function f (x) involves φ isthrough φ�(xi)φ(xj), for some i , j

Page 59: CS 337: Artificial Intelligence & Machine Learning ...

Ridge Regression in another form

Given w = (ΦTΦ+ λI )−1ΦTy

⇒ w = ΦT (ΦΦT + λI )−1y =�m

i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y

�i

⇒ the final decision function f (x) = φT (x)w =�m

i=1 αiφT (x)φ(xi )

φT (xi)φ(xj) = K (xi , xj)�ΦTΦ

�ij=

�mk=1 φi(xk)φj(xk)�

ΦΦT�ij=

�pk=1 φk(xi)φk(xj) = φT (xi)φ(xj) = K (xi , xj)

Page 60: CS 337: Artificial Intelligence & Machine Learning ...

Recap: Example Kernel

For a 2-dimensional xi :

φ(xi) =

1

xi1√2

xi2√2

xi1xi2√2

x2i1x2i2

φ(xi) exists in a 6-dimensional space

But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)

Page 61: CS 337: Artificial Intelligence & Machine Learning ...

Example Kernels

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

Page 62: CS 337: Artificial Intelligence & Machine Learning ...

Example: Regression using Kernels