CS 337: Artiﬁcial Intelligence & Machine Learning ...

CS 337: Artificial Intelligence & Machine LearningInstructor: Prof. Ganesh Ramakrishnan

Lecture 12: Kernels: Perceptron, Logistic Regressionand Ridge Regression

August 2019

Kernel Perceptronhttps://youtu.be/ql_GMONpmHM?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa

Recap: Perceptron Update Rule

Perceptron works for two classes (y = ±1).

A point is misclassified if ywT (φ(x)) < 0.

Perceptron Algorithm:

INITIALIZE: w=ones()REPEAT: for each < x, y >

If ywTΦ(x) < 0then, w = w + ηφ(x).yendif

Kernel (Non-linear) perceptron?

Kernelized perceptron1:

1The first kernel classification learner, was invented in 1964

Kernel (Non-linear) perceptron?

Kernelized perceptron1: f (x) = sign

��

i

αiyiK (x, xi) + b

�

INITIALIZE: α=zeroes()REPEAT: for < xi , yi >

If sign��

j αjyjK (xi , xj) + b��= yi

then, αi = αi + 1endif

Convergence is matter of Tutorial 4 & 5

1The first kernel classification learner, was invented in 1964

This could help you avoidseparately storing all the in solution to Problem 2 of lab 3

An Example Kernel

We illustrate going from φ(x) to K (x, y) and back...

For example, for a 2-dimensional xi :

φ(xi) =

1

xi1√2

xi2√2

xi1xi2√2

x2i1x2i2

φ(xi) exists in a 6-dimensional space

But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)

Example Kernels

Linear kernel: K (x, y) = xTy

Polynomial kernel: K (x, y) =�1 + xTy

�d

RBF kernel: K (x, y) = exp

�−�x− y�2

2σ2

�

Example: Classification using Kernels

Source:https://www.slideshare.net/ANITALOKITA/winnow-vs-perceptron

More on the Kernel Trick

Kernels operate in a high-dimensional, implicit feature space withoutnecessarily computing the coordinates of the data in that space, but ratherby simply computing the Kernel function

This operation is often computationally cheaper than the explicitcomputation of the coordinates

The Gram (Kernel) Matrix

For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as

K =

K (x1, x1) ... K (x1, xn)

... K (xi , xj) ...K (xm, x1) ... K (xm, xm)

Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....

K must be

The Gram (Kernel) Matrix

For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as

K =

K (x1, x1) ... K (x1, xn)

... K (xi , xj) ...K (xm, x1) ... K (xm, xm)

Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....

K must be positive semi-definite

Proof: For any b ∈ �m, bTKb =�

i ,j

biKijbj =�

i ,j

bibj�φ(xi ),φ(xj)�

= ��

i

biφ(xi ),�

j

bjφ(xj)� = ||�

i

biφ(xi )||22 ≥ 0

Basis expansion φ for symmetric K?

Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, theGram matrix K must be positive semi-definite so thatK = UΣUT = (UΣ

12 )(UΣ

12 )T = RRT where rows of U are linearly

independent and Σ is a positive diagonal matrix

We will illustrate through an example..

From matrix decomposition to function decomposition:Mercer Kernel K

Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, the

Gram matrix K must be positive semi-definite so that

K = UΣUT = (UΣ12 )(UΣ

12 )T = RRT where rows of U are linearly independent

and Σ is a positive diagonal matrix

Mercer kernel: Extending to eigenfunction decomposition2:

K (x1, x2) =∞�

j=1

αjφj(x1)φj(x2) where αj ≥ 0 and�∞

j=1 α2j < ∞

Mercer kernel and Positive-definite kernel turn out to be equivalent if theinput space {x} is compact3

2Eigen-decomposition wrt linear operators.

3Equivalent of closed and bounded.

Mercer and Positive Definite Kernels

Mercers Theorem is NOT INCLUDED IN

MIDSEMSEMESTER EXAM SYLLABUS:

Mercer’s theorem (Not for Midsem)

Mercer kernel: K (x1, x2) is a Mercer kernel if�x1

�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions

g(x)

Mercer’s theorem (Not for Midsem)

Mercer kernel: K (x1, x2) is a Mercer kernel if�x1

�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions

g(x)(g(x) is square integrable iff

�(g(x))2 dx is finite)

Mercer’s theorem: For any Mercer kernel K (x1, x2), ∃φ(x) : IRn �→ H ,s.t. K (x1, x2) = φ�(x1)φ(x2)

where H is a Hilbert space, the infinite dimensional version of the Eucledianspace, which is......(�n,< ., . >) where < ., . > is the standard dot product in �n

Advanced: Formally, Hibert Space is an inner product space with associatednorms, where every Cauchy sequence is convergent

Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)

We want to prove that for all square integrable functions g(x)�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2 ≥ 0

Here, x1 and x2 are vectors s.t x1, x2 ∈ �t

Thus,�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2

=

�

x11

..

�

x1t

�

x21

..

�

x2t

��

n1..nt

d !

n1!..nt!

t�

j=1

(x1jx2j)nj

�g(x1)g(x2) dx11..dx1tdx21..dx2t

s.t.t�

i=1

ni = d


We want to prove that for all square integrable functions g(x)�x1

�x2(x�1 x2)


Here, x1 and x2 are vectors s.t x1, x2 ∈ �t

Thus,�x1

�x2(x�1 x2)

dg(x1)g(x2) dx1dx2

=

�

x11

..

�

x1t

�

x21

..

�

x2t

��

n1..nt

d !

n1!..nt!

t�

j=1

(x1jx2j)nj

�g(x1)g(x2) dx11..dx1tdx21..dx2t

s.t.t�

i=1

ni = d

By some Fubini’s theorem, under finiteness conditions, integral of sum is sum ofintegrals


=�

n1...nt

d !

n1! . . . nt!

�

x1

�

x2

t�

j=1

(x1jx2j)nj g(x1)g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

�

x1

�

x2

(xn111xn212 . . . x

nt1t )g(x1) (x

n121x

n222 . . . x

nt2t )g(x2) dx1dx2


=�

n1...nt

d !

n1! . . . nt!

�

x1

�

x2

t�

j=1

(x1jx2j)nj g(x1)g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

�

x1

�

x2

(xn111xn212 . . . x

nt1t )g(x1) (x

n121x

n222 . . . x

nt2t )g(x2) dx1dx2

=�

n1...nt

d !

n1! . . . nt!

��

x1

(xn111 . . . xnt1t )g(x1) dx1

� ��

x2

(xn121 . . . xnt2t )g(x2) dx2

�

(integral of decomposable product as product of integrals)

s.t.t�

i

ni = d


Realize that both the integrals are basically the same, with different variablenames

Thus, the quadratic (positive-definiteness) expression becomes:

�

n1...nt

d !

n1! . . . nt!(

�

x1

(xn111 . . . xnt1t )g(x1) dx1)

2 ≥ 0

(the square is non-negative for reals)

Thus, we have shown that (x�1 x2)d is a Mercer kernel.

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

K (x1, x2) =r�

d=1

αd(x�1 x2)

d

Is

�

x1

�

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Is

�

x1

�

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

We have �

x1

�

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 =

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Is

�

x1

�

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 ≥ 0?

We have �

x1

�

x2

�r�

d=1

αd(x�1 x2)

d

�g(x1)g(x2) dx1dx2 =

r�

d=1

αd

�

x1

�

x2

(x�1 x2)dg(x1)g(x2) dx1dx2

What about�r

d=1 αd(x�1 x2)d s.t. αd ≥ 0?

Since αd ≥ 0, ∀d and since we have already proved that�x1

�x2(x�1 x2)


We must have,

r�

d=1

αd

�

x1

�

x2

(x1�x2)dg(x1)g(x2) dx1dx2 ≥ 0

By which, K (x1, x2) =r�

d=1

αd(x�1 x2)

d is a Mercer kernel.

Examples of Mercer Kernels: Linear Kernel, Polynomial Kernel, Radial BasisFunction Kernel

Closure properties of Kernels (Part of Midsems)

Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.

α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:

Closure properties of Kernels (Part of Midsems)

Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.

α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:

K1(x1, x2)K2(x1, x2)Proof:

For simplicity, we assume that the is finite dimensional

Are the following Mercer Kernels? (Part of Midsems)



�d

Exponential Kernel: K (x, y) = exp (�x, y�)




�d



�−�x− y�2

2σ2

�




�d



�−�x− y�2

2σ2

�= k(x− y)




�d



�−�x− y�2

2σ2

�= k(x− y) = ky(x)

The function ky(x) is also called a smoothing kernel (as well see soon).

Some more Tutorial 4 + 5 Questions

Informally show that the Kernelized Logistic Regression form is equivalent tothe original Logistic Regression when regularized cross entropy is minimized

Show that Ridge Regression has an equivalent Kernelized form

Kernelized Logistic Regression

(Part of Lab 3, Tutorial 4 +5 and Midsems)https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=316

Logistic Regression Kernelized

1 We have already seen (a) Cross Entropy loss and (b) Bayesian interpretationfor regularization

2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −

1

m

m�

i=1

�y (i) log σw

�x(i)

�+

�1 − y (i)

�log

�1 − σw

�x(i)

�� +

λ

2m�w�22 (1)

3 Equivalent dual kernelized objective?


2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1

�y (i) log σw

�x(i)

�+�1− y (i)

�log

�1− σw

�x(i)

��+

λ

2m�w�22(2)

3 By substituting σw (x) =�

1

1+e−wTφ(x)

�=

�ew

Tφ(x)

1+ewTφ(x)

�, we simplify (2) to...

E (w) = −�1

m

m�

i=1

�y (i)wTφ(x(i))− log

�1 + exp

�wTφ

�x(i)

��+

λ

2m�w�22

(3)


2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1


�1 + exp

�wTφ

�x(i)

��+

λ

2m�w�22

(4)

3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):

4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf


2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):

E (w) = −�1

m

m�

i=1


�1 + exp

�wTφ

�x(i)

��+

λ

2m�w�22

(4)

3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):

ED (α) = −

m�

i=1

m�

j=1

y (i)K�x(i), x(j)

�αj −

λ

2αi K

�x(i), x(j)

�αj

− log

1 + exp

−

m�

j=1

αj K�x(i), x(j)

�

(5)

Decision function σw(x) = 1

1+ exp

−

m�

j=1

αj K�x, x(j)

�

4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/

Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf

CS 337: Artificial Intelligence & Machine LearningThe Kernel Trick: Illustrations on Ridge Regression

Recall: Ridge RegressionVideo link: https://youtu.be/UVopa_V7rgE?t=598

Recall: Penalized Regularized Least Squares Regression

Φ =

φ1(x1) φ2(x1) ...... φn(x1)..

φ1(xm) φ2(xm) ...... φn(xm)

The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1

λI ) coincide with

Regularized Ridge Regression

wRidge = argminw

||Φw − y||22 + λσ2||w||22

Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude

Recall: Penalized Regularized Least Squares Regression

Φ =

φ1(x1) φ2(x1) ...... φn(x1)..

φ1(xm) φ2(xm) ...... φn(xm)

The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1

λI ) coincide with

Regularized Ridge Regression

wRidge = argminw

||Φw − y||22 + λσ2||w||22

Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude

We replace λσ2 by a single λ for convenience.

Recall: Closed-form solutions

Linear regression and Ridge regression both admit closed-form solutions

For linear regression,w∗ = (Φ�Φ)−1Φ�y

For ridge regression,w∗ = (Φ�Φ+ λI )−1Φ�y

(for linear regression, λ = 0)

Recall: Polynomial regression

Consider a degree 3 polynomialregression model as shown in thefigure

Each bend in the curve correspondsto increase in �w�Eigen values of (Φ�Φ+ λI ) areindicative of curvature.Increasing λ reduces the curvature(Tutorial 4 + 5)

Ridge Regression in another form

If y = wTφ(x), its ridge regression estimate is w = (ΦTΦ+ λI )−1ΦTy,where

Φ =

φ1(x1) ... φp(x1)... ... ...

φ1(xm) ... φp(xm)

Please note the difference between matrix Φ and vector φ(x)

φ(xj) =

φ1(xj)...

φp(xj)


The regression function will be

f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy

Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT




Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT

�ΦTΦ

�ij=

�mk=1 φi (xk)φj(xk)�

ΦΦT�ij=

�pk=1 φk(xi )φk(xj) = φT (xi )φ(xj) = K (xi , xj)




Consider the following matrix identity (verify for scalars):

(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ by setting R = I , P = 1λ I and B = Φ,




Consider the following matrix identity (verify for scalars):

(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ by setting R = I , P = 1λ I and B = Φ,

⇒ w = ΦT (ΦΦT + λI )−1y =�m

i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y

�i

⇒ the final decision function f (x) = φT (x)w =�m

i=1 αiφT (x)φ(xi )


Thus, given w = (ΦTΦ+ λI )−1ΦTy and the matrix identity(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1

⇒ w = ΦT (ΦΦT + λI )−1y =�m


�i



We notice that the only way the decision function f (x) involves φ isthrough φ�(xi)φ(xj), for some i , j


Given w = (ΦTΦ+ λI )−1ΦTy

⇒ w = ΦT (ΦΦT + λI )−1y =�m


�i



φT (xi)φ(xj) = K (xi , xj)�ΦTΦ

�ij=

�mk=1 φi(xk)φj(xk)�

ΦΦT�ij=

�pk=1 φk(xi)φk(xj) = φT (xi)φ(xj) = K (xi , xj)

Recap: Example Kernel

For a 2-dimensional xi :

φ(xi) =

1

xi1√2

xi2√2

xi1xi2√2

x2i1x2i2

φ(xi) exists in a 6-dimensional space

But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)

Example Kernels



�d


�−�x− y�2

2σ2

�

Example: Regression using Kernels

CS 337: Artiﬁcial Intelligence & Machine Learning ...

Documents

Transcript of CS 337: Artiﬁcial Intelligence & Machine Learning ...