CS 337: Artificial Intelligence & Machine Learning ...
Transcript of CS 337: Artificial Intelligence & Machine Learning ...
CS 337: Artificial Intelligence & Machine LearningInstructor: Prof. Ganesh Ramakrishnan
Lecture 12: Kernels: Perceptron, Logistic Regressionand Ridge Regression
August 2019
Kernel Perceptronhttps://youtu.be/ql_GMONpmHM?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa
Recap: Perceptron Update Rule
Perceptron works for two classes (y = ±1).
A point is misclassified if ywT (φ(x)) < 0.
Perceptron Algorithm:
INITIALIZE: w=ones()REPEAT: for each < x, y >
If ywTΦ(x) < 0then, w = w + ηφ(x).yendif
Kernel (Non-linear) perceptron?
Kernelized perceptron1:
1The first kernel classification learner, was invented in 1964
Kernel (Non-linear) perceptron?
Kernelized perceptron1: f (x) = sign
��
i
αiyiK (x, xi) + b
�
INITIALIZE: α=zeroes()REPEAT: for < xi , yi >
If sign��
j αjyjK (xi , xj) + b��= yi
then, αi = αi + 1endif
Convergence is matter of Tutorial 4 & 5
1The first kernel classification learner, was invented in 1964
This could help you avoidseparately storing all the in solution to Problem 2 of lab 3
An Example Kernel
We illustrate going from φ(x) to K (x, y) and back...
For example, for a 2-dimensional xi :
φ(xi) =
1
xi1√2
xi2√2
xi1xi2√2
x2i1x2i2
φ(xi) exists in a 6-dimensional space
But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)
Example Kernels
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
Example: Classification using Kernels
Source:https://www.slideshare.net/ANITALOKITA/winnow-vs-perceptron
More on the Kernel Trick
Kernels operate in a high-dimensional, implicit feature space withoutnecessarily computing the coordinates of the data in that space, but ratherby simply computing the Kernel function
This operation is often computationally cheaper than the explicitcomputation of the coordinates
The Gram (Kernel) Matrix
For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as
K =
K (x1, x1) ... K (x1, xn)
... K (xi , xj) ...K (xm, x1) ... K (xm, xm)
Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....
K must be
The Gram (Kernel) Matrix
For any dataset {x1, x2, . . . , xm} and for any m, the Gram matrix K isdefined as
K =
K (x1, x1) ... K (x1, xn)
... K (xi , xj) ...K (xm, x1) ... K (xm, xm)
Claim: If Kij = K (xi , xj) = �φ(xi),φ(xj)� are entries of an n × n GramMatrix K then K must be symmetric and.....
K must be positive semi-definite
Proof: For any b ∈ �m, bTKb =�
i ,j
biKijbj =�
i ,j
bibj�φ(xi ),φ(xj)�
= ��
i
biφ(xi ),�
j
bjφ(xj)� = ||�
i
biφ(xi )||22 ≥ 0
Basis expansion φ for symmetric K?
Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, theGram matrix K must be positive semi-definite so thatK = UΣUT = (UΣ
12 )(UΣ
12 )T = RRT where rows of U are linearly
independent and Σ is a positive diagonal matrix
We will illustrate through an example..
From matrix decomposition to function decomposition:Mercer Kernel K
Positive-definite kernel: For any dataset {x1, x2, . . . , xm} and for any m, the
Gram matrix K must be positive semi-definite so that
K = UΣUT = (UΣ12 )(UΣ
12 )T = RRT where rows of U are linearly independent
and Σ is a positive diagonal matrix
Mercer kernel: Extending to eigenfunction decomposition2:
K (x1, x2) =∞�
j=1
αjφj(x1)φj(x2) where αj ≥ 0 and�∞
j=1 α2j < ∞
Mercer kernel and Positive-definite kernel turn out to be equivalent if theinput space {x} is compact3
2Eigen-decomposition wrt linear operators.
3Equivalent of closed and bounded.
Mercer and Positive Definite Kernels
Mercers Theorem is NOT INCLUDED IN
MIDSEMSEMESTER EXAM SYLLABUS:
Mercer’s theorem (Not for Midsem)
Mercer kernel: K (x1, x2) is a Mercer kernel if�x1
�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions
g(x)
Mercer’s theorem (Not for Midsem)
Mercer kernel: K (x1, x2) is a Mercer kernel if�x1
�x2K (x1, x2)g(x1)g(x2) dx1dx2 ≥ 0 for all square integrable functions
g(x)(g(x) is square integrable iff
�(g(x))2 dx is finite)
Mercer’s theorem: For any Mercer kernel K (x1, x2), ∃φ(x) : IRn �→ H ,s.t. K (x1, x2) = φ�(x1)φ(x2)
where H is a Hilbert space, the infinite dimensional version of the Eucledianspace, which is......(�n,< ., . >) where < ., . > is the standard dot product in �n
Advanced: Formally, Hibert Space is an inner product space with associatednorms, where every Cauchy sequence is convergent
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
We want to prove that for all square integrable functions g(x)�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
Here, x1 and x2 are vectors s.t x1, x2 ∈ �t
Thus,�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2
=
�
x11
..
�
x1t
�
x21
..
�
x2t
��
n1..nt
d !
n1!..nt!
t�
j=1
(x1jx2j)nj
�g(x1)g(x2) dx11..dx1tdx21..dx2t
s.t.t�
i=1
ni = d
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
We want to prove that for all square integrable functions g(x)�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
Here, x1 and x2 are vectors s.t x1, x2 ∈ �t
Thus,�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2
=
�
x11
..
�
x1t
�
x21
..
�
x2t
��
n1..nt
d !
n1!..nt!
t�
j=1
(x1jx2j)nj
�g(x1)g(x2) dx11..dx1tdx21..dx2t
s.t.t�
i=1
ni = d
By some Fubini’s theorem, under finiteness conditions, integral of sum is sum ofintegrals
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
t�
j=1
(x1jx2j)nj g(x1)g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
(xn111xn212 . . . x
nt1t )g(x1) (x
n121x
n222 . . . x
nt2t )g(x2) dx1dx2
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
t�
j=1
(x1jx2j)nj g(x1)g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
�
x1
�
x2
(xn111xn212 . . . x
nt1t )g(x1) (x
n121x
n222 . . . x
nt2t )g(x2) dx1dx2
=�
n1...nt
d !
n1! . . . nt!
��
x1
(xn111 . . . xnt1t )g(x1) dx1
� ��
x2
(xn121 . . . xnt2t )g(x2) dx2
�
(integral of decomposable product as product of integrals)
s.t.t�
i
ni = d
Prove that (x�1 x2)d is Mercer kernel (d ∈ Z+, d ≥ 1)
Realize that both the integrals are basically the same, with different variablenames
Thus, the quadratic (positive-definiteness) expression becomes:
�
n1...nt
d !
n1! . . . nt!(
�
x1
(xn111 . . . xnt1t )g(x1) dx1)
2 ≥ 0
(the square is non-negative for reals)
Thus, we have shown that (x�1 x2)d is a Mercer kernel.
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
K (x1, x2) =r�
d=1
αd(x�1 x2)
d
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
We have �
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 =
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Is
�
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 ≥ 0?
We have �
x1
�
x2
�r�
d=1
αd(x�1 x2)
d
�g(x1)g(x2) dx1dx2 =
r�
d=1
αd
�
x1
�
x2
(x�1 x2)dg(x1)g(x2) dx1dx2
What about�r
d=1 αd(x�1 x2)d s.t. αd ≥ 0?
Since αd ≥ 0, ∀d and since we have already proved that�x1
�x2(x�1 x2)
dg(x1)g(x2) dx1dx2 ≥ 0
We must have,
r�
d=1
αd
�
x1
�
x2
(x1�x2)dg(x1)g(x2) dx1dx2 ≥ 0
By which, K (x1, x2) =r�
d=1
αd(x�1 x2)
d is a Mercer kernel.
Examples of Mercer Kernels: Linear Kernel, Polynomial Kernel, Radial BasisFunction Kernel
Closure properties of Kernels (Part of Midsems)
Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.
α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:
Closure properties of Kernels (Part of Midsems)
Let K1(x1, x2) and K2(x1, x2) be positive definite (mercer) kernels. Then thefollowing are also kernels.
α1K1(x1, x2) + α2K2(x1, x2) for α1,α2 ≥ 0.Proof:
K1(x1, x2)K2(x1, x2)Proof:
For simplicity, we assume that the is finite dimensional
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�= k(x− y)
Are the following Mercer Kernels? (Part of Midsems)
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
Exponential Kernel: K (x, y) = exp (�x, y�)
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�= k(x− y) = ky(x)
The function ky(x) is also called a smoothing kernel (as well see soon).
Some more Tutorial 4 + 5 Questions
Informally show that the Kernelized Logistic Regression form is equivalent tothe original Logistic Regression when regularized cross entropy is minimized
Show that Ridge Regression has an equivalent Kernelized form
Kernelized Logistic Regression
(Part of Lab 3, Tutorial 4 +5 and Midsems)https://youtu.be/4PtaZVUbilI?list=PLyo3HAXSZD3zfv9O-y9DJhvrWQPscqATa&t=316
Logistic Regression Kernelized
1 We have already seen (a) Cross Entropy loss and (b) Bayesian interpretationfor regularization
2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −
1
m
m�
i=1
�y (i) log σw
�x(i)
�+
�1 − y (i)
�log
�1 − σw
�x(i)
��� +
λ
2m�w�22 (1)
3 Equivalent dual kernelized objective?
Logistic Regression Kernelized
2 The Regularized (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i) log σw
�x(i)
�+�1− y (i)
�log
�1− σw
�x(i)
����+
λ
2m�w�22(2)
3 By substituting σw (x) =�
1
1+e−wTφ(x)
�=
�ew
Tφ(x)
1+ewTφ(x)
�, we simplify (2) to...
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(3)
Logistic Regression Kernelized
2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(4)
3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):
4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/
Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf
Logistic Regression Kernelized
2 The rewritten (Logistic) Cross-Entropy Loss function (minimized wrtw ∈ �p):
E (w) = −�1
m
m�
i=1
�y (i)wTφ(x(i))− log
�1 + exp
�wTφ
�x(i)
�����+
λ
2m�w�22
(4)
3 Equivalent dual kernelized objective4, (minimized wrt α ∈ �m):
ED (α) = −
m�
i=1
m�
j=1
y (i)K�x(i), x(j)
�αj −
λ
2αi K
�x(i), x(j)
�αj
− log
1 + exp
−
m�
j=1
αj K�x(i), x(j)
�
(5)
Decision function σw(x) = 1
1+ exp
−
m�
j=1
αj K�x, x(j)
�
4Representer Theorem and http://perso.telecom-paristech.fr/~clemenco/
Projets_ENPC_files/kernel-log-regression-svm-boosting.pdf
CS 337: Artificial Intelligence & Machine LearningThe Kernel Trick: Illustrations on Ridge Regression
Recall: Ridge RegressionVideo link: https://youtu.be/UVopa_V7rgE?t=598
Recall: Penalized Regularized Least Squares Regression
Φ =
φ1(x1) φ2(x1) ...... φn(x1)..
φ1(xm) φ2(xm) ...... φn(xm)
The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1
λI ) coincide with
Regularized Ridge Regression
wRidge = argminw
||Φw − y||22 + λσ2||w||22
Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude
Recall: Penalized Regularized Least Squares Regression
Φ =
φ1(x1) φ2(x1) ...... φn(x1)..
φ1(xm) φ2(xm) ...... φn(xm)
The Bayes and MAP estimates for Linear Regression y = wTφ(x) + �(� ∼ N (0, σ2)) using a gaussian prior w ∼ N (0, 1
λI ) coincide with
Regularized Ridge Regression
wRidge = argminw
||Φw − y||22 + λσ2||w||22
Penalty: To account for noise and stop coefficients of w from becomingtoo large in magnitude
We replace λσ2 by a single λ for convenience.
Recall: Closed-form solutions
Linear regression and Ridge regression both admit closed-form solutions
For linear regression,w∗ = (Φ�Φ)−1Φ�y
For ridge regression,w∗ = (Φ�Φ+ λI )−1Φ�y
(for linear regression, λ = 0)
Recall: Polynomial regression
Consider a degree 3 polynomialregression model as shown in thefigure
Each bend in the curve correspondsto increase in �w�Eigen values of (Φ�Φ+ λI ) areindicative of curvature.Increasing λ reduces the curvature(Tutorial 4 + 5)
Ridge Regression in another form
If y = wTφ(x), its ridge regression estimate is w = (ΦTΦ+ λI )−1ΦTy,where
Φ =
φ1(x1) ... φp(x1)... ... ...
φ1(xm) ... φp(xm)
Please note the difference between matrix Φ and vector φ(x)
φ(xj) =
φ1(xj)...
φp(xj)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Recall that φT (xi)φ(xj) = K (xi , xj). Note the following differences betweenΦTΦ and ΦΦT
�ΦTΦ
�ij=
�mk=1 φi (xk)φj(xk)�
ΦΦT�ij=
�pk=1 φk(xi )φk(xj) = φT (xi )φ(xj) = K (xi , xj)
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Consider the following matrix identity (verify for scalars):
(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ by setting R = I , P = 1λ I and B = Φ,
Ridge Regression in another form
The regression function will be
f (x) = wTφ(x) = φT (x)w = φT (x)(ΦTΦ+ λI )−1ΦTy
Consider the following matrix identity (verify for scalars):
(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ by setting R = I , P = 1λ I and B = Φ,
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
Ridge Regression in another form
Thus, given w = (ΦTΦ+ λI )−1ΦTy and the matrix identity(P−1 + BTR−1B)−1BTR−1 = PBT (BPBT + R)−1
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
We notice that the only way the decision function f (x) involves φ isthrough φ�(xi)φ(xj), for some i , j
Ridge Regression in another form
Given w = (ΦTΦ+ λI )−1ΦTy
⇒ w = ΦT (ΦΦT + λI )−1y =�m
i=1 αiφ(xi ) where αi =�(ΦΦT + λI )−1y
�i
⇒ the final decision function f (x) = φT (x)w =�m
i=1 αiφT (x)φ(xi )
φT (xi)φ(xj) = K (xi , xj)�ΦTΦ
�ij=
�mk=1 φi(xk)φj(xk)�
ΦΦT�ij=
�pk=1 φk(xi)φk(xj) = φT (xi)φ(xj) = K (xi , xj)
Recap: Example Kernel
For a 2-dimensional xi :
φ(xi) =
1
xi1√2
xi2√2
xi1xi2√2
x2i1x2i2
φ(xi) exists in a 6-dimensional space
But, to compute K (x1, x2), all we need is x�1 x2 without having to enumerateφ(xi)
Example Kernels
Linear kernel: K (x, y) = xTy
Polynomial kernel: K (x, y) =�1 + xTy
�d
RBF kernel: K (x, y) = exp
�−�x− y�2
2σ2
�
Example: Regression using Kernels