Kernels - University of Houston

Kernels

Nuno Vasconcelos ECE Department, UCSDp ,

Principal component analysisDimensionality reduction:• Last time, we saw that when the data lives in a subspace, it is

best to design our learning algorithms in this subspace

2D subspace in 3D y2

λ1λ2

φ1φ2

this can be done by computing the principal components of the data

i i l t φ th i t f Σ

• principal components φi are the eigenvectors of Σ• principal lengths λi are the eigenvalues of Σ

Principal component analysis (learning)

Principal component analysis

PCA by SVDwe next saw that PCA can be computed by the SVD of the data matrix directlygiven X with one example per column• 1) create the centered data-matrix

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

• 2) compute its SVD

TTcX ΜΠΝ=

• 3) principal components are columns of N, eigenvalues are

cX ΜΠΝ

ii n πλ =

ExtensionsToday we will talk about kernels• turns out that any algorithm which depends on the data throughturns out that any algorithm which depends on the data through

dot-products only, i.e. the matrix of elements

T xxcan be kernelized

• this is usually beneficial, we will see why later• for now we look at the question of whether PCA can be written in

the form above⎤⎡recall the data matrix is

⎥⎥⎤

⎢⎢⎡

1 nxxX K

⎥⎥⎦⎢

⎢⎣ ||

Extensionswe saw that the centered-data matrix and the covariance can be written as

⎞⎛⎟⎠⎞

⎜⎝⎛ −= T

c nIXX 111 T

cc XXn1

the eigenvector φi of eigenvalue λi is

TT XXXX φααφφ ===11

hence, the eigenvector matrix is

iciici

φααλ

φ === ,

⎥⎥⎤

⎢⎢⎡

=ΓΓ=Φ

, 1 dcX λ

⎥⎥

⎦⎢⎢

ΓΓΦ

c nnX λλ K

Extensionswe next note that, from the eigenvector decomposition

ΣΦΦ=Λ⇔ΦΛΦ=Σ TT

andΣΦΦ=Λ⇔ΦΛΦ=Σ

⎞⎜⎛ TTT 1

( )( )

Γ⎠⎞

⎜⎝⎛Γ=Λ c

T XXXn

( )( )ΓΓ= cT

cT XXXX

i.e.( )( ) T

c XXXXn

ΓΛΓ=1

Extensionsin summary, we have

TΦΛΦ=Σ Γ=Φ cX

( )( ) Tc

Tc XXXX

nΓΛΓ=

this means that we can obtain PCA by• 1) assembling n-1(X TX )(X TX )

• 1) assembling n 1(XcTXc)(Xc

TXc)• 2) computing its eigen-decomposition (Λ,Γ)

PCAC• the principal components are then given by XcΓ

• the eigenvalues are given by Λ

Extensionswhat is interesting here is that we only need the matrix

⎥⎤

⎢⎡

⎥⎤

⎢⎡ −− cx ||1

⎥⎥⎥

⎦⎢⎢⎢

⎣⎥⎥⎥

⎦⎢⎢⎢

⎣ −−== KM c

cTcc xx

( ) ⎥⎥⎤

⎢⎢⎡

⎦⎣⎦⎣

Tcn xx

this is the matrix of dot-products of the centered data-

( )⎥⎥⎦⎢

⎢⎣ M

this is the matrix of dot-products of the centered data-pointsnotice that you don’t need the points themselves, only

their dot-products (similarities)

Extensionsto compute PCA, we use the fact that

( )( ) TTT 11 ( )( ) Tccc

but if Kc has eigendecomposition (Λ,Γ)

TTTT 2111

h 1(X TX )(X TX ) h i d i i ( 2 )

TTTTcc nn

ΓΓΛ=ΓΛΓΓΛΓ= 2111

then, n-1(XcTXc)(Xc

TXc) has eigendecomposition (Λ2,Γ)

Extensionsin summary, to get PCA• 1) compute the dot-product matrix K• 2) compute its eigen-decomposition (Λ,Γ)

PCAth i i l t th i b Φ X Γ• the principal components are then given by Φ = XcΓ

• the eigenvalues are given by Λ2

• the projection of the data-points on the principal components isthe projection of the data points on the principal components is given by

Γ=Γ=Φ KXXX cT

this allows the computation of the eigenvalues and PCA coefficients when we only have access to the dot-product

coefficients when we only have access to the dot product matrix K

The dot product formThis turns out to be the case for many learning algorithmsIf you manipulate a little bit you can write them in “dotIf you manipulate a little bit, you can write them in dot product form”

Definition: a learning algorithm is in dot product form if, given a training set

D {( ) ( )}D = {(x1,y1), ..., (xn,yn)},it only depends on the points Xi through their dot products

XiTXj.Xi Xj.

for example, let’s look at k-means

ClusteringWe saw that it iterates bewteen• 1) classification:

2) re estimation:

2* minarg)( ii

xxi µ−=

• 2) re-estimation:

∑= ij

newi x

n)(1µ

note that

( ) ( )TTT

xxx µµµ −−=−

iiTT xxx µµµ +−= 2

Clusteringand

∑ i )(1

bi i th t it th t ti

n)(1µ

combining the two, we can write the top equation as a function of the dot products xi

∑∑ +−=−jl

Tkik xx

nxxx )()(

2)(2 12 µ

The kernel trickwhy is this interesting?consider transformation of the feature

consider transformation of the featurespace:• introduce a mapping

x1pp g

Φ: X → Zsuch that dim(Z) > dim(X)

if th l ith l d d th

if the algorithm only depends on thedata through dot-products

then, in the transformed space, it only depends on

( ) ( )16

only depends onx3

x2( ) ( )jiT xx φφ

The dot product implementationin the transformed space, the learning algorithms only requires dot-productsq p

Φ(xj)TΦ(xi)

note that we no-longer need to store the Φ(xj)note that we no longer need to store the Φ(xj)only the n2 dot-product matrixinterestingly, this holds even when Φ(x) is infiniteinterestingly, this holds even when Φ(x) is infinite dimensional we get a reduction from infinity to n2!there is, however, still one problem:• when dim[Φ(xj)] is infinite the computation of the dot products

looks impossible

The “kernel trick”“instead of defining defining Φ(x), computing Φ(xi) for each i and Φ(xi)TΦ(xj) for each pair (i,j), simply define the ( i) ( j) p ( ,j), p yfunction

)()(),( zxzxK T ΦΦ=

and work with it directly.”K(x,z) is called a dot-product kernelin fact, since we only use the kernel, why define Φ(x)?just define the kernel K(x,z) directly!in this way we never have to deal with the complexity of Φ(x)...

this is usually called the “kernel trick”

QuestionsI am confused!how do I know that if I pick a function K(x,z), it is p ( , ),equivalent to Φ(x)TΦ(z)?• in general, it is not. We will talk about this later.

if it is, how do I know what Φ(x) is?• you may never know. E.g. the Gaussian kernel

2zx−

is very popular. It is not obvious what Φ(x) is...

σ),(x

ezxK−

• on the positive side, we did not know how to choose Φ(x).Choosing instead K(x,z) makes no difference.

why is it that using K(x,z) is easier/better?

y g ( , )• complexity. let’s look at an example.

Polynomial kernelsstill in Rd, consider the square of the dot product between two vectors

T zxzxzxzx =⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛=⎟

⎞⎜⎝

⎛= ∑∑∑

⎠⎝⎠⎝⎠⎝

∑∑

=== 111

zzxxzzxxzzxxzzxxzzxxzzxx

+++++++++=

K 1121211111

zzxxzzxxzzxx

+++++M

K 2222221212

dddddddd zzxxzzxxzzxx ++++ K2211

Polynomial kernels 11zz⎪⎫

⎥⎤

⎢⎡

can be written as21

⎪⎪⎪⎪⎪

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

( ) [ ] )(,,,,,,,,,

21121112 z

zzxxxxxxxxxxxxzx

dddddT Φ

⎪⎪

⎪⎪⎬

⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢

= M444444444 3444444444 21KKKK

Φ(x)T

hence, we have 2zzd

⎪⎪⎪⎪⎪

⎥⎥⎥⎥

⎦⎢⎢⎢⎢

( ) TT zxzxzxK )()()( 2ΦΦ zz dd ⎪⎭

⎥⎦

⎢⎣( )

zxzxzxK )()(),(

⎞⎛

ℜ→ℜΦ

ΦΦ==

: with2dd

( )Tddddd

xxxxxxxxxxxxx

x,,,,,,,, 2112111

LLLM →⎟⎟⎟

⎜⎜⎜

d ⎠⎝

Polynomial kernelsthe point is that• while Φ(x)TΦ(z) has complexity O(d2)• while Φ(x) Φ(z) has complexity O(d )• direct computation of K(x,z) = (xTz)2 has complexity O(d)

direct evaluation is more efficient by a factor of dyas d goes to infinity this makes the idea feasibleBTW, you just met another kernel family, y j y• this implements polynomials of second order• in general, the family of polynomial kernels is defined as

I d ’t t t thi k b t iti d Φ( ) !

( ) { }L,2,1,1),( ∈+= kzxzxK kT

• I don’t even want to think about writing down Φ(x) !

Kernel summary1. D not easy to deal with in X, apply feature transformation Φ:X → Z,

such that dim(Z) >> dim(X)

2. computing Φ(x) too expensive:• write your learning algorithm in dot-product form• instead of Φ(xi) we only need Φ(xi)TΦ(xj) ∀ijinstead of Φ(xi), we only need Φ(xi) Φ(xj) ∀ij

3. instead of computing Φ(xi)TΦ(xj) ∀ij, define the “dot-product kernel”

)()(),( zxzxK T ΦΦ=

and compute K(xi,xj) ∀ij directly• note: the matrix

)()()(

⎥⎤

⎢⎡ M

is called the “kernel” or Gram matrix

⎥⎥⎥

⎦⎢⎢⎢

LL ),( ji zxKK

4. forget about Φ(x) and use K(x,z) from the start!

Questionwhat is a good dot-product kernel?• this is a difficult question (see Prof Lenckriet’s work)this is a difficult question (see Prof. Lenckriet s work)

in practice, the usual recipe is:• pick a kernel from a library of known kernelsp y• we have already met

• the linear kernel K(x,z) = xTz• the Gaussian family

ezxK−

• the polynomial family

( ) { }L,2,1,1),( ∈+= kzxzxK kT

( ) { },2,1,1),( ∈+ kzxzxK

Dot-product kernelsthis may not be a bad idea• we rip the benefits of a high-dimensional space without a price in

complexitycomplexity• the kernel simply adds a few parameters (σ, k) learning it would

imply introducing many parameters (up to n2)

what if I need to check whether K(x,z) is a kernel? Definition: a mapping

k xxx2 Xk: X x X → ℜ

(x,y) → k(x,y)is a dot product kernel if and only if

is a dot-product kernel if and only if

k(x,y) = <Φ(x),Φ(y)>where Φ: X → H H is a vector space and < > a dot

ooo oo

o ooox1

x3 x2xn

where Φ: X → H, H is a vector space and <.,.> a dot-product in H

Positive definite matricesrecall that (e.g. Linear Algebra and Applications, Strang)

Definition: each of the following is a necessary andDefinition: each of the following is a necessary and sufficient condition for a real symmetric matrix A to be (semi) positive definite:

i) TA ≥ 0 0i) xTAx ≥ 0, ∀ x ≠ 0 ii) all eigenvalues of A satisfy λi ≥ 0iii) all upper-left submatrices Ak have non-negative determinanti ) there is a matri R ith independent ro s s ch thativ) there is a matrix R with independent rows such that

A = RTR

l ft b t iupper left submatrices:

L 322212

3,12,11,1

32,11,1

2111 ⎥⎥⎤

⎢⎢⎡

=⎥⎤

⎢⎡

== aaaaaa

3,32,31,3

3,22,21,232,21,2

⎥⎥⎦⎢

⎢⎣

⎥⎦

⎢⎣ aaa

Positive definite matricesproperty iv) is particularly interesting• in ℜ d <x x> = xTAx is a dot-product kernel if and only if A is• in ℜ , <x,x> = x Ax is a dot-product kernel if and only if A is

positive definite• from iv) this holds if and only if there is R such that A = RTR• hence

<x,y> = xTAy = (xR)T(Ry) = Φ(x)TΦ(y)with

Φ: ℜ d → ℜ d

x → Rx

i.e. the dot-product kernelp

k(x,z) = xTAz, (A positive definite)is the standard dot-product in the range space of the

is the standard dot product in the range space of the mapping Φ(x) = Rx

Positive definite kernelshow do I extend this notion of positive definiteness to functions?Definition: a function k(x,y) is a positive definite kernel on X xX if ∀ l and ∀ {x1, ..., xl}, xi∈ X, the Gram matrix

⎥⎥⎤

⎢⎢⎡

),( ji xxkK

is positive definite.

⎥⎥⎦⎢

⎢⎣ M

),( ji

plike in in ℜ d, this allows us to check that we have a positive definite kernel

Dot product kernelsTheorem: k(x,y), x,y∈ X, is a dot-product kernel if and only if it is a positive definite kernely pin summary, to check whether a kernel is a dot product:• check if the Gram matrix is positive definite • for all possible sequences {x1, ..., xl}, xi∈ X

does the kernel have to be a dot-product kernel?not necessarily. For example, neural networks can be seen as implementing kernels that are not of this typehhowever:• you loose the parallelism. what you know about the learning

machine may no longer hold after you kernelize

• dot-product kernels usually lead to convex learning problems. Usually you loose this guarantee for non dot-product

Clusteringso far, this is mostly theoreticalhow does it affect my algorithms?consider, for example, the k-means algorithm• 1) classification:

2) ti ti

2* minarg)( ii

xxi µ−=

• 2) re-estimation:

∑= ij

newi x )(1µ

can we kernelize the classification step?

Clusteringwell, we saw that

thi th b k li d i t

∑∑ +−=−jl

Tkik xx

nxxx )()(

2)(2 12 µ

this can then be kernelized into

( ) ( ) ( ) ( )∑ ΦΦ−ΦΦ=− ij

Tkik xxxxx )(2 2µ ( ) ( ) ( ) ( )

( ) ( )∑

ΦΦ+ il

jjkkkik

( ) ( )∑jl

Clusteringfurthermore, this can be done with relative efficiency

( ) ( ) ( ) ( )∑ ΦΦΦΦ= iTT xxxxx )(2 2µ ( ) ( ) ( ) ( )

( ) ( )∑

ΦΦ−ΦΦ=−

jjkkkik

( ) ( )∑ ΦΦ+jl

lj xxn

the assignment of the point only requires computing

kth diagonal entry of Gram matrix computed once per clusterwhen all points are assigned

( )2requires computing for each clusterthis is a sum of entries of Gram

( ) ( )∑ ΦΦj

this is a sum of entries of Grammatrix

Clusteringnote, however, that we cannot explicitly compute

thi i b bl i fi it di i l

( ) ( )∑Φ=Φj

n)(1µ

this is probably infinite dimensional...in any case, if we define

a Gram matrix K(i) for each cluster (dot products between points• a Gram matrix K(i) for each cluster (dot products between points in cluster)

• and S(i) the scaled sum of the entries in this matrix

( ) ( )∑ ΦΦ=jl

S )()(2

Clusteringwe obtain the kernel k-means algorithm• 1) classification:1) classification:

( ) ( )⎥⎤⎢⎡

ΦΦ−+= ∑ ij

illl xxSKxi )()(* 2minarg)(

• 2) re-estimation: update

( ) ( )⎥⎦

⎢⎣

l n,g)(

( ) ( )∑ ΦΦ=jl

S )()(2

but we no longer have access to the prototype for each cluster

Clusteringwith the right kernel this can work significantly better than regular k-meansg

k-meanskernel

k-means

Clusteringbut for other applications, where the prototypes are important, this may be uselessp , ye.g. compression

we can try replacing the prototype by the closest vector, but this is not necessarily optimal

PCAwe saw that, to get PCA• 1) compute the dot-product matrix K• 2) compute its eigen-decomposition (Λ,Γ)

PCAth i i l t th i b Φ X Γ• the principal components are then given by Φ = XcΓ

• the eigenvalues are given by Λ2

• the projection of the data-points on the principal components isthe projection of the data points on the principal components is given by

Γ=Φ KX Tc

note that most of this holds when we kernelize, we only have to change the matrix K from xi

Txj to φ(xi)Tφ(xj)th l thi l th PC X

• the only thing we can no longer access are the PCs Φ = XcΓ

Kernel methodsmost learning algorithms can be kernelized• kernel PCAkernel PCA• kernel LDA• kernel ICA,• etc.

as in k-means, sometimes we loose some of the features of the original algorithmof the original algorithmbut the performance is frequently betternext week we will look at the canonical application thenext week we will look at the canonical application, the support vector machine

Kernels - University of Houston

Documents

Transcript of Kernels - University of Houston

1 Riemannian Metric - UHminru/Riemann08/note1.pdf · Math 6396 Riemannian Geometry, Metric, Connections, Curvature Tensors etc. By Min Ru, University of Houston 1 Riemannian Metric

The current knowledge of immune privilege in stem cellsCenter for Stem Cell and Regenerative Medicine, University of Texas Health Science Center at Houston (UT Health), TX77030, USA.

Machine Learning - GitHub Pages · PDF fileSlater’s Constraint Quali cation ... quadprog (MATLAB), CVXOPT, CPLEX, IPOPT, etc. ... Kernels can also be constructed by composing these

Original Article Experimental study of the RIP3 expression ... · chased from Selleck Chemicals (Houston, TX, USA). Cell culture Jurkat cells were maintained in the RPMI- 1640 medium

Shuxin Han and John Y. L. Chiang Department of Integrative ...dmd.aspetjournals.org/content/dmd/early/2008/12/23/dmd.108.025155.full.pdf(Baylor College of Medicine, Houston, TX). Expression

UAB Barcelonamat.uab.es/~xtolsa/cmpt12.pdf · CALDERON-ZYGMUND KERNELS AND RECTIFIABILITY IN THE PLANE V. CHOUSIONIS, J. MATEU, L. PRAT, AND X. TOLSA Abstract. Let E ˆC be a Borel

Sum and di erence formulae for sine and cosinekws006/Precalculus/5.1_Sum_and...Part 5, Trigonometry Lecture 5.1a, Sum and Di erence Formulas Dr. Ken W. Smith Sam Houston State University

New York University Carnegie Mellon University arXiv:2010 ...

€¦ · JOURNAL OF SYMPLECTIC GEOMETRY Volume 7, Number 2, 51–76, 2009 BERNSTEIN POLYNOMIALS, BERGMAN KERNELS AND TORIC KAHLER VARIETIES¨ Steve Zelditch We show that the classical

Definition and basic properties of heat kernels III ...zlu/talks/2010-ECNU/ecnu-3.pdf · De nition and basic properties of heat kernels III, Constructions Zhiqin Lu, Department of

SMJ34020A GRAPHICS SYSTEM PROCESSOR - TI.com · SMJ34020A GRAPHICS SYSTEM PROCESSOR SGUS011D -- APRIL 1991 -- REVISED SEPTEMBER 2004 POST OFFICE BOX 1443 • HOUSTON, TEXAS 77251--1443

CALDERON-ZYGMUND KERNELS AND RECTIFIABILITY IN THE …chousionis/cmptleg.pdf · CALDERON-ZYGMUND KERNELS AND RECTIFIABILITY IN THE PLANE V. CHOUSIONIS, J. MATEU, L. PRAT, AND X. TOLSA

BOUNDEDNESS OF SINGULAR INTEGRALS ON 1 INTRINSIC …chousionis/CFOv3.pdf · Zygmund kernels in the Heisenberg group. We show that if such an operator is L2 ... and beyond, is a classical

Kernels and Support Vector Machines

BOUNDING COHOMOLOGY FOR FINITE GROUPS AND FROBENIUS KERNELSpeople.virginia.edu/~lls2l/BoundingCohomology.pdf · bounding cohomology for finite groups and frobenius kernels 3 In later

University of Houston, Department of Mathematics Numerical Analysis II Chapter 2 …rohop/spring_06/Chapter2.pdf · 2005. 11. 20. · University of Houston, Department of Mathematics

CSC 411: Lecture 16: Kernelsurtasun/courses/CSC411_Fall16/16_svm.pdfZemel, Urtasun, Fidler (UofT) CSC 411: 16-Kernels 5 / 12. Input Transformation Mapping to a feature space can produce

people.kth.sehaakanh/publications/polyasympt4.pdf · ASYMPTOTIC EXPANSION OF POLYANALYTIC BERGMAN KERNELS ANTTI HAIMI AND HAAKAN HEDENMALM Abstract. We consider the q-analytic functions

Lanczos tridigonalization and Golub - Kahan ...strakos/download/2006_GAMMSIAM.pdf · ed., 2002 E. based on the 1st. ed., Oscillation matrices and kernels and small vibrations of mechanical

CSC 411 Lecture 18: Kernels - Department of Computer ...jlucas/teaching/csc411/lectures/lec18_handout.pdfOther Kernel methods Kernel Logistic regression I We can think of logistic