Introduction to Machine Learning

1

Bernhard Schölkopf

Empirical Inference Department

Max Planck Institute for Intelligent Systems

Tübingen, Germany

http://www.tuebingen.mpg.de/bs

Introduction to Machine Learning

2

Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference

•

x

x

x

x

x

x

y

x

x

x

Leibniz, Weyl, Chaitin

x

y = a * x

y = Σi ai k(x,xi) + b

3

Empirical Inference

• Drawing conclusions from empirical data (observations, measurements)

• Example 1: scientific inference

•

“If your experiment needs statistics [inference],

you ought to have done a better experiment.” (Rutherford)

4

Empirical Inference, II

• Example 2: perception

“The brain is nothing but a statistical decision organ”

(H. Barlow)

5

Hard Inference Problems

Sonnenburg, Rätsch, Schäfer,

Schölkopf, 2006, Journal of Machine

Learning Research

Task: classify human DNA

sequence locations into acceptor

splice site, decoy using 15

Million sequences of length 141,

and a Multiple-Kernel Support

Vector Machines.

PRC = Precision-Recall-Curve,

fraction of correct positive

predictions among all positively

predicted cases

• High dimensionality – consider many factors simultaneously to find the regularity

• Complex regularities – nonlinear, nonstationary, etc.

• Little prior knowledge – e.g., no mechanistic models for the data

• Need large data sets – processing requires computers and automatic inference methods

6

Hard Inference Problems, II

• We can solve scientific inference problems that humans can’t solve

• Even if it’s just because of data set size / dimensionality, this is a quantum leap

7

• observe 1, 2, 4, 7,..

• What’s next?

• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)

• 1,2,4,7,12,20,…: an+2=an+1+an+1

• 1,2,4,7,13,24,…: “Tribonacci”-sequence

• 1,2,4,7,14,28: set of divisors of 28

• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…

and e=2,718… interleaved

• The On-Line Encyclopedia of Integer Sequences: >600 hits…

+1 +2 +3

Generalization (thanks to O. Bousquet)

http://www.research.att.com/~njas/sequences/Seis.html



Generalization, II

8

• Question: which continuation is correct (“generalizes”)?

• Answer: there’s no way to tell (“induction problem”)

• Question of statistical learning theory: how to come up

with a law that is (probably) correct (“demarcation problem”) (more accurately: a law that is probably as correct on the test data as it is on the training data)

9

2-class classification

Learn based on m observations

generated from some

Goal: minimize expected error (“risk”)

Problem: P is unknown.

Induction principle: minimize training error (“empirical risk”)

V. Vapnik

over some class of functions. Q: is this “consistent”?

10

The law of large numbers

For all and

Does this imply “consistency” of empirical risk minimization

(optimality in the limit)?

No – need a uniform law of large numbers:

For all

Consistency and uniform convergence

-> LaTeX

Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011

12

Bernhard Schölkopf, 03

October 2011 13

• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):

representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Support Vector Machines

+-

+

+-

-

-

-

-

++

+

F

k(x,x’) = <F(x),F(x’)>

class 1 class 2

Bernhard Schölkopf, 03

October 2011 14

• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):

representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)

• unique solution found by convex QP

Support Vector Machines

+-

+

+-

-

-

-

-

++

+

F

k(x,x’) = <F(x),F(x’)>

class 1 class 2

15

Applications in Computational Geometry / Graphics

Steinke, Walder, Blanz et al.,

Eurographics’05, ’06, ‘08, ICML ’05,‘08,

NIPS ’07

Max-Planck-Institut für

biologische Kybernetik

Bernhard Schölkopf, Tübingen,

3. Oktober 2011 16 FIFA World Cup – Germany vs. England, June 27, 2010

Kernel Quiz

Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011

17

Kernel Methods

Bernhard Scholkopf

Max Planck Institute for Intelligent Systems

B. Scholkopf, MLSS France 2011

Statistical Learning Theory

1. started by Vapnik and Chervonenkis in the Sixties

2. model: we observe data generated by an unknown stochasticregularity

3. learning = extraction of the regularity from the data

4. the analysis of the learning problem leads to notions of capacityof the function classes that a learning machine can implement.

5. support vector machines use a particular type of function class:classifiers with large “margins” in a feature space induced by akernel.

[47, 48]


Example: Regression Estimation

y

x

•Data: input-output pairs (xi, yi) ∈ R× R

•Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)

• Learning: choose a function f : R → R such that the error,averaged over P, is minimized.

• Problem: P is unknown, so the average cannot be computed— need an “induction principle”

Pattern Recognition

Learn f : X → ±1 from examples

(x1, y1), . . . , (xm, ym) ∈ X×±1, generated i.i.d. from P(x, y),

such that the expected misclassification error on a test set, alsodrawn from P(x, y),

R[f ] =

∫

1

2|f (x)− y)| dP(x, y),

is minimal (Risk Minimization (RM)).

Problem: P is unknown. −→ need an induction principle.

Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error

Remp[f ] =1

m

∑m

i=1

1

2|f (xi)− yi|


Convergence of Means to Expectations

Law of large numbers:

Remp[f ] → R[f ]

as m→ ∞.

Does this imply that empirical risk minimization will give us theoptimal result in the limit of infinite sample size (“consistency”of empirical risk minimization)?

No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement.


Consistency and Uniform Convergence

Risk

Function class

R

Remp

f f fopt m

R[f]

R [f]emp


The Importance of the Set of Functions

What about allowing all functions from X to ±1?Training set (x1, y1), . . . , (xm, ym) ∈ X× ±1Test patterns x1, . . . , xm ∈ X,such that x1, . . . , xm ∩ x1, . . . ,xm = .

For any f there exists f∗ s.t.:1. f∗(xi) = f (xi) for all i2. f∗(xj) 6= f (xj) for all j.

Based on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is ’no free lunch’ [24, 56].−→ a restriction must be placed on the functions that we allow


Restricting the Class of Functions

Two views:

1. Statistical Learning (VC) Theory: take into account the ca-pacity of the class of functions that the learning machine canimplement

2. The Bayesian Way: place Prior distributions P(f ) over theclass of functions


Detailed Analysis

• loss ξi := 12|f (xi)− yi| in 0, 1

• the ξi are independent Bernoulli trials• empirical mean 1

m

∑mi=1 ξi (by def: equals Remp[f ])

• expected value E [ξ] (equals R[f ])


Chernoff’s Bound

P

∣

∣

∣

∣

∣

∣

1

m

m∑

i=1

ξi − E [ξ]

∣

∣

∣

∣

∣

∣

≥ ǫ

≤ 2 exp(−2mǫ2)

• here, P refers to the probability of getting a sample ξ1, . . . , ξmwith the property

∣

∣

∣

1m

∑mi=1 ξi − E [ξ]

∣

∣

∣ ≥ ǫ (is a product mea-

sure)

Useful corollary: Given a 2m-sample of Bernoulli trials, we have

P

∣

∣

∣

∣

∣

∣

1

m

m∑

i=1

ξi −1

m

2m∑

i=m+1

ξi

∣

∣

∣

∣

∣

∣

≥ ǫ

≤ 4 exp

(

−mǫ2

2

)

.


Chernoff’s Bound, II

Translate this back into machine learning terminology: the prob-ability of obtaining anm-sample where the training error and testerror differ by more than ǫ > 0 is bounded by

P∣

∣Remp[f ]− R[f ]∣

∣ ≥ ǫ

≤ 2 exp(−2mǫ2).

• refers to one fixed f

• not allowed to look at the data before choosing f , hence notsuitable as a bound on the test error of a learning algorithmusing empirical risk minimization


Uniform Convergence (Vapnik & Chervonenkis)

Necessary and sufficient conditions for nontrivial consistency ofempirical risk minimization (ERM):One-sided convergence, uniformly over all functions that can beimplemented by the learning machine.

limm→∞P sup

f∈F(R[f ]−Remp[f ]) > ǫ = 0

for all ǫ > 0.

• note that this takes into account the whole set of functions thatcan be implemented by the learning machine

• this is hard to check for a learning machine

Are there properties of learning machines (≡ sets of functions)which ensure uniform convergence of risk?


How to Prove a VC Bound

Take a closer look at Psupf∈F(R[f ]− Remp[f ]) > ǫ.Plan:

• if the function class F contains only one function, then Cher-noff’s bound suffices:

P supf∈F

(R[f ]− Remp[f ]) > ǫ ≤ 2 exp(−2mǫ2).

• if there are finitely many functions, we use the ’union bound’

• even if there are infinitely many, then on any finite samplethere are effectively only finitely many (use symmetrizationand capacity concepts)


The Case of Two Functions

Suppose F = f1, f2. RewriteP sup

f∈F(R[f ]− Remp[f ]) > ǫ = P(C1

ǫ ∪ C2ǫ ),

where

Ciǫ := (x1, y1), . . . , (xm, ym) | (R[fi]− Remp[fi]) > ǫdenotes the event that the risks of fi differ by more than ǫ.The RHS equals

P(C1ǫ ∪ C2

ǫ ) = P(C1ǫ ) + P(C2

ǫ )− P(C1ǫ ∩ C2

ǫ )

≤ P(C1ǫ ) + P(C2

ǫ ).

Hence by Chernoff’s bound

P supf∈F

(R[f ]−Remp[f ]) > ǫ ≤ P(C1ǫ ) + P(C2

ǫ )

≤ 2 · 2 exp(−2mǫ2).

The Union Bound

Similarly, if F = f1, . . . , fn, we haveP sup

f∈F(R[f ]− Remp[f ]) > ǫ = P(C1

ǫ ∪ · · · ∪ Cnǫ ),

and

P(C1ǫ ∪ · · · ∪ Cnǫ ) ≤

n∑

i=1

P(Ciǫ).

Use Chernoff for each summand, to get an extra factor n in thebound.

Note: this becomes an equality if and only if all the events Ciǫinvolved are disjoint.


Infinite Function Classes

•Note: empirical risk only refers to m points. On these points,the functions of F can take at most 2m values

• for Remp, the function class thus “looks” finite

• how about R?

• need to use a trick


Symmetrization

Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For mǫ2 ≥ 2 we have

P supf∈F

(R[f ]−Remp[f ]) > ǫ ≤ 2P supf∈F

(Remp[f ]−R′emp[f ]) > ǫ/2

Here, the first P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures the loss on the first half ofthe sample, and R′

emp on the second half.


Shattering Coefficient

•Hence, we only need to consider the maximum size of F on 2mpoints. Call it N(F, 2m).

•N(F, 2m) = max. number of different outputs (y1, . . . , y2m)that the function class can generate on 2m points — in otherwords, the max. number of different ways the function class canseparate 2m points into two classes.

•N(F, 2m) ≤ 22m

• if N(F, 2m) = 22m, then the function class is said to shatter2m points.


Putting Everything Together

We now use (1) symmetrization, (2) the shattering coefficient, and(3) the union bound, to get

Psupf∈F

(R[f ]− Remp[f ]) > ǫ

≤ 2Psupf∈F

(Remp[f ]−R′emp[f ]) > ǫ/2

= 2P(Remp[f1]−R′emp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)]−R′

emp[fN(F,2m)]) > ǫ/2

≤N(F,2m)∑

n=1

2P(Remp[fn]−R′emp[fn]) > ǫ/2.


ctd.

Use Chernoff’s bound for each term:∗

P

1

m

m∑

i=1

ξi −1

m

2m∑

i=m+1

ξi ≥ ǫ

≤ 2 exp

(

−mǫ2

2

)

.

This yields

P supf∈F

(R[f ]−Remp[f ]) > ǫ ≤ 4N(F, 2m) exp

(

−mǫ2

8

)

.

• provided that N(F, 2m) does not grow exponentially in m, thisis nontrivial

• such bounds are called VC type inequalities

• two types of randomness: (1) the P refers to the drawing ofthe training examples, and (2) R[f ] is an expectation over thedrawing of test examples.

∗ Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-

ization over permutations of the 2m-sample, see [36].

Confidence Intervals

Rewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for ǫ:

With a probability of at least 1− δ,

R[f ] ≤ Remp[f ] +

√

8

m

(

ln(N(F, 2m)) + ln4

δ

)

.

This bound holds independent of f ; in particular, it holds for thefunction fm minimizing the empirical risk.


Discussion

• tighter bounds are available (better constants etc.)• cannot minimize the bound over f

• other capacity concepts can be used


VC Entropy

On an example (x, y), f causes a loss

ξ(x, y, f (x)) =1

2|f (x)− y| ∈ 0, 1.

For a larger sample (x1, y1), . . . , (xm, ym), the different functionsf ∈ F lead to a set of loss vectors

ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),

whose cardinality we denote by

N (F, (x1, y1) . . . , (xm, ym)) .

The VC entropy is defined as

HF(m) = E [lnN (F, (x1, y1) . . . , (xm, ym))] ,

where the expectation is taken over the random generation of them-sample (x1, y1) . . . , (xm, ym) from P.

HF(m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-tency)

Further PR Capacity Concepts

• exchange ’E’ and ’ln’: annealed entropy .

HannF

(m)/m→ 0 ⇐⇒ exponentially fast uniform convergence

• take ’max’ instead of ’E’: growth function.Note that GF(m) = lnN(F,m).

GF(m)/m→ 0⇐⇒ exponential convergence for all underlyingdistributions P.

GF(m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectorscan be generated, i.e., the m points can be chosen such that byusing functions of the learning machine, they can be separatedin all 2m possible ways (shattered).


Structure of the Growth Function

Either GF(m) = m · ln(2) for all m ∈ N

Or there exists some maximal m for which the above is possible.Call this number the VC-dimension, and denote it by h. Form > h,

GF(m) ≤ h(

lnm

h+ 1)

.

Nothing “in between” linear growth and logarithmic growth ispossible.


VC-Dimension: Example

Half-spaces in R2:

f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R

• Clearly, we can shatter three non-collinear points.

• But we can never shatter four points.

•Hence the VC dimension is h = 3 (in this case, equal to thenumber of parameters)

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x


A Typical Bound for Pattern Recognition

For any f ∈ F and m > h, with a probability of at least 1− δ,

R[f ] ≤ Remp[f ] +

√

√

√

√

h(

log 2mh + 1

)

− log(δ/4)

m

holds.

• does this mean, that we can learn anything?

• The study of the consistency of ERM has thus led to conceptsand results which lets us formulate another induction principle(structural risk minimization)


SRM

R(f* )

h

training error

capacity term

error

Sn−1 Sn Sn+1

structure

bound on test error


Finding a Good Function Class

• recall: separating hyperplanes in R2 have a VC dimension of 3.

•more generally: separating hyperplanes in RN have a VC di-

mension of N + 1.

• hence: separating hyperplanes in high-dimensional featurespaces have extremely large VC dimension, and may not gener-alize well

• however, margin hyperplanes can still have a small VC dimen-sion


Kernels and Feature Spaces

Preprocess the data with

Φ : X → H

x 7→ Φ(x),

whereH is a dot product space, and learn the mapping from Φ(x)to y [6].

• usually, dim(X) ≪ dim(H)

• “Curse of Dimensionality”?

• crucial issue: capacity, not dimensionality


Example: All Degree 2 Monomials

Φ : R2 → R3

(x1, x2) 7→ (z1, z2, z3) := (x21,√2x1x2, x

22)

x1

x2

z1

z3

z2


General Product Feature Space

How about patterns x ∈ RN and product features of order d?

Here, dim(H) grows like Nd.

E.g. N = 16× 16, and d = 5 −→ dimension 1010


The Kernel Trick, N = d = 2

⟨

Φ(x),Φ(x′)⟩

= (x21,√2 x1x2, x

22)(x

′21,√2 x′1x

′2, x

′22 )

⊤

=⟨

x, x′⟩2

= : k(x, x′)

−→ the dot product in H can be computed in R2


The Kernel Trick, II

More generally: x, x′ ∈ RN , d ∈ N:

⟨

x, x′⟩d

=

N∑

j=1

xj · x′j

d

=

N∑

j1,...,jd=1

xj1 · · · · · xjd · x′j1· · · · · x′jd =

⟨

Φ(x),Φ(x′)⟩

,

where Φ maps into the space spanned by all ordered products ofd input directions


Mercer’s Theorem

If k is a continuous kernel of a positive definite integral oper-ator on L2(X) (where X is some compact space),

∫

X

k(x, x′)f (x)f (x′) dx dx′ ≥ 0,

it can be expanded as

k(x, x′) =∞∑

i=1

λiψi(x)ψi(x′)

using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].


The Mercer Feature Map

In that case

Φ(x) :=

√λ1ψ1(x)√λ2ψ2(x)

...

satisfies⟨

Φ(x),Φ(x′)⟩

= k(x, x′).

Proof:

⟨

Φ(x),Φ(x′)⟩

=

⟨

√λ1ψ1(x)√λ2ψ2(x)

...

,

√λ1ψ1(x

′)√λ2ψ2(x

′)...

⟩

=

∞∑

i=1

λiψi(x)ψi(x′) = k(x, x′)


Positive Definite Kernels

It can be shown that the admissible class of kernels coincides withthe one of positive definite (pd) kernels: kernels which are sym-metric (i.e., k(x, x′) = k(x′, x)), and for

• any set of training points x1, . . . , xm ∈ X and

• any a1, . . . , am ∈ R

satisfy∑

i,j

aiajKij ≥ 0, where Kij := k(xi, xj).

K is called the Gram matrix or kernel matrix.

If for pairwise distinct points,∑

i,j aiajKij = 0 =⇒ a = 0, callit strictly positive definite.


The Kernel Trick — Summary

• any algorithm that only depends on dot products can benefitfrom the kernel trick

• this way, we can apply linear methods to vectorial as well asnon-vectorial data

• think of the kernel as a nonlinear similarity measure

• examples of common kernels:

Polynomial k(x, x′) = (⟨

x, x′⟩

+ c)d

Gaussian k(x, x′) = exp(−‖x− x′‖2/(2σ2))•Kernels are also known as covariance functions [54, 52, 55, 29]


Properties of PD Kernels, 1

Assumption: Φ maps X into a dot product space H; x, x′ ∈ X

Kernels from Feature Maps.k(x, x′) :=

⟨

Φ(x),Φ(x′)⟩

is a pd kernel on X× X.

Kernels from Feature Maps, IIK(A,B) :=

∑

x∈A,x′∈B k(x, x′),

where A,B are finite subsets of X, is also a pd kernel(Hint: use the feature map Φ(A) :=

∑

x∈AΦ(x))


Properties of PD Kernels, 2 [36, 39]

Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X

k(x, x) ≥ 0 for all x (Positivity on the Diagonal)

k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)

(Hint: compute the determinant of the Gram matrix)

k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)

The following kernels are pd:

• αk, provided α ≥ 0• k1 + k2• k(x, x′) := limn→∞ kn(x, x

′), provided it exists• k1 · k2• tensor products, direct sums, convolutions [22]


The Feature Space for PD Kernels [4, 1, 35]

• define a feature map

Φ : X → RX

x 7→ k(., x).

E.g., for the Gaussian kernel: Φ

. .Φ(x) Φ(x')x x'

Next steps:

• turn Φ(X) into a linear space

• endow it with a dot product satisfying⟨

Φ(x),Φ(x′)⟩

= k(x, x′), i.e.,⟨

k(., x), k(., x′)⟩

= k(x, x′)• complete the space to get a reproducing kernel Hilbert space


Turn it Into a Linear Space

Form linear combinations

f (.) =m∑

i=1

αik(., xi),

g(.) =m′∑

j=1

βjk(., x′j)

(m,m′ ∈ N, αi, βj ∈ R, xi, x′j ∈ X).


Endow it With a Dot Product

〈f, g〉 :=

m∑

i=1

m′∑

j=1

αiβjk(xi, x′j)

=

m∑

i=1

αig(xi) =m′∑

j=1

βjf (x′j)

• This is well-defined, symmetric, and bilinear (more later).

• So far, it also works for non-pd kernels


The Reproducing Kernel Property

Two special cases:

•Assumef (.) = k(., x).

In this case, we have

〈k(., x), g〉 = g(x).

• If moreoverg(.) = k(., x′),

we have〈k(., x), k(., x′)〉 = k(x, x′).

k is called a reproducing kernel(up to here, have not used positive definiteness)


Endow it With a Dot Product, II

• It can be shown that 〈., .〉 is a p.d. kernel on the set of functionsf (.) =

∑mi=1αik(., xi)|αi ∈ R, xi ∈ X :

∑

ij

γiγj⟨

fi, fj⟩

=

⟨

∑

i

γifi,∑

j

γjfj

⟩

=: 〈f, f〉

=

⟨

∑

i

αik(., xi),∑

i

αik(., xi)

⟩

=∑

ij

αiαjk(xi, xj) ≥ 0

• furthermore, it is strictly positive definite:

f (x)2 = 〈f, k(., x)〉2 ≤ 〈f, f〉〈k(., x), k(., x)〉hence 〈f, f〉 = 0 implies f = 0.

• Complete the space in the corresponding norm to get a Hilbertspace Hk.


The Empirical Kernel Map

Recall the feature map

Φ : X → RX

x 7→ k(., x).

• each point is represented by its similarity to all other points

• how about representing it by its similarity to a sample of points?

Consider

Φm : X → Rm

x 7→ k(., x)|(x1,...,xm) = (k(x1, x), . . . , k(xm, x))⊤


ctd.

• Φm(x1), . . . ,Φm(xm) contain all necessary information aboutΦ(x1), . . . ,Φ(xm)

• the Gram matrix Gij :=⟨

Φm(xi),Φm(xj)⟩

satisfies G = K2

where Kij = k(xi, xj)

•modify Φm to

Φwm : X → Rm

x 7→ K−12(k(x1, x), . . . , k(xm, x))

⊤

• this “whitened” map (“kernel PCA map”) satifies⟨

Φwm(xi),Φwm(xj)

⟩

= k(xi, xj)

for all i, j = 1, . . . ,m.


An Example of a Kernel Algorithm

Idea: classify points x := Φ(x) in feature space according to whichof the two class means is closer.

c+ :=1

m+

∑

yi=1

Φ(xi), c− :=1

m−

∑

yi=−1

Φ(xi)

+o

o

o

o

++

c1

c2

x-c

w

x

c

.

Compute the sign of the dot product between w := c+− c− andx− c.


An Example of a Kernel Algorithm, ctd. [36]

f (x) = sgn

1

m+

∑

i:yi=+1〈Φ(x),Φ(xi)〉−

1

m−

∑

i:yi=−1〈Φ(x),Φ(xi)〉+b

= sgn

1

m+

∑

i:yi=+1k(x, xi)−

1

m−

∑

i:yi=−1k(x, xi) + b

where

b =1

2

1

m2−

∑

(i,j):yi=yj=−1k(xi, xj)−

1

m2+

∑

(i,j):yi=yj=+1k(xi, xj)

.

• provides a geometric interpretation of Parzen windows


An Example of a Kernel Algorithm, ctd.

•Demo

• Exercise: derive the Parzen windows classifier by computing thedistance criterion directly

• SVMs (ppt)


An example of a kernel algorithm, revisited

µ(X )w .

+

+

+

+

o

oo

µ(Y )

X compact subset of a separable metric space, m,n ∈ N.

Positive class X := x1, . . . , xm ⊂ X

Negative class Y := y1, . . . , yn ⊂ X

RKHS means µ(X) = 1m

∑mi=1 k(xi, ·), µ(Y ) = 1

n

∑ni=1 k(yi, ·).

Get a problem if µ(X) = µ(Y )!B. Scholkopf, MLSS France 2011

When do the means coincide?

k(x, x′) =⟨

x, x′⟩

: the means coincide

k(x, x′) = (⟨

x, x′⟩

+ 1)d: all empirical moments up to order d coincide

k strictly pd: X = Y .

The mean “remembers” each point that contributed to it.


Proposition 2Assume X,Y are defined as above, k isstrictly pd, and for all i, j, xi 6= xj, and yi 6= yj.If for some αi, βj ∈ R− 0, we have

m∑

i=1

αik(xi, .) =

n∑

j=1

βjk(yj, .), (1)

then X = Y .


Proof (by contradiction)

W.l.o.g., assume that x1 6∈ Y . Subtract∑nj=1 βjk(yj, .) from (1),

and make it a sum over pairwise distinct points, to get

0 =∑

i

γik(zi, .),

where z1 = x1, γ1 = α1 6= 0, andz2, · · · ∈ X ∪ Y − x1, γ2, · · · ∈ R.Take the RKHS dot product with

∑

j γjk(zj, .) to get

0 =∑

ij

γiγjk(zi, zj),

with γ 6= 0, hence k cannot be strictly pd.


The mean map

µ : X = (x1, . . . , xm) 7→1

m

m∑

i=1

k(xi, ·)

satisfies

〈µ(X), f〉 =⟨

1

m

m∑

i=1

k(xi, ·), f⟩

=1

m

m∑

i=1

f (xi)

and

‖µ(X)−µ(Y )‖ = sup‖f‖≤1

|〈µ(X)− µ(Y ), f〉| = sup‖f‖≤1

∣

∣

∣

∣

∣

1

m

m∑

i=1

f(xi)−1

n

n∑

i=1

f(yi)

∣

∣

∣

∣

∣

.

Note: Large distance = can find a function distinguishing thesamples


Witness function

f =µ(X)−µ(Y )‖µ(X)−µ(Y )‖, thus f (x) ∝ 〈µ(X)− µ(Y ), k(x, .)〉):

−6 −4 −2 0 2 4 6−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Witness f for Gauss and Laplace data

X

Pro

b. d

ensi

ty a

nd f

fGaussLaplace

This function is in the RKHS of a Gaussian kernel, but not in theRKHS of the linear kernel.


The mean map for measures

p, q Borel probability measures,

Ex,x′∼p[k(x, x′)], Ex,x′∼q[k(x, x

′)] <∞ (‖k(x, .)‖ ≤M <∞ is sufficient)

Defineµ : p 7→ Ex∼p[k(x, ·)].

Note〈µ(p), f〉 = Ex∼p[f (x)]

and

‖µ(p)− µ(q)‖ = sup‖f‖≤1

∣

∣Ex∼p[f (x)]− Ex∼q[f (x)]∣

∣ .

Recall that in the finite sample case, for strictly p.d. kernels, µwas injective — how about now?[43, 17]


Theorem 3 [15, 13]

p = q ⇐⇒ supf∈C(X)

∣

∣Ex∼p(f (x))−Ex∼q(f (x))∣

∣ = 0,

where C(X) is the space of continuous bounded functions onX.

Combine this with

‖µ(p)− µ(q)‖ = sup‖f‖≤1

∣

∣Ex∼p[f (x)]− Ex∼q[f (x)]∣

∣ .

Replace C(X) by the unit ball in an RKHS that is dense in C(X)— universal kernel [45], e.g., Gaussian.

Theorem 4 [19] If k is universal, then

p = q ⇐⇒ ‖µ(p)− µ(q)‖ = 0.


• µ is invertible on its imageM = µ(p) | p is a probability distribution(the “marginal polytope”, [53])

• generalization of the moment generating function of a RV xwith distribution p:

Mp(.) = Ex∼p[

e〈x, · 〉]

.

This provides us with a convenient metric on probability distribu-tions, which can be used to check whether two distributions aredifferent — provided that µ is invertible.


Fourier Criterion

Assume we have densities, the kernel is shift invariant (k(x, y) =k(x− y)), and all Fourier transforms below exist.Note that µ is invertible iff

∫

k(x− y)p(y) dy =

∫

k(x− y)q(y) dy =⇒ p = q,

i.e.,k(p− q) = 0 =⇒ p = q

(Sriperumbudur et al., 2008)

E.g., µ is invertible if k has full support. Restricting the class ofdistributions, weaker conditions suffice (e.g., if k has non-empty in-terior, µ is invertible for all distributions with compact support).


Fourier Optics

Application: p source of incoherent light, I indicator of a finiteaperture. In Fraunhofer diffraction, the intensity image is∝ p∗I2.Set k = I2, then this equals µ(p).This k does not have full support, thus the imaging process is notinvertible for the class of all light sources (Abbe), but it is if werestrict the class (e.g., to compact support).


Application 1: Two-sample problem [19]

X, Y i.i.d. m-samples from p, q, respectively.

‖µ(p)− µ(q)‖2 =Ex,x′∼p [k(x, x′)]− 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y

′)]

=Ex,x′∼p,y,y′∼q [h((x, y), (x′, y′))]

withh((x, y), (x′, y′)) := k(x, x′)− k(x, y′)− k(y, x′) + k(y, y′).

Define

D(p, q)2 := Ex,x′∼p,y,y′∼qh((x, y), (x′, y′))

D(X, Y )2 := 1m(m−1)

∑

i 6=jh((xi, yi), (xj, yj)).

D(X,Y )2 is an unbiased estimator of D(p, q)2.It’s easy to compute, and works on structured data.


Theorem 5 Assume k is bounded.

D(X, Y )2 converges to D(p, q)2 in probability with rate O(m−12).

This could be used as a basis for a test, but uniform convergence bounds are often loose..

Theorem 6 We assume E(

h2)

< ∞. When p 6= q, then√m(D(X, Y )2 − D(p, q)2)

converges in distribution to a zero mean Gaussian with variance

σ2u = 4(

Ez

[

(Ez′h(z, z′))2]

−[

Ez,z′(h(z, z′))]2)

.

When p = q, then m(D(X, Y )2 −D(p, q)2) = mD(X, Y )2 converges in distribution to

∞∑

l=1

λl[

q2l − 2]

, (2)

where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation∫

X

k(x, x′)ψi(x)dp(x) = λiψi(x′),

and k(xi, xj) := k(xi, xj) − Exk(xi, x) − Exk(x, xj) + Ex,x′k(x, x′) is the centred RKHS

kernel.


Application 2: Dependence Measures

Assume that (x, y) are drawn from pxy, with marginals px, py.

Want to know whether pxy factorizes.[2, 16]: kernel generalized variance

[20, 21]: kernel constrained covariance, HSIC

Main idea [25, 34]:x and y independent ⇐⇒ ∀ bounded continuous functions f, g,we have Cov(f (x), g(y)) = 0.


k kernel on X× Y.

µ(pxy) := E(x,y)∼pxy [k((x, y), ·)]µ(px × py) := Ex∼px,y∼py [k((x, y), ·)] .

Use ∆ :=∥

∥µ(pxy)− µ(px × py)∥

∥ as a measure of dependence.

For k((x, y), (x′, y′)) = kx(x, x′)ky(y, y′):

∆2 equals the Hilbert-Schmidt norm of the covariance opera-tor between the two RKHSs (HSIC), with empirical estimatem−2 trHKxHKy, where H = I − 1/m [20, 44].


Witness function of the equivalent optimisation problem:

X

Y

Dependence witness and sample

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

Application: learning causal structures (Sun et al., ICML 2007; Fuku-

mizu et al., NIPS 2007))B. Scholkopf, MLSS France 2011

Application 3: Covariate Shift Correction and LocalLearning

training set X = (x1, y1), . . . , (xm, ym) drawn from p,test set X ′ =

(x′1, y′1), . . . , (x

′n, y

′n)

from p′ 6= p.

Assume py|x = p′y|x.

[40]: reweight training set


Minimize∥

∥

∥

∥

∥

∥

m∑

i=1

βik(xi, ·)− µ(X ′)

∥

∥

∥

∥

∥

∥

2

+λ ‖β‖22 subject to βi ≥ 0,∑

i

βi = 1.

Equivalent QP:

minimizeβ

1

2β⊤ (K + λ1) β − β⊤l

subject to βi ≥ 0 and∑

i

βi = 1,

where Kij := k(xi, xj), li =⟨

k(xi, ·), µ(X ′)⟩

.

Experiments show that in underspecified situations (e.g., large ker-nel widths), this helps [23].

X ′ =

x′

leads to a local sample weighting scheme.B. Scholkopf, MLSS France 2011

The Representer Theorem

Theorem 7Given: a p.d. kernel k on X × X, a training set(x1, y1), . . . , (xm, ym) ∈ X×R, a strictly monotonic increasingreal-valued function Ω on [0,∞[, and an arbitrary cost functionc : (X× R

2)m → R ∪ ∞Any f ∈ Hk minimizing the regularized risk functional

c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω (‖f‖) (3)

admits a representation of the form

f (.) =∑m

i=1αik(xi, .).


Remarks

• significance: many learning algorithms have solutions that canbe expressed as expansions in terms of the training examples

• original form, with mean squared loss

c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) =1

m

m∑

i=1

(yi − f (xi))2,

and Ω(‖f‖) = λ‖f‖2 (λ > 0): [27]

• generalization to non-quadratic cost functions: [10]

• present form: [36]


Proof

Decompose f ∈ H into a part in the span of the k(xi, .) and anorthogonal one:

f =∑

i

αik(xi, .) + f⊥,where for all j

〈f⊥, k(xj, .)〉 = 0.

Application of f to an arbitrary training point xj yields

f (xj) =⟨

f, k(xj, .)⟩

=

⟨

∑

i

αik(xi, .) + f⊥, k(xj, .)

⟩

=∑

i

αi〈k(xi, .), k(xj, .)〉,

independent of f⊥.B. Scholkopf, MLSS France 2011

Proof: second part of (3)

Since f⊥ is orthogonal to∑

iαik(xi, .), and Ω is strictly mono-tonic, we get

Ω(‖f‖) = Ω(

‖∑

iαik(xi, .) + f⊥‖

)

= Ω

(

√

‖∑

iαik(xi, .)‖2 + ‖f⊥‖2

)

≥ Ω(

‖∑

iαik(xi, .)‖

)

, (4)

with equality occuring if and only if f⊥ = 0.Hence, any minimizer must have f⊥ = 0. Consequently, anysolution takes the form

f =∑

iαik(xi, .).


Application: Support Vector Classification

Here, yi ∈ ±1. Use

c ((xi, yi, f (xi))i) =1

λ

∑

i

max (0, 1− yif (xi)) ,

and the regularizer Ω (‖f‖) = ‖f‖2.λ→ 0 leads to the hard margin SVM


Further Applications

Bayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990),i.e.

• exp(−c((xi, yi, f (xi))i)) — likelihood of the data

• exp(−Ω(‖f‖)) — prior over the set of functions; e.g., Ω(‖f‖) =λ‖f‖2 — Gaussian process prior [55] with covariance functionk

•minimizer of (3) = MAP estimate

Kernel PCA (see below) can be shown to correspond to the caseof

c((xi, yi, f (xi))i=1,...,m) =

0 if 1m

∑

i

(

f (xi)− 1m

∑

j f (xj))2

= 1

∞ otherwise

with g an arbitrary strictly monotonically increasing function.

Conclusion

• the kernel corresponds to– a similarity measure for the data, or

– a (linear) representation of the data, or

– a hypothesis space for learning,

• kernels allow the formulation of a multitude of geometrical algo-rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)


Kernel PCA [37]

R2

linear PCA

R2

H

Φ

kernel PCA

k

k(x,y) = (x.y)

k(x,y) = (x.y)d

x

xx xx

x

x

xxx

x

x

x

xx xx

x

x

xxx

x

x

x

x

xx

xx

x

x x

x

x

x


Kernel PCA, II

x1, . . . , xm ∈ X, Φ : X → H, C =1

m

m∑

j=1

Φ(xj)Φ(xj)⊤

Eigenvalue problem

λV = CV =1

m

m∑

j=1

⟨

Φ(xj),V⟩

Φ(xj).

For λ 6= 0, V ∈ spanΦ(x1), . . . ,Φ(xm), thus

V =

m∑

i=1

αiΦ(xi),

and the eigenvalue problem can be written as

λ 〈Φ(xn),V〉 = 〈Φ(xn), CV〉 for all n = 1, . . . ,m


Kernel PCA in Dual Variables

In term of the m×m Gram matrix

Kij :=⟨

Φ(xi),Φ(xj)⟩

= k(xi, xj),

this leads tomλKα = K2α

where α = (α1, . . . , αm)⊤.

Solvemλα = Kα

−→ (λn,αn)

〈Vn,Vn〉 = 1 ⇐⇒ λn 〈αn,αn〉 = 1

thus divide αn by√λn


Feature extraction

Compute projections on the Eigenvectors

Vn =

m∑

i=1

αni Φ(xi)

in H:

for a test point x with image Φ(x) in H we get the features

〈Vn,Φ(x)〉 =

m∑

i=1

αni 〈Φ(xi),Φ(x)〉

=

m∑

i=1

αni k(xi, x)


The Kernel PCA Map

Recall

Φwm : X → Rm

x 7→ K−12(k(x1, x), . . . , k(xm, x))

⊤

If K = UDU⊤ is K’s diagonalization, then K−1/2 =UD−1/2U⊤. Thus we have

Φwm(x) = UD−1/2U⊤(k(x1, x), . . . , k(xm, x))⊤.

We can drop the leading U (since it leaves the dot product invari-ant) to get a map

ΦwKPCA(x) = D−1/2U⊤(k(x1, x), . . . , k(xm, x))⊤.

The rows of U⊤ are the eigenvectors αn of K, and the entries of

the diagonal matrix D−1/2 equal λ−1/2i .


Toy Example with Gaussian Kernel

k(x, x′) = exp(

−‖x− x′‖2)


Super-Resolution (Kim, Franz, & Scholkopf, 2004)

a. original image of resolution528 × 396

b. low resolution image (264 ×

198) stretched to the originalscale

c. bicubic interpolation d. supervised example-basedlearning based on nearest neigh-bor classifier

f. unsupervised KPCA recon-struction

g. enlarged portions of a-d, and f (from left to right)

Comparison between different super-resolution methods.


Support Vector Classifiers

feature spaceinput space

Φ

[6]


Separating Hyperplane

w

x | <w x> + b = 0,

<w x> + b > 0,

<w x> + b < 0,


Optimal Separating Hyperplane [50]

.w

x | <w x> + b = 0,


Eliminating the Scaling Freedom [47]

Note: if c 6= 0, then

x| 〈w,x〉 + b = 0 = x| 〈cw,x〉 + cb = 0.Hence (cw, cb) describes the same hyperplane as (w, b).

Definition: The hyperplane is in canonical form w.r.t. X∗ =x1, . . . ,xr if minxi∈X | 〈w,xi〉 + b| = 1.


Canonical Optimal Hyperplane

,w

x | <w x> + b = 0,

x | <w x> + b = −1,x | <w x> + b = +1,

x2x1

Note:

<w x1> + b = +1<w x2> + b = −1

=> <w (x1−x2)> = 2

=> (x1−x2) =w

||w||< >

,,

,

, 2||w||

yi = −1

yi = +1


Canonical Hyperplanes [47]

Note: if c 6= 0, then

x| 〈w,x〉 + b = 0 = x| 〈cw,x〉 + cb = 0.Hence (cw, cb) describes the same hyperplane as (w, b).

Definition: The hyperplane is in canonical form w.r.t. X∗ =x1, . . . ,xr if minxi∈X | 〈w,xi〉 + b| = 1.

Note that for canonical hyperplanes, the distance of the closestpoint to the hyperplane (“margin”) is 1/‖w‖:minxi∈X

∣

∣

∣

⟨

w‖w‖,xi

⟩

+ b‖w‖

∣

∣

∣= 1

‖w‖.


Theorem 8 (Vapnik [46])Consider hyperplanes 〈w,x〉 = 0where w is normalized such that they are in canonical formw.r.t. a set of points X∗ = x1, . . . ,xr, i.e.,

mini=1,...,r

| 〈w,xi〉 | = 1.

The set of decision functions fw(x) = sgn 〈x,w〉 defined onX∗ and satisfying the constraint ‖w‖ ≤ Λ has a VC dimensionsatisfying

h ≤ R2Λ2.

Here, R is the radius of the smallest sphere around the origincontaining X∗.


x

x

x

γ1

γ2

xx R


Proof Strategy (Gurvits, 1997)

Assume that x1, . . . ,xr are shattered by canonical hyperplaneswith ‖w‖ ≤ Λ, i.e., for all y1, . . . , yr ∈ ±1,

yi 〈w,xi〉 ≥ 1 for all i = 1, . . . , r. (5)

Two steps:

• prove that the more points we want to shatter (5), the larger‖∑ri=1 yixi‖ must be

• upper bound the size of ‖∑ri=1 yixi‖ in terms of R

Combining the two tells us how many points we can at most shat-ter.


Part I

Summing (5) over i = 1, . . . , r yields⟨

w,

r∑

i=1

yixi

⟩

≥ r.

By the Cauchy-Schwarz inequality, on the other hand, we have⟨

w,

r∑

i=1

yixi

⟩

≤ ‖w‖

∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

≤ Λ

∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

.

Combine both:

r

Λ≤

∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

. (6)


Part II

Consider independent random labels yi ∈ ±1, uniformly dis-tributed (Rademacher variables).

E

∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

2

=

r∑

i=1

E

⟨

yixi,r∑

j=1

yjxj

⟩

=

r∑

i=1

E

⟨

yixi,

∑

j 6=iyjxj

+ yixi

⟩

=

r∑

i=1

∑

j 6=iE[⟨

yixi, yjxj⟩]

+E [〈yixi, yixi〉]

=

r∑

i=1

E[

‖yixi‖2]

=

r∑

i=1

‖xi‖2


Part II, ctd.

Since ‖xi‖ ≤ R, we get

E

∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

2

≤ rR2.

• This holds for the expectation over the random choices of thelabels, hence there must be at least one set of labels for whichit also holds true. Use this set.

Hence∥

∥

∥

∥

∥

∥

r∑

i=1

yixi

∥

∥

∥

∥

∥

∥

2

≤ rR2.


Part I and II Combined

Part I:( rΛ

)2 ≤ ‖∑ri=1 yixi‖

2

Part II: ‖∑ri=1 yixi‖

2 ≤ rR2

Hencer2

Λ2≤ rR2,

i.e.,r ≤ R2Λ2,

completing the proof.


Pattern Noise as Maximum Margin Regularization

o

o

o

+

+

+

o

+

r

ρ


Maximum Margin vs. MDL — 2D Case

o

o

o

+

+

+

o

+γ+∆γ

γ

Rρ

Can perturb γ by ∆γ with |∆γ| < arcsin ρR and still correctly

separate the data.Hence only need to store γ with accuracy ∆γ [36, 51].


Formulation as an Optimization Problem

Hyperplane with maximum margin: minimize

‖w‖2

(recall: margin ∼ 1/‖w‖) subject toyi · [〈w,xi〉 + b] ≥ 1 for i = 1 . . . m

(i.e. the training data are separated correctly).


Lagrange Function (e.g., [5])

Introduce Lagrange multipliers αi ≥ 0 and a Lagrangian

L(w, b,α) =1

2‖w‖2 −

m∑

i=1

αi (yi · [〈w,xi〉 + b]− 1) .

L has to minimized w.r.t. the primal variables w and b andmaximized with respect to the dual variables αi

• if a constraint is violated, then yi · (〈w,xi〉 + b)− 1 < 0 −→· αi will grow to increase L — how far?

·w, b want to decrease L; i.e. they have to change such thatthe constraint is satisfied. If the problem is separable, thisensures that αi <∞.

• similarly: if yi · (〈w,xi〉 + b)− 1 > 0, then αi = 0: otherwise,L could be increased by decreasing αi (KKT conditions)


Derivation of the Dual Problem

At the extremum, we have

∂

∂bL(w, b,α) = 0,

∂

∂wL(w, b,α) = 0,

i.e.m∑

i=1

αiyi = 0

and

w =

m∑

i=1

αiyixi.

Substitute both into L to get the dual problem


The Support Vector Expansion

w =

m∑

i=1

αiyixi

where for all i = 1, . . . ,m either

yi · [〈w,xi〉 + b] > 1 =⇒ αi = 0 −→ xi irrelevantoryi · [〈w,xi〉 + b] = 1 (on the margin) −→ xi “Support Vector”

The solution is determined by the examples on the margin.

Thus

f (x) = sgn (〈x,w〉 + b)

= sgn(

∑m

i=1αiyi〈x,xi〉 + b

)

.


Why it is Good to Have Few SVs

Leave out an example that does not become SV−→ same solution.

Theorem [49]: Denote #SV(m) the number of SVs obtainedby training on m examples randomly drawn from P(x, y), and Ethe expectation. Then

E [Prob(test error)] ≤ E [#SV(m)]

mHere, Prob(test error) refers to the expected value of the risk,where the expectation is taken over training the SVM on samplesof size m− 1.


A Mechanical Interpretation [8]

Assume that each SV xi exerts a perpendicular force of size αiand sign yi on a solid plane sheet lying along the hyperplane.

Then the solution is mechanically stable:

m∑

i=1

αiyi = 0 implies that the forces sum to zero

w =

m∑

i=1

αiyixi implies that the torques sum to zero,

via∑

i

xi × yiαi ·w/‖w‖ = w×w/‖w‖ = 0.


Dual Problem

Dual: maximize

W (α) =m∑

i=1

αi −1

2

m∑

i,j=1

αiαjyiyj⟨

xi,xj⟩

subject to

αi ≥ 0, i = 1, . . . ,m, and

m∑

i=1

αiyi = 0.

Both the final decision function and the function to be maximizedare expressed in dot products −→ can use a kernel to compute

⟨

xi,xj⟩

=⟨

Φ(xi),Φ(xj)⟩

= k(xi, xj).


The SVM Architecture

Σ f(x)= sgn ( + b)

input vector x

support vectors x 1

... x 4

comparison: k(x,x i), e.g.

classification

weights

k(x,x i)=exp(−||x−x i||2 / c)

k(x,x i)=tanh(κ(x.x i)+θ)

k(x,x i)=(x.x i)d

f(x)= sgn ( Σ λi.k(x,x i) + b)

λ1 λ2 λ3 λ4

k k k k


Toy Example with Gaussian Kernel

k(x, x′) = exp(

−‖x− x′‖2)


Nonseparable Problems [3, 9]

If yi · (〈w,xi〉 + b) ≥ 1 cannot be satisfied, then αi → ∞.

Modify the constraint to

yi · (〈w,xi〉 + b) ≥ 1− ξi

withξi ≥ 0

(“soft margin”) and add

C ·m∑

i=1

ξi

in the objective function.


Soft Margin SVMs

C-SVM [9]: for C > 0, minimize

τ (w, ξ) =1

2‖w‖2 + C

m∑

i=1

ξi

subject to yi · (〈w,xi〉 + b) ≥ 1− ξi, ξi ≥ 0 (margin 1/‖w‖)

ν-SVM [38]: for 0 ≤ ν < 1, minimize

τ (w, ξ, ρ) =1

2‖w‖2 − νρ +

1

m

∑

i

ξi

subject to yi · (〈w,xi〉 + b) ≥ ρ− ξi, ξi ≥ 0 (margin ρ/‖w‖)


The ν-Property

SVs: αi > 0“margin errors:” ξi > 0

KKT-Conditions =⇒•All margin errors are SVs.

•Not all SVs need to be margin errors.

Those which are not lie exactly on the edge of the margin.

Proposition:1. fraction of Margin Errors ≤ ν ≤ fraction of SVs.2. asymptotically: ... = ν = ...


Duals, Using Kernels

C-SVM dual: maximize

W (α) =∑

iαi −

1

2

∑

i,jαiαjyiyjk(xi,xj)

subject to 0 ≤ αi ≤ C,∑

iαiyi = 0.

ν-SVM dual: maximize

W (α) = −1

2

∑

ijαiαjyiyjk(xi,xj)

subject to 0 ≤ αi ≤ 1m,

∑

iαiyi = 0,∑

iαi ≥ ν

In both cases: decision function:

f (x) = sgn(

∑m

i=1αiyik(x,xi) + b

)


Connection between ν-SVC and C-SVC

Proposition. If ν-SV classification leads to ρ > 0, then C-SVclassification, with C set a priori to 1/ρ, leads to the same decisionfunction.

Proof. Minimize the primal target, then fix ρ, and minimize only over theremaining variables: nothing will change. Hence the obtained solutionw0, b0, ξ0minimizes the primal problem of C-SVC, for C = 1, subject to

yi · (〈xi,w〉 + b) ≥ ρ− ξi.

To recover the constraint

yi · (〈xi,w〉 + b) ≥ 1− ξi,

rescale to the set of variables w′ = w/ρ, b′ = b/ρ, ξ′ = ξ/ρ. This leaves us, up

to a constant scaling factor ρ2, with the C-SV target with C = 1/ρ.


SVM Training

• naive approach: the complexity of maximizing

W (α) =∑m

i=1αi −

1

2

∑m

i,j=1αiαjyiyjk(xi,xj)

scales with the third power of the training set size m

• only SVs are relevant −→ only compute (k(xi,xj))ij for SVs.Extract them iteratively by cycling through the training set inchunks [46].

• in fact, one can use chunks which do not even contain all SVs[31]. Maximize over these sub-problems, using your favoriteoptimizer.

• the extreme case: by making the sub-problems very small (justtwo points), one can solve them analytically [32].

• http://www.kernel-machines.org/software.html


MNIST Benchmark

handwritten character benchmark (60000 training & 10000 testexamples, 28× 28)

5 0 4 1 9 2 1 3 1 4

3 5 3 6 1 7 2 8 6 9

4 0 9 1 1 2 4 3 2 7

3 8 6 9 0 5 6 0 7 6

1 8 7 9 3 9 8 5 9 3

3 0 7 4 9 8 0 9 4 1

4 4 6 0 4 5 6 1 0 0

1 7 1 6 3 0 2 1 1 7

9 0 2 6 7 8 3 9 0 4

6 7 4 6 8 0 7 8 3 1


MNIST Error Rates

Classifier test error referencelinear classifier 8.4% [7]3-nearest-neighbour 2.4% [7]SVM 1.4% [8]Tangent distance 1.1% [41]LeNet4 1.1% [28]Boosted LeNet4 0.7% [28]Translation invariant SVM 0.56% [11]

Note: the SVM used a polynomial kernel of degree 9, corresponding to a featurespace of dimension ≈ 3.2 · 1020.


References

[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.

[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002.

[3] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Opti-mization Methods and Software, 1:23–34, 1992.

[4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984.

[5] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.

[6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor,Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July1992. ACM Press.

[7] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Muller, E. Sackinger, P. Simard,and V. Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In Proceedings of the 12thInternational Conference on Pattern Recognition and Neural Networks, Jerusalem, pages 77–87. IEEE Computer SocietyPress, 1994.

[8] C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector learning machines. In M. Mozer,M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge,MA, 1997. MIT Press.

[9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.

[10] D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics,18:1676–1695, 1990.

[11] D. DeCoste and B. Scholkopf. Training invariant support vector machines. Machine Learning, 2001. Accepted for publi-cation. Also: Technical Report JPL-MLTR-00-1, Jet Propulsion Laboratory, Pasadena, CA, 2000.

[12] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications ofmathematics. Springer, New York, 1996.

[13] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.

[14] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L.Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 171–203, Cambridge, MA,2000. MIT Press.

[15] R. Fortet and E. Mourier. Convergence de la reparation empirique vers la reparation theorique. Ann. Scient. Ecole Norm.Sup., 70:266–285, 1953.

[16] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernelhilbert spaces. J. Mach. Learn. Res., 5:73–99, 2004.

[17] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf. Kernel measures of conditional dependence. In J. C. Platt, D. Koller,Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 489–496, Cam-bridge, MA, USA, 09 2008. MIT Press.

[18] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):1455–1480, 1998.

[19] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. J. Smola. A kernel method for the two-sample-problem. InB. Scholkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. TheMIT Press, Cambridge, MA, 2007.

[20] A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf. Measuring statistical dependence with Hilbert-Schmidt norms.In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 63–77, Berlin, Germany,2005. Springer-Verlag.

[21] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Scholkopf. Kernel methods for measuring independence. J. Mach.Learn. Res., 6:2075–2129, 2005.

[22] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Depart-ment, University of California at Santa Cruz, 1999.

[23] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. InB. Scholkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. TheMIT Press, Cambridge, MA, 2007.

[24] I. A. Ibragimov and R. Z. Has’minskii. Statistical Estimation — Asymptotic Theory. Springer-Verlag, New York, 1981.

[25] J. Jacod and P. Protter. Probability Essentials. Springer, New York, 2000.

[26] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing bysplines. Annals of Mathematical Statistics, 41:495–502, 1970.

[27] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971.

[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings ofthe IEEE, 86:2278–2324, 1998.

[29] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,pages 133–165. Springer-Verlag, Berlin, 1998.

[30] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Phi-los. Trans. Roy. Soc. London, A 209:415–446, 1909.

[31] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. Technical Report AIM-1602,MIT A.I. Lab., 1996.

[32] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 185–208, Cambridge, MA, 1999.MIT Press.

[33] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990.

[34] A. Renyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441–451, 1959.

[35] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & Technical, Harlow, England, 1988.

[36] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.

[37] B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Compu-tation, 10:1299–1319, 1998.

[38] B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation,12:1207–1245, 2000.

[39] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK,2004.

[40] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal ofStatistical Planning and Inference, 90, 2000.

[41] P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In S. J. Hanson,J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5. Proceedings of the 1992Conference, pages 50–58, San Mateo, CA, 1993. Morgan Kaufmann.

[42] A. Smola, B. Scholkopf, and K.-R. Muller. The connection between regularization operators and support vector kernels.Neural Networks, 11:637–649, 1998.

[43] A. J. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In M. Hutter, R. A.Servedio, and E. Takimoto, editors, Algorithmic Learning Theory: 18th International Conference, pages 13–31, Berlin, 102007. Springer.

[44] A. J. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf.Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.

[45] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67–93,2001.

[46] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation:Springer Verlag, New York, 1982).

[47] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, 1995.

[48] V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998.

[49] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow, 1974. (German Translation:W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979).

[50] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24,1963.

[51] U. von Luxburg, O. Bousquet, and B. Scholkopf. A compression approach to support vector model selection. TechnicalReport 101, Max Planck Institute for Biological Cybernetics, Tubingen, Germany, 2002. see more detailed JMLR version.

[52] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Math-ematics. SIAM, Philadelphia, 1990.

[53] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report649, UC Berkeley, Department of Statistics, September 2003.

[54] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, 1982.

[55] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I.Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.

[56] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.


Regularization Interpretation of Kernel Machines

The norm inH can be interpreted as a regularization term (Girosi1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regular-ization operator (mapping into a dot product space D) such thatk is Green’s function of P ∗P , then

‖w‖ = ‖Pf‖,where

w =∑m

i=1αiΦ(xi)

andf (x) =

∑

iαik(xi, x).

Example: for the Gaussian kernel, P is a linear combination ofdifferential operators.


‖w‖2 =∑

i,j

αiαjk(xi, xj)

=∑

i,j

αiαj

⟨

k(xi, .), δxj(.)⟩

=∑

i,j

αiαj⟨

k(xi, .), (P∗Pk)(xj, .)

⟩

=∑

i,j

αiαj⟨

(Pk)(xi, .), (Pk)(xj, .)⟩

D

=

⟨

(P∑

i

αik)(xi, .), (P∑

j

αjk)(xj, .)

⟩

D

= ‖Pf‖2,using f (x) =

∑

iαik(xi, x).B. Scholkopf, MLSS France 2011

Introduction to Machine Learning

Technology

Transcript of Introduction to Machine Learning