Introduction to Machine Learning

Upload
kkkc 
Category
Technology

view
1.032 
download
3
description
Transcript of Introduction to Machine Learning
1
Bernhard Schölkopf
Empirical Inference Department
Max Planck Institute for Intelligent Systems
Tübingen, Germany
http://www.tuebingen.mpg.de/bs
Introduction to Machine Learning
2
Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
•
x
x
x
x
x
x
y
x
x
x
Leibniz, Weyl, Chaitin
x
y = a * x
y = Σi ai k(x,xi) + b
3
Empirical Inference
• Drawing conclusions from empirical data (observations, measurements)
• Example 1: scientific inference
•
“If your experiment needs statistics [inference],
you ought to have done a better experiment.” (Rutherford)
4
Empirical Inference, II
• Example 2: perception
“The brain is nothing but a statistical decision organ”
(H. Barlow)
5
Hard Inference Problems
Sonnenburg, Rätsch, Schäfer,
Schölkopf, 2006, Journal of Machine
Learning Research
Task: classify human DNA
sequence locations into acceptor
splice site, decoy using 15
Million sequences of length 141,
and a MultipleKernel Support
Vector Machines.
PRC = PrecisionRecallCurve,
fraction of correct positive
predictions among all positively
predicted cases
• High dimensionality – consider many factors simultaneously to find the regularity
• Complex regularities – nonlinear, nonstationary, etc.
• Little prior knowledge – e.g., no mechanistic models for the data
• Need large data sets – processing requires computers and automatic inference methods
6
Hard Inference Problems, II
• We can solve scientific inference problems that humans can’t solve
• Even if it’s just because of data set size / dimensionality, this is a quantum leap
7
• observe 1, 2, 4, 7,..
• What’s next?
• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)
• 1,2,4,7,12,20,…: an+2=an+1+an+1
• 1,2,4,7,13,24,…: “Tribonacci”sequence
• 1,2,4,7,14,28: set of divisors of 28
• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159…
and e=2,718… interleaved
• The OnLine Encyclopedia of Integer Sequences: >600 hits…
+1 +2 +3
Generalization (thanks to O. Bousquet)
Generalization, II
8
• Question: which continuation is correct (“generalizes”)?
• Answer: there’s no way to tell (“induction problem”)
• Question of statistical learning theory: how to come up
with a law that is (probably) correct (“demarcation problem”) (more accurately: a law that is probably as correct on the test data as it is on the training data)
9
2class classification
Learn based on m observations
generated from some
Goal: minimize expected error (“risk”)
Problem: P is unknown.
Induction principle: minimize training error (“empirical risk”)
V. Vapnik
over some class of functions. Q: is this “consistent”?
10
The law of large numbers
For all and
Does this imply “consistency” of empirical risk minimization
(optimality in the limit)?
No – need a uniform law of large numbers:
For all
Consistency and uniform convergence
> LaTeX
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
12
Bernhard Schölkopf, 03
October 2011 13
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Support Vector Machines
+
+
+




++
+
F
k(x,x’) = <F(x),F(x’)>
class 1 class 2
Bernhard Schölkopf, 03
October 2011 14
• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):
representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)
• unique solution found by convex QP
Support Vector Machines
+
+
+




++
+
F
k(x,x’) = <F(x),F(x’)>
class 1 class 2
15
Applications in Computational Geometry / Graphics
Steinke, Walder, Blanz et al.,
Eurographics’05, ’06, ‘08, ICML ’05,‘08,
NIPS ’07
MaxPlanckInstitut für
biologische Kybernetik
Bernhard Schölkopf, Tübingen,
3. Oktober 2011 16 FIFA World Cup – Germany vs. England, June 27, 2010
Kernel Quiz
Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
17
Kernel Methods
Bernhard Scholkopf
Max Planck Institute for Intelligent Systems
B. Scholkopf, MLSS France 2011
Statistical Learning Theory
1. started by Vapnik and Chervonenkis in the Sixties
2. model: we observe data generated by an unknown stochasticregularity
3. learning = extraction of the regularity from the data
4. the analysis of the learning problem leads to notions of capacityof the function classes that a learning machine can implement.
5. support vector machines use a particular type of function class:classifiers with large “margins” in a feature space induced by akernel.
[47, 48]
B. Scholkopf, MLSS France 2011
Example: Regression Estimation
y
x
•Data: inputoutput pairs (xi, yi) ∈ R× R
•Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)
• Learning: choose a function f : R → R such that the error,averaged over P, is minimized.
• Problem: P is unknown, so the average cannot be computed— need an “induction principle”
Pattern Recognition
Learn f : X → ±1 from examples
(x1, y1), . . . , (xm, ym) ∈ X×±1, generated i.i.d. from P(x, y),
such that the expected misclassification error on a test set, alsodrawn from P(x, y),
R[f ] =
∫
1
2f (x)− y) dP(x, y),
is minimal (Risk Minimization (RM)).
Problem: P is unknown. −→ need an induction principle.
Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error
Remp[f ] =1
m
∑m
i=1
1
2f (xi)− yi
B. Scholkopf, MLSS France 2011
Convergence of Means to Expectations
Law of large numbers:
Remp[f ] → R[f ]
as m→ ∞.
Does this imply that empirical risk minimization will give us theoptimal result in the limit of infinite sample size (“consistency”of empirical risk minimization)?
No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement.
B. Scholkopf, MLSS France 2011
Consistency and Uniform Convergence
Risk
Function class
R
Remp
f f fopt m
R[f]
R [f]emp
B. Scholkopf, MLSS France 2011
The Importance of the Set of Functions
What about allowing all functions from X to ±1?Training set (x1, y1), . . . , (xm, ym) ∈ X× ±1Test patterns x1, . . . , xm ∈ X,such that x1, . . . , xm ∩ x1, . . . ,xm = .
For any f there exists f∗ s.t.:1. f∗(xi) = f (xi) for all i2. f∗(xj) 6= f (xj) for all j.
Based on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is ’no free lunch’ [24, 56].−→ a restriction must be placed on the functions that we allow
B. Scholkopf, MLSS France 2011
Restricting the Class of Functions
Two views:
1. Statistical Learning (VC) Theory: take into account the capacity of the class of functions that the learning machine canimplement
2. The Bayesian Way: place Prior distributions P(f ) over theclass of functions
B. Scholkopf, MLSS France 2011
Detailed Analysis
• loss ξi := 12f (xi)− yi in 0, 1
• the ξi are independent Bernoulli trials• empirical mean 1
m
∑mi=1 ξi (by def: equals Remp[f ])
• expected value E [ξ] (equals R[f ])
B. Scholkopf, MLSS France 2011
Chernoff’s Bound
P
∣
∣
∣
∣
∣
∣
1
m
m∑
i=1
ξi − E [ξ]
∣
∣
∣
∣
∣
∣
≥ ǫ
≤ 2 exp(−2mǫ2)
• here, P refers to the probability of getting a sample ξ1, . . . , ξmwith the property
∣
∣
∣
1m
∑mi=1 ξi − E [ξ]
∣
∣
∣ ≥ ǫ (is a product mea
sure)
Useful corollary: Given a 2msample of Bernoulli trials, we have
P
∣
∣
∣
∣
∣
∣
1
m
m∑
i=1
ξi −1
m
2m∑
i=m+1
ξi
∣
∣
∣
∣
∣
∣
≥ ǫ
≤ 4 exp
(
−mǫ2
2
)
.
B. Scholkopf, MLSS France 2011
Chernoff’s Bound, II
Translate this back into machine learning terminology: the probability of obtaining anmsample where the training error and testerror differ by more than ǫ > 0 is bounded by
P∣
∣Remp[f ]− R[f ]∣
∣ ≥ ǫ
≤ 2 exp(−2mǫ2).
• refers to one fixed f
• not allowed to look at the data before choosing f , hence notsuitable as a bound on the test error of a learning algorithmusing empirical risk minimization
B. Scholkopf, MLSS France 2011
Uniform Convergence (Vapnik & Chervonenkis)
Necessary and sufficient conditions for nontrivial consistency ofempirical risk minimization (ERM):Onesided convergence, uniformly over all functions that can beimplemented by the learning machine.
limm→∞P sup
f∈F(R[f ]−Remp[f ]) > ǫ = 0
for all ǫ > 0.
• note that this takes into account the whole set of functions thatcan be implemented by the learning machine
• this is hard to check for a learning machine
Are there properties of learning machines (≡ sets of functions)which ensure uniform convergence of risk?
B. Scholkopf, MLSS France 2011
How to Prove a VC Bound
Take a closer look at Psupf∈F(R[f ]− Remp[f ]) > ǫ.Plan:
• if the function class F contains only one function, then Chernoff’s bound suffices:
P supf∈F
(R[f ]− Remp[f ]) > ǫ ≤ 2 exp(−2mǫ2).
• if there are finitely many functions, we use the ’union bound’
• even if there are infinitely many, then on any finite samplethere are effectively only finitely many (use symmetrizationand capacity concepts)
B. Scholkopf, MLSS France 2011
The Case of Two Functions
Suppose F = f1, f2. RewriteP sup
f∈F(R[f ]− Remp[f ]) > ǫ = P(C1
ǫ ∪ C2ǫ ),
where
Ciǫ := (x1, y1), . . . , (xm, ym)  (R[fi]− Remp[fi]) > ǫdenotes the event that the risks of fi differ by more than ǫ.The RHS equals
P(C1ǫ ∪ C2
ǫ ) = P(C1ǫ ) + P(C2
ǫ )− P(C1ǫ ∩ C2
ǫ )
≤ P(C1ǫ ) + P(C2
ǫ ).
Hence by Chernoff’s bound
P supf∈F
(R[f ]−Remp[f ]) > ǫ ≤ P(C1ǫ ) + P(C2
ǫ )
≤ 2 · 2 exp(−2mǫ2).
The Union Bound
Similarly, if F = f1, . . . , fn, we haveP sup
f∈F(R[f ]− Remp[f ]) > ǫ = P(C1
ǫ ∪ · · · ∪ Cnǫ ),
and
P(C1ǫ ∪ · · · ∪ Cnǫ ) ≤
n∑
i=1
P(Ciǫ).
Use Chernoff for each summand, to get an extra factor n in thebound.
Note: this becomes an equality if and only if all the events Ciǫinvolved are disjoint.
B. Scholkopf, MLSS France 2011
Infinite Function Classes
•Note: empirical risk only refers to m points. On these points,the functions of F can take at most 2m values
• for Remp, the function class thus “looks” finite
• how about R?
• need to use a trick
B. Scholkopf, MLSS France 2011
Symmetrization
Lemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For mǫ2 ≥ 2 we have
P supf∈F
(R[f ]−Remp[f ]) > ǫ ≤ 2P supf∈F
(Remp[f ]−R′emp[f ]) > ǫ/2
Here, the first P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures the loss on the first half ofthe sample, and R′
emp on the second half.
B. Scholkopf, MLSS France 2011
Shattering Coefficient
•Hence, we only need to consider the maximum size of F on 2mpoints. Call it N(F, 2m).
•N(F, 2m) = max. number of different outputs (y1, . . . , y2m)that the function class can generate on 2m points — in otherwords, the max. number of different ways the function class canseparate 2m points into two classes.
•N(F, 2m) ≤ 22m
• if N(F, 2m) = 22m, then the function class is said to shatter2m points.
B. Scholkopf, MLSS France 2011
Putting Everything Together
We now use (1) symmetrization, (2) the shattering coefficient, and(3) the union bound, to get
Psupf∈F
(R[f ]− Remp[f ]) > ǫ
≤ 2Psupf∈F
(Remp[f ]−R′emp[f ]) > ǫ/2
= 2P(Remp[f1]−R′emp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)]−R′
emp[fN(F,2m)]) > ǫ/2
≤N(F,2m)∑
n=1
2P(Remp[fn]−R′emp[fn]) > ǫ/2.
B. Scholkopf, MLSS France 2011
ctd.
Use Chernoff’s bound for each term:∗
P
1
m
m∑
i=1
ξi −1
m
2m∑
i=m+1
ξi ≥ ǫ
≤ 2 exp
(
−mǫ2
2
)
.
This yields
P supf∈F
(R[f ]−Remp[f ]) > ǫ ≤ 4N(F, 2m) exp
(
−mǫ2
8
)
.
• provided that N(F, 2m) does not grow exponentially in m, thisis nontrivial
• such bounds are called VC type inequalities
• two types of randomness: (1) the P refers to the drawing ofthe training examples, and (2) R[f ] is an expectation over thedrawing of test examples.
∗ Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random
ization over permutations of the 2msample, see [36].
Confidence Intervals
Rewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for ǫ:
With a probability of at least 1− δ,
R[f ] ≤ Remp[f ] +
√
8
m
(
ln(N(F, 2m)) + ln4
δ
)
.
This bound holds independent of f ; in particular, it holds for thefunction fm minimizing the empirical risk.
B. Scholkopf, MLSS France 2011
Discussion
• tighter bounds are available (better constants etc.)• cannot minimize the bound over f
• other capacity concepts can be used
B. Scholkopf, MLSS France 2011
VC Entropy
On an example (x, y), f causes a loss
ξ(x, y, f (x)) =1
2f (x)− y ∈ 0, 1.
For a larger sample (x1, y1), . . . , (xm, ym), the different functionsf ∈ F lead to a set of loss vectors
ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),
whose cardinality we denote by
N (F, (x1, y1) . . . , (xm, ym)) .
The VC entropy is defined as
HF(m) = E [lnN (F, (x1, y1) . . . , (xm, ym))] ,
where the expectation is taken over the random generation of themsample (x1, y1) . . . , (xm, ym) from P.
HF(m)/m → 0 ⇐⇒ uniform convergence of risks (hence consistency)
Further PR Capacity Concepts
• exchange ’E’ and ’ln’: annealed entropy .
HannF
(m)/m→ 0 ⇐⇒ exponentially fast uniform convergence
• take ’max’ instead of ’E’: growth function.Note that GF(m) = lnN(F,m).
GF(m)/m→ 0⇐⇒ exponential convergence for all underlyingdistributions P.
GF(m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectorscan be generated, i.e., the m points can be chosen such that byusing functions of the learning machine, they can be separatedin all 2m possible ways (shattered).
B. Scholkopf, MLSS France 2011
Structure of the Growth Function
Either GF(m) = m · ln(2) for all m ∈ N
Or there exists some maximal m for which the above is possible.Call this number the VCdimension, and denote it by h. Form > h,
GF(m) ≤ h(
lnm
h+ 1)
.
Nothing “in between” linear growth and logarithmic growth ispossible.
B. Scholkopf, MLSS France 2011
VCDimension: Example
Halfspaces in R2:
f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R
• Clearly, we can shatter three noncollinear points.
• But we can never shatter four points.
•Hence the VC dimension is h = 3 (in this case, equal to thenumber of parameters)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
x
x
B. Scholkopf, MLSS France 2011
A Typical Bound for Pattern Recognition
For any f ∈ F and m > h, with a probability of at least 1− δ,
R[f ] ≤ Remp[f ] +
√
√
√
√
h(
log 2mh + 1
)
− log(δ/4)
m
holds.
• does this mean, that we can learn anything?
• The study of the consistency of ERM has thus led to conceptsand results which lets us formulate another induction principle(structural risk minimization)
B. Scholkopf, MLSS France 2011
SRM
R(f* )
h
training error
capacity term
error
Sn−1 Sn Sn+1
structure
bound on test error
B. Scholkopf, MLSS France 2011
Finding a Good Function Class
• recall: separating hyperplanes in R2 have a VC dimension of 3.
•more generally: separating hyperplanes in RN have a VC di
mension of N + 1.
• hence: separating hyperplanes in highdimensional featurespaces have extremely large VC dimension, and may not generalize well
• however, margin hyperplanes can still have a small VC dimension
B. Scholkopf, MLSS France 2011
Kernels and Feature Spaces
Preprocess the data with
Φ : X → H
x 7→ Φ(x),
whereH is a dot product space, and learn the mapping from Φ(x)to y [6].
• usually, dim(X) ≪ dim(H)
• “Curse of Dimensionality”?
• crucial issue: capacity, not dimensionality
B. Scholkopf, MLSS France 2011
Example: All Degree 2 Monomials
Φ : R2 → R3
(x1, x2) 7→ (z1, z2, z3) := (x21,√2x1x2, x
22)
x1
x2
z1
z3
z2
B. Scholkopf, MLSS France 2011
General Product Feature Space
How about patterns x ∈ RN and product features of order d?
Here, dim(H) grows like Nd.
E.g. N = 16× 16, and d = 5 −→ dimension 1010
B. Scholkopf, MLSS France 2011
The Kernel Trick, N = d = 2
⟨
Φ(x),Φ(x′)⟩
= (x21,√2 x1x2, x
22)(x
′21,√2 x′1x
′2, x
′22 )
⊤
=⟨
x, x′⟩2
= : k(x, x′)
−→ the dot product in H can be computed in R2
B. Scholkopf, MLSS France 2011
The Kernel Trick, II
More generally: x, x′ ∈ RN , d ∈ N:
⟨
x, x′⟩d
=
N∑
j=1
xj · x′j
d
=
N∑
j1,...,jd=1
xj1 · · · · · xjd · x′j1· · · · · x′jd =
⟨
Φ(x),Φ(x′)⟩
,
where Φ maps into the space spanned by all ordered products ofd input directions
B. Scholkopf, MLSS France 2011
Mercer’s Theorem
If k is a continuous kernel of a positive definite integral operator on L2(X) (where X is some compact space),
∫
X
k(x, x′)f (x)f (x′) dx dx′ ≥ 0,
it can be expanded as
k(x, x′) =∞∑
i=1
λiψi(x)ψi(x′)
using eigenfunctions ψi and eigenvalues λi ≥ 0 [30].
B. Scholkopf, MLSS France 2011
The Mercer Feature Map
In that case
Φ(x) :=
√λ1ψ1(x)√λ2ψ2(x)
...
satisfies⟨
Φ(x),Φ(x′)⟩
= k(x, x′).
Proof:
⟨
Φ(x),Φ(x′)⟩
=
⟨
√λ1ψ1(x)√λ2ψ2(x)
...
,
√λ1ψ1(x
′)√λ2ψ2(x
′)...
⟩
=
∞∑
i=1
λiψi(x)ψi(x′) = k(x, x′)
B. Scholkopf, MLSS France 2011
Positive Definite Kernels
It can be shown that the admissible class of kernels coincides withthe one of positive definite (pd) kernels: kernels which are symmetric (i.e., k(x, x′) = k(x′, x)), and for
• any set of training points x1, . . . , xm ∈ X and
• any a1, . . . , am ∈ R
satisfy∑
i,j
aiajKij ≥ 0, where Kij := k(xi, xj).
K is called the Gram matrix or kernel matrix.
If for pairwise distinct points,∑
i,j aiajKij = 0 =⇒ a = 0, callit strictly positive definite.
B. Scholkopf, MLSS France 2011
The Kernel Trick — Summary
• any algorithm that only depends on dot products can benefitfrom the kernel trick
• this way, we can apply linear methods to vectorial as well asnonvectorial data
• think of the kernel as a nonlinear similarity measure
• examples of common kernels:
Polynomial k(x, x′) = (⟨
x, x′⟩
+ c)d
Gaussian k(x, x′) = exp(−‖x− x′‖2/(2σ2))•Kernels are also known as covariance functions [54, 52, 55, 29]
B. Scholkopf, MLSS France 2011
Properties of PD Kernels, 1
Assumption: Φ maps X into a dot product space H; x, x′ ∈ X
Kernels from Feature Maps.k(x, x′) :=
⟨
Φ(x),Φ(x′)⟩
is a pd kernel on X× X.
Kernels from Feature Maps, IIK(A,B) :=
∑
x∈A,x′∈B k(x, x′),
where A,B are finite subsets of X, is also a pd kernel(Hint: use the feature map Φ(A) :=
∑
x∈AΦ(x))
B. Scholkopf, MLSS France 2011
Properties of PD Kernels, 2 [36, 39]
Assumption: k, k1, k2, . . . are pd; x, x′ ∈ X
k(x, x) ≥ 0 for all x (Positivity on the Diagonal)
k(x, x′)2 ≤ k(x, x)k(x′, x′) (CauchySchwarz Inequality)
(Hint: compute the determinant of the Gram matrix)
k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)
The following kernels are pd:
• αk, provided α ≥ 0• k1 + k2• k(x, x′) := limn→∞ kn(x, x
′), provided it exists• k1 · k2• tensor products, direct sums, convolutions [22]
B. Scholkopf, MLSS France 2011
The Feature Space for PD Kernels [4, 1, 35]
• define a feature map
Φ : X → RX
x 7→ k(., x).
E.g., for the Gaussian kernel: Φ
. .Φ(x) Φ(x')x x'
Next steps:
• turn Φ(X) into a linear space
• endow it with a dot product satisfying⟨
Φ(x),Φ(x′)⟩
= k(x, x′), i.e.,⟨
k(., x), k(., x′)⟩
= k(x, x′)• complete the space to get a reproducing kernel Hilbert space
B. Scholkopf, MLSS France 2011
Turn it Into a Linear Space
Form linear combinations
f (.) =m∑
i=1
αik(., xi),
g(.) =m′∑
j=1
βjk(., x′j)
(m,m′ ∈ N, αi, βj ∈ R, xi, x′j ∈ X).
B. Scholkopf, MLSS France 2011
Endow it With a Dot Product
〈f, g〉 :=
m∑
i=1
m′∑
j=1
αiβjk(xi, x′j)
=
m∑
i=1
αig(xi) =m′∑
j=1
βjf (x′j)
• This is welldefined, symmetric, and bilinear (more later).
• So far, it also works for nonpd kernels
B. Scholkopf, MLSS France 2011
The Reproducing Kernel Property
Two special cases:
•Assumef (.) = k(., x).
In this case, we have
〈k(., x), g〉 = g(x).
• If moreoverg(.) = k(., x′),
we have〈k(., x), k(., x′)〉 = k(x, x′).
k is called a reproducing kernel(up to here, have not used positive definiteness)
B. Scholkopf, MLSS France 2011
Endow it With a Dot Product, II
• It can be shown that 〈., .〉 is a p.d. kernel on the set of functionsf (.) =
∑mi=1αik(., xi)αi ∈ R, xi ∈ X :
∑
ij
γiγj⟨
fi, fj⟩
=
⟨
∑
i
γifi,∑
j
γjfj
⟩
=: 〈f, f〉
=
⟨
∑
i
αik(., xi),∑
i
αik(., xi)
⟩
=∑
ij
αiαjk(xi, xj) ≥ 0
• furthermore, it is strictly positive definite:
f (x)2 = 〈f, k(., x)〉2 ≤ 〈f, f〉 〈k(., x), k(., x)〉hence 〈f, f〉 = 0 implies f = 0.
• Complete the space in the corresponding norm to get a Hilbertspace Hk.
B. Scholkopf, MLSS France 2011
The Empirical Kernel Map
Recall the feature map
Φ : X → RX
x 7→ k(., x).
• each point is represented by its similarity to all other points
• how about representing it by its similarity to a sample of points?
Consider
Φm : X → Rm
x 7→ k(., x)(x1,...,xm) = (k(x1, x), . . . , k(xm, x))⊤
B. Scholkopf, MLSS France 2011
ctd.
• Φm(x1), . . . ,Φm(xm) contain all necessary information aboutΦ(x1), . . . ,Φ(xm)
• the Gram matrix Gij :=⟨
Φm(xi),Φm(xj)⟩
satisfies G = K2
where Kij = k(xi, xj)
•modify Φm to
Φwm : X → Rm
x 7→ K−12(k(x1, x), . . . , k(xm, x))
⊤
• this “whitened” map (“kernel PCA map”) satifies⟨
Φwm(xi),Φwm(xj)
⟩
= k(xi, xj)
for all i, j = 1, . . . ,m.
B. Scholkopf, MLSS France 2011
An Example of a Kernel Algorithm
Idea: classify points x := Φ(x) in feature space according to whichof the two class means is closer.
c+ :=1
m+
∑
yi=1
Φ(xi), c− :=1
m−
∑
yi=−1
Φ(xi)
+o
o
o
o
++
c1
c2
xc
w
x
c
.
Compute the sign of the dot product between w := c+− c− andx− c.
B. Scholkopf, MLSS France 2011
An Example of a Kernel Algorithm, ctd. [36]
f (x) = sgn
1
m+
∑
i:yi=+1〈Φ(x),Φ(xi)〉−
1
m−
∑
i:yi=−1〈Φ(x),Φ(xi)〉+b
= sgn
1
m+
∑
i:yi=+1k(x, xi)−
1
m−
∑
i:yi=−1k(x, xi) + b
where
b =1
2
1
m2−
∑
(i,j):yi=yj=−1k(xi, xj)−
1
m2+
∑
(i,j):yi=yj=+1k(xi, xj)
.
• provides a geometric interpretation of Parzen windows
B. Scholkopf, MLSS France 2011
An Example of a Kernel Algorithm, ctd.
•Demo
• Exercise: derive the Parzen windows classifier by computing thedistance criterion directly
• SVMs (ppt)
B. Scholkopf, MLSS France 2011
An example of a kernel algorithm, revisited
µ(X )w .
+
+
+
+
o
oo
µ(Y )
X compact subset of a separable metric space, m,n ∈ N.
Positive class X := x1, . . . , xm ⊂ X
Negative class Y := y1, . . . , yn ⊂ X
RKHS means µ(X) = 1m
∑mi=1 k(xi, ·), µ(Y ) = 1
n
∑ni=1 k(yi, ·).
Get a problem if µ(X) = µ(Y )!B. Scholkopf, MLSS France 2011
When do the means coincide?
k(x, x′) =⟨
x, x′⟩
: the means coincide
k(x, x′) = (⟨
x, x′⟩
+ 1)d: all empirical moments up to order d coincide
k strictly pd: X = Y .
The mean “remembers” each point that contributed to it.
B. Scholkopf, MLSS France 2011
Proposition 2Assume X,Y are defined as above, k isstrictly pd, and for all i, j, xi 6= xj, and yi 6= yj.If for some αi, βj ∈ R− 0, we have
m∑
i=1
αik(xi, .) =
n∑
j=1
βjk(yj, .), (1)
then X = Y .
B. Scholkopf, MLSS France 2011
Proof (by contradiction)
W.l.o.g., assume that x1 6∈ Y . Subtract∑nj=1 βjk(yj, .) from (1),
and make it a sum over pairwise distinct points, to get
0 =∑
i
γik(zi, .),
where z1 = x1, γ1 = α1 6= 0, andz2, · · · ∈ X ∪ Y − x1, γ2, · · · ∈ R.Take the RKHS dot product with
∑
j γjk(zj, .) to get
0 =∑
ij
γiγjk(zi, zj),
with γ 6= 0, hence k cannot be strictly pd.
B. Scholkopf, MLSS France 2011
The mean map
µ : X = (x1, . . . , xm) 7→1
m
m∑
i=1
k(xi, ·)
satisfies
〈µ(X), f〉 =⟨
1
m
m∑
i=1
k(xi, ·), f⟩
=1
m
m∑
i=1
f (xi)
and
‖µ(X)−µ(Y )‖ = sup‖f‖≤1
〈µ(X)− µ(Y ), f〉 = sup‖f‖≤1
∣
∣
∣
∣
∣
1
m
m∑
i=1
f(xi)−1
n
n∑
i=1
f(yi)
∣
∣
∣
∣
∣
.
Note: Large distance = can find a function distinguishing thesamples
B. Scholkopf, MLSS France 2011
Witness function
f =µ(X)−µ(Y )‖µ(X)−µ(Y )‖, thus f (x) ∝ 〈µ(X)− µ(Y ), k(x, .)〉):
−6 −4 −2 0 2 4 6−0.4
−0.2
0
0.2
0.4
0.6
0.8
1Witness f for Gauss and Laplace data
X
Pro
b. d
ensi
ty a
nd f
fGaussLaplace
This function is in the RKHS of a Gaussian kernel, but not in theRKHS of the linear kernel.
B. Scholkopf, MLSS France 2011
The mean map for measures
p, q Borel probability measures,
Ex,x′∼p[k(x, x′)], Ex,x′∼q[k(x, x
′)] <∞ (‖k(x, .)‖ ≤M <∞ is sufficient)
Defineµ : p 7→ Ex∼p[k(x, ·)].
Note〈µ(p), f〉 = Ex∼p[f (x)]
and
‖µ(p)− µ(q)‖ = sup‖f‖≤1
∣
∣Ex∼p[f (x)]− Ex∼q[f (x)]∣
∣ .
Recall that in the finite sample case, for strictly p.d. kernels, µwas injective — how about now?[43, 17]
B. Scholkopf, MLSS France 2011
Theorem 3 [15, 13]
p = q ⇐⇒ supf∈C(X)
∣
∣Ex∼p(f (x))−Ex∼q(f (x))∣
∣ = 0,
where C(X) is the space of continuous bounded functions onX.
Combine this with
‖µ(p)− µ(q)‖ = sup‖f‖≤1
∣
∣Ex∼p[f (x)]− Ex∼q[f (x)]∣
∣ .
Replace C(X) by the unit ball in an RKHS that is dense in C(X)— universal kernel [45], e.g., Gaussian.
Theorem 4 [19] If k is universal, then
p = q ⇐⇒ ‖µ(p)− µ(q)‖ = 0.
B. Scholkopf, MLSS France 2011
• µ is invertible on its imageM = µ(p)  p is a probability distribution(the “marginal polytope”, [53])
• generalization of the moment generating function of a RV xwith distribution p:
Mp(.) = Ex∼p[
e〈x, · 〉]
.
This provides us with a convenient metric on probability distributions, which can be used to check whether two distributions aredifferent — provided that µ is invertible.
B. Scholkopf, MLSS France 2011
Fourier Criterion
Assume we have densities, the kernel is shift invariant (k(x, y) =k(x− y)), and all Fourier transforms below exist.Note that µ is invertible iff
∫
k(x− y)p(y) dy =
∫
k(x− y)q(y) dy =⇒ p = q,
i.e.,k(p− q) = 0 =⇒ p = q
(Sriperumbudur et al., 2008)
E.g., µ is invertible if k has full support. Restricting the class ofdistributions, weaker conditions suffice (e.g., if k has nonempty interior, µ is invertible for all distributions with compact support).
B. Scholkopf, MLSS France 2011
Fourier Optics
Application: p source of incoherent light, I indicator of a finiteaperture. In Fraunhofer diffraction, the intensity image is∝ p∗I2.Set k = I2, then this equals µ(p).This k does not have full support, thus the imaging process is notinvertible for the class of all light sources (Abbe), but it is if werestrict the class (e.g., to compact support).
B. Scholkopf, MLSS France 2011
Application 1: Twosample problem [19]
X, Y i.i.d. msamples from p, q, respectively.
‖µ(p)− µ(q)‖2 =Ex,x′∼p [k(x, x′)]− 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y
′)]
=Ex,x′∼p,y,y′∼q [h((x, y), (x′, y′))]
withh((x, y), (x′, y′)) := k(x, x′)− k(x, y′)− k(y, x′) + k(y, y′).
Define
D(p, q)2 := Ex,x′∼p,y,y′∼qh((x, y), (x′, y′))
D(X, Y )2 := 1m(m−1)
∑
i 6=jh((xi, yi), (xj, yj)).
D(X,Y )2 is an unbiased estimator of D(p, q)2.It’s easy to compute, and works on structured data.
B. Scholkopf, MLSS France 2011
Theorem 5 Assume k is bounded.
D(X, Y )2 converges to D(p, q)2 in probability with rate O(m−12).
This could be used as a basis for a test, but uniform convergence bounds are often loose..
Theorem 6 We assume E(
h2)
< ∞. When p 6= q, then√m(D(X, Y )2 − D(p, q)2)
converges in distribution to a zero mean Gaussian with variance
σ2u = 4(
Ez
[
(Ez′h(z, z′))2]
−[
Ez,z′(h(z, z′))]2)
.
When p = q, then m(D(X, Y )2 −D(p, q)2) = mD(X, Y )2 converges in distribution to
∞∑
l=1
λl[
q2l − 2]
, (2)
where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation∫
X
k(x, x′)ψi(x)dp(x) = λiψi(x′),
and k(xi, xj) := k(xi, xj) − Exk(xi, x) − Exk(x, xj) + Ex,x′k(x, x′) is the centred RKHS
kernel.
B. Scholkopf, MLSS France 2011
Application 2: Dependence Measures
Assume that (x, y) are drawn from pxy, with marginals px, py.
Want to know whether pxy factorizes.[2, 16]: kernel generalized variance
[20, 21]: kernel constrained covariance, HSIC
Main idea [25, 34]:x and y independent ⇐⇒ ∀ bounded continuous functions f, g,we have Cov(f (x), g(y)) = 0.
B. Scholkopf, MLSS France 2011
k kernel on X× Y.
µ(pxy) := E(x,y)∼pxy [k((x, y), ·)]µ(px × py) := Ex∼px,y∼py [k((x, y), ·)] .
Use ∆ :=∥
∥µ(pxy)− µ(px × py)∥
∥ as a measure of dependence.
For k((x, y), (x′, y′)) = kx(x, x′)ky(y, y′):
∆2 equals the HilbertSchmidt norm of the covariance operator between the two RKHSs (HSIC), with empirical estimatem−2 trHKxHKy, where H = I − 1/m [20, 44].
B. Scholkopf, MLSS France 2011
Witness function of the equivalent optimisation problem:
X
Y
Dependence witness and sample
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
0.04
0.05
Application: learning causal structures (Sun et al., ICML 2007; Fuku
mizu et al., NIPS 2007))B. Scholkopf, MLSS France 2011
Application 3: Covariate Shift Correction and LocalLearning
training set X = (x1, y1), . . . , (xm, ym) drawn from p,test set X ′ =
(x′1, y′1), . . . , (x
′n, y
′n)
from p′ 6= p.
Assume pyx = p′yx.
[40]: reweight training set
B. Scholkopf, MLSS France 2011
Minimize∥
∥
∥
∥
∥
∥
m∑
i=1
βik(xi, ·)− µ(X ′)
∥
∥
∥
∥
∥
∥
2
+λ ‖β‖22 subject to βi ≥ 0,∑
i
βi = 1.
Equivalent QP:
minimizeβ
1
2β⊤ (K + λ1) β − β⊤l
subject to βi ≥ 0 and∑
i
βi = 1,
where Kij := k(xi, xj), li =⟨
k(xi, ·), µ(X ′)⟩
.
Experiments show that in underspecified situations (e.g., large kernel widths), this helps [23].
X ′ =
x′
leads to a local sample weighting scheme.B. Scholkopf, MLSS France 2011
The Representer Theorem
Theorem 7Given: a p.d. kernel k on X × X, a training set(x1, y1), . . . , (xm, ym) ∈ X×R, a strictly monotonic increasingrealvalued function Ω on [0,∞[, and an arbitrary cost functionc : (X× R
2)m → R ∪ ∞Any f ∈ Hk minimizing the regularized risk functional
c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω (‖f‖) (3)
admits a representation of the form
f (.) =∑m
i=1αik(xi, .).
B. Scholkopf, MLSS France 2011
Remarks
• significance: many learning algorithms have solutions that canbe expressed as expansions in terms of the training examples
• original form, with mean squared loss
c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) =1
m
m∑
i=1
(yi − f (xi))2,
and Ω(‖f‖) = λ‖f‖2 (λ > 0): [27]
• generalization to nonquadratic cost functions: [10]
• present form: [36]
B. Scholkopf, MLSS France 2011
Proof
Decompose f ∈ H into a part in the span of the k(xi, .) and anorthogonal one:
f =∑
i
αik(xi, .) + f⊥,where for all j
〈f⊥, k(xj, .)〉 = 0.
Application of f to an arbitrary training point xj yields
f (xj) =⟨
f, k(xj, .)⟩
=
⟨
∑
i
αik(xi, .) + f⊥, k(xj, .)
⟩
=∑
i
αi〈k(xi, .), k(xj, .)〉,
independent of f⊥.B. Scholkopf, MLSS France 2011
Proof: second part of (3)
Since f⊥ is orthogonal to∑
iαik(xi, .), and Ω is strictly monotonic, we get
Ω(‖f‖) = Ω(
‖∑
iαik(xi, .) + f⊥‖
)
= Ω
(
√
‖∑
iαik(xi, .)‖2 + ‖f⊥‖2
)
≥ Ω(
‖∑
iαik(xi, .)‖
)
, (4)
with equality occuring if and only if f⊥ = 0.Hence, any minimizer must have f⊥ = 0. Consequently, anysolution takes the form
f =∑
iαik(xi, .).
B. Scholkopf, MLSS France 2011
Application: Support Vector Classification
Here, yi ∈ ±1. Use
c ((xi, yi, f (xi))i) =1
λ
∑
i
max (0, 1− yif (xi)) ,
and the regularizer Ω (‖f‖) = ‖f‖2.λ→ 0 leads to the hard margin SVM
B. Scholkopf, MLSS France 2011
Further Applications
Bayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990),i.e.
• exp(−c((xi, yi, f (xi))i)) — likelihood of the data
• exp(−Ω(‖f‖)) — prior over the set of functions; e.g., Ω(‖f‖) =λ‖f‖2 — Gaussian process prior [55] with covariance functionk
•minimizer of (3) = MAP estimate
Kernel PCA (see below) can be shown to correspond to the caseof
c((xi, yi, f (xi))i=1,...,m) =
0 if 1m
∑
i
(
f (xi)− 1m
∑
j f (xj))2
= 1
∞ otherwise
with g an arbitrary strictly monotonically increasing function.
Conclusion
• the kernel corresponds to– a similarity measure for the data, or
– a (linear) representation of the data, or
– a hypothesis space for learning,
• kernels allow the formulation of a multitude of geometrical algorithms (Parzen windows, 2sample tests, SVMs, kernel PCA,...)
B. Scholkopf, MLSS France 2011
Kernel PCA [37]
R2
linear PCA
R2
H
Φ
kernel PCA
k
k(x,y) = (x.y)
k(x,y) = (x.y)d
x
xx xx
x
x
xxx
x
x
x
xx xx
x
x
xxx
x
x
x
x
xx
xx
x
x x
x
x
x
B. Scholkopf, MLSS France 2011
Kernel PCA, II
x1, . . . , xm ∈ X, Φ : X → H, C =1
m
m∑
j=1
Φ(xj)Φ(xj)⊤
Eigenvalue problem
λV = CV =1
m
m∑
j=1
⟨
Φ(xj),V⟩
Φ(xj).
For λ 6= 0, V ∈ spanΦ(x1), . . . ,Φ(xm), thus
V =
m∑
i=1
αiΦ(xi),
and the eigenvalue problem can be written as
λ 〈Φ(xn),V〉 = 〈Φ(xn), CV〉 for all n = 1, . . . ,m
B. Scholkopf, MLSS France 2011
Kernel PCA in Dual Variables
In term of the m×m Gram matrix
Kij :=⟨
Φ(xi),Φ(xj)⟩
= k(xi, xj),
this leads tomλKα = K2α
where α = (α1, . . . , αm)⊤.
Solvemλα = Kα
−→ (λn,αn)
〈Vn,Vn〉 = 1 ⇐⇒ λn 〈αn,αn〉 = 1
thus divide αn by√λn
B. Scholkopf, MLSS France 2011
Feature extraction
Compute projections on the Eigenvectors
Vn =
m∑
i=1
αni Φ(xi)
in H:
for a test point x with image Φ(x) in H we get the features
〈Vn,Φ(x)〉 =
m∑
i=1
αni 〈Φ(xi),Φ(x)〉
=
m∑
i=1
αni k(xi, x)
B. Scholkopf, MLSS France 2011
The Kernel PCA Map
Recall
Φwm : X → Rm
x 7→ K−12(k(x1, x), . . . , k(xm, x))
⊤
If K = UDU⊤ is K’s diagonalization, then K−1/2 =UD−1/2U⊤. Thus we have
Φwm(x) = UD−1/2U⊤(k(x1, x), . . . , k(xm, x))⊤.
We can drop the leading U (since it leaves the dot product invariant) to get a map
ΦwKPCA(x) = D−1/2U⊤(k(x1, x), . . . , k(xm, x))⊤.
The rows of U⊤ are the eigenvectors αn of K, and the entries of
the diagonal matrix D−1/2 equal λ−1/2i .
B. Scholkopf, MLSS France 2011
Toy Example with Gaussian Kernel
k(x, x′) = exp(
−‖x− x′‖2)
B. Scholkopf, MLSS France 2011
SuperResolution (Kim, Franz, & Scholkopf, 2004)
a. original image of resolution528 × 396
b. low resolution image (264 ×
198) stretched to the originalscale
c. bicubic interpolation d. supervised examplebasedlearning based on nearest neighbor classifier
f. unsupervised KPCA reconstruction
g. enlarged portions of ad, and f (from left to right)
Comparison between different superresolution methods.
B. Scholkopf, MLSS France 2011
Support Vector Classifiers
feature spaceinput space
Φ
[6]
B. Scholkopf, MLSS France 2011
Separating Hyperplane
w
x  <w x> + b = 0,
<w x> + b > 0,
<w x> + b < 0,
B. Scholkopf, MLSS France 2011
Optimal Separating Hyperplane [50]
.w
x  <w x> + b = 0,
B. Scholkopf, MLSS France 2011
Eliminating the Scaling Freedom [47]
Note: if c 6= 0, then
x 〈w,x〉 + b = 0 = x 〈cw,x〉 + cb = 0.Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X∗ =x1, . . . ,xr if minxi∈X  〈w,xi〉 + b = 1.
B. Scholkopf, MLSS France 2011
Canonical Optimal Hyperplane
,w
x  <w x> + b = 0,
x  <w x> + b = −1,x  <w x> + b = +1,
x2x1
Note:
<w x1> + b = +1<w x2> + b = −1
=> <w (x1−x2)> = 2
=> (x1−x2) =w
w< >
,,
,
, 2w
yi = −1
yi = +1
B. Scholkopf, MLSS France 2011
Canonical Hyperplanes [47]
Note: if c 6= 0, then
x 〈w,x〉 + b = 0 = x 〈cw,x〉 + cb = 0.Hence (cw, cb) describes the same hyperplane as (w, b).
Definition: The hyperplane is in canonical form w.r.t. X∗ =x1, . . . ,xr if minxi∈X  〈w,xi〉 + b = 1.
Note that for canonical hyperplanes, the distance of the closestpoint to the hyperplane (“margin”) is 1/‖w‖:minxi∈X
∣
∣
∣
⟨
w‖w‖,xi
⟩
+ b‖w‖
∣
∣
∣= 1
‖w‖.
B. Scholkopf, MLSS France 2011
Theorem 8 (Vapnik [46])Consider hyperplanes 〈w,x〉 = 0where w is normalized such that they are in canonical formw.r.t. a set of points X∗ = x1, . . . ,xr, i.e.,
mini=1,...,r
 〈w,xi〉  = 1.
The set of decision functions fw(x) = sgn 〈x,w〉 defined onX∗ and satisfying the constraint ‖w‖ ≤ Λ has a VC dimensionsatisfying
h ≤ R2Λ2.
Here, R is the radius of the smallest sphere around the origincontaining X∗.
B. Scholkopf, MLSS France 2011
x
x
x
γ1
γ2
xx R
B. Scholkopf, MLSS France 2011
Proof Strategy (Gurvits, 1997)
Assume that x1, . . . ,xr are shattered by canonical hyperplaneswith ‖w‖ ≤ Λ, i.e., for all y1, . . . , yr ∈ ±1,
yi 〈w,xi〉 ≥ 1 for all i = 1, . . . , r. (5)
Two steps:
• prove that the more points we want to shatter (5), the larger‖∑ri=1 yixi‖ must be
• upper bound the size of ‖∑ri=1 yixi‖ in terms of R
Combining the two tells us how many points we can at most shatter.
B. Scholkopf, MLSS France 2011
Part I
Summing (5) over i = 1, . . . , r yields⟨
w,
r∑
i=1
yixi
⟩
≥ r.
By the CauchySchwarz inequality, on the other hand, we have⟨
w,
r∑
i=1
yixi
⟩
≤ ‖w‖
∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
≤ Λ
∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
.
Combine both:
r
Λ≤
∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
. (6)
B. Scholkopf, MLSS France 2011
Part II
Consider independent random labels yi ∈ ±1, uniformly distributed (Rademacher variables).
E
∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
2
=
r∑
i=1
E
⟨
yixi,r∑
j=1
yjxj
⟩
=
r∑
i=1
E
⟨
yixi,
∑
j 6=iyjxj
+ yixi
⟩
=
r∑
i=1
∑
j 6=iE[⟨
yixi, yjxj⟩]
+E [〈yixi, yixi〉]
=
r∑
i=1
E[
‖yixi‖2]
=
r∑
i=1
‖xi‖2
B. Scholkopf, MLSS France 2011
Part II, ctd.
Since ‖xi‖ ≤ R, we get
E
∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
2
≤ rR2.
• This holds for the expectation over the random choices of thelabels, hence there must be at least one set of labels for whichit also holds true. Use this set.
Hence∥
∥
∥
∥
∥
∥
r∑
i=1
yixi
∥
∥
∥
∥
∥
∥
2
≤ rR2.
B. Scholkopf, MLSS France 2011
Part I and II Combined
Part I:( rΛ
)2 ≤ ‖∑ri=1 yixi‖
2
Part II: ‖∑ri=1 yixi‖
2 ≤ rR2
Hencer2
Λ2≤ rR2,
i.e.,r ≤ R2Λ2,
completing the proof.
B. Scholkopf, MLSS France 2011
Pattern Noise as Maximum Margin Regularization
o
o
o
+
+
+
o
+
r
ρ
B. Scholkopf, MLSS France 2011
Maximum Margin vs. MDL — 2D Case
o
o
o
+
+
+
o
+γ+∆γ
γ
Rρ
Can perturb γ by ∆γ with ∆γ < arcsin ρR and still correctly
separate the data.Hence only need to store γ with accuracy ∆γ [36, 51].
B. Scholkopf, MLSS France 2011
Formulation as an Optimization Problem
Hyperplane with maximum margin: minimize
‖w‖2
(recall: margin ∼ 1/‖w‖) subject toyi · [〈w,xi〉 + b] ≥ 1 for i = 1 . . . m
(i.e. the training data are separated correctly).
B. Scholkopf, MLSS France 2011
Lagrange Function (e.g., [5])
Introduce Lagrange multipliers αi ≥ 0 and a Lagrangian
L(w, b,α) =1
2‖w‖2 −
m∑
i=1
αi (yi · [〈w,xi〉 + b]− 1) .
L has to minimized w.r.t. the primal variables w and b andmaximized with respect to the dual variables αi
• if a constraint is violated, then yi · (〈w,xi〉 + b)− 1 < 0 −→· αi will grow to increase L — how far?
·w, b want to decrease L; i.e. they have to change such thatthe constraint is satisfied. If the problem is separable, thisensures that αi <∞.
• similarly: if yi · (〈w,xi〉 + b)− 1 > 0, then αi = 0: otherwise,L could be increased by decreasing αi (KKT conditions)
B. Scholkopf, MLSS France 2011
Derivation of the Dual Problem
At the extremum, we have
∂
∂bL(w, b,α) = 0,
∂
∂wL(w, b,α) = 0,
i.e.m∑
i=1
αiyi = 0
and
w =
m∑
i=1
αiyixi.
Substitute both into L to get the dual problem
B. Scholkopf, MLSS France 2011
The Support Vector Expansion
w =
m∑
i=1
αiyixi
where for all i = 1, . . . ,m either
yi · [〈w,xi〉 + b] > 1 =⇒ αi = 0 −→ xi irrelevantoryi · [〈w,xi〉 + b] = 1 (on the margin) −→ xi “Support Vector”
The solution is determined by the examples on the margin.
Thus
f (x) = sgn (〈x,w〉 + b)
= sgn(
∑m
i=1αiyi〈x,xi〉 + b
)
.
B. Scholkopf, MLSS France 2011
Why it is Good to Have Few SVs
Leave out an example that does not become SV−→ same solution.
Theorem [49]: Denote #SV(m) the number of SVs obtainedby training on m examples randomly drawn from P(x, y), and Ethe expectation. Then
E [Prob(test error)] ≤ E [#SV(m)]
mHere, Prob(test error) refers to the expected value of the risk,where the expectation is taken over training the SVM on samplesof size m− 1.
B. Scholkopf, MLSS France 2011
A Mechanical Interpretation [8]
Assume that each SV xi exerts a perpendicular force of size αiand sign yi on a solid plane sheet lying along the hyperplane.
Then the solution is mechanically stable:
m∑
i=1
αiyi = 0 implies that the forces sum to zero
w =
m∑
i=1
αiyixi implies that the torques sum to zero,
via∑
i
xi × yiαi ·w/‖w‖ = w×w/‖w‖ = 0.
B. Scholkopf, MLSS France 2011
Dual Problem
Dual: maximize
W (α) =m∑
i=1
αi −1
2
m∑
i,j=1
αiαjyiyj⟨
xi,xj⟩
subject to
αi ≥ 0, i = 1, . . . ,m, and
m∑
i=1
αiyi = 0.
Both the final decision function and the function to be maximizedare expressed in dot products −→ can use a kernel to compute
⟨
xi,xj⟩
=⟨
Φ(xi),Φ(xj)⟩
= k(xi, xj).
B. Scholkopf, MLSS France 2011
The SVM Architecture
Σ f(x)= sgn ( + b)
input vector x
support vectors x 1
... x 4
comparison: k(x,x i), e.g.
classification
weights
k(x,x i)=exp(−x−x i2 / c)
k(x,x i)=tanh(κ(x.x i)+θ)
k(x,x i)=(x.x i)d
f(x)= sgn ( Σ λi.k(x,x i) + b)
λ1 λ2 λ3 λ4
k k k k
B. Scholkopf, MLSS France 2011
Toy Example with Gaussian Kernel
k(x, x′) = exp(
−‖x− x′‖2)
B. Scholkopf, MLSS France 2011
Nonseparable Problems [3, 9]
If yi · (〈w,xi〉 + b) ≥ 1 cannot be satisfied, then αi → ∞.
Modify the constraint to
yi · (〈w,xi〉 + b) ≥ 1− ξi
withξi ≥ 0
(“soft margin”) and add
C ·m∑
i=1
ξi
in the objective function.
B. Scholkopf, MLSS France 2011
Soft Margin SVMs
CSVM [9]: for C > 0, minimize
τ (w, ξ) =1
2‖w‖2 + C
m∑
i=1
ξi
subject to yi · (〈w,xi〉 + b) ≥ 1− ξi, ξi ≥ 0 (margin 1/‖w‖)
νSVM [38]: for 0 ≤ ν < 1, minimize
τ (w, ξ, ρ) =1
2‖w‖2 − νρ +
1
m
∑
i
ξi
subject to yi · (〈w,xi〉 + b) ≥ ρ− ξi, ξi ≥ 0 (margin ρ/‖w‖)
B. Scholkopf, MLSS France 2011
The νProperty
SVs: αi > 0“margin errors:” ξi > 0
KKTConditions =⇒•All margin errors are SVs.
•Not all SVs need to be margin errors.
Those which are not lie exactly on the edge of the margin.
Proposition:1. fraction of Margin Errors ≤ ν ≤ fraction of SVs.2. asymptotically: ... = ν = ...
B. Scholkopf, MLSS France 2011
Duals, Using Kernels
CSVM dual: maximize
W (α) =∑
iαi −
1
2
∑
i,jαiαjyiyjk(xi,xj)
subject to 0 ≤ αi ≤ C,∑
iαiyi = 0.
νSVM dual: maximize
W (α) = −1
2
∑
ijαiαjyiyjk(xi,xj)
subject to 0 ≤ αi ≤ 1m,
∑
iαiyi = 0,∑
iαi ≥ ν
In both cases: decision function:
f (x) = sgn(
∑m
i=1αiyik(x,xi) + b
)
B. Scholkopf, MLSS France 2011
Connection between νSVC and CSVC
Proposition. If νSV classification leads to ρ > 0, then CSVclassification, with C set a priori to 1/ρ, leads to the same decisionfunction.
Proof. Minimize the primal target, then fix ρ, and minimize only over theremaining variables: nothing will change. Hence the obtained solutionw0, b0, ξ0minimizes the primal problem of CSVC, for C = 1, subject to
yi · (〈xi,w〉 + b) ≥ ρ− ξi.
To recover the constraint
yi · (〈xi,w〉 + b) ≥ 1− ξi,
rescale to the set of variables w′ = w/ρ, b′ = b/ρ, ξ′ = ξ/ρ. This leaves us, up
to a constant scaling factor ρ2, with the CSV target with C = 1/ρ.
B. Scholkopf, MLSS France 2011
SVM Training
• naive approach: the complexity of maximizing
W (α) =∑m
i=1αi −
1
2
∑m
i,j=1αiαjyiyjk(xi,xj)
scales with the third power of the training set size m
• only SVs are relevant −→ only compute (k(xi,xj))ij for SVs.Extract them iteratively by cycling through the training set inchunks [46].
• in fact, one can use chunks which do not even contain all SVs[31]. Maximize over these subproblems, using your favoriteoptimizer.
• the extreme case: by making the subproblems very small (justtwo points), one can solve them analytically [32].
• http://www.kernelmachines.org/software.html
B. Scholkopf, MLSS France 2011
MNIST Benchmark
handwritten character benchmark (60000 training & 10000 testexamples, 28× 28)
5 0 4 1 9 2 1 3 1 4
3 5 3 6 1 7 2 8 6 9
4 0 9 1 1 2 4 3 2 7
3 8 6 9 0 5 6 0 7 6
1 8 7 9 3 9 8 5 9 3
3 0 7 4 9 8 0 9 4 1
4 4 6 0 4 5 6 1 0 0
1 7 1 6 3 0 2 1 1 7
9 0 2 6 7 8 3 9 0 4
6 7 4 6 8 0 7 8 3 1
B. Scholkopf, MLSS France 2011
MNIST Error Rates
Classifier test error referencelinear classifier 8.4% [7]3nearestneighbour 2.4% [7]SVM 1.4% [8]Tangent distance 1.1% [41]LeNet4 1.1% [28]Boosted LeNet4 0.7% [28]Translation invariant SVM 0.56% [11]
Note: the SVM used a polynomial kernel of degree 9, corresponding to a featurespace of dimension ≈ 3.2 · 1020.
B. Scholkopf, MLSS France 2011
References
[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
[2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002.
[3] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23–34, 1992.
[4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. SpringerVerlag, New York, 1984.
[5] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995.
[6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor,Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July1992. ACM Press.
[7] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Muller, E. Sackinger, P. Simard,and V. Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In Proceedings of the 12thInternational Conference on Pattern Recognition and Neural Networks, Jerusalem, pages 77–87. IEEE Computer SocietyPress, 1994.
[8] C. J. C. Burges and B. Scholkopf. Improving the accuracy and speed of support vector learning machines. In M. Mozer,M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge,MA, 1997. MIT Press.
[9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.
[10] D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics,18:1676–1695, 1990.
[11] D. DeCoste and B. Scholkopf. Training invariant support vector machines. Machine Learning, 2001. Accepted for publication. Also: Technical Report JPLMLTR001, Jet Propulsion Laboratory, Pasadena, CA, 2000.
[12] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications ofmathematics. Springer, New York, 1996.
[13] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.
[14] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L.Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 171–203, Cambridge, MA,2000. MIT Press.
[15] R. Fortet and E. Mourier. Convergence de la reparation empirique vers la reparation theorique. Ann. Scient. Ecole Norm.Sup., 70:266–285, 1953.
[16] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernelhilbert spaces. J. Mach. Learn. Res., 5:73–99, 2004.
[17] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf. Kernel measures of conditional dependence. In J. C. Platt, D. Koller,Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 489–496, Cambridge, MA, USA, 09 2008. MIT Press.
[18] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):1455–1480, 1998.
[19] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. J. Smola. A kernel method for the twosampleproblem. InB. Scholkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. TheMIT Press, Cambridge, MA, 2007.
[20] A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf. Measuring statistical dependence with HilbertSchmidt norms.In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 63–77, Berlin, Germany,2005. SpringerVerlag.
[21] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Scholkopf. Kernel methods for measuring independence. J. Mach.Learn. Res., 6:2075–2129, 2005.
[22] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSCCRL9910, Computer Science Department, University of California at Santa Cruz, 1999.
[23] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. InB. Scholkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. TheMIT Press, Cambridge, MA, 2007.
[24] I. A. Ibragimov and R. Z. Has’minskii. Statistical Estimation — Asymptotic Theory. SpringerVerlag, New York, 1981.
[25] J. Jacod and P. Protter. Probability Essentials. Springer, New York, 2000.
[26] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing bysplines. Annals of Mathematical Statistics, 41:495–502, 1970.
[27] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings ofthe IEEE, 86:2278–2324, 1998.
[29] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,pages 133–165. SpringerVerlag, Berlin, 1998.
[30] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A 209:415–446, 1909.
[31] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. Technical Report AIM1602,MIT A.I. Lab., 1996.
[32] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 185–208, Cambridge, MA, 1999.MIT Press.
[33] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990.
[34] A. Renyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441–451, 1959.
[35] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & Technical, Harlow, England, 1988.
[36] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
[37] B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299–1319, 1998.
[38] B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation,12:1207–1245, 2000.
[39] J. ShaweTaylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK,2004.
[40] H. Shimodaira. Improving predictive inference under convariance shift by weighting the loglikelihood function. Journal ofStatistical Planning and Inference, 90, 2000.
[41] P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In S. J. Hanson,J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5. Proceedings of the 1992Conference, pages 50–58, San Mateo, CA, 1993. Morgan Kaufmann.
[42] A. Smola, B. Scholkopf, and K.R. Muller. The connection between regularization operators and support vector kernels.Neural Networks, 11:637–649, 1998.
[43] A. J. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In M. Hutter, R. A.Servedio, and E. Takimoto, editors, Algorithmic Learning Theory: 18th International Conference, pages 13–31, Berlin, 102007. Springer.
[44] A. J. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf.Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.
[45] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67–93,2001.
[46] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation:Springer Verlag, New York, 1982).
[47] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, 1995.
[48] V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998.
[49] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow, 1974. (German Translation:W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979).
[50] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24,1963.
[51] U. von Luxburg, O. Bousquet, and B. Scholkopf. A compression approach to support vector model selection. TechnicalReport 101, Max Planck Institute for Biological Cybernetics, Tubingen, Germany, 2002. see more detailed JMLR version.
[52] G. Wahba. Spline Models for Observational Data, volume 59 of CBMSNSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
[53] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report649, UC Berkeley, Department of Statistics, September 2003.
[54] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, 1982.
[55] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I.Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.
[56] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996.
B. Scholkopf, MLSS France 2011
Regularization Interpretation of Kernel Machines
The norm inH can be interpreted as a regularization term (Girosi1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regularization operator (mapping into a dot product space D) such thatk is Green’s function of P ∗P , then
‖w‖ = ‖Pf‖,where
w =∑m
i=1αiΦ(xi)
andf (x) =
∑
iαik(xi, x).
Example: for the Gaussian kernel, P is a linear combination ofdifferential operators.
B. Scholkopf, MLSS France 2011
‖w‖2 =∑
i,j
αiαjk(xi, xj)
=∑
i,j
αiαj
⟨
k(xi, .), δxj(.)⟩
=∑
i,j
αiαj⟨
k(xi, .), (P∗Pk)(xj, .)
⟩
=∑
i,j
αiαj⟨
(Pk)(xi, .), (Pk)(xj, .)⟩
D
=
⟨
(P∑
i
αik)(xi, .), (P∑
j
αjk)(xj, .)
⟩
D
= ‖Pf‖2,using f (x) =
∑
iαik(xi, x).B. Scholkopf, MLSS France 2011