Introduction to Machine Learning

of 138/138
1 Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tübingen, Germany http://www.tuebingen.mpg.de/bs Introduction to Machine Learning
  • date post

    19-Nov-2014
  • Category

    Technology

  • view

    1.022
  • download

    3

Embed Size (px)

description

 

Transcript of Introduction to Machine Learning

  • 1. Introduction to Machine Learning Bernhard Schlkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tbingen, Germanyhttp://www.tuebingen.mpg.de/bs1

2. Empirical Inference Drawing conclusions from empirical data (observations, measurements) Example 1: scientific inferencey = i ai k(x,xi) + bx y y=a*xxxxx xx x xx Leibniz, Weyl, Chaitin2 3. Empirical Inference Drawing conclusions from empirical data (observations, measurements) Example 1: scientific inference If your experiment needs statistics [inference],you ought to have done a better experiment. (Rutherford) 3 4. Empirical Inference, II Example 2: perception The brain is nothing but a statistical decision organ(H. Barlow) 4 5. Hard Inference ProblemsSonnenburg, Rtsch, Schfer,Schlkopf, 2006, Journal of MachineLearning ResearchTask: classify human DNAsequence locations into {acceptorsplice site, decoy} using 15Million sequences of length 141,and a Multiple-Kernel SupportVector Machines.PRC = Precision-Recall-Curve,fraction of correct positivepredictions among all positivelypredicted cases High dimensionality consider many factors simultaneously to find the regularity Complex regularities nonlinear, nonstationary, etc. Little prior knowledge e.g., no mechanistic models for the data Need large data sets processing requires computers and automatic inference methods5 6. Hard Inference Problems, II We can solve scientific inference problems that humans cant solve Even if its just because of data set size / dimensionality, this is aquantum leap 6 7. Generalization(thanks to O. Bousquet) observe 1,2, 4,7,.. Whats next? +1+2 +3 1,2,4,7,11,16,: an+1=an+n (lazy caterers sequence) 1,2,4,7,12,20,: an+2=an+1+an+1 1,2,4,7,13,24,: Tribonacci-sequence 1,2,4,7,14,28: set of divisors of 28 1,2,4,7,1,1,5,: decimal expansions of p=3,14159and e=2,718 interleaved The On-Line Encyclopedia of Integer Sequences: >600 hits7 8. Generalization, II Question: which continuation is correct (generalizes)? Answer: theres no way to tell (induction problem) Question of statistical learning theory: how to come upwith a law that is (probably) correct (demarcation problem)(more accurately: a law that is probably as correct on the test data as it is on the training data)8 9. 2-class classification Learn based on m observations generated from some Goal: minimize expected error (risk) V. Vapnik Problem: P is unknown. Induction principle: minimize training error (empirical risk) over some class of functions. Q: is this consistent?9 10. The law of large numbersFor allandDoes this imply consistency of empirical risk minimization(optimality in the limit)?No need a uniform law of large numbers:For all10 11. Consistency and uniform convergence 12. -> LaTeX 12Bernhard Schlkopf Empirical Inference Department Tbingen , 03 October 2011 13. Support Vector Machines class 2 class 1F+-- ++- k(x,x) + -= + + sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):- -representer theorem (Kimeldorf & Wahba 1971, Schlkopf et al. 2000) unique solution found by convex QPBernhard Schlkopf, 03October 2011 13 14. Support Vector Machines class 2 class 1F+-- ++- k(x,x) + -= + + sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):- -representer theorem (Kimeldorf & Wahba 1971, Schlkopf et al. 2000) unique solution found by convex QPBernhard Schlkopf, 03October 2011 14 15. Applications in Computational Geometry / GraphicsSteinke, Walder, Blanz et al.,Eurographics05, 06, 08, ICML 05,08,NIPS 07 15 16. Max-Planck-Institut frbiologische KybernetikBernhard Schlkopf, Tbingen,FIFA World Cup Germany vs. England, June 27, 20103. Oktober 201116 17. Kernel Quiz 17Bernhard Schlkopf Empirical Inference Department Tbingen , 03 October 2011 18. Kernel MethodsBernhard SchlkopfoMax Planck Institute for Intelligent SystemsB. Schlkopf, MLSS France 2011o 19. Statistical Learning Theory1. started by Vapnik and Chervonenkis in the Sixties2. model: we observe data generated by an unknown stochastic regularity3. learning = extraction of the regularity from the data4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement.5. support vector machines use a particular type of function class: classiers with large margins in a feature space induced by a kernel. [47, 48]B. Schlkopf, MLSS France 2011o 20. Example: Regression Estimation yx Data: input-output pairs (xi, yi) R R Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y) Learning: choose a function f : R R such that the error,averaged over P, is minimized. Problem: P is unknown, so the average cannot be computed need an induction principle 21. Pattern RecognitionLearn f : X {1} from examples(x1, y1), . . . , (xm, ym) X{1}, generated i.i.d. from P(x, y),such that the expected misclassication error on a test set, alsodrawn from P(x, y),1R[f ] = |f (x) y)| dP(x, y),2is minimal (Risk Minimization (RM)).Problem: P is unknown. need an induction principle.Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error 1 m 1 Remp[f ] =|f (xi) yi|mi=1 2B. Schlkopf, MLSS France 2011o 22. Convergence of Means to ExpectationsLaw of large numbers:Remp[f ] R[f ]as m .Does this imply that empirical risk minimization will give us theoptimal result in the limit of innite sample size (consistencyof empirical risk minimization)?No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement.B. Schlkopf, MLSS France 2011o 23. Consistency and Uniform ConvergenceRRiskRemp Remp [f] R[f] ff opt fm Function classB. Schlkopf, MLSS France 2011o 24. The Importance of the Set of FunctionsWhat about allowing all functions from X to {1}?Training set (x1, y1), . . . , (xm, ym) X {1} Test patterns x1, . . . , xm X,such that { 1, . . . , xm} {x1, . . . , xm} = {}. x 1. f (xi) = f (xi) for all iFor any f there exists f s.t.: 2. f ( j ) = f ( j ) for all j.xxBased on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is no free lunch [24, 56]. a restriction must be placed on the functions that we allow B. Schlkopf, MLSS France 2011 o 25. Restricting the Class of FunctionsTwo views:1. Statistical Learning (VC) Theory: take into account the ca-pacity of the class of functions that the learning machine canimplement2. The Bayesian Way: place Prior distributions P(f ) over theclass of functionsB. Schlkopf, MLSS France 2011o 26. Detailed Analysis loss i := 1 |f (xi) yi| in {0, 1} 2 the i are independent Bernoulli trials 1 m empirical mean m i=1 i (by def: equals Remp[f ]) expected value E [] (equals R[f ]) B. Schlkopf, MLSS France 2011 o 27. Cherno s Bound 1 m P i E [] 2 exp(2m2) m i=1 here, P refers to the probability of getting a sample 1, . . . , mwith the property m m i E [] (is a product mea- 1 i=1sure)Useful corollary: Given a 2m-sample of Bernoulli trials, we have 1 m12m m2Pi i 4 exp . m m 2 i=1 i=m+1B. Schlkopf, MLSS France 2011o 28. Cherno s Bound, IITranslate this back into machine learning terminology: the prob-ability of obtaining an m-sample where the training error and testerror dier by more than > 0 is bounded byP Remp[f ] R[f ] 2 exp(2m2). refers to one xed f not allowed to look at the data before choosing f , hence notsuitable as a bound on the test error of a learning algorithmusing empirical risk minimization B. Schlkopf, MLSS France 2011 o 29. Uniform Convergence (Vapnik & Chervonenkis)Necessary and sucient conditions for nontrivial consistency ofempirical risk minimization (ERM):One-sided convergence, uniformly over all functions that can beimplemented by the learning machine. lim P { sup (R[f ] Remp[f ]) > } = 0 m f Ffor all > 0. note that this takes into account the whole set of functions thatcan be implemented by the learning machine this is hard to check for a learning machineAre there properties of learning machines ( sets of functions)which ensure uniform convergence of risk? B. Schlkopf, MLSS France 2011 o 30. How to Prove a VC BoundTake a closer look at P{supf F (R[f ] Remp[f ]) > }.Plan: if the function class F contains only one function, then Cher-nos bound suces: P{ sup (R[f ] Remp[f ]) > } 2 exp(2m2).f F if there are nitely many functions, we use the union bound even if there are innitely many, then on any nite samplethere are eectively only nitely many (use symmetrizationand capacity concepts) B. Schlkopf, MLSS France 2011 o 31. The Case of Two FunctionsSuppose F = {f1, f2}. Rewrite12P{ sup (R[f ] Remp[f ]) > } = P(C C ), f Fwhere iC := {(x1, y1), . . . , (xm, ym) | (R[fi] Remp[fi]) > }denotes the event that the risks of fi dier by more than .The RHS equals 121212P(C C ) = P(C ) + P(C ) P(C C ) 12 P(C ) + P(C ).Hence by Chernos bound12P{ sup (R[f ] Remp[f ]) > } P(C ) + P(C )f F 2 2 exp(2m2). 32. The Union BoundSimilarly, if F = {f1, . . . , fn}, we have 1n P{ sup (R[f ] Remp[f ]) > } = P(C C ),f Fand n 1nP(C C ) iP(C ).i=1Use Cherno for each summand, to get an extra factor n in thebound.iNote: this becomes an equality if and only if all the events Cinvolved are disjoint.B. Schlkopf, MLSS France 2011o 33. Innite Function Classes Note: empirical risk only refers to m points. On these points,the functions of F can take at most 2m values for Remp, the function class thus looks nite how about R? need to use a trickB. Schlkopf, MLSS France 2011o 34. SymmetrizationLemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For m2 2 we haveP{ sup (R[f ]Remp[f ]) > } 2P{ sup (Remp[f ]Remp[f ]) > /2}f Ff FHere, the rst P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures the loss on the rst half ofthe sample, and Remp on the second half.B. Schlkopf, MLSS France 2011o 35. Shattering Coecient Hence, we only need to consider the maximum size of F on 2mpoints. Call it N(F, 2m). N(F, 2m) = max. number of dierent outputs (y1, . . . , y2m)that the function class can generate on 2m points in otherwords, the max. number of dierent ways the function class canseparate 2m points into two classes. N(F, 2m) 22m if N(F, 2m) = 22m, then the function class is said to shatter2m points. B. Schlkopf, MLSS France 2011 o 36. Putting Everything TogetherWe now use (1) symmetrization, (2) the shattering coecient, and(3) the union bound, to getP{sup(R[f ] Remp[f ]) > }f F 2P{sup(Remp[f ] Remp[f ]) > /2}f F = 2P{(Remp[f1] Remp[f1]) > /2 . . . (Remp[fN(F,2m)] Remp[fN(F,2m)]) > /2}N(F,2m) 2P{(Remp[fn ] Remp[fn]) > /2}. n=1B. Schlkopf, MLSS France 2011o 37. ctd.Use Chernos bound for each term: 1 m 12m m2 P i i 2 exp . m m 2 i=1 i=m+1This yields m2P{ sup (R[f ] Remp[f ]) > } 4N(F, 2m) exp . f F 8 provided that N(F, 2m) does not grow exponentially in m, thisis nontrivial such bounds are called VC type inequalities two types of randomness: (1) the P refers to the drawing ofthe training examples, and (2) R[f ] is an expectation over thedrawing of test examples.Note that the fi depend on the 2msample. A rigorous treatment would need to use a second random-ization over permutations of the 2m-sample, see [36]. 38. Condence IntervalsRewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for :With a probability of at least 1 ,84R[f ] Remp[f ] +ln(N(F, 2m)) + ln .mThis bound holds independent of f ; in particular, it holds for thefunction f m minimizing the empirical risk.B. Schlkopf, MLSS France 2011o 39. Discussion tighter bounds are available (better constants etc.) cannot minimize the bound over f other capacity concepts can be used B. Schlkopf, MLSS France 2011 o 40. VC EntropyOn an example (x, y), f causes a loss 1 (x, y, f (x)) = |f (x) y| {0, 1}.For a larger sample (x , y ), . .2. , (x , y ), the dierent functions 1 1mmf F lead to a set of loss vectors f = ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))),whose cardinality we denote by N (F, (x1, y1) . . . , (xm, ym)) .The VC entropy is dened asHF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,where the expectation is taken over the random generation of them-sample (x1, y1) . . . , (xm, ym) from P.HF (m)/m 0 uniform convergence of risks (hence consis-tency) 41. Further PR Capacity Concepts exchange E and ln: annealed entropy.ann HF (m)/m 0 exponentially fast uniform convergence take max instead of E: growth function.Note that GF (m) = ln N(F, m).GF (m)/m 0 exponential convergence for all underlyingdistributions P.GF (m) = m ln(2) for all m for any m, all loss vectorscan be generated, i.e., the m points can be chosen such that byusing functions of the learning machine, they can be separatedin all 2m possible ways (shattered ).B. Schlkopf, MLSS France 2011o 42. Structure of the Growth FunctionEither GF (m) = m ln(2) for all m NOr there exists some maximal m for which the above is possible.Call this number the VC-dimension, and denote it by h. Form > h, m GF (m) h ln + 1 . hNothing in between linear growth and logarithmic growth ispossible. B. Schlkopf, MLSS France 2011 o 43. VC-Dimension: ExampleHalf-spaces in R2: f (x, y) = sgn(a + bx + cy),with parameters a, b, c R Clearly, we can shatter three non-collinear points. But we can never shatter four points. Hence the VC dimension is h = 3 (in this case, equal to thenumber of parameters)xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx B. Schlkopf, MLSS France 2011 o 44. A Typical Bound for Pattern RecognitionFor any f F and m > h, with a probability of at least 1 ,h log 2m + 1 log(/4) h R[f ] Remp[f ] +mholds. does this mean, that we can learn anything? The study of the consistency of ERM has thus led to conceptsand results which lets us formulate another induction principle(structural risk minimization)B. Schlkopf, MLSS France 2011o 45. SRMerrorR(f* )bound on test errorcapacity termtraining errorhstructureSn1Sn Sn+1 B. Schlkopf, MLSS France 2011 o 46. Finding a Good Function Class recall: separating hyperplanes in R2 have a VC dimension of 3. more generally: separating hyperplanes in RN have a VC di-mension of N + 1. hence: separating hyperplanes in high-dimensional featurespaces have extremely large VC dimension, and may not gener-alize well however, margin hyperplanes can still have a small VC dimen-sion B. Schlkopf, MLSS France 2011 o 47. Kernels and Feature SpacesPreprocess the data with:X Hx (x),where H is a dot product space, and learn the mapping from (x)to y [6]. usually, dim(X) dim(H) Curse of Dimensionality? crucial issue: capacity, not dimensionality B. Schlkopf, MLSS France 2011 o 48. Example: All Degree 2 Monomials : R2 R3 2 , 2 x x , x2 )(x1, x2) (z1, z2, z3) := (x11 2 2x z3 2 H HHHx1HH HH HH HH z1 H H HH z2B. Schlkopf, MLSS France 2011o 49. General Product Feature SpaceHow about patterns x RN and product features of order d?Here, dim(H) grows like N d.E.g. N = 16 16, and d = 5 dimension 1010B. Schlkopf, MLSS France 2011o 50. The Kernel Trick, N = d = 2 2)(x2, 2 x x , x2) (x), (x) = (x2, 12 x1 x2 , x21 1 2 2 2 = x, x = : k(x, x) the dot product in H can be computed in R2B. Schlkopf, MLSS France 2011o 51. The Kernel Trick, IIMore generally: x, x RN , d N:d N x, x d = xj x jj=1 N= xj1 xjd x 1 x d = (x), (x) , jjj1,...,jd=1where maps into the space spanned by all ordered products ofd input directionsB. Schlkopf, MLSS France 2011o 52. Mercers TheoremIf k is a continuous kernel of a positive denite integral oper-ator on L2(X) (where X is some compact space), k(x, x)f (x)f (x) dx dx 0, Xit can be expanded as k(x, x) =ii(x)i(x) i=1using eigenfunctions i and eigenvalues i 0 [30]. B. Schlkopf, MLSS France 2011 o 53. The Mercer Feature MapIn that case 11(x)(x) := 22(x) . .satises (x), (x) = k(x, x).Proof: ) 11(x)11(x (x), (x) = 22(x) , 22(x ) .. ..= ii(x)i(x) = k(x, x)i=1 B. Schlkopf, MLSS France 2011 o 54. Positive Denite KernelsIt can be shown that the admissible class of kernels coincides withthe one of positive denite (pd) kernels: kernels which are sym-metric (i.e., k(x, x) = k(x, x)), and for any set of training points x1, . . . , xm X and any a1, . . . , am Rsatisfyaiaj Kij 0, where Kij := k(xi, xj ).i,jK is called the Gram matrix or kernel matrix.If for pairwise distinct points, i,j aiaj Kij = 0 = a = 0, callit strictly positive denite.B. Schlkopf, MLSS France 2011o 55. The Kernel Trick Summary any algorithm that only depends on dot products can benetfrom the kernel trick this way, we can apply linear methods to vectorial as well asnon-vectorial data think of the kernel as a nonlinear similarity measure examples of common kernels:Polynomial k(x, x) = ( x, x + c)d Gaussian k(x, x) = exp( x x 2/(2 2)) Kernels are also known as covariance functions [54, 52, 55, 29] B. Schlkopf, MLSS France 2011 o 56. Properties of PD Kernels, 1Assumption: maps X into a dot product space H; x, x XKernels from Feature Maps.k(x, x) := (x), (x) is a pd kernel on X X.Kernels from Feature Maps, IIK(A, B) := xA,xB k(x, x),where A, B are nite subsets of X, is also a pd kernel (Hint: use the feature map (A) := xA (x)) B. Schlkopf, MLSS France 2011 o 57. Properties of PD Kernels, 2 [36, 39]Assumption: k, k1, k2, . . . are pd; x, x Xk(x, x) 0 for all x (Positivity on the Diagonal)k(x, x)2 k(x, x)k(x, x) (Cauchy-Schwarz Inequality)(Hint: compute the determinant of the Gram matrix)k(x, x) = 0 for all x = k(x, x) = 0 for all x, x(Vanishing Diagonals)The following kernels are pd: k, provided 0 k1 + k2 k(x, x) := limn kn(x, x), provided it exists k1 k2 tensor products, direct sums, convolutions [22]B. Schlkopf, MLSS France 2011o 58. The Feature Space for PD Kernels[4, 1, 35] dene a feature map : X RXx k(., x).E.g., for the Gaussian kernel: . . x x (x) (x)Next steps: turn (X) into a linear space endow it with a dot product satisfying (x), (x) = k(x, x), i.e., k(., x), k(., x ) = k(x, x) complete the space to get a reproducing kernel Hilbert space B. Schlkopf, MLSS France 2011 o 59. Turn it Into a Linear SpaceForm linear combinations m f (.) = ik(., xi), i=1 m g(.) =j k(., x )j j=1(m, m N, i, j R, xi, x X). jB. Schlkopf, MLSS France 2011o 60. Endow it With a Dot Product m mf, g :=ij k(xi, x ) ji=1 j=1 m m =ig(xi) = j f (x ) ji=1 j=1 This is well-dened, symmetric, and bilinear (more later). So far, it also works for non-pd kernels B. Schlkopf, MLSS France 2011 o 61. The Reproducing Kernel PropertyTwo special cases: Assumef (.) = k(., x). In this case, we havek(., x), g = g(x). If moreoverg(.) = k(., x), we havek(., x), k(., x ) = k(x, x).k is called a reproducing kernel(up to here, have not used positive deniteness) B. Schlkopf, MLSS France 2011 o 62. Endow it With a Dot Product, II It can be shown that ., . is a p.d. kernel on the set of functions{f (.) = m ik(., xi)|i R, xi X} :i=1 i j f i , f j = i f i ,j f j =: f, f ijij = ik(., xi), ik(., xi) =ij k(xi, xj ) 0 i iij furthermore, it is strictly positive denite:f (x)2 = f, k(., x) 2 f, fk(., x), k(., x) hence f, f = 0 implies f = 0. Complete the space in the corresponding norm to get a Hilbertspace Hk . 63. The Empirical Kernel MapRecall the feature map : X RX x k(., x). each point is represented by its similarity to all other points how about representing it by its similarity to a sample of points?Considerm : X Rm x k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))B. Schlkopf, MLSS France 2011o 64. ctd. m(x1), . . . , m(xm) contain all necessary information about(x1), . . . , (xm) the Gram matrix Gij := m(xi), m(xj ) satises G = K 2where Kij = k(xi, xj ) modify m tow : X Rmm 1 (k(x , x), . . . , k(x , x))x K 21 m this whitened map (kernel PCA map) saties w (xi), w (xj ) = k(xi, xj )mmfor all i, j = 1, . . . , m.B. Schlkopf, MLSS France 2011o 65. An Example of a Kernel AlgorithmIdea: classify points x := (x) in feature space according to whichof the two class means is closer. 1 1c+ := (xi), c :=(xi) m+ myi=1yi=1+ oo.+ w + c2 o cc1 x-co xCompute the sign of the dot product between w := c+ c andx c.B. Schlkopf, MLSS France 2011o 66. An Example of a Kernel Algorithm, ctd. [36] f (x) = sgn 1 (x), (xi) 1(x), (xi) +b m+m {i:yi=+1}{i:yi=1}= sgn 1k(x, xi) 1 k(x, xi) + bm+m {i:yi=+1}{i:yi=1}where 1 1 1b=k(xi, xj ) k(xi, xj ) . 2 m2 m2+{(i,j):yi=yj =1}{(i,j):yi=yj =+1} provides a geometric interpretation of Parzen windows B. Schlkopf, MLSS France 2011 o 67. An Example of a Kernel Algorithm, ctd. Demo Exercise: derive the Parzen windows classier by computing thedistance criterion directly SVMs (ppt) B. Schlkopf, MLSS France 2011 o 68. An example of a kernel algorithm, revisitedo+(Y ).+ w o (X )+ o+X compact subset of a separable metric space, m, n N.Positive class X := {x1, . . . , xm} XNegative class Y := {y1, . . . , yn} X1 m 1 nRKHS means (X) = m i=1 k(xi, ), (Y ) = n i=1 k(yi, ).Get a problem if (X) = (Y )!B. Schlkopf, MLSS France 2011o 69. When do the means coincide?k(x, x) = x, x :the means coincidek(x, x) = ( x, x + 1)d: all empirical moments up to order d coincidek strictly pd:X =Y.The mean remembers each point that contributed to it.B. Schlkopf, MLSS France 2011o 70. Proposition 2 Assume X, Y are dened as above, k isstrictly pd, and for all i, j, xi = xj , and yi = yj .If for some i, j R {0}, we have mn ik(xi, .) = j k(yj , .),(1) i=1j=1then X = Y .B. Schlkopf, MLSS France 2011o 71. Proof (by contradiction)W.l.o.g., assume that x1 Y . Subtract n j k(yj , .) from (1),j=1and make it a sum over pairwise distinct points, to get 0=ik(zi, .), iwhere z1 = x1, 1 = 1 = 0, andz2, X Y {x1}, 2, R.Take the RKHS dot product with j j k(zj , .) to get 0=ij k(zi, zj ),ijwith = 0, hence k cannot be strictly pd.B. Schlkopf, MLSS France 2011o 72. The mean mapm 1 : X = (x1, . . . , xm) k(xi, ) mi=1satisesm m1 1 (X), f =k(xi, ), f =f (xi)m mi=1i=1andmn 11 (X)(Y ) = sup | (X) (Y ), f | = supf (xi) f (yi) .f 1f 1 m i=1n i=1Note: Large distance = can nd a function distinguishing thesamplesB. Schlkopf, MLSS France 2011o 73. Witness function (X)(Y )f = (X)(Y ) , thus f (x) (X) (Y ), k(x, .) ): Witness f for Gauss and Laplace data 1 f0.8Gauss Laplace0.6 Prob. density and f0.40.2 0 0.2 0.46420 24 6XThis function is in the RKHS of a Gaussian kernel, but not in theRKHS of the linear kernel. B. Schlkopf, MLSS France 2011 o 74. The mean map for measuresp, q Borel probability measures,Ex,xp[k(x, x)], Ex,xq [k(x, x)] < ( k(x, .) M < is sucient)Dene : p Exp[k(x, )].Note (p), f = Exp[f (x)]and(p) (q) = sup Exp[f (x)] Exq [f (x)] . f 1Recall that in the nite sample case, for strictly p.d. kernels, was injective how about now?[43, 17] B. Schlkopf, MLSS France 2011 o 75. Theorem 3 [15, 13] p = q supExp(f (x)) Exq (f (x)) = 0,f C(X)where C(X) is the space of continuous bounded functions onX.Combine this with (p) (q) = sup Exp[f (x)] Exq [f (x)] .f 1Replace C(X) by the unit ball in an RKHS that is dense in C(X) universal kernel [45], e.g., Gaussian.Theorem 4 [19] If k is universal, thenp = q (p) (q) = 0.B. Schlkopf, MLSS France 2011o 76. is invertible on its imageM = {(p) | p is a probability distribution}(the marginal polytope, [53]) generalization of the moment generating function of a RV xwith distribution p: Mp(.) = Exp e x, .This provides us with a convenient metric on probability distribu-tions, which can be used to check whether two distributions aredierent provided that is invertible. B. Schlkopf, MLSS France 2011 o 77. Fourier CriterionAssume we have densities, the kernel is shift invariant (k(x, y) =k(x y)), and all Fourier transforms below exist.Note that is invertible i k(x y)p(y) dy = k(x y)q(y) dy = p = q,i.e., p k( q ) = 0 = p = q(Sriperumbudur et al., 2008)E.g., is invertible if k has full support. Restricting the class of distributions, weaker conditions suce (e.g., if k has non-empty in-terior, is invertible for all distributions with compact support). B. Schlkopf, MLSS France 2011 o 78. Fourier OpticsApplication: p source of incoherent light, I indicator of a nite aperture. In Fraunhofer diraction, the intensity image is p I 2.Set k = I 2, then this equals (p).This k does not have full support, thus the imaging process is notinvertible for the class of all light sources (Abbe), but it is if werestrict the class (e.g., to compact support). B. Schlkopf, MLSS France 2011 o 79. Application 1: Two-sample problem [19]X, Y i.i.d. m-samples from p, q, respectively. 2 (p) (q) =Ex,xp [k(x, x)] 2Exp,yq [k(x, y)] + Ey,yq [k(y, y )] =Ex,xp,y,yq [h((x, y), (x, y ))]with h((x, y), (x, y )) := k(x, x) k(x, y ) k(y, x) + k(y, y ).Dene D(p, q)2 := Ex,xp,y,yq h((x, y), (x, y ))D(X, Y )2 := 1h((xi, yi), (xj , yj )). m(m1)i=j D(X, Y )2 is an unbiased estimator of D(p, q)2.Its easy to compute, and works on structured data. B. Schlkopf, MLSS France 2011 o 80. Theorem 5 Assume k is bounded.1D(X, Y )2 converges to D(p, q)2 in probability with rate O(m 2 ).This could be used as a basis for a test, but uniform convergence bounds are often loose.. Theorem 6 We assume E h2 < . When p = q, then m(D(X, Y )2 D(p, q)2)converges in distribution to a zero mean Gaussian with variance2 u = 4 Ez (Ez h(z, z ))2 Ez,z (h(z, z ))2. When p = q, then m(D(X, Y )2 D(p, q)2) = mD(X, Y )2 converges in distribution to l ql2 2 , (2)l=1where ql N(0, 2) i.i.d., i are the solutions to the eigenvalue equationk(x, x)i (x)dp(x) = i i(x ),Xand k(xi, xj ) := k(xi, xj ) Exk(xi, x) Exk(x, xj ) + Ex,x k(x, x) is the centred RKHSkernel.B. Schlkopf, MLSS France 2011o 81. Application 2: Dependence MeasuresAssume that (x, y) are drawn from pxy , with marginals px, py .Want to know whether pxy factorizes.[2, 16]: kernel generalized variance[20, 21]: kernel constrained covariance, HSICMain idea [25, 34]:x and y independent bounded continuous functions f, g,we have Cov(f (x), g(y)) = 0. B. Schlkopf, MLSS France 2011 o 82. k kernel on X Y.(pxy ) := E(x,y)pxy [k((x, y), )](px py ) := Expx,ypy [k((x, y), )] .Use := (pxy ) (px py ) as a measure of dependence.For k((x, y), (x , y )) = kx(x, x)ky (y, y ):2 equals the Hilbert-Schmidt norm of the covariance opera-tor between the two RKHSs (HSIC), with empirical estimatem2 tr HKxHKy , where H = I 1/m [20, 44]. B. Schlkopf, MLSS France 2011 o 83. Witness function of the equivalent optimisation problem: Dependence witness and sample1.5 0.05 1 0.04 0.030.5 0.02 0.01 Y 0 0 0.50.01 0.021 0.03 0.04 1.5 1.5 10.500.511.5XApplication: learning causal structures (Sun et al., ICML 2007; Fuku-mizu et al., NIPS 2007)) B. Schlkopf, MLSS France 2011 o 84. Application 3: Covariate Shift Correction and LocalLearningtraining set X = {(x1, y1), . . . , (xm, ym)} drawn from p,test set X = (x , y1), . . . , (x , yn) from p = p. 1n Assume py|x = p . y|x[40]: reweight training setB. Schlkopf, MLSS France 2011o 85. Minimize2m ik(xi, ) (X ) + 2 subject to i 0,2i = 1. i=1iEquivalent QP:1 minimize (K + 1) l 2 subject to i 0 and i = 1, iwhere Kij := k(xi, xj ), li = k(xi, ), (X ) .Experiments show that in underspecied situations (e.g., large ker-nel widths), this helps [23].X = x leads to a local sample weighting scheme. B. Schlkopf, MLSS France 2011 o 86. The Representer TheoremTheorem 7 Given: a p.d. kernel k on X X, a training set(x1, y1), . . . , (xm, ym) X R, a strictly monotonic increasingreal-valued function on [0, [, and an arbitrary cost functionc : (X R2)m R {}Any f Hk minimizing the regularized risk functional c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + ( f ) (3)admits a representation of the form m f (.) = ik(xi, .). i=1 B. Schlkopf, MLSS France 2011 o 87. Remarks signicance: many learning algorithms have solutions that canbe expressed as expansions in terms of the training examples original form, with mean squared loss m 1 c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi f (xi))2, m i=1 and ( f ) = f 2 ( > 0): [27] generalization to non-quadratic cost functions: [10] present form: [36] B. Schlkopf, MLSS France 2011 o 88. ProofDecompose f H into a part in the span of the k(xi, .) and anorthogonal one:f= ik(xi, .) + f,where for all j if, k(xj , .) = 0.Application of f to an arbitrary training point xj yields f (xj ) = f, k(xj , .)= ik(xi, .) + f, k(xj , .)i= i k(xi, .), k(xj , .) ,iindependent of f. B. Schlkopf, MLSS France 2011 o 89. Proof: second part of (3)Since f is orthogonal to i ik(xi, .), and is strictly mono-tonic, we get( f ) = ik(xi, .) + fi= ik(xi, .) 2 + f 2 i ik(xi, .) ,(4)iwith equality occuring if and only if f = 0.Hence, any minimizer must have f = 0. Consequently, anysolution takes the formf= ik(xi, .). i B. Schlkopf, MLSS France 2011 o 90. Application: Support Vector ClassicationHere, yi {1}. Use 1 c ((xi, yi, f (xi))i) = max (0, 1 yif (xi)) , iand the regularizer ( f ) = f 2. 0 leads to the hard margin SVM B. Schlkopf, MLSS France 2011 o 91. Further ApplicationsBayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990),i.e. exp(c((xi, yi, f (xi))i)) likelihood of the data exp(( f )) prior over the set of functions; e.g., ( f ) = f 2 Gaussian process prior [55] with covariance functionk minimizer of (3) = MAP estimateKernel PCA (see below) can be shown to correspond to the caseof2 1 1c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) m j f (xj ) = 1 otherwisewith g an arbitrary strictly monotonically increasing function. 92. Conclusion the kernel corresponds to a similarity measure for the data, or a (linear) representation of the data, or a hypothesis space for learning, kernels allow the formulation of a multitude of geometrical algo-rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)B. Schlkopf, MLSS France 2011o 93. Kernel PCA [37] linear PCA k(x,y) = (x.y)R2 xxx xxxx x xxxx kernel PCA k(x,y) = (x.y)dR2 xxxx x xx xx xx xxx x x x x x xx xx x kHB. Schlkopf, MLSS France 2011o 94. Kernel PCA, IIm 1x1, . . . , xm X, : X H,C= (xj )(xj ) m j=1Eigenvalue problem m1V = CV =(xj ), V (xj ).mj=1For = 0, V span{(x1), . . . , (xm)}, thusm V= i(xi),i=1and the eigenvalue problem can be written as (xn), V = (xn), CV for all n = 1, . . . , mB. Schlkopf, MLSS France 2011o 95. Kernel PCA in Dual VariablesIn term of the m m Gram matrixKij := (xi), (xj ) = k(xi, xj ),this leads tomK = K 2where = (1, . . . , m).Solve m = K (n, n) Vn, Vn = 1 n n, n = 1thus divide n by n B. Schlkopf, MLSS France 2011 o 96. Feature extractionCompute projections on the EigenvectorsmVn = ni (xi)i=1in H:for a test point x with image (x) in H we get the featuresm Vn, (x) =ni (xi), (x)i=1 m=ni k(xi, x)i=1 B. Schlkopf, MLSS France 2011 o 97. The Kernel PCA MapRecall w : X Rmm 1 2 (k(x , x), . . . , k(x , x)) x K 1 mIf K = U DU is Ks diagonalization, then K 1/2 =U D1/2U . Thus we havew (x) = U D1/2U (k(x1, x), . . . , k(xm, x)). mWe can drop the leading U (since it leaves the dot product invari-ant) to get a map w CA(x) = D1/2U (k(x1, x), . . . , k(xm, x)).KPThe rows of U are the eigenvectors n of K, and the entries of1/2the diagonal matrix D1/2 equal i .B. Schlkopf, MLSS France 2011o 98. Toy Example with Gaussian Kernelk(x, x) = exp x x 2 B. Schlkopf, MLSS France 2011 o 99. Super-Resolution(Kim, Franz, & Schlkopf, 2004)oa. original image of resolution b. low resolution image (264 c. bicubic interpolationd. supervised example-basedf. unsupervised KPCA recon-528 396 198) stretched to the original learning based on nearest neigh- structionscalebor classier g. enlarged portions of a-d, and f (from left to right)Comparison between dierent super-resolution methods. B. Schlkopf, MLSS France 2011 o 100. Support Vector Classiers input space feature space G NN N N GGG G G [6] B. Schlkopf, MLSS France 2011 o 101. Separating Hyperplane + b > 0 NGNG N + b < 0 wNGGG{x | + b = 0} B. Schlkopf, MLSS France 2011 o 102. Optimal Separating Hyperplane [50] NGNG N.wNGGG{x | + b = 0} B. Schlkopf, MLSS France 2011 o 103. Eliminating the Scaling Freedom [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Denition: The hyperplane is in canonical form w.r.t. X ={x1, . . . , xr } if minxiX | w, xi + b| = 1. B. Schlkopf, MLSS France 2011 o 104. Canonical Optimal Hyperplane{x | + b = +1}{x | + b = 1}Note:N + b = +1 HN x1 yi = +1 + b = 1 x2H => = 2N ,wyi = 1w N => < , (x1x2) = 2>||w||||w||H HH {x | + b = 0}B. Schlkopf, MLSS France 2011o 105. Canonical Hyperplanes [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Denition: The hyperplane is in canonical form w.r.t. X ={x1, . . . , xr } if minxiX | w, xi + b| = 1.Note that for canonical hyperplanes, the distance of the closestpoint to the hyperplane (margin) is 1/ w :minxiXw ,x + b = 1 . w i ww B. Schlkopf, MLSS France 2011 o 106. Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0where w is normalized such that they are in canonical formw.r.t. a set of points X = {x1, . . . , xr }, i.e.,min | w, xi | = 1.i=1,...,rThe set of decision functions fw(x) = sgn x, w dened onX and satisfying the constraint w has a VC dimensionsatisfying h R22.Here, R is the radius of the smallest sphere around the origincontaining X .B. Schlkopf, MLSS France 2011o 107. xxxR xx1 2B. Schlkopf, MLSS France 2011o 108. Proof Strategy (Gurvits, 1997)Assume that x1, . . . , xr are shattered by canonical hyperplaneswith w , i.e., for all y1, . . . , yr {1}, yi w, xi 1 for all i = 1, . . . , r.(5)Two steps: prove that the more points we want to shatter (5), the largerri=1 yixi must ber upper bound the size of i=1 yixi in terms of RCombining the two tells us how many points we can at most shat-ter.B. Schlkopf, MLSS France 2011o 109. Part ISumming (5) over i = 1, . . . , r yields r w, yi xi r.i=1By the Cauchy-Schwarz inequality, on the other hand, we haver rrw, yi xi wyixi yixi . i=1 i=1i=1Combine both:rryi xi . (6) i=1B. Schlkopf, MLSS France 2011o 110. Part IIConsider independent random labels yi {1}, uniformly dis-tributed (Rademacher variables). 2 r r rEyi xi =E yi xi ,yj xj i=1 i=1 j=1 r = E y i x i , yj xj + yixi i=1 j=i r = Eyixi, yj xj + E [ yixi, yixi ] i=1 j=ir r = E yi xi 2 = xi 2 i=1 i=1B. Schlkopf, MLSS France 2011o 111. Part II, ctd.Since xi R, we get 2 r E yixi rR2. i=1 This holds for the expectation over the random choices of thelabels, hence there must be at least one set of labels for whichit also holds true. Use this set.Hence 2 r yi xi rR2.i=1B. Schlkopf, MLSS France 2011o 112. Part I and II Combinedr 2Part I: ryi xi 2i=1Part II: r i=1 yixi 2 rR2Hencer2 rR2,2i.e.,r R22,completing the proof.B. Schlkopf, MLSS France 2011o 113. Pattern Noise as Maximum Margin Regularizationoro o+o +++B. Schlkopf, MLSS France 2011o 114. Maximum Margin vs. MDL 2D Case o o o + + o + R+ +Can perturb by with || < arcsin R and still correctlyseparate the data.Hence only need to store with accuracy [36, 51].B. Schlkopf, MLSS France 2011o 115. Formulation as an Optimization ProblemHyperplane with maximum margin: minimize w 2(recall: margin 1/ w ) subject toyi [ w, xi + b] 1 for i = 1 . . . m(i.e. the training data are separated correctly).B. Schlkopf, MLSS France 2011o 116. Lagrange Function (e.g., [5])Introduce Lagrange multipliers i 0 and a Lagrangianm 1 L(w, b, ) = w 2 i (yi [ w, xi + b] 1) . 2i=1 L has to minimized w.r.t. the primal variables w and b andmaximized with respect to the dual variables i if a constraint is violated, then yi ( w, xi + b) 1 < 0 i will grow to increase L how far? w, b want to decrease L; i.e. they have to change such thatthe constraint is satised. If the problem is separable, thisensures that i < . similarly: if yi ( w, xi + b) 1 > 0, then i = 0: otherwise,L could be increased by decreasing i (KKT conditions)B. Schlkopf, MLSS France 2011o 117. Derivation of the Dual ProblemAt the extremum, we have L(w, b, ) = 0,L(w, b, ) = 0,b wi.e.m i yi = 0 i=1andm w=i yi xi . i=1Substitute both into L to get the dual problemB. Schlkopf, MLSS France 2011o 118. The Support Vector Expansion mw=i yi xii=1where for all i = 1, . . . , m eitheryi [ w, xi + b] > 1= i = 0 xi irrelevantoryi [ w, xi + b] = 1 (on the margin) xi Support VectorThe solution is determined by the examples on the margin.Thus f (x) = sgn ( x, w + b) m = sgn y x, xi + b . i=1 i iB. Schlkopf, MLSS France 2011o 119. Why it is Good to Have Few SVsLeave out an example that does not become SV same solution.Theorem [49]: Denote #SV(m) the number of SVs obtainedby training on m examples randomly drawn from P(x, y), and Ethe expectation. ThenE [#SV(m)] E [Prob(test error)] mHere, Prob(test error) refers to the expected value of the risk,where the expectation is taken over training the SVM on samplesof size m 1. B. Schlkopf, MLSS France 2011 o 120. A Mechanical Interpretation [8]Assume that each SV xi exerts a perpendicular force of size iand sign yi on a solid plane sheet lying along the hyperplane.Then the solution is mechanically stable:miyi = 0 implies that the forces sum to zeroi=1mw=iyixi implies that the torques sum to zero,i=1viaxi yii w/ w = w w/ w = 0.i B. Schlkopf, MLSS France 2011 o 121. Dual ProblemDual: maximize m m1 W () = i i j yi yj xi , xj2 i=1i,j=1subject to m i 0, i = 1, . . . , m, and iyi = 0. i=1Both the nal decision function and the function to be maximizedare expressed in dot products can use a kernel to computexi, xj = (xi), (xj ) = k(xi, xj ).B. Schlkopf, MLSS France 2011o 122. The SVM Architecture f(x)= sgn ( + b)classification f(x)= sgn ( i.k(x,x i) + b)1234 weightsk kk k comparison: k(x,x i), e.g. k(x,x i)=(x.x i)d k(x,x i)=exp(||xx i||2 / c) support vectors x 1 ... x 4 k(x,x i)=tanh((x.x i)+) input vector x B. Schlkopf, MLSS France 2011 o 123. Toy Example with Gaussian Kernel k(x, x) = exp x x 2 B. Schlkopf, MLSS France 2011 o 124. Nonseparable Problems[3, 9]If yi ( w, xi + b) 1 cannot be satised, then i .Modify the constraint toyi ( w, xi + b) 1 iwithi 0(soft margin) and addm C ii=1in the objective function. B. Schlkopf, MLSS France 2011 o 125. Soft Margin SVMsC-SVM [9]: for C > 0, minimizem 1 (w, ) = w 2 + Ci 2 i=1subject to yi ( w, xi + b) 1 i, i 0 (margin 1/ w )-SVM [38]: for 0 < 1, minimize 12 + 1 (w, , ) = w i 2 misubject to yi ( w, xi + b) i, i 0 (margin / w ) B. Schlkopf, MLSS France 2011 o 126. The -PropertySVs: i > 0margin errors: i > 0KKT-Conditions = All margin errors are SVs. Not all SVs need to be margin errors.Those which are not lie exactly on the edge of the margin.Proposition:1. fraction of Margin Errors fraction of SVs.2. asymptotically: ... = = ... B. Schlkopf, MLSS France 2011 o 127. Duals, Using KernelsC-SVM dual: maximize1W () = y y k(xi, xj ) i i2i,j i j i jsubject to 0 i C,i iyi = 0.-SVM dual: maximize1W () = ij yiyj k(xi, xj )2ij1subject to 0 i m , i iyi = 0,i i In both cases: decision function: m f (x) = sgn iyik(x, xi) + b i=1B. Schlkopf, MLSS France 2011o 128. Connection between -SVC and C-SVCProposition. If -SV classication leads to > 0, then C-SVclassication, with C set a priori to 1/, leads to the same decisionfunction.Proof. Minimize the primal target, then x , and minimize only over theremaining variables: nothing will change. Hence the obtained solution w0, b0, 0minimizes the primal problem of C-SVC, for C = 1, subject toyi ( xi, w + b) i.To recover the constraintyi ( xi, w + b) 1 i,rescale to the set of variables w = w/, b = b/, = /. This leaves us, upto a constant scaling factor 2, with the C-SV target with C = 1/.B. Schlkopf, MLSS France 2011o 129. SVM Training naive approach: the complexity of maximizingm1m W () =i ij yiyj k(xi, xj )i=12i,j=1scales with the third power of the training set size m only SVs are relevant only compute (k(xi, xj ))ij for SVs.Extract them iteratively by cycling through the training set inchunks [46]. in fact, one can use chunks which do not even contain all SVs[31]. Maximize over these sub-problems, using your favoriteoptimizer. the extreme case: by making the sub-problems very small (justtwo points), one can solve them analytically [32]. http://www.kernel-machines.org/software.html 130. MNIST Benchmarkhandwritten character benchmark (60000 training & 10000 testexamples, 28 28)5 0 4 1 9 2 1 3 1 43 5 3 6 1 7 2 8 6 94 0 9 1 1 2 4 3 2 73 8 6 9 0 5 6 0 7 61 8 7 9 3 9 8 5 9 33 0 7 4 9 8 0 9 4 14 4 6 0 4 5 6 1 0 01 7 1 6 3 0 2 1 1 79 0 2 6 7 8 3 9 0 46 7 4 6 8 0 7 8 3 1B. Schlkopf, MLSS France 2011o 131. MNIST Error RatesClassier test error referencelinear classier8.4% [7]3-nearest-neighbour 2.4% [7]SVM 1.4% [8]Tangent distance1.1% [41]LeNet41.1% [28]Boosted LeNet40.7% [28]Translation invariant SVM 0.56%[11]Note: the SVM used a polynomial kernel of degree 9, corresponding to a featurespace of dimension 3.2 1020.B. Schlkopf, MLSS France 2011o 132. References [1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337404, 1950. [2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:148, 2002. [3] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Opti- mization Methods and Software, 1:2334, 1992. [4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984. [5] D. P. Bertsekas. Nonlinear Programming. Athena Scientic, Belmont, MA, 1995. [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144152, Pittsburgh, PA, July 1992. ACM Press. [7] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. M ller, E. Sckinger, P. Simard, ua and V. Vapnik. Comparison of classier methods: a case study in handwritten digit recognition. In Proceedings of the 12th International Conference on Pattern Recognition and Neural Networks, Jerusalem, pages 7787. IEEE Computer Society Press, 1994. [8] C. J. C. Burges and B. Schlkopf. Improving the accuracy and speed of support vector learning machines. In M. Mozer, o M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375381, Cambridge, MA, 1997. MIT Press. [9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273297, 1995.[10] D. Cox and F. OSullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18:16761695, 1990.[11] D. DeCoste and B. Schlkopf. Training invariant support vector machines. Machine Learning, 2001. Accepted for publi- o cation. Also: Technical Report JPL-MLTR-00-1, Jet Propulsion Laboratory, Pasadena, CA, 2000. 133. [12] L. Devroye, L. Gyr, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of o mathematics. Springer, New York, 1996.[13] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.[14] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Schlkopf, and D. Schuurmans, editors, Advances in Large Margin Classiers, pages 171203, Cambridge, MA, o 2000. MIT Press.ee e [15] R. Fortet and E. Mourier. Convergence de la rparation empirique vers la rparation thorique. Ann. Scient. Ecole Norm. Sup., 70:266285, 1953.[16] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:7399, 2004.[17] K. Fukumizu, A. Gretton, X. Sun, and B. Schlkopf. Kernel measures of conditional dependence. In J. C. Platt, D. Koller,o Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 489496, Cam- bridge, MA, USA, 09 2008. MIT Press.[18] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):1455 1480, 1998.[19] A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf, and A. J. Smola. A kernel method for the two-sample-problem. Ino B. Schlkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The o MIT Press, Cambridge, MA, 2007.[20] A. Gretton, O. Bousquet, A.J. Smola, and B. Schlkopf. Measuring statistical dependence with Hilbert-Schmidt norms. o In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 6377, Berlin, Germany, 2005. Springer-Verlag.[21] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schlkopf. Kernel methods for measuring independence. J. Mach. o Learn. Res., 6:20752129, 2005.[22] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Depart- ment, University of California at Santa Cruz, 1999.[23] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Schlkopf. Correcting sample selection bias by unlabeled data. Ino B. Schlkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The o MIT Press, Cambridge, MA, 2007. 134. [24] I. A. Ibragimov and R. Z. Hasminskii. Statistical Estimation Asymptotic Theory. Springer-Verlag, New York, 1981.[25] J. Jacod and P. Protter. Probability Essentials. Springer, New York, 2000.[26] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41:495502, 1970.[27] G. S. Kimeldorf and G. Wahba. Some results on Tchebychean spline functions. J. Math. Anal. Applic., 33:8295, 1971.[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:22782324, 1998.[29] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages 133165. Springer-Verlag, Berlin, 1998.[30] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Phi- los. Trans. Roy. Soc. London, A 209:415446, 1909.[31] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. Technical Report AIM-1602, MIT A.I. Lab., 1996.[32] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schlkopf, C. J. C. Burges,o and A. J. Smola, editors, Advances in Kernel Methods Support Vector Learning, pages 185208, Cambridge, MA, 1999. MIT Press.[33] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990.[34] A. Rnyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441451, 1959. e[35] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientic & Technical, Harlow, England, 1988.[36] B. Schlkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o[37] B. Schlkopf, A. Smola, and K.-R. M ller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Compu-o u tation, 10:12991319, 1998.[38] B. Schlkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, o 12:12071245, 2000.[39] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. 135. [40] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 2000.[41] P. Simard, Y. LeCun, and J. Denker. Ecient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5. Proceedings of the 1992 Conference, pages 5058, San Mateo, CA, 1993. Morgan Kaufmann.[42] A. Smola, B. Schlkopf, and K.-R. M ller. The connection between regularization operators and support vector kernels. ou Neural Networks, 11:637649, 1998.[43] A. J. Smola, A. Gretton, L. Song, and B. Schlkopf. A Hilbert space embedding for distributions. In M. Hutter, R. A. o Servedio, and E. Takimoto, editors, Algorithmic Learning Theory: 18th International Conference, pages 1331, Berlin, 10 2007. Springer.[44] A. J. Smola, A. Gretton, L. Song, and B. Schlkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf.o Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.[45] I. Steinwart. On the inuence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:6793, 2001.[46] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).[47] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, 1995.[48] V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998.[49] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, AkademieVerlag, Berlin, 1979).[50] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 1963.[51] U. von Luxburg, O. Bousquet, and B. Schlkopf. A compression approach to support vector model selection. Technical o Report 101, Max Planck Institute for Biological Cybernetics, T bingen, Germany, 2002. see more detailed JMLR version. u[52] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Math- ematics. SIAM, Philadelphia, 1990. 136. [53] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September 2003.[54] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, 1982.[55] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.[56] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):13411390, 1996. B. Schlkopf, MLSS France 2011 o 137. Regularization Interpretation of Kernel MachinesThe norm in H can be interpreted as a regularization term (Girosi1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regular-ization operator (mapping into a dot product space D) such thatk is Greens function of P P , then w = Pf ,where m w= i(xi)i=1and f (x) = ik(xi, x). iExample: for the Gaussian kernel, P is a linear combination ofdierential operators. B. Schlkopf, MLSS France 2011 o 138. w 2 =ij k(xi, xj )i,j= ij k(xi, .), xj (.)i,j= ij k(xi, .), (P P k)(xj , .)i,j= ij (P k)(xi, .), (P k)(xj , .) Di,j=(Pik)(xi, .), (P j k)(xj , .) i jD= P f 2,using f (x) = i ik(xi, x). B. Schlkopf, MLSS France 2011 o