Introduction to Machine Learning

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)



Transcript of Introduction to Machine Learning

  • 1. Introduction to Machine Learning Bernhard Schlkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tbingen, Germany

2. Empirical Inference Drawing conclusions from empirical data (observations, measurements) Example 1: scientific inferencey = i ai k(x,xi) + bx y y=a*xxxxx xx x xx Leibniz, Weyl, Chaitin2 3. Empirical Inference Drawing conclusions from empirical data (observations, measurements) Example 1: scientific inference If your experiment needs statistics [inference],you ought to have done a better experiment. (Rutherford) 3 4. Empirical Inference, II Example 2: perception The brain is nothing but a statistical decision organ(H. Barlow) 4 5. Hard Inference ProblemsSonnenburg, Rtsch, Schfer,Schlkopf, 2006, Journal of MachineLearning ResearchTask: classify human DNAsequence locations into {acceptorsplice site, decoy} using 15Million sequences of length 141,and a Multiple-Kernel SupportVector Machines.PRC = Precision-Recall-Curve,fraction of correct positivepredictions among all positivelypredicted cases High dimensionality consider many factors simultaneously to find the regularity Complex regularities nonlinear, nonstationary, etc. Little prior knowledge e.g., no mechanistic models for the data Need large data sets processing requires computers and automatic inference methods5 6. Hard Inference Problems, II We can solve scientific inference problems that humans cant solve Even if its just because of data set size / dimensionality, this is aquantum leap 6 7. Generalization(thanks to O. Bousquet) observe 1,2, 4,7,.. Whats next? +1+2 +3 1,2,4,7,11,16,: an+1=an+n (lazy caterers sequence) 1,2,4,7,12,20,: an+2=an+1+an+1 1,2,4,7,13,24,: Tribonacci-sequence 1,2,4,7,14,28: set of divisors of 28 1,2,4,7,1,1,5,: decimal expansions of p=3,14159and e=2,718 interleaved The On-Line Encyclopedia of Integer Sequences: >600 hits7 8. Generalization, II Question: which continuation is correct (generalizes)? Answer: theres no way to tell (induction problem) Question of statistical learning theory: how to come upwith a law that is (probably) correct (demarcation problem)(more accurately: a law that is probably as correct on the test data as it is on the training data)8 9. 2-class classification Learn based on m observations generated from some Goal: minimize expected error (risk) V. Vapnik Problem: P is unknown. Induction principle: minimize training error (empirical risk) over some class of functions. Q: is this consistent?9 10. The law of large numbersFor allandDoes this imply consistency of empirical risk minimization(optimality in the limit)?No need a uniform law of large numbers:For all10 11. Consistency and uniform convergence 12. -> LaTeX 12Bernhard Schlkopf Empirical Inference Department Tbingen , 03 October 2011 13. Support Vector Machines class 2 class 1F+-- ++- k(x,x) + -= + + sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):- -representer theorem (Kimeldorf & Wahba 1971, Schlkopf et al. 2000) unique solution found by convex QPBernhard Schlkopf, 03October 2011 13 14. Support Vector Machines class 2 class 1F+-- ++- k(x,x) + -= + + sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992):- -representer theorem (Kimeldorf & Wahba 1971, Schlkopf et al. 2000) unique solution found by convex QPBernhard Schlkopf, 03October 2011 14 15. Applications in Computational Geometry / GraphicsSteinke, Walder, Blanz et al.,Eurographics05, 06, 08, ICML 05,08,NIPS 07 15 16. Max-Planck-Institut frbiologische KybernetikBernhard Schlkopf, Tbingen,FIFA World Cup Germany vs. England, June 27, 20103. Oktober 201116 17. Kernel Quiz 17Bernhard Schlkopf Empirical Inference Department Tbingen , 03 October 2011 18. Kernel MethodsBernhard SchlkopfoMax Planck Institute for Intelligent SystemsB. Schlkopf, MLSS France 2011o 19. Statistical Learning Theory1. started by Vapnik and Chervonenkis in the Sixties2. model: we observe data generated by an unknown stochastic regularity3. learning = extraction of the regularity from the data4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement.5. support vector machines use a particular type of function class: classiers with large margins in a feature space induced by a kernel. [47, 48]B. Schlkopf, MLSS France 2011o 20. Example: Regression Estimation yx Data: input-output pairs (xi, yi) R R Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y) Learning: choose a function f : R R such that the error,averaged over P, is minimized. Problem: P is unknown, so the average cannot be computed need an induction principle 21. Pattern RecognitionLearn f : X {1} from examples(x1, y1), . . . , (xm, ym) X{1}, generated i.i.d. from P(x, y),such that the expected misclassication error on a test set, alsodrawn from P(x, y),1R[f ] = |f (x) y)| dP(x, y),2is minimal (Risk Minimization (RM)).Problem: P is unknown. need an induction principle.Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error 1 m 1 Remp[f ] =|f (xi) yi|mi=1 2B. Schlkopf, MLSS France 2011o 22. Convergence of Means to ExpectationsLaw of large numbers:Remp[f ] R[f ]as m .Does this imply that empirical risk minimization will give us theoptimal result in the limit of innite sample size (consistencyof empirical risk minimization)?No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement.B. Schlkopf, MLSS France 2011o 23. Consistency and Uniform ConvergenceRRiskRemp Remp [f] R[f] ff opt fm Function classB. Schlkopf, MLSS France 2011o 24. The Importance of the Set of FunctionsWhat about allowing all functions from X to {1}?Training set (x1, y1), . . . , (xm, ym) X {1} Test patterns x1, . . . , xm X,such that { 1, . . . , xm} {x1, . . . , xm} = {}. x 1. f (xi) = f (xi) for all iFor any f there exists f s.t.: 2. f ( j ) = f ( j ) for all j.xxBased on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is no free lunch [24, 56]. a restriction must be placed on the functions that we allow B. Schlkopf, MLSS France 2011 o 25. Restricting the Class of FunctionsTwo views:1. Statistical Learning (VC) Theory: take into account the ca-pacity of the class of functions that the learning machine canimplement2. The Bayesian Way: place Prior distributions P(f ) over theclass of functionsB. Schlkopf, MLSS France 2011o 26. Detailed Analysis loss i := 1 |f (xi) yi| in {0, 1} 2 the i are independent Bernoulli trials 1 m empirical mean m i=1 i (by def: equals Remp[f ]) expected value E [] (equals R[f ]) B. Schlkopf, MLSS France 2011 o 27. Cherno s Bound 1 m P i E [] 2 exp(2m2) m i=1 here, P refers to the probability of getting a sample 1, . . . , mwith the property m m i E [] (is a product mea- 1 i=1sure)Useful corollary: Given a 2m-sample of Bernoulli trials, we have 1 m12m m2Pi i 4 exp . m m 2 i=1 i=m+1B. Schlkopf, MLSS France 2011o 28. Cherno s Bound, IITranslate this back into machine learning terminology: the prob-ability of obtaining an m-sample where the training error and testerror dier by more than > 0 is bounded byP Remp[f ] R[f ] 2 exp(2m2). refers to one xed f not allowed to look at the data before choosing f , hence notsuitable as a bound on the test error of a learning algorithmusing empirical risk minimization B. Schlkopf, MLSS France 2011 o 29. Uniform Convergence (Vapnik & Chervonenkis)Necessary and sucient conditions for nontrivial consistency ofempirical risk minimization (ERM):One-sided convergence, uniformly over all functions that can beimplemented by the learning machine. lim P { sup (R[f ] Remp[f ]) > } = 0 m f Ffor all > 0. note that this takes into account the whole set of functions thatcan be implemented by the learning machine this is hard to check for a learning machineAre there properties of learning machines ( sets of functions)which ensure uniform convergence of risk? B. Schlkopf, MLSS France 2011 o 30. How to Prove a VC BoundTake a closer look at P{supf F (R[f ] Remp[f ]) > }.Plan: if the function class F contains only one function, then Cher-nos bound suces: P{ sup (R[f ] Remp[f ]) > } 2 exp(2m2).f F if there are nitely many functions, we use the union bound even if there are innitely many, then on any nite samplethere are eectively only nitely many (use symmetrizationand capacity concepts) B. Schlkopf, MLSS France 2011 o 31. The Case of Two FunctionsSuppose F = {f1, f2}. Rewrite12P{ sup (R[f ] Remp[f ]) > } = P(C C ), f Fwhere iC := {(x1, y1), . . . , (xm, ym) | (R[fi] Remp[fi]) > }denotes the event that the risks of fi dier by more than .The RHS equals 121212P(C C ) = P(C ) + P(C ) P(C C ) 12 P(C ) + P(C ).Hence by Chernos bound12P{ sup (R[f ] Remp[f ]) > } P(C ) + P(C )f F 2 2 exp(2m2). 32. The Union BoundSimilarly, if F = {f1, . . . , fn}, we have 1n P{ sup (R[f ] Remp[f ]) > } = P(C C ),f Fand n 1nP(C C ) iP(C ).i=1Use Cherno for each summand, to get an extra factor n in thebound.iNote: this becomes an equality if and only if all the events Cinvolved are disjoint.B. Schlkopf, MLSS France 2011o 33. Innite Function Classes Note: empirical risk only refers to m points. On these points,the functions of F can take at most 2m values for Remp, the function class thus looks nite how about R? need to use a trickB. Schlkopf, MLSS France 2011o 34. SymmetrizationLemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For m2 2 we haveP{ sup (R[f ]Remp[f ]) > } 2P{ sup (Remp[f ]Remp[f ]) > /2}f Ff FHere, the rst P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures t