Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

34
Support Vector Machines Aar$ Singh Machine Learning 10601 Nov 22, 2011

Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Page 1: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Support  Vector  Machines  

Aar$  Singh      

Machine  Learning  10-­‐601  Nov  22,  2011  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

Page 2: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVMs  reminder  

26  

     min    w.w  +  C  Σ  ξj     w,b,ξ    

s.t.  (w.xj+b)  yj  ≥  1-­‐ξj   ∀j  

     ξj  ≥  0 ∀j  

Hinge loss Regularization

Soft margin approach

Essen$ally  a  constrained  op$miza$on  problem!  

Page 3: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Constrained  Op7miza7on  

Primal  problem:  

⌘ min

x

max

↵�0x

2 � ↵(x� b)

min

x

max

↵�0x

2 � ↵(x� b)

Lagrange  mul$plier  

Lagrangian  

L(x,↵)

Dual  problem:  

max

↵�0min

x

x

2 � ↵(x� b)⌘ min

x

max

↵�0x

2 � ↵(x� b)

L(x,↵)

f⇤ = arg

d⇤ = arg

d⇤ f⇤In  general,   d⇤ = f⇤When  is                                          ?  

Page 4: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Constrained  Op7miza7on  

28  

Primal  problem:   Dual  problem:  

f⇤ = arg d⇤ = arg

↵⇤= dual solution

x

⇤= primal solution

d⇤ = f⇤ if          f  is  convex  and      x*,  α*  sa$sfy  the  KKT  condi$ons:  

rL(x⇤,↵

⇤) = 0

↵⇤ � 0x

⇤ � b

⇤(x⇤ � b) = 0

Zero  Gradient    

Primal  feasibility  

Dual  feasibility  

Complementary  slackness  

è If α*  >  0,  then  x*  =  b          i.e.      Constraint  is  effec$ve  

Page 5: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Constrained  Op7miza7on  

29  

α*  =  0  Constraint  is  ineffec$ve  

α* = 1/2 >  0  Constraint  is  effec$ve  

Complementary  slackness  α*(x*-­‐1)  =  0  

Page 6: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  linearly  separable  case  

•  Primal  problem:  

   •  Lagrangian:      

30  

w – weights on features

α – weights on training pts

α j  >  0    constraint  is  effec7ve          (w.xj+b)yj  =  1                      point  j  is  a  support  vector!!  

Page 7: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  linearly  separable  case  

•  Dual  problem  deriva$on:      

31  

If    we  can  solve  for  αs  (dual  problem),  then    we  have  a  solu$on  for  w  (primal  problem)    

Page 8: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  linearly  separable  case  

•  Dual  problem  deriva$on:  

•  Dual  problem:      

32  

Page 9: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  linearly  separable  case  

     Dual  problem  is  also  QP  Solu$on  gives  αjs  

33  Use  support  vectors  to  compute  b  w.xk+b  =  yk                      (w.xk+b)yk  =  1  

•  Dual  problem:      

Page 10: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  Interpreta7on:  Sparsity  

34  

w.x  +  b  =  0  

Only  few  αjs  can  be  non-­‐zero  :  where  constraint  is  $ght  

       (w.xj  +  b)yj  =  1    Support  vectors  –  training  points  j  whose  αjs  are  non-­‐zero  

αj  >  0  

αj  >  0  

αj  >  0  

αj  =  0  

αj  =  0  

αj  =  0  =  Support  vectors  

Page 11: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

So  why  solve  the  dual  SVM?  

•  There  are  some  quadra$c  programming  algorithms  that  can  solve  the  dual  faster  than  the  primal,  specially  in  high  dimensions  d>>n  

   Recall:      (w1,  w2,  …,  wd,  b)  –  d+1  primal  variables        α1, α2, ..., αn  –  n  dual  variables  

 

35  

Page 12: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  non-­‐separable  case  

36  

•  Primal  problem:  

•  Dual  problem:      

Lagrange    Mul7pliers  

Page 13: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  SVM  –  non-­‐separable  case  

37  

           Dual  problem  is  also  QP  Solu$on  gives  αjs  

comes  from   Intuition:    Earlier  -­‐  If  constraint  violated,  αi  →∞  Now  -­‐  If  constraint  violated,  αi  ≤  C  

Page 14: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

So  why  solve  the  dual  SVM?  

•  There  are  some  quadra$c  programming  algorithms  that  can  solve  the  dual  faster  than  the  primal,  specially  in  high  dimensions  d>>n  

   Recall:      (w1,  w2,  …,  wd,  b)  –  d+1  primal  variables        α1, α2, ..., αn  –  n  dual  variables  

 •  But,  more  importantly,  the  “kernel  trick”!!!  

38  

Page 15: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

39  

What  if  data  is  not  linearly  separable?  

x1  

Using  non-­‐linear  features  to  get  linear  separa$on  

x 2  

Radius,  r  =  √x12+x22    

Angle,  θ    

Page 16: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

40  

What  if  data  is  not  linearly  separable?  

x  

Using  non-­‐linear  features  to  get  linear  separa$on  

y  

x  

x2  

Page 17: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

What  if  data  is  not  linearly  separable?  

41  

Use  features  of  features    of  features  of  features….  

Feature  space  becomes  really  large  very  quickly!  

Can  we  get  by  without  having  to  write  out  the  features  explicitly?  

Φ(x)  =  (x12,  x22,  x1x2,  ….,  exp(x1))  

Page 18: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Higher  Order  Polynomials  

42  

d  –  input  features      m  –  degree  of  polynomial  

grows  fast!  m  =  6,  d  =  100  about  1.6  billion  terms  

(m+ d� 1)!

m!(d� 1)!dm

✓m+d-1

m

Page 19: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dual  formula7on  only  depends  on  dot-­‐products,  not  on  w!  

43  

Φ(x)  –  High-­‐dimensional  feature  space,  but  never  need  it  explicitly  as  long  as  we  can  compute  the  dot  product  fast  using  some  Kernel  K  

Page 20: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Dot  Product  of  Polynomials  

44  

m=1  

m=2  

m  

m

√2   2  √2  

Don’t  store  high-­‐dim  features  -­‐  Only  evaluate  dot-­‐products  with  kernels    

m   =  K(x,z)  

Page 21: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Finally:  The  Kernel  Trick!  

45  

•  Never  represent  features  explicitly  –  Compute  dot  products  in  closed  form  

•  Constant-­‐$me  high-­‐dimensional  dot-­‐products  for  many  classes  of  features    

Page 22: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Common  Kernels  

46  

•  Polynomials  of  degree  d  

•  Polynomials  of  degree  up  to  d  

•  Gaussian/Radial  kernels  (polynomials  of  all  orders  –  recall  series  expansion)  

•  Sigmoid  

Page 23: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Which  Func7ons  Can  Be  Kernels?  

•  not  all  func$ons  •  for  some  defini$ons  of  K(x1,x2)  there  is  no  corresponding  projec$on  ϕ(x)  

•  Nice  theory  on  this,  including  how  to  construct  new  kernels  from  exis$ng  ones  

•  Ini$ally  kernels  were  defined  over  data  points  in  Euclidean  space,  but  more  recently  over  strings,  over  trees,  over  graphs,  …  

Page 24: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

OverfiYng  

48  

•  Huge  feature  space  with  kernels,  what  about  overfiyng???  

– Maximizing  margin  leads  to  sparse  set  of  support  vectors  (decision  boundary  not  too  complicated)  

–  Some  interes$ng  theory  says  that  SVMs  search  for  simple  hypothesis  with  large  margin  

–  O{en  robust  to  overfiyng  

Page 25: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

What  about  classifica7on  7me?  

49  

•  For  a  new  input  x,  if  we  need  to  represent  Φ(x),  we  are  in  trouble!  •  Recall  classifier:  sign(w.Φ(x)+b)  •  Using  kernels  we  are  cool!  

Page 26: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVMs  with  Kernels  

50  

•  Choose  a  set  of  features  and  kernel  func$on  •  Solve  dual  problem  to  obtain  support  vectors  αi  •  At  classifica$on  $me,  compute:  

Classify  as  

Page 27: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVM  Decision  Surface  using  Gaussian  Kernel  

51  

Bishop  Fig  7.2  

Circled  points  are  the  support  vectors:  training  examples  with  non-­‐zero  αj    Points  plo|ed  in  original  2-­‐D  space.    Contour  lines  show  constant    

Page 28: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVM  So[  Margin  Decision  Surface  using  Gaussian  Kernel  

52  

Bishop  Fig  7.4  

Circled  points  are  the  support  vectors:  training  examples  with  non-­‐zero  αj    Points  plo|ed  in  original  2-­‐D  space.    Contour  lines  show  constant    

−2 0 2

−2

0

2

Page 29: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

53  

SVMs   Kernel  Regression  

or  

Differences:  •  SVMs:  

–  Learn  weights  αi  (and  bandwidth)  –  O{en  sparse  solu$on  

•  KR:  –  Fixed  “weights”,  learn  bandwidth  –  Solu$on  may  not  be  sparse  – Much  simpler  to  implement  

SVMs  vs.  Kernel  Regression  

Page 30: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVMs  vs.  Logis7c  Regression  

54  

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Page 31: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Kernels  in  Logis7c  Regression  

55  

•  Define  weights  in  terms  of  features:  

•  Derive  simple  gradient  descent  rule  on  αi  

Page 32: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

SVMs  vs.  Logis7c  Regression  

56  

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Page 33: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

Kernels  :    Key  Points  

•  Many  learning  tasks  are  framed  as  op$miza$on  problems  

•  Primal  and  Dual  formula$ons  of  op$miza$on  problems  

•  Dual  version  framed  in  terms  of  dot  products  between  x’s  

•  Kernel  func$ons  K(x,y)  allow  calcula$ng  dot  products  <Φ(x),Φ(y)>  without  bothering  to  project  x  into  Φ(x)  

•  Leads  to  major  efficiencies,  and  ability  to  use  very  high  dimensional  (virtual)  feature  spaces  

Page 34: Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 · Support’Vector’Machines’ Aar$%Singh% % % Machine%Learning%101601% Nov22,2011 TexPointfonts%used%in%EMF.%%

What  you  need  to  know…  

58  

•  Dual  SVM  formula$on  – How  it’s  derived  

•  The  kernel  trick  •  Common  kernels  •  Differences  between  SVMs  and  kernel  regression  •  Differences  between  SVMs  and  logis$c  regression  •  Kernelized  logis$c  regression  

•  Also,  Kernel  PCA,  Kernel  ICA,  ...