Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Post on 14-Jun-2020

3 views 0 download

Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Support  Vector  Machines  

Aar$  Singh      

Machine  Learning  10-­‐601  Nov  22,  2011  

TexPoint  fonts  used  in  EMF.    Read  the  TexPoint  manual  before  you  delete  this  box.:  AAAAAAA  

SVMs  reminder  

26  

     min    w.w  +  C  Σ  ξj     w,b,ξ    

s.t.  (w.xj+b)  yj  ≥  1-­‐ξj   ∀j  

     ξj  ≥  0 ∀j  

Hinge loss Regularization

Soft margin approach

Essen$ally  a  constrained  op$miza$on  problem!  

Constrained  Op7miza7on  

Primal  problem:  

⌘ min

x

max

↵�0x

2 � ↵(x� b)

min

x

max

↵�0x

2 � ↵(x� b)

Lagrange  mul$plier  

Lagrangian  

L(x,↵)

Dual  problem:  

max

↵�0min

x

x

2 � ↵(x� b)⌘ min

x

max

↵�0x

2 � ↵(x� b)

L(x,↵)

f⇤ = arg

d⇤ = arg

d⇤ f⇤In  general,   d⇤ = f⇤When  is                                          ?  

Constrained  Op7miza7on  

28  

Primal  problem:   Dual  problem:  

f⇤ = arg d⇤ = arg

↵⇤= dual solution

x

⇤= primal solution

d⇤ = f⇤ if          f  is  convex  and      x*,  α*  sa$sfy  the  KKT  condi$ons:  

rL(x⇤,↵

⇤) = 0

↵⇤ � 0x

⇤ � b

⇤(x⇤ � b) = 0

Zero  Gradient    

Primal  feasibility  

Dual  feasibility  

Complementary  slackness  

è If α*  >  0,  then  x*  =  b          i.e.      Constraint  is  effec$ve  

Constrained  Op7miza7on  

29  

α*  =  0  Constraint  is  ineffec$ve  

α* = 1/2 >  0  Constraint  is  effec$ve  

Complementary  slackness  α*(x*-­‐1)  =  0  

Dual  SVM  –  linearly  separable  case  

•  Primal  problem:  

   •  Lagrangian:      

30  

w – weights on features

α – weights on training pts

α j  >  0    constraint  is  effec7ve          (w.xj+b)yj  =  1                      point  j  is  a  support  vector!!  

Dual  SVM  –  linearly  separable  case  

•  Dual  problem  deriva$on:      

31  

If    we  can  solve  for  αs  (dual  problem),  then    we  have  a  solu$on  for  w  (primal  problem)    

Dual  SVM  –  linearly  separable  case  

•  Dual  problem  deriva$on:  

•  Dual  problem:      

32  

Dual  SVM  –  linearly  separable  case  

     Dual  problem  is  also  QP  Solu$on  gives  αjs  

33  Use  support  vectors  to  compute  b  w.xk+b  =  yk                      (w.xk+b)yk  =  1  

•  Dual  problem:      

Dual  SVM  Interpreta7on:  Sparsity  

34  

w.x  +  b  =  0  

Only  few  αjs  can  be  non-­‐zero  :  where  constraint  is  $ght  

       (w.xj  +  b)yj  =  1    Support  vectors  –  training  points  j  whose  αjs  are  non-­‐zero  

αj  >  0  

αj  >  0  

αj  >  0  

αj  =  0  

αj  =  0  

αj  =  0  =  Support  vectors  

So  why  solve  the  dual  SVM?  

•  There  are  some  quadra$c  programming  algorithms  that  can  solve  the  dual  faster  than  the  primal,  specially  in  high  dimensions  d>>n  

   Recall:      (w1,  w2,  …,  wd,  b)  –  d+1  primal  variables        α1, α2, ..., αn  –  n  dual  variables  

 

35  

Dual  SVM  –  non-­‐separable  case  

36  

•  Primal  problem:  

•  Dual  problem:      

Lagrange    Mul7pliers  

Dual  SVM  –  non-­‐separable  case  

37  

           Dual  problem  is  also  QP  Solu$on  gives  αjs  

comes  from   Intuition:    Earlier  -­‐  If  constraint  violated,  αi  →∞  Now  -­‐  If  constraint  violated,  αi  ≤  C  

So  why  solve  the  dual  SVM?  

•  There  are  some  quadra$c  programming  algorithms  that  can  solve  the  dual  faster  than  the  primal,  specially  in  high  dimensions  d>>n  

   Recall:      (w1,  w2,  …,  wd,  b)  –  d+1  primal  variables        α1, α2, ..., αn  –  n  dual  variables  

 •  But,  more  importantly,  the  “kernel  trick”!!!  

38  

39  

What  if  data  is  not  linearly  separable?  

x1  

Using  non-­‐linear  features  to  get  linear  separa$on  

x 2  

Radius,  r  =  √x12+x22    

Angle,  θ    

40  

What  if  data  is  not  linearly  separable?  

x  

Using  non-­‐linear  features  to  get  linear  separa$on  

y  

x  

x2  

What  if  data  is  not  linearly  separable?  

41  

Use  features  of  features    of  features  of  features….  

Feature  space  becomes  really  large  very  quickly!  

Can  we  get  by  without  having  to  write  out  the  features  explicitly?  

Φ(x)  =  (x12,  x22,  x1x2,  ….,  exp(x1))  

Higher  Order  Polynomials  

42  

d  –  input  features      m  –  degree  of  polynomial  

grows  fast!  m  =  6,  d  =  100  about  1.6  billion  terms  

(m+ d� 1)!

m!(d� 1)!dm

✓m+d-1

m

Dual  formula7on  only  depends  on  dot-­‐products,  not  on  w!  

43  

Φ(x)  –  High-­‐dimensional  feature  space,  but  never  need  it  explicitly  as  long  as  we  can  compute  the  dot  product  fast  using  some  Kernel  K  

Dot  Product  of  Polynomials  

44  

m=1  

m=2  

m  

m

√2   2  √2  

Don’t  store  high-­‐dim  features  -­‐  Only  evaluate  dot-­‐products  with  kernels    

m   =  K(x,z)  

Finally:  The  Kernel  Trick!  

45  

•  Never  represent  features  explicitly  –  Compute  dot  products  in  closed  form  

•  Constant-­‐$me  high-­‐dimensional  dot-­‐products  for  many  classes  of  features    

Common  Kernels  

46  

•  Polynomials  of  degree  d  

•  Polynomials  of  degree  up  to  d  

•  Gaussian/Radial  kernels  (polynomials  of  all  orders  –  recall  series  expansion)  

•  Sigmoid  

Which  Func7ons  Can  Be  Kernels?  

•  not  all  func$ons  •  for  some  defini$ons  of  K(x1,x2)  there  is  no  corresponding  projec$on  ϕ(x)  

•  Nice  theory  on  this,  including  how  to  construct  new  kernels  from  exis$ng  ones  

•  Ini$ally  kernels  were  defined  over  data  points  in  Euclidean  space,  but  more  recently  over  strings,  over  trees,  over  graphs,  …  

OverfiYng  

48  

•  Huge  feature  space  with  kernels,  what  about  overfiyng???  

– Maximizing  margin  leads  to  sparse  set  of  support  vectors  (decision  boundary  not  too  complicated)  

–  Some  interes$ng  theory  says  that  SVMs  search  for  simple  hypothesis  with  large  margin  

–  O{en  robust  to  overfiyng  

What  about  classifica7on  7me?  

49  

•  For  a  new  input  x,  if  we  need  to  represent  Φ(x),  we  are  in  trouble!  •  Recall  classifier:  sign(w.Φ(x)+b)  •  Using  kernels  we  are  cool!  

SVMs  with  Kernels  

50  

•  Choose  a  set  of  features  and  kernel  func$on  •  Solve  dual  problem  to  obtain  support  vectors  αi  •  At  classifica$on  $me,  compute:  

Classify  as  

SVM  Decision  Surface  using  Gaussian  Kernel  

51  

Bishop  Fig  7.2  

Circled  points  are  the  support  vectors:  training  examples  with  non-­‐zero  αj    Points  plo|ed  in  original  2-­‐D  space.    Contour  lines  show  constant    

SVM  So[  Margin  Decision  Surface  using  Gaussian  Kernel  

52  

Bishop  Fig  7.4  

Circled  points  are  the  support  vectors:  training  examples  with  non-­‐zero  αj    Points  plo|ed  in  original  2-­‐D  space.    Contour  lines  show  constant    

−2 0 2

−2

0

2

53  

SVMs   Kernel  Regression  

or  

Differences:  •  SVMs:  

–  Learn  weights  αi  (and  bandwidth)  –  O{en  sparse  solu$on  

•  KR:  –  Fixed  “weights”,  learn  bandwidth  –  Solu$on  may  not  be  sparse  – Much  simpler  to  implement  

SVMs  vs.  Kernel  Regression  

SVMs  vs.  Logis7c  Regression  

54  

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels  in  Logis7c  Regression  

55  

•  Define  weights  in  terms  of  features:  

•  Derive  simple  gradient  descent  rule  on  αi  

SVMs  vs.  Logis7c  Regression  

56  

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels  :    Key  Points  

•  Many  learning  tasks  are  framed  as  op$miza$on  problems  

•  Primal  and  Dual  formula$ons  of  op$miza$on  problems  

•  Dual  version  framed  in  terms  of  dot  products  between  x’s  

•  Kernel  func$ons  K(x,y)  allow  calcula$ng  dot  products  <Φ(x),Φ(y)>  without  bothering  to  project  x  into  Φ(x)  

•  Leads  to  major  efficiencies,  and  ability  to  use  very  high  dimensional  (virtual)  feature  spaces  

What  you  need  to  know…  

58  

•  Dual  SVM  formula$on  – How  it’s  derived  

•  The  kernel  trick  •  Common  kernels  •  Differences  between  SVMs  and  kernel  regression  •  Differences  between  SVMs  and  logis$c  regression  •  Kernelized  logis$c  regression  

•  Also,  Kernel  PCA,  Kernel  ICA,  ...