Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Support Vector Machines

Aar$ Singh

Machine Learning 10-‐601 Nov 22, 2011

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

SVMs reminder

min w.w + C Σ ξj w,b,ξ

s.t. (w.xj+b) yj ≥ 1-‐ξj ∀j

ξj ≥ 0 ∀j

Hinge loss Regularization

Soft margin approach

Essen$ally a constrained op$miza$on problem!

Constrained Op7miza7on

Primal problem:

⌘ min

↵�0x

2 � ↵(x� b)

↵�0x

2 � ↵(x� b)

Lagrange mul$plier

Lagrangian

L(x,↵)

Dual problem:

↵�0min

2 � ↵(x� b)⌘ min

↵�0x

2 � ↵(x� b)

L(x,↵)

f⇤ = arg

d⇤ = arg

d⇤ f⇤In general, d⇤ = f⇤When is ?

Primal problem: Dual problem:

f⇤ = arg d⇤ = arg

↵⇤= dual solution

⇤= primal solution

d⇤ = f⇤ if f is convex and x*, α* sa$sfy the KKT condi$ons:

rL(x⇤,↵

⇤) = 0

↵⇤ � 0x

⇤ � b

⇤(x⇤ � b) = 0

Zero Gradient

Primal feasibility

Dual feasibility

Complementary slackness

è If α* > 0, then x* = b i.e. Constraint is effec$ve

α* = 0 Constraint is ineffec$ve

α* = 1/2 > 0 Constraint is effec$ve

Complementary slackness α*(x*-‐1) = 0

Dual SVM – linearly separable case

•  Primal problem:

•  Lagrangian:

w – weights on features

α – weights on training pts

α j > 0 constraint is effec7ve (w.xj+b)yj = 1 point j is a support vector!!

•  Dual problem deriva$on:

If we can solve for αs (dual problem), then we have a solu$on for w (primal problem)

•  Dual problem deriva$on:

•  Dual problem:

Dual problem is also QP Solu$on gives αjs

33 Use support vectors to compute b w.xk+b = yk (w.xk+b)yk = 1

Dual SVM Interpreta7on: Sparsity

w.x + b = 0

Only few αjs can be non-‐zero : where constraint is $ght

(w.xj + b)yj = 1 Support vectors – training points j whose αjs are non-‐zero

αj > 0

αj = 0

αj = 0 = Support vectors

So why solve the dual SVM?

•  There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n

Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables

Dual SVM – non-‐separable case

•  Primal problem:

Lagrange Mul7pliers

Dual SVM – non-‐separable case

Dual problem is also QP Solu$on gives αjs

comes from Intuition: Earlier -‐ If constraint violated, αi →∞ Now -‐ If constraint violated, αi ≤ C

So why solve the dual SVM?

•  There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n

Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables

•  But, more importantly, the “kernel trick”!!!

What if data is not linearly separable?

Using non-‐linear features to get linear separa$on

Radius, r = √x12+x22

Angle, θ

Using non-‐linear features to get linear separa$on

Use features of features of features of features….

Feature space becomes really large very quickly!

Can we get by without having to write out the features explicitly?

Φ(x) = (x12, x22, x1x2, …., exp(x1))

Higher Order Polynomials

d – input features m – degree of polynomial

grows fast! m = 6, d = 100 about 1.6 billion terms

(m+ d� 1)!

m!(d� 1)!dm

✓m+d-1

Dual formula7on only depends on dot-‐products, not on w!

Φ(x) – High-‐dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K

Dot Product of Polynomials

√2 2 √2

Don’t store high-‐dim features -‐ Only evaluate dot-‐products with kernels

m = K(x,z)

Finally: The Kernel Trick!

•  Never represent features explicitly –  Compute dot products in closed form

•  Constant-‐$me high-‐dimensional dot-‐products for many classes of features

Common Kernels

•  Polynomials of degree d

•  Polynomials of degree up to d

•  Gaussian/Radial kernels (polynomials of all orders – recall series expansion)

•  Sigmoid

Which Func7ons Can Be Kernels?

•  not all func$ons •  for some defini$ons of K(x1,x2) there is no corresponding projec$on ϕ(x)

•  Nice theory on this, including how to construct new kernels from exis$ng ones

•  Ini$ally kernels were defined over data points in Euclidean space, but more recently over strings, over trees, over graphs, …

OverfiYng

•  Huge feature space with kernels, what about overfiyng???

– Maximizing margin leads to sparse set of support vectors (decision boundary not too complicated)

–  Some interes$ng theory says that SVMs search for simple hypothesis with large margin

–  O{en robust to overfiyng

What about classifica7on 7me?

•  For a new input x, if we need to represent Φ(x), we are in trouble! •  Recall classifier: sign(w.Φ(x)+b) •  Using kernels we are cool!

SVMs with Kernels

•  Choose a set of features and kernel func$on •  Solve dual problem to obtain support vectors αi •  At classifica$on $me, compute:

Classify as

SVM Decision Surface using Gaussian Kernel

Bishop Fig 7.2

Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant

SVM So[ Margin Decision Surface using Gaussian Kernel

Bishop Fig 7.4

Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant

−2 0 2

SVMs Kernel Regression

Differences: •  SVMs:

–  Learn weights αi (and bandwidth) –  O{en sparse solu$on

•  KR: –  Fixed “weights”, learn bandwidth –  Solu$on may not be sparse – Much simpler to implement

SVMs vs. Kernel Regression

SVMs vs. Logis7c Regression

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels in Logis7c Regression

•  Define weights in terms of features:

•  Derive simple gradient descent rule on αi

SVMs vs. Logis7c Regression

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels : Key Points

•  Many learning tasks are framed as op$miza$on problems

•  Primal and Dual formula$ons of op$miza$on problems

•  Dual version framed in terms of dot products between x’s

•  Kernel func$ons K(x,y) allow calcula$ng dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)

•  Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces

What you need to know…

•  Dual SVM formula$on – How it’s derived

•  The kernel trick •  Common kernels •  Differences between SVMs and kernel regression •  Differences between SVMs and logis$c regression •  Kernelized logis$c regression

•  Also, Kernel PCA, Kernel ICA, ...

Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Documents

Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Electrical Machine-II EEN-287 (AC Machine) Engr. Sobuj Kumar Ray Faculty, BSEEE IUBAT

Faraday s Law (Induced emf) - MIT OpenCourseWare · Faraday s Law (Induced emf) Reading ... •Faraday s Law and Induced Electromotive Force ... lines pass through the integrating

A Reversible Abstract Machine

Panic Machine @ Semedic

Prepared By: Ajit Singh, Lecturer

Turing Machines - Rice Universityyz2/COMP481/Chapter17.pdf · Turing Machine Machine Move • Change symbol under tape head • Move tape head • Change state • Analog Machine?

Probability in Machine Learning

Crossing the emf in this direction +ΔΔ

Prabhakar singh first sem biochem unit i high e compound

Deepak Singh & Peter Gehring - NIST€¦ · Deepak Singh & Peter Gehring. NIST Center for Neutron Research. Summer School on Methods and Applications of Neutron Spectroscopy . NIST

SIMON SINGH · 2013. 11. 8. · 003-111912-MATEMATICAS SIMPSON.indd 3 07/11/13 10:52. CONTENIDO ... 003-111912-MATEMATICAS SIMPSON.indd 7 14/10/13 10:10. 8 ... 18 · SIMON SINGH despacho

Design of Synchronous Machine

Turing machine

Formelsammlung Elektromagnetische Feldtheoriefabis-site.net/uni/emf/blaetter/emf-formelsammlung.pdf · Formelsammlung Elektromagnetische Feldtheorie 1 1 MaxwellscheGleichungen NamedesGesetzes

Flesh Machine

Advanced Machine Learning & Perception

PRECISION MACHINE VISION LENS

Machine Learning Probabilistic Machine Learning · Machine Learning Probabilistic Machine Learning learning as inference, Bayesian Kernel Ridge regression = Gaussian Processes, Bayesian

EMF Survey Training - Part 2g

Condition Assessment_HT Machine