CSC 411: Lecture 16: Kernelsurtasun/courses/CSC411_Fall16/16_svm.pdfZemel, Urtasun, Fidler (UofT)...

CSC 411: Lecture 16: Kernels

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 16-Kernels 1 / 12

Kernel trick

Summary of Linear SVM

Binary and linear separable classification

Linear classifier with maximal margin

Training SVM by maximizing

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

t(i)t(j)αiαj(x(i)T · x(j))}

subject to αi ≥ 0;N∑i=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

Only a small subset of αi ’s will be nonzero, and the corresponding x(i)’s arethe support vectors S

Prediction on a new example:

y = sign[b + x · (N∑i=1

αi t(i)x(i))] = sign[b + x · (

∑i∈S

αi t(i)x(i))]

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

y = sign[b + x · (N∑i=1

∑i∈S

αi t(i)x(i))]

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

y = sign[b + x · (N∑i=1

∑i∈S

αi t(i)x(i))]

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

y = sign[b + x · (N∑i=1

∑i∈S

αi t(i)x(i))]

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

y = sign[b + x · (N∑i=1

∑i∈S

αi t(i)x(i))]

maxαi≥0{

N∑i=1

αi −1

N∑i,j=1

αi t(i) = 0

The weights are

w =N∑i=1

αi t(i)x(i)

y = sign[b + x · (N∑i=1

∑i∈S

αi t(i)x(i))]

What if data is not linearly separable?

Introduce slack variables ξi

2||w||2 + λ

N∑i=1

s.t ξi ≥ 0; ∀i t(i)(wTx(i)) ≥ 1− ξi

Example lies on wrong side of hyperplane ξi > 1

Therefore∑

i ξi upper bounds the number of training errors

λ trades off training error vs model complexity

This is known as the soft-margin extension

2||w||2 + λ

N∑i=1

s.t ξi ≥ 0; ∀i t(i)(wTx(i)) ≥ 1− ξi

Therefore∑

2||w||2 + λ

N∑i=1

s.t ξi ≥ 0; ∀i t(i)(wTx(i)) ≥ 1− ξi

Therefore∑

2||w||2 + λ

N∑i=1

s.t ξi ≥ 0; ∀i t(i)(wTx(i)) ≥ 1− ξi

Therefore∑

2||w||2 + λ

N∑i=1

s.t ξi ≥ 0; ∀i t(i)(wTx(i)) ≥ 1− ξi

Therefore∑

Non-linear Decision Boundaries

Note that both the learning objective and the decision function depend onlyon dot products between patterns

` =N∑i=1

αi −1

N∑i,j=1

t(i)t(j)αiαj(x(i)T · x(j))

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

How to form non-linear decision boundaries in input space?

1. Map data into feature space x→ φ(x)2. Replace dot products between inputs with feature points

x(j) → φ(x(i))Tφ(x(j))

3. Find linear decision boundary in feature space

Problem: what is a good feature function φ(x)?

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

1. Map data into feature space x→ φ(x)

2. Replace dot products between inputs with feature points

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b + x · (N∑i=1

αi t(i)x(i))]

Input Transformation

Mapping to a feature space can produce problems:

I High computational burden due to high dimensionalityI Many more parameters

SVM solves these two issues simultaneously

I “Kernel trick” produces efficient classificationI Dual formulation only assigns parameters to samples, not features

I High computational burden due to high dimensionality

I Many more parameters

I “Kernel trick” produces efficient classification

I Dual formulation only assigns parameters to samples, not features

Kernel Trick

Kernel trick: dot-products in feature space can be computed as a kernelfunction

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Idea: work directly on x, avoid having to compute φ(x)

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 =

((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernel Trick

K (x(i), x(j)) = φ(x(i))Tφ(x(j))

Example:

K (a,b) = (aTb)3 = ((a1, a2)T (b1, b2))3

= (a1b1 + a2b2)3

= a31b31 + 3a21b

21a2b2 + 3a1b1a

22 + a32b

= (a31,√

3a21a2,√

3a1a22, a

32)T (b31,

√3b21b2,

√3b1b

= φ(a) · φ(b)

Kernels

Examples of kernels: kernels measure similarity

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

where d is the degree of the polynomial, e.g., d = 2 for quadratic2. Gaussian

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

K (x(i), x(j)) = tanh(β(x(i)T

x(j)) + a)

Each kernel computation corresponds to dot productI calculation for particular mapping φ(x) implicitly maps to

high-dimensional space

Why is this useful?

1. Rewrite training examples using more complex features2. Dataset not linearly separable in original space may be linearly

separable in higher dimensional space

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

where d is the degree of the polynomial, e.g., d = 2 for quadratic

2. Gaussian

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Each kernel computation corresponds to dot product

I calculation for particular mapping φ(x) implicitly maps tohigh-dimensional space

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

1. Rewrite training examples using more complex features

2. Dataset not linearly separable in original space may be linearlyseparable in higher dimensional space

Kernels

1. Polynomial

K (x(i), x(j)) = (x(i)T

x(j) + 1)d

K (x(i), x(j)) = exp(−||x(i) − x(j)||2

3. Sigmoid

x(j)) + a)

Why is this useful?

Kernel Functions

Mercer’s Theorem (1909): any reasonable kernel corresponds to somefeature space

Reasonable means that the Gram matrix is positive definite

Kij = K (x(i), x(j))

Feature space can be very large

I polynomial kernel (1 + (x(i))Tx(j))d corresponds to feature spaceexponential in d

I Gaussian kernel has infinitely dimensional features

Linear separators in these super high-dim spaces correspond to highlynonlinear decision boundaries in input space

Kernel Functions

Classification with Non-linear SVMs

Non-linear SVM using kernel function K ():

` =N∑i=1

αi −1

N∑i,j=1

t(i)t(j)αiαjK (x(i), x(j))

Maximize ` w.r.t. {α} under constraints ∀i , αi ≥ 0

Unlike linear SVM, cannot express w as linear combination of supportvectors

I now must retain the support vectors to classify new examples

Final decision function:

y = sign[b +N∑i=1

t(i)αiK (x, x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b +N∑i=1

t(i)αiK (x, x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b +N∑i=1

t(i)αiK (x, x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b +N∑i=1

t(i)αiK (x, x(i))]

` =N∑i=1

αi −1

N∑i,j=1

y = sign[b +N∑i=1

t(i)αiK (x, x(i))]

Summary

Advantages:

I Kernels allow very flexible hypothesesI Poly-time exact optimization methods rather than approximate

methodsI Soft-margin extension permits mis-classified examplesI Variable-sized hypothesis spaceI Excellent results (1.1% error rate on handwritten digits vs. LeNet’s

Disadvantages:

I Must choose kernel parametersI Very large problems computationally intractableI Batch algorithm

Summary

Advantages:

I Kernels allow very flexible hypotheses

I Poly-time exact optimization methods rather than approximatemethods

I Soft-margin extension permits mis-classified examplesI Variable-sized hypothesis spaceI Excellent results (1.1% error rate on handwritten digits vs. LeNet’s

Disadvantages:

Summary

Advantages:

methods

I Soft-margin extension permits mis-classified examplesI Variable-sized hypothesis spaceI Excellent results (1.1% error rate on handwritten digits vs. LeNet’s

Disadvantages:

Summary

Advantages:

methodsI Soft-margin extension permits mis-classified examples

I Variable-sized hypothesis spaceI Excellent results (1.1% error rate on handwritten digits vs. LeNet’s

Disadvantages:

Summary

Advantages:

methodsI Soft-margin extension permits mis-classified examplesI Variable-sized hypothesis space

I Excellent results (1.1% error rate on handwritten digits vs. LeNet’s0.9%)

Disadvantages:

Summary

Advantages:

Disadvantages:

Summary

Advantages:

Disadvantages:

Summary

Advantages:

Disadvantages:

I Must choose kernel parameters

I Very large problems computationally intractableI Batch algorithm

Summary

Advantages:

Disadvantages:

I Must choose kernel parametersI Very large problems computationally intractable

I Batch algorithm

Summary

Advantages:

Disadvantages:

More Summary

Software:

I A list of SVM implementations can be found athttp://www.kernel-machines.org/software.html

I Some implementations (such as LIBSVM) can handle multi-classclassification

I SVMLight is among the earliest implementationsI Several Matlab toolboxes for SVM are also available

Key points:

I Difference between logistic regression and SVMsI Maximum margin principleI Target function for SVMsI Slack variables for mis-classified pointsI Kernel trick allows non-linear generalizations

More Summary

Software:

Key points:

More Summary

Software:

Key points:

I Difference between logistic regression and SVMs

I Maximum margin principleI Target function for SVMsI Slack variables for mis-classified pointsI Kernel trick allows non-linear generalizations

More Summary

Software:

Key points:

I Difference between logistic regression and SVMsI Maximum margin principle

I Target function for SVMsI Slack variables for mis-classified pointsI Kernel trick allows non-linear generalizations

More Summary

Software:

Key points:

I Difference between logistic regression and SVMsI Maximum margin principleI Target function for SVMs

I Slack variables for mis-classified pointsI Kernel trick allows non-linear generalizations

More Summary

Software:

Key points:

I Difference between logistic regression and SVMsI Maximum margin principleI Target function for SVMsI Slack variables for mis-classified points

I Kernel trick allows non-linear generalizations

More Summary

Software:

Key points:

CSC 411: Lecture 16: Kernelsurtasun/courses/CSC411_Fall16/16_svm.pdfZemel, Urtasun, Fidler (UofT)...

Documents

Transcript of CSC 411: Lecture 16: Kernelsurtasun/courses/CSC411_Fall16/16_svm.pdfZemel, Urtasun, Fidler (UofT)...

CSC Thermal Interface Materials

Teaching Material for CSC 789 (Multiparadigmatic programming)matt/tenure/vol7/vol7.pdf · Teaching Material for CSC 789 (Multiparadigmatic programming) The content of this package

CSC 413/513: Intro to Algorithms

CSC 311: Introduction to Machine Learning · 2020. 11. 4. · CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger Grosse Chris Maddison Juhan

Πολεμική τακτική και στρατηγική 411-362 π.Χ. · Ασία ο Ξενοφών μετατράπηκε κατ’ ουσία σε αρχιστράτηγο.

CSC 473 Automata, Grammars & Languages 10/20/10 · CSC 473 Automata, Grammars & Languages 10/20/10 2 C SC 473 Automata, Grammars & Languages 44 PDA CFG (contʼd) Trace computation

Φύλλο 411 | Όχι στην καταστροφή 411.pdf · Ισόγειο κατάστημα - οριζόντια ιδιοκτησία, (πλατεία Μαντώ Μαυρογένους),

Αναρχία (16)

Chapter 4web.mst.edu/~hale/courses/411/411_notes/Chapter4.ver5c.pdf · Chapter 4 Boundary Value Problems in Spherical and Cylindrical Coordinates 4.1 Laplace’s Equation in spherical

ΣΥΝΤΟΜΟ 2ΙΟ 3ΡΑΦΙΚΟ ΣΗΜ 5ΙΩΜΑ · 2014. 5. 24. · Extremadura, Badajoz, Spain από 26 έως 28 Φεβρουαρίου 2013. ( 410- 411) Δ. Eπαγγελμα

Metal-enriched plasmainprotogalactichalos5 Inter-University Centre for Astronomy and Astrophysics, Post Bag 4, Ganesh Khind, Pune 411 007, India Received 23 January, 2009 / Accepted

Φιλοσοφία - CORE · 2017-01-05 · Πρωταγόρας από τα Άβδηρα( 480-411) • Πέρασε μεγάλο μέρος της ζωής του ταξιδεύοντας

CSC 261/461 – Database Systems Lecture 13

ID Alignment: CSC

Genetica 16

CSC 425 - Principles of Compiler Design Ischwesin/fall19/csc425/... · 2019. 10. 8. · Review LL parsing Shift-reduce parsing The LR parsing algorithm Constructing LR parsing tables.

Νέος CEΟ στην Αθ. Ζυθοποιία - Ποιος αντικαθιστά το Ζ. Μηνά · 1 Δευτέρα, 25 Σεπτεμβρίου 2017 Τεύχος #411 Η ΚΑΘΗΜΕΡΙΝΗ

Mitotic polarization of transcription factors during ...diposit.ub.edu/dspace/bitstream/2445/157719/1/688667.pdfsmall population of cells, termed cancer stem cells (CSC) or tumor-initiating

411. MerikŁc DiaforikŁc Exis‚seic

1 Andrea Bangert, ATLAS SCT Meeting, 18.05.2007 Monte Carlo Studies Of Top Quark Pair Production Andrea Bangert, Max Planck Institute of Physics, CSC T6.