Post on 29-Aug-2019
Math formulation
โข Given training data ๐ฅ๐ , ๐ฆ๐ : 1 โค ๐ โค ๐ i.i.d. from distribution ๐ท
โข Find ๐ฆ = ๐(๐ฅ) โ ๐ that minimizes ๐ฟ ๐ =1
๐ฯ๐=1๐ ๐(๐, ๐ฅ๐ , ๐ฆ๐)
โข s.t. the expected loss is small
๐ฟ ๐ = ๐ผ ๐ฅ,๐ฆ ~๐ท[๐(๐, ๐ฅ, ๐ฆ)]
Machine learning 1-2-3
โข Collect data and extract features
โข Build model: choose hypothesis class ๐ and loss function ๐
โข Optimization: minimize the empirical loss
Loss function
โข ๐2 loss: linear regression
โข Cross-entropy: logistic regression
โข Hinge loss: Perceptron
โข General principle: maximum likelihood estimation (MLE)โข ๐2 loss: corresponds to Normal distribution
โข logistic regression: corresponds to sigmoid conditional distribution
Optimization
โข Linear regression: closed form solution
โข Logistic regression: gradient descent
โข Perceptron: stochastic gradient descent
โข General principle: local improvementโข SGD: Perceptron; can also be applied to linear regression/logistic regression
Principle for hypothesis class?
โข Yes, there exists a general principle (at least philosophically)
โข Different names/faces/connectionsโข Occamโs razorโข VC dimension theoryโข Minimum description lengthโข Tradeoff between Bias and variance; uniform convergenceโข The curse of dimensionality
โข Running example: Support Vector Machine (SVM)
Linear classification (๐คโ)๐๐ฅ = 0
Class +1
Class -1
๐คโ
(๐คโ)๐๐ฅ > 0
(๐คโ)๐๐ฅ < 0
Assume perfect separation between the two classes
Attempt
โข Given training data ๐ฅ๐ , ๐ฆ๐ : 1 โค ๐ โค ๐ i.i.d. from distribution ๐ท
โข Hypothesis ๐ฆ = sign(๐๐ค ๐ฅ ) = sign(๐ค๐๐ฅ)โข ๐ฆ = +1 if ๐ค๐๐ฅ > 0
โข ๐ฆ = โ1 if ๐ค๐๐ฅ < 0
โข Letโs assume that we can optimize to find ๐ค
Multiple optimal solutions?
Class +1
Class -1
๐ค2 ๐ค3๐ค1
Same on empirical loss;Different on test/expected loss
Margin
โข Lemma 1: ๐ฅ has distance |๐๐ค ๐ฅ |
| ๐ค |to the hyperplane ๐๐ค ๐ฅ = ๐ค๐๐ฅ = 0
Proof:
โข ๐ค is orthogonal to the hyperplane
โข The unit direction is ๐ค
| ๐ค |
โข The projection of ๐ฅ is ๐ค
๐ค
๐
๐ฅ =๐๐ค(๐ฅ)
| ๐ค |
๐ค
| ๐ค |
๐ฅ
๐ค
๐ค
๐
๐ฅ
0
Margin: with bias
โข Claim 1: ๐ค is orthogonal to the hyperplane ๐๐ค,๐ ๐ฅ = ๐ค๐๐ฅ + ๐ = 0
Proof:
โข pick any ๐ฅ1 and ๐ฅ2 on the hyperplane
โข ๐ค๐๐ฅ1 + ๐ = 0
โข ๐ค๐๐ฅ2 + ๐ = 0
โข So ๐ค๐(๐ฅ1 โ ๐ฅ2) = 0
Margin: with bias
โข Claim 2: 0 has distance โ๐
| ๐ค |to the hyperplane ๐ค๐๐ฅ + ๐ = 0
Proof:
โข pick any ๐ฅ1 the hyperplane
โข Project ๐ฅ1 to the unit direction ๐ค
| ๐ค |to get the distance
โข๐ค
๐ค
๐
๐ฅ1 =โ๐
| ๐ค |since ๐ค๐๐ฅ1 + ๐ = 0
Margin: with bias
โข Lemma 2: ๐ฅ has distance |๐๐ค,๐ ๐ฅ |
| ๐ค |to the hyperplane ๐๐ค,๐ ๐ฅ = ๐ค๐๐ฅ +
๐ = 0
Proof:
โข Let ๐ฅ = ๐ฅโฅ + ๐๐ค
| ๐ค |, then |๐| is the distance
โข Multiply both sides by ๐ค๐ and add ๐
โข Left hand side: ๐ค๐๐ฅ + ๐ = ๐๐ค,๐ ๐ฅ
โข Right hand side: ๐ค๐๐ฅโฅ + ๐๐ค๐๐ค
| ๐ค |+ ๐ = 0 + ๐| ๐ค |
๐ฆ ๐ฅ = ๐ค๐๐ฅ + ๐ค0
The notation here is:
Figure from Pattern Recognition and Machine Learning, Bishop
SVM: objective
โข Margin over all training data points:
๐พ = min๐
|๐๐ค,๐ ๐ฅ๐ |
| ๐ค |
โข Since only want correct ๐๐ค,๐, and recall ๐ฆ๐ โ {+1,โ1}, we have
๐พ = min๐
๐ฆ๐๐๐ค,๐ ๐ฅ๐| ๐ค |
โข If ๐๐ค,๐ incorrect on some ๐ฅ๐, the margin is negative
SVM: objective
โข Maximize margin over all training data points:
max๐ค,๐
๐พ = max๐ค,๐
min๐
๐ฆ๐๐๐ค,๐ ๐ฅ๐| ๐ค |
= max๐ค,๐
min๐
๐ฆ๐(๐ค๐๐ฅ๐ + ๐)
| ๐ค |
โข A bit complicated โฆ
SVM: simplified objective
โข Observation: when (๐ค, ๐) scaled by a factor ๐, the margin unchanged
โข Letโs consider a fixed scale such that
๐ฆ๐โ ๐ค๐๐ฅ๐โ + ๐ = 1
where ๐ฅ๐โ is the point closest to the hyperplane
๐ฆ๐(๐๐ค๐๐ฅ๐ + ๐๐)
| ๐๐ค |=๐ฆ๐(๐ค
๐๐ฅ๐ + ๐)
| ๐ค |
SVM: simplified objective
โข Letโs consider a fixed scale such that
๐ฆ๐โ ๐ค๐๐ฅ๐โ + ๐ = 1
where ๐ฅ๐โ is the point closet to the hyperplane
โข Now we have for all data๐ฆ๐ ๐ค
๐๐ฅ๐ + ๐ โฅ 1
and at least for one ๐ the equality holds
โข Then the margin is 1
| ๐ค |
SVM: simplified objective
โข Optimization simplified to
min๐ค,๐
1
2๐ค
2
๐ฆ๐ ๐ค๐๐ฅ๐ + ๐ โฅ 1, โ๐
โข How to find the optimum เท๐คโ?
Thought experiment
โข Suppose pick an ๐ , and suppose can decide if exists ๐ค satisfying1
2๐ค
2โค ๐
๐ฆ๐ ๐ค๐๐ฅ๐ + ๐ โฅ 1, โ๐
โข Decrease ๐ until cannot find ๐ค satisfying the inequalities
Thought experiment
โข เท๐คโ is the best weight (i.e., satisfying the smallest ๐ )
เท๐คโ
Thought experiment
โข เท๐คโ is the best weight (i.e., satisfying the smallest ๐ )
เท๐คโ
Thought experiment
โข เท๐คโ is the best weight (i.e., satisfying the smallest ๐ )
เท๐คโ
Thought experiment
โข เท๐คโ is the best weight (i.e., satisfying the smallest ๐ )
เท๐คโ
Thought experiment
โข เท๐คโ is the best weight (i.e., satisfying the smallest ๐ )
เท๐คโ
Thought experiment
โข To handle the difference between empirical and expected losses
โข Choose large margin hypothesis (high confidence)
โข Choose a small hypothesis class
เท๐คโ
Corresponds to the hypothesis class
Thought experiment
โข Principle: use smallest hypothesis class still with a correct/good oneโข Also true beyond SVM
โข Also true for the case without perfect separation between the two classes
โข Math formulation: VC-dim theory, etc.
เท๐คโ
Corresponds to the hypothesis class
Thought experiment
โข Principle: use smallest hypothesis class still with a correct/good oneโข Whatever you know about the ground truth, add it as constraint/regularizer
เท๐คโ
Corresponds to the hypothesis class
SVM: optimization
โข Optimization (Quadratic Programming):
min๐ค,๐
1
2๐ค
2
๐ฆ๐ ๐ค๐๐ฅ๐ + ๐ โฅ 1, โ๐
โข Solved by Lagrange multiplier method:
โ ๐ค, ๐, ๐ถ =1
2๐ค
2
โ
๐
๐ผ๐[๐ฆ๐ ๐ค๐๐ฅ๐ + ๐ โ 1]
where ๐ถ is the Lagrange multiplierโข Details in next lecture