TD(0) prediction Sarsa , On-policy learning Q-Learning, Off-policy learning
Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online...
Transcript of Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online...
![Page 1: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/1.jpg)
Online Learning via Stochastic Optimization,Perceptron, and Intro to SVMs
Piyush Rai
Machine Learning (CS771A)
Aug 20, 2016
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1
![Page 2: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/2.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 3: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/3.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 4: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/4.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 5: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/5.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 6: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/6.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1
if µ(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 7: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/7.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5
or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 8: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/8.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 9: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/9.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0
if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 10: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/10.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5
or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 11: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/11.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 12: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/12.jpg)
Stochastic Gradient Descent for Logistic Regression
Recall the gradient descent (GD) update rule for (unreg.) logistic regression
w (t+1) = w (t) − η
N∑n=1
(µ(t)n − yn)xn
where the predicted probability of yn being 1 is µ(t)n = 1
1+exp(−w (t)>xn)
Stochastic GD (SGD): Approx. the gradient using a randomly chosen (xn, yn)
w (t+1) = w (t) − ηt(µ(t)n − yn)xn
where ηt is the learning rate at update t (typically decreasing with t)
Let’s replace the predicted label prob. µ(t)n by the predicted binary label y
(t)n
w (t+1) = w (t) − ηt(y(t)n − yn)xn
where
y (t)n =
{1 if µ
(t)n ≥ 0.5 or w (t)>xn ≥ 0
0 if µ(t)n < 0.5 or w (t)>xn < 0
Thus w (t) gets updated only when y(t)n 6= yn (i.e., when w (t) mispredicts)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 2
![Page 13: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/13.jpg)
Mistake-Driven Learning
Consider the “mistake-driven” SGD update rule
w (t+1) = w (t) − ηt(y(t)n − yn)xn
Let’s, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
y (t)n − yn =
{−2yn if y
(t)n 6= yn
0 if y(t)n = yn
Thus whenever the model mispredicts (i.e., y(t)n 6= yn), we update w (t) as
w (t+1) = w (t) + 2ηtynxn
For ηt = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + ynxn
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 3
![Page 14: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/14.jpg)
Mistake-Driven Learning
Consider the “mistake-driven” SGD update rule
w (t+1) = w (t) − ηt(y(t)n − yn)xn
Let’s, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
y (t)n − yn =
{−2yn if y
(t)n 6= yn
0 if y(t)n = yn
Thus whenever the model mispredicts (i.e., y(t)n 6= yn), we update w (t) as
w (t+1) = w (t) + 2ηtynxn
For ηt = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + ynxn
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 3
![Page 15: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/15.jpg)
Mistake-Driven Learning
Consider the “mistake-driven” SGD update rule
w (t+1) = w (t) − ηt(y(t)n − yn)xn
Let’s, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
y (t)n − yn =
{−2yn if y
(t)n 6= yn
0 if y(t)n = yn
Thus whenever the model mispredicts (i.e., y(t)n 6= yn), we update w (t) as
w (t+1) = w (t) + 2ηtynxn
For ηt = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + ynxn
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 3
![Page 16: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/16.jpg)
Mistake-Driven Learning
Consider the “mistake-driven” SGD update rule
w (t+1) = w (t) − ηt(y(t)n − yn)xn
Let’s, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
y (t)n − yn =
{−2yn if y
(t)n 6= yn
0 if y(t)n = yn
Thus whenever the model mispredicts (i.e., y(t)n 6= yn), we update w (t) as
w (t+1) = w (t) + 2ηtynxn
For ηt = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + ynxn
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 3
![Page 17: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/17.jpg)
Mistake-Driven Learning
Consider the “mistake-driven” SGD update rule
w (t+1) = w (t) − ηt(y(t)n − yn)xn
Let’s, from now on, assume the binary labels to be {-1,+1}, not {0,1}. Then
y (t)n − yn =
{−2yn if y
(t)n 6= yn
0 if y(t)n = yn
Thus whenever the model mispredicts (i.e., y(t)n 6= yn), we update w (t) as
w (t+1) = w (t) + 2ηtynxn
For ηt = 0.5, this is akin to the Perceptron (a hyperplane based learning algo)
w (t+1) = w (t) + ynxn
Note: There are other ways of deriving the Perceptron rule (will see shortly)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 3
![Page 18: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/18.jpg)
The Perceptron Algorithm
One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes
A very simple, mistake-driven Online Learning algorithm. Learns using one example at a time. Alsohighly scalable for large training data sets (like SGD based logistic regression and other SGD basedlearning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 4
![Page 19: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/19.jpg)
The Perceptron Algorithm
One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes
A very simple, mistake-driven Online Learning algorithm. Learns using one example at a time. Alsohighly scalable for large training data sets (like SGD based logistic regression and other SGD basedlearning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 4
![Page 20: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/20.jpg)
The Perceptron Algorithm
One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes
A very simple, mistake-driven Online Learning algorithm. Learns using one example at a time. Alsohighly scalable for large training data sets (like SGD based logistic regression and other SGD basedlearning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 4
![Page 21: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/21.jpg)
The Perceptron Algorithm
One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes
A very simple, mistake-driven Online Learning algorithm. Learns using one example at a time. Alsohighly scalable for large training data sets (like SGD based logistic regression and other SGD basedlearning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 4
![Page 22: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/22.jpg)
The Perceptron Algorithm
One of the earliest algorithms for binary classification (Rosenblatt, 1958)
Learns a linear hyperplane to separate the two classes
A very simple, mistake-driven Online Learning algorithm. Learns using one example at a time. Alsohighly scalable for large training data sets (like SGD based logistic regression and other SGD basedlearning algorithms)
Guaranteed to find a separating hyperplane if the data is linearly separable
If data not linearly separable
First make it linearly separable (more when we discuss Kernel Methods)
.. or use multi-layer Perceptrons (more when we discuss Deep Learning)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 4
![Page 23: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/23.jpg)
Hyperplanes and Margins
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 5
![Page 24: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/24.jpg)
Hyperplanes
Separates a D-dimensional space into two half-spaces (positive and negative)
Defined by normal vector w ∈ RD (pointing towards positive half-space)
w is orthogonal to any vector x lying on the hyperplane
w>x = 0 (equation of the hyperplane)
Assumption: The hyperplane passes through origin. If not, add a bias b ∈ R
w>x + b = 0
b > 0 means moving it parallely along w (b < 0 means in opposite direction)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 6
![Page 25: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/25.jpg)
Hyperplanes
Separates a D-dimensional space into two half-spaces (positive and negative)
Defined by normal vector w ∈ RD (pointing towards positive half-space)
w is orthogonal to any vector x lying on the hyperplane
w>x = 0 (equation of the hyperplane)
Assumption: The hyperplane passes through origin. If not, add a bias b ∈ R
w>x + b = 0
b > 0 means moving it parallely along w (b < 0 means in opposite direction)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 6
![Page 26: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/26.jpg)
Hyperplanes
Separates a D-dimensional space into two half-spaces (positive and negative)
Defined by normal vector w ∈ RD (pointing towards positive half-space)
w is orthogonal to any vector x lying on the hyperplane
w>x = 0 (equation of the hyperplane)
Assumption: The hyperplane passes through origin. If not, add a bias b ∈ R
w>x + b = 0
b > 0 means moving it parallely along w (b < 0 means in opposite direction)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 6
![Page 27: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/27.jpg)
Hyperplanes
Separates a D-dimensional space into two half-spaces (positive and negative)
Defined by normal vector w ∈ RD (pointing towards positive half-space)
w is orthogonal to any vector x lying on the hyperplane
w>x = 0 (equation of the hyperplane)
Assumption: The hyperplane passes through origin. If not, add a bias b ∈ R
w>x + b = 0
b > 0 means moving it parallely along w (b < 0 means in opposite direction)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 6
![Page 28: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/28.jpg)
Hyperplane based Linear Classification
Represents the decision boundary by a hyperplane w ∈ RD
For binary classification, w is assumed to point towards the positive class
Classification rule: y = sign(wTx + b)
wT x + b > 0⇒ y = +1wT x + b < 0⇒ y = −1
Note: y(wTx + b) < 0 mean a mistake on training example (x , y)
Note: Some algorithms that we have already seen (e.g., “distance from means”, logisticregression, etc.) can also be viewed as learning hyperplanes
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 7
![Page 29: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/29.jpg)
Hyperplane based Linear Classification
Represents the decision boundary by a hyperplane w ∈ RD
For binary classification, w is assumed to point towards the positive class
Classification rule: y = sign(wTx + b)
wT x + b > 0⇒ y = +1wT x + b < 0⇒ y = −1
Note: y(wTx + b) < 0 mean a mistake on training example (x , y)
Note: Some algorithms that we have already seen (e.g., “distance from means”, logisticregression, etc.) can also be viewed as learning hyperplanes
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 7
![Page 30: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/30.jpg)
Hyperplane based Linear Classification
Represents the decision boundary by a hyperplane w ∈ RD
For binary classification, w is assumed to point towards the positive class
Classification rule: y = sign(wTx + b)
wT x + b > 0⇒ y = +1wT x + b < 0⇒ y = −1
Note: y(wTx + b) < 0 mean a mistake on training example (x , y)
Note: Some algorithms that we have already seen (e.g., “distance from means”, logisticregression, etc.) can also be viewed as learning hyperplanes
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 7
![Page 31: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/31.jpg)
Hyperplane based Linear Classification
Represents the decision boundary by a hyperplane w ∈ RD
For binary classification, w is assumed to point towards the positive class
Classification rule: y = sign(wTx + b)
wT x + b > 0⇒ y = +1wT x + b < 0⇒ y = −1
Note: y(wTx + b) < 0 mean a mistake on training example (x , y)
Note: Some algorithms that we have already seen (e.g., “distance from means”, logisticregression, etc.) can also be viewed as learning hyperplanes
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 7
![Page 32: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/32.jpg)
Hyperplane based Linear Classification
Represents the decision boundary by a hyperplane w ∈ RD
For binary classification, w is assumed to point towards the positive class
Classification rule: y = sign(wTx + b)
wT x + b > 0⇒ y = +1wT x + b < 0⇒ y = −1
Note: y(wTx + b) < 0 mean a mistake on training example (x , y)
Note: Some algorithms that we have already seen (e.g., “distance from means”, logisticregression, etc.) can also be viewed as learning hyperplanes
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 7
![Page 33: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/33.jpg)
Notion of Margins
Geometric margin γn of an example xn is its signed distance from hyperplane
γn =wTxn + b
||w ||
Geometric margin may be +ve/-ve based on which side of the plane xn is
Margin of a set {x1, . . . , xN} w.r.t. w is the min. abs. geometric margin
γ = min1≤n≤N
|γn| = min1≤n≤N
|(wTxn + b)|||w ||
Functional margin of w on a training example (xn, yn): yn(wTxn + b)
Positive if w predicts yn correctly
Negative if w predicts yn incorrectly
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 8
![Page 34: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/34.jpg)
Notion of Margins
Geometric margin γn of an example xn is its signed distance from hyperplane
γn =wTxn + b
||w ||
Geometric margin may be +ve/-ve based on which side of the plane xn is
Margin of a set {x1, . . . , xN} w.r.t. w is the min. abs. geometric margin
γ = min1≤n≤N
|γn| = min1≤n≤N
|(wTxn + b)|||w ||
Functional margin of w on a training example (xn, yn): yn(wTxn + b)
Positive if w predicts yn correctly
Negative if w predicts yn incorrectly
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 8
![Page 35: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/35.jpg)
Notion of Margins
Geometric margin γn of an example xn is its signed distance from hyperplane
γn =wTxn + b
||w ||
Geometric margin may be +ve/-ve based on which side of the plane xn is
Margin of a set {x1, . . . , xN} w.r.t. w is the min. abs. geometric margin
γ = min1≤n≤N
|γn| = min1≤n≤N
|(wTxn + b)|||w ||
Functional margin of w on a training example (xn, yn): yn(wTxn + b)
Positive if w predicts yn correctly
Negative if w predicts yn incorrectly
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 8
![Page 36: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/36.jpg)
Notion of Margins
Geometric margin γn of an example xn is its signed distance from hyperplane
γn =wTxn + b
||w ||
Geometric margin may be +ve/-ve based on which side of the plane xn is
Margin of a set {x1, . . . , xN} w.r.t. w is the min. abs. geometric margin
γ = min1≤n≤N
|γn| = min1≤n≤N
|(wTxn + b)|||w ||
Functional margin of w on a training example (xn, yn): yn(wTxn + b)
Positive if w predicts yn correctly
Negative if w predicts yn incorrectly
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 8
![Page 37: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/37.jpg)
A Loss Function for Hyperplane based Classification
For a hyperplane based model, let’s consider the following loss function
`(w , b) =N∑
n=1
`n(w , b) =N∑
n=1
max{0,−yn(wT xn + b)}
Seems natural: if yn(wTxn + b) ≥ 0, then the loss on (xn, yn) will be 0; otherwise the model willincur some positive loss when yn(wTxn + b) < 0
Let’s perform stochastic optimization on this loss. Stochastic (sub-)gradients are
∂`n(w , b)
∂w= −ynxn
∂`n(w , b)
∂b= −yn
(when w makes a mistake, and are zero otherwise)
Upon every mistake, update rule for w and b (assuming learning rate = 1)
w = w + ynxn
b = b + yn
These updates define the Perceptron algorithm
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 9
![Page 38: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/38.jpg)
A Loss Function for Hyperplane based Classification
For a hyperplane based model, let’s consider the following loss function
`(w , b) =N∑
n=1
`n(w , b) =N∑
n=1
max{0,−yn(wT xn + b)}
Seems natural: if yn(wTxn + b) ≥ 0, then the loss on (xn, yn) will be 0; otherwise the model willincur some positive loss when yn(wTxn + b) < 0
Let’s perform stochastic optimization on this loss. Stochastic (sub-)gradients are
∂`n(w , b)
∂w= −ynxn
∂`n(w , b)
∂b= −yn
(when w makes a mistake, and are zero otherwise)
Upon every mistake, update rule for w and b (assuming learning rate = 1)
w = w + ynxn
b = b + yn
These updates define the Perceptron algorithm
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 9
![Page 39: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/39.jpg)
A Loss Function for Hyperplane based Classification
For a hyperplane based model, let’s consider the following loss function
`(w , b) =N∑
n=1
`n(w , b) =N∑
n=1
max{0,−yn(wT xn + b)}
Seems natural: if yn(wTxn + b) ≥ 0, then the loss on (xn, yn) will be 0; otherwise the model willincur some positive loss when yn(wTxn + b) < 0
Let’s perform stochastic optimization on this loss. Stochastic (sub-)gradients are
∂`n(w , b)
∂w= −ynxn
∂`n(w , b)
∂b= −yn
(when w makes a mistake, and are zero otherwise)
Upon every mistake, update rule for w and b (assuming learning rate = 1)
w = w + ynxn
b = b + yn
These updates define the Perceptron algorithm
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 9
![Page 40: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/40.jpg)
A Loss Function for Hyperplane based Classification
For a hyperplane based model, let’s consider the following loss function
`(w , b) =N∑
n=1
`n(w , b) =N∑
n=1
max{0,−yn(wT xn + b)}
Seems natural: if yn(wTxn + b) ≥ 0, then the loss on (xn, yn) will be 0; otherwise the model willincur some positive loss when yn(wTxn + b) < 0
Let’s perform stochastic optimization on this loss. Stochastic (sub-)gradients are
∂`n(w , b)
∂w= −ynxn
∂`n(w , b)
∂b= −yn
(when w makes a mistake, and are zero otherwise)
Upon every mistake, update rule for w and b (assuming learning rate = 1)
w = w + ynxn
b = b + yn
These updates define the Perceptron algorithm
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 9
![Page 41: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/41.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 42: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/42.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 43: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/43.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 44: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/44.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 45: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/45.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 46: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/46.jpg)
The Perceptron Algorithm
Given: Training data D = {(x1, y1), . . . , (xN , yN)}Initialize: w old = [0, . . . , 0], bold = 0
Repeat until convergence:For a random (xn, yn) ∈ D
if sign(wT xn + b) 6= yn or yn(wT xn + b) ≤ 0 (i.e., mistake is made)
wnew = wold + ynxn
bnew = bold + yn
Stopping condition: stop when either
All training examples are classified correctly
May overfit, so less common in practice
A fixed number of iterations completed, or some convergence criteria met
Completed one pass over the data (each example seen once)
E.g., examples arriving in a streaming fashion and can’t be stored in memory
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 10
![Page 47: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/47.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 48: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/48.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)
bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 49: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/49.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 50: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/50.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 51: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/51.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 52: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/52.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 53: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/53.jpg)
Why Perceptron Updates Work?
Let’s look at a misclassified positive example (yn = +1)
Perceptron (wrongly) predicts that wToldxn + bold < 0
Updates would be
w new = w old + ynxn = w old + xn (since yn = +1)bnew = bold + yn = bold + 1
wTnewxn + bnew = (w old + xn)Txn + bold + 1
= (wToldxn + bold) + xT
n xn + 1
Thus wTnewxn + bnew is less negative than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 11
![Page 54: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/54.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old + x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 12
![Page 55: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/55.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old + x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 12
![Page 56: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/56.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old + x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 12
![Page 57: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/57.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 58: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/58.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)
bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 59: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/59.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 60: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/60.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 61: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/61.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 62: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/62.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 63: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/63.jpg)
Why Perceptron Updates Work?
Now let’s look at a misclassified negative example (yn = −1)
Perceptron (wrongly) predicts that wToldxn + bold > 0
Updates would be
w new = w old + ynxn = w old − xn (since yn = −1)bnew = bold + yn = bold − 1
wTnewxn + bnew = (w old − xn)Txn + bold − 1
= (wToldxn + bold)− xT
n xn − 1
Thus wTnewxn + bnew is less positive than wT
oldxn + bold
So we are making ourselves more correct on this example!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 13
![Page 64: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/64.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old − x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 14
![Page 65: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/65.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old − x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 14
![Page 66: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/66.jpg)
Why Perceptron Updates Work (Pictorially)?
wnew = w old − x
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 14
![Page 67: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/67.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).
Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 68: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/68.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 69: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/69.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 70: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/70.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ
(why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 71: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/71.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 72: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/72.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 73: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/73.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2
= ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 74: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/74.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2
≤ ||w k ||2 + R2 (since ynwTk xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 75: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/75.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 76: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/76.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 77: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/77.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 78: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/78.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 79: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/79.jpg)
Convergence of Perceptron
Theorem (Block & Novikoff): If the training data is linearly separable with margin γ by a unit normhyperplane w∗ (||w∗|| = 1) with b = 0, then perceptron converges after R2/γ2 mistakes during training(assuming ||x || < R for all x).Proof:
Margin of w∗ on any arbitrary example (xn, yn):ynwT∗ xn
||w∗|| = ynwT∗ xn ≥ γ
Consider the (k + 1)thmistake: ynwTk xn ≤ 0, and update w k+1 = w k + ynxn
wTk+1w∗ = wT
k w∗ + ynwT∗ xn ≥ wT
k w∗ + γ (why is this nice?)
Repeating iteratively k times, we get wTk+1w∗ > kγ (1)
||w k+1||2 = ||w k ||2 + 2ynwTk xn + ||x ||2 ≤ ||w k ||2 + R2 (since ynwT
k xn ≤ 0)
Repeating iteratively k times, we get ||w k+1||2 ≤ kR2 (2)
Using (1), (2), and ||w∗|| = 1 , we get kγ < wTk+1w∗ ≤ ||w k+1|| ≤ R
√k
k ≤ R2/γ2
Nice Thing: Convergence rate does not depend on the number of training examples N or the datadimensionality D. Depends only on the margin!!!
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 15
![Page 80: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/80.jpg)
The Best Hyperplane Separator?
Perceptron finds one of the many possible hyperplanes separating the data
.. if one exists
Of the many possible choices, which one is the best?
Intuitively, we want the hyperplane having the maximum margin
Large margin leads to good generalization on the test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 16
![Page 81: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/81.jpg)
The Best Hyperplane Separator?
Perceptron finds one of the many possible hyperplanes separating the data
.. if one exists
Of the many possible choices, which one is the best?
Intuitively, we want the hyperplane having the maximum margin
Large margin leads to good generalization on the test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 16
![Page 82: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/82.jpg)
The Best Hyperplane Separator?
Perceptron finds one of the many possible hyperplanes separating the data
.. if one exists
Of the many possible choices, which one is the best?
Intuitively, we want the hyperplane having the maximum margin
Large margin leads to good generalization on the test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 16
![Page 83: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/83.jpg)
The Best Hyperplane Separator?
Perceptron finds one of the many possible hyperplanes separating the data
.. if one exists
Of the many possible choices, which one is the best?
Intuitively, we want the hyperplane having the maximum margin
Large margin leads to good generalization on the test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 16
![Page 84: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/84.jpg)
Support Vector Machine (SVM)
Probably the most popular/influential classification algorithm
Backed by solid theoretical groundings (Vapnik and Cortes, 1995)
A hyperplane based classifier (like the Perceptron)
Additionally uses the Maximum Margin Principle
Finds the hyperplane with maximum separation margin on the training data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 17
![Page 85: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/85.jpg)
Support Vector Machine (SVM)
Probably the most popular/influential classification algorithm
Backed by solid theoretical groundings (Vapnik and Cortes, 1995)
A hyperplane based classifier (like the Perceptron)
Additionally uses the Maximum Margin Principle
Finds the hyperplane with maximum separation margin on the training data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 17
![Page 86: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/86.jpg)
Support Vector Machine (SVM)
Probably the most popular/influential classification algorithm
Backed by solid theoretical groundings (Vapnik and Cortes, 1995)
A hyperplane based classifier (like the Perceptron)
Additionally uses the Maximum Margin Principle
Finds the hyperplane with maximum separation margin on the training data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 17
![Page 87: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/87.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 88: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/88.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 89: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/89.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 90: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/90.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 91: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/91.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 92: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/92.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 93: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/93.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 94: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/94.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1
⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 95: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/95.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w||
= 1||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 96: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/96.jpg)
Support Vector Machine
A hyperplane based linear classifier defined by w and b
Prediction rule: y = sign(wTx + b)
Given: Training data {(x1, y1), . . . , (xN , yN)}
Goal: Learn w and b that achieve the maximum margin
For now, assume the entire training data is correctly classified by (w , b)
Zero loss on the training examples (non-zero loss case later)
Assume the hyperplane is such that
wT xn + b ≥ 1 for yn = +1
wT xn + b ≤ −1 for yn = −1
Equivalently, yn(wT xn + b) ≥ 1⇒ min1≤n≤N |wT xn + b| = 1
The hyperplane’s margin:
γ = min1≤n≤N|wT xn+b|||w|| = 1
||w||
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 18
![Page 97: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/97.jpg)
Support Vector Machine: The Optimization Problem
We want to maximize the margin γ = 1||w ||
Maximizing the margin γ = minimizing ||w || (the norm)
Our optimization problem would be:
Minimize f (w , b) =||w ||2
2
subject to yn(wTxn + b) ≥ 1, n = 1, . . . ,N
This is a Quadratic Program (QP) with N linear inequality constraints
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 19
![Page 98: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/98.jpg)
Support Vector Machine: The Optimization Problem
We want to maximize the margin γ = 1||w ||
Maximizing the margin γ = minimizing ||w || (the norm)
Our optimization problem would be:
Minimize f (w , b) =||w ||2
2
subject to yn(wTxn + b) ≥ 1, n = 1, . . . ,N
This is a Quadratic Program (QP) with N linear inequality constraints
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 19
![Page 99: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/99.jpg)
Support Vector Machine: The Optimization Problem
We want to maximize the margin γ = 1||w ||
Maximizing the margin γ = minimizing ||w || (the norm)
Our optimization problem would be:
Minimize f (w , b) =||w ||2
2
subject to yn(wTxn + b) ≥ 1, n = 1, . . . ,N
This is a Quadratic Program (QP) with N linear inequality constraints
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 19
![Page 100: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/100.jpg)
Support Vector Machine: The Optimization Problem
We want to maximize the margin γ = 1||w ||
Maximizing the margin γ = minimizing ||w || (the norm)
Our optimization problem would be:
Minimize f (w , b) =||w ||2
2
subject to yn(wTxn + b) ≥ 1, n = 1, . . . ,N
This is a Quadratic Program (QP) with N linear inequality constraints
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 19
![Page 101: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/101.jpg)
Large Margin = Good Generalization
Large margins intuitively mean good generalization
We can give a slightly more formal justification to this
Recall: Margin γ = 1||w ||
Large margin ⇒ small ||w ||
Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)
Simple solutions ⇒ good generalization on test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 20
![Page 102: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/102.jpg)
Large Margin = Good Generalization
Large margins intuitively mean good generalization
We can give a slightly more formal justification to this
Recall: Margin γ = 1||w ||
Large margin ⇒ small ||w ||
Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)
Simple solutions ⇒ good generalization on test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 20
![Page 103: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/103.jpg)
Large Margin = Good Generalization
Large margins intuitively mean good generalization
We can give a slightly more formal justification to this
Recall: Margin γ = 1||w ||
Large margin ⇒ small ||w ||
Small ||w || ⇒ regularized/simple solutions (wi ’s don’t become too large)
Simple solutions ⇒ good generalization on test data
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 20
![Page 104: Online Learning via Stochastic Optimization, …...Aug 20, 2016 Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 1 Stochastic Gradient](https://reader035.fdocument.org/reader035/viewer/2022070917/5fb7e845657bd355481903a8/html5/thumbnails/104.jpg)
Next class..
Solving the SVM optimization problem
Introduction to kernel methods (nonlinear SVMs)
Machine Learning (CS771A) Online Learning via Stochastic Optimization, Perceptron, and Intro to SVMs 21