Machine Learning (CSE 446): Learning as Minimizing Loss ...
Transcript of Machine Learning (CSE 446): Learning as Minimizing Loss ...
Machine Learning (CSE 446):Learning as Minimizing Loss (continued)
Noah Smithc© 2017
University of [email protected]
October 25, 2017
1 / 27
warning
2 / 27
Gradient Descent
Data: function F : Rd → R, number of iterations K, step sizes 〈η(1), . . . , η(K)〉Result: z ∈ Rd
initialize: z(0) = 0;for k ∈ {1, . . . ,K} do
g(k) = ∇zF (z(k−1));
z(k) = z(k−1) − η(k) · g(k);
end
return z(K);Algorithm 1: GradientDescent
3 / 27
Gradient Descent
4 / 27
x[1]
…
w[1] ×
x[2] w[2] ×
x[3] w[3] ×
x[d] w[d] ×
∑
b
!
fire, or not?ŷ
bias parameter
weight parameters
“activation” output
input
Classification
5 / 27
x[1]
…
w[1] ×
x[2] w[2] ×
x[3] w[3] ×
x[d] w[d] ×
∑
b
! ŷ
bias parameter
weight parameters
“activation” output
input y
L
output
loss
Training (one example)
6 / 27
xn w × ∑
b
Ln
yn
Training (one example)
7 / 27
x1
…
×
xn w ×
xN ×
∑
b
y1
Ln L
L1
LN
…
…
…
∑
yn
yN
∑
Training (D)
8 / 27
Log LossThe log loss is continuous, convex, and differentiable. It is closely related to theperceptron loss.
-10 -5 0 5 10
02
46
810
Ln(w) = log (1 + exp (−yn · (w · xn + b)))9 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
10 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][1 + exp (−yn · (w · xn + b))]
Remember: ∂∂x log f(x) =
∂∂x
f(x)
f(x) .
11 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][1 + exp (−yn · (w · xn + b))]
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j]exp (−yn · (w · xn + b))
12 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][1 + exp (−yn · (w · xn + b))]
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j]exp (−yn · (w · xn + b))
=exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][−yn · (w · xn + b)]
Remember: ∂∂x exp f(x) = exp f(x) · ∂
∂xf(x).
13 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][1 + exp (−yn · (w · xn + b))]
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j]exp (−yn · (w · xn + b))
=exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][−yn · (w · xn + b)]
=exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b))· −yn · xn[j]
14 / 27
First Derivative of the Log Loss for One Example
∂Ln
∂w[j]=
∂
∂w[j]log (1 + exp (−yn · (w · xn + b)))
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][1 + exp (−yn · (w · xn + b))]
=1
1 + exp (−yn · (w · xn + b))· ∂
∂w[j]exp (−yn · (w · xn + b))
=exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b))· ∂
∂w[j][−yn · (w · xn + b)]
=exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b))· −yn · xn[j]
=1
1 + exp (yn · (w · xn + b))· −yn · xn[j]
expx1+expx = expx
1+expx ·exp−xexp−x = 1
1+exp−x .
15 / 27
Gradient of the Log Loss, All Examples
One example:
1
1 + exp (yn · (w · xn + b))· −yn · xn[j]
Sum over all examples:
N∑n=1
1
1 + exp (yn · (w · xn + b))· −yn · xn[j]
Gradient stacks all first derivatives into a vector:
N∑n=1
1
1 + exp (yn · (w · xn + b))· −yn · xn
16 / 27
Perceptron Learning vs. Gradient Descent on Log Loss
Updating rule for the perceptron (single example n):
w← w + Jyn 6= ynK · yn · xn
17 / 27
Perceptron Learning vs. Gradient Descent on Log Loss
Updating rule for the perceptron (single example n):
w← w + Jyn 6= ynK · yn · xn
Gradient descent on log loss:
w← w − η ·N∑
n=1
1
1 + exp (yn · (w · xn + b))· −yn · xn
= w + η ·N∑
n=1
1
1 + exp (yn · (w · xn + b))· yn · xn
18 / 27
An Interesting Function
-10 -5 0 5 10
0.0
0.4
0.8
σ(x) =1
1 + expx
19 / 27
An Interesting Function
-10 -5 0 5 10
0.0
0.4
0.8
σ(x) =1
1 + expx
What we plug in to this function isyn · (w · xn + b).
20 / 27
An Interesting Function
-10 -5 0 5 10
0.0
0.4
0.8
σ(x) =1
1 + expx
What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.
21 / 27
An Interesting Function
-10 -5 0 5 10
0.0
0.4
0.8
σ(x) =1
1 + expx
What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.If example n is getting classified incorrectly,σ(yn · (w · xn + b)) goes to one.
22 / 27
An Interesting Function
-10 -5 0 5 10
0.0
0.4
0.8
σ(x) =1
1 + expx
What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.If example n is getting classified incorrectly,σ(yn · (w · xn + b)) goes to one.Not “is it wrong?” but instead “how wrongis it?”
23 / 27
Perceptron Learning vs. Gradient Descent on Log LossUpdating rule for the perceptron (single example n):
w← w + Jyn 6= ynK · yn · xn
Gradient descent on log loss:
w← w − η ·N∑
n=1
1
1 + exp (yn · (w · xn + b))· −yn · xn
= w + η ·N∑
n=1
1
1 + exp (yn · (w · xn + b))· yn · xn
= w +
N∑n=1
η
1 + exp (yn · (w · xn + b))︸ ︷︷ ︸“soft” error test
·yn · xn
24 / 27
Logistic Regression
Classifier:
f(x) = sign (w · x+ b)
(Identical to the neuron-inspired classifier we trained using the perceptron!)
To learn, solve:
minw,b
1
N
N∑n=1
log (1 + exp (−yn · (w · xn + b)))︸ ︷︷ ︸log loss
+R(w, b)
25 / 27
Linear RegressionFor continuous output y.
Predictor:
f(x) = w · x+ b
To learn, solve:
minw,b
1
N
N∑n=1
(yn − (w · xn + b))2︸ ︷︷ ︸squared loss
+R(w, b)
26 / 27
Coming Soon
I Gradient descent is usually too slow; we can do better.
I Is there a deeper connection between perceptron learning and gradient descent?
I More loss functions!
I More activation functions!
I More regularization functions!
I A probabilistic interpretation of logistic regression and linear regression.
27 / 27