Machine Learning (CSE 446): Learning as Minimizing Loss ...

of 27 /27
Machine Learning (CSE 446): Learning as Minimizing Loss (continued) Noah Smith c 2017 University of Washington [email protected] October 25, 2017 1 / 27

Embed Size (px)

Transcript of Machine Learning (CSE 446): Learning as Minimizing Loss ...

Machine Learning (CSE 446): Learning as Minimizing Loss (continued)Noah Smith c© 2017
University of Washington [email protected]
2 / 27
Gradient Descent
Data: function F : Rd → R, number of iterations K, step sizes η(1), . . . , η(K) Result: z ∈ Rd
initialize: z(0) = 0; for k ∈ {1, . . . ,K} do
g(k) = ∇zF (z (k−1));
z(k) = z(k−1) − η(k) · g(k);
end
3 / 27
Gradient Descent
4 / 27
x[1]
Training (D)
8 / 27
Log Loss The log loss is continuous, convex, and differentiable. It is closely related to the perceptron loss.
-10 -5 0 5 10
0 2
4 6
8 10
Ln(w) = log (1 + exp (−yn · (w · xn + b))) 9 / 27
First Derivative of the Log Loss for One Example
∂Ln
10 / 27
∂Ln
= 1
∂w[j] [1 + exp (−yn · (w · xn + b))]
Remember: ∂ ∂x log f(x) =
∂Ln
= 1
∂w[j] [1 + exp (−yn · (w · xn + b))]
= 1
∂w[j] exp (−yn · (w · xn + b))
12 / 27
∂Ln
= 1
∂w[j] [1 + exp (−yn · (w · xn + b))]
= 1
∂w[j] exp (−yn · (w · xn + b))
= exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b)) · ∂
∂w[j] [−yn · (w · xn + b)]
Remember: ∂ ∂x exp f(x) = exp f(x) · ∂
∂xf(x).
∂Ln
= 1
∂w[j] [1 + exp (−yn · (w · xn + b))]
= 1
∂w[j] exp (−yn · (w · xn + b))
= exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b)) · ∂
∂w[j] [−yn · (w · xn + b)]
= exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b)) · −yn · xn[j]
14 / 27
∂Ln
= 1
∂w[j] [1 + exp (−yn · (w · xn + b))]
= 1
∂w[j] exp (−yn · (w · xn + b))
= exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b)) · ∂
∂w[j] [−yn · (w · xn + b)]
= exp (−yn · (w · xn + b))
1 + exp (−yn · (w · xn + b)) · −yn · xn[j]
= 1
expx 1+expx = expx
1+exp−x .
One example:
Sum over all examples:
Gradient stacks all first derivatives into a vector:
N∑ n=1
16 / 27
w← w + Jyn 6= ynK · yn · xn
17 / 27
w← w + Jyn 6= ynK · yn · xn
Gradient descent on log loss:
w← w − η · N∑
= w + η · N∑
18 / 27
0. 0
0. 4
0. 8
σ(x) = 1
1 + expx
19 / 27
0. 0
0. 4
0. 8
σ(x) = 1
1 + expx
What we plug in to this function is yn · (w · xn + b).
20 / 27
0. 0
0. 4
0. 8
σ(x) = 1
1 + expx
What we plug in to this function is yn · (w · xn + b). If example n is getting classified correctly, σ(yn · (w · xn + b)) goes to zero.
21 / 27
0. 0
0. 4
0. 8
σ(x) = 1
1 + expx
What we plug in to this function is yn · (w · xn + b). If example n is getting classified correctly, σ(yn · (w · xn + b)) goes to zero. If example n is getting classified incorrectly, σ(yn · (w · xn + b)) goes to one.
22 / 27
0. 0
0. 4
0. 8
σ(x) = 1
1 + expx
What we plug in to this function is yn · (w · xn + b). If example n is getting classified correctly, σ(yn · (w · xn + b)) goes to zero. If example n is getting classified incorrectly, σ(yn · (w · xn + b)) goes to one. Not “is it wrong?” but instead “how wrong is it?”
23 / 27
Perceptron Learning vs. Gradient Descent on Log Loss Updating rule for the perceptron (single example n):
w← w + Jyn 6= ynK · yn · xn
Gradient descent on log loss:
w← w − η · N∑
= w + η · N∑
= w +
·yn · xn
24 / 27
Logistic Regression
(Identical to the neuron-inspired classifier we trained using the perceptron!)
To learn, solve:
+R(w, b)
25 / 27
Predictor:
+R(w, b)
26 / 27
Coming Soon
I Gradient descent is usually too slow; we can do better.
I Is there a deeper connection between perceptron learning and gradient descent?
I More loss functions!
I More activation functions!
I More regularization functions!
I A probabilistic interpretation of logistic regression and linear regression.
27 / 27