27
Machine Learning (CSE 446): Learning as Minimizing Loss (continued) Noah Smith c 2017 University of Washington October 25, 2017 1 / 27

others
• Category

Documents

• view

5

0

Transcript of Machine Learning (CSE 446): Learning as Minimizing Loss ...

Machine Learning (CSE 446):Learning as Minimizing Loss (continued)

University of [email protected]

October 25, 2017

1 / 27

warning

2 / 27

Data: function F : Rd → R, number of iterations K, step sizes 〈η(1), . . . , η(K)〉Result: z ∈ Rd

initialize: z(0) = 0;for k ∈ {1, . . . ,K} do

g(k) = ∇zF (z(k−1));

z(k) = z(k−1) − η(k) · g(k);

end

3 / 27

4 / 27

x[1]

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

b

!

fire, or not?ŷ

bias parameter

weight parameters

“activation” output

input

Classification

5 / 27

x[1]

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

b

! ŷ

bias parameter

weight parameters

“activation” output

input y

L

output

loss

Training (one example)

6 / 27

xn w × ∑

b

Ln

yn

Training (one example)

7 / 27

x1

×

xn w ×

xN ×

b

y1

Ln L

L1

LN

yn

yN

Training (D)

8 / 27

Log LossThe log loss is continuous, convex, and differentiable. It is closely related to theperceptron loss.

-10 -5 0 5 10

02

46

810

Ln(w) = log (1 + exp (−yn · (w · xn + b)))9 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

10 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][1 + exp (−yn · (w · xn + b))]

Remember: ∂∂x log f(x) =

∂∂x

f(x)

f(x) .

11 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][1 + exp (−yn · (w · xn + b))]

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j]exp (−yn · (w · xn + b))

12 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][1 + exp (−yn · (w · xn + b))]

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j]exp (−yn · (w · xn + b))

=exp (−yn · (w · xn + b))

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][−yn · (w · xn + b)]

Remember: ∂∂x exp f(x) = exp f(x) · ∂

∂xf(x).

13 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][1 + exp (−yn · (w · xn + b))]

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j]exp (−yn · (w · xn + b))

=exp (−yn · (w · xn + b))

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][−yn · (w · xn + b)]

=exp (−yn · (w · xn + b))

1 + exp (−yn · (w · xn + b))· −yn · xn[j]

14 / 27

First Derivative of the Log Loss for One Example

∂Ln

∂w[j]=

∂w[j]log (1 + exp (−yn · (w · xn + b)))

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][1 + exp (−yn · (w · xn + b))]

=1

1 + exp (−yn · (w · xn + b))· ∂

∂w[j]exp (−yn · (w · xn + b))

=exp (−yn · (w · xn + b))

1 + exp (−yn · (w · xn + b))· ∂

∂w[j][−yn · (w · xn + b)]

=exp (−yn · (w · xn + b))

1 + exp (−yn · (w · xn + b))· −yn · xn[j]

=1

1 + exp (yn · (w · xn + b))· −yn · xn[j]

expx1+expx = expx

1+expx ·exp−xexp−x = 1

1+exp−x .

15 / 27

Gradient of the Log Loss, All Examples

One example:

1

1 + exp (yn · (w · xn + b))· −yn · xn[j]

Sum over all examples:

N∑n=1

1

1 + exp (yn · (w · xn + b))· −yn · xn[j]

Gradient stacks all first derivatives into a vector:

N∑n=1

1

1 + exp (yn · (w · xn + b))· −yn · xn

16 / 27

Perceptron Learning vs. Gradient Descent on Log Loss

Updating rule for the perceptron (single example n):

w← w + Jyn 6= ynK · yn · xn

17 / 27

Perceptron Learning vs. Gradient Descent on Log Loss

Updating rule for the perceptron (single example n):

w← w + Jyn 6= ynK · yn · xn

w← w − η ·N∑

n=1

1

1 + exp (yn · (w · xn + b))· −yn · xn

= w + η ·N∑

n=1

1

1 + exp (yn · (w · xn + b))· yn · xn

18 / 27

An Interesting Function

-10 -5 0 5 10

0.0

0.4

0.8

σ(x) =1

1 + expx

19 / 27

An Interesting Function

-10 -5 0 5 10

0.0

0.4

0.8

σ(x) =1

1 + expx

What we plug in to this function isyn · (w · xn + b).

20 / 27

An Interesting Function

-10 -5 0 5 10

0.0

0.4

0.8

σ(x) =1

1 + expx

What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.

21 / 27

An Interesting Function

-10 -5 0 5 10

0.0

0.4

0.8

σ(x) =1

1 + expx

What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.If example n is getting classified incorrectly,σ(yn · (w · xn + b)) goes to one.

22 / 27

An Interesting Function

-10 -5 0 5 10

0.0

0.4

0.8

σ(x) =1

1 + expx

What we plug in to this function isyn · (w · xn + b).If example n is getting classified correctly,σ(yn · (w · xn + b)) goes to zero.If example n is getting classified incorrectly,σ(yn · (w · xn + b)) goes to one.Not “is it wrong?” but instead “how wrongis it?”

23 / 27

Perceptron Learning vs. Gradient Descent on Log LossUpdating rule for the perceptron (single example n):

w← w + Jyn 6= ynK · yn · xn

w← w − η ·N∑

n=1

1

1 + exp (yn · (w · xn + b))· −yn · xn

= w + η ·N∑

n=1

1

1 + exp (yn · (w · xn + b))· yn · xn

= w +

N∑n=1

η

1 + exp (yn · (w · xn + b))︸ ︷︷ ︸“soft” error test

·yn · xn

24 / 27

Logistic Regression

Classifier:

f(x) = sign (w · x+ b)

(Identical to the neuron-inspired classifier we trained using the perceptron!)

To learn, solve:

minw,b

1

N

N∑n=1

log (1 + exp (−yn · (w · xn + b)))︸ ︷︷ ︸log loss

+R(w, b)

25 / 27

Linear RegressionFor continuous output y.

Predictor:

f(x) = w · x+ b

To learn, solve:

minw,b

1

N

N∑n=1

(yn − (w · xn + b))2︸ ︷︷ ︸squared loss

+R(w, b)

26 / 27

Coming Soon

I Gradient descent is usually too slow; we can do better.

I Is there a deeper connection between perceptron learning and gradient descent?

I More loss functions!

I More activation functions!

I More regularization functions!

I A probabilistic interpretation of logistic regression and linear regression.

27 / 27