• date post

27-Jun-2020
• Category

## Documents

• view

9

1

Embed Size (px)

### Transcript of Introduction to Machine Learning - jbg/teaching/CMSC_726/03a.pdf · PDF file Machine...

• Introduction to Machine Learning

Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 18

• Reminder: Logistic Regression

P(Y = 0|X) = 1

1+exp �

β0 + ∑

i βiXi � (1)

P(Y = 1|X) = exp

β0 + ∑

i βiXi �

1+exp �

β0 + ∑

i βiXi � (2)

Discriminative prediction: p(y |x) Classification uses: ad placement, spam detection

What we didn’t talk about is how to learn β from data

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 18

• Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) = ∑

j

lnp(y(j) |x(j),β) (3)

= ∑

j

y(j)

β0 + ∑

i

βix (j) i

− ln

1+exp

β0 + ∑

i

βix (j) i

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18

• Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) = ∑

j

lnp(y(j) |x(j),β) (3)

= ∑

j

y(j)

β0 + ∑

i

βix (j) i

− ln

1+exp

β0 + ∑

i

βix (j) i

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18

• Convexity

Convex function

Doesn’t matter where you start, if you “walk up” objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18

• Convexity

Convex function

Doesn’t matter where you start, if you “walk up” objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18

Goal

Optimize log likelihood with respect to variables β

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

0

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

0

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

1

0

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

1

0

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

1

0

2

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

1

0

2

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

1

0

2

3

Undiscovered Country

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

0

β ij (l+1)

Jα ∂ ∂β

β ij (l)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Goal

Optimize log likelihood with respect to variables β

0

β ij (l+1)

Jα ∂ ∂β

β ij (l)

Luckily, (vanilla) logistic regression is convex

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

To ease notation, let’s define

πi = expβT xi

1+expβT xi (5)

Our objective function is

L = ∑

i

logp(yi |xi) = ∑

i

Li = ∑

i

¨

logπi if yi = 1

log(1−πi) if yi = 0 (6)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 18

• Taking the Derivative

Apply chain rule:

∂L ∂ βj

= ∑

i

∂Li( ~β) ∂ βj

= ∑

i

(

1 πi ∂ πi ∂ βj

if yi = 1 1

1−πi

− ∂ πi∂ βj

if yi = 0 (7)

If we plug in the derivative,

∂ πi ∂ βj

=πi(1−πi)xj , (8)

we can merge these two cases

∂Li ∂ βj

= (yi −πi)xj . (9)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 18

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

Why are we adding? What would well do if we wanted to do descent?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

η: step size, must be greater than zero

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

NB: Conjugate gradient is usually better, but harder to implement

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

• Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

• Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

• Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

• Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

• Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

• Remaining issues

When to stop?

What if β keeps getting bigger?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 18

• Regularized Conditional Log Likelihood

Unregularized

β ∗= argmax β

ln

p(y(j) |x(j),β)

(13)

Regularized

β ∗= argmax β

ln

p(y(j) |x(j),β)

−µ