of 42

• date post

06-Jun-2020
• Category

## Documents

• view

3
• download

0

Embed Size (px)

### Transcript of Logistic Regression Optimization - jbg/teaching/CMSC_470/04a_sgd.pdf Reminder: Logistic Regression...

• Logistic Regression Optimization

Natural Language Processing: Jordan Boyd-Graber University of Maryland DERIVATION

Slides adapted from Emily Fox

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 1 / 18

• Reminder: Logistic Regression

P(Y = 0|X) = 1

1+exp �

β0 + ∑

i βiXi � (1)

P(Y = 1|X) = exp

β0 + ∑

i βiXi �

1+exp �

β0 + ∑

i βiXi � (2)

Discriminative prediction: p(y |x) Classification uses: ad placement, spam detection

What we didn’t talk about is how to learn β from data

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 2 / 18

• Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) = ∑

j

lnp(y(j) |x(j),β) (3)

= ∑

j

y(j)

β0 + ∑

i

βix (j) i

− ln

1+exp

β0 + ∑

i

βix (j) i

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 3 / 18

• Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) = ∑

j

lnp(y(j) |x(j),β) (3)

= ∑

j

y(j)

β0 + ∑

i

βix (j) i

− ln

1+exp

β0 + ∑

i

βix (j) i

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 3 / 18

• Convexity

Convex function

Doesn’t matter where you start, if you “walk up” objective

Gradient!

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 4 / 18

• Convexity

Convex function

Doesn’t matter where you start, if you “walk up” objective

Gradient!

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 4 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

0

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

0

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

1

0

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

1

0

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

1

0

2

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

1

0

2

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

1

0

2

3

Undiscovered Country

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

Parameter

Objective start

stop

undiscovered country

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

Parameter

Objective start

stop

undiscovered country

Luckily, (vanilla) logistic regression is convex

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

• Gradient for Logistic Regression

To ease notation, let’s define

πi = expβT xi

1+expβT xi (5)

Our objective function is

L = ∑

i

logp(yi |xi) = ∑

i

Li = ∑

i

¨

logπi if yi = 1

log(1−πi) if yi = 0 (6)

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 6 / 18

• Taking the Derivative

Apply chain rule:

∂L ∂ βj

= ∑

i

∂Li( ~β) ∂ βj

= ∑

i

(

1 πi ∂ πi ∂ βj

if yi = 1 1

1−πi

− ∂ πi∂ βj

if yi = 0 (7)

If we plug in the derivative,

∂ πi ∂ βj

=πi(1−πi)xj , (8)

we can merge these two cases

∂Li ∂ βj

= (yi −πi)xj . (9)

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 7 / 18

• Gradient for Logistic Regression

Gradient

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

• Gradient for Logistic Regression

Gradient

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

Why are we adding? What would well do if we wanted to do descent?

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

• Gradient for Logistic Regression

Gradient

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

η: step size, must be greater than zero

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

• Gradient for Logistic Regression

Gradient

∇βL ( ~β) = �

∂L ( ~β) ∂ β0

, . . . , ∂L ( ~β) ∂ βn

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η ∂L ( ~β) ∂ βi

(12)

NB: Conjugate gradient is usually better, but harder to implement

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

• Choosing Step Size

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

• Choosing Step Size

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

• Choosing Step Size

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

• Choosing Step Size

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

• Choosing Step Size

Parameter

Objective

Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

• Remaining issues

When to stop?

What if β keeps getting bigger?

Natural Language Processing