Logistic Regression Optimization - jbg/teaching/CMSC_470/04a_sgd.pdf Reminder: Logistic Regression...

download Logistic Regression Optimization - jbg/teaching/CMSC_470/04a_sgd.pdf Reminder: Logistic Regression P(Y

of 42

  • date post

    06-Jun-2020
  • Category

    Documents

  • view

    5
  • download

    0

Embed Size (px)

Transcript of Logistic Regression Optimization - jbg/teaching/CMSC_470/04a_sgd.pdf Reminder: Logistic Regression...

  • Logistic Regression Optimization

    Natural Language Processing: Jordan Boyd-Graber University of Maryland DERIVATION

    Slides adapted from Emily Fox

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 1 / 18

  • Reminder: Logistic Regression

    P(Y = 0|X) = 1

    1+exp �

    β0 + ∑

    i βiXi � (1)

    P(Y = 1|X) = exp

    β0 + ∑

    i βiXi �

    1+exp �

    β0 + ∑

    i βiXi � (2)

    Discriminative prediction: p(y |x) Classification uses: ad placement, spam detection

    What we didn’t talk about is how to learn β from data

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 2 / 18

  • Logistic Regression: Objective Function

    L ≡ lnp(Y |X ,β) = ∑

    j

    lnp(y(j) |x(j),β) (3)

    = ∑

    j

    y(j)

    β0 + ∑

    i

    βix (j) i

    − ln

    1+exp

    β0 + ∑

    i

    βix (j) i

    (4)

    Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 3 / 18

  • Logistic Regression: Objective Function

    L ≡ lnp(Y |X ,β) = ∑

    j

    lnp(y(j) |x(j),β) (3)

    = ∑

    j

    y(j)

    β0 + ∑

    i

    βix (j) i

    − ln

    1+exp

    β0 + ∑

    i

    βix (j) i

    (4)

    Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 3 / 18

  • Convexity

    Convex function

    Doesn’t matter where you start, if you “walk up” objective

    Gradient!

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 4 / 18

  • Convexity

    Convex function

    Doesn’t matter where you start, if you “walk up” objective

    Gradient!

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 4 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    0

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    0

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    1

    0

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    1

    0

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    1

    0

    2

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    1

    0

    2

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    1

    0

    2

    3

    Undiscovered Country

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    Parameter

    Objective start

    stop

    undiscovered country

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient Descent (non-convex)

    Goal

    Optimize log likelihood with respect to variables β

    Parameter

    Objective start

    stop

    undiscovered country

    Luckily, (vanilla) logistic regression is convex

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 5 / 18

  • Gradient for Logistic Regression

    To ease notation, let’s define

    πi = expβT xi

    1+expβT xi (5)

    Our objective function is

    L = ∑

    i

    logp(yi |xi) = ∑

    i

    Li = ∑

    i

    ¨

    logπi if yi = 1

    log(1−πi) if yi = 0 (6)

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 6 / 18

  • Taking the Derivative

    Apply chain rule:

    ∂L ∂ βj

    = ∑

    i

    ∂Li( ~β) ∂ βj

    = ∑

    i

    (

    1 πi ∂ πi ∂ βj

    if yi = 1 1

    1−πi

    − ∂ πi∂ βj

    if yi = 0 (7)

    If we plug in the derivative,

    ∂ πi ∂ βj

    =πi(1−πi)xj , (8)

    we can merge these two cases

    ∂Li ∂ βj

    = (yi −πi)xj . (9)

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 7 / 18

  • Gradient for Logistic Regression

    Gradient

    ∇βL ( ~β) = �

    ∂L ( ~β) ∂ β0

    , . . . , ∂L ( ~β) ∂ βn

    (10)

    Update

    ∆β ≡η∇βL ( ~β) (11)

    β ′i ←βi +η ∂L ( ~β) ∂ βi

    (12)

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

  • Gradient for Logistic Regression

    Gradient

    ∇βL ( ~β) = �

    ∂L ( ~β) ∂ β0

    , . . . , ∂L ( ~β) ∂ βn

    (10)

    Update

    ∆β ≡η∇βL ( ~β) (11)

    β ′i ←βi +η ∂L ( ~β) ∂ βi

    (12)

    Why are we adding? What would well do if we wanted to do descent?

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

  • Gradient for Logistic Regression

    Gradient

    ∇βL ( ~β) = �

    ∂L ( ~β) ∂ β0

    , . . . , ∂L ( ~β) ∂ βn

    (10)

    Update

    ∆β ≡η∇βL ( ~β) (11)

    β ′i ←βi +η ∂L ( ~β) ∂ βi

    (12)

    η: step size, must be greater than zero

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

  • Gradient for Logistic Regression

    Gradient

    ∇βL ( ~β) = �

    ∂L ( ~β) ∂ β0

    , . . . , ∂L ( ~β) ∂ βn

    (10)

    Update

    ∆β ≡η∇βL ( ~β) (11)

    β ′i ←βi +η ∂L ( ~β) ∂ βi

    (12)

    NB: Conjugate gradient is usually better, but harder to implement

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 8 / 18

  • Choosing Step Size

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

  • Choosing Step Size

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

  • Choosing Step Size

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

  • Choosing Step Size

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

  • Choosing Step Size

    Parameter

    Objective

    Natural Language Processing: Jordan Boyd-Graber | UMD Logistic Regression Optimization | 9 / 18

  • Remaining issues

    When to stop?

    What if β keeps getting bigger?

    Natural Language Processing