18. Getting to Union Station

Transcript

Introduction to Machine Learning

Machine Learning: Jordan Boyd-GraberUniversity of MarylandLOGISTIC REGRESSION FROM TEXT

Slides adapted from Emily Fox

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 18

Page 2: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Reminder: Logistic Regression

P(Y = 0|X) =1

1+exp�

β0 +∑

i βiXi

� (1)

P(Y = 1|X) =exp

�

β0 +∑

i βiXi

�

1+exp�

β0 +∑

i βiXi

� (2)

� Discriminative prediction: p(y |x)� Classification uses: ad placement, spam detection

� What we didn’t talk about is how to learn β from data

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 18

Page 3: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) =∑

lnp(y(j) |x(j),β) (3)

=∑

y(j)�

β0 +∑

βix(j)i

�

− ln

�

1+exp

�

β0 +∑

βix(j)i

��

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18

Page 4: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) =∑

lnp(y(j) |x(j),β) (3)

=∑

y(j)�

β0 +∑

βix(j)i

�

− ln

�

1+exp

�

β0 +∑

βix(j)i

��

(4)

Training data (y ,x) are fixed. Objective function is a function of β . . . whatvalues of β give a good value.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18

Page 5: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Convexity

� Convex function

� Doesn’t matter where you start, ifyou “walk up” objective

� Gradient!

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18

Page 6: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Convexity

� Convex function

� Doesn’t matter where you start, ifyou “walk up” objective

� Gradient!

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18

Page 7: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 8: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 9: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 10: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 11: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 12: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 13: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 14: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 15: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

UndiscoveredCountry

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 16: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

β ij(l+1)

Jα∂∂β

β ij(l)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 17: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables β

β ij(l+1)

Jα∂∂β

β ij(l)

Luckily, (vanilla) logistic regression is convex

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18

Page 18: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient for Logistic Regression

To ease notation, let’s define

πi =expβT xi

1+expβT xi(5)

Our objective function is

L =∑

logp(yi |xi) =∑

Li =∑

logπi if yi = 1

log(1−πi) if yi = 0(6)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 18

Page 19: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Taking the Derivative

Apply chain rule:

∂L∂ βj

=∑

∂Li( ~β)

∂ βj=∑

(

1πi

∂ πi∂ βj

if yi = 11

1−πi

�

− ∂ πi∂ βj

�

if yi = 0(7)

If we plug in the derivative,

∂ πi

∂ βj=πi(1−πi)xj , (8)

we can merge these two cases

∂Li

∂ βj= (yi −πi)xj . (9)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 18

Page 20: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient for Logistic Regression

Gradient

∇βL ( ~β) =

�

∂L ( ~β)

∂ β0, . . . ,

∂L ( ~β)

∂ βn

�

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η∂L ( ~β)

∂ βi(12)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

Page 21: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient for Logistic Regression

Gradient

∇βL ( ~β) =

�

∂L ( ~β)

∂ β0, . . . ,

∂L ( ~β)

∂ βn

�

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η∂L ( ~β)

∂ βi(12)

Why are we adding? What would well do if we wanted to do descent?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

Page 22: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient for Logistic Regression

Gradient

∇βL ( ~β) =

�

∂L ( ~β)

∂ β0, . . . ,

∂L ( ~β)

∂ βn

�

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η∂L ( ~β)

∂ βi(12)

η: step size, must be greater than zero

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

Page 23: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Gradient for Logistic Regression

Gradient

∇βL ( ~β) =

�

∂L ( ~β)

∂ β0, . . . ,

∂L ( ~β)

∂ βn

�

(10)

Update

∆β ≡η∇βL ( ~β) (11)

β ′i ←βi +η∂L ( ~β)

∂ βi(12)

NB: Conjugate gradient is usually better, but harder to implement

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18

Page 24: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

Page 25: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

Page 26: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

Page 27: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

Page 28: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Choosing Step Size

Parameter

Objective

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18

Page 29: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Remaining issues

� When to stop?

� What if β keeps getting bigger?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 18

Page 30: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(13)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

β 2i (14)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18

Page 31: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(13)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

β 2i (14)

µ is “regularization” parameter that trades off between likelihood andhaving small parameters

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 18

Page 32: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Approximating the Gradient

� Our datasets are big (to fit into memory)

� . . . or data are changing / streaming

� Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (15)

� Average over all observations

� What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18

Page 33: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Approximating the Gradient

� Our datasets are big (to fit into memory)

� . . . or data are changing / streaming

� Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (15)

� Average over all observations

� What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18

Page 34: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Approximating the Gradient

� Our datasets are big (to fit into memory)

� . . . or data are changing / streaming

� Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (15)

� Average over all observations

� What if we compute an update just from one observation?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 18

Page 35: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Getting to Union Station

Pretend it’s a pre-smartphone world and you want to get to Union Station

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 18

Page 36: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Stochastic Gradient for Logistic Regression

Given a single observation xi chosen at random from the dataset,

βj ←β ′j +η�

−µβ ′j + xij [yi −πi ]�

(16)

Examples in class.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18

Page 37: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Stochastic Gradient for Logistic Regression

Given a single observation xi chosen at random from the dataset,

βj ←β ′j +η�

−µβ ′j + xij [yi −πi ]�

(16)

Examples in class.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 18

Page 38: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Stochastic Gradient for Regularized Regression

L = logp(y |x;β)−µ∑

β 2j (17)

Taking the derivative (with respect to example xi )

∂L∂ βj

=(yi −πi)xj −2µβj (18)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18

Page 39: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Stochastic Gradient for Regularized Regression

L = logp(y |x;β)−µ∑

β 2j (17)

Taking the derivative (with respect to example xi )

∂L∂ βj

=(yi −πi)xj −2µβj (18)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 18

Page 40: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Algorithm

1. Initialize a vector B to be all zeros

2. For t = 1, . . . ,T� For each example ~xi ,yi and feature j :� Compute πi ≡ Pr(yi = 1 | ~xi)� Set β [j] =β [j]′+λ(yi −πi)xi

3. Output the parameters β1, . . . ,βd .

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 16 / 18

Page 41: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

Proofs about Stochastic Gradient

� Depends on convexity of objective and how close ε you want to get toactual answer

� Best bounds depend on changing η over time and per dimension (notall features created equal)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 17 / 18

Page 42: Introduction to Machine Learning - UMIACSjbg/teaching/CMSC_726/03a.pdf · Machine Learning: Jordan Boyd-Graber jUMD Introduction to Machine Learning 12 / 18. Getting to Union Station

In class

� Your questions!

� Working through simple example

� Prepared for logistic regression homework

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 18 / 18

Top Related

A field guide the machine learning zoo

Sparse methods for machine learning - WebHomefbach/Cours_peyresq_2010.pdf · Sparse methods for machine learning Theory and algorithms Francis Bach Willow project, INRIA - Ecole Normale

EXAM IN STATISTICAL MACHINE LEARNING … · EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING ... Pr (Y = 0|X)! = βTX. ... The kernels for the models M1-M5 are plotted

Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Machine Learning Seminar: Support Vector Regression

ST451 - Lent term Bayesian Machine Learning

Machine Learning (CSE 446): Learning as Minimizing Loss ...

Machine Learning The Future of LHC Theoryplehn/includes/... · 2020-06-26 · Machine Learning — The Future of LHC Theory Tilman Plehn Universitat Heidelberg ...