Introduction to Machine Learning - jbg/teaching/CMSC_726/03a.pdf · PDF file Machine...
date post
27-Jun-2020Category
Documents
view
9download
1
Embed Size (px)
Transcript of Introduction to Machine Learning - jbg/teaching/CMSC_726/03a.pdf · PDF file Machine...
Introduction to Machine Learning
Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT
Slides adapted from Emily Fox
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 18
Reminder: Logistic Regression
P(Y = 0|X) = 1
1+exp �
β0 + ∑
i βiXi � (1)
P(Y = 1|X) = exp
�
β0 + ∑
i βiXi �
1+exp �
β0 + ∑
i βiXi � (2)
Discriminative prediction: p(y |x) Classification uses: ad placement, spam detection
What we didn’t talk about is how to learn β from data
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 18
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) = ∑
j
lnp(y(j) |x(j),β) (3)
= ∑
j
y(j)
β0 + ∑
i
βix (j) i
− ln
1+exp
β0 + ∑
i
βix (j) i
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) = ∑
j
lnp(y(j) |x(j),β) (3)
= ∑
j
y(j)
β0 + ∑
i
βix (j) i
− ln
1+exp
β0 + ∑
i
βix (j) i
(4)
Training data (y ,x) are fixed. Objective function is a function of β . . . what values of β give a good value.
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 18
Convexity
Convex function
Doesn’t matter where you start, if you “walk up” objective
Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
Convexity
Convex function
Doesn’t matter where you start, if you “walk up” objective
Gradient!
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
1
0
2
3
Undiscovered Country
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij (l+1)
Jα ∂ ∂β
β ij (l)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables β
0
β ij (l+1)
Jα ∂ ∂β
β ij (l)
Luckily, (vanilla) logistic regression is convex
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 18
Gradient for Logistic Regression
To ease notation, let’s define
πi = expβT xi
1+expβT xi (5)
Our objective function is
L = ∑
i
logp(yi |xi) = ∑
i
Li = ∑
i
¨
logπi if yi = 1
log(1−πi) if yi = 0 (6)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 18
Taking the Derivative
Apply chain rule:
∂L ∂ βj
= ∑
i
∂Li( ~β) ∂ βj
= ∑
i
(
1 πi ∂ πi ∂ βj
if yi = 1 1
1−πi
− ∂ πi∂ βj
if yi = 0 (7)
If we plug in the derivative,
∂ πi ∂ βj
=πi(1−πi)xj , (8)
we can merge these two cases
∂Li ∂ βj
= (yi −πi)xj . (9)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) = �
∂L ( ~β) ∂ β0
, . . . , ∂L ( ~β) ∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η ∂L ( ~β) ∂ βi
(12)
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) = �
∂L ( ~β) ∂ β0
, . . . , ∂L ( ~β) ∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η ∂L ( ~β) ∂ βi
(12)
Why are we adding? What would well do if we wanted to do descent?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) = �
∂L ( ~β) ∂ β0
, . . . , ∂L ( ~β) ∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η ∂L ( ~β) ∂ βi
(12)
η: step size, must be greater than zero
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Gradient for Logistic Regression
Gradient
∇βL ( ~β) = �
∂L ( ~β) ∂ β0
, . . . , ∂L ( ~β) ∂ βn
�
(10)
Update
∆β ≡η∇βL ( ~β) (11)
β ′i ←βi +η ∂L ( ~β) ∂ βi
(12)
NB: Conjugate gradient is usually better, but harder to implement
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 18
Remaining issues
When to stop?
What if β keeps getting bigger?
Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 18
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmax β
ln
p(y(j) |x(j),β)
(13)
Regularized
β ∗= argmax β
ln
p(y(j) |x(j),β)
−µ