Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf ·...

7
Stochastic Gradient Descent William Cohen February 15, 2012 1 A Warmup If you have a continuous mathematical expression for a probability, you can often learn the parameter values by simply maximizing the probability of the data with respect to the parameter values. For example, suppose we have a binomial—a coin with some unknown probability θ of heads (i.e., x i = 1)—and n draws from it, x 1 ,...,x n , of which k came up heads. We can write the probability of a single x i as P (X = x i |θ)= ( θ if x i =1 1 - θ if x i =0 or better, since we’d like something we can differentiate (to find the max), P (X = x i |θ)= θ x i (1 - θ) 1-x i Let D be the dataset, D = {x 1 ,...,x n }. The P (D|θ)= Y i P (x i |θ)= Y i θ x i (1 - θ) 1-x i = θ k (1 - θ) n-k So we can differentiate, using (fg) 0 = f 0 g + fg 0 and the chain rule for g.... ∂θ P (D)= k-1 (1 - θ) n-k + θ k (n - k)(1 - θ) n-k-1 (-1) Now simplify by pulling out the common factors of θ k-1 and (1 - θ) n-k-1 to get ∂θ P (D)= θ k-1 (1 - θ) n-k-1 (k(1 - θ) - θ(n - k)) Setting this to zero, we need one of the factors to be zero, so either θ k-1 =0 at θ = 0, or (1 - θ) n-k-1 = 0 at θ = 1, or k(1 - θ) - θ(n - k) = 0, which we can solve to give θ = n/k, the maximum. 1

Transcript of Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf ·...

Page 1: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

Stochastic Gradient Descent

William Cohen

February 15, 2012

1 A Warmup

If you have a continuous mathematical expression for a probability, you canoften learn the parameter values by simply maximizing the probability of thedata with respect to the parameter values.

For example, suppose we have a binomial—a coin with some unknownprobability θ of heads (i.e., xi = 1)—and n draws from it, x1, . . . , xn, ofwhich k came up heads. We can write the probability of a single xi as

P (X = xi|θ) =

{θ if xi = 11− θ if xi = 0

or better, since we’d like something we can differentiate (to find the max),

P (X = xi|θ) = θxi(1− θ)1−xi

Let D be the dataset, D = {x1, . . . , xn}. The

P (D|θ) =∏i

P (xi|θ) =∏i

θxi(1− θ)1−xi = θk(1− θ)n−k

So we can differentiate, using (fg)′ = f ′g + fg′ and the chain rule for g....

∂θP (D) = kθk−1(1− θ)n−k + θk(n− k)(1− θ)n−k−1(−1)

Now simplify by pulling out the common factors of θk−1 and (1− θ)n−k−1 toget

∂θP (D) = θk−1(1− θ)n−k−1 (k(1− θ)− θ(n− k))

Setting this to zero, we need one of the factors to be zero, so either θk−1 = 0at θ = 0, or (1− θ)n−k−1 = 0 at θ = 1, or k(1− θ)− θ(n− k) = 0, which wecan solve to give θ = n/k, the maximum.

1

Page 2: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

2 Logistic regression

The model we’d like to optimize next is for an example (x, y), where y = 0or y = 1, and x is a feature vector. We will let the predicted value of y be

P (Y = y|X = x,w) =1

1 + e−x·w

So then the probability of a single example (x, y) is

P (Y = y|X = x,w) =

{1

1+e−x·w if y = 1

1− 11+e−x·w if y = 0

To motivate this formula, note that this similar in form to the other linearclassifiers we’ve looked at, like Naive Bayes, Rocchio, and the perceptron—it’s just that instead instead of using an argmax or a sign function on theinner product x · w, we’re using the easy-to-differentiate logistic functionf(z) = 1

1+e−z , which approximates a step function between 0 and 1.What we’ll do is look at the gradient of logPw(y|x) for a single example

(x, y). The algorithm we will use will be to pick a random example, computethis gradient, and adjust the parameters to take a small step in that direction- so we’re using a randomized approximation to the true gradient over all thedata. This turns out to be fast and surprisingly effective—and algorithmicallya lot like the perceptron method.

As notation, let p be

p ≡ 1

1 + e−x·w=

1

1 + exp(−∑j xjwj)

and consider the log probability (which, since it’s monotonic with the prob-ability, will have the same maximal values):

logP (Y = y|X = x,w) =

{log p if y = 1log(1− p) if y = 0

Let’s take the gradient with respect to some parameter value wj. We’ll dothe two cases separately for now: using (log f)′ = 1

ff ′, we have

∂wjlogP (Y = y|X = x,w) =

{1p

∂∂wj p if y = 11

1−p(− ∂∂wj p) if y = 0

2

Page 3: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

Now we can churn through and get the gradient for p, using (ef )′ = eff ′ andthe chain rule:

∂wjp =

∂wj(1 + exp(−

∑j

xjwj))−1

= (−1)(1 + exp(−∑j

xjwj))−2∂

∂wjexp(−

∑j

xjwj)

= (−1)(1 + exp(−∑j

xjwj))−2 exp(−∑j

xjwj)(−xj)

=1

1 + exp(−∑j xjwj)

exp(−∑j x

jwj)

1 + exp(−∑j xjwj)

xj

Note that

1− p =1 + exp(−∑

j xjwj)

1 + exp(−∑j xjwj)

− 1

1 + exp(−∑j xjwj)

=exp(−∑

j xjwj)

1 + exp(−∑j xjwj)

so the final line can be rewritten to give

∂wjp = p(1− p)xj

which we can plug into the cases above and get

∂wjlogP (Y = y|X = x,w) =

{1pp(1− p)xj = (1− p)xj if y = 11

1−p(−1)p(1− p)xj = −pxj if y = 0

Finally, these cases can be combined to give the very simple gradient

∂wjlogP (Y = y|X = x,w) = (y − p)xj

So, taking a small step in this direction would be to increment w as follows:

w(t+1) = w(t) + λ(y − p)x

Compare this to the perceptron update rule: it’s not very different.So this leads to this algorithm, which is very fast (assuming you have

enough memory to hash all the parameter values).

1. Initialize a hashtable W

3

Page 4: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

2. For t = 1, . . . , T

• For each example xi, yi:

– Compute the prediction for xi:

pi =1

1 + exp(−∑j:xji>0 x

jiw

j)

– For each non-zero feature of xi with index j and value xj:

∗ If j is not in W , set W [j] = 0.

∗ Set W [j] = W [j] + λ(yi − pi)xj

3. Output the hash table W .

The time to run this is O(nT ), where n is the total number of non-zerofeatures for each example and T is the number of iterations.

3 An observation

Consider averaging the gradient

∂wjlogP (Y = y|X = x,w) = (y − p)xj

over many examples D = {(x1, y1), . . . , (xn, yn)}—which is the usual non-stochastic gradient procedure—and assume that the xj’s are binary. If theexamples are independent, then

∂wjlogP (D|w) =

1

n

∑i

(yi − pi)xji =1

n

∑i:xji=1

yi −1

n

∑i:xji=1

pi

This can be viewed as the difference between the actual fraction of labelsy = 1 in the subset of the data for which xj = 1, and the predicted fractionof labels y = 1 in the subset of the data for which xj = 1.

4

Page 5: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

4 Efficient regularized SGD

Logistic regression tends to overfit when there are many rare features. Onefix is to penalize large values of wj, by optimizing, instead of LCL (log-conditional likelihood), some function such as

LCL− µd∑j=1

(wj)2

Here µ controls how much weight to give to the penalty term. Then insteadof

∂wjlogP (Y = y|X = x,w) = (y − p)xj

we have

∂wjlogP (Y = y|X = x,w)− µ

d∑j=1

(wj)2 = (y − p)xj − 2µwj

and the update for wj becomes

wj = wj + λ((y − p)xj − 2µwj)

or equivalentlywj = wj + λ(y − p)xj − λ2µwj

Experimentally this greatly improves overfitting - but unfortunately, thismakes the computation much more expensive, because now every wj needsto be updated, not only the ones that are associated with non-zero xj’s.

The trick to making this efficient is to break the update into two parts.One is the usual update of adding λ(y− p)xj. Let’s call this the “LCL” partof the update. The second is the “regularization part” of the update, whichis to replace wj by

wj = wj − λ2µwj = wj · (1− 2λµ)

So we could perform our update of wj as follows:

• Set wj = wj · (1− 2λµ)

• If xj 6= 0, set wj = wj + λ(y − p)xj

5

Page 6: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

Following this up, we note that we can perform m successive “regulariza-tion” updates by letting Bj = Bj · (1 − 2λµ)m. The basic idea of the newalgorithm is to not perform regularization updates for zero-valued xj’s, butinstead to simply keep track of how many such updates would need to beperformed to update wj, and perform them only when we would normallyperform “LCL” updates (or when we output the parameters at the end ofthe day).

Here’s the final algorithm (for more detail, see “Lazy sparse stochasticgradient descent for regularized multinomial logistic regression”, Bob Car-penter, http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf.)

1. Let k = 0, and let A and W be empty hashtables. A will record thevalue of k last time W [j] was updated.

2. For t = 1, . . . , T

• For each example xi, yi:

• Let k = k + 1

– For each non-zero feature of xi with index j and value xj:

∗ If j is not in W , set W [j] = 0.

∗ If j is not in A, set A[j] = 0.

∗ Simulate the “regularization” updates that would havebeen performed for the k − A[j] examples since the lasttime a non-zero xj was encountered by setting

W [j] = W [j] · (1− 2λµ)k−A[j]

∗ Set W [j] = W [j] + λ(y − p)xj

∗ Set A[j] = k

3. For each parameter w1, . . . , wj, . . . , wd, set

W [j] = W [j] · (1− 2λµ)k−A[j]

4. Output the hash table W .

6

Page 7: Stochastic Gradient Descent - Carnegie Mellon School of ...wcohen/10-605/notes/sgd-notes.pdf · Stochastic Gradient Descent ... February 15, 2012 1 A Warmup If you have a continuous

5 Other hints

The learning rate λ is often decreased over time. On the t-th sweep throughthe data, set λ = η

t2. I used η = 0.5. (Sometimes λ is also scaled by 1/ne,

where ne is the number of examples.) I also used a value of 2µ = 0.1.When running stochastic gradient descent, it usual to randomize the order

of examples, and scale the feature values so that they are comparable (if theyare not already binary). However, randomization is not trivial to do for alarge dataset. I recommend implementing SGD as a process that streamsonce through a data stream, with the number of examples ne being passedin separately as a command-line argument so that the algorithm is awareof what the current value of t is. Then write a separate module that willinput a file of examples and then stream the individual examples out inapproximately random order.

7