Variance reduction for stochastic gradient methods

57
ELE 522: Large-Scale Optimization for Data Science Variance reduction for stochastic gradient methods Yuxin Chen Princeton University, Fall 2019

Transcript of Variance reduction for stochastic gradient methods

Page 1: Variance reduction for stochastic gradient methods

ELE 522: Large-Scale Optimization for Data Science

Variance reduction for stochastic gradientmethods

Yuxin ChenPrinceton University, Fall 2019

Page 2: Variance reduction for stochastic gradient methods

Outline

• Stochastic variance reduced gradient (SVRG)◦ Convergence analysis for strongly convex problems

• Stochastic recursive gradient algorithm (SARAH)◦ Convergence analysis for nonconvex problems

• Other variance reduced stochastic methods◦ Stochastic dual coordinate ascent (SDCA)

Variance reduction 12-2

Page 3: Variance reduction for stochastic gradient methods

Finite-sum optimization

minimizex∈Rd F (x) := 1n

n∑i=1

fi(x)︸ ︷︷ ︸

loss for ith sample︸ ︷︷ ︸(ai,yi)

+ ψ(x)︸ ︷︷ ︸regularizer

common task in machine learning

• linear regression: fi(x) = 12(a>i x− yi)2, ψ(x) = 0

• logistic regression: fi(x) = log(1 + e−yia>i x), ψ(x) = 0

• Lasso: fi(x) = 12(a>i x− yi)2, ψ(x) = λ‖x‖1

• SVM: fi(x) = max{0, 1− yia>i x}, ψ(x) = λ2‖x‖

22

• . . .

Variance reduction 12-3

Page 4: Variance reduction for stochastic gradient methods

Stochastic gradient descent (SGD)

Algorithm 12.1 Stochastic gradient descent (SGD)1: for t = 1, 2, . . . do2: pick it ∼ Unif(1, . . . , n)3: xt+1 = xt − ηt∇fit(xt)

As we have shown in the last lecture

• large stepsizes poorly suppress variability of stochastic gradients=⇒ SGD with ηt � 1 tends to oscillate around global mins

• choosing ηt � 1/t mitigates oscillation, but is too conservative

Variance reduction 12-4

Page 5: Variance reduction for stochastic gradient methods

Recall: SGD theory with fixed stepsizes

xt+1 = xt − ηt gt

• gt: an unbiased estimate of F (xt)

• E[‖gt‖22] ≤ σ2g + cg‖∇F (xt)‖22

• F (·): µ-strongly convex; L-smooth

From the last lecture, we know

E[F (xt)− F (x∗)] ≤ηLσ2

g2µ + (1− ηµ)t

(F (x0)− F (x∗)

)

Variance reduction 12-5

Page 6: Variance reduction for stochastic gradient methods

Recall: SGD theory with fixed stepsizes

E[F (xt)− F (x∗)] ≤ηLσ2

g2µ + (1− ηµ)t

(F (x0)− F (x∗)

)

• vanilla SGD: gt = ∇fit(xt)◦ issue: σ2

g is non-negligible even when xt = x∗

• question: it is possible to design gt with reduced variability σ2g?

Variance reduction 12-6

Page 7: Variance reduction for stochastic gradient methods

A simple idea

Imagine we take some vt with E[vt] = 0 and set

gt = ∇fit(xt)− vt

— so gt is still an unbiased estimate of ∇F (xt)

question: how to reduce variability (i.e. E[‖gt‖22] < E[‖∇fit(xt)‖22])?

answer: find some zero-mean vt that is positively correlated with∇fit(xt) (i.e. 〈vt,∇fit(xt)〉 > 0) (why?)

Variance reduction 12-7

Page 8: Variance reduction for stochastic gradient methods

Reducing variance via gradient aggregation

If the current iterate is not too far away from previous iterates, thenhistorical gradient info might be useful in producing such a vt toreduce variance

main idea of this lecture: aggregate previous gradient info to helpimprove the convergence rate

Variance reduction 12-8

Page 9: Variance reduction for stochastic gradient methods

Stochastic variance reduced gradient (SVRG)

Page 10: Variance reduction for stochastic gradient methods

Strongly convex and smooth problems(no regularization)

minimizex∈Rd F (x) = 1n

n∑i=1

fi (x)

• fi: convex and L-smooth

• F : µ-strongly convex

• κ := L/µ: condition number

Variance reduction 12-10

Page 11: Variance reduction for stochastic gradient methods

Stochastic variance reduced gradient (SVRG)

— Johnson, Zhang ’13

key idea: if we have access to a history point xold and ∇F (xold),then

∇fit(xt)−∇fit(xold)︸ ︷︷ ︸→0 if xt≈xold

+ ∇F (xold)︸ ︷︷ ︸→0 if xold≈x∗

with it ∼ Unif(1, · · · , n)

• is an unbiased estimate of ∇F (xt)

• converges to 0︸ ︷︷ ︸variability is reduced!

if xt ≈ xold ≈ x∗

Variance reduction 12-11

Page 12: Variance reduction for stochastic gradient methods

Stochastic variance reduced gradient (SVRG)

• operate in epochs• in the sth epoch

◦ very beginning: take a snapshot xolds of the current iterate, and

compute the batch gradient ∇F (xolds )

◦ inner loop: use the snapshot point to help reduce variance

xt+1s = xts − η

{∇fit(xts)−∇fit(xold

s ) +∇F (xolds )}

a hybrid approach: the batch gradient is computed only once perepoch

Variance reduction 12-12

Page 13: Variance reduction for stochastic gradient methods

SVRG algorithm (Johnson, Zhang ’13)

Algorithm 12.2 SVRG for finite-sum optimization1: for s = 1, 2, . . . do2: xold

s ← xms−1, and compute ∇F (xolds )︸ ︷︷ ︸

batch gradient

// update snapshot

3: initialize x0s ← xold

s

4: for t = 0, . . . ,m− 1︸ ︷︷ ︸each epoch contains m iterations

do

5: choose it uniformly from {1, . . . , n}, and

xt+1s = xts − η

{∇fit(xts)−∇fit(xold

s )︸ ︷︷ ︸stochastic gradient

+∇F (xolds )}

Variance reduction 12-13

Page 14: Variance reduction for stochastic gradient methods

Remark

• constant stepsize η• each epoch contains 2m+ n gradient computations

◦ the batch gradient is computed only once every m iterations◦ the average per-iteration cost of SVRG is comparable to that of

SGD if m & n

Variance reduction 12-14

Page 15: Variance reduction for stochastic gradient methods

Convergence analysis of SVRG

Theorem 12.1

Assume each fi is convex and L-smooth, and F is µ-strongly convex.Choose m large enough s.t. ρ = 1

µη(1−2Lη)m + 2Lη1−2Lη < 1, then

E[F (xolds )− F (x∗)] ≤ ρs[F (xold

0 )− F (x∗)]

• linear convergence: choosing m & L/µ = κ and constantstepsizes η � 1/L yields 0 < ρ < 1/2

=⇒ O(log 1ε ) epochs to attain ε accuracy

• total computational cost:

(m+ n)︸ ︷︷ ︸# grad computation per epoch

log 1ε �

(n+ κ

)log 1

ε︸ ︷︷ ︸if m�max{n, κ}

Variance reduction 12-15

Page 16: Variance reduction for stochastic gradient methods

Convergence analysis of SVRG

Theorem 12.1

Assume each fi is convex and L-smooth, and F is µ-strongly convex.Choose m large enough s.t. ρ = 1

µη(1−2Lη)m + 2Lη1−2Lη < 1, then

E[F (xolds )− F (x∗)] ≤ ρs[F (xold

0 )− F (x∗)]

• linear convergence: choosing m & L/µ = κ and constantstepsizes η � 1/L yields 0 < ρ < 1/2

=⇒ O(log 1ε ) epochs to attain ε accuracy

• total computational cost:

(m+ n)︸ ︷︷ ︸# grad computation per epoch

log 1ε �

(n+ κ

)log 1

ε︸ ︷︷ ︸if m�max{n, κ}

Variance reduction 12-15

Page 17: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1

Here, we provide the proof for an alternative version, where in eachepoch,

xolds+1 = xjs with j ∼ Unif(0, · · · ,m− 1)︸ ︷︷ ︸

rather than j=m

(12.1)

The interested reader is referred to Tan et al. ’16 for the proof of theoriginal version

Variance reduction 12-16

Page 18: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1

Let gts := ∇fit(xts)−∇fit(xolds ) +∇F (xold

s ) for simplicity. As usual,conditional on everything prior to xt+1

s , one has

E[‖xt+1

s − x∗‖22]

= E[‖xts − ηgts − x∗‖22

]= ‖xts − x∗‖22 − 2η(xts − x∗)>E

[gts]

+ η2E[‖gts‖22

]≤ ‖xts − x∗‖22 − 2η(xts − x∗)> ∇F (xts)︸ ︷︷ ︸

since gts is an unbiased estimate of ∇F (xt

s)

+ η2E[‖gts‖22

]≤ ‖xts − x∗‖22 − 2η(F (xts)− F (x∗))︸ ︷︷ ︸

by convexity

+ η2E[‖gts‖22

](12.2)

• key step: control E[‖gts‖22

]— we’d like to upper bound it via the (relative) objective value

Variance reduction 12-17

Page 19: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1

main pillar: control E[‖gts‖22

]via . . .

Lemma 12.2E[‖gts‖22

]≤ 4L

[F (xts)− F (x∗) + F (xold

s )− F (x∗)]

this means if xts ≈ xolds ≈ x∗, then E

[‖gts‖22

]≈ 0 (reduced variance)

this allows one to obtain: conditional on everything prior to xt+1s ,

E[‖xt+1

s − x∗‖22]≤ (12.2)≤ ‖xts − x∗‖22 − 2η[F (xts)− F (x∗)]

+ 4Lη2[F (xts)− F (x∗) + F (xolds )− F (x∗)]

= ‖xts − x∗‖22 − 2η(1− 2Lη)[F (xts)− F (x∗)]+ 4Lη2[F (xold

s )− F (x∗)] (12.3)

Variance reduction 12-18

Page 20: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1

main pillar: control E[‖gts‖22

]via . . .

Lemma 12.2E[‖gts‖22

]≤ 4L

[F (xts)− F (x∗) + F (xold

s )− F (x∗)]

this means if xts ≈ xolds ≈ x∗, then E

[‖gts‖22

]≈ 0 (reduced variance)

this allows one to obtain: conditional on everything prior to xt+1s ,

E[‖xt+1

s − x∗‖22]≤ (12.2)≤ ‖xts − x∗‖22 − 2η[F (xts)− F (x∗)]

+ 4Lη2[F (xts)− F (x∗) + F (xolds )− F (x∗)]

= ‖xts − x∗‖22 − 2η(1− 2Lη)[F (xts)− F (x∗)]+ 4Lη2[F (xold

s )− F (x∗)] (12.3)

Variance reduction 12-18

Page 21: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1 (cont.)

Taking expectation w.r.t. all history, we have

2η(1− 2Lη)mE[F (xold

s+1)− F (x∗)]

= 2η(1− 2Lη)m−1∑t=0

E[F (xts)− F (x∗)

]by (12.1)

≤ E[‖xms+1 − x∗‖2

2]︸ ︷︷ ︸

≥0

+ 2η(1− 2Lη)m−1∑t=0

E[F (xts)− F (x∗)

]≤ E

[‖x0

s+1 − x∗‖22]

+ 4Lmη2[F (xolds )− F (x∗)] (apply (12.3) recursively)

= E[‖xold

s − x∗‖22]

+ 4Lmη2E[F (xold

s )− F (x∗)]

≤ 2µE[F (xold

s )− F (x∗)]

+ 4Lmη2E[F (xold

s )− F (x∗)]

(strong convexity)

=(

2µ + 4Lmη2

)E[F (xold

s )− F (x∗)]

Variance reduction 12-19

Page 22: Variance reduction for stochastic gradient methods

Proof of Theorem 12.1 (cont.)

Consequently,

E[F (xold

s+1)− F (x∗)]

≤2µ + 4Lmη2

2η(1− 2Lη)mE[F (xolds )− F (x∗)]

=(

1µη(1− 2Lη)m + 2Lη

1− 2Lη︸ ︷︷ ︸=ρ

)E[F (xold

s )− F (x∗)]

Applying this bound recursively establishes the theorem.

Variance reduction 12-20

Page 23: Variance reduction for stochastic gradient methods

Proof of Lemma 12.2

E[‖∇fit(xts)−∇fit(xold

s ) +∇F (xolds )‖2

2]

= E[‖∇fit(xts)−∇fit(x∗)−

(∇fit(xold

s )−∇fit(x∗)−∇F (xolds ))‖2

2]

≤ 2E[‖∇fit(xts)−∇fit(x∗)‖2

2]

+ 2E[‖∇fit(xold

s )−∇fit(x∗)−∇F (xolds )‖2

2]

= 2E[‖∇fit(xts)−∇fit(x∗)‖2

2]

+ 2E[‖∇fit(xold

s )−∇fit(x∗)− E[∇fit(xolds )−∇fit(x∗)]︸ ︷︷ ︸

since E[∇fit (x∗)]=∇F (x∗)=0

‖22]

≤ 2E[‖∇fit(xts)−∇fit(x∗)‖2

2]

+ 2E[‖∇fit(xold

s )−∇fit(x∗)‖22]

≤ 4L[F (xts)− F (x∗) + F (xolds )− F (x∗)]

where the last inequality would hold if we could justify

1n

n∑i=1

∥∥∇fi(x)−∇fi(x∗)∥∥2

2 ≤ 2L[F (x)− F (x∗)

]︸ ︷︷ ︸

relies on both smoothness and convexity of fi

(12.4)

Variance reduction 12-21

Page 24: Variance reduction for stochastic gradient methods

Proof of Lemma 12.2 (cont.)

To establish (12.4), observe from smoothness and convexity of fi that

12L∥∥∇fi(x)−∇fi(x∗)

∥∥22 ≤ fi(x)− fi(x∗)−∇fi(x∗)>(x− x∗)︸ ︷︷ ︸

an equivalent characterization of L-smoothness

Summing over all i and recognizing that ∇F (x∗) = 0 yield

12L

n∑i=1

∥∥∇fi(x)−∇fi(x∗)∥∥2

2 ≤ nF (x)− nF (x∗)− n(∇F (x?)

)>(x− x∗)

= nF (x)− nF (x∗)

as claimed

Variance reduction 12-22

Page 25: Variance reduction for stochastic gradient methods

Numerical example: logistic regression

— Johnson, Zhang ’13

1E-12

1E-10

1E-8

1E-6

1E-4

1E-2

0 10 20 30

trai

ning

loss

-op

timum

#grad / n

rcv1 convex

SGD-bestSDCASVRG

0.035

0.04

0.045

0.05

0 10 20 30

Test

err

or r

ate

#grad / n

rcv1 convex

SGD-bestSDCASVRG

1E-6

1E-5

1E-4

1E-3

1E-2

0 10 20 30

trai

ning

loss

-op

timum

#grad / n

protein convex

SGD-bestSDCASVRG

0.002

0.003

0.004

0.005

0.006

0 10 20 30

Test

err

or r

ate

#grad / n

protein convex

SGD-bestSDCASVRG

1E-141E-121E-10

1E-81E-61E-41E-2

0 10 20 30

trai

ning

loss

-op

timum

#grad / n

cover type convex

SGD-bestSDCASVRG

0.24

0.245

0.25

0.255

0.26

0 10 20 30

Test

err

or r

ate

#grad / n

cover type convex

SGD-bestSDCASVRG

1E-121E-101E-081E-061E-041E-021E+00

0 50 100trai

ning

loss

-op

timum

#grad / n

CIFAR10 convex

SGD-bestSDCASVRG

0.58

0.6

0.62

0.64

0.66

0 50 100

Test

err

or r

ate

#grad / n

CIFAR10 convexSGD-bestSDCASVRG

Figure 3: More convex-case results. Loss residual P (w) � P (w⇤) (top) and test error rates (down). L2-regularized logistic regression (10-class for CIFAR-10 and binary for the rest).

0.09

0.095

0.1

0.105

0.11

0 100 200

Trai

ning

loss

#grad / n

MNIST nonconvex

SGD-bestSVRG

1E-5

1E-4

1E-3

1E-2

1E-1

1E+0

0 100 200

Var

ianc

e

#grad / n

MNIST nonconvex

SGD-best/η(t)SGD-bestSVRG

1.4

1.45

1.5

1.55

1.6

0 200 400

Trai

ning

loss

#grad / n

CIFAR10 nonconvexSGD-bestSVRG

0.450.460.470.480.49

0.50.510.52

0 200 400

Test

err

or r

ate

#grad / n

CIFAR10 nonconvexSGD-bestSVRG

Figure 4: Neural net results (nonconvex).

neural net studies; � was set to 1e-4 and 1e-3, respectively. In Fig. 4 we confirm that the results aresimilar to the convex case; i.e., SVRG reduces the variance and smoothly converges faster than thebest-tuned SGD with learning rate scheduling, which is a de facto standard method for neural nettraining. As said earlier, methods such as SDCA and SAG are not practical for neural nets due totheir memory requirement. We view these results as promising. However, further investigation, inparticular with larger/deeper neural nets for which training cost is a critical issue, is still needed.

6 Conclusion

This paper introduces an explicit variance reduction method for stochastic gradient descent meth-ods. For smooth and strongly convex functions, we prove that this method enjoys the same fastconvergence rate as those of SDCA and SAG. However, our proof is significantly simpler and moreintuitive. Moreover, unlike SDCA or SAG, this method does not require the storage of gradients, andthus is more easily applicable to complex problems such as structured prediction or neural networklearning.

Acknowledgment

We thank Leon Bottou and Alekh Agarwal for spotting a mistake in the original theorem.

8

`2-regularized logistic regression on CIFAR-10

Variance reduction 12-23

Page 26: Variance reduction for stochastic gradient methods

Comparisons with GD and SGD

SVRG GD SGDcomp. cost (n+ κ) log 1

ε nκ log 1ε

κ2

ε (practically often κε )

Variance reduction 12-24

Page 27: Variance reduction for stochastic gradient methods

Proximal extension

minimizex∈Rd F (x) = 1n

n∑i=1

fi (x) + ψ(x)

• fi: convex and L-smooth

• F : µ-strongly convex

• κ := L/µ: condition number

• ψ: potentially non-smooth

Variance reduction 12-25

Page 28: Variance reduction for stochastic gradient methods

Proximal extension (Xiao, Zhang ’14)

Algorithm 12.3 Prox-SVRG for finite-sum optimization1: for s = 1, 2, . . . do2: xold

s ← xms−1, and compute ∇F (xolds )︸ ︷︷ ︸

batch gradient

// update snapshot

3: initialize x0s ← xold

s

4: for t = 0, . . . ,m− 1︸ ︷︷ ︸each epoch contains m iterations

do

5: choose it uniformly from {1, . . . , n}, and

xt+1s = proxηψ

(xts−η

{∇fit(xts)−∇fit(xold

s )︸ ︷︷ ︸stochastic gradient

+∇F (xolds )})

Variance reduction 12-26

Page 29: Variance reduction for stochastic gradient methods

Stochastic recursive gradient algorithm(SARAH)

Page 30: Variance reduction for stochastic gradient methods

Nonconvex and smooth problems

minimizex∈Rd F (x) = 1n

n∑i=1

fi (x)

• fi: L-smooth, potentially nonconvex

Variance reduction 12-28

Page 31: Variance reduction for stochastic gradient methods

Recursive stochastic gradient estimates

— Nguyen, Liu, Scheinberg, Takac ’17

key idea: recursive / adaptive updates of gradient estimates︸ ︷︷ ︸stochastic

gt = ∇fit(xt)−∇fit(xt−1) + gt−1 (12.5)xt+1 = xt − ηgt

comparison to SVRG (use a fixed snapshot point for the entireepoch)

(SVRG) gt = ∇fit(xt)−∇fit(xold) +∇F (xold)

Variance reduction 12-29

Page 32: Variance reduction for stochastic gradient methods

Restarting gradient estimate every epoch

For many (e.g. strongly convex) problems, recursive gradient estimategt may decay fast (variance ↓; bias (relative to ∇F (xt)) ↑)

• gt may quickly deviate from the target gradient ∇F (xt)• progress stalls as gt cannot guarantee sufficient descent

solution: reset gt every few iterations to calibrate with the truebatch gradient

Variance reduction 12-30

Page 33: Variance reduction for stochastic gradient methods

Bias of gradient estimates

Unlike SVRG, gt is NOT an unbiased estimate of ∇F (xt)

E[gt | everything prior to xts

]= ∇F (xt) −∇F (xt−1) + gt−1︸ ︷︷ ︸

6=0

But if we average out all randomness, we have (exercise!)

E[gt]

= E[∇F (xt)

]

Variance reduction 12-31

Page 34: Variance reduction for stochastic gradient methods

StochAstic Recursive grAdient algoritHm

Algorithm 12.4 SARAH (Nguyen et al. ’17)1: for s = 1, 2, . . . , S do2: x0

s ← xm+1s−1 , and compute g0

s = ∇F (x0s)︸ ︷︷ ︸

batch gradient

// restart g anew

3: x1s = x0

s − ηg0s

4: for t = 1, . . . ,m do5: choose it uniformly from {1, . . . , n}6: gts = ∇fit(xts)−∇fit(xt−1

s )︸ ︷︷ ︸stochastic gradient

+ gt−1s

7: xt+1s = xts − ηgts

Variance reduction 12-32

Page 35: Variance reduction for stochastic gradient methods

Convergence analysis of SARAH (nonconvex)

Theorem 12.3 (Nguyen et al. ’19)

Suppose each fi is L-smooth. Then SARAH with η . 1L√m

obeys

1(m+ 1)S

S∑s=1

m∑t=0

E[∥∥∇F (xts)∥∥2

2

]≤ 2η(m+ 1)S

[F (x0

0)− F (x∗)]

• iteration complexity for finding ε-approximate stationary point(i.e. ‖∇F (x)‖2 ≤ ε):

O

(n+ L

√n

ε2

)(setting m � n, η � 1

L√m

)

• also derived by Fang et al. ’18 (for a SARAH-like algorithm“Spider”) and improved by Wang et al. ’19 (for “SpiderBoost”)

Variance reduction 12-33

Page 36: Variance reduction for stochastic gradient methods

Convergence analysis of SARAH (nonconvex)

Theorem 12.3 (Nguyen et al. ’19)

Suppose each fi is L-smooth. Then SARAH with η . 1L√m

obeys

1(m+ 1)S

S∑s=1

m∑t=0

E[∥∥∇F (xts)∥∥2

2

]≤ 2η(m+ 1)S

[F (x0

0)− F (x∗)]

• iteration complexity for finding ε-approximate stationary point(i.e. ‖∇F (x)‖2 ≤ ε):

O

(n+ L

√n

ε2

)(setting m � n, η � 1

L√m

)

• also derived by Fang et al. ’18 (for a SARAH-like algorithm“Spider”) and improved by Wang et al. ’19 (for “SpiderBoost”)

Variance reduction 12-33

Page 37: Variance reduction for stochastic gradient methods

Proof of Theorem 12.3

Theorem 12.3 follows immediately from the following claim on thetotal objective improvement in one epoch (why?)

E[F (xm+1

s )]≤ E

[F (x0

s)]− η

2

m∑t=0

E[∥∥∇F (xts)∥∥2

2]

(12.6)

We will then focus on estalibshing (12.6)

Variance reduction 12-34

Page 38: Variance reduction for stochastic gradient methods

Proof of Theorem 12.3 (cont.)

To establish (12.6), recall that the smoothness assumption gives

E[F (xt+1

s )]≤ E

[F (xts)

]− ηE

[∇F (xts)>gts

]+ Lη2

2 E[∥∥gts∥∥2

2

](12.7)

Since gts is not an unbiased estimate of ∇F (xts), we first decouple

2E[∇F (xts)>gts

]= E

[∥∥∇F (xts)∥∥2

2

]︸ ︷︷ ︸desired gradient estimate

+E[∥∥gts∥∥2

2

]︸ ︷︷ ︸variance

− E[∥∥∇F (xts)− gts

∥∥22

]︸ ︷︷ ︸squared bias of gradient estimate

Substitution into (12.7) with straightforward algebra gives

E[F (xt+1

s )]≤ E

[F (xts)

]− η

2E[∥∥∇F (xts)

∥∥22

]+ η

2E[∥∥∇F (xts)− gts

∥∥22

]−(η2 −

Lη2

2

)E[∥∥gts∥∥2

2

]

Variance reduction 12-35

Page 39: Variance reduction for stochastic gradient methods

Proof of Theorem 12.3 (cont.)Sum over t = 0, . . . ,m to arrive at

E[F (xm+1

s )]≤ E

[F (x0

s)]− η

2∑m

t=0E[∥∥∇F (xts)

∥∥22

]+ η

2

{∑m

t=0E[∥∥∇F (xts)− gts

∥∥22

]− (1− Lη)︸ ︷︷ ︸

≥1/2

∑m

t=0E[∥∥gts∥∥2

2

]}The proof of (12.6) is thus complete if we can justify

Lemma 12.4

If η ≤ 1L√m

, then (for fixed η, the epoch length m cannot be too large)∑m

t=0E[∥∥∇F (xts)− gts

∥∥22

]︸ ︷︷ ︸

squared bias of gradient estimate

≤ 12

∑m

t=0E[∥∥gts∥∥2

2

]︸ ︷︷ ︸

variance

• informally, this says the accumulated squared bias of gradient estimates(w.r.t. batch gradients) can be controlled by the accumulated variance

Variance reduction 12-36

Page 40: Variance reduction for stochastic gradient methods

Proof of Lemma 12.4

Key step:

Lemma 12.5

E[∥∥∇F (xts)− gts

∥∥22

]≤∑t

k=1E[∥∥gks − gk−1

s

∥∥22

]

• convert the bias of gradient estimates to the differences ofconsecutive gradient estimates (a consequence of the smoothnessand the recursive formula of gts)

Variance reduction 12-37

Page 41: Variance reduction for stochastic gradient methods

Proof of Lemma 12.4 (cont.)

From Lemma 12.5, it suffices to connect {‖gts − gt−1s ‖2} with {‖gts‖2}:∥∥gts − gt−1

s

∥∥22

(12.5)=∥∥∇fit(xts)−∇fit(xt−1

s

)∥∥22

smoothness≤ L2∥∥xts − xt−1

s

∥∥22

= η2L2∥∥gt−1s

∥∥22

Invoking Lemma 12.5 then gives

E[∥∥∇F (xts)− gts

∥∥22

]≤∑t

k=1E[∥∥gks − gk−1

s

∥∥22

]≤ η2L2

∑t

k=1E[∥∥gk−1

s

∥∥22

]Summing over t = 0, · · · ,m, we obtain∑m

t=0E[∥∥∇F (xts)− gts

∥∥22

]≤ η2L2m

∑m−1

t=0E[∥∥gts∥∥2

2

]which establishes Lemma 12.4 if η . 1

L√m

Variance reduction 12-38

Page 42: Variance reduction for stochastic gradient methods

Proof of Lemma 12.5Since this lemma only concerns a single epoch, we shall drop the dependencyon s for simplicity. Let Fk contain all info up to xk and gk−1, thenE[∥∥∇F (xk)− gk

∥∥22| Fk

]= E[∥∥∇F (xk−1)− gk−1 +

(∇F (xk)−∇F (xk−1)

)−(gk − gk−1)∥∥2

2| Fk

]=∥∥∇F (xk−1)− gk−1∥∥2

2+∥∥∇F (xk)−∇F (xk−1)

∥∥22

+ E[∥∥gk − gk−1∥∥2

2| Fk

]+ 2⟨∇F (xk−1)− gk−1,∇F (xk)−∇F (xk−1)

⟩− 2⟨∇F (xk−1)− gk−1,E

[gk − gk−1 | Fk

]⟩− 2⟨∇F (xk)−∇F (xk−1),E

[gk − gk−1 | Fk

]⟩(exercise)=

∥∥∇F (xk−1)− gk−1∥∥22−∥∥∇F (xk)−∇F (xk−1)

∥∥22

+ E[∥∥gk − gk−1∥∥2

2| Fk

]Since ∇F (x0) = g0. Sum over k = 1, . . . , t to obtain

E[∥∥∇F (xk)− gk

∥∥22

]=

t∑k=1

E[∥∥gk − gk−1∥∥2

2

]−

t∑k=1

∥∥∇F (xk)−∇F (xk−1)∥∥2

2︸ ︷︷ ︸≤0; done!

Variance reduction 12-39

Page 43: Variance reduction for stochastic gradient methods

Stochastic dual coordinate ascent (SDCA)— a dual perspective

Page 44: Variance reduction for stochastic gradient methods

A class of finite-sum optimization

minimizex∈Rd F (x) = 1n

n∑i=1

fi(x) + µ

2 ‖x‖22 (12.8)

• fi: convex and L-smooth

Variance reduction 12-41

Page 45: Variance reduction for stochastic gradient methods

Dual formulation

The dual problem of (12.8)

maximizev D (ν) = 1n

n∑i=1−f∗i (−vi)−

µ

2

∥∥∥∥∥ 1µn

n∑i=1νi

∥∥∥∥∥2

2(12.9)

• a primal-dual relation

x(ν) = 1µn

n∑i=1νi (12.10)

Variance reduction 12-42

Page 46: Variance reduction for stochastic gradient methods

Derivation of the dual formulation

minx

1n

n∑i=1

fi(x) + µ

2 ‖x‖22

⇐⇒ minx,{zi}

1n

n∑i=1

fi(zi) + µ

2 ‖x‖22 s.t. zi = x

⇐⇒ max{νi}

minx,{zi}

1n

n∑i=1

fi(zi) + µ

2 ‖x‖22 + 1

n

n∑i=1〈νi, zi − x〉︸ ︷︷ ︸

Lagrangian

⇐⇒ max{νi}

minx

1n

n∑i=1−f∗i (−νi)︸ ︷︷ ︸

conjugate: f∗i

(ν):=maxz{〈ν,z〉−fi(z)}

+ µ

2 ‖x‖22 −

1n

n∑i=1〈νi,x〉

⇐⇒ max{νi}

1n

n∑i=1−f∗i (−νi)−

µ

2

∥∥∥ 1µn

n∑i=1

νi︸ ︷︷ ︸optimal x= 1

µn

∑iνi

∥∥∥2

2

Variance reduction 12-43

Page 47: Variance reduction for stochastic gradient methods

Randomized coordinate ascent on dual problem

— Shalev-Shwartz, Zhang ’13

• randomized coordinate ascent: at each iteration, randomlypick one dual (block) coordinate νit of (12.9) to optimize

• maintain the primal-dual relation (12.10)

xt = 1µn

n∑i=1νti (12.11)

Variance reduction 12-44

Page 48: Variance reduction for stochastic gradient methods

Stochastic dual coordinate ascent (SDCA)

Algorithm 12.5 SDCA for finite-sum optimization1: initialize x0 = 1

µn

∑ni=1 ν

0i

2: for t = 0, 1, . . . do3: // choose a random coordinate to optimize4: choose it uniformly from {1, . . . , n}5: ∆t ← arg max

∆− 1nf∗it

(− νtit −∆

)− µ

2∥∥xt + 1

µn∆∥∥2

2︸ ︷︷ ︸find the optimal step with all {νt

i}i:i 6=itfixed

6: νt+1i ← νti + ∆t 1{i = it}︸ ︷︷ ︸

update only the itht coordinate

(1 ≤ i ≤ n)

7: xt+1 ← xt + 1µn∆t // based on (12.11)

Variance reduction 12-45

Page 49: Variance reduction for stochastic gradient methods

A variant of SDCA without duality

SDCA might not be applicable if the conjugate functions are difficultto evaluate

This calls for a dual-free version of SDCA

Variance reduction 12-46

Page 50: Variance reduction for stochastic gradient methods

A variant of SDCA without duality

— S. Shalev-Shwartz ’16

Algorithm 12.6 SDCA without duality1: initialize x0 = 1

µn

∑ni=1 ν

0i

2: for t = 0, 1, . . . do3: // choose a random coordinate to optimize4: choose it uniformly from {1, . . . , n}5: ∆t ← −ηµn

(∇fit

(xt)

+ νtit)

6: νt+1i ← νti + ∆t 1{i = it}︸ ︷︷ ︸

update only the itht coordinate

(1 ≤ i ≤ n)

7: xt+1 ← xt + 1µn∆t // based on (12.11)

Variance reduction 12-46

Page 51: Variance reduction for stochastic gradient methods

A variant of SDCA without duality

A little intuition

• the optimality condition requires (check!)

ν∗i = −∇fi(x∗), ∀i (12.12)

• with a modified update rule, one has

νt+1it← (1− ηµn)νtit + ηµn

(−∇fit

(xt))︸ ︷︷ ︸

cvx combination of current dual iterate and gradient component

— when it converges, it will satisfy (12.12)

Variance reduction 12-47

Page 52: Variance reduction for stochastic gradient methods

SDCA as SGD

The SDCA (without duality) update rule reads:

xt+1 = xt − η(∇fit(xt) + νtit︸ ︷︷ ︸ )

:=gt

It is straightforward to verify that gt is an unbiased gradient estimate

E[gt]

= E[∇fit(xt)

]+ E

[νtit]

= 1n

n∑i=1∇fi(xt) + 1

n

n∑i=1νti︸ ︷︷ ︸

=µxt

= ∇F (xt)

The variance of ‖gt‖2 goes to 0 as we converge to the optimizer

E[‖gt‖22] = E[‖νtit − ν

∗it + ν∗it +∇fit(xt)‖22

]≤ 2E

[‖νtit − ν

∗it‖

22]︸ ︷︷ ︸

→ 0 as t→∞

+ 2 E[‖ν∗it +∇fit(xt)‖22

]︸ ︷︷ ︸≤‖wt−w∗‖2

2 (Shalev-Shwartz ’16)

Variance reduction 12-48

Page 53: Variance reduction for stochastic gradient methods

SDCA as variance-reduced SGD

The SDCA (without duality) update rule reads:

xt+1 = xt − η(∇fit(xt) + νtit︸ ︷︷ ︸ )

:=gt

It is straightforward to verify that gt is an unbiased gradient estimate

E[gt]

= E[∇fit(xt)

]+ E

[νtit]

= 1n

n∑i=1∇fi(xt) + 1

n

n∑i=1νti︸ ︷︷ ︸

=µxt

= ∇F (xt)

The variance of ‖gt‖2 goes to 0 as we converge to the optimizer

E[‖gt‖22] = E[‖νtit − ν

∗it + ν∗it +∇fit(xt)‖22

]≤ 2E

[‖νtit − ν

∗it‖

22]︸ ︷︷ ︸

→ 0 as t→∞

+ 2 E[‖ν∗it +∇fit(xt)‖22

]︸ ︷︷ ︸≤‖wt−w∗‖2

2 (Shalev-Shwartz ’16)

Variance reduction 12-48

Page 54: Variance reduction for stochastic gradient methods

Convergence guarantees of SDCA

Theorem 12.6 (informal, Shalev-Shwartz ’16)

Assume each fi is convex and L-smooth, and set η = 1L+µn . Then it

takes SDCA (without duality) O((n+ L

µ

)log 1

ε

)iterations to yield ε

accuracy

• the same computational complexity as SVRG

• storage complexity: O(nd) (needs to store {νi}1≤i≤n)

Variance reduction 12-49

Page 55: Variance reduction for stochastic gradient methods

Reference

[1] ”Recent advances in stochastic convex and non-convex optimization,”Z. Allen-Zhu, ICML Tutorial, 2017.

[2] ”Accelerating stochastic gradient descent using predictive variancereduction,” R. Johnson, T. Zhang, NIPS, 2013.

[3] ”Barzilai-Borwein step size for stochastic gradient descent,” C. Tan,S. Ma, Y.H. Dai, Y. Qian, NIPS, 2016.

[4] ”A proximal stochastic gradient method with progressive variancereduction,” L. Xiao, T. Zhang, SIAM Journal on Optimization, 2014.

[5] ”Minimizing finite sums with the stochastic average gradient,”M. Schmidt, N. Le Roux, F. Bach, Mathematical Programming, 2013.

[6] ”SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives,” A. Defazio, F. Bach, andS. Lacoste-Julien, NIPS, 2014.

Variance reduction 12-50

Page 56: Variance reduction for stochastic gradient methods

Reference

[7] ”Variance reduction for faster non-convex optimization,” Z. Allen-Zhu,E. Hazan, ICML, 2016.

[8] ”Katyusha: The first direct acceleration of stochastic gradientmethods,” Z. Allen-Zhu, STOC, 2017.

[9] ”SARAH: A novel method for machine learning problems usingstochastic recursive gradient,” L. Nguyen, J. Liu, K. Scheinberg,M. Takac, ICML, 2017.

[10] ”Spider: Near-optimal non-convex optimization via stochasticpath-integrated differential estimator,” C. Fang, C. Li, Z. Lin, T. Zhang,NIPS, 2018.

[11] ”SpiderBoost and momentum: Faster variance reduction algorithms,”Z. Wang, K. Ji, Y. Zhou, Y. Liang, V. Tarokh, NIPS, 2019.

Variance reduction 12-51

Page 57: Variance reduction for stochastic gradient methods

Reference

[12] ”Optimal finite-Sum smooth non-convex optimization with SARAH,”L. Nguyen, M. vanDijk, D. Phan, P. Nguyen, T. Weng, J. Kalagnanam,arXiv:1901.07648, 2019.

[13] ”Stochastic dual coordinate ascent methods for regularized lossminimization,” S. Shalev-Shwartz, T. Zhang, Journal of MachineLearning Research, 2013.

[14] ”SDCA without duality, regularization, and individual convexity,”S. Shalev-Shwartz, ICML, 2016.

[15] ”Optimization methods for large-scale machine learning,” L. Bottou,F. Curtis, J. Nocedal, 2016.

Variance reduction 12-52