Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic...

Stochastic subgradient methodsBased on material by Mark Schmidt

Julieta Martinez

University of British Columbia

October 06, 2015

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Introduction

minx∈RD

N∑i=1

L(x , ai , bi ) + λ · r(x)

Introduction

minx∈RD

N∑i=1

L(x , ai , bi ) + λ · r(x)

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

∑Ni=1 i (x) = ∇f (x).

i=1 fi (xi )

I Deterministic gradient methods

I Stochastic gradient methods [Robbins and Monro, 1951]

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Convergence

Strongly O((1−√

u/L)t) O(1/t)

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

f (x) =N∑

√t) O(1/

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if