Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic...

Post on 03-Jul-2020

0 views 0 download

Transcript of Stochastic subgradient methods › labs › lci › mlrg › slides › stochastic.pdfStochastic...

Stochastic subgradient methodsBased on material by Mark Schmidt

Julieta Martinez

University of British Columbia

October 06, 2015

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Introduction

I We are interested in a typical machine learning problem

minx∈RD

1N

N∑i=1

L(x , ai , bi ) + λ · r(x)

data fitting term + regularizer

I Last time, we talked about gradient methods, which workwhen D is large

I Today we will talk about stochastic subgradient methods,which work when N is large

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I A deterministic gradient method computes the gradient exactly

xt+1 = xt − αt · ∇f (xt) = xt − αt ·1N

N∑i=1

∇fi (xi )

I Computing the exact gradient is O(N)I We can get convergence with constant αt or using line-search

I A stochastic gradient method [Robbins and Monro, 1951]estimates the gradient from a sample it ∼ {1, 2, . . . ,N}

xt+1 = xt − αt · ∇fi (xt) = xt − αt ·1n∇fit (xi )

I Note that this gives an unbiased estimate of the gradient;E[f ′it (x)

]= 1

N

∑Ni=1 i (x) = ∇f (x).

I The iteration cost no longer depends on NI Convergence requires αt → 0

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I We want to minimize a function f (x) = 1N∑N

i=1 fi (xi )

I Deterministic gradient methods

I Stochastic gradient methods [Robbins and Monro, 1951]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Convergence

Stochastic methods are N times faster per iteration but, whatabout convergence?

Assumption Deterministic StochasticConvex O(1/t2) O(1/

√t)

Strongly O((1−√

u/L)t) O(1/t)

I Stochastic methods have a lower iteration cost, but a lowerconvergence rate

I Sublinear rate even under strong convexityI Bounds are unimprovable if only unbiased gradients are

availableI Momentum/acceleration does not improve convergenceI For convergence, momentum must go to zero [Tseng, 1998]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Figure : Convergence rates in the strongly convex case

I Stochastic methods are better for low-accuracy/time situationsI It can be hard to know when the crossing will happen

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I The convergence rates look quite different when the functionis non-smooth

I E.g., consider the binary support vector machine

f (x) =N∑

i=1

maxx{0, 1− bi (xᵀai )}+ λ‖x‖2

I Rates for subgradient methods in non-smooth objectives:

Assumption Deterministic StochasticConvex O(1/

√t) O(1/

√t)

Strongly O(1/t) O(1/t)

I Other black-box methods such as cutting plane are not fasterI Take-away point: for non-smooth problems

I Deterministic methods are not faster than stochastic methodsI Stochastic methods are a free, N times faster, lunch

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I For differentiable convex functions, we have

f (y) ≥ f (x) +∇f (x)ᵀ(y − x),∀x , y .

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

I At differentiable x , the only subgradient is ∇f (x)

I At non-differentiable x , we have a set of subgradients, calledthe subdifferential, ∂f (x)

I Notice that if ~0 ∈ ∂f (x), then x is a global minimizer

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

f(x) + rf(x)|(y � x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Example

A vector d is a subgradient of a convex function f at x if

f (y) ≥ f (x) + dᵀ(y − x),∀x , y .

f(x)

@f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Another example

Consider the absolute value function: |x |

∂|x | =

1 x > 0−1 x < 0[−1, 1] x = 0

f(x)

@f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Subdifferential of max function

I |x | is a special case of the max functionI Given two convex functions f1(x) and f2(x), the subdifferential

of max(f1(x), f2(x)) is given by

∂max (f1(x), f2(x)) =

∇f1(x) f1(x) > f2(x)

∇f2(x) f1(x) < f2(x)

θ∇f2(x) + (1− θ)∇f2(x) f1(x) = f2(x)I I.e., any convex combination of the gradients of the argmax

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in general

f(x)

@f(x) = [0.4, 1.2]

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The subgradient method

I The basic subgradient method:

x t+1 = x t − αtdt ,

for some dt ∈ ∂f (x t)I The steepest descent dt is argmind∈∂f (x){‖d‖}

I Easy to see in the 1d caseI Easy to find for `1 regularization, but hard in generalI If dt 6= argmin{d ∈ ∂f (x)}‖d‖, the objective may increaseI But ‖x t+1 − x∗‖ ≤ ‖x t − x∗‖ for small enough αI Again, for convergence, we require α→ 0

I The basic stochastic subgradient method

x t+1 = x t − αtdi t

for some di t ∈ ∂fi t(x t), it ∼ {1, 2, . . . ,N}

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

The stochastic subgradient method in practice

I Theory says we should do

it ∼ {1, 2, . . . ,N}, αt =1µt

x t+1 = x t − α∇fi t(x t).

I O(1/t) for smooth objectivesI O(log(t)/t) for non-smooth objectives

I Do not do this! Why?I Initial steps will be huge (µ1 = 1/N or 1/

√N)

I Later steps are tiny (1/t get small very quickly)I Convergence rate is not robust to mis-specification of µI Non-adaptive (very worst-case behaviour)

I What people do in practiceI Use smaller initial steps, then go to zero more slowlyI Take a weighted average of the iterations or gradients

x̄t =t∑

i=1

wtxt , d̄t =t∑

i=1

δtdt .

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

There is work that supports using large steps and averagingI [Moulines and Bach, 2011], [Lacoste-Julien et al., 2012]

I Averaging later iterations achieves O(1)̈ in non-smooth caseI Averaging by iteration number achieves the same

I [Nesterov, 2009], [Xiao, 2009]I Gradient averaging improves constants (‘dual averaging’)I Finds non-zero variables with sparse regularizers

I [Moulines and Bach, 2011]I αt = O(1/tβ) for β ∈ (0.5, 1) more robust than αt = O(1/t)

I [Nedić and Bertsekas, 2001]I Constant step size (αt = α) achieves rate of

E[f(x t)]− f (x∗) ≤ (1− 2µα)t (f (x0)− f (x∗)

)+O(α)

I [Polyak and Juditsky, 1992]I In the smooth case, iterate averaging is asymptotically optimalI Achieves same rate as optimal Stochastic Newton method

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

I What about accelerated/Newton-like stochastic methods?I Stochasticity in these methods does not improve the

convergence rateI But, it has been shown that

I [Ghadimi and Lan, 2010]I Acceleration can improve dependence on L and µI It improves performance at start if noise is small

I Newton-line AdaGrad method [Duchi et al., 2011]

x t+1 = x t + αD∇fi t(x t), with Djj =

√∑k=1

t‖∇j fi k(x t)‖

I improves regret bounds, but not optimization errorI Newton-like method [Bach and Moulines, 2013] achievesO(1/t) without strong-convexity (but with extraself-concordance assumption)

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

Recap

I We want to solve problems with BIG data X ∈ RD×N

I When D is large, we use gradient methodsI When N is large, we use stochastic gradient methods

I If the function is non-smooth, stochastic subgradient has greatconvergence rates

I Stochastic methods:I Are N times faster than deterministic methodsI Do a lot of progress quickly, then stall

I In practice:I Choose smaller step sizes at the beginningI Averaging the iterations / gradients helpsI Taking a permutation of the data (no longer unbiased

gradient) works well too

I Next week Mohammed will talk about finite-sum methods

Julieta Martinez Subgradient methods

Introduction Stochastic gradient Convergence rates Subgradients & subdifferentials Stochastic subgradient method Recap

References

Bach, F. and Moulines, E. (2013).Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).In Advances in Neural Information Processing Systems, pages 773–781.

Duchi, J., Hazan, E., and Singer, Y. (2011).Adaptive subgradient methods for online learning and stochastic optimization.The Journal of Machine Learning Research, 12:2121–2159.

Ghadimi, S. and Lan, G. (2010).Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization.Optimization Online, July.

Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012).A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method.arXiv preprint arXiv:1212.2002.

Moulines, E. and Bach, F. R. (2011).Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems, pages 451–459.

Nedić, A. and Bertsekas, D. P. (2001).Incremental subgradient methods for nondifferentiable optimization.SIAM Journal on Optimization, 12(1):109–138.

Nesterov, Y. (2009).Primal-dual subgradient methods for convex problems.Mathematical programming, 120(1):221–259.

Polyak, B. T. and Juditsky, A. B. (1992).Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization, 30(4):838–855.

Robbins, H. and Monro, S. (1951).A stochastic approximation method.The annals of mathematical statistics, pages 400–407.

Tseng, P. (1998).An incremental gradient (-projection) method with momentum term and adaptive stepsize rule.SIAM Journal on Optimization, 8(2):506–531.

Xiao, L. (2009).Dual averaging method for regularized stochastic learning and online optimization.In Advances in Neural Information Processing Systems, pages 2116–2124.

Julieta Martinez Subgradient methods