Chapter 6-7: stochastic algorithms for large scale problems

Post on 30-Apr-2022

4 views 0 download

Transcript of Chapter 6-7: stochastic algorithms for large scale problems

Chapter 6-7: stochastic algorithms for large scale problems

Edouard Pauwels

Statistics and optimization in high dimensionsM2RI, Toulouse 3 Paul Sabatier

1 / 31

Different estimators

X ∈ Rn×d , Y ∈ Rn (random).

θ ∈ arg minθ∈Rd ‖Xθ − Y ‖22θ ∈ arg minθ∈Rd ‖Xθ − Y ‖22, s.t. ‖θ‖1 ≤ 1

θ ∈ arg minθ∈Rd ‖Xθ − Y ‖22, s.t. ‖θ‖0 ≤ k

θ ∈ arg minθ∈Rd ‖Xθ − Y ‖22 + λ‖θ‖0θ ∈ arg minθ∈Rd ‖Xθ − Y ‖22 + λ‖θ‖1θ ∈ arg minθ∈Rd ‖θ‖1, s.t. Xθ = Y .

How to deal with large values of n (possibly infinite)?

Can we reduce the cost of treating large d .

2 / 31

Where have we been so far.

‖ · ‖0: hard to handle computationally.

`1 norm estimators are solutions to conic programs.

General purpose solvers (interior point methods), hardly apply to large instances.

Dedicacted first order methods, cheap iterations.

Plan for today: stochastic algorithms to treat large n or large d .

Stochastic approximation and Robbins-Monro algorithm.

Prototype algorithm, ODE method, convergence rate analysis.

Block coordinate methods, convergence rate analysis.

General conclusion, I expect your feedback.

Sources are diverse, see the lecture notes.

3 / 31

Plan

1. Introduction to stochastic approximation

2. Robbins-Monro algorithm

3. Convergence analysis

4. Block coordinate algorithms

4 / 31

Motivation for large n

The Lasso estimator is given as follows:

θ`1 ∈ arg minθ∈Rd

1

2n‖Xθ − Y ‖2 + λ‖θ‖1

θ`1 ∈ arg minθ∈Rd

1

n

n∑i=1

1

2(xT

i θ − yi )2 + λ‖θ‖1,

General model

minx∈Rp

F (x) :=1

n

n∑i=1

fi (x) + g(x). (1)

where fi and g are convex lower semicontinuous convex functions.

Sum rule: ∂F =∑n

i=1 ∂fi + ∂g

Redundancy: if fi = f for all i = 1, . . . , n, the sum is not needed, only one term.

5 / 31

Redundancy and estimation of the mean

minx∈R

F (x) :=1

n

n∑i=1

(x − xi )2

● ●●● ●● ●● ● ●● ●● ●●0.00

0.25

0.50

0.75

1.00

−2 −1 0 1

x

y

6 / 31

Intuition for stochastic approximation

Let I be uniform over {1, . . . , n}

F : x 7→ E [fI (x)] + g(x),

Stochastic approximation: main algorithmic step, for any x ∈ Rd ,

Sample i uniformly at random in {1, . . . , n}.Perform an algorithmic step using only the value of fi (x) and ∇fi (x) or eventuallyv ∈ ∂fi (x)

Unbiased estimates of the (sub)gradient.

If for each value of I , fI is C1, we have for any x ∈ Rd ,

E [∇fI (x)] = ∇E [fI (x)] = ∇F (x)

Let vI be a random variable such that vI ∈ ∂fI (x) almost surely, F is convex and

E [vI ] ∈ ∂E [fI (x)] = ∂F (x).

7 / 31

Plan

1. Introduction to stochastic approximation

2. Robbins-Monro algorithm

3. Convergence analysis

4. Block coordinate algorithms

8 / 31

Robbins-Monro algorithm

Let h : Rp 7→ Rp be Lipschitz, we seek a zero of h, noisy unbiased estimates of h.

Robins-Monro: (Xk)k∈N is a sequence of random variables such that for any k ∈ N

Xk+1 = Xk + αk (h(Xk) + Mk+1) (2)

where

(αk)k∈N is a sequence of positive step sizes satisfying

n∑i=1

αk = +∞n∑

i=1

α2k < +∞

(Mk)k∈N, martingale difference sequence with respect to the increasing σ-fields

Fk = σ(Xm,Mm,m ≤ k) = σ(X0,M1, . . . ,Mk).

E [Mk+1|Fk ] = 0, for all k ∈ N.

In addition, we assume that there exists a positive constant C such that

supk∈N

E[‖Mk+1‖22|Fk

]≤ C .

9 / 31

More intuition

Martingale convergence theorem:∑+∞

k=0 E[α2k‖Mk+1‖2|Fk

]is finite. Hence

K∑k=0

αkMk+1

is a zero mean martingale with square summable increments. It converges to a squareintegrable random varible M in Rp, almost surely and in L2 (Durret Section 5.4).

Vanishing step size: In addition to wash out noise, we obtain trajectories close to theODE

x = h(x)

10 / 31

Plan

1. Introduction to stochastic approximation

2. Robbins-Monro algorithm

3. Convergence analysis

4. Block coordinate algorithms

11 / 31

The ODE method

Choose h = −∇F (x) assuming that F has Lispchitz gradient. The following result is dueto Michel Benaim.

Theorem

Conditioning on boundedness of {Xk}k∈N, almost surely, the (random) set ofaccumulation point of the sequence is compact connected and invariant by the flowgenerated by the continuous time limit:

x = h(x).

Consequence: let x be an accumulation point, the unique solution x : t 7→ Rp tox = −∇F (x), x(0) = x remains bounded for all t ∈ R.

Corollary

If F is convex, C1 with Lipschitz gradient, and attains its minimum, setting h = −∇F ,conditioning on the event that supk∈N ‖Xk‖ is finite, almost surely, all the accumulationpoints of Xk are critical points of F .

12 / 31

Non asymptotic rates for stochastic subgradient

Proposition

Consider the problem

minx∈Rd

F (x) :=1

n

n∑i=1

fi (x),

where each fi is convex and L-Lipschitz. Choose x0 ∈ R and a sequence of randomvariables (ik)k∈N independently identically distributed uniformly on {1, . . . , n} and asequence of positive step sizes (αk)k∈N. Consider the recursion

xk+1 = xk − αkvk (3)

vk ∈ ∂fik (xk) (4)

Then for all K ∈ N, K ≥ 1

E [F (xK )− F ∗] ≤‖x0 − x∗‖22 + L2∑K

k=0 α2k

2∑K

k=0 αk

where xK =∑K

k=0 αk xk∑Kk=0

αk.

13 / 31

Consequences

Corollary

Under the same hypotheses, we have the following

If αk = α is constant, we have

E [F (xk)− F ∗] ≤ ‖x0 − x∗‖2

2(k + 1)α+

L2α

2.

In particular, choosing αi = ‖x0−x∗‖/L√k+1

, we have

E [F (xk)− F ∗] ≤ ‖x0 − x∗‖L√k + 1

.

Choosing αk = ‖x0 − x∗‖/(L√k) for all k, we obtain for all k

E [F (xk)− F ∗] = O

(‖x0 − x∗‖2L(1 + log(k))√

k

).

14 / 31

Exercise

For the last point, what can you say if F is strongly convex?

15 / 31

Non asymptotic rates for stochastic proximal gradient

Proposition

Consider the problem

minx∈Rd

F (x) :=1

n

n∑i=1

fi (x) + g(x)

where each fi is convex with L-Lipschitz gradient and g is convex. Choose x0 ∈ R and asequence of random variables (ik)k∈N independently identically distributed uniformly on{1, . . . , n} and a sequence of positive step sizes (αk)k∈N. Consider the recursion

xk+1 = proxαkg/L(xk − αk/L∇fik (xk)) . (5)

Assume the following

0 < αk ≤ 1, for all k ∈ N.

fi and g are G -Lipschitz for all i = 1, . . . , n;

Then for all K ∈ N, K ≥ 1, setting xK =∑K

k=0 αk xk∑Kk=0

αk.

E [F (xK )− F ∗] ≤L‖x0 − x∗‖22 + 2G2

L

∑Kk=0 α

2k

2∑K

k=0 αk

16 / 31

Consequences

Corollary

If αk = α is constant, we have for all k ≥ 1

F (xk)− F ∗ ≤ L‖x0 − x∗‖2

2(k + 1)α+

G 2α

L.

In particular, choosing αi = 1√2k+2

, for i = 1 . . . , k, for some k ∈ N, we have

F (xk)− F ∗ ≤L‖x0 − x∗‖22 + G2

L√2k + 2

.

Choosing αk = 1/√

2k + 2 for all k, we obtain for all k

F (xk)− F ∗ = O

(L‖x0 − x∗‖22 + G2

Llog(k)

√2k + 2

).

17 / 31

Optimality of these rates

O(1/√k) are optimal rates for optimization based on stochastic oracles.

Smoothness does not improve.

Strong convexity leads to O(1/k).

Linear rates can be achieved using variance reduction techniques for finite sumsunder strong convexity.

18 / 31

Minimizing the population risk

Minimize functions of the form

x 7→ EZ [f (x ,Z)]

where x are model parameters and Z is a population random variable.

Example: input output pair (X ,Y ) of a regression problem, minimize over a parametricregression function class F .

R(f ) = E[(f (X )− Y )2

]=

∫X×Y

(f (x)− y))2P(dx , dy).

Single pass: given (xi , yi )ni=1, one pass of a stochastic algorithm, amount to perform n

steps of the same algorithm on the population risk.

19 / 31

Plan

1. Introduction to stochastic approximation

2. Robbins-Monro algorithm

3. Convergence analysis

4. Block coordinate algorithms

20 / 31

Motivation for large d

θ`1 ∈ arg minθ∈Rd

1

2n‖Xθ − Y ‖2 + λ‖θ‖1.

The cost of one proximal gradient step is of the order of d2.

Idea: Update only subsets of the coordinates to reduce the cost.

21 / 31

Does this work

For smooth convex functions?

For nonsmooth convex functions?

For the Lasso problem?

22 / 31

Block proximal gradient algorithm

We consider optimization problems of the form

minx∈Rp

F (x) = f (x) +

p∑i=1

gi (xi ),

where f : Rp 7→ R has L-Lipschitz gradient and gi : R 7→ R are convex lowersemicontinuous univariate functions.

Let e1, . . . , ep be the elements of the canonical basis. Given a sequence of coordinateindices (ik)k∈N, starting at x0 ∈ Rp

xk+1 = arg miny=xk+teikf (xk) + 〈∇f (xk), y − xk〉+

L

2‖y − xk‖22 + gik (y)

Assumption (Coercivity)

The sublevelset {y ∈ Rp, F (y) ≤ F (x0)} is compact, for any y ∈ Rp such thatF (y) ≤ F (x0), ‖y − x∗‖2 ≤ R.

23 / 31

Technical Lemma

Lemma

Let (Ak)k∈N be a sequence of positive real numbers and γ > 0 be such that

Ak − Ak+1 ≥ γA2k

then for all k ∈ N, Ak ≤(

1A0

+ γk)−1

.

24 / 31

Random block gradient descent

Proposition (Nesterov 2012)

Consider the problem

minx∈Rp

f (x)

where f : Rp 7→ R is convex differentiable with L-Lipschitz gradient. Choose x0 ∈ R and asequence of random variables (ik)k∈N independently identically distributed uniformly on{1, . . . , p} and a sequence of positive step sizes. Consider the recursion

xk+1 = xk −1

L∇ik f (xk) (6)

Then for all k ∈ N, k ≥ 1

E [f (xk)− f ∗] ≤ 2pLR2

k.

25 / 31

Random block proximal gradient descent

Proposition (Richtarik,Takac 2014)

Consider the problem

minx∈Rd

F (x) := f (x) +

p∑i=1

gi (x)

where f : Rp 7→ R is convex differentiable with L-Lipschitz gradient, each gi : Rp 7→ R isconvex and lower semicontinuous and only depends on coordinate i . Choose x0 ∈ R anda sequence of random variables (ik)k∈N independently identically distributed uniformly on{1, . . . , p} and a sequence of positive step sizes. Consider the recursion

xk+1 = arg miny f (xk) + 〈∇ik f (xk), y − xk〉+L

2‖y − xk‖22 + gik (y) (7)

= proxgik/L

(xk −

1

L∇ik f (xk)

). (8)

Set C = max{LR2,F (x0)− F ∗

}, we have, for all k ≥ 1,

E [F (xk)− F ∗] ≤ 2pC

k.

26 / 31

Deterministic block gradient descent

Proposition

Consider the problem

minx∈Rd

f (x)

where f : Rp 7→ R is convex differentiable with L-Lipschitz gradient. Choose x0 ∈ R, andconsider the recursion

xk+1 = xk −1

L∇ik f (xk) (9)

where ik is the largest block of ∇f (xk) in Euclidean norm. Then for all k ∈ N, k ≥ 1

f (xk)− f ∗ ≤ 2pLR2

k.

Similar for proximal variant.

27 / 31

Exercise

In the same setting as the block gradient descent method, consider the update

xk+1 ∈ arg min f (y), s.t. y = xk + teik , t ∈ R.

Can you prove a convergence rate for this method?

28 / 31

Comment on complexity for quadratic losses

Lasso estimator

θ`1 ∈ arg minθ∈Rd

1

2n‖Xθ − Y ‖2 + λ‖θ‖1.

Full proximal gradient step costs O(d2).

Given θ ∈ Rd and β = XT (Xθ − Y ), choosing θ such that ‖θ − θ‖0, computingXT (Xθ − Y ) given β costs only O(d).

Consequences:

d steps of random block method have roughly the same cost as one step of the fullmethod.

The computational overhead for deterministic rules is affordable.

29 / 31

Conclusion for randomized methods

Stochastic Gradient Descent (SGD) is at the heart of machine learning methodsbeyond convex optimization (deep learning . . . ).

Block decomposition methods can be beneficial even for small d .

Most state of the art method used randomized algorithms.

30 / 31

Conclusion

Sparse least squares problem, a running example to illustrate:

Statistical efficiency issues in high dimension and their resolution

Computational complexity barriers in high dimensional estimation.

All purpose solvers for conic programming

First order methods and composite optimization

Randomized methods to treat high dimensionality issues from a computational viewpoint.

Lecture notes: available at

https://www.math.univ-toulouse.fr/~epauwels/M2RI/index.html

Feedback form: please rate the course.

31 / 31