Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...

Towards optimal stochastic ADMM

Samaneh Azadi, Suvrit Sra

UC Berkeley, Max Planck Institute, Tübingen

Thanks: Aaditya Ramdas (CMU)

Problem

Linearly constrained stochastic convex optimization

min Eξ[F (x , ξ)] + h(y)s.t. Ax + By = b, and x ∈ X , y ∈ Y

∀ξ,F (·, ξ): closed and convexh(y): closed and convexX and Y: compact convex sets.

I f ≡ E[F (x , ξ)]: Loss FunctionI h: Regularizer; generalization; structure

2 / 24

Comparison of convergence rates

Previous methods:

I ADMM : not stochastic optimization.

I SADMM : by Ouyang et al.(2013) and Suzuki(2013)leading to suboptimal convergence rates.

SADMM Optimal SADMMstrongly convex Lipschitz strongly convex Lipschitz

cont. gradients cont. gradientsO(log k/k) O(1/k) O(1/k) O(1/k2)

3 / 24

ADMM

minx∈X ,y∈Y f (x)+h(y), Ax +By−b = 0,

Introducing an augmented Lagrangian:

Lβ(x , y , λ) := f (x) + h(y)−〈λ, Ax + By − b〉+β2‖Ax + By − b‖22

λ: dual variable, β : penalty parameter.

Algorithm 1:Initialize: x0, y0, and λ0.for k ≥ 0 do

xk+1 ← argminx∈X {Lβ(x , yk , λk )}yk+1 ← argminy∈Y {Lβ(xk+1, y , λk )}λk+1 ← λk − β(Azk+1 + Byk+1 − b)

end4 / 24

SADMM

I For stochastic problems over a potentially unknowndistribution,

I Modified Augmented Lagrangian:

Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22 + 12ηk ‖x− xk‖

22,

(1)

=⇒Linearize f (x),gk : a stochastic (sub)gradient of f .

5 / 24

SADMM

I For stochastic problems with over a potentially unknowndistribution,

I Modified Augmented Lagrangian:

Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22+ 12ηk ‖x− xk‖

22,

(2)‖x − xk‖22 prox-term:ensuring that (2) has a unique solution,aiding the convergence analysis.

6 / 24

SADMM for strongly convex f

Assumptions:

Bounded subgradientsCompact X , Y; bounded dual variables.

Algorithm:

Similar to the SADMM algorithm.

7 / 24


Modification:

x̄k := 2k(k+1)∑k−1

j=0(j + 1)xj , ȳk := 2k(k+1)

∑kj=1

jyj (3)

Using nonuniform averaging of the iterates

Aim:

Giving higher weight to more recent iterates.

8 / 24


Result:

For a specific stepsize ηk :

E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Ax̄k + Bȳk − b‖2]≤ 2G2µ(k+1) +

β2(k+1)D

2Y +

2ρ2β(k+1) .

=⇒ Convergence rate: O(1/k)

9 / 24

SADMM for smooth fAssumptions:

Bounded noise varianceCompact X , Y; bounded dual variables.

Algorithm 2:Input: Sequence (γk ) of interpolation parameters;

(ηk = (L + αk )−1), stepsizesInitialize: x0 = z0, y0.

for k ≥ 0 dopk ← (1− γk )xk + γkzk

E [gk ] = ∇f (pk )...

10 / 24

SADMM for smooth fzk+1 ← argminx∈X

{L̂kβ(x , yk , λk )

}

interpolatory sequences (pk ) and (zk ), and“stepsizes” γk based on fast-gradient methods.

xk+1 ← (1− γk )xk + γkzk+1Updating x by first computing zk+1,Using a weighted prox-term enforcing proximity to zk .

yk+1 ← argminy∈Y{

L̂kβ(zk+1, y , λk )}

Updating y by using an AL term that de-pends on zk+1 instead of xk+1 for simplification

λk+1 ← λk − β(Azk+1 + Byk+1 − b)

11 / 24

SADMM for smooth fOther Modifications:

I A modified Augmented Lagrangian term based onsuitable parameters (θk , γk ).

I Averaging the iterates generated by Algorithm 2non-uniformly.

I Smooth f (x)⇒ No need to average over x

12 / 24

SADMM for smooth f

Results:

-For specific αj and γj parameters, smooth f andnon-smooth h:

-For σ = 0,

E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Az̄k + Bȳk − b‖2]

≤ 2LR2

(k + 1)2+

2βD2Yk + 1

+2ρ2

β(k + 1).

=⇒ Convergence rate of the smooth part: O(1/k2)

13 / 24

GFLasso with smooth loss

I Graph-guided fused lasso (GFlasso):I Using a graph based regularizer,I Variables : Vertices of the graph (xi ),I Penalizing the difference between adjacent variables:

according to the edge weight (wi,j : the weight for theedge between xi and xj ).

Ouyang et al. ICML 201314 / 24


I Graph-guided fused lasso (GFlasso):I Problem formulation

minE[L(x , ξ)] + λ2‖x‖22 + ν‖y‖1,s.t. Fx − y = 0.

I L(x, ξ) = 12(l − xT s)2 for (s, l) feature label pair in

the training sample ξ.

I Fij = wij , Fji = −wij for all edges {i , j} ∈ E ,

15 / 24


I Comparing the following methods on the 20newsgroupsdataset:

I Purpose: Classifying papers into 4 categories based onthe words they include.

0 1300 3250 5200 650065

75

85Smooth f: Accuracy %

# of iterations

ClassificationAccuracy

%

SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM

0 1300 3250 5200 65000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Smooth f: Objective Function

# of iterations

1Ntrain

∑i1 2(1

−l ixTs i)2

+γ 2||x

||2 2+

ν||F

x|| 1

SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM

16 / 24

Overlapped group lasso

I Formulation:

f (x , ξ) = 0.1∑10

j=1 L(x , ξj),

h(y) = C(‖x (1)‖1 + 1√123‖x

(2)‖block).

I L(x, ξj) = log (1 + e−lj sTj x)→ logistic loss,

I h(y): overlapping group lasso regularizer,I y = Ax :a concatenation of m-repetitions of x ,I ‖x‖block =

∑∑∑i ‖Xi,.‖2 +

∑∑∑j ‖X.,j‖2 where X denotes a

reshaped version of x as a square matrix,I Vector x(1): related to the 123-first elements of x ,I x(2): representing the rest of x .

17 / 24

Overlapped group lasso

I Using ”adult” datasetI Purpose: binary classification

0 5 10 15 20 25 3075

80

85

90Smooth f: Accuracy %

CPU time (s)

ClassificationAccuracy

%

RDA-AdmmSADMMOptimal SADMM

0 5 10 15 20 25 300.25

0.35

0.45

0.55

0.65

0.75Smooth f: Objective Function

CPU time (s)

RDA-AdmmSADMMOptimal SADMM

18 / 24

Strongly convex loss functions

I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,

I For ”20newsgroup” dataset,

0 1300 3250 5200 650045

55

65

75Strongly Convex f: Accuracy %

# of iterations

ClassificationAccuracy%

SGDProximal SGDSADMMoptimal-SADMM

0 1300 3250 5200 65000.6

0.7

0.8

0.9

1Strongly Convex f: Objective Function

# of iterations

1Ntrain

∑i1 2(li−xTsi)2

+γ 2||x

||2 2+ν||F

x|| 1


19 / 24

Strongly convex loss functions

I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,

I For ”adult” dataset,

0 0.5 1 1.5 2 2.5 3

x 104

55

65

75

85Strongly Convex f: Accuracy %

# of iterations

ClassificationAccuracy%


0 0.5 1 1.5 2 2.5 3

x 104

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Strongly Convex f: Objective Function

# of iterations

1Ntrain

∑if(x,ξ)+h(y)


20 / 24

Accuracy improvement vs. number of features

I Generating Synthetic data,I Running GFlasso with smooth loss,I Reporting percentage improvement of Optimal-SADMM

over SADMM in terms of classification accuracy as afunction of number of features,

10 50 100 200 300 5000

10

20

30

40

50

60

Accuracy improvement using Accelerated SADMM

Number of features

%im

provement

21 / 24

Conclusion

I Presenting two new accelerated versions the stochasticADMM,

I A variant attaining the theoretically optimal O(1/k)convergence rate for strongly convex stochasticproblems,

I A variant algorithm for the smooth stochastic partwith an optimal O(1/k2) dependence on thesmooth part.

I Notable performance of our accelerated variants overtheir non-accelerated counterparts.

22 / 24

Future work

I Transferring the O(log k/k) convergence rate of the lastiterate as done for SGD to the SADMM setting.

I Obtaining high-probability bounds under light-tailedassumptions on the stochastic error.

I Incorporating the impact of sampling multiple stochasticgradients to decrease the variance in the gradientestimates.

I Deriving a mirror-descent version.I Improving rate dependence of the augmented Lagrangian

part to O(1/k2) for smooth problems.

23 / 24

The End

Questions?

24 / 24

Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...

Documents

Transcript of Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...