Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...

24
Towards optimal stochastic ADMM Samaneh Azadi, Suvrit Sra UC Berkeley, Max Planck Institute, T ¨ ubingen Thanks: Aaditya Ramdas (CMU)

Transcript of Samaneh Azadi Suvrit Sra UC Berkeley Max Planck Institute...

  • Towards optimal stochastic ADMM

    Samaneh Azadi, Suvrit Sra

    UC Berkeley, Max Planck Institute, Tübingen

    Thanks: Aaditya Ramdas (CMU)

  • Problem

    Linearly constrained stochastic convex optimization

    min Eξ[F (x , ξ)] + h(y)s.t. Ax + By = b, and x ∈ X , y ∈ Y

    ∀ξ,F (·, ξ): closed and convexh(y): closed and convexX and Y: compact convex sets.

    I f ≡ E[F (x , ξ)]: Loss FunctionI h: Regularizer; generalization; structure

    2 / 24

  • Comparison of convergence rates

    Previous methods:

    I ADMM : not stochastic optimization.

    I SADMM : by Ouyang et al.(2013) and Suzuki(2013)leading to suboptimal convergence rates.

    SADMM Optimal SADMMstrongly convex Lipschitz strongly convex Lipschitz

    cont. gradients cont. gradientsO(log k/k) O(1/k) O(1/k) O(1/k2)

    3 / 24

  • ADMM

    minx∈X ,y∈Y f (x)+h(y), Ax +By−b = 0,

    Introducing an augmented Lagrangian:

    Lβ(x , y , λ) := f (x) + h(y)−〈λ, Ax + By − b〉+β2‖Ax + By − b‖22

    λ: dual variable, β : penalty parameter.

    Algorithm 1:Initialize: x0, y0, and λ0.for k ≥ 0 do

    xk+1 ← argminx∈X {Lβ(x , yk , λk )}yk+1 ← argminy∈Y {Lβ(xk+1, y , λk )}λk+1 ← λk − β(Azk+1 + Byk+1 − b)

    end4 / 24

  • SADMM

    I For stochastic problems over a potentially unknowndistribution,

    I Modified Augmented Lagrangian:

    Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22 + 12ηk ‖x− xk‖

    22,

    (1)

    =⇒Linearize f (x),gk : a stochastic (sub)gradient of f .

    5 / 24

  • SADMM

    I For stochastic problems with over a potentially unknowndistribution,

    I Modified Augmented Lagrangian:

    Lkβ(x , y , λ) :=f (xk )+〈gk, x〉+ h(y)− 〈λ, Ax + By − b〉+ β2‖Ax + By − b‖22+ 12ηk ‖x− xk‖

    22,

    (2)‖x − xk‖22 prox-term:ensuring that (2) has a unique solution,aiding the convergence analysis.

    6 / 24

  • SADMM for strongly convex f

    Assumptions:

    Bounded subgradientsCompact X , Y; bounded dual variables.

    Algorithm:

    Similar to the SADMM algorithm.

    7 / 24

  • SADMM for strongly convex f

    Modification:

    x̄k := 2k(k+1)∑k−1

    j=0(j + 1)xj , ȳk := 2k(k+1)

    ∑kj=1

    jyj (3)

    Using nonuniform averaging of the iterates

    Aim:

    Giving higher weight to more recent iterates.

    8 / 24

  • SADMM for strongly convex f

    Result:

    For a specific stepsize ηk :

    E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Ax̄k + Bȳk − b‖2]≤ 2G2µ(k+1) +

    β2(k+1)D

    2Y +

    2ρ2β(k+1) .

    =⇒ Convergence rate: O(1/k)

    9 / 24

  • SADMM for smooth fAssumptions:

    Bounded noise varianceCompact X , Y; bounded dual variables.

    Algorithm 2:Input: Sequence (γk ) of interpolation parameters;

    (ηk = (L + αk )−1), stepsizesInitialize: x0 = z0, y0.

    for k ≥ 0 dopk ← (1− γk )xk + γkzk

    E [gk ] = ∇f (pk )...

    10 / 24

  • SADMM for smooth fzk+1 ← argminx∈X

    {L̂kβ(x , yk , λk )

    }

    interpolatory sequences (pk ) and (zk ), and“stepsizes” γk based on fast-gradient methods.

    xk+1 ← (1− γk )xk + γkzk+1Updating x by first computing zk+1,Using a weighted prox-term enforcing proximity to zk .

    yk+1 ← argminy∈Y{

    L̂kβ(zk+1, y , λk )}

    Updating y by using an AL term that de-pends on zk+1 instead of xk+1 for simplification

    λk+1 ← λk − β(Azk+1 + Byk+1 − b)

    11 / 24

  • SADMM for smooth fOther Modifications:

    I A modified Augmented Lagrangian term based onsuitable parameters (θk , γk ).

    I Averaging the iterates generated by Algorithm 2non-uniformly.

    I Smooth f (x)⇒ No need to average over x

    12 / 24

  • SADMM for smooth f

    Results:

    -For specific αj and γj parameters, smooth f andnon-smooth h:

    -For σ = 0,

    E[f (x̄k )− f (x∗) + h(ȳk )− h(y∗) + ρ‖Az̄k + Bȳk − b‖2]

    ≤ 2LR2

    (k + 1)2+

    2βD2Yk + 1

    +2ρ2

    β(k + 1).

    =⇒ Convergence rate of the smooth part: O(1/k2)

    13 / 24

  • GFLasso with smooth loss

    I Graph-guided fused lasso (GFlasso):I Using a graph based regularizer,I Variables : Vertices of the graph (xi ),I Penalizing the difference between adjacent variables:

    according to the edge weight (wi,j : the weight for theedge between xi and xj ).

    Ouyang et al. ICML 201314 / 24

  • GFLasso with smooth loss

    I Graph-guided fused lasso (GFlasso):I Problem formulation

    minE[L(x , ξ)] + λ2‖x‖22 + ν‖y‖1,s.t. Fx − y = 0.

    I L(x, ξ) = 12(l − xT s)2 for (s, l) feature label pair in

    the training sample ξ.

    I Fij = wij , Fji = −wij for all edges {i , j} ∈ E ,

    15 / 24

  • GFLasso with smooth loss

    I Comparing the following methods on the 20newsgroupsdataset:

    I Purpose: Classifying papers into 4 categories based onthe words they include.

    0 1300 3250 5200 650065

    75

    85Smooth f: Accuracy %

    # of iterations

    ClassificationAccuracy

    %

    SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM

    0 1300 3250 5200 65000.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1Smooth f: Objective Function

    # of iterations

    1Ntrain

    ∑i1 2(1

    −l ixTs i)2

    +γ 2||x

    ||2 2+

    ν||F

    x|| 1

    SGDProximal SGDOnline-RDARDA-AdmmSADMMoptimal-SADMM

    16 / 24

  • Overlapped group lasso

    I Formulation:

    f (x , ξ) = 0.1∑10

    j=1 L(x , ξj),

    h(y) = C(‖x (1)‖1 + 1√123‖x

    (2)‖block).

    I L(x, ξj) = log (1 + e−lj sTj x)→ logistic loss,

    I h(y): overlapping group lasso regularizer,I y = Ax :a concatenation of m-repetitions of x ,I ‖x‖block =

    ∑∑∑i ‖Xi,.‖2 +

    ∑∑∑j ‖X.,j‖2 where X denotes a

    reshaped version of x as a square matrix,I Vector x(1): related to the 123-first elements of x ,I x(2): representing the rest of x .

    17 / 24

  • Overlapped group lasso

    I Using ”adult” datasetI Purpose: binary classification

    0 5 10 15 20 25 3075

    80

    85

    90Smooth f: Accuracy %

    CPU time (s)

    ClassificationAccuracy

    %

    RDA-AdmmSADMMOptimal SADMM

    0 5 10 15 20 25 300.25

    0.35

    0.45

    0.55

    0.65

    0.75Smooth f: Objective Function

    CPU time (s)

    RDA-AdmmSADMMOptimal SADMM

    18 / 24

  • Strongly convex loss functions

    I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,

    I For ”20newsgroup” dataset,

    0 1300 3250 5200 650045

    55

    65

    75Strongly Convex f: Accuracy %

    # of iterations

    ClassificationAccuracy%

    SGDProximal SGDSADMMoptimal-SADMM

    0 1300 3250 5200 65000.6

    0.7

    0.8

    0.9

    1Strongly Convex f: Objective Function

    # of iterations

    1Ntrain

    ∑i1 2(li−xTsi)2

    +γ 2||x

    ||2 2+ν||F

    x|| 1

    SGDProximal SGDSADMMoptimal-SADMM

    19 / 24

  • Strongly convex loss functions

    I Using hinge loss (L(x, ξ) = max{0,1− lsT x}) in thetwo previous examples,

    I For ”adult” dataset,

    0 0.5 1 1.5 2 2.5 3

    x 104

    55

    65

    75

    85Strongly Convex f: Accuracy %

    # of iterations

    ClassificationAccuracy%

    SGDProximal SGDSADMMoptimal-SADMM

    0 0.5 1 1.5 2 2.5 3

    x 104

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5Strongly Convex f: Objective Function

    # of iterations

    1Ntrain

    ∑if(x,ξ)+h(y)

    SGDProximal SGDSADMMoptimal-SADMM

    20 / 24

  • Accuracy improvement vs. number of features

    I Generating Synthetic data,I Running GFlasso with smooth loss,I Reporting percentage improvement of Optimal-SADMM

    over SADMM in terms of classification accuracy as afunction of number of features,

    10 50 100 200 300 5000

    10

    20

    30

    40

    50

    60

    Accuracy improvement using Accelerated SADMM

    Number of features

    %im

    provement

    21 / 24

  • Conclusion

    I Presenting two new accelerated versions the stochasticADMM,

    I A variant attaining the theoretically optimal O(1/k)convergence rate for strongly convex stochasticproblems,

    I A variant algorithm for the smooth stochastic partwith an optimal O(1/k2) dependence on thesmooth part.

    I Notable performance of our accelerated variants overtheir non-accelerated counterparts.

    22 / 24

  • Future work

    I Transferring the O(log k/k) convergence rate of the lastiterate as done for SGD to the SADMM setting.

    I Obtaining high-probability bounds under light-tailedassumptions on the stochastic error.

    I Incorporating the impact of sampling multiple stochasticgradients to decrease the variance in the gradientestimates.

    I Deriving a mirror-descent version.I Improving rate dependence of the augmented Lagrangian

    part to O(1/k2) for smooth problems.

    23 / 24

  • The End

    Questions?

    24 / 24