Variable Splitting Methods · Two Typical Problems I Regularized estimation to get sparse solutions...
Transcript of Variable Splitting Methods · Two Typical Problems I Regularized estimation to get sparse solutions...
Variable Splitting Methods
Eric Chi
November 20, 2017
Two Typical Problems
I Regularized estimation to get sparse solutions
✓ = argmin✓
1
2ky � X✓k22 + �k✓k1
Arises in biomedical problems: genome wide associationstudies
y
θ
X= + ε
Two Typical Problems
I Regularized estimation to get low-rank solutions
Z = argminZ
1
2kX� Zk22 + �kZk⇤
Arises in collaborative filtering: Netflix
X = + ε
Z
Two Typical Problems
I Regularized estimation to get low-rank solutions
Z = argminZ
1
2kX� Zk22 + �kZk⇤
Arises in collaborative filtering: Netflix
X = + ε
Z
?
??
?
??
?
The Generic Problem
✓ = argmin✓
L(✓)|{z}Lack of fit
+ J(✓)|{z}Complexity
Reasons for success:
I Theory: Consistency and convergence rates when n, p ! 1I Computation: Fast and scalable algorithms for computing ✓
The Generic Problem
✓ = argmin✓
L(✓)|{z}Lack of fit
+ J(D✓)| {z }Complexity
Reasons for success:
I Theory: Consistency and convergence rates when n, p ! 1I Computation: Fast and scalable algorithms for computing ✓
What Variable Splitting Can Do For You
✓ = argmin✓
L(✓) + J(D✓)
Variable splitting is
I helpful when J(✓) is to work with but J(D✓) is not.I typically easy to derive and code
I e.g. Lasso solver in less than 10 lines of code.
I modestly accurate solutions in 10s to 100s of iterations.
Agenda
I Case Study: Convex Clustering II Variable Splitting
I ADMMI AMA
I Case Study: Convex Clustering II
I Case Study: ADMM for Lasso
The Clustering Problem
Task:
I Given p points in q dimensions
IX 2 Rq⇥p
I group similar points together.
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Clustering Problem
Many approaches:
I k-means, mixture models
I Hierarchical clustering
I Spectral clustering, ...
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Clustering Problem
Computational Issues
I Nonconvex formulations
I Local minimizers
I Instability (initializations, tuning parameters, or data)
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
Convex Clustering
I Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al.2011
minimizeu
1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
I Assign a centroid ui to each data point xi .I Convex Fusion Penalty
I shrinks cluster centroids togetherI sparsity in pairwise di↵erences of centroids
ui � uj = 0 () xi and xj belong to the same cluster
I � : tunes overall amount of regularization
I wij : fine tunes pairwise shrinkage
I Generalizes fused lasso
Convex Clustering
I Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al.2011
minimizeu
1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
I Assign a centroid ui to each data point xi .I Convex Fusion Penalty
I shrinks cluster centroids togetherI sparsity in pairwise di↵erences of centroids
ui � uj = 0 () xi and xj belong to the same cluster
I � : tunes overall amount of regularization
I wij : fine tunes pairwise shrinkage
I Generalizes fused lasso
Too many degrees of freedom!
Convex Clustering
I Pelckmans et al. 2005, Lindsten et al. 2011, Hocking et al.2011
minimizeu
1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
I Assign a centroid ui to each data point xi .I Convex Fusion Penalty
I shrinks cluster centroids togetherI sparsity in pairwise di↵erences of centroids
ui � uj = 0 () xi and xj belong to the same cluster
I � : tunes overall amount of regularization
I wij : fine tunes pairwise shrinkage
I Generalizes fused lasso
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
The Solution Path
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
●●●●●● ●●
●●●●
●● ●●●●
●●
●●●
●
●●●●
●
●
●
● ●
●
●
●
●
●●
●
●
● ●●
●●●●
●●
●●
●●
●●●●●
●
●
●● ●
●
●●
●
●●●● ●●
●●
●
●
●
●
●●
●●●●
●
●●●
●
●●●●
●●
●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.3 0.6 0.9x
y
The Solution Path
minimize1
2
nX
i=1
kxi � uik2 + �X
i<j
wijkui � ujk2
E. Chi Convex Biclustering 11
Two Interlocking Half-Moons
Senate Voting
A B
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
Sessions
Shelby
Stevens
McCain
Hutchinson
Lincoln
Boxer
DoddLieberman
Cleland
Miller
Lugar
Breaux
Landrieu
Collins
Snowe
Carnahan
Baucus
Hagel
Nelson1
Conrad
VoinovichNickles
Specter
Chafee
Hollings
Johnson
Jeffords
Byrd
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
Sessions
Shelby
Stevens
McCain
Hutchinson
Lincoln
Boxer
DoddLieberman
Cleland
Miller
Lugar
Breaux
Landrieu
Collins
Snowe
Carnahan
Baucus Hagel
Nelson1
Conrad
VoinovichNickles
Specter
Chafee
HollingsJohnson
Jeffords
Byrd
−0.5
0.0
0.5
1.0
−2 −1 0 1 2 −2 −1 0 1 2Principal Component 1
Prin
cipa
l Com
pone
nt 2
Party● (D)
(I)(R)
Apparently Non-Trivial Optimization Problem
Why is this hard to solve?
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
General Recipe:
1. Introduce a dummy variable
unconstrained ! equality constrained
2. Use iterative method to solve equality constrained version
Apparently Non-Trivial Optimization Problem
Why is this hard to solve?
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
General Recipe:
1. Introduce a dummy variable
unconstrained ! equality constrained
2. Use iterative method to solve equality constrained version
Nonsmooth? Not the issue
Apparently Non-Trivial Optimization Problem
Why is this hard to solve?
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
General Recipe:
1. Introduce a dummy variable
unconstrained ! equality constrained
2. Use iterative method to solve equality constrained version
Affine transformation of u
Apparently Non-Trivial Optimization Problem
Why is this hard to solve?
minimize1
2
nX
i=1
kxi � uik22 + �X
i<j
wijkui � ujk2
General Recipe:
1. Introduce a dummy variable
unconstrained ! equality constrained
2. Use iterative method to solve equality constrained version
Convex Clustering: Variable Split Version
minimize1
2
pX
i=1
kxi � uik22 + �X
l
wlkvlk
subject to ul1 � ul2 � vl = 0
l = (l1, l2) with l1 < l2.
Equality constrained optimization...
Convex Clustering: Variable Split Version
minimize1
2
pX
i=1
kxi � uik22 + �X
l
wlkvlk
subject to ul1 � ul2 � vl = 0
l = (l1, l2) with l1 < l2.
Lagrange Multipliers
Lagrange Multipliers
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
ALM solves the equivalent problem
minimize f (u) + g(v) +⌫
2kc� Au� Bvk22,
subject to Au+ Bv = c
(6)
Typically need to solve this iteratively.
The Dual & AMA
L(u, v,�) = f (u) + g(v) + h�, c� Au� BviD(�) = argmin
u,vL(u, v,�)
for any feasible (u, v)
D(�) = argminu,v
L(u, v,�) L(u, v,�) = f (u) + g(v)
D(�) f (u) + g(v)
Therefore,
D(�?) = max�
D(�) minu,v
f (u) + g(v) = f (u?) + g(v?)
D(�?) f (u?) + g(v?)
D(�?) = f (u?) + g(v?)
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
L(u, v,�) = f (u) + g(v) + h�, c� Au� Bvi
(u?, v?) is a solution if there is a �? st (u?, v?,�?) is a stationarypoint of the Lagrangian,
rL(u?, v?,�?) = 0.
ame
minimize f (u) + g(v)
subject to Au+ Bv = c,(6)
(u?, v?) = argminu,v
L(u, v,�?)
(u?, v?) is a solution if there is a �? st (u?, v?,�?) is a stationarypoint of the Lagrangian,
rL(u?, v?,�?) = 0.
By Nexcis - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=3831272
Lagrange Multipliers
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
ALM solves the equivalent problem
minimize f (u) + g(v) +⌫
2kc� Au� Bvk22,
subject to Au+ Bv = c
(6)
Typically need to solve this iteratively.
The Dual & AMA
L(u, v,�) = f (u) + g(v) + h�, c� Au� BviD(�) = argmin
u,vL(u, v,�)
for any feasible (u, v)
D(�) = argminu,v
L(u, v,�) L(u, v,�) = f (u) + g(v)
D(�) f (u) + g(v)
Therefore,
D(�?) = max�
D(�) minu,v
f (u) + g(v) = f (u?) + g(v?)
D(�?) f (u?) + g(v?)
D(�?) = f (u?) + g(v?)
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
L(u, v,�) = f (u) + g(v) + h�, c� Au� Bvi
(u?, v?) is a solution if there is a �? st (u?, v?,�?) is a stationarypoint of the Lagrangian,
rL(u?, v?,�?) = 0.
ame
minimize f (u) + g(v)
subject to Au+ Bv = c,(6)
(u?, v?) = argminu,v
L(u, v,�?)
(u?, v?) is a solution if there is a �? st (u?, v?,�?) is a stationarypoint of the Lagrangian,
rL(u?, v?,�?) = 0.
Augmented Lagrangian Methodaugmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
ALM solves the equivalent problem
minimize f (u) + g(v) +⌫
2kc� Au� Bvk22,
subject to Au+ Bv = c
(6)
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
ALM solves the equivalent problem
minimize f (u) + g(v) +⌫
2kc� Au� Bvk22,
subject to Au+ Bv = c
(6)
ALM: Augmented Lagrangian Method
augmented Lagrangian method (ALM)
L⌫(u, v,�) = f (u) + g(v) + h�, c� Au� Bvi+ ⌫
2kc� Au� Bvk2
2
augmented Lagrangian method (ALM)
minimize f (u) + g(v)
subject to Au+ Bv = c,(5)
ALM solves the equivalent problem
minimize f (u) + g(v) +⌫
2kc� Au� Bvk22,
subject to Au+ Bv = c
(6)
The Augmented Lagrangian
ALM Updates
augmented Lagrangian method (ALM)
(um+1, vm+1) = argminu,v
L⌫(u, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).(8)
ALM: Augmented Lagrangian Method
augmented Lagrangian method (ALM)
(um+1, vm+1) = argminu,v
L⌫(u, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).(8)
ALM UpdatesOften hard
ALM: Augmented Lagrangian Method
augmented Lagrangian method (ALM)
(um+1, vm+1) = argminu,v
L⌫(u, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).(8)
ALM UpdatesOften hard
1. Alternating Direction Method of Multipliers (ADMM)
2. Alternating Minimization Algorithm (AMA)
(Gabay & Mercier 1976, Glowinski & Marrocco 1975)
(Tseng 1991)
ADMM: Alternating Direction Method of MultipliersADMM
Unfortunately, the minimization of the augmented Lagrangian overu and v jointly is often di�cult. ADMM and AMA adopt di↵erentstrategies in simplifying the minimization subproblem in the ALMupdates . ADMM minimizes the augmented Lagrangian one blockof variables at a time. This yields the algorithm
u
m+1 = argminu
L⌫(u, vm,�m)
v
m+1 = argminv
L⌫(um+1, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).
(9)
ADMM Updates
Goal: Simpler algorithms
augmented Lagrangian method (ALM)
(um+1, vm+1) = argminu,v
L⌫(u, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).(8)
ALM UpdatesOften hard
AMA: Alternating Minimization AlgorithmAMA
AMA takes a slightly di↵erent tack and updates the first block u
without augmentation, assuming f (u) is strongly convex. Thischange is accomplished by setting the positive tuning constant ⌫to be 0. The overall algorithm iterates according to
u
m+1 = argminu
L0
(u, vm,�m)
v
m+1 = argminv
L⌫(um+1, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).
(10)
AMA Updates
Goal: Simpler algorithms
augmented Lagrangian method (ALM)
(um+1, vm+1) = argminu,v
L⌫(u, v,�m)
�m+1 = �m + ⌫(c� Au
m+1 � Bv
m+1).(8)
ALM UpdatesOften hard
ADMM UpdatesADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
AMA UpdatesAMA
ui =1
1 + p0yi +
p0
1 + p0x
yi = xi +X
l1=i
[�l + 0vl ]�X
l2=i
[�l + 0vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(11)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
AMA UpdatesAMA
ui = xi +X
l1=i
�l �X
l2=i
�l
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(12)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
ADMM
ui =1
1 + p⌫yi +
p⌫
1 + p⌫x
yi = xi +X
l1=i
[�l + ⌫vl ]�X
l2=i
[�l + ⌫vl ].
vl = argminv
1
2kv � (ul1 � ul2 � ⌫�1�l)k22 +
�wl
⌫kvk
= prox�lk·k/⌫(ul1 � ul2 � ⌫�1�l),(10)
where �l = �wl .
�l = �l + ⌫(vl � ul1 + ul2).
Proximal Map
For � > 0 the function
prox�⌦(v) = argminv
�⌦(v) +
1
2kv � vk22
�
is the proximal map of the function ⌦(v).
Minimizer always exists and is unique for norms
Proximal maps for common norms
Table: Proximal maps for common norms.
Norm ⌦(v) prox�⌦(v)
`1 kvk1h1� �
|vl |
i
+vl
`2 kvk2h1� �
kvk2
i
+v
`1 kvk1 v � P�S(v)
`1,2P
g2G kvgk2h1� �
kvgk2
i
+vg
What’s the Di↵erence?
●
●
●
●
●
●
0
50
100
150
200
100 200 300 400 500p
Squa
re ro
ot o
f run
tim
e (s
ec)
MethodSubgradientAMAADMM
What’s the Di↵erence?
●
●
●
●
●
●
0
50
100
150
200
100 200 300 400 500p
Squa
re ro
ot o
f run
tim
e (s
ec)
MethodSubgradientAMAADMM
3.5 hrs2.9 hrs
16 min
Remarks
I Both AMA and ADMM convergeI Both AMA and ADMM can be accelerated
I Beck and Teboulle (2009)I Goldstein, O’Donoghue, and Setzer (2012)
I AMA and ADMM look very similar but...I Convergence speed
IAMA is clearly faster
I ConvergenceI
ADMM converges when ⌫ > 0
IAMA converges when ⌫ 1/p
I AMA requires stronger assumptionsI
Smooth part of objective needs to be strongly convex
ADMM solver for Lasso
minimize✓
1
2ky � X✓k22 + �k✓k1
Augmented Lagrangian
L(✓, v,�) = 1
2nky � X✓k22 + �kvk1 + ⌫
2k✓ � v + �k22.
ADMM Updates
✓k = minimize✓
1
2nky � X✓k22 +
⌫
2k✓ � v
k�1 + �k�1k22.
v
k = minimizev
�kvk1 + ⌫
2kv � ✓k � �k�1k22.
�k = �k�1 + ✓k � v
k .
ADMM solver for Lasso
minimize✓
1
2ky � X✓k22 + �kvk1 subject to ✓ = v,
Augmented Lagrangian
L(✓, v,�) = 1
2nky � X✓k22 + �kvk1 + ⌫
2k✓ � v + �k22.
ADMM Updates
✓k = minimize✓
1
2nky � X✓k22 +
⌫
2k✓ � v
k�1 + �k�1k22.
v
k = minimizev
�kvk1 + ⌫
2kv � ✓k � �k�1k22.
�k = �k�1 + ✓k � v
k .
ADMM solver for Lasso
minimize✓
1
2ky � X✓k22 + �kvk1 subject to ✓ = v,
Augmented Lagrangian
L(✓, v,�) = 1
2nky � X✓k22 + �kvk1 + ⌫
2k✓ � v + �k22.
ADMM Updates
✓k = minimize✓
1
2nky � X✓k22 +
⌫
2k✓ � v
k�1 + �k�1k22.
v
k = minimizev
�kvk1 + ⌫
2kv � ✓k � �k�1k22.
�k = �k�1 + ✓k � v
k .
ADMM solver for Lasso
minimize✓
1
2ky � X✓k22 + �kvk1 subject to ✓ = v,
Augmented Lagrangian
L(✓, v,�) = 1
2nky � X✓k22 + �kvk1 + ⌫
2k✓ � v + �k22.
ADMM Updates
✓k = minimize✓
1
2nky � X✓k22 +
⌫
2k✓ � v
k�1 + �k�1k22.
v
k = minimizev
�kvk1 + ⌫
2kv � ✓k � �k�1k22.
�k = �k�1 + ✓k � v
k .
Getting started
I Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.(2011), “Distributed Optimization and Statistical Learning viathe Alternating Direction Method of Multipliers,” Found.Trends Mach. Learn., 3, 1-122.
I Tseng, P. (1991), “Applications of a Splitting Algorithm toDecomposition in Convex Programming and VariationalInequalities,” SIAM Journal on Control and Optimization, 29,119-138.