Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...
Transcript of Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...
Proximal methods for minimizing convexcompositions
Courtney PaquetteJoint Work with D. Drusvyatskiy
Department of MathematicsUniversity of Washington (Seattle)
WCOM Fall 2016October 1, 2016
1 / 14
A function f is α-convex and β-smooth if
qx ≤ f ≤ Qx
where
qx(x) = f (x) + 〈∇f (x), y− x〉+ α
2‖y− x‖2
Qx(x) = f (x) + 〈∇f (x), y− x〉+ β
2‖y− x‖2
Condition number: κ = βα
2 / 14
A function f is α-convex and β-smooth if
qx ≤ f ≤ Qx
where
qx(x) = f (x) + 〈∇f (x), y− x〉+ α
2‖y− x‖2
Qx(x) = f (x) + 〈∇f (x), y− x〉+ β
2‖y− x‖2
Condition number: κ = βα
2 / 14
Complexity of first-order methods
Gradient descent: xk+1 = xk − 1β∇f (xk)
Majorization view: xk+1 = argminx Qxk(·)
β-smooth α-convex
Gradient descent βε κ · log(1
ε)
Table: Iterations until f (xk)− f ∗ < ε
(Nesterov ’83, Yudin-Nemirovsky ’83)
3 / 14
Complexity of first-order methods
Gradient descent: xk+1 = xk − 1β∇f (xk)
Majorization view: xk+1 = argminx Qxk(·)
β-smooth α-convex
Gradient descent βε κ · log(1
ε)
Optimal methods√
βε
√κ · log(1
ε)
Table: Iterations until f (xk)− f ∗ < ε
(Nesterov ’83, Yudin-Nemirovsky ’83)
3 / 14
General set-upConvex-Composite Problem is
minx
F(x) := h(c(x)) + g(x)
c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex
For convenience, set µ := Lβ
Examples:Additive composite minimization
minx
c(x) + g(x)Nonlinear least squares
minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}
Exact penalty subproblem:
minx
g(x) + distK(c(x))
4 / 14
General set-upConvex-Composite Problem is
minx
F(x) := h(c(x)) + g(x)
c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex
For convenience, set µ := LβExamples:
Additive composite minimization
minx
c(x) + g(x)Nonlinear least squares
minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}
Exact penalty subproblem:
minx
g(x) + distK(c(x))
4 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient
G(x) = µ(x− x+)
5 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-Marquardt
The prox-gradientG(x) = µ(x− x+)
5 / 14
Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:
0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))
Idea: Majorization
F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y) ∀ y
Prox-linear mapping:
x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)}
Prox-linear method:xk+1 = x+k
(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient
G(x) = µ(x− x+)5 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Prox-Linear algorithm
Convergence Rate:
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
What is ‖G(xk)‖2 < ε?
dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)
For nonconvex problems, the rate
‖G(xk)‖2 < ε after O(µ2
ε
)iterations
is “essentially” the best
6 / 14
Solving the Sub-problem: Inexact Prox-Linear
Prox-linear method requires solving:
miny
ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)
Suppose we can not solve exactly:
ϕ(x+, x) ≤ miny
ϕ(y, x) + ε
where x+ is an ε-approximate optimal solution
QuestionHow accurately do we need to solve the subproblem to
guarantee the same overall rate for the prox-linear?
7 / 14
Solving the Sub-problem: Inexact Prox-Linear
Prox-linear method requires solving:
miny
ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ
2‖y− x‖2 + g(y)
Suppose we can not solve exactly:
ϕ(x+, x) ≤ miny
ϕ(y, x) + ε
where x+ is an ε-approximate optimal solution
QuestionHow accurately do we need to solve the subproblem to
guarantee the same overall rate for the prox-linear?
7 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Inexact Prox-Linear AlgorithmWant to bound
G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem
Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then
mini=1,...,k
‖G(xi)‖2 ≤ O
(µ+
∑ki=1 εi
k
).
Generalizes (Schmidt-Le Roux-Bach ’11)
Question
Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function
8 / 14
Acceleration
Measuring non-convexity,
h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}
Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.
Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖
2 is convex for all w ∈ dom h∗
Defn: Parameter ρ ∈ [0, 1] such that
x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗
9 / 14
Acceleration
Measuring non-convexity,
h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}
Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.
Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖
2 is convex for all w ∈ dom h∗
Defn: Parameter ρ ∈ [0, 1] such that
x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗
9 / 14
Acceleration
Algorithm 1: Accelerated prox-linear methodInitialize : Fix two points x0, v0 ∈ dom g.
1 while ‖G(yk−1)‖ > ε do2 ak ← 2
k+13 yk ← akvk−1 + (1− ak)xk−1
4 xk ← y+k5 vk ← argminz g(z)+ 1
ak·h(c(yk)+ak∇c(yk)(z−vk−1)
)+ ak
2t ‖z− vk−1‖2
6 k← k + 17 end
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
‖G(xi)‖2 ≤ O(µ2
k3
)+ ρ · O
(µ2R2
k
)where R = diam (dom g)
Generalizes (Ghadimi-Lan ’16) for additive composite
10 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Inexact Accelerated Prox-LinearTwo sub-problems to solve:
xk is an εk-approximate optimal solution
minz
g(z) + h(c(yk) +∇c(yk)(z− yk)
)+ 1
2t ‖z− yk‖2
vk is an δk-approximate optimal solution
minz
g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)
)+ ak
2t ‖z− vk−1‖2
dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+
√εk
Thm: (Drusvyatskiy-P ’16)
mini=1,...,k
{‖xk − yk‖2+ εi} ≤ ρ · O
(µ2R2
k
)+O
(µ2
k3
)+
1k3
(k∑
i=1
O(i2εi) +O(i2δi) +O(i√δi)
)where R = diam (dom g)Need: εi ∼ 1
i3+r and δi ∼ 1i4+r
11 / 14
Thank you!
12 / 14
ReferencesDrusvyatskiy, D. and Kempton, C. (2016).An accelerated algorithm for minimizing convex compositions.Preprint arXiv: 1605.00125.
Drusvyatskiy, D. and Lewis, A. (2016).Error bounds, quadratic growth, and linear convergence of proximalmethods.Preprint arXiv:1602.06661.
Ghadimi, S. and Lan, G. (2016).Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.Math. Program., 156(1-2, Ser. A):59–99.
Nesterov, Y. (2004).Introductory lectures on convex optimization. A Basic Course.Springer.
13 / 14
References (cont.)Schmidt, M., Le Roux, N., and Bach, F. (2011).Convergence rates of inexact proximal-gradient methods for convexoptimization.Advances in Neural Information Processing Systems.
14 / 14