Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

27
Proximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University of Washington (Seattle) WCOM Fall 2016 October 1, 2016 1 / 14

Transcript of Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Page 1: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Proximal methods for minimizing convexcompositions

Courtney PaquetteJoint Work with D. Drusvyatskiy

Department of MathematicsUniversity of Washington (Seattle)

WCOM Fall 2016October 1, 2016

1 / 14

Page 2: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

where

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

Page 3: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

where

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

Page 4: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

ε)

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

Page 5: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

ε)

Optimal methods√

βε

√κ · log(1

ε)

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

Page 6: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

General set-upConvex-Composite Problem is

minx

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := Lβ

Examples:Additive composite minimization

minx

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

minx

g(x) + distK(c(x))

4 / 14

Page 7: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

General set-upConvex-Composite Problem is

minx

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := LβExamples:

Additive composite minimization

minx

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

minx

g(x) + distK(c(x))

4 / 14

Page 8: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)

5 / 14

Page 9: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-Marquardt

The prox-gradientG(x) = µ(x− x+)

5 / 14

Page 10: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)5 / 14

Page 11: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Page 12: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Page 13: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Page 14: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

miny

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Page 15: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

miny

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Page 16: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Page 17: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Page 18: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Page 19: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Page 20: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Page 21: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Acceleration

Algorithm 1: Accelerated prox-linear methodInitialize : Fix two points x0, v0 ∈ dom g.

1 while ‖G(yk−1)‖ > ε do2 ak ← 2

k+13 yk ← akvk−1 + (1− ak)xk−1

4 xk ← y+k5 vk ← argminz g(z)+ 1

ak·h(c(yk)+ak∇c(yk)(z−vk−1)

)+ ak

2t ‖z− vk−1‖2

6 k← k + 17 end

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

‖G(xi)‖2 ≤ O(µ2

k3

)+ ρ · O

(µ2R2

k

)where R = diam (dom g)

Generalizes (Ghadimi-Lan ’16) for additive composite

10 / 14

Page 22: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Page 23: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Page 24: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Page 25: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

Thank you!

12 / 14

Page 26: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

ReferencesDrusvyatskiy, D. and Kempton, C. (2016).An accelerated algorithm for minimizing convex compositions.Preprint arXiv: 1605.00125.

Drusvyatskiy, D. and Lewis, A. (2016).Error bounds, quadratic growth, and linear convergence of proximalmethods.Preprint arXiv:1602.06661.

Ghadimi, S. and Lan, G. (2016).Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.Math. Program., 156(1-2, Ser. A):59–99.

Nesterov, Y. (2004).Introductory lectures on convex optimization. A Basic Course.Springer.

13 / 14

Page 27: Proximal methods for minimizing convex compositionsProximal methods for minimizing convex compositions Courtney Paquette Joint Work with D. Drusvyatskiy Department of Mathematics University

References (cont.)Schmidt, M., Le Roux, N., and Bach, F. (2011).Convergence rates of inexact proximal-gradient methods for convexoptimization.Advances in Neural Information Processing Systems.

14 / 14