Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Post on 29-Jul-2020

15 views 0 download

Transcript of Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Proximal methods for minimizing convexcompositions

Courtney PaquetteJoint Work with D. Drusvyatskiy

Department of MathematicsUniversity of Washington (Seattle)

WCOM Fall 2016October 1, 2016

1 / 14

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

where

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

where

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

ε)

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

ε)

Optimal methods√

βε

√κ · log(1

ε)

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

General set-upConvex-Composite Problem is

minx

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := Lβ

Examples:Additive composite minimization

minx

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

minx

g(x) + distK(c(x))

4 / 14

General set-upConvex-Composite Problem is

minx

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := LβExamples:

Additive composite minimization

minx

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

minx

g(x) + distK(c(x))

4 / 14

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)

5 / 14

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-Marquardt

The prox-gradientG(x) = µ(x− x+)

5 / 14

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)5 / 14

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

‖G(xk)‖2 < ε after O(µ2

ε

)iterations

is “essentially” the best

6 / 14

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

miny

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

miny

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

(µ+

∑ki=1 εi

k

).

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Acceleration

Algorithm 1: Accelerated prox-linear methodInitialize : Fix two points x0, v0 ∈ dom g.

1 while ‖G(yk−1)‖ > ε do2 ak ← 2

k+13 yk ← akvk−1 + (1− ak)xk−1

4 xk ← y+k5 vk ← argminz g(z)+ 1

ak·h(c(yk)+ak∇c(yk)(z−vk−1)

)+ ak

2t ‖z− vk−1‖2

6 k← k + 17 end

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

‖G(xi)‖2 ≤ O(µ2

k3

)+ ρ · O

(µ2R2

k

)where R = diam (dom g)

Generalizes (Ghadimi-Lan ’16) for additive composite

10 / 14

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

minz

g(z) + h(c(yk) +∇c(yk)(z− yk)

)+ 1

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

minz

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

)+ ak

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

k

)+O

(µ2

k3

)+

1k3

(k∑

i=1

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

Thank you!

12 / 14

ReferencesDrusvyatskiy, D. and Kempton, C. (2016).An accelerated algorithm for minimizing convex compositions.Preprint arXiv: 1605.00125.

Drusvyatskiy, D. and Lewis, A. (2016).Error bounds, quadratic growth, and linear convergence of proximalmethods.Preprint arXiv:1602.06661.

Ghadimi, S. and Lan, G. (2016).Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.Math. Program., 156(1-2, Ser. A):59–99.

Nesterov, Y. (2004).Introductory lectures on convex optimization. A Basic Course.Springer.

13 / 14

References (cont.)Schmidt, M., Le Roux, N., and Bach, F. (2011).Convergence rates of inexact proximal-gradient methods for convexoptimization.Advances in Neural Information Processing Systems.

14 / 14