Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Proximal methods for minimizing convexcompositions

Courtney PaquetteJoint Work with D. Drusvyatskiy

Department of MathematicsUniversity of Washington (Seattle)

WCOM Fall 2016October 1, 2016

1 / 14

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

A function f is α-convex and β-smooth if

qx ≤ f ≤ Qx

qx(x) = f (x) + 〈∇f (x), y− x〉+ α

2‖y− x‖2

Qx(x) = f (x) + 〈∇f (x), y− x〉+ β

2‖y− x‖2

Condition number: κ = βα

2 / 14

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

Complexity of first-order methods

Gradient descent: xk+1 = xk − 1β∇f (xk)

Majorization view: xk+1 = argminx Qxk(·)

β-smooth α-convex

Gradient descent βε κ · log(1

Optimal methods√

√κ · log(1

Table: Iterations until f (xk)− f ∗ < ε

(Nesterov ’83, Yudin-Nemirovsky ’83)

3 / 14

General set-upConvex-Composite Problem is

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := Lβ

Examples:Additive composite minimization

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

g(x) + distK(c(x))

4 / 14

General set-upConvex-Composite Problem is

F(x) := h(c(x)) + g(x)

c : Rn → Rm is C1-smooth with β-Lipschitz Jacobianh : Rm → R is closed, convex, and L-Lipschitzg : Rn → R ∪ {∞} is convex

For convenience, set µ := LβExamples:

Additive composite minimization

c(x) + g(x)Nonlinear least squares

minx{‖c(x)‖ : `i ≤ xi ≤ ui, i = 1, . . .n}

Exact penalty subproblem:

g(x) + distK(c(x))

4 / 14

Prox-Linear algorithm-Base CaseSeek points x which are first-order stationary: F′(x; v) ≥ 0 ∀v.Equivalent to:

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

Prox-linear mapping:

x+ := argminy {h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)}

Prox-linear method:xk+1 = x+k

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)

5 / 14

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

2‖y− x‖2 + g(y)}

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-Marquardt

The prox-gradientG(x) = µ(x− x+)

5 / 14

0 ∈ ∂g(x) +∇c(x)∗∂h(c(x))

Idea: Majorization

F(y) ≤ h (c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y) ∀ y

2‖y− x‖2 + g(y)}

(Burke ’85, ’91, Fletcher ’82, Powell ’84, Wright ’90, Yuan ’83)Eg: proximal gradient, Levenberg-MarquardtThe prox-gradient

G(x) = µ(x− x+)5 / 14

Prox-Linear algorithm

Convergence Rate:

‖G(xk)‖2 < ε after O(µ2

)iterations

What is ‖G(xk)‖2 < ε?

dist(0, ∂F(uk)) ≤ 5 ‖G(xk)‖with ‖uk − xk‖ ≈ ‖G(xk)‖Pf: Ekeland’s variational principle (Lewis-Drusvyatskiy ’16)

For nonconvex problems, the rate

)iterations

is “essentially” the best

6 / 14

Convergence Rate:

)iterations

6 / 14

Convergence Rate:

)iterations

6 / 14

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Solving the Sub-problem: Inexact Prox-Linear

Prox-linear method requires solving:

ϕ(y, x) := h(c(x) +∇c(x)(y− x)) +µ

2‖y− x‖2 + g(y)

Suppose we can not solve exactly:

ϕ(x+, x) ≤ miny

ϕ(y, x) + ε

where x+ is an ε-approximate optimal solution

QuestionHow accurately do we need to solve the subproblem to

guarantee the same overall rate for the prox-linear?

7 / 14

Inexact Prox-Linear AlgorithmWant to bound

G(xk) = µ(xk − x∗k+1), x∗k+1 is the true optimal point to the sub-problem

Thm: (Drusvyatskiy-P ’16)Suppose xi+1 is an εi+1-approximate optimal solution. Then

mini=1,...,k

‖G(xi)‖2 ≤ O

∑ki=1 εi

Generalizes (Schmidt-Le Roux-Bach ’11)

Question

Design an acceleration scheme1 Optimal rate for convex problems2 Rate no worse than prox-gradient for nonconvex problems3 Detects convexity of the function

8 / 14

mini=1,...,k

‖G(xi)‖2 ≤ O

∑ki=1 εi

Question

8 / 14

mini=1,...,k

‖G(xi)‖2 ≤ O

∑ki=1 εi

Question

8 / 14

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Acceleration

Measuring non-convexity,

h ◦ c(x) = supw{〈w, c(x)〉 − h∗(w)}

Fact 1: h ◦ c(x) is convex if x 7→ 〈w, c(x)〉 is convex for allw ∈ dom h∗.

Fact 2: x 7→ 〈w, c(x)〉+ µ2 ‖x‖

2 is convex for all w ∈ dom h∗

Defn: Parameter ρ ∈ [0, 1] such that

x 7→ 〈w, c(x)〉+ ρ · µ2 ‖x‖2 is convex for all w ∈ dom h∗

9 / 14

Acceleration

Algorithm 1: Accelerated prox-linear methodInitialize : Fix two points x0, v0 ∈ dom g.

1 while ‖G(yk−1)‖ > ε do2 ak ← 2

k+13 yk ← akvk−1 + (1− ak)xk−1

4 xk ← y+k5 vk ← argminz g(z)+ 1

ak·h(c(yk)+ak∇c(yk)(z−vk−1)

2t ‖z− vk−1‖2

6 k← k + 17 end

Thm: (Drusvyatskiy-P ’16)

mini=1,...,k

‖G(xi)‖2 ≤ O(µ2

)+ ρ · O

(µ2R2

)where R = diam (dom g)

Generalizes (Ghadimi-Lan ’16) for additive composite

10 / 14

Inexact Accelerated Prox-LinearTwo sub-problems to solve:

xk is an εk-approximate optimal solution

g(z) + h(c(yk) +∇c(yk)(z− yk)

2t ‖z− yk‖2

vk is an δk-approximate optimal solution

g(z) + 1ak· h(c(yk) + ak∇c(yk)(z− vk−1)

2t ‖z− vk−1‖2

dist(0, ∂F(uk)) ≤ C(‖xk − yk‖+√εk), ‖uk − xk‖ ≈ ‖xk − yk‖+

√εk

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

O(i2εi) +O(i2δi) +O(i√δi)

)where R = diam (dom g)Need: εi ∼ 1

i3+r and δi ∼ 1i4+r

11 / 14

2t ‖z− yk‖2

2t ‖z− vk−1‖2

√εk

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

11 / 14

2t ‖z− yk‖2

2t ‖z− vk−1‖2

√εk

mini=1,...,k

{‖xk − yk‖2+ εi} ≤ ρ · O

(µ2R2

11 / 14

Thank you!

12 / 14

ReferencesDrusvyatskiy, D. and Kempton, C. (2016).An accelerated algorithm for minimizing convex compositions.Preprint arXiv: 1605.00125.

Drusvyatskiy, D. and Lewis, A. (2016).Error bounds, quadratic growth, and linear convergence of proximalmethods.Preprint arXiv:1602.06661.

Ghadimi, S. and Lan, G. (2016).Accelerated gradient methods for nonconvex nonlinear and stochasticprogramming.Math. Program., 156(1-2, Ser. A):59–99.

Nesterov, Y. (2004).Introductory lectures on convex optimization. A Basic Course.Springer.

13 / 14

References (cont.)Schmidt, M., Le Roux, N., and Bach, F. (2011).Convergence rates of inexact proximal-gradient methods for convexoptimization.Advances in Neural Information Processing Systems.

14 / 14

Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Documents

Transcript of Proximal methods for minimizing convex compositionsProximal methods for minimizing convex...

Convex Analysis and Optimization Chapter 1 Solutions

Complexity Analysis beyond Convex Optimization - Stanford University

Convex functions - University of Colorado Boulder

Convex Optimization for Neural Networks

Convex Hulls, Voronoi Diagrams and Delaunay Triangulations · 2010-01-12 · Convex hull P conv(P) Smallest convex set that contains a ﬁnite set of points P Set of all possible

Convex Sets and Jensen’s Inequality · 44 De nition of Convex Functions: A function f : R !R is convex if the epigraph of f(x) is a convex set. The epigraph is the set of points

Non Convex-Concave Saddle Point Optimization

A RANDOM COORDINATE DESCENT METHOD ON LARGE-SCALE ...suvrit/...random_coordinate_descent... · on random coordinate descent methods for smooth convex functions were obtained by Automatic

Di erentiable Convex Optimization Layersboyd/papers/pdf/diff_cvxpy_poster.pdf · • DSLs for convex optimization make it easy to specify, solve convex problems • Modern DSLs (CVXPY,

Machine Learning (CSE 446): Learning as Minimizing Loss ...

LENGTH OF A CURVE IS QUASI-CONVEX

Convex Optimization — Boyd & Vandenberghe 3. Convex functions · Convex functions 3–13. Positive weighted sum & composition with aﬃne function nonnegative multiple: αf is convex

INTRODUCTION TO LMI/SDP OPTIMIZATIONInterior point methods - solvers - interfaces Lecture material References on convex optimization: M.F. Anjos, J.B. Lasserre. Handbook on Semide

Convex Analysis Notes - Walid Krichenewalid.krichene.net › notes › reading-convex_analysis_notes.pdf · 2018-03-03 · 3 The Algebra of Convex Sets C 1;C 2 convex )C 1 + C 2 convex.

CONVEX REAL PROJECTIVE STRUCTURES ON COMPACT …

CS257 Linear and Convex Optimization - Lecture 5

MINIMIZING FUEL CONSUMPTION WITH CFD AND … · MINIMIZING FUEL CONSUMPTION WITH CFD AND AUTOMATED MORPHING 1Simon Michalowski, ... ICEM Surf, to get a basis for …

Lecture 3 Convex Functions - University Of Illinoisangelia/L3_convfunc.pdfLecture 3 Convex Functions Informally: ... Convex Optimization 2. Lecture 3 More on Convex Function Def. A

Convex Optimization - Massachusetts Institute of Technology

Convex Optimization - A light-speed introduction