Complexity Analysis beyond Convex Optimization

Yinyu Ye

K. T. Li Professor of EngineeringDepartment of Management Science and Engineering

Stanford University

http://www.stanford.edu/˜yyye

August 1, 2013

Yinyu Ye ICCOPT 2013

Outline

� Application arisen from Non-Convex Regularization

� Theory of the Lp-norm Regularization

� Selected Complexity Results for Non-Convex Optimization

� High-Level Complexity Analyses for Few cases

� Open Questions

Unconstrained L2+Lp Minimization

Consider the problem:

Minimizex∈Rn f2p(x) := ‖Ax− b‖22 + λ‖x‖pp (1)

where data A ∈ Rm×n,b ∈ Rm, parameter 0 ≤ p ≤ 1, and

‖x‖pp =n∑

|xj |p .

‖x‖pp =n∑

|xj |p .

‖x‖0 := |{j : xj �= 0}|that is, the number of nonzero entries in x.

‖x‖pp =n∑

|xj |p .

‖x‖0 := |{j : xj �= 0}|that is, the number of nonzero entries in x.

A more general model: for q ≥ 1

Minimizex∈Rn fqp(x) := ‖Ax− b‖qq + λ‖x‖pp

Constrained Lp Minimization

Consider another problem:

Minimize∑

1≤j≤n

Subject to Ax = b,x ≥ 0,

Constrained Lp Minimization

Consider another problem:

Minimize∑

1≤j≤n

Subject to Ax = b,x ≥ 0,

orMinimize

∑1≤j≤n

|xj |p

Subject to Ax = b.(3)

Application and Motivation

The original goal is to minimize ‖x‖0 = |{j : xj �= 0}|, the size ofthe support set of x, such that Ax = b, for

� Sparse image reconstruction

� Sparse signal recovering

� Sensor network localization

which is known to be an NP-Hard problem.

Approximation of ‖x‖0

� ‖x‖1 has been used to approximate ‖x‖0, and theregularization can be exact under certain strong conditions(Donoho 2004, Candes and Tao 2005, etc). Thisregularization model is actually a linear program.

Approximation of ‖x‖0

� ‖x‖1 has been used to approximate ‖x‖0, and theregularization can be exact under certain strong conditions(Donoho 2004, Candes and Tao 2005, etc). Thisregularization model is actually a linear program.

� Theoretical and empirical computational results indicate that‖x‖p regularization, say p = .5, have better performancesunder weaker conditions, and it is solvable equally efficientlyin practice (Chartrand 2009, Xu et al. 2009, etc).

The Hardness of Lp (Ge et al. 2011, Chen et al. 2012)

Question: is L2 + Lp minimization easier than L2 + L0minimization?

TheoremDecide the global minimum of optimization problem Lq + Lp isstrongly NP-hard for any given 0 ≤ p < 1, q ≥ 1 and λ > 0.

Nevertheless, practitioners solve them using non-linear solvers tocompute an KKT solution...

Recover Result: L0.5-Norm vs. L1 Norm

0 2 4 6 8 10 12 14 16 18 200

Sparsity

Gaussian Random Matrices

Figure: Successful sparse recovery rates of L0 5 and L1 solutions, withYinyu Ye ICCOPT 2013

Sensor Network Localization

Given a graph G = (V ,E ) and sets of non–negative weights, say{dij : (i , j) ∈ E}, the goal is to compute a realization of G in theEuclidean space Rd for a given low dimension d , i.e.

� to place the vertexes of G in Rd such that

� the Euclidean distance between a pair of adjacent vertexes(i , j) equals to (or bounded by) the prescribed weight dij ∈ E .

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

Figure: 50-node 2-D Sensor Localization

Application to SNL

SNL-SDP:

minimize∑

(i ,j)∈E α2ij

subject to (ei − ej )(ei − ej)T • Y = d2

ij + αij , ∀ (i , j) ∈ E ,

Y � 0.

Application to SNL

SNL-SDP:

minimize∑

(i ,j)∈E α2ij

ij + αij , ∀ (i , j) ∈ E ,

Y � 0.

Regularized SNL-SDP:

minimize∑

(i ,j)∈E α2ij + λ‖Y ‖p

ij + αij , ∀ (i , j) ∈ E ,

Y � 0.

Schatten p-norm (Ji et al. 2013)

For any given symmetric matrix Y ∈ Sn,

‖Y ‖p =

⎛⎝∑

|λ(Y )j |p⎞⎠

, 0 < p ≤ 1

is known as the Schatten p-quasi-norm of Y .

‖Y ‖p =

⎛⎝∑

|λ(Y )j |p⎞⎠

, 0 < p ≤ 1

When p = 1, it is called Nuclear norm.

‖Y ‖p =

⎛⎝∑

|λ(Y )j |p⎞⎠

, 0 < p ≤ 1

When p = 1, it is called Nuclear norm.

The Schatten pquasinorm has several nice analytical propertiesthat make it a natural candidate for a regularizer.

Recover Result: Schatten 0.5-Norm vs. Nuclear Norm

−0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.5

−0.4

−0.3

−0.2

−0.1

recovered sensorexact sensorAnchor

−0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.5

−0.4

−0.3

−0.2

−0.1

recovered sensorexact sensorAnchor

0 25 50 75 100 125 1500

Extra Edges Added

Exactly Recover Cases (50 sensors)

Trace(Y)

Trace(Yp)

Theory of L2+Lp Minimization I

Theorem(The first order bound) Let x∗ be any first-order KKT point andlet

2‖ai‖√

f2p(x∗)

) 11−p

for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.

Theory of L2+Lp Minimization I

Theorem(The first order bound) Let x∗ be any first-order KKT point andlet

2‖ai‖√

f2p(x∗)

) 11−p

for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.

“Lower Bound Theory of Nonzero Entries in Solutions of L2-LpMinimization” (Chen, Xu and Y), SIAM J. Scientific Computing32:5 (2010) 2832-2852.

Theory of L2+Lp Minimization II

Theorem(The second order bound) Let x∗ be any second-order KKT point

and let Li =

(λp(1− p)

2‖ai‖2) 1

, i ∈ N . Then

(1)for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.

(2) The support columns of x∗ are linearly independent.

More Theoretical Developments ...

Markowitz Portfolio Model:

minimize 12x

TQx+ cT xsubject to eT x = 1, x ≥ 0

More Theoretical Developments ...

Markowitz Portfolio Model:

minimize 12x

TQx+ cT xsubject to eT x = 1, x ≥ 0

Regularized Markowitz Portfolio Model:

minimize 12x

TQx+ cT x+ λ‖x‖ppsubject to eT x = 1, x ≥ 0

Linearly Constrained Optimization Problem

(LCOP)minimize f (x)subject to Ax = b, x ≥ 0.

The first-order KKT conditions:

xj(∇f (x) − ATy)j = 0, ∀jAx = b∇f (x)− ATy ≥ 0, x ≥ 0.

First-order ε-KKT solution: |xj(∇f (x)− ATy)j | ≤ ε for all j .

First-order ε-KKT solution: |xj(∇f (x)− ATy)j | ≤ ε for all j .Second-order ε-KKT solution if additionally the Hessian in the nullspace of active constraints is ε-positive-semidefinite.

Iteration Bound for an ε-KKT Solution

Smooth Lipschitz Non-Lipschitzlog log(ε−1) [Y 1992] Ball-IQPε−1 log(ε−1) [Y 1998] IQP [Ge et al 2011]

Constrained Lpε−3/2 [Nesterov et al

2006];[Bian et al2012]

[Cartis et al 2011]ε−2 [Vavasis 1991],

[Nesterov 2004];[Vavasis 1991] [Bian et al

2012];[Gratton et al2008]

[Cartis et al2011]

[Bian et al2012]

ε−3 log(ε−1) [Garmanjani etal 2012]

Table: Selected worst-case complexity results for nonconvex optimization

Ball or Sphere-Constrained Indefinite QP

(BQP)minimize 1

2xTQx+ cT x

subject to ‖x‖2 = (≤)1.

(BQP)minimize 1

2xTQx+ cT x

The solution x of problem (BQP) satisfies the following necessaryand sufficient conditions (S-Lemma):

(Q + μI )x = −c,(Q + μI ) � 0,

and ‖x‖ = 1.

(BQP)minimize 1

2xTQx+ cT x

The solution x of problem (BQP) satisfies the following necessaryand sufficient conditions (S-Lemma):

(Q + μI )x = −c,(Q + μI ) � 0,

and ‖x‖ = 1.

This is an SDP problem, and the simplest trust-region sub-problem(More, Sorensen, Dennis and Schnabel, etc. 1980).

The Bisection Method

For any μ > −λ(Q), where λ(Q) is the minimal eigenvalue of Q,denote by x(μ) the solution to

(Q + μI )x = −c.

Assume ‖x(−λ(Q))‖ > 1 (the othe case can be handled easily)and note

μ ≤ −λ(Q) + ‖c‖.

(Q + μI )x = −c.

μ ≤ −λ(Q) + ‖c‖.Thus, one can apply the bisection method to find the right μ andsolve the problem in polynomial-time log(ε−1) steps.

(Q + μI )x = −c.

μ ≤ −λ(Q) + ‖c‖.Thus, one can apply the bisection method to find the right μ andsolve the problem in polynomial-time log(ε−1) steps.

We can do it in log-polynomial time using the Steve Smale 1986work on Newton’s method ...

Combined Bisection and Newton’s Method

Potential Reduction Algorithm for LCOP

Consider the (concave+convex) Karmarkar potential function

φ(x) = ρ log(f (x)) −∑nj=1 log xj ,

where we assume that f (x) is nonnegative in the feasible region.

Potential Reduction Algorithm for LCOP

Consider the (concave+convex) Karmarkar potential function

φ(x) = ρ log(f (x)) −∑nj=1 log xj ,

where we assume that f (x) is nonnegative in the feasible region.We start from the analytic center x0 of the feasible region–Fp , sothat if

φ(xk)− φ(x0) ≤ ρ log ε, (4)

f (xk)

f (x0)≤ ε;

which implies that xk is an ε-global minimizer.

Quadratic Over-Estimate of Potential Function I

Consider

f (x) = q(x) =1

2xTQx+ cT x.

Quadratic Over-Estimate of Potential Function I

Consider

f (x) = q(x) =1

2xTQx+ cT x.

Given 0 < x ∈ Fp , let Δ = q(x) and let dx , Adx = 0, be a vectorsuch that x+ := x+ dx > 0. Then the non-convex part

ρ log(q(x+))− ρ log(q(x))

= ρ log(Δ +1

2dTx Qdx + (Qx+ c)Tdx)− ρ log Δ

= ρ log(1 + (1

2dTx Qdx + (Qx+ c)Tdx)/Δ)

≤ ρ

2dTx Qdx + (Qx+ c)Tdx).

Quadratic Over-Estimate of Potential Function II

On the other hand, if ‖X−1dx‖ ≤ β < 1, the convex part

−n∑

log(x+j ) +n∑

log(xj ) ≤ −eTX−1dx +β2

2(1− β).

Quadratic Over-Estimate of Potential Function II

On the other hand, if ‖X−1dx‖ ≤ β < 1, the convex part

−n∑

log(x+j ) +n∑

log(xj ) ≤ −eTX−1dx +β2

2(1− β).

Thus, if ‖X−1dx‖ ≤ β < 1, x+ = x+ dx > 0 and

φ(x+)−φ(x) ≤ ρ

2dTx Qdx+(Qx+c−Δ

ρX−1e)Tdx)+

2(1 − β).

A Ball-Constrained Quadratic Subproblem I

We solve the following problem at the kth iteration:

minimize 12d

Tx Qdx + (Qxk + c− Δk

ρ (X k)−1e)Tdx

subject to Adx = 0,

‖(X k)−1dx‖2 ≤ β2.

A Ball-Constrained Quadratic Subproblem I

We solve the following problem at the kth iteration:

minimize 12d

Tx Qdx + (Qxk + c− Δk

ρ (X k)−1e)Tdx

subject to Adx = 0,

‖(X k)−1dx‖2 ≤ β2.

Using the affine-scaling, this problem can be reduced to theball-constrained quadratic program, where the radius of the ball isβ.

Complexity Analysis

Each iteration can either make a constant reduction of thepotential, or not.

In the latter case, the new iterate x+ becomes a second-orderε-KKT solution with a suitable choice of ρ.

Complexity Analysis

TheoremLet β = 1

3 and ρ = 3q(x0)ε . Then the potential reduction

algorithm returns a second-order ε-KKT solution or global

minimizer in no more than O(q(x0)

ε log q(x0)ε ) iterations.

Complexity Analysis

TheoremLet β = 1

3 and ρ = 3q(x0)ε . Then the potential reduction

algorithm returns a second-order ε-KKT solution or global

minimizer in no more than O(q(x0)

ε log q(x0)ε ) iterations.

This type of algorithm is called fully polynomial timeapproximation scheme.

The PRA for Concave Minimization

The case when f (x) being a concave function is even easier.

Let x+ = x+ dx . Then, we have a linear over-estimate of thepotential function:

φ(x+)− φ(x) ≤(

f (x)∇f (x)T − eTX−1

2(1− β),

as long as ‖X−1dx‖ ≤ β.

Let x+ = x+ dx . Then, we have a linear over-estimate of thepotential function:

φ(x+)− φ(x) ≤(

f (x)∇f (x)T − eTX−1

2(1− β),

as long as ‖X−1dx‖ ≤ β.

Let affine-scaling d′ = X−1dx . Then, one can solve:

z(d′) := Minimize(

ρf (x)∇f (x)TX − eT

Subject to AXd′ = 0‖d′‖2 ≤ β2.

Affine Scaling Direction

The optimal direction of the affine scaling sub-problem is given by

d′ =β

‖p(x)‖ · p(x),

p(x) = − (I − XAT (AX 2AT )−1AX) ( ρ

f (x)X∇f (x) − e)

= e− ρf (x)X

(∇f (x) − ATy).

Affine Scaling Direction

The optimal direction of the affine scaling sub-problem is given by

d′ =β

‖p(x)‖ · p(x),

p(x) = − (I − XAT (AX 2AT )−1AX) ( ρ

f (x)X∇f (x) − e)

= e− ρf (x)X

(∇f (x) − ATy).

And the minimal value of the sub-problem

z(d′) = −β · ‖p(x)‖.

Complexity Analysis I

If ‖p(x)‖ ≥ 1, then the minimal objective value of the affinescaling sub-problem is less than β so that

φ(x+)− φ(x) < −β +β2

2(1− β).

Thus, the potential value is reduced by a constant for choosing asuitable β.

Complexity Analysis I

If ‖p(x)‖ ≥ 1, then the minimal objective value of the affinescaling sub-problem is less than β so that

φ(x+)− φ(x) < −β +β2

2(1− β).

Thus, the potential value is reduced by a constant for choosing asuitable β.

If this case would hold for O(ρ log f (x0)ε ) iterations, we would have

produced an ε-global minimizer of LCOP.

Complexity Analysis II

On the other hand, if ‖p(x)‖ < 1, from

p(x) = e− ρ

f (x)X(∇f (x)− ATy

we must haveρ

)≥ 0

)≤ 2e.

Complexity Analysis III

In other words (∇f (x)− ATy

)≥ 0

(∇f (x) − ATy

2f (x)

ρ, ∀j .

Complexity Analysis III

In other words (∇f (x)− ATy

)≥ 0

(∇f (x) − ATy

2f (x)

ρ, ∀j .

The first condition indicate that the Lagrange multiplier y is valid,and the second inequality implies that the complementaritycondition is approximately satisfied when ρ chosen sufficientlylarge.

Complexity Analyses IV

In particular, if we choose ρ ≥ 2f (x0)ε , then

‖X (∇f (x)− ATy)‖∞ ≤ ε,

which implies that x is a first-order ε-KKT solution.

Complexity Analyses IV

In particular, if we choose ρ ≥ 2f (x0)ε , then

‖X (∇f (x)− ATy)‖∞ ≤ ε,

which implies that x is a first-order ε-KKT solution.

TheoremThe algorithm then will provably return a first-order ε-KKT

solution of LCOP in no more than O( f (x0)ε log f (x0)

ε ) iterations forany given ε < 1, if the objective function is concave.

Complexity Analysis beyond Convex Optimization - Stanford University

Transcript of Complexity Analysis beyond Convex Optimization - Stanford University

Complexity Analysis beyond Convex Optimization - Stanford University

Documents

Transcript of Complexity Analysis beyond Convex Optimization - Stanford University

BYZANTINE-RESILIENT NON-CONVEX STOCHASTIC GRADIENT …

Convex Door Handles 2013

Convex functions - University of Colorado Boulder

Beyond Good and Bad

Convex Analysis Notes - Walid Krichenewalid.krichene.net › notes › reading-convex_analysis_notes.pdf · 2018-03-03 · 3 The Algebra of Convex Sets C 1;C 2 convex )C 1 + C 2 convex.

Lecture 3 Convex functions - Laboratoire Jean Kuntzmann LECTURE 3. CONVEX FUNCTIONS functions convex on the whole axis: x2p, pbeing positive integer; expfxg; functions convex on the

EE/AA 578, Univ of Washington, Fall 2016 1. Convex sets · Convex sets • subspaces, aﬃne and convex sets • some important examples • operations that preserve convexity •

Convex Home Series

One Step Beyond

Di erentiable Convex Optimization Layersboyd/papers/pdf/diff_cvxpy_poster.pdf · • DSLs for convex optimization make it easy to specify, solve convex problems • Modern DSLs (CVXPY,

Convex Sets - University of Colorado Boulderlich1539/cvx/ConvexSets2018.pdf · Convex Sets • subspaces, aﬃne sets, and convex sets • operations that preserve convexity • generalized

Non Convex-Concave Saddle Point Optimization

Lecture 3 Convex functions - Bienvenue l'IMAG

CONVEX REAL PROJECTIVE STRUCTURES ON COMPACT …

Convex Optimization & Euclidean Distance Geometry

Convex Sets (Stanford Lecture)

Convex Sets and Jensen’s Inequality · 44 De nition of Convex Functions: A function f : R !R is convex if the epigraph of f(x) is a convex set. The epigraph is the set of points

Gradient flows of non convex functionals in Hilbert …...GRADIENT FLOWS OF NON CONVEX FUNCTIONALS IN HILBERT SPACES AND APPLICATIONS 567 which shows that in the -convex case we can

2. Convex sets

Introduction to Convex Optimizationhelper.ipam.ucla.edu/publications/gss2013/gss2013_11378.pdf1. Introduction to convex optimization theory • convex sets and functions • conic