Complexity Analysis beyond Convex Optimization - Stanford University

Complexity Analysis beyond Convex Optimization

Yinyu Ye

K. T. Li Professor of EngineeringDepartment of Management Science and Engineering

Stanford University

http://www.stanford.edu/˜yyye

August 1, 2013

Yinyu Ye ICCOPT 2013

Outline

� Application arisen from Non-Convex Regularization

� Theory of the Lp-norm Regularization

� Selected Complexity Results for Non-Convex Optimization

� High-Level Complexity Analyses for Few cases

� Open Questions


Unconstrained L2+Lp Minimization

Consider the problem:

Minimizex∈Rn f2p(x) := ‖Ax− b‖22 + λ‖x‖pp (1)

where data A ∈ Rm×n,b ∈ Rm, parameter 0 ≤ p ≤ 1, and

‖x‖pp =n∑

j=1

|xj |p .






‖x‖pp =n∑

j=1

|xj |p .

‖x‖0 := |{j : xj �= 0}|that is, the number of nonzero entries in x.






‖x‖pp =n∑

j=1

|xj |p .

‖x‖0 := |{j : xj �= 0}|that is, the number of nonzero entries in x.

A more general model: for q ≥ 1

Minimizex∈Rn fqp(x) := ‖Ax− b‖qq + λ‖x‖pp


Constrained Lp Minimization

Consider another problem:

Minimize∑

1≤j≤n

xpj

Subject to Ax = b,x ≥ 0,

(2)


Constrained Lp Minimization

Consider another problem:

Minimize∑

1≤j≤n

xpj

Subject to Ax = b,x ≥ 0,

(2)

orMinimize

∑1≤j≤n

|xj |p

Subject to Ax = b.(3)


Application and Motivation

The original goal is to minimize ‖x‖0 = |{j : xj �= 0}|, the size ofthe support set of x, such that Ax = b, for

� Sparse image reconstruction

� Sparse signal recovering

� Sensor network localization

which is known to be an NP-Hard problem.


Approximation of ‖x‖0

� ‖x‖1 has been used to approximate ‖x‖0, and theregularization can be exact under certain strong conditions(Donoho 2004, Candes and Tao 2005, etc). Thisregularization model is actually a linear program.


Approximation of ‖x‖0

� ‖x‖1 has been used to approximate ‖x‖0, and theregularization can be exact under certain strong conditions(Donoho 2004, Candes and Tao 2005, etc). Thisregularization model is actually a linear program.

� Theoretical and empirical computational results indicate that‖x‖p regularization, say p = .5, have better performancesunder weaker conditions, and it is solvable equally efficientlyin practice (Chartrand 2009, Xu et al. 2009, etc).


The Hardness of Lp (Ge et al. 2011, Chen et al. 2012)

Question: is L2 + Lp minimization easier than L2 + L0minimization?




TheoremDecide the global minimum of optimization problem Lq + Lp isstrongly NP-hard for any given 0 ≤ p < 1, q ≥ 1 and λ > 0.




TheoremDecide the global minimum of optimization problem Lq + Lp isstrongly NP-hard for any given 0 ≤ p < 1, q ≥ 1 and λ > 0.

Nevertheless, practitioners solve them using non-linear solvers tocompute an KKT solution...


Recover Result: L0.5-Norm vs. L1 Norm

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sparsity

Fre

quen

cy o

f Suc

cess

Gaussian Random Matrices

L

1

Lp

Figure: Successful sparse recovery rates of L0 5 and L1 solutions, withYinyu Ye ICCOPT 2013

Sensor Network Localization

Given a graph G = (V ,E ) and sets of non–negative weights, say{dij : (i , j) ∈ E}, the goal is to compute a realization of G in theEuclidean space Rd for a given low dimension d , i.e.

� to place the vertexes of G in Rd such that

� the Euclidean distance between a pair of adjacent vertexes(i , j) equals to (or bounded by) the prescribed weight dij ∈ E .


−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Figure: 50-node 2-D Sensor Localization


Application to SNL

SNL-SDP:

minimize∑

(i ,j)∈E α2ij

subject to (ei − ej )(ei − ej)T • Y = d2

ij + αij , ∀ (i , j) ∈ E ,

Y � 0.


Application to SNL

SNL-SDP:

minimize∑

(i ,j)∈E α2ij


ij + αij , ∀ (i , j) ∈ E ,

Y � 0.

Regularized SNL-SDP:

minimize∑

(i ,j)∈E α2ij + λ‖Y ‖p


ij + αij , ∀ (i , j) ∈ E ,

Y � 0.


Schatten p-norm (Ji et al. 2013)

For any given symmetric matrix Y ∈ Sn,

‖Y ‖p =

⎛⎝∑

j

|λ(Y )j |p⎞⎠

1/p

, 0 < p ≤ 1

is known as the Schatten p-quasi-norm of Y .




‖Y ‖p =

⎛⎝∑

j

|λ(Y )j |p⎞⎠

1/p

, 0 < p ≤ 1


When p = 1, it is called Nuclear norm.




‖Y ‖p =

⎛⎝∑

j

|λ(Y )j |p⎞⎠

1/p

, 0 < p ≤ 1


When p = 1, it is called Nuclear norm.

The Schatten pquasinorm has several nice analytical propertiesthat make it a natural candidate for a regularizer.


Recover Result: Schatten 0.5-Norm vs. Nuclear Norm

−0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

recovered sensorexact sensorAnchor

−0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

recovered sensorexact sensorAnchor


0 25 50 75 100 125 1500

20

40

60

80

100

Extra Edges Added

Re

co

ve

r C

ase

s

Exactly Recover Cases (50 sensors)

Trace(Y)

Trace(Yp)


Theory of L2+Lp Minimization I

Theorem(The first order bound) Let x∗ be any first-order KKT point andlet

Li =

(λp

2‖ai‖√

f2p(x∗)

) 11−p

.

Then

for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.


Theory of L2+Lp Minimization I

Theorem(The first order bound) Let x∗ be any first-order KKT point andlet

Li =

(λp

2‖ai‖√

f2p(x∗)

) 11−p

.

Then

for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.

“Lower Bound Theory of Nonzero Entries in Solutions of L2-LpMinimization” (Chen, Xu and Y), SIAM J. Scientific Computing32:5 (2010) 2832-2852.


Theory of L2+Lp Minimization II

Theorem(The second order bound) Let x∗ be any second-order KKT point

and let Li =

(λp(1− p)

2‖ai‖2) 1

2−p

, i ∈ N . Then

(1)for any i ∈ N , x∗i ∈ (−Li , Li ) ⇒ x∗i = 0.

(2) The support columns of x∗ are linearly independent.


More Theoretical Developments ...

Markowitz Portfolio Model:

minimize 12x

TQx+ cT xsubject to eT x = 1, x ≥ 0


More Theoretical Developments ...

Markowitz Portfolio Model:

minimize 12x

TQx+ cT xsubject to eT x = 1, x ≥ 0

Regularized Markowitz Portfolio Model:

minimize 12x

TQx+ cT x+ λ‖x‖ppsubject to eT x = 1, x ≥ 0


Linearly Constrained Optimization Problem

(LCOP)minimize f (x)subject to Ax = b, x ≥ 0.




The first-order KKT conditions:

xj(∇f (x) − ATy)j = 0, ∀jAx = b∇f (x)− ATy ≥ 0, x ≥ 0.






First-order ε-KKT solution: |xj(∇f (x)− ATy)j | ≤ ε for all j .






First-order ε-KKT solution: |xj(∇f (x)− ATy)j | ≤ ε for all j .Second-order ε-KKT solution if additionally the Hessian in the nullspace of active constraints is ε-positive-semidefinite.


Iteration Bound for an ε-KKT Solution

Smooth Lipschitz Non-Lipschitzlog log(ε−1) [Y 1992] Ball-IQPε−1 log(ε−1) [Y 1998] IQP [Ge et al 2011]

Constrained Lpε−3/2 [Nesterov et al

2006];[Bian et al2012]

[Cartis et al 2011]ε−2 [Vavasis 1991],

[Nesterov 2004];[Vavasis 1991] [Bian et al

2012];[Gratton et al2008]

[Cartis et al2011]

[Bian et al2012]

ε−3 log(ε−1) [Garmanjani etal 2012]

Table: Selected worst-case complexity results for nonconvex optimization


Ball or Sphere-Constrained Indefinite QP

(BQP)minimize 1

2xTQx+ cT x

subject to ‖x‖2 = (≤)1.



(BQP)minimize 1

2xTQx+ cT x


The solution x of problem (BQP) satisfies the following necessaryand sufficient conditions (S-Lemma):

(Q + μI )x = −c,(Q + μI ) � 0,

and ‖x‖ = 1.



(BQP)minimize 1

2xTQx+ cT x


The solution x of problem (BQP) satisfies the following necessaryand sufficient conditions (S-Lemma):

(Q + μI )x = −c,(Q + μI ) � 0,

and ‖x‖ = 1.

This is an SDP problem, and the simplest trust-region sub-problem(More, Sorensen, Dennis and Schnabel, etc. 1980).


The Bisection Method

For any μ > −λ(Q), where λ(Q) is the minimal eigenvalue of Q,denote by x(μ) the solution to

(Q + μI )x = −c.




(Q + μI )x = −c.

Assume ‖x(−λ(Q))‖ > 1 (the othe case can be handled easily)and note

μ ≤ −λ(Q) + ‖c‖.




(Q + μI )x = −c.


μ ≤ −λ(Q) + ‖c‖.Thus, one can apply the bisection method to find the right μ andsolve the problem in polynomial-time log(ε−1) steps.




(Q + μI )x = −c.


μ ≤ −λ(Q) + ‖c‖.Thus, one can apply the bisection method to find the right μ andsolve the problem in polynomial-time log(ε−1) steps.

We can do it in log-polynomial time using the Steve Smale 1986work on Newton’s method ...


Combined Bisection and Newton’s Method


Potential Reduction Algorithm for LCOP

Consider the (concave+convex) Karmarkar potential function

φ(x) = ρ log(f (x)) −∑nj=1 log xj ,

where we assume that f (x) is nonnegative in the feasible region.


Potential Reduction Algorithm for LCOP

Consider the (concave+convex) Karmarkar potential function

φ(x) = ρ log(f (x)) −∑nj=1 log xj ,

where we assume that f (x) is nonnegative in the feasible region.We start from the analytic center x0 of the feasible region–Fp , sothat if

φ(xk)− φ(x0) ≤ ρ log ε, (4)

f (xk)

f (x0)≤ ε;

which implies that xk is an ε-global minimizer.


Quadratic Over-Estimate of Potential Function I

Consider

f (x) = q(x) =1

2xTQx+ cT x.


Quadratic Over-Estimate of Potential Function I

Consider

f (x) = q(x) =1

2xTQx+ cT x.

Given 0 < x ∈ Fp , let Δ = q(x) and let dx , Adx = 0, be a vectorsuch that x+ := x+ dx > 0. Then the non-convex part

ρ log(q(x+))− ρ log(q(x))

= ρ log(Δ +1

2dTx Qdx + (Qx+ c)Tdx)− ρ log Δ

= ρ log(1 + (1

2dTx Qdx + (Qx+ c)Tdx)/Δ)

≤ ρ

Δ(1

2dTx Qdx + (Qx+ c)Tdx).


Quadratic Over-Estimate of Potential Function II

On the other hand, if ‖X−1dx‖ ≤ β < 1, the convex part

−n∑

j=1

log(x+j ) +n∑

j=1

log(xj ) ≤ −eTX−1dx +β2

2(1− β).


Quadratic Over-Estimate of Potential Function II

On the other hand, if ‖X−1dx‖ ≤ β < 1, the convex part

−n∑

j=1

log(x+j ) +n∑

j=1

log(xj ) ≤ −eTX−1dx +β2

2(1− β).

Thus, if ‖X−1dx‖ ≤ β < 1, x+ = x+ dx > 0 and

φ(x+)−φ(x) ≤ ρ

Δ(1

2dTx Qdx+(Qx+c−Δ

ρX−1e)Tdx)+

β2

2(1 − β).

(5)


A Ball-Constrained Quadratic Subproblem I

We solve the following problem at the kth iteration:

minimize 12d

Tx Qdx + (Qxk + c− Δk

ρ (X k)−1e)Tdx

subject to Adx = 0,

‖(X k)−1dx‖2 ≤ β2.


A Ball-Constrained Quadratic Subproblem I

We solve the following problem at the kth iteration:

minimize 12d

Tx Qdx + (Qxk + c− Δk

ρ (X k)−1e)Tdx

subject to Adx = 0,

‖(X k)−1dx‖2 ≤ β2.

Using the affine-scaling, this problem can be reduced to theball-constrained quadratic program, where the radius of the ball isβ.


Complexity Analysis

Each iteration can either make a constant reduction of thepotential, or not.

In the latter case, the new iterate x+ becomes a second-orderε-KKT solution with a suitable choice of ρ.


Complexity Analysis



TheoremLet β = 1

3 and ρ = 3q(x0)ε . Then the potential reduction

algorithm returns a second-order ε-KKT solution or global

minimizer in no more than O(q(x0)

ε log q(x0)ε ) iterations.


Complexity Analysis



TheoremLet β = 1

3 and ρ = 3q(x0)ε . Then the potential reduction

algorithm returns a second-order ε-KKT solution or global

minimizer in no more than O(q(x0)

ε log q(x0)ε ) iterations.

This type of algorithm is called fully polynomial timeapproximation scheme.


The PRA for Concave Minimization

The case when f (x) being a concave function is even easier.




Let x+ = x+ dx . Then, we have a linear over-estimate of thepotential function:

φ(x+)− φ(x) ≤(

ρ

f (x)∇f (x)T − eTX−1

)dx +

β2

2(1− β),

as long as ‖X−1dx‖ ≤ β.




Let x+ = x+ dx . Then, we have a linear over-estimate of thepotential function:

φ(x+)− φ(x) ≤(

ρ

f (x)∇f (x)T − eTX−1

)dx +

β2

2(1− β),

as long as ‖X−1dx‖ ≤ β.

Let affine-scaling d′ = X−1dx . Then, one can solve:

z(d′) := Minimize(

ρf (x)∇f (x)TX − eT

)d′

Subject to AXd′ = 0‖d′‖2 ≤ β2.


Affine Scaling Direction

The optimal direction of the affine scaling sub-problem is given by

d′ =β

‖p(x)‖ · p(x),

where

p(x) = − (I − XAT (AX 2AT )−1AX) ( ρ

f (x)X∇f (x) − e)

= e− ρf (x)X

(∇f (x) − ATy).


Affine Scaling Direction

The optimal direction of the affine scaling sub-problem is given by

d′ =β

‖p(x)‖ · p(x),

where

p(x) = − (I − XAT (AX 2AT )−1AX) ( ρ

f (x)X∇f (x) − e)

= e− ρf (x)X

(∇f (x) − ATy).

And the minimal value of the sub-problem

z(d′) = −β · ‖p(x)‖.


Complexity Analysis I

If ‖p(x)‖ ≥ 1, then the minimal objective value of the affinescaling sub-problem is less than β so that

φ(x+)− φ(x) < −β +β2

2(1− β).

Thus, the potential value is reduced by a constant for choosing asuitable β.


Complexity Analysis I

If ‖p(x)‖ ≥ 1, then the minimal objective value of the affinescaling sub-problem is less than β so that

φ(x+)− φ(x) < −β +β2

2(1− β).

Thus, the potential value is reduced by a constant for choosing asuitable β.

If this case would hold for O(ρ log f (x0)ε ) iterations, we would have

produced an ε-global minimizer of LCOP.


Complexity Analysis II

On the other hand, if ‖p(x)‖ < 1, from

p(x) = e− ρ

f (x)X(∇f (x)− ATy

),

we must haveρ


)≥ 0

andρ


)≤ 2e.


Complexity Analysis III

In other words (∇f (x)− ATy

)≥ 0

and

xj

(∇f (x) − ATy

)j<

2f (x)

ρ, ∀j .


Complexity Analysis III

In other words (∇f (x)− ATy

)≥ 0

and

xj

(∇f (x) − ATy

)j<

2f (x)

ρ, ∀j .

The first condition indicate that the Lagrange multiplier y is valid,and the second inequality implies that the complementaritycondition is approximately satisfied when ρ chosen sufficientlylarge.


Complexity Analyses IV

In particular, if we choose ρ ≥ 2f (x0)ε , then

‖X (∇f (x)− ATy)‖∞ ≤ ε,

which implies that x is a first-order ε-KKT solution.


Complexity Analyses IV

In particular, if we choose ρ ≥ 2f (x0)ε , then

‖X (∇f (x)− ATy)‖∞ ≤ ε,

which implies that x is a first-order ε-KKT solution.

TheoremThe algorithm then will provably return a first-order ε-KKT

solution of LCOP in no more than O( f (x0)ε log f (x0)

ε ) iterations forany given ε < 1, if the objective function is concave.


More Applications and Questions

� Could the time bound be further improved for QP or concaveminimization?




� Would the O(ε−1 log ε−1) bound applicable to more generalnon-convex optimization problems?





� We are not even able to prove the O(ε−1 log ε−1) bound when

f (x) = q(x) + ‖x‖pp.






f (x) = q(x) + ‖x‖pp.

� More structural properties on the final KKT solution.






f (x) = q(x) + ‖x‖pp.

� More structural properties on the final KKT solution.

� Applications to general sparse solution optimization, such asthe cardinality constrained portfolio selection, sparse pricingreduction for revenue management, etc..


Complexity Analysis beyond Convex Optimization - Stanford University

Documents

Transcript of Complexity Analysis beyond Convex Optimization - Stanford University