Convex Optimization

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysishttp://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html

Carlos Fernandez-Granda

Convexity

Differentiable convex functions

Minimizing differentiable convex functions

Convex functions

A function f : Rn → R is convex if for any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

θf (~x) + (1− θ) f (~y) ≥ f (θ~x + (1− θ) ~y)

A function f if concave is −f is convex

Convex functions

f (θ~x + (1 − θ)~y)

θf (~x) + (1 − θ)f (~y)

f (~x)

f (~y)

Linear functions are convex

If f is linear

f (θ~x + (1− θ) ~y)

= θf (~x) + (1− θ) f (~y)

Linear functions are convex

If f is linear

f (θ~x + (1− θ) ~y) = θf (~x) + (1− θ) f (~y)

Strictly convex functions

A function f : Rn → R is strictly convex if for any ~x , ~y ∈ Rn andany θ ∈ (0, 1)

θf (~x) + (1− θ) f (~y) > f (θ~x + (1− θ) ~y)

Local minima are global

Any local minimum of a convex function is also a global minimum

Let ~xloc be a local minimum: for all ~x ∈ Rn such that ||~x − ~xloc||2 ≤ γ

f (~xloc) ≤ f (~x)

Let ~xglob be a global minimum

f(~xglob

)< f (~xloc)

Choose θ so that ~xθ := θ~xloc + (1− θ) ~xglob satisfies

||~xθ − ~xloc||2 ≤ γ

f (~xloc) ≤ f (~xθ)

= f(θ~xloc + (1− θ) ~xglob

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

< f (~xloc) because f(~xglob

)< f (~xloc)

||~xθ − ~xloc||2 ≤ γ

≤ θf (~xloc) + (1− θ) f(~xglob

)by convexity of f

)< f (~xloc)

||~xθ − ~xloc||2 ≤ γ

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

)< f (~xloc)

||~xθ − ~xloc||2 ≤ γ

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

)< f (~xloc)

Let V be a vector space, a norm is a function ||·|| from V to R withthe following properties

I It is homogeneous. For any scalar α and any ~x ∈ V

||α~x || = |α| ||~x || .

I It satisfies the triangle inequality

||~x + ~y || ≤ ||~x ||+ ||~y || .

In particular, ||~x || ≥ 0

I ||~x || = 0 implies ~x = ~0

Norms are convex

For any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||

≤ ||θ~x ||+ ||(1− θ) ~y ||= θ ||~x ||+ (1− θ) ||~y ||

Norms are convex

||θ~x + (1− θ) ~y || ≤ ||θ~x ||+ ||(1− θ) ~y ||

= θ ||~x ||+ (1− θ) ||~y ||

Norms are convex

||θ~x + (1− θ) ~y || ≤ ||θ~x ||+ ||(1− θ) ~y ||= θ ||~x ||+ (1− θ) ||~y ||

Composition of convex and affine function

If f : Rn → R is convex, then for any A ∈ Rn×m and ~b ∈ Rn

h (~x) := f(A~x + ~b

)is convex

Consequence:

f (~x) :=∣∣∣∣∣∣A~x + ~b

∣∣∣∣∣∣is convex for any A and ~b

h (θ~x + (1− θ) ~y)

= f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

≤ θf(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

= θ h (~x) + (1− θ) h (~y)

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0

= ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||0

6= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0

θ ||~x ||0 + (1− θ) ||~y ||0

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0

`0 “norm"

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Promoting sparsity

Finding sparse vectors consistent with data is often very useful

Toy problem: Find t such that

~vt :=

tt − 1t − 1

is sparse

Strategy: Minimize

f (t) := ||~vt ||

Promoting sparsity

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

||t||0||t||1||t||2||t||∞

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to Ris not convex

[1 00 0

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y ) = 1

[1 00 0

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y )

θ rank (X ) + (1− θ) rank (Y )

[1 00 0

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y )

[1 00 0

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y ) = 1

Matrix norms

Frobenius norm

||A||F :=

√√√√ m∑i=1

n∑j=1

A2ij =

√√√√min{m,n}∑i=1

Operator norm

||A|| := max{||~x ||2=1 | ~x∈Rn}

||A ~x ||2 = σ1

Nuclear norm

||A||∗ :=

min{m,n}∑i=1

Promoting low-rank structure

Finding low-rank matrices consistent with data is often very useful

Toy problem: Find t such that

M (t) :=

0.5 + t 1 10.5 0.5 t0.5 1− t 0.5

,is low rank

Strategy: Minimize

f (t) := ||M (t)||

Promoting low-rank structure

1.0 0.5 0.0 0.5 1.0 1.5t

3.0Rank

Operator norm

Frobenius norm

Nuclear norm

Convexity

Gradient

∇f (~x) =

∂f (~x)∂~x[1]

∂f (~x)∂~x[2]

· · ·∂f (~x)∂~x[n]

If the gradient exists at every point, the function is said to bedifferentiable

Directional derivative

Encodes first-order rate of change in a particular direction

f ′~u (~x) := limh→0

f (~x + h~u)− f (~x)

= 〈∇f (~x) , ~u〉

where ||u||2 = 1

Direction of maximum variation

∇f is direction of maximum increase

-∇f is direction of maximum decrease

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality= ||∇f (~x)||2

equality holds if and only if ~u = ± ∇f (~x)||∇f (~x)||2

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality

= ||∇f (~x)||2

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

Gradient

First-order approximation

The first-order or linear approximation of f : Rn → R at ~x is

f 1~x (~y) := f (~x) +∇f (~x)T (~y − ~x)

If f is continuously differentiable at ~x

lim~y→~x

f (~y)− f 1~x (~y)

||~y − ~x ||2= 0

First-order approximation

f (~y)

f 1x (~y)

Convexity

A differentiable function f : Rn → R is convex if and only if for every~x , ~y ∈ Rn

f (~y) ≥ f (~x) +∇f (~x)T (~y − ~x)

It is strictly convex if and only if

f (~y) > f (~x) +∇f (~x)T (~y − ~x)

Optimality condition

If f is convex and ∇f (~x) = 0, then for any ~y ∈ R

f (~y) ≥ f (~x)

If f is strictly convex then for any ~y 6= ~x

f (~y) > f (~x)

Epigraph

The epigraph of f : Rn → R is

epi (f ) :=

~x | f~x [1]· · ·~x [n]

≤ ~x [n + 1]

Epigraph

epi (f )

Supporting hyperplane

A hyperplane H is a supporting hyperplane of a set S at ~x if

I H and S intersect at ~xI S is contained in one of the half-spaces bounded by H

Geometric intuition

Geometrically, f is convex if and only if for every ~x the plane

Hf ,~x :=

~y | ~y [n + 1] = f 1~x

~y [1]· · ·~y [n]

is a supporting hyperplane of the epigraph at ~x

If ∇f (~x) = 0 the hyperplane is horizontal

Convexity

f (~y)

f 1x (~y)

Hessian matrix

If f has a Hessian matrix at every point, it is twice differentiable

∇2f (~x) =

∂2f (~x)∂~x[1]2

∂2f (~x)∂~x[1]∂~x[2] · · ·

∂2f (~x)∂~x[1]∂~x[n]

∂2f (~x)∂~x[1]∂~x[2]

∂2f (~x)∂~x[1]2 · · · ∂2f (~x)

∂~x[2]∂~x[n]

· · ·∂2f (~x)

∂~x[1]∂~x[n]∂2f (~x)

∂~x[2]∂~x[n] · · ·∂2f (~x)∂~x[n]2

Curvature

The second directional derivative f ′′~u of f at ~x equals

f ′′~u (~x) = ~uT∇2f (~x) ~u

for any unit-norm vector ~u ∈ Rn

Second-order approximation

The second-order or quadratic approximation of f at ~x is

f 2~x (~y) := f (~x) +∇f (~x) (~y − ~x) +

(~y − ~x)T ∇2f (~x) (~y − ~x)

f (~y)

f 2x (~y)

Quadratic form

Second order polynomial in several dimensions

q (~x) := ~xTA~x + ~bT~x + c

parametrized by symmetric matrix A ∈ Rn×n, a vector ~b ∈ Rn anda constant c

Quadratic approximation

The quadratic approximation f 2~x : Rn → R at ~x ∈ Rn of a

twice-continuously differentiable function f : Rn → R satisfies

lim~y→~x

f (~y)− f 2~x (~y)

||~y − ~x ||22= 0

Eigendecomposition of symmetric matrices

Let A = UΛUT be the eigendecomposition of a symmetric matrix A

Eigenvalues: λ1 ≥ · · · ≥ λn (which can be negative or 0)

Eigenvectors: ~u1, . . . , ~un, orthonormal basis

λ1 = max{||~x ||2=1 | ~x∈Rn}

~xTA~x

~u1 = argmax{||~x ||2=1 | ~x∈Rn}

~xTA~x

λn = min{||~x ||2=1 | ~x∈Rn}

~xTA~x

~un = argmin{||~x ||2=1 | ~x∈Rn}

~xTA~x

Maximum and minimum curvature

Let ∇2f (~x) = UΛUT be the eigendecomposition of the Hessian at ~x

Direction of maximum curvature: ~u1

Direction of minimum curvature (or maximum negative curvature): ~un

Positive semidefinite matrices

For any ~x

~xTA~x = ~xTUΛUT~x

λi 〈~ui , ~x〉2

All eigenvalues are nonnegative if and only if

~xTA~x ≥ 0

for all ~x

The matrix is positive semidefinite

Positive (negative) (semi)definite matrices

Positive (semi)definite: all eigenvalues are positive (nonnegative),equivalently for all ~x

~xTA~x > (≥) 0

Quadratic form: All directions have positive curvature

Negative (semi)definite: all eigenvalues are negative (nonpositive),equivalently for all ~x

~xTA~x < (≤) 0

Quadratic form: All directions have negative curvature

Convexity

A twice-differentiable function g : R→ R is convex if and only if

g ′′ (x) ≥ 0

for all x ∈ R

A twice-differentiable function in Rn is convex if and only if theirHessian is positive semidefinite at every point

If the Hessian is positive definite, the function is strictly convex

f (~y)

f 2x (~y)

Convex

Concave

Neither

Convexity

Problem

Challenge: Minimizing differentiable convex functions

min~x∈Rn

f (~x)

Gradient descent

Intuition: Make local progress in the steepest direction −∇f (~x)

Set the initial point ~x (0) to an arbitrary value

Update by setting

~x (k+1) := ~x (k)−αk ∇f(~x (k)

)where αk > 0 is the step size, until a stopping criterion is met

Gradient descent

0.51.01.52.02.53.03.54.0

Small step size

0.51.01.52.02.53.03.54.0

Large step size

102030405060708090100

Line search

Idea: Find minimum of

αk := argminα

h (α)

= arg minα∈R

f(~x (k) − αk∇f

(~x (k)

Backtracking line search with Armijo rule

Given α0 ≥ 0 and β, η ∈ (0, 1), set αk := α0 βi for smallest i such that

~x (k+1) := ~x (k) − αk∇f(~x (k)

)satisfies

f(~x (k+1)

)≤ f

(~x (k)

)− 1

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

a condition known as Armijo rule

Backtracking line search with Armijo rule

0.51.01.52.02.53.03.54.0

Gradient descent for least squares

Aim: Use n examples(y (1), ~x (1)

),(y (2), ~x (2)

), . . . ,

(y (n), ~x (n)

)to fit a linear model by minimizing least-squares cost function

minimize~β∈Rp

∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣2

The gradient of the quadratic function

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

= ~βTXTX ~β − 2~βTXT~y + ~yT~y

equals

∇f (~β)

= 2XTX ~β − 2XT~y

Gradient descent updates are

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

equals

∇f (~β) = 2XTX ~β − 2XT~y

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

equals

∇f (~β) = 2XTX ~β − 2XT~y

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

equals

∇f (~β) = 2XTX ~β − 2XT~y

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

Gradient ascent for logistic regression

Aim: Use n examples(y (1), ~x (1)

),(y (2), ~x (2)

), . . . ,

(y (n), ~x (n)

)to fit logistic-regression model by maximizing log-likelihood costfunction

f (~β) :=n∑

y (i) log g(〈~x (i), ~β〉

)+(1− y (i)

)log(1− g

(〈~x (i), ~β〉

))where

g (t) =1

1− exp−t

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

The gradient of the cost function equals

∇f (~β)

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

The gradient ascent updates are

~β(k+1) := ~β(k)

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

∇f (~β) =n∑

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

~β(k+1) := ~β(k)

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

∇f (~β) =n∑

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

~β(k+1)

:= ~β(k)

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

∇f (~β) =n∑

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

~β(k+1) := ~β(k)

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Convergence of gradient descent

Does the method converge?

How fast (slow)?

For what step sizes?

Depends on function

Does the method converge?

How fast (slow)?

For what step sizes?

Depends on function

Lipschitz continuity

A function f : Rn → Rm is Lipschitz continuous if for any ~x , ~y ∈ Rn

||f (~y)− f (~x)||2 ≤ L ||~y − ~x ||2 .

L is the Lipschitz constant

Lipschitz-continuous gradients

If ∇f is Lipschitz continuous with Lipschitz constant L

||∇f (~y)−∇f (~x)||2 ≤ L ||~y − ~x ||2

then for any ~x , ~y ∈ Rn we have a quadratic upper bound

f (~y) ≤ f (~x) +∇f (~x)T (~y − ~x) +L

2||~y − ~x ||22

Local progress of gradient descent

~x (k+1) := ~x (k) − αk∇f(~x (k)

f(~x (k+1)

≤ f(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

= f(~x (k)

)− αk

(1− αkL

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

~x (k+1) := ~x (k) − αk∇f(~x (k)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

= f(~x (k)

)− αk

(1− αkL

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

~x (k+1) := ~x (k) − αk∇f(~x (k)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

= f(~x (k)

)− αk

(1− αkL

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

~x (k+1) := ~x (k) − αk∇f(~x (k)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

= f(~x (k)

)− αk

(1− αkL

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

I f is convex

I ∇f is L-Lipschitz continuous

I There exists a point ~x ∗ at which f achieves a finite minimum

I The step size is set to αk := α ≤ 1/L

f(~x (k)

)− f (~x ∗) ≤

∣∣∣∣~x (0) − ~x ∗∣∣∣∣2

22α k

f(~x (k)

)≤ f

(~x (k−1)

)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

f(~x (k)

)≤ f

(~x (k−1)

)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

f(~x (k)

)≤ f

(~x (k−1)

)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

f(~x (k)

)≤ f

(~x (k−1)

)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

f(~x (k)

)≤ f

(~x (k−1)

)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

f(~x (k)

)− f (~x ∗)

≤ 1k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

f(~x (k)

)− f (~x ∗) ≤ 1

k∑i=1

f(~x (k)

)− f (~x ∗)

never increases

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

f(~x (k)

)− f (~x ∗) ≤ 1

k∑i=1

f(~x (k)

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

f(~x (k)

)− f (~x ∗) ≤ 1

k∑i=1

f(~x (k)

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

f(~x (k)

)− f (~x ∗) ≤ 1

k∑i=1

f(~x (k)

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

f(~x (k)

)− f (~x ∗) ≤ 1

k∑i=1

f(~x (k)

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

Accelerated gradient descent

I Gradient descent takes O (1/ε) to achieve an error of ε

I The optimal rate is O (1/√ε)

I Gradient descent can be accelerated by adding a momentum term

Accelerated gradient descent

Update by setting

y (k+1) = x (k) − αk∇f(x (k)

x (k+1) = βk y(k+1) + γk y

where αk is the step size and βk > 0 and γk > 0 are parameters

Digit classification

MNIST data

Aim: Determine whether a digit is a 5 or not

~xi is an image

~yi = 1 or ~yi = 0 if image i is a 5 or not, respectively

We fit a logistic-regression model

102 103 104

Gradient DescentAccelerated Descent

Stochastic gradient descent

Cost functions to fit models are often additive

f (~x) =1m

m∑i=1

fi (~x) .

I Linear regression

n∑i=1

(y (i) − ~x (i)T ~β

)2=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

I Logistic regression

n∑i=1

y (i) log g(〈~x (i), ~β〉

)+(1− y (i)

)log(1− g

(〈~x (i), ~β〉

In big data regime (very large n), gradient descent is too slow

In some cases, data is acquired sequentially (online setting)

Stochastic gradient descent: update solution using a subset of the data

Update by

1. Choosing a random subset of b indices B (b � m is the batch size)2. Setting

~x (k+1) := ~x (k) − αkm∑i∈B∇fi

(~x (k)

)where αk is the step size

We replace ∇f by

∑i∈B∇fi

(~x (k)

m∑i=1

1i∈B∇fi(~x (k+1)

)Noisy estimate of ∇f

Unbiased if every example is in the batch with probability p

(m∑i=1

1i∈B∇fi(~x (k)

=m∑i=1

E (1i∈B)∇fi(~x (k)

m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

We replace ∇f by

∑i∈B∇fi

(~x (k)

m∑i=1

1i∈B∇fi(~x (k+1)

(m∑i=1

1i∈B∇fi(~x (k)

m∑i=1

E (1i∈B)∇fi(~x (k)

=m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

We replace ∇f by

∑i∈B∇fi

(~x (k)

m∑i=1

1i∈B∇fi(~x (k+1)

(m∑i=1

1i∈B∇fi(~x (k)

m∑i=1

E (1i∈B)∇fi(~x (k)

m∑i=1

P (i ∈ B)∇fi(~x (k)

= p∇f(~x (k)

We replace ∇f by

∑i∈B∇fi

(~x (k)

m∑i=1

1i∈B∇fi(~x (k+1)

(m∑i=1

1i∈B∇fi(~x (k)

m∑i=1

E (1i∈B)∇fi(~x (k)

m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

I Linear regression

~β(k+1) := ~β(k) + 2αk

∑i∈B

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

I Logistic regression

~β(k+1) := ~β(k)

∑i∈B

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

MNIST data

Aim: Determine whether a digit is a 5 or not

~xi is an image

~yi = 1 or ~yi = 0 if image i is a 5 or not, respectively

We fit a logistic-regression model

100 101 102 103 104 105 106

Gradient Descent

SGD (1)

SGD (10)

SGD (100)

SGD (1000)

SGD (10000)

Newton’s method

Motivation: Convex functions are often almost quadratic f ≈ f 2~x

Idea: Iteratively minimize quadratic approximation

f 2~x (~y) := f (~x) +∇f (~x) (~y − ~x) +

(~y − ~x)T ∇2f (~x) (~y − ~x) ,

Minimum has closed form

argmin~y∈Rn

f 2~x (~y) = ~x −∇2f (~x)−1∇f (~x)

We have

∇f 2~x (y) = ∇f (~x) +∇2f (~x) (~y − ~x)

It is equal to zero if

∇2f (~x) (~y − ~x) = −∇f (~x)

If the Hessian is positive definite, the only minimum of f 2~x is at

~x −∇2f (~x)−1∇f (~x)

Newton’s method

Update by setting

~x (k+1) := ~x (k) −∇2f(~x (k)

)−1∇f(~x (k)

)until a stopping criterion is met

Newton’s method

Quadratic approximation

Quadratic function

Convex function

Logistic regression

∂2f (~x)

∂~x [j ]∂~x [l ]= −

n∑i=1

g(〈~x (i), ~β〉)(1− g(〈~x (i), ~β〉)

)~x (i)[j ] ~x (i)[l ]

∇2f (~β) = −XTG (~β)X

The rows of X ∈ Rn×p contain ~x (1), . . . ~x (n)

G is a diagonal matrix such that

G (~β)ii := g(〈~x (i), ~β〉)(1− g(〈~x (i), ~β〉)

), 1 ≤ i ≤ n

Logistic regression

Newton updates are

~β(k+1) := ~β(k) −(XTG (~β(k))X

)−1∇f (~β(k))

Sanity check: Cost function is concave, for any ~β, ~v ∈ Rp

~vT∇2f (~β)~v = −n∑

G (~β)ii (X~v) [i ]2 ≤ 0

Convex Optimization - New York University

Transcript of Convex Optimization - New York University

Convex Optimization - New York University

Documents

Transcript of Convex Optimization - New York University

A robust accelerated optimization algorithm for strongly convex … · 2020. 12. 15. · Saman Cyrus Bin Hu Bryan Van Scoy Laurent Lessard University of Wisconsin{Madison June 27,

Natasha 2: Faster Non-Convex Optimization Than SGD · 8/28/2017 · Natasha 2: Faster Non-Convex Optimization Than SGD | How to Swing By Saddle Points (version 4) Zeyuan Allen-Zhu

2. Convex sets

The Shambles York, Britain

Deﬁning quantum divergences via convex optimization · 2020. 7. 27. · arXiv:2007.12576v1 [quant-ph] 24 Jul 2020 Deﬁning quantum divergences via convex optimization HamzaFawzi1

Convex functions - ics.uci.eduxhx/courses/ConvexOpt/convex_functions.pdf · Convex functions Deﬁnition f : Rn → R is convex if dom f is a convex set and f(θx +(1−θ)y) ≤

Convex Analysis Notes - Walid Krichenewalid.krichene.net › notes › reading-convex_analysis_notes.pdf · 2018-03-03 · 3 The Algebra of Convex Sets C 1;C 2 convex )C 1 + C 2 convex.

Numerical Optimization - Convex Sets

CS257 Linear and Convex Optimization - Lecture 5

INTRODUCTION TO LMI/SDP OPTIMIZATIONInterior point methods - solvers - interfaces Lecture material References on convex optimization: M.F. Anjos, J.B. Lasserre. Handbook on Semide

Convex Optimization Solutions Manual · 2 Convex sets Let c1 be a vector in the plane de ned by a1 and a2, and orthogonal to a2.For example, we can take c1 = a1 aT 1 a2 ka2k2 2 a2:

Overview - Convex Optimization Euclidean Distance Geometry 2e · optimization problem. Conversely, recent advances in geometry and in graph theory hold convex optimization within

Convex Optimization Solutions Manual - egrcc's blogegrcc.github.io/docs/math/cvxbook-solutions.pdf2 Convex sets Examples 2.5 What is the distance between two parallel hyperplanes fx2

Convex Optimization-based Resource Scheduling for Multi-User Wireless Systems · 2015-06-30 · Convex Optimization-based Resource Scheduling for Multi-User Wireless Systems by Charilaos

Convex Optimization - Massachusetts Institute of Technology

Introduction to Convex Optimizationhelper.ipam.ucla.edu/publications/gss2013/gss2013_11378.pdf1. Introduction to convex optimization theory • convex sets and functions • conic

Overview - Convex Optimization Euclidean Distance Geometry 2edattorro/EDMAbstract.pdf · 2018-10-22 · Chapter 1 Overview Convex Optimization Euclidean Distance Geometry 2ε People

Natasha 2: Faster Non-Convex Optimization Than SGDpapers.nips.cc/paper/7533-natasha-2-faster-non... · Natasha 2: Faster Non-Convex Optimization Than SGD Zeyuan Allen-Zhu Microsoft

Convex Optimization — Boyd & Vandenberghe 3. Convex functions · Convex functions 3–13. Positive weighted sum & composition with aﬃne function nonnegative multiple: αf is convex

Continuing Work in Robust Portfolio Optimizationdano/talks/w.pdfContinuing Work in Robust Portfolio Optimization Daniel Bienstock Columbia University, New York 30 June 2008 Daniel