Convex Optimization - New York University

133
Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda

Transcript of Convex Optimization - New York University

Page 1: Convex Optimization - New York University

Convex Optimization

DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysishttp://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html

Carlos Fernandez-Granda

Page 2: Convex Optimization - New York University

Convexity

Differentiable convex functions

Minimizing differentiable convex functions

Page 3: Convex Optimization - New York University

Convex functions

A function f : Rn → R is convex if for any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

θf (~x) + (1− θ) f (~y) ≥ f (θ~x + (1− θ) ~y)

A function f if concave is −f is convex

Page 4: Convex Optimization - New York University

Convex functions

f (θ~x + (1 − θ)~y)

θf (~x) + (1 − θ)f (~y)

f (~x)

f (~y)

Page 5: Convex Optimization - New York University

Linear functions are convex

If f is linear

f (θ~x + (1− θ) ~y)

= θf (~x) + (1− θ) f (~y)

Page 6: Convex Optimization - New York University

Linear functions are convex

If f is linear

f (θ~x + (1− θ) ~y) = θf (~x) + (1− θ) f (~y)

Page 7: Convex Optimization - New York University

Strictly convex functions

A function f : Rn → R is strictly convex if for any ~x , ~y ∈ Rn andany θ ∈ (0, 1)

θf (~x) + (1− θ) f (~y) > f (θ~x + (1− θ) ~y)

Page 8: Convex Optimization - New York University

Local minima are global

Any local minimum of a convex function is also a global minimum

Page 9: Convex Optimization - New York University

Proof

Let ~xloc be a local minimum: for all ~x ∈ Rn such that ||~x − ~xloc||2 ≤ γ

f (~xloc) ≤ f (~x)

Let ~xglob be a global minimum

f(~xglob

)< f (~xloc)

Page 10: Convex Optimization - New York University

Proof

Choose θ so that ~xθ := θ~xloc + (1− θ) ~xglob satisfies

||~xθ − ~xloc||2 ≤ γ

then

f (~xloc) ≤ f (~xθ)

= f(θ~xloc + (1− θ) ~xglob

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

< f (~xloc) because f(~xglob

)< f (~xloc)

Page 11: Convex Optimization - New York University

Proof

Choose θ so that ~xθ := θ~xloc + (1− θ) ~xglob satisfies

||~xθ − ~xloc||2 ≤ γ

then

f (~xloc) ≤ f (~xθ)

= f(θ~xloc + (1− θ) ~xglob

)

≤ θf (~xloc) + (1− θ) f(~xglob

)by convexity of f

< f (~xloc) because f(~xglob

)< f (~xloc)

Page 12: Convex Optimization - New York University

Proof

Choose θ so that ~xθ := θ~xloc + (1− θ) ~xglob satisfies

||~xθ − ~xloc||2 ≤ γ

then

f (~xloc) ≤ f (~xθ)

= f(θ~xloc + (1− θ) ~xglob

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

< f (~xloc) because f(~xglob

)< f (~xloc)

Page 13: Convex Optimization - New York University

Proof

Choose θ so that ~xθ := θ~xloc + (1− θ) ~xglob satisfies

||~xθ − ~xloc||2 ≤ γ

then

f (~xloc) ≤ f (~xθ)

= f(θ~xloc + (1− θ) ~xglob

)≤ θf (~xloc) + (1− θ) f

(~xglob

)by convexity of f

< f (~xloc) because f(~xglob

)< f (~xloc)

Page 14: Convex Optimization - New York University

Norm

Let V be a vector space, a norm is a function ||·|| from V to R withthe following properties

I It is homogeneous. For any scalar α and any ~x ∈ V

||α~x || = |α| ||~x || .

I It satisfies the triangle inequality

||~x + ~y || ≤ ||~x ||+ ||~y || .

In particular, ||~x || ≥ 0

I ||~x || = 0 implies ~x = ~0

Page 15: Convex Optimization - New York University

Norms are convex

For any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||

≤ ||θ~x ||+ ||(1− θ) ~y ||= θ ||~x ||+ (1− θ) ||~y ||

Page 16: Convex Optimization - New York University

Norms are convex

For any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

||θ~x + (1− θ) ~y || ≤ ||θ~x ||+ ||(1− θ) ~y ||

= θ ||~x ||+ (1− θ) ||~y ||

Page 17: Convex Optimization - New York University

Norms are convex

For any ~x , ~y ∈ Rn and any θ ∈ (0, 1)

||θ~x + (1− θ) ~y || ≤ ||θ~x ||+ ||(1− θ) ~y ||= θ ||~x ||+ (1− θ) ||~y ||

Page 18: Convex Optimization - New York University

Composition of convex and affine function

If f : Rn → R is convex, then for any A ∈ Rn×m and ~b ∈ Rn

h (~x) := f(A~x + ~b

)is convex

Consequence:

f (~x) :=∣∣∣∣∣∣A~x + ~b

∣∣∣∣∣∣is convex for any A and ~b

Page 19: Convex Optimization - New York University

Composition of convex and affine function

h (θ~x + (1− θ) ~y)

= f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

Page 20: Convex Optimization - New York University

Composition of convex and affine function

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))

≤ θf(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

Page 21: Convex Optimization - New York University

Composition of convex and affine function

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)

= θ h (~x) + (1− θ) h (~y)

Page 22: Convex Optimization - New York University

Composition of convex and affine function

h (θ~x + (1− θ) ~y) = f(θ(A~x + ~b

)+ (1− θ)

(A~y + ~b

))≤ θf

(A~x + ~b

)+ (1− θ) f

(A~y + ~b

)= θ h (~x) + (1− θ) h (~y)

Page 23: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0

= ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Page 24: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||0

6= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Page 25: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Page 26: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Page 27: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0

= 2

θ ||~x ||0 + (1− θ) ||~y ||0

= 1

Page 28: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0

= 1

Page 29: Convex Optimization - New York University

`0 “norm"

Number of nonzero entries in a vector

Not a norm!

||2~x ||0 = ||~x ||06= 2 ||~x ||0

Not convex

Let ~x := ( 10 ) and ~y := ( 0

1 ), for any θ ∈ (0, 1)

||θ~x + (1− θ) ~y ||0 = 2

θ ||~x ||0 + (1− θ) ||~y ||0 = 1

Page 30: Convex Optimization - New York University

Promoting sparsity

Finding sparse vectors consistent with data is often very useful

Toy problem: Find t such that

~vt :=

tt − 1t − 1

is sparse

Strategy: Minimize

f (t) := ||~vt ||

Page 31: Convex Optimization - New York University

Promoting sparsity

−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

1

2

3

4

5

t

||t||0||t||1||t||2||t||∞

Page 32: Convex Optimization - New York University

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to Ris not convex

X :=

[1 00 0

]Y :=

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y ) = 1

Page 33: Convex Optimization - New York University

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to Ris not convex

X :=

[1 00 0

]Y :=

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y )

= 2

θ rank (X ) + (1− θ) rank (Y )

= 1

Page 34: Convex Optimization - New York University

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to Ris not convex

X :=

[1 00 0

]Y :=

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y )

= 1

Page 35: Convex Optimization - New York University

The rank is not convex

The rank of matrices in Rn×n interpreted as a function from Rn×n to Ris not convex

X :=

[1 00 0

]Y :=

[0 00 1

]For any θ ∈ (0, 1)

rank (θX + (1− θ)Y ) = 2

θ rank (X ) + (1− θ) rank (Y ) = 1

Page 36: Convex Optimization - New York University

Matrix norms

Frobenius norm

||A||F :=

√√√√ m∑i=1

n∑j=1

A2ij =

√√√√min{m,n}∑i=1

σ2i

Operator norm

||A|| := max{||~x ||2=1 | ~x∈Rn}

||A ~x ||2 = σ1

Nuclear norm

||A||∗ :=

min{m,n}∑i=1

σi

Page 37: Convex Optimization - New York University

Promoting low-rank structure

Finding low-rank matrices consistent with data is often very useful

Toy problem: Find t such that

M (t) :=

0.5 + t 1 10.5 0.5 t0.5 1− t 0.5

,is low rank

Strategy: Minimize

f (t) := ||M (t)||

Page 38: Convex Optimization - New York University

Promoting low-rank structure

1.0 0.5 0.0 0.5 1.0 1.5t

1.0

1.5

2.0

2.5

3.0Rank

Operator norm

Frobenius norm

Nuclear norm

Page 39: Convex Optimization - New York University

Convexity

Differentiable convex functions

Minimizing differentiable convex functions

Page 40: Convex Optimization - New York University

Gradient

∇f (~x) =

∂f (~x)∂~x[1]

∂f (~x)∂~x[2]

· · ·∂f (~x)∂~x[n]

If the gradient exists at every point, the function is said to bedifferentiable

Page 41: Convex Optimization - New York University

Directional derivative

Encodes first-order rate of change in a particular direction

f ′~u (~x) := limh→0

f (~x + h~u)− f (~x)

h

= 〈∇f (~x) , ~u〉

where ||u||2 = 1

Page 42: Convex Optimization - New York University

Direction of maximum variation

∇f is direction of maximum increase

-∇f is direction of maximum decrease

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality= ||∇f (~x)||2

equality holds if and only if ~u = ± ∇f (~x)||∇f (~x)||2

Page 43: Convex Optimization - New York University

Direction of maximum variation

∇f is direction of maximum increase

-∇f is direction of maximum decrease

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality

= ||∇f (~x)||2

equality holds if and only if ~u = ± ∇f (~x)||∇f (~x)||2

Page 44: Convex Optimization - New York University

Direction of maximum variation

∇f is direction of maximum increase

-∇f is direction of maximum decrease

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality= ||∇f (~x)||2

equality holds if and only if ~u = ± ∇f (~x)||∇f (~x)||2

Page 45: Convex Optimization - New York University

Direction of maximum variation

∇f is direction of maximum increase

-∇f is direction of maximum decrease

∣∣f ′~u (~x)∣∣ =

∣∣∣∇f (~x)T ~u∣∣∣

≤ ||∇f (~x)||2 ||~u||2 Cauchy-Schwarz inequality= ||∇f (~x)||2

equality holds if and only if ~u = ± ∇f (~x)||∇f (~x)||2

Page 46: Convex Optimization - New York University

Gradient

Page 47: Convex Optimization - New York University

First-order approximation

The first-order or linear approximation of f : Rn → R at ~x is

f 1~x (~y) := f (~x) +∇f (~x)T (~y − ~x)

If f is continuously differentiable at ~x

lim~y→~x

f (~y)− f 1~x (~y)

||~y − ~x ||2= 0

Page 48: Convex Optimization - New York University

First-order approximation

x

f (~y)

f 1x (~y)

Page 49: Convex Optimization - New York University

Convexity

A differentiable function f : Rn → R is convex if and only if for every~x , ~y ∈ Rn

f (~y) ≥ f (~x) +∇f (~x)T (~y − ~x)

It is strictly convex if and only if

f (~y) > f (~x) +∇f (~x)T (~y − ~x)

Page 50: Convex Optimization - New York University

Optimality condition

If f is convex and ∇f (~x) = 0, then for any ~y ∈ R

f (~y) ≥ f (~x)

If f is strictly convex then for any ~y 6= ~x

f (~y) > f (~x)

Page 51: Convex Optimization - New York University

Epigraph

The epigraph of f : Rn → R is

epi (f ) :=

~x | f~x [1]· · ·~x [n]

≤ ~x [n + 1]

Page 52: Convex Optimization - New York University

Epigraph

f

epi (f )

Page 53: Convex Optimization - New York University

Supporting hyperplane

A hyperplane H is a supporting hyperplane of a set S at ~x if

I H and S intersect at ~xI S is contained in one of the half-spaces bounded by H

Page 54: Convex Optimization - New York University

Geometric intuition

Geometrically, f is convex if and only if for every ~x the plane

Hf ,~x :=

~y | ~y [n + 1] = f 1~x

~y [1]· · ·~y [n]

is a supporting hyperplane of the epigraph at ~x

If ∇f (~x) = 0 the hyperplane is horizontal

Page 55: Convex Optimization - New York University

Convexity

x

f (~y)

f 1x (~y)

Page 56: Convex Optimization - New York University

Hessian matrix

If f has a Hessian matrix at every point, it is twice differentiable

∇2f (~x) =

∂2f (~x)∂~x[1]2

∂2f (~x)∂~x[1]∂~x[2] · · ·

∂2f (~x)∂~x[1]∂~x[n]

∂2f (~x)∂~x[1]∂~x[2]

∂2f (~x)∂~x[1]2 · · · ∂2f (~x)

∂~x[2]∂~x[n]

· · ·∂2f (~x)

∂~x[1]∂~x[n]∂2f (~x)

∂~x[2]∂~x[n] · · ·∂2f (~x)∂~x[n]2

Page 57: Convex Optimization - New York University

Curvature

The second directional derivative f ′′~u of f at ~x equals

f ′′~u (~x) = ~uT∇2f (~x) ~u

for any unit-norm vector ~u ∈ Rn

Page 58: Convex Optimization - New York University

Second-order approximation

The second-order or quadratic approximation of f at ~x is

f 2~x (~y) := f (~x) +∇f (~x) (~y − ~x) +

12

(~y − ~x)T ∇2f (~x) (~y − ~x)

Page 59: Convex Optimization - New York University

Second-order approximation

x

f (~y)

f 2x (~y)

Page 60: Convex Optimization - New York University

Quadratic form

Second order polynomial in several dimensions

q (~x) := ~xTA~x + ~bT~x + c

parametrized by symmetric matrix A ∈ Rn×n, a vector ~b ∈ Rn anda constant c

Page 61: Convex Optimization - New York University

Quadratic approximation

The quadratic approximation f 2~x : Rn → R at ~x ∈ Rn of a

twice-continuously differentiable function f : Rn → R satisfies

lim~y→~x

f (~y)− f 2~x (~y)

||~y − ~x ||22= 0

Page 62: Convex Optimization - New York University

Eigendecomposition of symmetric matrices

Let A = UΛUT be the eigendecomposition of a symmetric matrix A

Eigenvalues: λ1 ≥ · · · ≥ λn (which can be negative or 0)

Eigenvectors: ~u1, . . . , ~un, orthonormal basis

λ1 = max{||~x ||2=1 | ~x∈Rn}

~xTA~x

~u1 = argmax{||~x ||2=1 | ~x∈Rn}

~xTA~x

λn = min{||~x ||2=1 | ~x∈Rn}

~xTA~x

~un = argmin{||~x ||2=1 | ~x∈Rn}

~xTA~x

Page 63: Convex Optimization - New York University

Maximum and minimum curvature

Let ∇2f (~x) = UΛUT be the eigendecomposition of the Hessian at ~x

Direction of maximum curvature: ~u1

Direction of minimum curvature (or maximum negative curvature): ~un

Page 64: Convex Optimization - New York University

Positive semidefinite matrices

For any ~x

~xTA~x = ~xTUΛUT~x

=n∑

i=1

λi 〈~ui , ~x〉2

All eigenvalues are nonnegative if and only if

~xTA~x ≥ 0

for all ~x

The matrix is positive semidefinite

Page 65: Convex Optimization - New York University

Positive (negative) (semi)definite matrices

Positive (semi)definite: all eigenvalues are positive (nonnegative),equivalently for all ~x

~xTA~x > (≥) 0

Quadratic form: All directions have positive curvature

Negative (semi)definite: all eigenvalues are negative (nonpositive),equivalently for all ~x

~xTA~x < (≤) 0

Quadratic form: All directions have negative curvature

Page 66: Convex Optimization - New York University

Convexity

A twice-differentiable function g : R→ R is convex if and only if

g ′′ (x) ≥ 0

for all x ∈ R

A twice-differentiable function in Rn is convex if and only if theirHessian is positive semidefinite at every point

If the Hessian is positive definite, the function is strictly convex

Page 67: Convex Optimization - New York University

Second-order approximation

x

f (~y)

f 2x (~y)

Page 68: Convex Optimization - New York University

Convex

Page 69: Convex Optimization - New York University

Concave

Page 70: Convex Optimization - New York University

Neither

Page 71: Convex Optimization - New York University

Convexity

Differentiable convex functions

Minimizing differentiable convex functions

Page 72: Convex Optimization - New York University

Problem

Challenge: Minimizing differentiable convex functions

min~x∈Rn

f (~x)

Page 73: Convex Optimization - New York University

Gradient descent

Intuition: Make local progress in the steepest direction −∇f (~x)

Set the initial point ~x (0) to an arbitrary value

Update by setting

~x (k+1) := ~x (k)−αk ∇f(~x (k)

)where αk > 0 is the step size, until a stopping criterion is met

Page 74: Convex Optimization - New York University

Gradient descent

Page 75: Convex Optimization - New York University

Gradient descent

0.51.01.52.02.53.03.54.0

Page 76: Convex Optimization - New York University

Small step size

0.51.01.52.02.53.03.54.0

Page 77: Convex Optimization - New York University

Large step size

102030405060708090100

Page 78: Convex Optimization - New York University

Line search

Idea: Find minimum of

αk := argminα

h (α)

= arg minα∈R

f(~x (k) − αk∇f

(~x (k)

))

Page 79: Convex Optimization - New York University

Backtracking line search with Armijo rule

Given α0 ≥ 0 and β, η ∈ (0, 1), set αk := α0 βi for smallest i such that

~x (k+1) := ~x (k) − αk∇f(~x (k)

)satisfies

f(~x (k+1)

)≤ f

(~x (k)

)− 1

2αk

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

a condition known as Armijo rule

Page 80: Convex Optimization - New York University

Backtracking line search with Armijo rule

0.51.01.52.02.53.03.54.0

Page 81: Convex Optimization - New York University

Gradient descent for least squares

Aim: Use n examples(y (1), ~x (1)

),(y (2), ~x (2)

), . . . ,

(y (n), ~x (n)

)to fit a linear model by minimizing least-squares cost function

minimize~β∈Rp

∣∣∣∣∣∣~y − X ~β∣∣∣∣∣∣2

2

Page 82: Convex Optimization - New York University

Gradient descent for least squares

The gradient of the quadratic function

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

= ~βTXTX ~β − 2~βTXT~y + ~yT~y

equals

∇f (~β)

= 2XTX ~β − 2XT~y

Gradient descent updates are

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

Page 83: Convex Optimization - New York University

Gradient descent for least squares

The gradient of the quadratic function

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

= ~βTXTX ~β − 2~βTXT~y + ~yT~y

equals

∇f (~β) = 2XTX ~β − 2XT~y

Gradient descent updates are

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

Page 84: Convex Optimization - New York University

Gradient descent for least squares

The gradient of the quadratic function

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

= ~βTXTX ~β − 2~βTXT~y + ~yT~y

equals

∇f (~β) = 2XTX ~β − 2XT~y

Gradient descent updates are

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)

= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

Page 85: Convex Optimization - New York University

Gradient descent for least squares

The gradient of the quadratic function

f (~β) :=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

= ~βTXTX ~β − 2~βTXT~y + ~yT~y

equals

∇f (~β) = 2XTX ~β − 2XT~y

Gradient descent updates are

~β(k+1) = ~β(k) + 2αkXT(~y − X ~β(k)

)= ~β(k) + 2αk

n∑i=1

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

Page 86: Convex Optimization - New York University

Gradient ascent for logistic regression

Aim: Use n examples(y (1), ~x (1)

),(y (2), ~x (2)

), . . . ,

(y (n), ~x (n)

)to fit logistic-regression model by maximizing log-likelihood costfunction

f (~β) :=n∑

i=1

y (i) log g(〈~x (i), ~β〉

)+(1− y (i)

)log(1− g

(〈~x (i), ~β〉

))where

g (t) =1

1− exp−t

Page 87: Convex Optimization - New York University

Gradient ascent for logistic regression

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

The gradient of the cost function equals

∇f (~β)

=n∑

i=1

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

The gradient ascent updates are

~β(k+1) := ~β(k)

+ αk

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Page 88: Convex Optimization - New York University

Gradient ascent for logistic regression

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

The gradient of the cost function equals

∇f (~β) =n∑

i=1

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

The gradient ascent updates are

~β(k+1) := ~β(k)

+ αk

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Page 89: Convex Optimization - New York University

Gradient ascent for logistic regression

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

The gradient of the cost function equals

∇f (~β) =n∑

i=1

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

The gradient ascent updates are

~β(k+1)

:= ~β(k)

+ αk

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Page 90: Convex Optimization - New York University

Gradient ascent for logistic regression

g ′ (t) = g (t) (1− g (t))

(1− g (t))′ = −g (t) (1− g (t))

The gradient of the cost function equals

∇f (~β) =n∑

i=1

y (i)(1− g(〈~x (i), ~β〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β〉)~x (i)

The gradient ascent updates are

~β(k+1) := ~β(k)

+ αk

n∑i=1

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Page 91: Convex Optimization - New York University

Convergence of gradient descent

Does the method converge?

How fast (slow)?

For what step sizes?

Depends on function

Page 92: Convex Optimization - New York University

Convergence of gradient descent

Does the method converge?

How fast (slow)?

For what step sizes?

Depends on function

Page 93: Convex Optimization - New York University

Lipschitz continuity

A function f : Rn → Rm is Lipschitz continuous if for any ~x , ~y ∈ Rn

||f (~y)− f (~x)||2 ≤ L ||~y − ~x ||2 .

L is the Lipschitz constant

Page 94: Convex Optimization - New York University

Lipschitz-continuous gradients

If ∇f is Lipschitz continuous with Lipschitz constant L

||∇f (~y)−∇f (~x)||2 ≤ L ||~y − ~x ||2

then for any ~x , ~y ∈ Rn we have a quadratic upper bound

f (~y) ≤ f (~x) +∇f (~x)T (~y − ~x) +L

2||~y − ~x ||22

Page 95: Convex Optimization - New York University

Local progress of gradient descent

~x (k+1) := ~x (k) − αk∇f(~x (k)

)

f(~x (k+1)

)

≤ f(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

)+

L

2

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

2

= f(~x (k)

)− αk

(1− αkL

2

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

2

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

Page 96: Convex Optimization - New York University

Local progress of gradient descent

~x (k+1) := ~x (k) − αk∇f(~x (k)

)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

)+

L

2

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

2

= f(~x (k)

)− αk

(1− αkL

2

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

2

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

Page 97: Convex Optimization - New York University

Local progress of gradient descent

~x (k+1) := ~x (k) − αk∇f(~x (k)

)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

)+

L

2

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

2

= f(~x (k)

)− αk

(1− αkL

2

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

2

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

Page 98: Convex Optimization - New York University

Local progress of gradient descent

~x (k+1) := ~x (k) − αk∇f(~x (k)

)

f(~x (k+1)

)≤ f

(~x (k)

)+∇f

(~x (k)

)T (~x (k+1) − ~x (k)

)+

L

2

∣∣∣∣∣∣~x (k+1) − ~x (k)∣∣∣∣∣∣2

2

= f(~x (k)

)− αk

(1− αkL

2

) ∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

If αk ≤ 1L

f(~x (k+1)

)≤ f

(~x (k)

)−αk

2

∣∣∣∣∣∣∇f (~x (k))∣∣∣∣∣∣2

2

Page 99: Convex Optimization - New York University

Convergence of gradient descent

I f is convex

I ∇f is L-Lipschitz continuous

I There exists a point ~x ∗ at which f achieves a finite minimum

I The step size is set to αk := α ≤ 1/L

f(~x (k)

)− f (~x ∗) ≤

∣∣∣∣~x (0) − ~x ∗∣∣∣∣2

22α k

Page 100: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)≤ f

(~x (k−1)

)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

)=

12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

)

Page 101: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)≤ f

(~x (k−1)

)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

)=

12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

)

Page 102: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)≤ f

(~x (k−1)

)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

)=

12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

)

Page 103: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)≤ f

(~x (k−1)

)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

)

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

)

Page 104: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)≤ f

(~x (k−1)

)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

f(~x (k−1)

)+∇f

(~x (k−1)

)T (~x ∗ − ~x (k−1)

)≤ f (~x ∗)

f(~x (k)

)− f (~x ∗)

≤ f(~x (k−1)

)− f (~x ∗)− αk

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

≤ ∇f(~x (k−1)

)T (~x (k−1) − ~x ∗

)− α

2

∣∣∣∣∣∣∇f (~x (k−1))∣∣∣∣∣∣2

2

=12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k−1) − ~x ∗ − α∇f

(~x (k−1)

)∣∣∣∣∣∣22

)=

12α

(∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

)

Page 105: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗)

≤ 1k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 106: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗) ≤ 1

k

k∑i=1

f(~x (k)

)− f (~x ∗)

never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 107: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗) ≤ 1

k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 108: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗) ≤ 1

k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 109: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗) ≤ 1

k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)

≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 110: Convex Optimization - New York University

Convergence of gradient descent

f(~x (k)

)− f (~x ∗) ≤ 1

k

k∑i=1

f(~x (k)

)− f (~x ∗) never increases

=1

2α k

k∑i=1

∣∣∣∣∣∣~x (k−1) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣2

=1

2α k

(∣∣∣∣∣∣~x (0) − ~x ∗∣∣∣∣∣∣2

2−∣∣∣∣∣∣~x (k) − ~x ∗

∣∣∣∣∣∣22

)≤∣∣∣∣~x (0) − ~x ∗

∣∣∣∣22

2α k

Page 111: Convex Optimization - New York University

Accelerated gradient descent

I Gradient descent takes O (1/ε) to achieve an error of ε

I The optimal rate is O (1/√ε)

I Gradient descent can be accelerated by adding a momentum term

Page 112: Convex Optimization - New York University

Accelerated gradient descent

Set the initial point ~x (0) to an arbitrary value

Update by setting

y (k+1) = x (k) − αk∇f(x (k)

)

x (k+1) = βk y(k+1) + γk y

(k)

where αk is the step size and βk > 0 and γk > 0 are parameters

Page 113: Convex Optimization - New York University

Digit classification

MNIST data

Aim: Determine whether a digit is a 5 or not

~xi is an image

~yi = 1 or ~yi = 0 if image i is a 5 or not, respectively

We fit a logistic-regression model

Page 114: Convex Optimization - New York University

Digit classification

102 103 104

Sizes

10-1

100

101

Tim

es

(s)

Gradient DescentAccelerated Descent

Page 115: Convex Optimization - New York University

Stochastic gradient descent

Cost functions to fit models are often additive

f (~x) =1m

m∑i=1

fi (~x) .

I Linear regression

n∑i=1

(y (i) − ~x (i)T ~β

)2=∣∣∣∣∣∣~y − X ~β

∣∣∣∣∣∣22

I Logistic regression

n∑i=1

y (i) log g(〈~x (i), ~β〉

)+(1− y (i)

)log(1− g

(〈~x (i), ~β〉

))

Page 116: Convex Optimization - New York University

Stochastic gradient descent

In big data regime (very large n), gradient descent is too slow

In some cases, data is acquired sequentially (online setting)

Stochastic gradient descent: update solution using a subset of the data

Page 117: Convex Optimization - New York University

Stochastic gradient descent

Set the initial point ~x (0) to an arbitrary value

Update by

1. Choosing a random subset of b indices B (b � m is the batch size)2. Setting

~x (k+1) := ~x (k) − αkm∑i∈B∇fi

(~x (k)

)where αk is the step size

Page 118: Convex Optimization - New York University

Stochastic gradient descent

We replace ∇f by

∑i∈B∇fi

(~x (k)

)=

m∑i=1

1i∈B∇fi(~x (k+1)

)Noisy estimate of ∇f

Unbiased if every example is in the batch with probability p

E

(m∑i=1

1i∈B∇fi(~x (k)

))

=m∑i=1

E (1i∈B)∇fi(~x (k)

)=

m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

)

Page 119: Convex Optimization - New York University

Stochastic gradient descent

We replace ∇f by

∑i∈B∇fi

(~x (k)

)=

m∑i=1

1i∈B∇fi(~x (k+1)

)Noisy estimate of ∇f

Unbiased if every example is in the batch with probability p

E

(m∑i=1

1i∈B∇fi(~x (k)

))=

m∑i=1

E (1i∈B)∇fi(~x (k)

)

=m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

)

Page 120: Convex Optimization - New York University

Stochastic gradient descent

We replace ∇f by

∑i∈B∇fi

(~x (k)

)=

m∑i=1

1i∈B∇fi(~x (k+1)

)Noisy estimate of ∇f

Unbiased if every example is in the batch with probability p

E

(m∑i=1

1i∈B∇fi(~x (k)

))=

m∑i=1

E (1i∈B)∇fi(~x (k)

)=

m∑i=1

P (i ∈ B)∇fi(~x (k)

)

= p∇f(~x (k)

)

Page 121: Convex Optimization - New York University

Stochastic gradient descent

We replace ∇f by

∑i∈B∇fi

(~x (k)

)=

m∑i=1

1i∈B∇fi(~x (k+1)

)Noisy estimate of ∇f

Unbiased if every example is in the batch with probability p

E

(m∑i=1

1i∈B∇fi(~x (k)

))=

m∑i=1

E (1i∈B)∇fi(~x (k)

)=

m∑i=1

P (i ∈ B)∇fi(~x (k)

)= p∇f

(~x (k)

)

Page 122: Convex Optimization - New York University

Stochastic gradient descent

I Linear regression

~β(k+1) := ~β(k) + 2αk

∑i∈B

(~y (i) − 〈x (i), ~β(k)〉

)x (i)

I Logistic regression

~β(k+1) := ~β(k)

+ αk

∑i∈B

y (i)(1− g(〈~x (i), ~β(k)〉)

)~x (i) −

(1− y (i)

)g(〈~x (i), ~β(k)〉)~x (i)

Page 123: Convex Optimization - New York University

Digit classification

MNIST data

Aim: Determine whether a digit is a 5 or not

~xi is an image

~yi = 1 or ~yi = 0 if image i is a 5 or not, respectively

We fit a logistic-regression model

Page 124: Convex Optimization - New York University

Digit classification

100 101 102 103 104 105 106

Steps

0

2

4

6

8

10

12

14

16

Tra

inin

g L

oss

Gradient Descent

SGD (1)

SGD (10)

SGD (100)

SGD (1000)

SGD (10000)

Page 125: Convex Optimization - New York University

Newton’s method

Motivation: Convex functions are often almost quadratic f ≈ f 2~x

Idea: Iteratively minimize quadratic approximation

f 2~x (~y) := f (~x) +∇f (~x) (~y − ~x) +

12

(~y − ~x)T ∇2f (~x) (~y − ~x) ,

Minimum has closed form

argmin~y∈Rn

f 2~x (~y) = ~x −∇2f (~x)−1∇f (~x)

Page 126: Convex Optimization - New York University

Proof

We have

∇f 2~x (y) = ∇f (~x) +∇2f (~x) (~y − ~x)

It is equal to zero if

∇2f (~x) (~y − ~x) = −∇f (~x)

If the Hessian is positive definite, the only minimum of f 2~x is at

~x −∇2f (~x)−1∇f (~x)

Page 127: Convex Optimization - New York University

Newton’s method

Set the initial point ~x (0) to an arbitrary value

Update by setting

~x (k+1) := ~x (k) −∇2f(~x (k)

)−1∇f(~x (k)

)until a stopping criterion is met

Page 128: Convex Optimization - New York University

Newton’s method

Quadratic approximation

Page 129: Convex Optimization - New York University

Quadratic function

Page 130: Convex Optimization - New York University

Quadratic function

Page 131: Convex Optimization - New York University

Convex function

Page 132: Convex Optimization - New York University

Logistic regression

∂2f (~x)

∂~x [j ]∂~x [l ]= −

n∑i=1

g(〈~x (i), ~β〉)(1− g(〈~x (i), ~β〉)

)~x (i)[j ] ~x (i)[l ]

∇2f (~β) = −XTG (~β)X

The rows of X ∈ Rn×p contain ~x (1), . . . ~x (n)

G is a diagonal matrix such that

G (~β)ii := g(〈~x (i), ~β〉)(1− g(〈~x (i), ~β〉)

), 1 ≤ i ≤ n

Page 133: Convex Optimization - New York University

Logistic regression

Newton updates are

~β(k+1) := ~β(k) −(XTG (~β(k))X

)−1∇f (~β(k))

Sanity check: Cost function is concave, for any ~β, ~v ∈ Rp

~vT∇2f (~β)~v = −n∑

i=1

G (~β)ii (X~v) [i ]2 ≤ 0