Quasi-Newton methods - Princeton...

19
ELE 522: Large-Scale Optimization for Data Science Quasi-Newton methods Yuxin Chen Princeton University, Fall 2019

Transcript of Quasi-Newton methods - Princeton...

Page 1: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

ELE 522: Large-Scale Optimization for Data Science

Quasi-Newton methods

Yuxin Chen

Princeton University, Fall 2019

Page 2: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Newton’s method

minimizex∈Rn f(x)

xt+1 = xt − (∇2f(xt))−1∇f(xt)

Examples

example in R2 (page 10–9)

x(0)

x(1)

k

f(x

(k) )

−p

0 1 2 3 4 510−15

10−10

10−5

100

105

• backtracking parameters α = 0.1, β = 0.7

• converges in only 5 steps

• quadratic local convergence

Unconstrained minimization 10–21

minimizex max {distC1(x), distC2(x)}where distC(x) := minz2C kx � zk2

find x 2 C1 \ C2

rdistCi(x) =

x � PCi(x)

distCi(x)

xt+1 = xt � ⌘tgt = xt � distCi(x

t)

kgtk22

xt � PCi(xt)

distCi(xt)

= PCi(xt)

kxt+1 � x⇤k22 kxt � x⇤k2

2 ��f(xt) � f(x⇤)

�2

kgtk22

When f is µ-strongly convex, we can improve Lemma @@@ to (check)

kxt+1 � x⇤k22 (1 � µ⌘t)kxt � x⇤k2

2 � 2⌘t

�f(xt) � fopt

�+ ⌘2

t kgtk22

) f(xt) � fopt (1 � µ⌘t)

2⌘tkxt � x⇤k2

2 �1

2⌘tkxt+1 � x⇤k2

2 +⌘t

2kgtk2

2

Since ⌘t = 2/(µ(t + 1)), we have

f(xt) � fopt µ(t � 1)

4kxt � x⇤k2

2 �µ(t + 1)

4kxt+1 � x⇤k2

2 +1

µ(t + 1)kgtk2

2

and hence

t�f(xt) � fopt

� µt(t � 1)

4kxt � x⇤k2

2 �µt(t + 1)

4kxt+1 � x⇤k2

2 +1

µkgtk2

2

Summing over all iterations before t, we get

tX

k=0

k�f(xk) � fopt

� 0 � µt(t + 1)

4kxt+1 � x⇤k2

2 +1

µ

tX

k=0

kgkk22

t

µL2

f

=) fbest,k L2

f

µ

tPtk=0 k

2L2

f

µ(t + 1)

t f(xt) � fopt

1

min

imiz

e xm

ax{d

ist C

1(x

),di

stC

2(x

)}w

here

dist

C(x)

:=m

inz2C

kx�

zk 2

find

x2

C 1\

C 2

rdi

stC i

(x)

=x�

P Ci(x

)

dist

Ci(x

)

xt+

1=

xt�

⌘ tg

t=

xt�

dist

Ci(x

t )

kgt k

2 2

xt�

P Ci(x

t )

dist

Ci(x

t )

=P C

i(x

t )

kxt+

1�

x⇤ k

2 2

kxt�

x⇤ k

2 2�

� f(x

t )�

f(x

⇤ )� 2

kgt k

2 2

Whe

nf

isµ-s

tron

gly

conv

ex,w

eca

nim

prov

eLe

mm

a@

@@

to(c

heck

)

kxt+

1�

x⇤ k

2 2

(1�

µ⌘ t

)kx

t�

x⇤ k

2 2�

2⌘t

� f(x

t )�

fopt�

+⌘2 tkg

t k2 2

)f(x

t )�

fopt

(1�

µ⌘ t

)

2⌘t

kxt�

x⇤ k

2 2�

1 2⌘tkx

t+1�

x⇤ k

2 2+

⌘ t 2kg

t k2 2

Sinc

e⌘ t

=2/

(µ(t

+1)

),w

eha

ve

f(x

t )�

fopt

µ(t�

1)

4kx

t�

x⇤ k

2 2�

µ(t

+1)

4kx

t+1�

x⇤ k

2 2+

1

µ(t

+1)

kgt k

2 2

and

henc

e

t� f

(xt )�

fopt�

µt(

t�

1)

4kx

t�

x⇤ k

2 2�

µt(

t+

1)

4kx

t+1�

x⇤ k

2 2+

1 µkg

t k2 2

Sum

min

gov

eral

lite

ration

sbe

fore

t,w

ege

t

t X k=

0

k� f

(xk)�

fopt�

0�

µt(

t+

1)

4kx

t+1�

x⇤ k

2 2+

1 µ

t X k=

0

kgkk2 2

t µL

2 f

=)

fbes

t,k

L2 f µ

tP

t k=

0k

2L2 f

µ(t

+1)

tf(x

t )�

fopt

1

• quadratic convergence: attains ε accuracy within O(log log 1ε )

iterations• typically requires storing and inverting Hessian ∇2f(x) ∈ Rn×n

• a single iteration may last forever; prohibitive storage requirementQuasi-Newton methods 13-2

Page 3: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Quasi-Newton methods

key idea: approximate the Hessian matrix using only gradientinformation

xt+1 = xt − ηt Ht︸︷︷︸surrogate of (∇2f(xt))−1

∇f(xt)

challenges: how to find a good approximation Ht � 0 of(∇2f(xt))−1

• using only gradient information• using limited memory• achieving super-linear convergence

Quasi-Newton methods 13-3

Page 4: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Criterion for choosing Ht

Consider the following approximate quadratic model of f(·):

ft(x) := f(xt+1)+〈∇f(xt+1),x−xt+1〉+12(x−xt+1)>H−1

t+1(x−xt+1)

which satisfies

∇ft(x) = ∇f(xt+1) + H−1t+1(x− xt+1)

One reasonable criterion: gradient matching for the latest twoiterates:

∇ft(xt) = ∇f(xt) (13.1a)∇ft(xt+1) = ∇f(xt+1) (13.1b)

Quasi-Newton methods 13-4

Page 5: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Secant equationProof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

Monotonicity

We start with a monotonicity result:

Lemma 2.5

Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

where xú is any minimizer with optimal f(xú)

Gradient methods 2-35

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1

2

�x � xt+1

�>H�1

t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1

t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)

rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1

�xt � xt+1

�= rf(xt)

() H�1t+1

�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1

2

�x � xt+1

�>H�1

t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1

t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)

rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1

�xt � xt+1

�= rf(xt)

() H�1t+1

�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1

2

�x � xt+1

�>H�1

t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1

t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)

rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1

�xt � xt+1

�= rf(xt)

() H�1t+1

�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

slope H�1t+1

2

(13.1b) holds automatically. To satisfy (13.1a), one requires

∇f(xt+1) + H−1t+1(xt − xt+1) = ∇f(xt)

⇐⇒ H−1t+1(xt+1 − xt) = ∇f(xt+1)−∇f(xt)

︸ ︷︷ ︸secant equation

• the secant equation requires that H−1t+1 maps the displacement

xt+1 − xt into the change of gradients ∇f(xt+1)−∇f(xt)Quasi-Newton methods 13-5

Page 6: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Secant equation

Ht+1(∇f(xt+1)−∇f(xt)

)︸ ︷︷ ︸

=:yt

= xt+1 − xt︸ ︷︷ ︸

=:st

(13.2)

• only possible when s>t yt > 0, since

s>t yt = y>t Ht+1yt > 0

• admit an infinite number of solutions, since the degrees offreedom O(n2) in choosing H−1

t+1 far exceeds the number ofconstraints n in (13.2)• which H−1

t+1 shall we choose?

Quasi-Newton methods 13-6

Page 7: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Broyden-Fletcher-Goldfarb-Shanno (BFGS) method

Quasi-Newton methods 13-7

Page 8: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Closeness to Ht

In addition to the secant equation, choose Ht+1 sufficiently close toHt:

minimizeH ‖H −Ht‖subject to H = H>

Hyt = st

for some norm ‖ · ‖

• exploit past information regarding Ht

• choosing different norms ‖ · ‖ results in different quasi-Newtonmethods

Quasi-Newton methods 13-8

Page 9: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Choice of norm in BFGS

Choosing ‖M‖ := ‖W 1/2MW 1/2‖F for any weight matrix Wobeying Wst = yt, we get

minimizeH∥∥W 1/2(H −Ht)W 1/2∥∥

Fsubject to H = H>

Hyt = st

This admits a closed-form expression

Ht+1 =(I − ρtsty

>t

)Ht(I − ρtyts

>t

)+ ρtsts

>t︸ ︷︷ ︸

BFGS update rule; Ht+1�0 if Ht�0

(13.3)

with ρt = 1y>t st

Quasi-Newton methods 13-9

Page 10: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

An alternative interpretation

Ht+1 is also the solution to

minimizeH 〈Ht,H−1〉 − log det

(HtH

−1)− n︸ ︷︷ ︸

KL divergence between N (0,H−1) and N (0,H−1t )

subject to Hyt = st

• minimizing some sort of KL divergence subject to the secantequation constraints

Quasi-Newton methods 13-10

Page 11: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

BFGS methods

Algorithm 13.1 BFGS1: for t = 0, 1, · · · do2: xt+1 = xt − ηtHt∇f(xt) (line search to determine ηt)3: Ht+1 =

(I − ρtsty

>t

)Ht(I − ρtyts

>t

)+ ρtsts

>t , where st =

xt+1 − xt, yt = ∇f(xt+1)−∇f(xt), and ρt = 1y>t st

• each iteration costs O(n2) (in addition to computing gradients)• no need to solve linear systems or invert matrices• no magic formula for initialization; possible choices: approximate

inverse Hessian at x0, or identity matrix

Quasi-Newton methods 13-11

Page 12: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Rank-2 update on H−1t

From the Sherman-Morrison-Woodbury formula(A + UV >

)−1 = A−1 −A−1U(I + V >A−1U

)−1V >A−1, we can

show that the BFGS rule is equivalent to

H−1t+1 = H−1

t − 1s>t H

−1t st

H−1t sts

>t H

−1t + ρtyty

>t

︸ ︷︷ ︸rank-2 update

Quasi-Newton methods 13-12

Page 13: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Local superlinear convergence

Theorem 13.1 (informal)Suppose f is strongly convex and has Lipschitz-continuous Hessian.Under mild conditions, BFGS achieves

limt→∞

∥∥xt+1 − x∗∥∥

2∥∥xt − x∗∥∥

2= 0

• iteration complexity: larger than Newton methods but smallerthan gradient methods• asymptotic result: holds when t→∞

Quasi-Newton methods 13-13

Page 14: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Key observation

The BFGS update rule achieves

limt→∞

∥∥(H−1t −∇2f(x∗)

)(xt+1 − xt

)∥∥2∥∥xt+1 − xt

∥∥2

= 0

Implications• even though H−1

t may not converge to ∇2f(x∗), it becomes anincreasingly more accurate approximation of ∇2f(x∗) along thesearch direction xt+1 − xt

• asymptotically, xt+1 − xt ≈ −(∇2f(xt))−1∇f(xt)

︸ ︷︷ ︸Newton search direction

Quasi-Newton methods 13-14

Page 15: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Numerical example— EE236C lecture notes

minimizex∈Rn c>x−N∑

i=1log

(bi − a>i x

)

Example

minimize cTx �mX

i=1

log(bi � aTi x)

n = 100, m = 500

0 2 4 6 8 10 1210�12

10�9

10�6

10�3

100

103

k

f(x

k)�

f?

Newton

0 50 100 15010�12

10�9

10�6

10�3

100

103

k

f(x

k)�

f?

BFGS

• cost per Newton iteration: O(n3) plus computing r2f(x)

• cost per BFGS iteration: O(n2)

Quasi-Newton methods 2-10

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1

2

�x � xt+1

�>H�1

t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1

t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)

rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1

�xt � xt+1

�= rf(xt)

() H�1t+1

�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

slope H�1t+1

generic optimization algorithmsStage 1: random init ! local region Stage 2: local convergence

\

minimizeH kH � Htksubject to H = H>

H�rf(xt+1) �rf(xt)

�= xt+1 � xt

Ht+1 = V >t HtVt + ⇢tsts

>t| {z }

BFGS update rule

with Vt = I � ⇢tyts>t

with ⇢t = 1y>

t st

HLt = V >

t�1 · · · V >t�mHL

t�mVt�m · · · Vt�1

+ ⇢t�mV >t�1 · · · V >

t�m+1st�ms>t�mVt�m+1 · · · Vt�1

+ ⇢t�mV >t�1 · · · V >

t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1

+ · · · + ⇢t�1st�1s>t�1

From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

H�1t+1 = H�1

t � 1

s>t H�1

t st

H�1t sts

>t H�1

t + ⇢tyty>t

| {z }rank-2 update

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1

2

�x � xt+1

�>H�1

t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1

t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)

rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1

�xt � xt+1

�= rf(xt)

() H�1t+1

�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

slope H�1t+1

generic optimization algorithmsStage 1: random init ! local region Stage 2: local convergence

\

minimizeH kH � Htksubject to H = H>

H�rf(xt+1) �rf(xt)

�= xt+1 � xt

Ht+1 = V >t HtVt + ⇢tsts

>t| {z }

BFGS update rule

with Vt = I � ⇢tyts>t

with ⇢t = 1y>

t st

HLt = V >

t�1 · · · V >t�mHL

t�mVt�m · · · Vt�1

+ ⇢t�mV >t�1 · · · V >

t�m+1st�ms>t�mVt�m+1 · · · Vt�1

+ ⇢t�mV >t�1 · · · V >

t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1

+ · · · + ⇢t�1st�1s>t�1

From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

H�1t+1 = H�1

t � 1

s>t H�1

t st

H�1t sts

>t H�1

t + ⇢tyty>t

| {z }rank-2 update

2

n = 100, N = 500Quasi-Newton methods 13-15

Page 16: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Limited-memory quasi-Newton methods

Hessian matrices are usually dense. For large-scale problems, evenstoring the (inverse) Hessian matrices is prohibitive

Instead of storing full Hessian approximations, one may want tomaintain more parsimonious approximation of the Hessians, usingonly a few vectors

Quasi-Newton methods 13-16

Page 17: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Limited-memory BFGS (L-BFGS)

Ht+1 = V >t HtVt + ρtsts>t︸ ︷︷ ︸

BFGS update rule

with Vt = I − ρtyts>t

key idea: maintain a modified version of Ht implicitly by storing m(e.g. 20) most recent vector pairs (st,yt)

Quasi-Newton methods 13-17

Page 18: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Limited-memory BFGS (L-BFGS)

L-BFGS maintains

HLt = V >t−1 · · ·V >t−mHL

t,0Vt−m · · ·Vt−1

+ ρt−mV >t−1 · · ·V >t−m+1st−ms>t−mVt−m+1 · · ·Vt−1

+ ρt−m+1V>

t−1 · · ·V >t−m+2st−m+1s>t−m+1Vt−m+1 · · ·Vt−1

+ · · ·+ ρt−1st−1s>t−1

• can be computed recursively• initialization HL

t,0 may vary from iteration to iteration• only needs to store {(si,yi)}t−m≤i<t

Quasi-Newton methods 13-18

Page 19: Quasi-Newton methods - Princeton Universityyc5/ele522_optimization/lectures/quasi_Newton.pdfQuasi-Newton methods key idea: approximate the Hessian matrix using only gradient information

Reference

[1] ”Numerical optimization, J. Nocedal, S. Wright, 2000.

[2] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.

[3] ”Optimization methods for large-scale machine learning,” L. Bottou etal., arXiv, 2016.

[4] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.

Quasi-Newton methods 13-19