of 19/19
ELE 522: Large-Scale Optimization for Data Science Quasi-Newton methods Yuxin Chen Princeton University, Fall 2019
• date post

24-Sep-2020
• Category

## Documents

• view

3

0

Embed Size (px)

### Transcript of Quasi-Newton methods - Princeton...

• ELE 522: Large-Scale Optimization for Data Science

Quasi-Newton methods

Yuxin Chen

Princeton University, Fall 2019

• Newton’s method

minimizex∈Rn f(x)

xt+1 = xt − (∇2f(xt))−1∇f(xt)Examples

example in R2 (page 10–9)

x(0)

x(1)

k

f(x

(k) )

−p

0 1 2 3 4 510−15

10−10

10−5

100

105

• backtracking parameters α = 0.1, β = 0.7• converges in only 5 steps• quadratic local convergence

Unconstrained minimization 10–21

minimizex max {distC1(x), distC2(x)}where distC(x) := minz2C kx � zk2

find x 2 C1 \ C2

rdistCi(x) =x � PCi(x)distCi(x)

xt+1 = xt � ⌘tgt = xt �distCi(x

t)

kgtk22xt � PCi(xt)distCi(x

t)

= PCi(xt)

kxt+1 � x⇤k22 kxt � x⇤k22 ��f(xt) � f(x⇤)

�2

kgtk22When f is µ-strongly convex, we can improve Lemma @@@ to (check)

kxt+1 � x⇤k22 (1 � µ⌘t)kxt � x⇤k22 � 2⌘t�f(xt) � fopt

�+ ⌘2t kgtk22

) f(xt) � fopt (1 � µ⌘t)2⌘t

kxt � x⇤k22 �1

2⌘tkxt+1 � x⇤k22 +

⌘t2kgtk22

Since ⌘t = 2/(µ(t + 1)), we have

f(xt) � fopt µ(t � 1)4

kxt � x⇤k22 �µ(t + 1)

4kxt+1 � x⇤k22 +

1

µ(t + 1)kgtk22

and hence

t�f(xt) � fopt

� µt(t � 1)

4kxt � x⇤k22 �

µt(t + 1)

4kxt+1 � x⇤k22 +

1

µkgtk22

Summing over all iterations before t, we get

tX

k=0

k�f(xk) � fopt

� 0 � µt(t + 1)

4kxt+1 � x⇤k22 +

1

µ

tX

k=0

kgkk22

L2f

=) fbest,k L2fµ

tPtk=0 k

2L2f

µ(t + 1)

t f(xt) � fopt

1

min

imiz

e xm

ax{d

ist C

1(x

),di

stC

2(x

)}w

here

dist

C(x)

:=m

inz2C

kx�

zk 2

find

x2

C 1\

C 2

rdi

stC i

(x)

=x�

P Ci(x

)

dist

Ci(x

)

xt+

1=

xt�

⌘ tg

t=

xt�

dist

Ci(x

t )

kgt k

2 2

xt�

P Ci(x

t )

dist

Ci(x

t )

=P C

i(x

t )

kxt+

1�

x⇤ k

2 2

kxt�

x⇤ k

2 2�

� f(x

t )�

f(x

⇤ )� 2

kgt k

2 2

Whe

nf

isµ-s

tron

gly

conv

ex,w

eca

nim

prov

eLe

mm

[email protected]

@@

to(c

heck

)

kxt+

1�

x⇤ k

2 2

(1�

µ⌘ t

)kx

t�

x⇤ k

2 2�

2⌘t

� f(x

t )�

fopt�

+⌘2 tkg

t k2 2

)f(x

t )�

fopt

(1�

µ⌘ t

)

2⌘t

kxt�

x⇤ k

2 2�

1 2⌘tkx

t+1�

x⇤ k

2 2+

⌘ t 2kg

t k2 2

Sinc

e⌘ t

=2/

(µ(t

+1)

),w

eha

ve

f(x

t )�

fopt

µ(t�

1)

4kx

t�

x⇤ k

2 2�

µ(t

+1)

4kx

t+1�

x⇤ k

2 2+

1

µ(t

+1)

kgt k

2 2

and

henc

e

t� f

(xt )�

fopt�

µt(

t�

1)

4kx

t�

x⇤ k

2 2�

µt(

t+

1)

4kx

t+1�

x⇤ k

2 2+

1 µkg

t k2 2

Sum

min

gov

eral

lite

ration

sbe

fore

t,w

ege

t

t X k=0

k� f

(xk)�

fopt�

0�

µt(

t+

1)

4kx

t+1�

x⇤ k

2 2+

1 µ

t X k=0

kgkk2 2

t µL

2 f

=)

fbes

t,k

L2 f µ

tP

t k=

0k

2L2 f

µ(t

+1)

tf(x

t )�

fopt

1

• quadratic convergence: attains ε accuracy within O(log log 1ε )iterations• typically requires storing and inverting Hessian ∇2f(x) ∈ Rn×n

• a single iteration may last forever; prohibitive storage requirementQuasi-Newton methods 13-2

• Quasi-Newton methods

key idea: approximate the Hessian matrix using only gradientinformation

xt+1 = xt − ηt Ht︸︷︷︸surrogate of (∇2f(xt))−1

∇f(xt)

challenges: how to find a good approximation Ht � 0 of(∇2f(xt))−1

• using only gradient information• using limited memory• achieving super-linear convergence

Quasi-Newton methods 13-3

• Criterion for choosing Ht

Consider the following approximate quadratic model of f(·):

ft(x) := f(xt+1)+〈∇f(xt+1),x−xt+1〉+12(x−xt+1)>H−1t+1

(x−xt+1)

which satisfies

∇ft(x) = ∇f(xt+1) + H−1t+1(x− xt+1)

One reasonable criterion: gradient matching for the latest twoiterates:

∇ft(xt) = ∇f(xt) (13.1a)∇ft(xt+1) = ∇f(xt+1) (13.1b)

Quasi-Newton methods 13-4

• Secant equationProof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝

=0

)..2

2

=..xt ≠ xú

..22 ≠ 2÷Èx

t ≠ xú, Òf(xt) ≠ Òf(xú)Í¸ ˚˙ ˝Ø 2÷L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)

..22

Æ..xt ≠ xú

..22 ≠ ÷

2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú

..22

Monotonicity

Lemma 2.5

Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

where xú is any minimizer with optimal f(xú)

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

2

�x � xt+1

�>H�1t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1�xt � xt+1

�= rf(xt)

() H�1t+1�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

2

�x � xt+1

�>H�1t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1�xt � xt+1

�= rf(xt)

() H�1t+1�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

2

�x � xt+1

�>H�1t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1�xt � xt+1

�= rf(xt)

() H�1t+1�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)slope H�1t+1

2

(13.1b) holds automatically. To satisfy (13.1a), one requires

∇f(xt+1) + H−1t+1(xt − xt+1) = ∇f(xt)

⇐⇒ H−1t+1(xt+1 − xt) = ∇f(xt+1)−∇f(xt)

︸ ︷︷ ︸secant equation

• the secant equation requires that H−1t+1 maps the displacementxt+1 − xt into the change of gradients ∇f(xt+1)−∇f(xt)

Quasi-Newton methods 13-5

• Secant equation

Ht+1(∇f(xt+1)−∇f(xt))︸ ︷︷ ︸

=:yt

= xt+1 − xt︸ ︷︷ ︸=:st

(13.2)

• only possible when s>t yt > 0, since

s>t yt = y>t Ht+1yt > 0

• admit an infinite number of solutions, since the degrees offreedom O(n2) in choosing H−1t+1 far exceeds the number ofconstraints n in (13.2)• which H−1t+1 shall we choose?

Quasi-Newton methods 13-6

• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method

Quasi-Newton methods 13-7

• Closeness to Ht

In addition to the secant equation, choose Ht+1 sufficiently close toHt:

minimizeH ‖H −Ht‖subject to H = H>

Hyt = st

for some norm ‖ · ‖

• exploit past information regarding Ht• choosing different norms ‖ · ‖ results in different quasi-Newton

methods

Quasi-Newton methods 13-8

• Choice of norm in BFGS

Choosing ‖M‖ := ‖W 1/2MW 1/2‖F for any weight matrix Wobeying Wst = yt, we get

minimizeH∥∥W 1/2(H −Ht)W 1/2

∥∥F

subject to H = H>

Hyt = st

Ht+1 =(I − ρtsty>t

)Ht(I − ρtyts>t

)+ ρtsts>t︸ ︷︷ ︸

BFGS update rule; Ht+1�0 if Ht�0

(13.3)

with ρt = 1y>t st

Quasi-Newton methods 13-9

• An alternative interpretation

Ht+1 is also the solution to

minimizeH 〈Ht,H−1〉 − log det(HtH

−1)− n︸ ︷︷ ︸

KL divergence between N (0,H−1) and N (0,H−1t )

subject to Hyt = st

• minimizing some sort of KL divergence subject to the secantequation constraints

Quasi-Newton methods 13-10

• BFGS methods

Algorithm 13.1 BFGS1: for t = 0, 1, · · · do2: xt+1 = xt − ηtHt∇f(xt) (line search to determine ηt)3: Ht+1 =

(I − ρtsty>t

)Ht(I − ρtyts>t

)+ ρtsts>t , where st =

xt+1 − xt, yt = ∇f(xt+1)−∇f(xt), and ρt = 1y>t st

• each iteration costs O(n2) (in addition to computing gradients)• no need to solve linear systems or invert matrices• no magic formula for initialization; possible choices: approximate

inverse Hessian at x0, or identity matrix

Quasi-Newton methods 13-11

• Rank-2 update on H−1t

From the Sherman-Morrison-Woodbury formula(A + UV >

)−1 = A−1 −A−1U(I + V >A−1U)−1V >A−1, we canshow that the BFGS rule is equivalent to

H−1t+1 = H−1t −1

s>t H−1t st

H−1t sts>t H

−1t + ρtyty>t

︸ ︷︷ ︸rank-2 update

Quasi-Newton methods 13-12

• Local superlinear convergence

Theorem 13.1 (informal)Suppose f is strongly convex and has Lipschitz-continuous Hessian.Under mild conditions, BFGS achieves

limt→∞

∥∥xt+1 − x∗∥∥

2∥∥xt − x∗∥∥

2= 0

• iteration complexity: larger than Newton methods but smallerthan gradient methods• asymptotic result: holds when t→∞

Quasi-Newton methods 13-13

• Key observation

The BFGS update rule achieves

limt→∞

∥∥(H−1t −∇2f(x∗))(xt+1 − xt)

∥∥2∥∥xt+1 − xt

∥∥2

= 0

Implications• even though H−1t may not converge to ∇2f(x∗), it becomes an

increasingly more accurate approximation of ∇2f(x∗) along thesearch direction xt+1 − xt

• asymptotically, xt+1 − xt ≈ −(∇2f(xt))−1∇f(xt)︸ ︷︷ ︸

Newton search direction

Quasi-Newton methods 13-14

• Numerical example— EE236C lecture notes

minimizex∈Rn c>x−N∑

i=1log

(bi − a>i x

)

Example

minimize cTx �mX

i=1

log(bi � aTi x)

n = 100, m = 500

0 2 4 6 8 10 1210�12

10�9

10�6

10�3

100

103

k

f(x

k)�

f?

Newton

0 50 100 15010�12

10�9

10�6

10�3

100

103

k

f(x

k)�

f?

BFGS

• cost per Newton iteration: O(n3) plus computing r2f(x)• cost per BFGS iteration: O(n2)

Quasi-Newton methods 2-10

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0

"+ L

2f

2flqt

k=0 ÷2kqt

k=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0

", then

fbest,t ≠ fopt Æ OALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0

"+ L

2f

2flqt

k=0 ÷2kqt

k=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0

", then

fbest,t ≠ fopt Æ OALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

2

�x � xt+1

�>H�1t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1�xt � xt+1

�= rf(xt)

() H�1t+1�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)slope H�1t+1generic optimization algorithms

Stage 1: random init ! local region Stage 2: local convergence\

minimizeH kH � Htksubject to H = H>

H�rf(xt+1) �rf(xt)

�= xt+1 � xt

Ht+1 = V>

t HtVt + ⇢tsts>t| {z }

BFGS update rule

with Vt = I � ⇢tyts>t

with ⇢t = 1y>t st

HLt = V>

t�1 · · · V >t�mHLt�mVt�m · · · Vt�1+ ⇢t�mV

>t�1 · · · V >t�m+1st�ms>t�mVt�m+1 · · · Vt�1

+ ⇢t�mV>

t�1 · · · V >t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1+ · · · + ⇢t�1st�1s>t�1

From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

H�1t+1 = H�1t �

1

s>t H�1t st

H�1t sts>t H

�1t + ⇢tyty

>t

| {z }rank-2 update

2

Consider approximate quadratic model ft(·) of f(·) as follows

ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

2

�x � xt+1

�>H�1t+1

�x � xt+1

which satisfiesrft(x) = rf(xt+1) + H�1t+1

�x � xt+1

One reasonable criterion: gradient matching for latest two iterates:

rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

() holds trivially. For (), one requires

rf(xt+1) + H�1t+1�xt � xt+1

�= rf(xt)

() H�1t+1�xt+1 � xt

�= rf(xt+1) �rf(xt)

rf(x)slope H�1t+1generic optimization algorithms

Stage 1: random init ! local region Stage 2: local convergence\

minimizeH kH � Htksubject to H = H>

H�rf(xt+1) �rf(xt)

�= xt+1 � xt

Ht+1 = V>

t HtVt + ⇢tsts>t| {z }

BFGS update rule

with Vt = I � ⇢tyts>t

with ⇢t = 1y>t st

HLt = V>

t�1 · · · V >t�mHLt�mVt�m · · · Vt�1+ ⇢t�mV

>t�1 · · · V >t�m+1st�ms>t�mVt�m+1 · · · Vt�1

+ ⇢t�mV>

t�1 · · · V >t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1+ · · · + ⇢t�1st�1s>t�1

From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

H�1t+1 = H�1t �

1

s>t H�1t st

H�1t sts>t H

�1t + ⇢tyty

>t

| {z }rank-2 update

2

n = 100, N = 500Quasi-Newton methods 13-15

• Limited-memory quasi-Newton methods

Hessian matrices are usually dense. For large-scale problems, evenstoring the (inverse) Hessian matrices is prohibitive

Instead of storing full Hessian approximations, one may want tomaintain more parsimonious approximation of the Hessians, usingonly a few vectors

Quasi-Newton methods 13-16

• Limited-memory BFGS (L-BFGS)

Ht+1 = V >t HtVt + ρtsts>t︸ ︷︷ ︸BFGS update rule

with Vt = I − ρtyts>t

key idea: maintain a modified version of Ht implicitly by storing m(e.g. 20) most recent vector pairs (st,yt)

Quasi-Newton methods 13-17

• Limited-memory BFGS (L-BFGS)

L-BFGS maintains

HLt = V >t−1 · · ·V >t−mHLt,0Vt−m · · ·Vt−1+ ρt−mV >t−1 · · ·V >t−m+1st−ms>t−mVt−m+1 · · ·Vt−1+ ρt−m+1V >t−1 · · ·V >t−m+2st−m+1s>t−m+1Vt−m+1 · · ·Vt−1+ · · ·+ ρt−1st−1s>t−1

• can be computed recursively• initialization HLt,0 may vary from iteration to iteration• only needs to store {(si,yi)}t−m≤i

• Reference

[1] ”Numerical optimization, J. Nocedal, S. Wright, 2000.

[2] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.

[3] ”Optimization methods for large-scale machine learning,” L. Bottou etal., arXiv, 2016.

[4] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.

Quasi-Newton methods 13-19