Quasi-Newton methods - Princeton...
Transcript of Quasi-Newton methods - Princeton...
ELE 522: Large-Scale Optimization for Data Science
Quasi-Newton methods
Yuxin Chen
Princeton University, Fall 2019
Newton’s method
minimizex∈Rn f(x)
xt+1 = xt − (∇2f(xt))−1∇f(xt)
Examples
example in R2 (page 10–9)
x(0)
x(1)
k
f(x
(k) )
−p
⋆
0 1 2 3 4 510−15
10−10
10−5
100
105
• backtracking parameters α = 0.1, β = 0.7
• converges in only 5 steps
• quadratic local convergence
Unconstrained minimization 10–21
minimizex max {distC1(x), distC2(x)}where distC(x) := minz2C kx � zk2
find x 2 C1 \ C2
rdistCi(x) =
x � PCi(x)
distCi(x)
xt+1 = xt � ⌘tgt = xt � distCi(x
t)
kgtk22
xt � PCi(xt)
distCi(xt)
= PCi(xt)
kxt+1 � x⇤k22 kxt � x⇤k2
2 ��f(xt) � f(x⇤)
�2
kgtk22
When f is µ-strongly convex, we can improve Lemma @@@ to (check)
kxt+1 � x⇤k22 (1 � µ⌘t)kxt � x⇤k2
2 � 2⌘t
�f(xt) � fopt
�+ ⌘2
t kgtk22
) f(xt) � fopt (1 � µ⌘t)
2⌘tkxt � x⇤k2
2 �1
2⌘tkxt+1 � x⇤k2
2 +⌘t
2kgtk2
2
Since ⌘t = 2/(µ(t + 1)), we have
f(xt) � fopt µ(t � 1)
4kxt � x⇤k2
2 �µ(t + 1)
4kxt+1 � x⇤k2
2 +1
µ(t + 1)kgtk2
2
and hence
t�f(xt) � fopt
� µt(t � 1)
4kxt � x⇤k2
2 �µt(t + 1)
4kxt+1 � x⇤k2
2 +1
µkgtk2
2
Summing over all iterations before t, we get
tX
k=0
k�f(xk) � fopt
� 0 � µt(t + 1)
4kxt+1 � x⇤k2
2 +1
µ
tX
k=0
kgkk22
t
µL2
f
=) fbest,k L2
f
µ
tPtk=0 k
2L2
f
µ(t + 1)
t f(xt) � fopt
1
min
imiz
e xm
ax{d
ist C
1(x
),di
stC
2(x
)}w
here
dist
C(x)
:=m
inz2C
kx�
zk 2
find
x2
C 1\
C 2
rdi
stC i
(x)
=x�
P Ci(x
)
dist
Ci(x
)
xt+
1=
xt�
⌘ tg
t=
xt�
dist
Ci(x
t )
kgt k
2 2
xt�
P Ci(x
t )
dist
Ci(x
t )
=P C
i(x
t )
kxt+
1�
x⇤ k
2 2
kxt�
x⇤ k
2 2�
� f(x
t )�
f(x
⇤ )� 2
kgt k
2 2
Whe
nf
isµ-s
tron
gly
conv
ex,w
eca
nim
prov
eLe
mm
a@
@@
to(c
heck
)
kxt+
1�
x⇤ k
2 2
(1�
µ⌘ t
)kx
t�
x⇤ k
2 2�
2⌘t
� f(x
t )�
fopt�
+⌘2 tkg
t k2 2
)f(x
t )�
fopt
(1�
µ⌘ t
)
2⌘t
kxt�
x⇤ k
2 2�
1 2⌘tkx
t+1�
x⇤ k
2 2+
⌘ t 2kg
t k2 2
Sinc
e⌘ t
=2/
(µ(t
+1)
),w
eha
ve
f(x
t )�
fopt
µ(t�
1)
4kx
t�
x⇤ k
2 2�
µ(t
+1)
4kx
t+1�
x⇤ k
2 2+
1
µ(t
+1)
kgt k
2 2
and
henc
e
t� f
(xt )�
fopt�
µt(
t�
1)
4kx
t�
x⇤ k
2 2�
µt(
t+
1)
4kx
t+1�
x⇤ k
2 2+
1 µkg
t k2 2
Sum
min
gov
eral
lite
ration
sbe
fore
t,w
ege
t
t X k=
0
k� f
(xk)�
fopt�
0�
µt(
t+
1)
4kx
t+1�
x⇤ k
2 2+
1 µ
t X k=
0
kgkk2 2
t µL
2 f
=)
fbes
t,k
L2 f µ
tP
t k=
0k
2L2 f
µ(t
+1)
tf(x
t )�
fopt
1
• quadratic convergence: attains ε accuracy within O(log log 1ε )
iterations• typically requires storing and inverting Hessian ∇2f(x) ∈ Rn×n
• a single iteration may last forever; prohibitive storage requirementQuasi-Newton methods 13-2
Quasi-Newton methods
key idea: approximate the Hessian matrix using only gradientinformation
xt+1 = xt − ηt Ht︸︷︷︸surrogate of (∇2f(xt))−1
∇f(xt)
challenges: how to find a good approximation Ht � 0 of(∇2f(xt))−1
• using only gradient information• using limited memory• achieving super-linear convergence
Quasi-Newton methods 13-3
Criterion for choosing Ht
Consider the following approximate quadratic model of f(·):
ft(x) := f(xt+1)+〈∇f(xt+1),x−xt+1〉+12(x−xt+1)>H−1
t+1(x−xt+1)
which satisfies
∇ft(x) = ∇f(xt+1) + H−1t+1(x− xt+1)
One reasonable criterion: gradient matching for the latest twoiterates:
∇ft(xt) = ∇f(xt) (13.1a)∇ft(xt+1) = ∇f(xt+1) (13.1b)
Quasi-Newton methods 13-4
Secant equationProof of Lemma 2.5
It follows that
Îxt+1 ≠ xúÎ22 =
..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0
)..2
2
=..xt ≠ xú..2
2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷
L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)
+ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2
2
Æ..xt ≠ xú..2
2
Gradient methods 2-36
Monotonicity
We start with a monotonicity result:
Lemma 2.5
Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then
Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2
where xú is any minimizer with optimal f(xú)
Gradient methods 2-35
Consider approximate quadratic model ft(·) of f(·) as follows
ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1
2
�x � xt+1
�>H�1
t+1
�x � xt+1
�
which satisfiesrft(x) = rf(xt+1) + H�1
t+1
�x � xt+1
�
One reasonable criterion: gradient matching for latest two iterates:
rft(xt) = rf(xt)
rft(xt+1) = rf(xt+1)
() holds trivially. For (), one requires
rf(xt+1) + H�1t+1
�xt � xt+1
�= rf(xt)
() H�1t+1
�xt+1 � xt
�= rf(xt+1) �rf(xt)
rf(x)
2
Consider approximate quadratic model ft(·) of f(·) as follows
ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1
2
�x � xt+1
�>H�1
t+1
�x � xt+1
�
which satisfiesrft(x) = rf(xt+1) + H�1
t+1
�x � xt+1
�
One reasonable criterion: gradient matching for latest two iterates:
rft(xt) = rf(xt)
rft(xt+1) = rf(xt+1)
() holds trivially. For (), one requires
rf(xt+1) + H�1t+1
�xt � xt+1
�= rf(xt)
() H�1t+1
�xt+1 � xt
�= rf(xt+1) �rf(xt)
rf(x)
2
Consider approximate quadratic model ft(·) of f(·) as follows
ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1
2
�x � xt+1
�>H�1
t+1
�x � xt+1
�
which satisfiesrft(x) = rf(xt+1) + H�1
t+1
�x � xt+1
�
One reasonable criterion: gradient matching for latest two iterates:
rft(xt) = rf(xt)
rft(xt+1) = rf(xt+1)
() holds trivially. For (), one requires
rf(xt+1) + H�1t+1
�xt � xt+1
�= rf(xt)
() H�1t+1
�xt+1 � xt
�= rf(xt+1) �rf(xt)
rf(x)
slope H�1t+1
2
(13.1b) holds automatically. To satisfy (13.1a), one requires
∇f(xt+1) + H−1t+1(xt − xt+1) = ∇f(xt)
⇐⇒ H−1t+1(xt+1 − xt) = ∇f(xt+1)−∇f(xt)
︸ ︷︷ ︸secant equation
• the secant equation requires that H−1t+1 maps the displacement
xt+1 − xt into the change of gradients ∇f(xt+1)−∇f(xt)Quasi-Newton methods 13-5
Secant equation
Ht+1(∇f(xt+1)−∇f(xt)
)︸ ︷︷ ︸
=:yt
= xt+1 − xt︸ ︷︷ ︸
=:st
(13.2)
• only possible when s>t yt > 0, since
s>t yt = y>t Ht+1yt > 0
• admit an infinite number of solutions, since the degrees offreedom O(n2) in choosing H−1
t+1 far exceeds the number ofconstraints n in (13.2)• which H−1
t+1 shall we choose?
Quasi-Newton methods 13-6
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method
Quasi-Newton methods 13-7
Closeness to Ht
In addition to the secant equation, choose Ht+1 sufficiently close toHt:
minimizeH ‖H −Ht‖subject to H = H>
Hyt = st
for some norm ‖ · ‖
• exploit past information regarding Ht
• choosing different norms ‖ · ‖ results in different quasi-Newtonmethods
Quasi-Newton methods 13-8
Choice of norm in BFGS
Choosing ‖M‖ := ‖W 1/2MW 1/2‖F for any weight matrix Wobeying Wst = yt, we get
minimizeH∥∥W 1/2(H −Ht)W 1/2∥∥
Fsubject to H = H>
Hyt = st
This admits a closed-form expression
Ht+1 =(I − ρtsty
>t
)Ht(I − ρtyts
>t
)+ ρtsts
>t︸ ︷︷ ︸
BFGS update rule; Ht+1�0 if Ht�0
(13.3)
with ρt = 1y>t st
Quasi-Newton methods 13-9
An alternative interpretation
Ht+1 is also the solution to
minimizeH 〈Ht,H−1〉 − log det
(HtH
−1)− n︸ ︷︷ ︸
KL divergence between N (0,H−1) and N (0,H−1t )
subject to Hyt = st
• minimizing some sort of KL divergence subject to the secantequation constraints
Quasi-Newton methods 13-10
BFGS methods
Algorithm 13.1 BFGS1: for t = 0, 1, · · · do2: xt+1 = xt − ηtHt∇f(xt) (line search to determine ηt)3: Ht+1 =
(I − ρtsty
>t
)Ht(I − ρtyts
>t
)+ ρtsts
>t , where st =
xt+1 − xt, yt = ∇f(xt+1)−∇f(xt), and ρt = 1y>t st
• each iteration costs O(n2) (in addition to computing gradients)• no need to solve linear systems or invert matrices• no magic formula for initialization; possible choices: approximate
inverse Hessian at x0, or identity matrix
Quasi-Newton methods 13-11
Rank-2 update on H−1t
From the Sherman-Morrison-Woodbury formula(A + UV >
)−1 = A−1 −A−1U(I + V >A−1U
)−1V >A−1, we can
show that the BFGS rule is equivalent to
H−1t+1 = H−1
t − 1s>t H
−1t st
H−1t sts
>t H
−1t + ρtyty
>t
︸ ︷︷ ︸rank-2 update
Quasi-Newton methods 13-12
Local superlinear convergence
Theorem 13.1 (informal)Suppose f is strongly convex and has Lipschitz-continuous Hessian.Under mild conditions, BFGS achieves
limt→∞
∥∥xt+1 − x∗∥∥
2∥∥xt − x∗∥∥
2= 0
• iteration complexity: larger than Newton methods but smallerthan gradient methods• asymptotic result: holds when t→∞
Quasi-Newton methods 13-13
Key observation
The BFGS update rule achieves
limt→∞
∥∥(H−1t −∇2f(x∗)
)(xt+1 − xt
)∥∥2∥∥xt+1 − xt
∥∥2
= 0
Implications• even though H−1
t may not converge to ∇2f(x∗), it becomes anincreasingly more accurate approximation of ∇2f(x∗) along thesearch direction xt+1 − xt
• asymptotically, xt+1 − xt ≈ −(∇2f(xt))−1∇f(xt)
︸ ︷︷ ︸Newton search direction
Quasi-Newton methods 13-14
Numerical example— EE236C lecture notes
minimizex∈Rn c>x−N∑
i=1log
(bi − a>i x
)
Example
minimize cTx �mX
i=1
log(bi � aTi x)
n = 100, m = 500
0 2 4 6 8 10 1210�12
10�9
10�6
10�3
100
103
k
f(x
k)�
f?
Newton
0 50 100 15010�12
10�9
10�6
10�3
100
103
k
f(x
k)�
f?
BFGS
• cost per Newton iteration: O(n3) plus computing r2f(x)
• cost per BFGS iteration: O(n2)
Quasi-Newton methods 2-10
Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then
fbest,t ≠ fopt ÆsupxœC DÏ
!x,x0"
+ L2f
2fl
qtk=0 ÷2
kqtk=0 ÷k
• If ÷t =Ô
2flRLf
1Ôt
with R := supxœC DÏ!x,x0"
, then
fbest,t ≠ fopt Æ O
ALf
ÔRÔ
fl
log tÔt
B
¶ one can further remove log t factorMirror descent 5-37
Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3
Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then
fbest,t ≠ fopt ÆsupxœC DÏ
!x,x0"
+ L2f
2fl
qtk=0 ÷2
kqtk=0 ÷k
• If ÷t =Ô
2flRLf
1Ôt
with R := supxœC DÏ!x,x0"
, then
fbest,t ≠ fopt Æ O
ALf
ÔRÔ
fl
log tÔt
B
¶ one can further remove log t factorMirror descent 5-37
Opt
imal
ityof
Nes
tero
v’s
met
hod
Inte
rest
ingly
,no
first
-ord
erm
etho
dsca
nim
prov
eup
onNe
ster
ov’s
resu
ltin
gene
ral
Mor
epr
ecise
ly,÷
conv
exan
dL
-smoo
thfu
nctio
nf
s.t.
f(x
t )≠
fop
tØ
3LÎx
0≠
xú Î
2 232
(t+
1)2
aslon
gas
xk
œx
0+
span
{Òf(x
0 ),·
··,Ò
f(x
k≠
1 )}
¸˚˙
˝de
finiti
onof
first
-ord
erm
etho
ds
fora
ll1
Æk
Æt
Acce
lerat
edGD
7-35
Opt
imal
ityof
Nes
tero
v’s
met
hod
Inte
rest
ingly
,no
first
-ord
erm
etho
dsca
nim
prov
eup
onNe
ster
ov’s
resu
ltin
gene
ral
Mor
epr
ecise
ly,÷
conv
exan
dL
-smoo
thfu
nctio
nf
s.t.
f(x
t )≠
fop
tØ
3LÎx
0≠
xú Î
2 232
(t+
1)2
aslon
gas
xk
œx
0+
span
{Òf(x
0 ),·
··,Ò
f(x
k≠
1 )}
¸˚˙
˝de
finiti
onof
first
-ord
erm
etho
ds
fora
ll1
Æk
Æt
Acce
lerat
edGD
7-35
Consider approximate quadratic model ft(·) of f(·) as follows
ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1
2
�x � xt+1
�>H�1
t+1
�x � xt+1
�
which satisfiesrft(x) = rf(xt+1) + H�1
t+1
�x � xt+1
�
One reasonable criterion: gradient matching for latest two iterates:
rft(xt) = rf(xt)
rft(xt+1) = rf(xt+1)
() holds trivially. For (), one requires
rf(xt+1) + H�1t+1
�xt � xt+1
�= rf(xt)
() H�1t+1
�xt+1 � xt
�= rf(xt+1) �rf(xt)
rf(x)
slope H�1t+1
generic optimization algorithmsStage 1: random init ! local region Stage 2: local convergence
\
minimizeH kH � Htksubject to H = H>
H�rf(xt+1) �rf(xt)
�= xt+1 � xt
Ht+1 = V >t HtVt + ⇢tsts
>t| {z }
BFGS update rule
with Vt = I � ⇢tyts>t
with ⇢t = 1y>
t st
HLt = V >
t�1 · · · V >t�mHL
t�mVt�m · · · Vt�1
+ ⇢t�mV >t�1 · · · V >
t�m+1st�ms>t�mVt�m+1 · · · Vt�1
+ ⇢t�mV >t�1 · · · V >
t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1
+ · · · + ⇢t�1st�1s>t�1
From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton
H�1t+1 = H�1
t � 1
s>t H�1
t st
H�1t sts
>t H�1
t + ⇢tyty>t
| {z }rank-2 update
2
Consider approximate quadratic model ft(·) of f(·) as follows
ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i +1
2
�x � xt+1
�>H�1
t+1
�x � xt+1
�
which satisfiesrft(x) = rf(xt+1) + H�1
t+1
�x � xt+1
�
One reasonable criterion: gradient matching for latest two iterates:
rft(xt) = rf(xt)
rft(xt+1) = rf(xt+1)
() holds trivially. For (), one requires
rf(xt+1) + H�1t+1
�xt � xt+1
�= rf(xt)
() H�1t+1
�xt+1 � xt
�= rf(xt+1) �rf(xt)
rf(x)
slope H�1t+1
generic optimization algorithmsStage 1: random init ! local region Stage 2: local convergence
\
minimizeH kH � Htksubject to H = H>
H�rf(xt+1) �rf(xt)
�= xt+1 � xt
Ht+1 = V >t HtVt + ⇢tsts
>t| {z }
BFGS update rule
with Vt = I � ⇢tyts>t
with ⇢t = 1y>
t st
HLt = V >
t�1 · · · V >t�mHL
t�mVt�m · · · Vt�1
+ ⇢t�mV >t�1 · · · V >
t�m+1st�ms>t�mVt�m+1 · · · Vt�1
+ ⇢t�mV >t�1 · · · V >
t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1
+ · · · + ⇢t�1st�1s>t�1
From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton
H�1t+1 = H�1
t � 1
s>t H�1
t st
H�1t sts
>t H�1
t + ⇢tyty>t
| {z }rank-2 update
2
n = 100, N = 500Quasi-Newton methods 13-15
Limited-memory quasi-Newton methods
Hessian matrices are usually dense. For large-scale problems, evenstoring the (inverse) Hessian matrices is prohibitive
Instead of storing full Hessian approximations, one may want tomaintain more parsimonious approximation of the Hessians, usingonly a few vectors
Quasi-Newton methods 13-16
Limited-memory BFGS (L-BFGS)
Ht+1 = V >t HtVt + ρtsts>t︸ ︷︷ ︸
BFGS update rule
with Vt = I − ρtyts>t
key idea: maintain a modified version of Ht implicitly by storing m(e.g. 20) most recent vector pairs (st,yt)
Quasi-Newton methods 13-17
Limited-memory BFGS (L-BFGS)
L-BFGS maintains
HLt = V >t−1 · · ·V >t−mHL
t,0Vt−m · · ·Vt−1
+ ρt−mV >t−1 · · ·V >t−m+1st−ms>t−mVt−m+1 · · ·Vt−1
+ ρt−m+1V>
t−1 · · ·V >t−m+2st−m+1s>t−m+1Vt−m+1 · · ·Vt−1
+ · · ·+ ρt−1st−1s>t−1
• can be computed recursively• initialization HL
t,0 may vary from iteration to iteration• only needs to store {(si,yi)}t−m≤i<t
Quasi-Newton methods 13-18
Reference
[1] ”Numerical optimization, J. Nocedal, S. Wright, 2000.
[2] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.
[3] ”Optimization methods for large-scale machine learning,” L. Bottou etal., arXiv, 2016.
[4] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.
Quasi-Newton methods 13-19