Quasi-Newton methods - Princeton...

of 19/19
ELE 522: Large-Scale Optimization for Data Science Quasi-Newton methods Yuxin Chen Princeton University, Fall 2019
  • date post

    24-Sep-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of Quasi-Newton methods - Princeton...

  • ELE 522: Large-Scale Optimization for Data Science

    Quasi-Newton methods

    Yuxin Chen

    Princeton University, Fall 2019

  • Newton’s method

    minimizex∈Rn f(x)

    xt+1 = xt − (∇2f(xt))−1∇f(xt)Examples

    example in R2 (page 10–9)

    x(0)

    x(1)

    k

    f(x

    (k) )

    −p

    0 1 2 3 4 510−15

    10−10

    10−5

    100

    105

    • backtracking parameters α = 0.1, β = 0.7• converges in only 5 steps• quadratic local convergence

    Unconstrained minimization 10–21

    minimizex max {distC1(x), distC2(x)}where distC(x) := minz2C kx � zk2

    find x 2 C1 \ C2

    rdistCi(x) =x � PCi(x)distCi(x)

    xt+1 = xt � ⌘tgt = xt �distCi(x

    t)

    kgtk22xt � PCi(xt)distCi(x

    t)

    = PCi(xt)

    kxt+1 � x⇤k22 kxt � x⇤k22 ��f(xt) � f(x⇤)

    �2

    kgtk22When f is µ-strongly convex, we can improve Lemma @@@ to (check)

    kxt+1 � x⇤k22 (1 � µ⌘t)kxt � x⇤k22 � 2⌘t�f(xt) � fopt

    �+ ⌘2t kgtk22

    ) f(xt) � fopt (1 � µ⌘t)2⌘t

    kxt � x⇤k22 �1

    2⌘tkxt+1 � x⇤k22 +

    ⌘t2kgtk22

    Since ⌘t = 2/(µ(t + 1)), we have

    f(xt) � fopt µ(t � 1)4

    kxt � x⇤k22 �µ(t + 1)

    4kxt+1 � x⇤k22 +

    1

    µ(t + 1)kgtk22

    and hence

    t�f(xt) � fopt

    � µt(t � 1)

    4kxt � x⇤k22 �

    µt(t + 1)

    4kxt+1 � x⇤k22 +

    1

    µkgtk22

    Summing over all iterations before t, we get

    tX

    k=0

    k�f(xk) � fopt

    � 0 � µt(t + 1)

    4kxt+1 � x⇤k22 +

    1

    µ

    tX

    k=0

    kgkk22

    L2f

    =) fbest,k L2fµ

    tPtk=0 k

    2L2f

    µ(t + 1)

    t f(xt) � fopt

    1

    min

    imiz

    e xm

    ax{d

    ist C

    1(x

    ),di

    stC

    2(x

    )}w

    here

    dist

    C(x)

    :=m

    inz2C

    kx�

    zk 2

    find

    x2

    C 1\

    C 2

    rdi

    stC i

    (x)

    =x�

    P Ci(x

    )

    dist

    Ci(x

    )

    xt+

    1=

    xt�

    ⌘ tg

    t=

    xt�

    dist

    Ci(x

    t )

    kgt k

    2 2

    xt�

    P Ci(x

    t )

    dist

    Ci(x

    t )

    =P C

    i(x

    t )

    kxt+

    1�

    x⇤ k

    2 2

    kxt�

    x⇤ k

    2 2�

    � f(x

    t )�

    f(x

    ⇤ )� 2

    kgt k

    2 2

    Whe

    nf

    isµ-s

    tron

    gly

    conv

    ex,w

    eca

    nim

    prov

    eLe

    mm

    [email protected]

    @@

    to(c

    heck

    )

    kxt+

    1�

    x⇤ k

    2 2

    (1�

    µ⌘ t

    )kx

    t�

    x⇤ k

    2 2�

    2⌘t

    � f(x

    t )�

    fopt�

    +⌘2 tkg

    t k2 2

    )f(x

    t )�

    fopt

    (1�

    µ⌘ t

    )

    2⌘t

    kxt�

    x⇤ k

    2 2�

    1 2⌘tkx

    t+1�

    x⇤ k

    2 2+

    ⌘ t 2kg

    t k2 2

    Sinc

    e⌘ t

    =2/

    (µ(t

    +1)

    ),w

    eha

    ve

    f(x

    t )�

    fopt

    µ(t�

    1)

    4kx

    t�

    x⇤ k

    2 2�

    µ(t

    +1)

    4kx

    t+1�

    x⇤ k

    2 2+

    1

    µ(t

    +1)

    kgt k

    2 2

    and

    henc

    e

    t� f

    (xt )�

    fopt�

    µt(

    t�

    1)

    4kx

    t�

    x⇤ k

    2 2�

    µt(

    t+

    1)

    4kx

    t+1�

    x⇤ k

    2 2+

    1 µkg

    t k2 2

    Sum

    min

    gov

    eral

    lite

    ration

    sbe

    fore

    t,w

    ege

    t

    t X k=0

    k� f

    (xk)�

    fopt�

    0�

    µt(

    t+

    1)

    4kx

    t+1�

    x⇤ k

    2 2+

    1 µ

    t X k=0

    kgkk2 2

    t µL

    2 f

    =)

    fbes

    t,k

    L2 f µ

    tP

    t k=

    0k

    2L2 f

    µ(t

    +1)

    tf(x

    t )�

    fopt

    1

    • quadratic convergence: attains ε accuracy within O(log log 1ε )iterations• typically requires storing and inverting Hessian ∇2f(x) ∈ Rn×n

    • a single iteration may last forever; prohibitive storage requirementQuasi-Newton methods 13-2

  • Quasi-Newton methods

    key idea: approximate the Hessian matrix using only gradientinformation

    xt+1 = xt − ηt Ht︸︷︷︸surrogate of (∇2f(xt))−1

    ∇f(xt)

    challenges: how to find a good approximation Ht � 0 of(∇2f(xt))−1

    • using only gradient information• using limited memory• achieving super-linear convergence

    Quasi-Newton methods 13-3

  • Criterion for choosing Ht

    Consider the following approximate quadratic model of f(·):

    ft(x) := f(xt+1)+〈∇f(xt+1),x−xt+1〉+12(x−xt+1)>H−1t+1

    (x−xt+1)

    which satisfies

    ∇ft(x) = ∇f(xt+1) + H−1t+1(x− xt+1)

    One reasonable criterion: gradient matching for the latest twoiterates:

    ∇ft(xt) = ∇f(xt) (13.1a)∇ft(xt+1) = ∇f(xt+1) (13.1b)

    Quasi-Newton methods 13-4

  • Secant equationProof of Lemma 2.5

    It follows that

    Îxt+1 ≠ xúÎ22 =..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝

    =0

    )..2

    2

    =..xt ≠ xú

    ..22 ≠ 2÷Èx

    t ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

    + ÷2..Òf(xt) ≠ Òf(xú)

    ..22

    Æ..xt ≠ xú

    ..22 ≠ ÷

    2..Òf(xt) ≠ Òf(xú)..2

    2

    Æ..xt ≠ xú

    ..22

    Gradient methods 2-36

    Monotonicity

    We start with a monotonicity result:

    Lemma 2.5

    Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

    Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

    where xú is any minimizer with optimal f(xú)

    Gradient methods 2-35

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

    2

    �x � xt+1

    �>H�1t+1

    �x � xt+1

    which satisfiesrft(x) = rf(xt+1) + H�1t+1

    �x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1�xt � xt+1

    �= rf(xt)

    () H�1t+1�xt+1 � xt

    �= rf(xt+1) �rf(xt)

    rf(x)

    2

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

    2

    �x � xt+1

    �>H�1t+1

    �x � xt+1

    which satisfiesrft(x) = rf(xt+1) + H�1t+1

    �x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1�xt � xt+1

    �= rf(xt)

    () H�1t+1�xt+1 � xt

    �= rf(xt+1) �rf(xt)

    rf(x)

    2

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

    2

    �x � xt+1

    �>H�1t+1

    �x � xt+1

    which satisfiesrft(x) = rf(xt+1) + H�1t+1

    �x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1�xt � xt+1

    �= rf(xt)

    () H�1t+1�xt+1 � xt

    �= rf(xt+1) �rf(xt)

    rf(x)slope H�1t+1

    2

    (13.1b) holds automatically. To satisfy (13.1a), one requires

    ∇f(xt+1) + H−1t+1(xt − xt+1) = ∇f(xt)

    ⇐⇒ H−1t+1(xt+1 − xt) = ∇f(xt+1)−∇f(xt)

    ︸ ︷︷ ︸secant equation

    • the secant equation requires that H−1t+1 maps the displacementxt+1 − xt into the change of gradients ∇f(xt+1)−∇f(xt)

    Quasi-Newton methods 13-5

  • Secant equation

    Ht+1(∇f(xt+1)−∇f(xt))︸ ︷︷ ︸

    =:yt

    = xt+1 − xt︸ ︷︷ ︸=:st

    (13.2)

    • only possible when s>t yt > 0, since

    s>t yt = y>t Ht+1yt > 0

    • admit an infinite number of solutions, since the degrees offreedom O(n2) in choosing H−1t+1 far exceeds the number ofconstraints n in (13.2)• which H−1t+1 shall we choose?

    Quasi-Newton methods 13-6

  • Broyden-Fletcher-Goldfarb-Shanno (BFGS) method

    Quasi-Newton methods 13-7

  • Closeness to Ht

    In addition to the secant equation, choose Ht+1 sufficiently close toHt:

    minimizeH ‖H −Ht‖subject to H = H>

    Hyt = st

    for some norm ‖ · ‖

    • exploit past information regarding Ht• choosing different norms ‖ · ‖ results in different quasi-Newton

    methods

    Quasi-Newton methods 13-8

  • Choice of norm in BFGS

    Choosing ‖M‖ := ‖W 1/2MW 1/2‖F for any weight matrix Wobeying Wst = yt, we get

    minimizeH∥∥W 1/2(H −Ht)W 1/2

    ∥∥F

    subject to H = H>

    Hyt = st

    This admits a closed-form expression

    Ht+1 =(I − ρtsty>t

    )Ht(I − ρtyts>t

    )+ ρtsts>t︸ ︷︷ ︸

    BFGS update rule; Ht+1�0 if Ht�0

    (13.3)

    with ρt = 1y>t st

    Quasi-Newton methods 13-9

  • An alternative interpretation

    Ht+1 is also the solution to

    minimizeH 〈Ht,H−1〉 − log det(HtH

    −1)− n︸ ︷︷ ︸

    KL divergence between N (0,H−1) and N (0,H−1t )

    subject to Hyt = st

    • minimizing some sort of KL divergence subject to the secantequation constraints

    Quasi-Newton methods 13-10

  • BFGS methods

    Algorithm 13.1 BFGS1: for t = 0, 1, · · · do2: xt+1 = xt − ηtHt∇f(xt) (line search to determine ηt)3: Ht+1 =

    (I − ρtsty>t

    )Ht(I − ρtyts>t

    )+ ρtsts>t , where st =

    xt+1 − xt, yt = ∇f(xt+1)−∇f(xt), and ρt = 1y>t st

    • each iteration costs O(n2) (in addition to computing gradients)• no need to solve linear systems or invert matrices• no magic formula for initialization; possible choices: approximate

    inverse Hessian at x0, or identity matrix

    Quasi-Newton methods 13-11

  • Rank-2 update on H−1t

    From the Sherman-Morrison-Woodbury formula(A + UV >

    )−1 = A−1 −A−1U(I + V >A−1U)−1V >A−1, we canshow that the BFGS rule is equivalent to

    H−1t+1 = H−1t −1

    s>t H−1t st

    H−1t sts>t H

    −1t + ρtyty>t

    ︸ ︷︷ ︸rank-2 update

    Quasi-Newton methods 13-12

  • Local superlinear convergence

    Theorem 13.1 (informal)Suppose f is strongly convex and has Lipschitz-continuous Hessian.Under mild conditions, BFGS achieves

    limt→∞

    ∥∥xt+1 − x∗∥∥

    2∥∥xt − x∗∥∥

    2= 0

    • iteration complexity: larger than Newton methods but smallerthan gradient methods• asymptotic result: holds when t→∞

    Quasi-Newton methods 13-13

  • Key observation

    The BFGS update rule achieves

    limt→∞

    ∥∥(H−1t −∇2f(x∗))(xt+1 − xt)

    ∥∥2∥∥xt+1 − xt

    ∥∥2

    = 0

    Implications• even though H−1t may not converge to ∇2f(x∗), it becomes an

    increasingly more accurate approximation of ∇2f(x∗) along thesearch direction xt+1 − xt

    • asymptotically, xt+1 − xt ≈ −(∇2f(xt))−1∇f(xt)︸ ︷︷ ︸

    Newton search direction

    Quasi-Newton methods 13-14

  • Numerical example— EE236C lecture notes

    minimizex∈Rn c>x−N∑

    i=1log

    (bi − a>i x

    )

    Example

    minimize cTx �mX

    i=1

    log(bi � aTi x)

    n = 100, m = 500

    0 2 4 6 8 10 1210�12

    10�9

    10�6

    10�3

    100

    103

    k

    f(x

    k)�

    f?

    Newton

    0 50 100 15010�12

    10�9

    10�6

    10�3

    100

    103

    k

    f(x

    k)�

    f?

    BFGS

    • cost per Newton iteration: O(n3) plus computing r2f(x)• cost per BFGS iteration: O(n2)

    Quasi-Newton methods 2-10

    Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

    Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

    fbest,t ≠ fopt ÆsupxœC DÏ

    !x,x0

    "+ L

    2f

    2flqt

    k=0 ÷2kqt

    k=0 ÷k

    • If ÷t =Ô

    2flRLf

    1Ôt

    with R := supxœC DÏ!x,x0

    ", then

    fbest,t ≠ fopt Æ OALf

    ÔRÔ

    fl

    log tÔt

    B

    ¶ one can further remove log t factorMirror descent 5-37

    Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

    Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

    fbest,t ≠ fopt ÆsupxœC DÏ

    !x,x0

    "+ L

    2f

    2flqt

    k=0 ÷2kqt

    k=0 ÷k

    • If ÷t =Ô

    2flRLf

    1Ôt

    with R := supxœC DÏ!x,x0

    ", then

    fbest,t ≠ fopt Æ OALf

    ÔRÔ

    fl

    log tÔt

    B

    ¶ one can further remove log t factorMirror descent 5-37

    Opt

    imal

    ityof

    Nes

    tero

    v’s

    met

    hod

    Inte

    rest

    ingly

    ,no

    first

    -ord

    erm

    etho

    dsca

    nim

    prov

    eup

    onNe

    ster

    ov’s

    resu

    ltin

    gene

    ral

    Mor

    epr

    ecise

    ly,÷

    conv

    exan

    dL

    -smoo

    thfu

    nctio

    nf

    s.t.

    f(x

    t )≠

    fop

    3LÎx

    0≠

    xú Î

    2 232

    (t+

    1)2

    aslon

    gas

    xk

    œx

    0+

    span

    {Òf(x

    0 ),·

    ··,Ò

    f(x

    k≠

    1 )}

    ¸˚˙

    ˝de

    finiti

    onof

    first

    -ord

    erm

    etho

    ds

    fora

    ll1

    Æk

    Æt

    Acce

    lerat

    edGD

    7-35

    Opt

    imal

    ityof

    Nes

    tero

    v’s

    met

    hod

    Inte

    rest

    ingly

    ,no

    first

    -ord

    erm

    etho

    dsca

    nim

    prov

    eup

    onNe

    ster

    ov’s

    resu

    ltin

    gene

    ral

    Mor

    epr

    ecise

    ly,÷

    conv

    exan

    dL

    -smoo

    thfu

    nctio

    nf

    s.t.

    f(x

    t )≠

    fop

    3LÎx

    0≠

    xú Î

    2 232

    (t+

    1)2

    aslon

    gas

    xk

    œx

    0+

    span

    {Òf(x

    0 ),·

    ··,Ò

    f(x

    k≠

    1 )}

    ¸˚˙

    ˝de

    finiti

    onof

    first

    -ord

    erm

    etho

    ds

    fora

    ll1

    Æk

    Æt

    Acce

    lerat

    edGD

    7-35

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

    2

    �x � xt+1

    �>H�1t+1

    �x � xt+1

    which satisfiesrft(x) = rf(xt+1) + H�1t+1

    �x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1�xt � xt+1

    �= rf(xt)

    () H�1t+1�xt+1 � xt

    �= rf(xt+1) �rf(xt)

    rf(x)slope H�1t+1generic optimization algorithms

    Stage 1: random init ! local region Stage 2: local convergence\

    minimizeH kH � Htksubject to H = H>

    H�rf(xt+1) �rf(xt)

    �= xt+1 � xt

    Ht+1 = V>

    t HtVt + ⇢tsts>t| {z }

    BFGS update rule

    with Vt = I � ⇢tyts>t

    with ⇢t = 1y>t st

    HLt = V>

    t�1 · · · V >t�mHLt�mVt�m · · · Vt�1+ ⇢t�mV

    >t�1 · · · V >t�m+1st�ms>t�mVt�m+1 · · · Vt�1

    + ⇢t�mV>

    t�1 · · · V >t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1+ · · · + ⇢t�1st�1s>t�1

    From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

    H�1t+1 = H�1t �

    1

    s>t H�1t st

    H�1t sts>t H

    �1t + ⇢tyty

    >t

    | {z }rank-2 update

    2

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(xt+1) + hrf(xt+1), x � xt+1i + 1

    2

    �x � xt+1

    �>H�1t+1

    �x � xt+1

    which satisfiesrft(x) = rf(xt+1) + H�1t+1

    �x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt)rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1�xt � xt+1

    �= rf(xt)

    () H�1t+1�xt+1 � xt

    �= rf(xt+1) �rf(xt)

    rf(x)slope H�1t+1generic optimization algorithms

    Stage 1: random init ! local region Stage 2: local convergence\

    minimizeH kH � Htksubject to H = H>

    H�rf(xt+1) �rf(xt)

    �= xt+1 � xt

    Ht+1 = V>

    t HtVt + ⇢tsts>t| {z }

    BFGS update rule

    with Vt = I � ⇢tyts>t

    with ⇢t = 1y>t st

    HLt = V>

    t�1 · · · V >t�mHLt�mVt�m · · · Vt�1+ ⇢t�mV

    >t�1 · · · V >t�m+1st�ms>t�mVt�m+1 · · · Vt�1

    + ⇢t�mV>

    t�1 · · · V >t�m+2st�m+1s>t�m+1Vt�m+1 · · · Vt�1+ · · · + ⇢t�1st�1s>t�1

    From Sherman-Morrison-Woodbury formula, BFGS rule is equivalent toNewton

    H�1t+1 = H�1t �

    1

    s>t H�1t st

    H�1t sts>t H

    �1t + ⇢tyty

    >t

    | {z }rank-2 update

    2

    n = 100, N = 500Quasi-Newton methods 13-15

  • Limited-memory quasi-Newton methods

    Hessian matrices are usually dense. For large-scale problems, evenstoring the (inverse) Hessian matrices is prohibitive

    Instead of storing full Hessian approximations, one may want tomaintain more parsimonious approximation of the Hessians, usingonly a few vectors

    Quasi-Newton methods 13-16

  • Limited-memory BFGS (L-BFGS)

    Ht+1 = V >t HtVt + ρtsts>t︸ ︷︷ ︸BFGS update rule

    with Vt = I − ρtyts>t

    key idea: maintain a modified version of Ht implicitly by storing m(e.g. 20) most recent vector pairs (st,yt)

    Quasi-Newton methods 13-17

  • Limited-memory BFGS (L-BFGS)

    L-BFGS maintains

    HLt = V >t−1 · · ·V >t−mHLt,0Vt−m · · ·Vt−1+ ρt−mV >t−1 · · ·V >t−m+1st−ms>t−mVt−m+1 · · ·Vt−1+ ρt−m+1V >t−1 · · ·V >t−m+2st−m+1s>t−m+1Vt−m+1 · · ·Vt−1+ · · ·+ ρt−1st−1s>t−1

    • can be computed recursively• initialization HLt,0 may vary from iteration to iteration• only needs to store {(si,yi)}t−m≤i

  • Reference

    [1] ”Numerical optimization, J. Nocedal, S. Wright, 2000.

    [2] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.

    [3] ”Optimization methods for large-scale machine learning,” L. Bottou etal., arXiv, 2016.

    [4] ”Convex optimization, EE364B lecture notes,” S. Boyd, Stanford.

    Quasi-Newton methods 13-19