Quasi-Newton methods - Princeton yc5/ele522_optimization/lectures/quasi_ Quasi-Newton methods

download Quasi-Newton methods - Princeton yc5/ele522_optimization/lectures/quasi_  Quasi-Newton methods

If you can't read please download the document

  • date post

    24-Sep-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Quasi-Newton methods - Princeton yc5/ele522_optimization/lectures/quasi_ Quasi-Newton methods

  • ELE 522: Large-Scale Optimization for Data Science

    Quasi-Newton methods

    Yuxin Chen

    Princeton University, Fall 2019

  • Newton’s method

    minimizex∈Rn f(x)

    xt+1 = xt − (∇2f(xt))−1∇f(xt) Examples

    example in R2 (page 10–9)

    x(0)

    x(1)

    k

    f (x

    (k ) )

    − p

    0 1 2 3 4 510 −15

    10−10

    10−5

    100

    105

    • backtracking parameters α = 0.1, β = 0.7 • converges in only 5 steps • quadratic local convergence

    Unconstrained minimization 10–21

    minimizex max {distC1(x), distC2(x)} where distC(x) := minz2C kx � zk2

    find x 2 C1 \ C2

    rdistCi(x) = x � PCi(x) distCi(x)

    xt+1 = xt � ⌘tgt = xt � distCi(x

    t)

    kgtk22 xt � PCi(xt) distCi(x

    t)

    = PCi(xt)

    kxt+1 � x⇤k22  kxt � x⇤k22 � � f(xt) � f(x⇤)

    �2

    kgtk22 When f is µ-strongly convex, we can improve Lemma @@@ to (check)

    kxt+1 � x⇤k22  (1 � µ⌘t)kxt � x⇤k22 � 2⌘t � f(xt) � fopt

    � + ⌘2t kgtk22

    ) f(xt) � fopt  (1 � µ⌘t) 2⌘t

    kxt � x⇤k22 � 1

    2⌘t kxt+1 � x⇤k22 +

    ⌘t 2 kgtk22

    Since ⌘t = 2/(µ(t + 1)), we have

    f(xt) � fopt  µ(t � 1) 4

    kxt � x⇤k22 � µ(t + 1)

    4 kxt+1 � x⇤k22 +

    1

    µ(t + 1) kgtk22

    and hence

    t � f(xt) � fopt

    �  µt(t � 1)

    4 kxt � x⇤k22 �

    µt(t + 1)

    4 kxt+1 � x⇤k22 +

    1

    µ kgtk22

    Summing over all iterations before t, we get

    tX

    k=0

    k � f(xk) � fopt

    �  0 � µt(t + 1)

    4 kxt+1 � x⇤k22 +

    1

    µ

    tX

    k=0

    kgkk22

     t µ

    L2f

    =) fbest,k  L2f µ

    tPt k=0 k

     2L2f

    µ(t + 1)

    t f(xt) � fopt

    1

    m in

    im iz

    e x m

    ax {d

    is t C

    1 (x

    ), di

    st C

    2 (x

    )} w

    he re

    di st

    C( x )

    := m

    in z 2C

    kx �

    z k 2

    fin d

    x 2

    C 1 \

    C 2

    r di

    st C i

    (x )

    = x �

    P C i (x

    )

    di st

    C i (x

    )

    x t+

    1 =

    x t �

    ⌘ t g

    t =

    x t �

    di st

    C i (x

    t )

    kg t k

    2 2

    x t �

    P C i (x

    t )

    di st

    C i (x

    t )

    = P C

    i (x

    t )

    kx t+

    1 �

    x ⇤ k

    2 2 

    kx t �

    x ⇤ k

    2 2 �

    � f (x

    t ) �

    f (x

    ⇤ ) � 2

    kg t k

    2 2

    W he

    n f

    is µ -s

    tr on

    gl y

    co nv

    ex ,w

    e ca

    n im

    pr ov

    e Le

    m m

    a @

    @ @

    to (c

    he ck

    )

    kx t+

    1 �

    x ⇤ k

    2 2 

    (1 �

    µ ⌘ t

    )k x

    t �

    x ⇤ k

    2 2 �

    2⌘ t

    � f (x

    t ) �

    f o p t�

    + ⌘ 2 t kg

    t k 2 2

    ) f (x

    t ) �

    f o p t 

    (1 �

    µ ⌘ t

    )

    2⌘ t

    kx t �

    x ⇤ k

    2 2 �

    1 2⌘ t kx

    t+ 1 �

    x ⇤ k

    2 2 +

    ⌘ t 2 kg

    t k 2 2

    Si nc

    e ⌘ t

    = 2/

    (µ (t

    + 1)

    ), w

    e ha

    ve

    f (x

    t ) �

    f o p t 

    µ (t �

    1)

    4 kx

    t �

    x ⇤ k

    2 2 �

    µ (t

    + 1)

    4 kx

    t+ 1 �

    x ⇤ k

    2 2 +

    1

    µ (t

    + 1)

    kg t k

    2 2

    an d

    he nc

    e

    t � f

    (x t ) �

    f o p t�

     µ t(

    t �

    1)

    4 kx

    t �

    x ⇤ k

    2 2 �

    µ t(

    t +

    1)

    4 kx

    t+ 1 �

    x ⇤ k

    2 2 +

    1 µ kg

    t k 2 2

    Su m

    m in

    g ov

    er al

    li te

    ra ti on

    s be

    fo re

    t, w

    e ge

    t

    t X k= 0

    k � f

    (x k ) �

    f o p t�

     0 �

    µ t(

    t +

    1)

    4 kx

    t+ 1 �

    x ⇤ k

    2 2 +

    1 µ

    t X k= 0

    kg k k2 2

     t µ L

    2 f

    = )

    f b es

    t, k 

    L 2 f µ

    t P

    t k =

    0 k 

    2L 2 f

    µ (t

    + 1)

    t f (x

    t ) �

    f o p t

    1

    • quadratic convergence: attains ε accuracy within O(log log 1ε ) iterations • typically requires storing and inverting Hessian ∇2f(x) ∈ Rn×n

    • a single iteration may last forever; prohibitive storage requirement Quasi-Newton methods 13-2

  • Quasi-Newton methods

    key idea: approximate the Hessian matrix using only gradient information

    xt+1 = xt − ηt Ht︸︷︷︸ surrogate of (∇2f(xt))−1

    ∇f(xt)

    challenges: how to find a good approximation Ht � 0 of(∇2f(xt))−1

    • using only gradient information • using limited memory • achieving super-linear convergence

    Quasi-Newton methods 13-3

  • Criterion for choosing Ht

    Consider the following approximate quadratic model of f(·):

    ft(x) := f(xt+1)+〈∇f(xt+1),x−xt+1〉+ 1 2 ( x−xt+1)>H−1t+1

    ( x−xt+1)

    which satisfies

    ∇ft(x) = ∇f(xt+1) + H−1t+1 ( x− xt+1)

    One reasonable criterion: gradient matching for the latest two iterates:

    ∇ft(xt) = ∇f(xt) (13.1a) ∇ft(xt+1) = ∇f(xt+1) (13.1b)

    Quasi-Newton methods 13-4

  • Secant equation Proof of Lemma 2.5

    It follows that

    Îxt+1 ≠ xúÎ22 = ..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝

    =0

    ) ..2

    2

    = ..xt ≠ xú

    ..2 2 ≠ 2÷Èx

    t ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝ Ø 2÷L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

    + ÷2 ..Òf(xt) ≠ Òf(xú)

    ..2 2

    Æ ..xt ≠ xú

    ..2 2 ≠ ÷

    2..Òf(xt) ≠ Òf(xú) ..2

    2

    Æ ..xt ≠ xú

    ..2 2

    Gradient methods 2-36

    Monotonicity

    We start with a monotonicity result:

    Lemma 2.5

    Let f be convex and L-smooth. If ÷t © ÷ = 1/L, then

    Îxt+1 ≠ xúÎ2 Æ Îxt ≠ xúÎ2

    where xú is any minimizer with optimal f(xú)

    Gradient methods 2-35

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(x t+1) + hrf(xt+1), x � xt+1i + 1

    2

    � x � xt+1

    �> H�1t+1

    � x � xt+1

    which satisfies rft(x) = rf(xt+1) + H�1t+1

    � x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt) rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1 � xt � xt+1

    � = rf(xt)

    () H�1t+1 � xt+1 � xt

    � = rf(xt+1) �rf(xt)

    rf(x)

    2

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(x t+1) + hrf(xt+1), x � xt+1i + 1

    2

    � x � xt+1

    �> H�1t+1

    � x � xt+1

    which satisfies rft(x) = rf(xt+1) + H�1t+1

    � x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt) rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1 � xt � xt+1

    � = rf(xt)

    () H�1t+1 � xt+1 � xt

    � = rf(xt+1) �rf(xt)

    rf(x)

    2

    Consider approximate quadratic model ft(·) of f(·) as follows

    ft(x) := f(x t+1) + hrf(xt+1), x � xt+1i + 1

    2

    � x � xt+1

    �> H�1t+1

    � x � xt+1

    which satisfies rft(x) = rf(xt+1) + H�1t+1

    � x � xt+1

    One reasonable criterion: gradient matching for latest two iterates:

    rft(xt) = rf(xt) rft(xt+1) = rf(xt+1)

    () holds trivially. For (), one requires

    rf(xt+1) + H�1t+1 � xt � xt+1

    � = rf(xt)

    () H�1t+1 � xt+1 � xt

    � = rf(xt+1) �rf(xt)

    rf(x) slope H�1t+1

    2

    (13.1b) holds automatically. To satisfy (13.1a), one requires

    ∇f(xt+1) + H−1t+1 ( xt − xt+1) = ∇f(xt)

    ⇐⇒ H−1t+1 ( xt+1 − xt) = ∇f(xt+1)−∇f(xt)

    ︸ ︷︷ ︸ secant equation

    • the secant equation requires that H−1t+1 maps the displacement xt+1 − xt into the change of gradients ∇f(xt+1)−∇f(xt)

    Quasi-Newton methods 13-5

  • Secant equation

    Ht+1 (∇f(xt+1)−∇f(xt)) ︸ ︷︷ ︸

    =:yt

    = xt+1 − xt︸ ︷︷ ︸ =:st

    (13.2)

    • only possible when s>t yt > 0, since

    s>t yt = y>t Ht+1yt > 0

    • admit an infinite number of solutions, since the degrees of freedom O(n2) in choosing H−1t+1 far exceeds the number of constraints n in (13.2) • which H−1t+1 shall we choose?

    Quasi-Newton methods 13-6

  • Broyden-Fletcher-Goldfarb-Shanno (BFGS) method

    Quasi-Newton methods 13-7