Mathematics in Quasi-Newton Method - hod/papers/Unsorted/slids_qn.pdfMathematics in Quasi-Newton...

download Mathematics in Quasi-Newton Method - hod/papers/Unsorted/slids_qn.pdfMathematics in Quasi-Newton Method Peng Du and Xijun Wang Department of Computing and Software ... n−1 n

of 43

  • date post

    10-Apr-2018
  • Category

    Documents

  • view

    219
  • download

    5

Embed Size (px)

Transcript of Mathematics in Quasi-Newton Method - hod/papers/Unsorted/slids_qn.pdfMathematics in Quasi-Newton...

  • Mathematics in Quasi-Newton Method

    Peng Du and Xijun Wang

    Department of Computing and Software

    McMaster University

    April 2, 2003

  • Outline

    Elementary Quasi-Newton Method

    Update for Cholesky Factorization

    Update for Limited Memory

    Update for Sparse Hessians

    Update for Reduced Hessians

    References

  • Elementary Quasi-Newton Method

    Unconstrained Optimization: min f(x)

    Quasi-Newton Iteration: xk+1 = xk tkHkgk, Hk+1 = Hk + Ek

    Notation: B, B, H, H, x, x, g, g, (p) = x x, y = g gDesired Update Properties:

    Quasi-Newton Property: B = y(Hy = )

    Symmetric

    Positive Definite

    SR1: B = B + T

    T , = y B

    DFP: B = B ByT+yT BT y

    +(1 +

    T BT y

    )yyT

    T y

    BFGS: B = B + yyT

    yT BT B

    T B

    PSB: B = B + T+T

    T T

    T T

    T

  • Cholesky Factor Modification I Rank-one

    Cholesky Factor Modification (symmetric rank one P.D.) update

    Using Classical Cholesky Factorization

    A = LDLT A = A + zzT =?LDLT

    A = L(D + ppT )LT , Lp = z

    Factorize D + ppT = LDLT giving A = LDLT , L = LL, D = D

    L is of the special form

    L(p, , ) =

    1p21 2p31 p32 3

    ... ... ... . . .n1

    pn1 pn2 pn3... pnn1 n

    Assertion 1: Need only to compute di, i, in O(n) time.

    Assertion 2: Product of L and L can be done in O(n2) time.

    Drawback instable

  • Cholesky Factor Modification II Rank-one

    Using Givens Matrices (I)

    2-dimensional case(c ss c

    ) (z1z2

    )=

    (+0

    )

    where 2 = z21 + z22, = sign(z1)(

    2)1/2, c = z1/, s = z2/.

    n-dimensional case

    P ijz =

    (i)th (j)th1

    .. .1

    c s1

    .. .1

    s c1

    .. .1

    z1...

    zi

    ...

    zj...

    zn

    =

    z1...

    zi

    ...

    0...

    zn

  • Sequential reduction

    P12 P23 Pn1n z = e1 or P12 P13 P1n z = e1

    Products of Pn1n Pn2n1 P23 P12 is of the special form

    HL(p, , ) =

    p11 1p21 p22 2p31 p32 p33

    . . .... ... ... n2

    pn11 pn12 pn13 pn1n1 n1pn1 pn2 pn3 pnn1 pnn

    Denote HU(, p, ) HTL(p, , )

    Factorization A = RTR A = A + zzT =?RT RA = RT (I + ppT )R, RTp = z

    Choose a sequence of Givens matrices such that

    Pp = P12 P23 Pn1n p = e1

  • A can be written as

    A = RTPTP (I + ppT )PTPR = RTPT (I + 2e1eT1)PR = H

    TJTJH

    where H = PR, J is identity matrix except J11 = (1 + pTp)1/2.

    Assertion Product of P and R can be done in O(n2) time.

    Let H = JH, then A = HT H.

    Choose a second sequence of Givens matrices such that

    P H = P n1n P n2n1 P23 P12 H = R

    Then A = RT R.

  • Cholesky Factor Modification III Rank-one

    Using Givens Matrices (II) A = RTR A = A + zzT =?RT RA = RT (I + ppT )R = RT (I + ppT )(I + ppT )R

    where RTp = z and = /(1 + (1 + pTp)1/2).

    Choose a sequence of Givens matrices such that

    Pp = P12 P23 Pn1n p = e1

    A can be written as

    A = RT (I + ppT )PTP (I + ppT )R = RTHTHR

    where H = P (I + ppT ) = P + e1pT .

    Note that P = HU(, p, ) for some , . Then

    H = HU(, p, ) + e1pT = HU(, p, )

    where differs from only in the first element: = + e1.

    Choose a second sequence of Givens matrices such that

    PH = P n1n P n2n1 P23 P12 H = R

  • Assertion 1: R is of the form R = R(, p, ) L(p, , )

    Then A = RTHT PT PHR = RTRT RR.

    Assertion 2: Product of R and R can be done in O(n2) time.

  • Cholesky Factor Modification IV Rank-two

    Product Form of BFGS

    B+ = (I + vuT )B(I + uvT ), u = p + B1y, v = 1Bp + 2yfor some , 1, 2.

    Triangularization of I + zwT : (I + zwT )Q = L

    First choose a sequence of Givens matrices such that

    Pw = P12 P23 Pn1n w = e1

    Then H = (I + zwT )PT = PT + zeT1 is lower Hessenberg matrix.

    Choose a second sequence of Givens matrix to reduce the superdiagonal

    elements of H such that

    HP = L

  • Assertion: L is of the special form

    L(w, , z, , ) =

    11w2 + 1z2 21w3 + 1z3 2w3 + 2z3

    . . .... ... n1

    1wn + 1zn 2wn + 2zn n1wn + n1zn n

    Factorization B = L1DLT1 B+ = (I + vuT )B(I + uvT ) =?L1DL1T

    B+ = L(I + zwT )(I + wzT )LT , L = L1D1/2, Lz = v, Lw = LLTu.

    Triangularize I + zwT = LQT giving

    B+ = LLQTQLTLT = L+L+T , L+ = LL = L1D1/2L

    Assertion: D1/2L(w, , z, , ) = L(w, , z, , ) = L(w, , z, , e)D1/2 where

    w = D1/2w, z = D1/2z, = D1/2, e = (1, ,1)T ,D1/2 = diag(1, , n), = D1/2, = D1/2.

    Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

  • Cholesky Factor Modification V Rank-two

    BFGS B+ = B + uuT + vvT

    Using Classical Factorization

    B = L1DLT1 B+ = B + uuT + vvT =?L1DL1T

    B+ = B + uuT + vvT = L1(D + wwT + zzT )LT1

    where L1w = u, L1z = v.

    Factorize D + wwT + zzT = L1DLT1

    Similar to the rank-one case, L1 is of a special form.

    Finally

    L+1 = L1L(w, , z, , e), D+ = D

    Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

  • Implementing Quasi-Newton Method

    Approach-1: Cholesky Approach-2: Inverse Hessian

    Bksk = f(xk) Bksk = f(xk)

    RTk Rk = Bk Hk = B1k

    back and forward substitution solve for sk is straight forward

    find Rk+1 s.t. update Hk directly, i.e.

    RTk+1Rk+1 = Bk + Bk Hk+1 = f(Hk, k, yk) = B1k+1

  • Update Inverse Hessian Approximation: PropertiesDesired

    Hessian Approximate Inverse Hessian Approximate

    Bk+1k = yk Hk+1yk = k

    Bk+1 is positive definite Hk+1 is positive definite

    Bk+1 is symmetric Hk+1 is symmetric

    Requirements are in dual format. As a consequence, Update formula should be in dual format too

  • Updating Hessian Approximate

    Broyden Family on Bk update

    Bk+1 = Bk Bkk(Bkk)

    T

    TBkk+

    ykyTk

    yTk k+ k(

    Tk Bkk)vkv

    Tk ;

    where k [0,1] andvk =

    (yk

    yTk k Bkk

    TBkk

    )

    Broyden Family on Hk update

    Hk+1 = Hk Hkyk(Hkyk)

    T

    yTHkyk+

    kTk

    Tk yk+ k(y

    Tk Hkyk)wkw

    Tk ;

    where k [0,1] andwk =

    (k

    Tk yk Hkyk

    yTHkyk

    )

  • BFGS update of Hk when k = 1

    Hk+1 = Hk Hkyk(Hkyk)T

    yT Hkyk+

    kTk

    Tk yk+ (yTk Hkyk)wkw

    Tk ;

    rewritten as:

    Hk+1 = VTk HkVk + kk

    Tk

    where k =1

    yTk k, and Vk = I ykTk .

    Difficulties:

    Hk is dense usually

    Hk can be ill-conditioned

  • Solving Memory Problem L-BFGS

    Hk+1 = VTk HkVk + kk

    Tk

    expand Hk = VTk1Hk1Vk1 + k1k1Tk1

    Hk+1 = (VTk V

    Tk1)Hk1(Vk1Vk)

    +k1V Tk1k1Tk1Vk1 + kTk

    expand m times:Hk+1 = (V

    Tk V Tkm)Hkm(Vkm Vk)

    +km(V Tk V Tkm+1)kmTkm(Vkm+1 Vk)+km+1(V Tk V Tkm+2)km+1Tkm+1(Vkm+2 Vk)

    ...+kk

    Tk .

  • Ideas of L-BFGS

    Maintain m most recent pairs of and y,P = {(k, yk), (k1, yk1), . . . , (km+1, ykm+1)}

    At k-th iteration:If (k m)

    P = P {(k, yk)}else

    P = (P {(km+1, ykm+1)}) {(k, yk)}end

  • Performance of L-BFGS

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Choosing Appropriate m

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Dealing With Hk

    Choose of Hk influences the overall performance.

    Options found in literature:

    Identity matrix Hk = H0 = I

    Identity matrix scaled at first step only Hk =yT0 0y02I

    Identity matrix scaled at each step Hk =yTk kyk2I

    Diagonal matrix minimize

    Hk = argminDdiagDYk1 Sk1F ,where Yk1 = [yk1, . . . , ykm], Sk1 = [k1, . . . , km]

  • Scaling Schemes Comparisons

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Update for Spare Hessian I

    Suppose 2f(x) has a known sparse pattern K:(i, j) K [2f(x)]ij = 0(i, j) K (j, i) K, (for i, j = 1,2, , n)(i, i) K, (for i = 1,2, , n)

    Additional desired update property: (i, j) K Bij = 0

    N = {X Rnn|Xp = y, X = XT}S = {X Rnn|(i, j) K Xij = 0}V = S N

    Row Decomposition B = B + ni=1 eieTi BB = B + ni=1 eieTi

    (yyT

    pT y BppT B

    pT Bp

    )= B +

    ni=1 ei

    (eTi y

    pT yyT e

    Ti Bp

    pT BppTB

    )

  • Update for Spare Hessian II

    Symmetrization

    B = B +(y Bp)zT

    zTp

    B = B + (y Bp)zT + z(y Bp)TzTp

    zzT

    where is chosen to assure QN condition:

    =(y Bp)Tp

    (zTp)2

    z = y

    pTBp/pTy + Bp = BFGSz = y = DFPz = y Bp = SR1z = p = PSB

  • Update for Spare Hessian III

    Update with desired sparse pattern

    B = B +n

    i=1

    eieTi(y Bp)zTi

    zTi p, zi = Diz

    where Di is the diagonal matrix with the jth component 1 if Bij = 0,0 if Bij = 0.

    Symmetrization

    B = B +n

    i=1

    i(eizTi + zie

    Ti )

    where i is chosen to assure QN condition, satisfying

    S = y Bpwhere S =

    ni=1 z

    Ti peie

    Ti + e

    Ti pzie

    Ti .

    When z = y

    pTBp/pTy + Bp, y, y Bp, p, we get a sparse version ofBFGS,DFP,SR1,PSB.

    Drawback of symmetrization method

  • Update for Spare Hessian IV

    Frobenius Norm Projection X2F n

    i,j=1 X2ij = Tr(X

    TX)

    Define B = B + E, and consider problem

    min 12E2F = 12Tr(ETE)s.t. Ep = 0

    Eij = Bij, (i, j) KE = ET

    We get

    E =1

    2

    n

    i=1

    eieTi (p

    Ti + p