Mathematics in Quasi-Newton Method - Greeley.Orghod/papers/Unsorted/slids_qn.pdf · Mathematics in...

of 43/43
Mathematics in Quasi-Newton Method Peng Du and Xijun Wang Department of Computing and Software McMaster University April 2, 2003
  • date post

    10-Apr-2018
  • Category

    Documents

  • view

    227
  • download

    6

Embed Size (px)

Transcript of Mathematics in Quasi-Newton Method - Greeley.Orghod/papers/Unsorted/slids_qn.pdf · Mathematics in...

  • Mathematics in Quasi-Newton Method

    Peng Du and Xijun Wang

    Department of Computing and Software

    McMaster University

    April 2, 2003

  • Outline

    Elementary Quasi-Newton Method

    Update for Cholesky Factorization

    Update for Limited Memory

    Update for Sparse Hessians

    Update for Reduced Hessians

    References

  • Elementary Quasi-Newton Method

    Unconstrained Optimization: min f(x)

    Quasi-Newton Iteration: xk+1 = xk tkHkgk, Hk+1 = Hk + Ek

    Notation: B, B, H, H, x, x, g, g, (p) = x x, y = g gDesired Update Properties:

    Quasi-Newton Property: B = y(Hy = )

    Symmetric

    Positive Definite

    SR1: B = B + T

    T , = y B

    DFP: B = B ByT+yT BT y

    +(1 +

    T BT y

    )yyT

    T y

    BFGS: B = B + yyT

    yT BT B

    T B

    PSB: B = B + T+T

    T T

    T T

    T

  • Cholesky Factor Modification I Rank-one

    Cholesky Factor Modification (symmetric rank one P.D.) update

    Using Classical Cholesky Factorization

    A = LDLT A = A + zzT =?LDLT

    A = L(D + ppT )LT , Lp = z

    Factorize D + ppT = LDLT giving A = LDLT , L = LL, D = D

    L is of the special form

    L(p, , ) =

    1p21 2p31 p32 3

    ... ... ... . . .n1

    pn1 pn2 pn3... pnn1 n

    Assertion 1: Need only to compute di, i, in O(n) time.

    Assertion 2: Product of L and L can be done in O(n2) time.

    Drawback instable

  • Cholesky Factor Modification II Rank-one

    Using Givens Matrices (I)

    2-dimensional case(c ss c

    ) (z1z2

    )=

    (+0

    )

    where 2 = z21 + z22, = sign(z1)(

    2)1/2, c = z1/, s = z2/.

    n-dimensional case

    P ijz =

    (i)th (j)th1

    .. .1

    c s1

    .. .1

    s c1

    .. .1

    z1...

    zi

    ...

    zj...

    zn

    =

    z1...

    zi

    ...

    0...

    zn

  • Sequential reduction

    P12 P23 Pn1n z = e1 or P12 P13 P1n z = e1

    Products of Pn1n Pn2n1 P23 P12 is of the special form

    HL(p, , ) =

    p11 1p21 p22 2p31 p32 p33

    . . .... ... ... n2

    pn11 pn12 pn13 pn1n1 n1pn1 pn2 pn3 pnn1 pnn

    Denote HU(, p, ) HTL(p, , )

    Factorization A = RTR A = A + zzT =?RT RA = RT (I + ppT )R, RTp = z

    Choose a sequence of Givens matrices such that

    Pp = P12 P23 Pn1n p = e1

  • A can be written as

    A = RTPTP (I + ppT )PTPR = RTPT (I + 2e1eT1)PR = H

    TJTJH

    where H = PR, J is identity matrix except J11 = (1 + pTp)1/2.

    Assertion Product of P and R can be done in O(n2) time.

    Let H = JH, then A = HT H.

    Choose a second sequence of Givens matrices such that

    P H = P n1n P n2n1 P23 P12 H = R

    Then A = RT R.

  • Cholesky Factor Modification III Rank-one

    Using Givens Matrices (II) A = RTR A = A + zzT =?RT RA = RT (I + ppT )R = RT (I + ppT )(I + ppT )R

    where RTp = z and = /(1 + (1 + pTp)1/2).

    Choose a sequence of Givens matrices such that

    Pp = P12 P23 Pn1n p = e1

    A can be written as

    A = RT (I + ppT )PTP (I + ppT )R = RTHTHR

    where H = P (I + ppT ) = P + e1pT .

    Note that P = HU(, p, ) for some , . Then

    H = HU(, p, ) + e1pT = HU(, p, )

    where differs from only in the first element: = + e1.

    Choose a second sequence of Givens matrices such that

    PH = P n1n P n2n1 P23 P12 H = R

  • Assertion 1: R is of the form R = R(, p, ) L(p, , )

    Then A = RTHT PT PHR = RTRT RR.

    Assertion 2: Product of R and R can be done in O(n2) time.

  • Cholesky Factor Modification IV Rank-two

    Product Form of BFGS

    B+ = (I + vuT )B(I + uvT ), u = p + B1y, v = 1Bp + 2yfor some , 1, 2.

    Triangularization of I + zwT : (I + zwT )Q = L

    First choose a sequence of Givens matrices such that

    Pw = P12 P23 Pn1n w = e1

    Then H = (I + zwT )PT = PT + zeT1 is lower Hessenberg matrix.

    Choose a second sequence of Givens matrix to reduce the superdiagonal

    elements of H such that

    HP = L

  • Assertion: L is of the special form

    L(w, , z, , ) =

    11w2 + 1z2 21w3 + 1z3 2w3 + 2z3

    . . .... ... n1

    1wn + 1zn 2wn + 2zn n1wn + n1zn n

    Factorization B = L1DLT1 B+ = (I + vuT )B(I + uvT ) =?L1DL1T

    B+ = L(I + zwT )(I + wzT )LT , L = L1D1/2, Lz = v, Lw = LLTu.

    Triangularize I + zwT = LQT giving

    B+ = LLQTQLTLT = L+L+T , L+ = LL = L1D1/2L

    Assertion: D1/2L(w, , z, , ) = L(w, , z, , ) = L(w, , z, , e)D1/2 where

    w = D1/2w, z = D1/2z, = D1/2, e = (1, ,1)T ,D1/2 = diag(1, , n), = D1/2, = D1/2.

    Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

  • Cholesky Factor Modification V Rank-two

    BFGS B+ = B + uuT + vvT

    Using Classical Factorization

    B = L1DLT1 B+ = B + uuT + vvT =?L1DL1T

    B+ = B + uuT + vvT = L1(D + wwT + zzT )LT1

    where L1w = u, L1z = v.

    Factorize D + wwT + zzT = L1DLT1

    Similar to the rank-one case, L1 is of a special form.

    Finally

    L+1 = L1L(w, , z, , e), D+ = D

    Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

  • Implementing Quasi-Newton Method

    Approach-1: Cholesky Approach-2: Inverse Hessian

    Bksk = f(xk) Bksk = f(xk)

    RTk Rk = Bk Hk = B1k

    back and forward substitution solve for sk is straight forward

    find Rk+1 s.t. update Hk directly, i.e.

    RTk+1Rk+1 = Bk + Bk Hk+1 = f(Hk, k, yk) = B1k+1

  • Update Inverse Hessian Approximation: PropertiesDesired

    Hessian Approximate Inverse Hessian Approximate

    Bk+1k = yk Hk+1yk = k

    Bk+1 is positive definite Hk+1 is positive definite

    Bk+1 is symmetric Hk+1 is symmetric

    Requirements are in dual format. As a consequence, Update formula should be in dual format too

  • Updating Hessian Approximate

    Broyden Family on Bk update

    Bk+1 = Bk Bkk(Bkk)

    T

    TBkk+

    ykyTk

    yTk k+ k(

    Tk Bkk)vkv

    Tk ;

    where k [0,1] andvk =

    (yk

    yTk k Bkk

    TBkk

    )

    Broyden Family on Hk update

    Hk+1 = Hk Hkyk(Hkyk)

    T

    yTHkyk+

    kTk

    Tk yk+ k(y

    Tk Hkyk)wkw

    Tk ;

    where k [0,1] andwk =

    (k

    Tk yk Hkyk

    yTHkyk

    )

  • BFGS update of Hk when k = 1

    Hk+1 = Hk Hkyk(Hkyk)T

    yT Hkyk+

    kTk

    Tk yk+ (yTk Hkyk)wkw

    Tk ;

    rewritten as:

    Hk+1 = VTk HkVk + kk

    Tk

    where k =1

    yTk k, and Vk = I ykTk .

    Difficulties:

    Hk is dense usually

    Hk can be ill-conditioned

  • Solving Memory Problem L-BFGS

    Hk+1 = VTk HkVk + kk

    Tk

    expand Hk = VTk1Hk1Vk1 + k1k1Tk1

    Hk+1 = (VTk V

    Tk1)Hk1(Vk1Vk)

    +k1V Tk1k1Tk1Vk1 + kTk

    expand m times:Hk+1 = (V

    Tk V Tkm)Hkm(Vkm Vk)

    +km(V Tk V Tkm+1)kmTkm(Vkm+1 Vk)+km+1(V Tk V Tkm+2)km+1Tkm+1(Vkm+2 Vk)

    ...+kk

    Tk .

  • Ideas of L-BFGS

    Maintain m most recent pairs of and y,P = {(k, yk), (k1, yk1), . . . , (km+1, ykm+1)}

    At k-th iteration:If (k m)

    P = P {(k, yk)}else

    P = (P {(km+1, ykm+1)}) {(k, yk)}end

  • Performance of L-BFGS

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Choosing Appropriate m

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Dealing With Hk

    Choose of Hk influences the overall performance.

    Options found in literature:

    Identity matrix Hk = H0 = I

    Identity matrix scaled at first step only Hk =yT0 0y02I

    Identity matrix scaled at each step Hk =yTk kyk2I

    Diagonal matrix minimize

    Hk = argminDdiagDYk1 Sk1F ,where Yk1 = [yk1, . . . , ykm], Sk1 = [k1, . . . , km]

  • Scaling Schemes Comparisons

    the two numbers above: iterations/function-evaluations

    there numbers below: iteration-time/function-time/total-time

  • Update for Spare Hessian I

    Suppose 2f(x) has a known sparse pattern K:(i, j) K [2f(x)]ij = 0(i, j) K (j, i) K, (for i, j = 1,2, , n)(i, i) K, (for i = 1,2, , n)

    Additional desired update property: (i, j) K Bij = 0

    N = {X Rnn|Xp = y, X = XT}S = {X Rnn|(i, j) K Xij = 0}V = S N

    Row Decomposition B = B + ni=1 eieTi BB = B + ni=1 eieTi

    (yyT

    pT y BppT B

    pT Bp

    )= B +

    ni=1 ei

    (eTi y

    pT yyT e

    Ti Bp

    pT BppTB

    )

  • Update for Spare Hessian II

    Symmetrization

    B = B +(y Bp)zT

    zTp

    B = B + (y Bp)zT + z(y Bp)TzTp

    zzT

    where is chosen to assure QN condition:

    =(y Bp)Tp

    (zTp)2

    z = y

    pTBp/pTy + Bp = BFGSz = y = DFPz = y Bp = SR1z = p = PSB

  • Update for Spare Hessian III

    Update with desired sparse pattern

    B = B +n

    i=1

    eieTi(y Bp)zTi

    zTi p, zi = Diz

    where Di is the diagonal matrix with the jth component 1 if Bij = 0,0 if Bij = 0.

    Symmetrization

    B = B +n

    i=1

    i(eizTi + zie

    Ti )

    where i is chosen to assure QN condition, satisfying

    S = y Bpwhere S =

    ni=1 z

    Ti peie

    Ti + e

    Ti pzie

    Ti .

    When z = y

    pTBp/pTy + Bp, y, y Bp, p, we get a sparse version ofBFGS,DFP,SR1,PSB.

    Drawback of symmetrization method

  • Update for Spare Hessian IV

    Frobenius Norm Projection X2F n

    i,j=1 X2ij = Tr(X

    TX)

    Define B = B + E, and consider problem

    min 12E2F = 12Tr(ETE)s.t. Ep = 0

    Eij = Bij, (i, j) KE = ET

    We get

    E =1

    2

    n

    i=1

    eieTi (p

    Ti + p

    Ti ) 2BK

    where BK = B if (i, j) K, BK = 0 otherwise, pi = Dip, i = Di,and is chosen to assure QN condition

    Store B and r = 2BKp. Compute whole B.

    Drawback (actually of all general sparse update) Cant minimize an

    n-dimensional quadratic in at most n updates with exact line search.

    Cant guaranteed B to be P.D.

  • Consider K = {(i, j)|i = j, i, j = 1, , n}. Then

    Bii =y(i)

    p(i), i = 1, , n

    In general, it is not possible to derive a diagonal and always P.D. update.

    Question What are the maximum sparsity requirements we may impose

    so that there always exists a P.D. QN update?

  • Update for Spare Hessian V

    Irreducible(Primitive) Matrices: There is no symmetric permutation of

    the rows and columns of the considered spare pattern such that the

    permuted pattern is block diagonal with at least two blocks.

    Define perturbation matrix E as

    Eij = ijpi2 pi(j)pj(i)

    For (i, j) K and i < j, define matrix Rij as (P.S.D.)

    Rij =

    ... ... p2(j) p(i)p(j)

    ... ... p(i)p(j) p2(i)

    ... ...

    Then

    E =

    (i,j) K,i

  • Properties of perturbation matrix E Suppose the given sparse pattern

    is irreducible and p(i) = 0, i = 1, , n. Then E satisfies

    (i) E is symmetric.

    (ii) E has the designated spare pattern.

    (iii) Ep = 0.

    (iv) For all z = 0 and z = p where R, zTEz > 0.

    Existence Theorem of P.D. Sparse Update Suppose the given sparse

    pattern is irreducible and p(i) = 0, i = 1, , n. If B+C is any sparseupdate, then there exists a >= 0 such that for all , matrixB = B + C + E is also a sparse update and P.D.

  • .......

    .......

    Unit Surface

    p

    p

    Note 1: p(j) = 0 for all j, sparse pattern is reducible.Note 2: p(j) = 0 for some j, sparse pattern is irreducible.

  • I wish I am a Magician

    Bksk = f(xk)

    Bksk = f(xk)

  • Magic Show step (1)

    Define:

    Gk = span{g0, g1, . . . , gk}

    Gk denotes the orthogonal complement of Gk in Rn

    Lemma:

    Consider the Broyden method applied to a general nonlinear function. If

    B0 = I with > 0, then sk Gk for all k. Moreover, if z Gk and w Gk ,then Bkz Gk and Bkw = w.

  • Magic Show step (2)

    Notation:

    rk = dim(Gk)Bk Rnrk forms a basis for Gk

    QR factorization:

    Bk = ZkTk

    where:

    Zk Rnrk : an orthonormal basis matrixTk Rrkrk : a nonsingular upper triangular matrix

  • Magic Show step (3)

    More Notation:

    Wk Rn(nrk): an orthonormal basis matrix for Gk

    Define:

    Qk = (ZkWk)

    A transformed Hessian Approximate:

    QTk BkQk =

    (ZTk BkZk (Z

    Tk BkWk)

    T

    ZTk BkWk WTk BkWk

    )=

    (ZTk BkZk 0

    0 Inrk

    )

    A transformed gradient:

    QTk gk =

    (ZTk gk

    0

    )

  • Magic Show step (4)

    Bksk = gk

    (QTk BkQk)QTk sk = QTk gk

    (

    ZTk BkZk 00 Inrk

    ) (ZTkWTk

    ) (qkqk

    )=

    (ZTk0

    )

    ZTk BkZkqk = ZTk gk, sk = Zkqk

  • Magic Show step (5)

    Fact: ZTk BkZk is positive definite if Bk is positive definite

    Cholesky Factorization:

    ZTk HkZk = RTk Rk

    RTk Rkqk = ZTk gk

    Benefits:

    Cond(ZTk HkZk) Cond(Hk)

    Zk, Rk require less memory

    Computing sk requires less time

  • Update Zk the orthonormal basis

    Step 1): Calculate k+1 = (I ZkZTk )gk+1Step 2): If (k+1 = 0)

    Zk+1 = Zk, rk+1 = rk

    else

    zk+1 =(IZkZTk )gk+1

    k+1

    Zk+1 = [Zk zk+1]

  • Update Rk the Cholesky factor

    Want: ZTk+1Bk+1Zk+1 = RTk+1Rk+1

    Two-Step Approach:

    Expand Step: find Rk such that

    (Rk)TRk = ZTk+1BkZk+1

    Broyden-Update step: find Rk+1 from Rk such that

    ZTk+1Bk+1Zk+1 = RTk+1Rk+1

  • Expand step: (Rk)TRk = Zk+1BkZk+1

    We know Rk : RTk Rk = Z

    Tk BkZk.

    ZK+1BkZk+1 =

    ZTk BkZk ZTk Bkzk+1

    zTk+1BkZk

    =

    (ZTk BkZk 0

    0

    )

    =

    (RTk Rk 0

    0

    )

    =

    (Rk 00 1/2

    )T (Rk 00 1/2

    )

    = (Rk)TRk

  • Expand Step Continued

    For Convenience, we define:

    vk = ZTk gk uk = ZTk gk+1 qk = ZTk sk

    Update them due to the expansion of Zk

    vk = ZTk+1gk =[

    vk0

    ]

    uk = ZTk+1gk+1 =[

    ukk+1

    ]

    qk = ZTk+1sk =[

    qk0

    ]

  • Broyden-Update Step

    1. define:k = ZTk+1(xk+1 xk) = qkyk = ZTk+1(gk+1 gk) = uk vk

    2. Apply QR factorization to

    Rk + w1wT2where:

    w1 =1

    RkkRkk

    w2 =1

    ((yk)T k)

    1/2yk 1Rkk(R

    k)

    TRkk

  • Reduced Hessian Method (Complete Version)

    Choose x0 and > 0;

    k = 0; rk = 1; gk = (xk);Zk =

    gkgk;Rk =

    1/2; vk = gkwhile not converged do

    Solve RTk dk = vk;Rkqk = dk;sk = Zkqk;

    Find k satisfying the Wolfe conditions;

    xk+1 = xk + sk; gk+1 = f(xk+1);uk = ZTk gk+1;(Zk+1, rk+1, R

    k, u

    k, v

    k, q

    k) = expand(Zk, rk, uk, vk, qk, gk+1, );

    (sigmak = kqk; yk = uk vk;Rk+1 = BroydenUpdate(R

    k,

    k, y

    k);

    vk+1 = uk; k = k + 1;

    end do

  • References

    Gill, P. E., Golub, G. H., Murray, W. and Saunders, M. A. (1974). Methods formodifying matrix factorizations, Mathematics of Computation 28(126): 505535.

    Goldfarb, D. (1976). Factorized variable metric methods for unconstrained opti-mization, Mathematics of Computation 30(136): 796811.

    Liu, D. C. and Nocedal, J. (1989). On the limited memory BFGS method for largescale optimization, Mathematical Programming 45: 503 528.

    Shanno, D. F. (1980). On variable-metric methods for sparse Hessians, Mathemat-ics of Computation 34(150): 499514.

    Toint, P. (1981). A note about sparsity exploiting Quasi-Newton, MathematicalProgramming 21: 172181.

    Gill, P. E. and Leonard, M. W. (2001). Reduced-Hessian Quasi-Newton methodsfor unconstrained optimization, SIAM J. OPTIM. 12(1): 209237.