of 43/43
Mathematics in Quasi-Newton Method Peng Du and Xijun Wang Department of Computing and Software McMaster University April 2, 2003
• date post

10-Apr-2018
• Category

## Documents

• view

227

6

Embed Size (px)

### Transcript of Mathematics in Quasi-Newton Method - Greeley.Orghod/papers/Unsorted/slids_qn.pdf · Mathematics in...

• Mathematics in Quasi-Newton Method

Peng Du and Xijun Wang

Department of Computing and Software

McMaster University

April 2, 2003

• Outline

Elementary Quasi-Newton Method

Update for Cholesky Factorization

Update for Limited Memory

Update for Sparse Hessians

Update for Reduced Hessians

References

• Elementary Quasi-Newton Method

Unconstrained Optimization: min f(x)

Quasi-Newton Iteration: xk+1 = xk tkHkgk, Hk+1 = Hk + Ek

Notation: B, B, H, H, x, x, g, g, (p) = x x, y = g gDesired Update Properties:

Quasi-Newton Property: B = y(Hy = )

Symmetric

Positive Definite

SR1: B = B + T

T , = y B

DFP: B = B ByT+yT BT y

+(1 +

T BT y

)yyT

T y

BFGS: B = B + yyT

yT BT B

T B

PSB: B = B + T+T

T T

T T

T

• Cholesky Factor Modification I Rank-one

Cholesky Factor Modification (symmetric rank one P.D.) update

Using Classical Cholesky Factorization

A = LDLT A = A + zzT =?LDLT

A = L(D + ppT )LT , Lp = z

Factorize D + ppT = LDLT giving A = LDLT , L = LL, D = D

L is of the special form

L(p, , ) =

1p21 2p31 p32 3

... ... ... . . .n1

pn1 pn2 pn3... pnn1 n

Assertion 1: Need only to compute di, i, in O(n) time.

Assertion 2: Product of L and L can be done in O(n2) time.

Drawback instable

• Cholesky Factor Modification II Rank-one

Using Givens Matrices (I)

2-dimensional case(c ss c

) (z1z2

)=

(+0

)

where 2 = z21 + z22, = sign(z1)(

2)1/2, c = z1/, s = z2/.

n-dimensional case

P ijz =

(i)th (j)th1

.. .1

c s1

.. .1

s c1

.. .1

z1...

zi

...

zj...

zn

=

z1...

zi

...

0...

zn

• Sequential reduction

P12 P23 Pn1n z = e1 or P12 P13 P1n z = e1

Products of Pn1n Pn2n1 P23 P12 is of the special form

HL(p, , ) =

p11 1p21 p22 2p31 p32 p33

. . .... ... ... n2

pn11 pn12 pn13 pn1n1 n1pn1 pn2 pn3 pnn1 pnn

Denote HU(, p, ) HTL(p, , )

Factorization A = RTR A = A + zzT =?RT RA = RT (I + ppT )R, RTp = z

Choose a sequence of Givens matrices such that

Pp = P12 P23 Pn1n p = e1

• A can be written as

A = RTPTP (I + ppT )PTPR = RTPT (I + 2e1eT1)PR = H

TJTJH

where H = PR, J is identity matrix except J11 = (1 + pTp)1/2.

Assertion Product of P and R can be done in O(n2) time.

Let H = JH, then A = HT H.

Choose a second sequence of Givens matrices such that

P H = P n1n P n2n1 P23 P12 H = R

Then A = RT R.

• Cholesky Factor Modification III Rank-one

Using Givens Matrices (II) A = RTR A = A + zzT =?RT RA = RT (I + ppT )R = RT (I + ppT )(I + ppT )R

where RTp = z and = /(1 + (1 + pTp)1/2).

Choose a sequence of Givens matrices such that

Pp = P12 P23 Pn1n p = e1

A can be written as

A = RT (I + ppT )PTP (I + ppT )R = RTHTHR

where H = P (I + ppT ) = P + e1pT .

Note that P = HU(, p, ) for some , . Then

H = HU(, p, ) + e1pT = HU(, p, )

where differs from only in the first element: = + e1.

Choose a second sequence of Givens matrices such that

PH = P n1n P n2n1 P23 P12 H = R

• Assertion 1: R is of the form R = R(, p, ) L(p, , )

Then A = RTHT PT PHR = RTRT RR.

Assertion 2: Product of R and R can be done in O(n2) time.

• Cholesky Factor Modification IV Rank-two

Product Form of BFGS

B+ = (I + vuT )B(I + uvT ), u = p + B1y, v = 1Bp + 2yfor some , 1, 2.

Triangularization of I + zwT : (I + zwT )Q = L

First choose a sequence of Givens matrices such that

Pw = P12 P23 Pn1n w = e1

Then H = (I + zwT )PT = PT + zeT1 is lower Hessenberg matrix.

Choose a second sequence of Givens matrix to reduce the superdiagonal

elements of H such that

HP = L

• Assertion: L is of the special form

L(w, , z, , ) =

11w2 + 1z2 21w3 + 1z3 2w3 + 2z3

. . .... ... n1

1wn + 1zn 2wn + 2zn n1wn + n1zn n

Factorization B = L1DLT1 B+ = (I + vuT )B(I + uvT ) =?L1DL1T

B+ = L(I + zwT )(I + wzT )LT , L = L1D1/2, Lz = v, Lw = LLTu.

Triangularize I + zwT = LQT giving

B+ = LLQTQLTLT = L+L+T , L+ = LL = L1D1/2L

Assertion: D1/2L(w, , z, , ) = L(w, , z, , ) = L(w, , z, , e)D1/2 where

w = D1/2w, z = D1/2z, = D1/2, e = (1, ,1)T ,D1/2 = diag(1, , n), = D1/2, = D1/2.

Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

• Cholesky Factor Modification V Rank-two

BFGS B+ = B + uuT + vvT

Using Classical Factorization

B = L1DLT1 B+ = B + uuT + vvT =?L1DL1T

B+ = B + uuT + vvT = L1(D + wwT + zzT )LT1

where L1w = u, L1z = v.

Factorize D + wwT + zzT = L1DLT1

Similar to the rank-one case, L1 is of a special form.

Finally

L+1 = L1L(w, , z, , e), D+ = D

Assertion: Product of L1L(w, , z, , e) could be done in O(n2) time.

• Implementing Quasi-Newton Method

Approach-1: Cholesky Approach-2: Inverse Hessian

Bksk = f(xk) Bksk = f(xk)

RTk Rk = Bk Hk = B1k

back and forward substitution solve for sk is straight forward

find Rk+1 s.t. update Hk directly, i.e.

RTk+1Rk+1 = Bk + Bk Hk+1 = f(Hk, k, yk) = B1k+1

• Update Inverse Hessian Approximation: PropertiesDesired

Hessian Approximate Inverse Hessian Approximate

Bk+1k = yk Hk+1yk = k

Bk+1 is positive definite Hk+1 is positive definite

Bk+1 is symmetric Hk+1 is symmetric

Requirements are in dual format. As a consequence, Update formula should be in dual format too

• Updating Hessian Approximate

Broyden Family on Bk update

Bk+1 = Bk Bkk(Bkk)

T

TBkk+

ykyTk

yTk k+ k(

Tk Bkk)vkv

Tk ;

where k [0,1] andvk =

(yk

yTk k Bkk

TBkk

)

Broyden Family on Hk update

Hk+1 = Hk Hkyk(Hkyk)

T

yTHkyk+

kTk

Tk yk+ k(y

Tk Hkyk)wkw

Tk ;

where k [0,1] andwk =

(k

Tk yk Hkyk

yTHkyk

)

• BFGS update of Hk when k = 1

Hk+1 = Hk Hkyk(Hkyk)T

yT Hkyk+

kTk

Tk yk+ (yTk Hkyk)wkw

Tk ;

rewritten as:

Hk+1 = VTk HkVk + kk

Tk

where k =1

yTk k, and Vk = I ykTk .

Difficulties:

Hk is dense usually

Hk can be ill-conditioned

• Solving Memory Problem L-BFGS

Hk+1 = VTk HkVk + kk

Tk

expand Hk = VTk1Hk1Vk1 + k1k1Tk1

Hk+1 = (VTk V

Tk1)Hk1(Vk1Vk)

+k1V Tk1k1Tk1Vk1 + kTk

expand m times:Hk+1 = (V

Tk V Tkm)Hkm(Vkm Vk)

+km(V Tk V Tkm+1)kmTkm(Vkm+1 Vk)+km+1(V Tk V Tkm+2)km+1Tkm+1(Vkm+2 Vk)

...+kk

Tk .

• Ideas of L-BFGS

Maintain m most recent pairs of and y,P = {(k, yk), (k1, yk1), . . . , (km+1, ykm+1)}

At k-th iteration:If (k m)

P = P {(k, yk)}else

P = (P {(km+1, ykm+1)}) {(k, yk)}end

• Performance of L-BFGS

the two numbers above: iterations/function-evaluations

there numbers below: iteration-time/function-time/total-time

• Choosing Appropriate m

the two numbers above: iterations/function-evaluations

there numbers below: iteration-time/function-time/total-time

• Dealing With Hk

Choose of Hk influences the overall performance.

Options found in literature:

Identity matrix Hk = H0 = I

Identity matrix scaled at first step only Hk =yT0 0y02I

Identity matrix scaled at each step Hk =yTk kyk2I

Diagonal matrix minimize

Hk = argminDdiagDYk1 Sk1F ,where Yk1 = [yk1, . . . , ykm], Sk1 = [k1, . . . , km]

• Scaling Schemes Comparisons

the two numbers above: iterations/function-evaluations

there numbers below: iteration-time/function-time/total-time

• Update for Spare Hessian I

Suppose 2f(x) has a known sparse pattern K:(i, j) K [2f(x)]ij = 0(i, j) K (j, i) K, (for i, j = 1,2, , n)(i, i) K, (for i = 1,2, , n)

Additional desired update property: (i, j) K Bij = 0

N = {X Rnn|Xp = y, X = XT}S = {X Rnn|(i, j) K Xij = 0}V = S N

Row Decomposition B = B + ni=1 eieTi BB = B + ni=1 eieTi

(yyT

pT y BppT B

pT Bp

)= B +

ni=1 ei

(eTi y

pT yyT e

Ti Bp

pT BppTB

)

• Update for Spare Hessian II

Symmetrization

B = B +(y Bp)zT

zTp

B = B + (y Bp)zT + z(y Bp)TzTp

zzT

where is chosen to assure QN condition:

=(y Bp)Tp

(zTp)2

z = y

pTBp/pTy + Bp = BFGSz = y = DFPz = y Bp = SR1z = p = PSB

• Update for Spare Hessian III

Update with desired sparse pattern

B = B +n

i=1

eieTi(y Bp)zTi

zTi p, zi = Diz

where Di is the diagonal matrix with the jth component 1 if Bij = 0,0 if Bij = 0.

Symmetrization

B = B +n

i=1

i(eizTi + zie

Ti )

where i is chosen to assure QN condition, satisfying

S = y Bpwhere S =

ni=1 z

Ti peie

Ti + e

Ti pzie

Ti .

When z = y

pTBp/pTy + Bp, y, y Bp, p, we get a sparse version ofBFGS,DFP,SR1,PSB.

Drawback of symmetrization method

• Update for Spare Hessian IV

Frobenius Norm Projection X2F n

i,j=1 X2ij = Tr(X

TX)

Define B = B + E, and consider problem

min 12E2F = 12Tr(ETE)s.t. Ep = 0

Eij = Bij, (i, j) KE = ET

We get

E =1

2

n

i=1

eieTi (p

Ti + p

Ti ) 2BK

where BK = B if (i, j) K, BK = 0 otherwise, pi = Dip, i = Di,and is chosen to assure QN condition

Store B and r = 2BKp. Compute whole B.

Drawback (actually of all general sparse update) Cant minimize an

Cant guaranteed B to be P.D.

• Consider K = {(i, j)|i = j, i, j = 1, , n}. Then

Bii =y(i)

p(i), i = 1, , n

In general, it is not possible to derive a diagonal and always P.D. update.

Question What are the maximum sparsity requirements we may impose

so that there always exists a P.D. QN update?

• Update for Spare Hessian V

Irreducible(Primitive) Matrices: There is no symmetric permutation of

the rows and columns of the considered spare pattern such that the

permuted pattern is block diagonal with at least two blocks.

Define perturbation matrix E as

Eij = ijpi2 pi(j)pj(i)

For (i, j) K and i < j, define matrix Rij as (P.S.D.)

Rij =

... ... p2(j) p(i)p(j)

... ... p(i)p(j) p2(i)

... ...

Then

E =

(i,j) K,i

• Properties of perturbation matrix E Suppose the given sparse pattern

is irreducible and p(i) = 0, i = 1, , n. Then E satisfies

(i) E is symmetric.

(ii) E has the designated spare pattern.

(iii) Ep = 0.

(iv) For all z = 0 and z = p where R, zTEz > 0.

Existence Theorem of P.D. Sparse Update Suppose the given sparse

pattern is irreducible and p(i) = 0, i = 1, , n. If B+C is any sparseupdate, then there exists a >= 0 such that for all , matrixB = B + C + E is also a sparse update and P.D.

• .......

.......

Unit Surface

p

p

Note 1: p(j) = 0 for all j, sparse pattern is reducible.Note 2: p(j) = 0 for some j, sparse pattern is irreducible.

• I wish I am a Magician

Bksk = f(xk)

Bksk = f(xk)

• Magic Show step (1)

Define:

Gk = span{g0, g1, . . . , gk}

Gk denotes the orthogonal complement of Gk in Rn

Lemma:

Consider the Broyden method applied to a general nonlinear function. If

B0 = I with > 0, then sk Gk for all k. Moreover, if z Gk and w Gk ,then Bkz Gk and Bkw = w.

• Magic Show step (2)

Notation:

rk = dim(Gk)Bk Rnrk forms a basis for Gk

QR factorization:

Bk = ZkTk

where:

Zk Rnrk : an orthonormal basis matrixTk Rrkrk : a nonsingular upper triangular matrix

• Magic Show step (3)

More Notation:

Wk Rn(nrk): an orthonormal basis matrix for Gk

Define:

Qk = (ZkWk)

A transformed Hessian Approximate:

QTk BkQk =

(ZTk BkZk (Z

Tk BkWk)

T

ZTk BkWk WTk BkWk

)=

(ZTk BkZk 0

0 Inrk

)

QTk gk =

(ZTk gk

0

)

• Magic Show step (4)

Bksk = gk

(QTk BkQk)QTk sk = QTk gk

(

ZTk BkZk 00 Inrk

) (ZTkWTk

) (qkqk

)=

(ZTk0

)

ZTk BkZkqk = ZTk gk, sk = Zkqk

• Magic Show step (5)

Fact: ZTk BkZk is positive definite if Bk is positive definite

Cholesky Factorization:

ZTk HkZk = RTk Rk

RTk Rkqk = ZTk gk

Benefits:

Cond(ZTk HkZk) Cond(Hk)

Zk, Rk require less memory

Computing sk requires less time

• Update Zk the orthonormal basis

Step 1): Calculate k+1 = (I ZkZTk )gk+1Step 2): If (k+1 = 0)

Zk+1 = Zk, rk+1 = rk

else

zk+1 =(IZkZTk )gk+1

k+1

Zk+1 = [Zk zk+1]

• Update Rk the Cholesky factor

Want: ZTk+1Bk+1Zk+1 = RTk+1Rk+1

Two-Step Approach:

Expand Step: find Rk such that

(Rk)TRk = ZTk+1BkZk+1

Broyden-Update step: find Rk+1 from Rk such that

ZTk+1Bk+1Zk+1 = RTk+1Rk+1

• Expand step: (Rk)TRk = Zk+1BkZk+1

We know Rk : RTk Rk = Z

Tk BkZk.

ZK+1BkZk+1 =

ZTk BkZk ZTk Bkzk+1

zTk+1BkZk

=

(ZTk BkZk 0

0

)

=

(RTk Rk 0

0

)

=

(Rk 00 1/2

)T (Rk 00 1/2

)

= (Rk)TRk

• Expand Step Continued

For Convenience, we define:

vk = ZTk gk uk = ZTk gk+1 qk = ZTk sk

Update them due to the expansion of Zk

vk = ZTk+1gk =[

vk0

]

uk = ZTk+1gk+1 =[

ukk+1

]

qk = ZTk+1sk =[

qk0

]

• Broyden-Update Step

1. define:k = ZTk+1(xk+1 xk) = qkyk = ZTk+1(gk+1 gk) = uk vk

2. Apply QR factorization to

Rk + w1wT2where:

w1 =1

RkkRkk

w2 =1

((yk)T k)

1/2yk 1Rkk(R

k)

TRkk

• Reduced Hessian Method (Complete Version)

Choose x0 and > 0;

k = 0; rk = 1; gk = (xk);Zk =

gkgk;Rk =

1/2; vk = gkwhile not converged do

Solve RTk dk = vk;Rkqk = dk;sk = Zkqk;

Find k satisfying the Wolfe conditions;

xk+1 = xk + sk; gk+1 = f(xk+1);uk = ZTk gk+1;(Zk+1, rk+1, R

k, u

k, v

k, q

k) = expand(Zk, rk, uk, vk, qk, gk+1, );

(sigmak = kqk; yk = uk vk;Rk+1 = BroydenUpdate(R

k,

k, y

k);

vk+1 = uk; k = k + 1;

end do

• References

Gill, P. E., Golub, G. H., Murray, W. and Saunders, M. A. (1974). Methods formodifying matrix factorizations, Mathematics of Computation 28(126): 505535.

Goldfarb, D. (1976). Factorized variable metric methods for unconstrained opti-mization, Mathematics of Computation 30(136): 796811.

Liu, D. C. and Nocedal, J. (1989). On the limited memory BFGS method for largescale optimization, Mathematical Programming 45: 503 528.

Shanno, D. F. (1980). On variable-metric methods for sparse Hessians, Mathemat-ics of Computation 34(150): 499514.

Toint, P. (1981). A note about sparsity exploiting Quasi-Newton, MathematicalProgramming 21: 172181.

Gill, P. E. and Leonard, M. W. (2001). Reduced-Hessian Quasi-Newton methodsfor unconstrained optimization, SIAM J. OPTIM. 12(1): 209237.