Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture...

20
Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10, 2020 1 / 20

Transcript of Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture...

Page 1: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Lecture 7 Gradient Methods

April 8 - 10 2020

Gradient Methods Lecture 7 April 8 - 10 2020 1 20

1 Line search methods

Given search direction dk and step length αk

xk+1 = xk + αkdk

Search directions

Negative gradientdk = minusnablaf(xk)

Newtondk = minusnabla2f(xk)minus1nablaf(xk)

Quasi-Newtondk = minusBminus1

k nablaf(xk)

Conjugate gradient (βk ensures dkminus1 and dk are conjugate)

dk = minusnablaf(xk) + βkdkminus1

Step length Numerical Optimization sect 31 (exact inexact)

Gradient Methods Lecture 7 April 8 - 10 2020 2 20

2 Gradient Descent (GD) method

Set dk = minusnablaf(xk) and αk =1

L

xk+1 = xk minus 1

Lnablaf(xk) k = 0 1 2

If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)

f(x+ αd) le f(x) + αnablaf(x)Td+ α2L

2983042d9830422

This yields

f(xk+1) = f

983061xk minus 1

Lnablaf(xk)

983062le f(xk)minus 1

2L983042nablaf(xk)9830422

which implies f(xk) is a nonincreasing sequence

Gradient Methods Lecture 7 April 8 - 10 2020 3 20

Theorem 1 (Sublinear convergence of GD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with

αk equiv 1L generates a sequence xkinfin0 that satisfies

min0lekleKminus1

983042nablaf(xk)983042 le983157

2L(f(x0)minus f(xK))

Kle

9831582L(f(x0)minus f)

K

Proof The statement is a direct result of

Kminus1983131

k=0

983042nablaf(xk)9830422 le 2L

Kminus1983131

k=0

(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))

and

min0lekleKminus1

983042nablaf(xk)983042 =983157

min0lekleKminus1

983042nablaf(xk)9830422 le

983161983160983160983159 1

K

Kminus1983131

k=0

983042nablaf(xk)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 4 20

Corollary 2

Any accumulation point of xkinfin0 in Theorem 1 is stationary

Theorem 3 (Sublinear convergence of GD O(1k))

Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies

f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)

Theorem 4 (Linear convergence of GD O((1minus γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies

f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 5 20

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 2: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

1 Line search methods

Given search direction dk and step length αk

xk+1 = xk + αkdk

Search directions

Negative gradientdk = minusnablaf(xk)

Newtondk = minusnabla2f(xk)minus1nablaf(xk)

Quasi-Newtondk = minusBminus1

k nablaf(xk)

Conjugate gradient (βk ensures dkminus1 and dk are conjugate)

dk = minusnablaf(xk) + βkdkminus1

Step length Numerical Optimization sect 31 (exact inexact)

Gradient Methods Lecture 7 April 8 - 10 2020 2 20

2 Gradient Descent (GD) method

Set dk = minusnablaf(xk) and αk =1

L

xk+1 = xk minus 1

Lnablaf(xk) k = 0 1 2

If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)

f(x+ αd) le f(x) + αnablaf(x)Td+ α2L

2983042d9830422

This yields

f(xk+1) = f

983061xk minus 1

Lnablaf(xk)

983062le f(xk)minus 1

2L983042nablaf(xk)9830422

which implies f(xk) is a nonincreasing sequence

Gradient Methods Lecture 7 April 8 - 10 2020 3 20

Theorem 1 (Sublinear convergence of GD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with

αk equiv 1L generates a sequence xkinfin0 that satisfies

min0lekleKminus1

983042nablaf(xk)983042 le983157

2L(f(x0)minus f(xK))

Kle

9831582L(f(x0)minus f)

K

Proof The statement is a direct result of

Kminus1983131

k=0

983042nablaf(xk)9830422 le 2L

Kminus1983131

k=0

(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))

and

min0lekleKminus1

983042nablaf(xk)983042 =983157

min0lekleKminus1

983042nablaf(xk)9830422 le

983161983160983160983159 1

K

Kminus1983131

k=0

983042nablaf(xk)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 4 20

Corollary 2

Any accumulation point of xkinfin0 in Theorem 1 is stationary

Theorem 3 (Sublinear convergence of GD O(1k))

Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies

f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)

Theorem 4 (Linear convergence of GD O((1minus γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies

f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 5 20

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 3: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

2 Gradient Descent (GD) method

Set dk = minusnablaf(xk) and αk =1

L

xk+1 = xk minus 1

Lnablaf(xk) k = 0 1 2

If f is Lipschitz continuously differentiable with constant L then(see Lecture 6 Page 11)

f(x+ αd) le f(x) + αnablaf(x)Td+ α2L

2983042d9830422

This yields

f(xk+1) = f

983061xk minus 1

Lnablaf(xk)

983062le f(xk)minus 1

2L983042nablaf(xk)9830422

which implies f(xk) is a nonincreasing sequence

Gradient Methods Lecture 7 April 8 - 10 2020 3 20

Theorem 1 (Sublinear convergence of GD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with

αk equiv 1L generates a sequence xkinfin0 that satisfies

min0lekleKminus1

983042nablaf(xk)983042 le983157

2L(f(x0)minus f(xK))

Kle

9831582L(f(x0)minus f)

K

Proof The statement is a direct result of

Kminus1983131

k=0

983042nablaf(xk)9830422 le 2L

Kminus1983131

k=0

(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))

and

min0lekleKminus1

983042nablaf(xk)983042 =983157

min0lekleKminus1

983042nablaf(xk)9830422 le

983161983160983160983159 1

K

Kminus1983131

k=0

983042nablaf(xk)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 4 20

Corollary 2

Any accumulation point of xkinfin0 in Theorem 1 is stationary

Theorem 3 (Sublinear convergence of GD O(1k))

Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies

f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)

Theorem 4 (Linear convergence of GD O((1minus γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies

f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 5 20

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 4: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Theorem 1 (Sublinear convergence of GD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andthere exists a constant f satisfying f(x) ge f Gradient descent with

αk equiv 1L generates a sequence xkinfin0 that satisfies

min0lekleKminus1

983042nablaf(xk)983042 le983157

2L(f(x0)minus f(xK))

Kle

9831582L(f(x0)minus f)

K

Proof The statement is a direct result of

Kminus1983131

k=0

983042nablaf(xk)9830422 le 2L

Kminus1983131

k=0

(f(xk)minus f(xk+1)) = 2L(f(x0)minus f(xK))

and

min0lekleKminus1

983042nablaf(xk)983042 =983157

min0lekleKminus1

983042nablaf(xk)9830422 le

983161983160983160983159 1

K

Kminus1983131

k=0

983042nablaf(xk)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 4 20

Corollary 2

Any accumulation point of xkinfin0 in Theorem 1 is stationary

Theorem 3 (Sublinear convergence of GD O(1k))

Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies

f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)

Theorem 4 (Linear convergence of GD O((1minus γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies

f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 5 20

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 5: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Corollary 2

Any accumulation point of xkinfin0 in Theorem 1 is stationary

Theorem 3 (Sublinear convergence of GD O(1k))

Suppose f is convex and Lipschitz continuously differentiable withconstant L and that minxisinRn f(x) has a solution x983183 Gradient descentwith αk equiv 1L generates a sequence xkinfin0 that satisfies

f(xk)minus f(x983183) le L983042x0 minus x9831839830422(2k)

Theorem 4 (Linear convergence of GD O((1minus γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Then f has a uniqueminimizer x983183 and gradient descent with αk equiv 1L generates a sequencexkinfin0 that satisfies

f(xk+1)minus f(x983183) le (1minus γL) (f(xk)minus f(x983183)) k = 0 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 5 20

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 6: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

3 Descent Direction (DD) method

Choose dk to be a descent direction that is

nablaf(xk)Tdk lt 0

which ensures that

f(xk + αdk) lt f(xk)

for sufficiently small α gt 0

Line search methodxk+1 = xk + αkd

k

with a descent direction dk and sufficiently small αk gt 0 worksGradient descent is a special case

Gradient Methods Lecture 7 April 8 - 10 2020 6 20

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 7: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Theorem 5 (Sublinear convergence of DD O(1radick))

Suppose f is Lipschitz continuously differentiable with constant L andf is bounded below by a constant f ie f(x) ge f Consider the linear

search method where dk satisfies

nablaf(xk)Tdk le minusη983042nablaf(xk)983042983042dk983042

for some η gt 0 and αk gt 0 satisfies the weak Wolfe conditions

f(xk + αkdk) le f(xk) + c1αknablaf(xk)Tdk

nablaf(xk + αkdk)Tdk ge c2nablaf(xk)Tdk

at all k for some constants c1 and c2 satisfying 0 lt c1 lt c2 lt 1 Thenfor any integer K ge 1 we have

min0lekleKminus1

983042nablaf(xk)983042 le983158

L

η2c1(1minus c2)

983158f(x0)minus f

K

Gradient Methods Lecture 7 April 8 - 10 2020 7 20

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 8: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Step lengths satisfying the weak Wolfe conditions

Gradient Methods Lecture 7 April 8 - 10 2020 8 20

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 9: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Corollary 6

Any accumulation point of xkinfin0 in Theorem 5 is stationary

Remark 1

Let xkinfin0 be the sequence in Theorem 5

If f is convex then any accumulation point of xkinfin0 is a solutionof minxisinRn f(x)

If f is nonconvex then accumulation points of xkinfin0 may be localminimizers saddle points or local maximizier

4 Frank-Wolfe (FW) method conditional gradient

Consider the optimization problem

minxisinΩ

f(x)

where Ω is compact and convex and f is convex and differentiable

Gradient Methods Lecture 7 April 8 - 10 2020 9 20

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 10: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

The conditional gradient method is as follows

vk = argminvisinΩ

vTnablaf(xk)

xk+1 = xk + αk(vk minus xk) αk =

2

k + 2

Theorem 7 (Sublinear convergence of FW O(1k))

Suppose Ω is a compact convex set with diameter D ie

983042xminus y983042 le D for all xy isin Ω

Suppose that f is convex and Lipschitz continuously differentiable in aneighborhood of Ω with Lipschitz constant L FW generates a sequencexkinfin0 that satisfies

f(xk)minus f(x983183) le2LD2

k + 2 k = 1 2

where x983183 is any solution of minxisinΩ f(x)

Gradient Methods Lecture 7 April 8 - 10 2020 10 20

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 11: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

5 The key idea of acceleration is momentum

Consider the iterate

xk+1 = xk minus αknablaf(983144xk)minuskminus1983131

i=0

microkinablaf(983144xi)

Due to more flexibility it may yield better convergence

This is the foundation of momentum method

xk+1 = xkminusαknablaf(983144xk) + βkMomentum

Gradient Methods Lecture 7 April 8 - 10 2020 11 20

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 12: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

51 Heavy-Ball (HB) method

Each iteration of this method has the form

xk+1 = xkminusαknablaf(xk) + βk(xk minus xkminus1)

where αk gt 0 and βk gt 0 (Two-step method)

HB method is not a descent method usually f(xk+1) gt f(xk) formany k This property is shared by other momentum methods

Example Consider the strongly convex quadratic function

minxisinRn

983069f(x) =

1

2xTAxminus bTx

983070

where the (constant) Hessian A has eigenvalues in the range [γ L]with 0 lt γ le L Let

αk = α =4

(radicL+

radicγ)2

βk = β =

radicLminusradic

γradicL+

radicγ

Gradient Methods Lecture 7 April 8 - 10 2020 12 20

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 13: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

It can be shown that

983042xk minus x983183983042 le Cβk

which further yields

f(xk)minus f(x983183) leLC2

2β2k

If L ≫ γ we haveβ asymp 1minus 2

983155γL

Therefore the complexity is

O(983155

Lγ log(1ε))

Convergence of GD

f(xk)minus f(x983183) le (1minus γL)k(f(x0)minus f(x983183))

Complexity of GD is

O((Lγ) log(1ε))

Gradient Methods Lecture 7 April 8 - 10 2020 13 20

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 14: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

52 Conjugate Gradient (CG) method

Given SPD A isin Rntimesn

Ax = bldquo hArr rdquo minxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

CG iteration

xk+1 = xk + αkpk where pk = minusnablaf(xk) + ξkp

kminus1

Gradient Methods Lecture 7 April 8 - 10 2020 14 20

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 15: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

53 Nesterovrsquos Accelerated Gradient (NAG)

NAG iteration

xk+1 = xk minus αknablaf(xk + βk(xk minus xkminus1)) + βk(x

k minus xkminus1)

Theorem 8 (Sublinear convergence of NAG O(1k2))

Suppose f is convex and Lipschitz continuously differentiable withconstant L Suppose the minimum of f is attained at x983183 NAG withx0 x1 = x0 minusnablaf(x0)L αk = 1L and βk defined as follows

λ0 = 0 λk+1 =1

2

9830611 +

9831561 + 4λ2

k

983062 βk =

λk minus 1

λk+1

yields xkinfin0 that satisfies

f(xk)minus f(x983183) le2L983042x0 minus x9831839830422

(k + 1)2 k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 15 20

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 16: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Theorem 9 (Linear convergence of NAG O((1minus983155

γL)k))

Suppose f is Lipschitz continuously differentiable with constant L andstrongly convex with modulus of convexity γ Suppose the uniqueminimizer of f is x983183 NAG with x0

x1 = x0 minus 1

Lnablaf(x0)

and

αk =1

L βk =

radicLminusradic

γradicL+

radicγ

yields xkinfin0 that satisfies

f(xk)minus f(x983183) leL+ γ

2983042x0 minus x9831839830422

9830611minus

983157γ

L

983062k

k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 16 20

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 17: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

54 ldquoOptimalityrdquo of NAG

NAG is ldquooptimalrdquo because the convergence rate of NAG is thebest possible (possibly up to a constant) among algorithms thatmake use of all encountered gradient information

Example Consider the problem A=toeplitz([2-10middot middot middot 0])

minxisinRn

f(x) =1

2xTAxminus eT1 x

The iteration

x0 = 0 xk+1 = xk +

k983131

j=0

ξjnablaf(xj)

yields

f(xk)minus f(x983183) ge3983042A9830422

32(k + 1)2983042x0 minus x9831839830422 k = 1 2

n

2minus 1

Gradient Methods Lecture 7 April 8 - 10 2020 17 20

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 18: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

6 Prox-Gradient (PG) method

Consider the regularized optimization problem

minxisinRn

φ(x) = f(x) + λψ(x)

where f smooth and convex ψ is convex and λ ge 0

Each step of the prox-gradient method is defined as follows

xk+1 = proxαkλψ(xk minus αknablaf(xk))

for some step length αk gt 0 and

proxαkλψ(x) = argmin

u

983069αkλψ(u) +

1

2983042uminus x9830422

983070

xk+1 is the solution of a quadratic approximation to φ(x)

xk+1 = argminu

nablaf(xk)T(uminus xk) +1

2αk983042uminus xk9830422 + λψ(u)

Gradient Methods Lecture 7 April 8 - 10 2020 18 20

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 19: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Define

Gα(x) =1

α

983043xminus proxαλψ (xminus αnablaf(x))

983044 α gt 0

Then at step k of prox-gradient method

xk+1 = xk minus αkGαk(xk)

Lemma 10

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Then

(a) Gα(x) isin nablaf(x) + λpartψ(xminus αGα(x))

(b) For any z and any α isin (0 1L] we have that

φ(xminus αGα(x)) le φ(z) +Gα(x)T(xminus z)minus α

2983042Gα(x)9830422

Gradient Methods Lecture 7 April 8 - 10 2020 19 20

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20

Page 20: Lecture 7: Gradient Methodsmath.xmu.edu.cn › group › nona › damc › Lecture07.pdf · Lecture 7: Gradient Methods April 8 - 10, 2020 Gradient Methods Lecture 7 April 8 - 10,

Theorem 11 (Sublinear convergence of PG O(1k))

Suppose that ψ is a closed convex function and that f is convex andLipschitz continuously differentiable with constant L Suppose that

minxisinRn

φ(x) = f(x) + λψ(x) λ ge 0

attains a minimizer x983183 with optimal objective value φ983183 Then if

αk =1

L

for all k in the prox-gradient method we have

φ(xk)minus φ983183 leL983042x0 minus x9831839830422

2k k = 1 2

Gradient Methods Lecture 7 April 8 - 10 2020 20 20