A robust accelerated optimization algorithm for strongly convex … · 2020. 12. 15. · Saman...

15
A robust accelerated optimization algorithm for strongly convex functions Saman Cyrus Bin Hu Bryan Van Scoy Laurent Lessard University of Wisconsin–Madison June 27, 2018

Transcript of A robust accelerated optimization algorithm for strongly convex … · 2020. 12. 15. · Saman...

  • A robust accelerated optimization algorithm for

    strongly convex functions

    Saman Cyrus Bin Hu Bryan Van Scoy Laurent Lessard

    University of Wisconsin–Madison

    June 27, 2018

  • Iterative algorithms

    Unconstrained optimization with f strongly convex.

    minx∈Rn

    f (x)

    mI � ∇2f (x) � LI , and define κ = Lm

    (condition number)

    • Gradient method:xk+1 = xk − α∇f (xk)

    • Nesterov’s accelerated gradient method (AGM)yk = xk + β(xk − xk−1)

    xk+1 = yk − α∇f (yk)

    2

  • Iteration complexity

    100 101 102 103

    100

    101

    102

    103

    Condition number κ

    Iterationsto

    convergence

    Gradient Method α = 1L

    Gradient Method α = 2m+L

    Nesterov’s AGM

    Triple Momentum

    Theoretical Lower Bound

    κ log 1ε

    √κ log 1

    ε

    3

  • Noise robustness

    • ∇f is inexact/approximate.

    • Many different noise models have been studied.

    • We use relative deterministic noise:

    ‖∇fnoisy −∇fexact‖‖∇fexact‖

    ≤ δ

    • Ex: round-off error.

    4

  • Performance in the presence of noise

    Gradient method: slow and robust

    100 101 102 10310−1

    100

    101

    102

    103

    Iterationsto

    convergence

    Iterations for different δδ ∈ {.01, .02, .05, .1, .2, .5}Gradient method

    Condition number κ

    Nesterov’s AGM: fast and fragile

    100 101 102 103 10410−1

    100

    101

    102

    103

    Iterationsto

    convergence

    Iterations for different δδ ∈ {.05, .1, .2, .3, .4, .5}Nesterov (quadratic)

    27

    Condition number κ

    Simulated by solving a small SDP(Lessard et al., SIAM Journal on Opt., 2016)

    5

  • Proposed algorithm: Robust momentum method

    Iteration update:

    xk+1 = xk + β(xk − xk−1)− α∇f (yk)yk = xk + γ(xk − xk−1)

    with parameters:

    α =κ(1− ρ)2(1 + ρ)

    L, β =

    κρ3

    κ− 1, γ =

    ρ3

    (κ− 1)(1− ρ)2(1 + ρ)

    single tuning parameter ρ:

    1− 1√κ︸ ︷︷ ︸

    fast + fragile

    ≤ ρ ≤ 1− 1κ︸ ︷︷ ︸

    slow + robust

    6

  • Control interpretation

    Nesterov’s AGM:

    xk+1 = yk − α∇f (yk)yk = xk + β(xk − xk−1)

    G

    ∇fyu

    G =

    [xk+1xk

    ]=

    [1 + β −β

    1 0

    ] [xkxk−1

    ]+

    [−α0

    ]uk

    yk =[1 + β −β

    ] [ xkxk−1

    ]

    7

  • Frequency domain condition

    Transfer function for the linear part:

    G (z) = −α (1 + γ)z − γ(z − 1)(z − β)

    Sufficient condition for linear convergence: xk − x? = O(ρk)

    Re

    [(ρz−1 − 1) 1− LG (ρz)

    1−mG (ρz)

    ]< 0 for all |z | = 1

    • combination of Circle Criterion and Zames-Falb multipliers

    8

  • Gradient method

    −1 −0.5 0 0.5 1−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    real part

    imaginarypart

    ρ = 1

    ρ = 0.98

    ρ = 0.95

    ρ = Opt.

    ρ = 1

    ρ = 0.98

    ρ = 0.95

    ρ = Opt.

    α = 1L

    α = 2m+L

    9

  • Idea: add a margin to improve robustness

    Re

    [(ρz−1 − 1) 1− LG (ρz)

    1−mG (ρz)

    ]+ ν ≤ 0 for all |z | = 1

    • ν > 0 is a robustness margin• tune α, β, γ to obtain a vertical line at −ν.

    10

  • Robust Momentum Method

    −1 −0.5 0 0.5 1−1

    −0.5

    0

    0.5

    1

    real part

    imaginarypart

    ρ = 0.90

    ρ = 0.85

    ρ = 0.80

    ρ = Opt.

    ρ = 0.90

    ρ = 0.85

    ρ = 0.80

    ρ = Opt.

    11

  • Nesterov’s AGM — not a line!

    −10 −8 −6 −4 −2 0−10

    −5

    0

    5

    10

    real part

    imaginarypart

    ρ = 0.80

    ρ = 0.77

    ρ = 0.76

    ρ = Opt.

    12

  • Trade-off: Performance vs noise

    0 0.2 0.4 0.6 0.8 11− 1√

    κ

    κ− 1κ+ 1

    1− 1κ

    1

    Noise strength (δ)

    Con

    vergence

    rate

    (ρ)

    GM (α = 1/L)

    GM (α = 2/(L+m))

    GM (min α ∈ [0, 2/L])Nesterov’s AGM

    RMM (ν = 0)

    RMM (ν = 1/3)

    RMM (ν = 2/3)

    RMM (min ν ∈ [0, 1))

    13

  • Simulation with relative gradient noise

    f (x) = x21 + 10x22

    0 50 10010−10

    10−5

    100

    105

    iteration

    δ = 0

    RMM (ν = 0)

    RMM (ν = 0.55)

    AGM

    0 50 100

    iteration

    δ = 0.25

    0 50 100

    iteration

    δ = 0.5

    14

  • Thank you