2pt Frank-Wolfe Algorithm & Alternating Direction Method ...· Frank-Wolfe Algorithm & Alternating

download 2pt Frank-Wolfe Algorithm & Alternating Direction Method ...· Frank-Wolfe Algorithm & Alternating

of 62

  • date post

    21-Oct-2018
  • Category

    Documents

  • view

    217
  • download

    0

Embed Size (px)

Transcript of 2pt Frank-Wolfe Algorithm & Alternating Direction Method ...· Frank-Wolfe Algorithm & Alternating

  • Frank-Wolfe Algorithm&

    Alternating Direction Method of Multipliers

    Ives Macedoijamj@cs.ubc.ca

    October 27, 2015

  • Where were we?

  • Previous episode. . .Proximal-gradient methods

    minimizexX

    f(x) + (x)

    I f : X R convex with Lipschitz-continuous gradientI : X R {+} convex and simple (i.e., proximable)

    prox : X X ( > 0)

    prox(x) := arg minxX

    {(x) +

    1

    2x x22

    }

    xk+1 := proxk

    (xk kf(xk)

    )

  • Proximal-gradient methodsGood news

    I k (0, 2/L) f(xk) +(xk)min {f + } O(1/k)

    I Acceleration gives O(1/k2)

    I Generalize projected gradient methods, where

    (x) = C(x) :=

    {0 if x C+ if x / C

  • Proximal-gradient methodsBad news

    I Some sets C can be tough to project onto but you canminimize linear functions in them

    Frank-Wolfe Algorithm/Conditional Gradient Method

    I Dealing with (x) = (Ax) aint easy even when is simple

    Alternating Direction Method of Multipliers (ADMM)

  • Proximal-gradient methodsBad news

    I Some sets C can be tough to project onto but you canminimize linear functions in them

    Frank-Wolfe Algorithm/Conditional Gradient Method

    I Dealing with (x) = (Ax) aint easy even when is simple

    Alternating Direction Method of Multipliers (ADMM)

  • Proximal-gradient methodsBad news

    I Some sets C can be tough to project onto but you canminimize linear functions in them

    Frank-Wolfe Algorithm/Conditional Gradient Method

    I Dealing with (x) = (Ax) aint easy even when is simple

    Alternating Direction Method of Multipliers (ADMM)

  • Frank-Wolfe Algorithm

  • A motivating problemMatrix completion

    ? ? 2 ?

    1 ? ? 3

    1 2 2 3

    ? 6 6 9

    3 ? ? 9

    1 ? 2 ?

    ? ? 6 9

  • A motivating problemMatrix completion

    0 0 2 0

    1 0 0 3

    1 2 2 3

    0 6 6 9

    3 0 0 9

    1 0 2 0

    0 0 6 9

  • A motivating problemMatrix completion

    rank

    0 0 2 0

    1 0 0 3

    1 2 2 3

    0 6 6 9

    3 0 0 9

    1 0 2 0

    0 0 6 9

    = 4

  • A motivating problemMatrix completion

    rank

    1 2 2 3

    1 2 2 3

    1 2 2 3

    3 6 6 9

    3 6 6 9

    1 2 2 3

    3 6 6 9

    = 1

  • A motivating problemMatrix completion with nuclear-norm lasso

    minimizeXRn1n2

    1

    2

    mk=1

    (Xik,jk bk)2 subject to (X)1

  • A motivating problemMatrix completion with nuclear-norm lasso

    minimizeXRn1n2

    1

    2

    mk=1

    (Xik,jk bk)2 subject to (X)1

  • A motivating problemMatrix completion with nuclear-norm lasso

    minimizeXRn1n2

    1

    2

    mk=1

    (Xik,jk bk)2 subject to X1

    I X1 = (X)1 =min{n1,n2}

    i=1i(X)

    I Projection onto {X | X1 } potentially requires full SVD

    I Linear minimization requires only one SVD triplet!

  • Model problem

    minimizexC

    f(x)

    I f : Rn R is convex and continuously differentiableI C Rn is convex and compact (i.e., closed and bounded)

    I we can minimize linear functions over C, i.e., c Rn

    find x arg minxC

    c, x

  • Frank-Wolfe algorithmFrank and Wolfe (1956)

    x0 C

    xk+1 arg minxC

    {f(xk) +

    f(xk), x xk

    }

    xk+1 = (1 k)xk+kxk+1, k :=2

    k + 2

    Approximation similar to projected gradient, but no quadratic term!

  • Frank-Wolfe algorithmFrank and Wolfe (1956)

    x0 C

    xk+1 arg minxC

    {f(xk) +

    f(xk), x xk

    }

    xk+1 = (1 k)xk+kxk+1, k :=2

    k + 2

    Approximation similar to projected gradient, but no quadratic term!

  • Frank-Wolfe algorithmVisualizing the iterates

  • Frank-Wolfe algorithmVisualizing the iterates

  • Frank-Wolfe algorithmVisualizing the iterates

  • Frank-Wolfe algorithmVisualizing the iterates

  • Frank-Wolfe algorithmCurvature constant

    `f (y;x) := f(y) f(x) f(x), y x

    Cf := maxx,xC[0,1]

    y=(1)x+x

    2

    2`f (y;x)

  • Frank-Wolfe algorithmCurvature constant (example)

    f(x) =1

    2x22

    `f (y;x) =1

    2y x22

    Cf = maxx,xC

    x x22 = (diam C)2

    If f is L-Lipschitz, then Cf 6 L(diam C)2

  • Frank-Wolfe algorithmCurvature constant (example)

    f(x) =1

    2x22

    `f (y;x) =1

    2y x22

    Cf = maxx,xC

    x x22 = (diam C)2

    If f is L-Lipschitz, then Cf 6 L(diam C)2

  • Frank-Wolfe algorithmApproximate subproblem minimizers

    xk+1 {x C

    `f (x;xk) minxC `f (x;xk) + 12kCf}

  • Frank-Wolfe algorithmExact line-search

    k arg min[0,1]

    f(

    (1 )xk + xk+1)

  • Frank-Wolfe algorithmFully-corrective reoptimization

    xk+1 arg minxconv{x0,x1,...,xk+1}

    f(x)

  • Frank-Wolfe algorithmPrimal-convergence

    Theorem (Jaggi, 2013)

    f(xk) infCf

    2Cfk + 2

    (1 + )

  • Frank-Wolfe algorithmLower bound on primal convergence

    Theorem (Canon and Cullum, 1968)

    There are instances with strongly convex objectives for which theoriginal FWA generates sequences with the following behavior:for all C, > 0 there are infinitely many k such that

    f(xk) infCf C

    k1+

  • Frank-Wolfe algorithmFaster variants

    I Linear convergence can be obtained in certain cases ifaway/drop steps are used; see (GueLat and Marcotte, 1986)and (Lacoste-Julien and Jaggi, 2014)

    I For smooth f and strongly convex C, a simple variant hascomplexity O(1/k2) (Garber and Hazan, 2015)

  • Alternating DirectionMethod of Multipliers

  • A motivating problemTotal-variation image denoising

  • A motivating problemTotal-variation image denoising

  • A motivating problemTotal-variation image denoising

  • A motivating problemTotal-variation image denoising

  • A motivating problemTotal-variation image denoising

    minimizef

    1

    2

    (f f)2 +

    f2

  • A motivating problemTotal-variation image denoising

    minimizex

    1

    2x x22 + Dx1

  • A motivating problemTotal-variation image denoising

    minimizex,y

    1

    2x x22 + y1 subject to Dx y = 0

  • Equality-constrained optimizationFirst-order optimality conditions

    minimizex

    f(x) subject to h(x) = 0

    Necessary for x to be a minimizer:

    f(x) +h(x)z = 0h(x) = 0

  • Equality-constrained optimizationQuadratic penalization

    xk+1 arg minx

    {f(x) +

    k2h(x)22

    }

    f(xk+1) +h(xk+1)[kh(xk+1)] = 0

    I Need h(xk+1) 0 and k + for kh(xk+1) z 6= 0

  • Equality-constrained optimizationLagrangian minimization

    xk+1 arg minx

    {f(x) +

    zk, h(x)

    }zk+1 = zk + kh(x

    k+1)

    f(xk+1) +h(xk+1)zk = 0

    I (Super)gradient ascent on concave dual

    I Stability issues when argmin has multiple points at solution

  • Equality-constrained optimizationMethod of Multipliers/Augmented Lagrangian

    I MM Lagrangian Minimization + Quadratic Penalization

    xk+1 arg minx

    {f(x) +

    zk, h(x)

    +k2h(x)22

    }zk+1 = zk + kh(x

    k+1)

    f(xk+1) +h(xk+1)zk+1 = 0

    I Will work once k sufficiently large (no need for k +)I Computing xk+1 can be tough

  • Model problem

    minimizexX

    f(x) + (Ax)

    I A : X Y linearI : Y R {+} convex and proximableI f : X R such that one can solve:

    minimizexX

    f(x) +1

    2bAx22

  • Model problemMethod of Multipliers

    minimizex,y

    f(x) + (y) subject to Ax y = 0

    (xk+1, yk+1) arg minx,y

    {f(x) + (y) +

    k2

    Ax y + zkk22

    }zk+1 = zk + k(Ax

    k+1 yk+1)

    I Still tricky joint minimization over x and y

    I Alternate!

  • Model problemAlternating Direction Method of Multipliers

    xk+1 arg minx

    {f(x) +

    k2

    Ax yk + zkk22

    }

    yk+1 = arg miny

    {(y) +

    k2

    Axk+1 y + zkk22

    }

    = prox1k

    [Axk+1 +

    zk

    k

    ]

    zk+1 = zk + k(Axk+1 yk+1)

  • Model problemAlternating Direction Method of multipliers

    I Simpler iterations when k (defining zk := zk/)

    xk+1 arg minx

    {f(x) +

    2

    Ax yk + zk22

    }

    yk+1 = prox1

    [Axk+1 + zk

    ]zk+1 = zk + (Axk+1 yk+1)

  • ADMMTotal-variation denoising

  • ADMMTotal-variation denoising

  • ADMMTotal-variation denoising

  • ADMMConvergence

    I Function values decrease as O(1/k) (He and Yuan, 2012)

    I Linear convergence if f or is strongly convex and undercertain conditions on A (Deng and Yin, 2012)

  • Other problems suitable for ADMMIn case I havent bored you out of your mind. . .

  • ADMMWhat if f is only proximable?

    minimizex

    f(x) + (Ax)

    minimizex1,x2,y

    f(x1) + (y)

    subject to

    Ax2 y = 0x1 x2 = 0

  • ADMMWhat if f is only proximable?

    minimizex1,x2,y

    f(x1) + (y)

    subject to

    Ax2 y = 0x1 x2 = 0

    xk+11 = prox1f

    [xk2 zk2

    ]yk+1 = prox1

    [Axk2 + z

    k1

    ]xk+12 = (I +A

    A)1(xk+11 + zk2 +A

    (yk+1 zk1 ))zk+11 = z

    k1 + (Ax

    k+12 y

    k+1)

    zk+12 = zk2 + (x

    k+11 x

    k+12 )

  • ADMMSum of proximable functions

    minimizex

    mi=1

    fi(x)

    xk+1i = prox1fi [xk zki ]

    xk+1=1

    m

    mi=1

    (xk+1i + zki )

    zk+1i = zki + (x

    k+1i x

    k+1)

    Distributed consensus

  • ADMMSum of proximable functions

    minimizex,x1,...,xm

    mi=1

    fi(xi) subject to xix = 0