TD(λ) and the Proximal Algorithm
Dimitri P. Bertsekas
Laboratory for Information and Decision SystemsMassachusetts Institute of Technology
Optimization Methods Workshop
NIPS 2017
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 1 / 30
A Bridge Between Convex Analysis and Approximate DynamicProgramming
Convex Analysis
Deterministic ProblemsGeometric IdeasIterative DescentProximal Algorithms
Approximate DP
Stochastic ProblemsSimulation IdeasValue and Policy IterationTemporal Differences
4
Polyhedral Convexity Template
x0 x1 x2 x3 x4 f(x) X x
f(x0) + (x − x0)′g0
f(x1) + (x − x1)′g1
αx + (1 − α)y, 0 ≤ α ≤ 1
< 90◦
Level set!x | f(x) ≤ f∗ + αc2/2
"Optimal solution set x0
X
xk − αkgk
xk+1 = PX(xk − αkgk)
x∗
xk
1
4
Polyhedral Convexity Template
x0 x1 x2 x3 x4 f(x) X x
f(x0) + (x − x0)′g0
f(x1) + (x − x1)′g1
αx + (1 − α)y, 0 ≤ α ≤ 1
< 90◦
Level set!x | f(x) ≤ f∗ + αc2/2
"Optimal solution set x0
X
xk − αkgk
xk+1 = PX(xk − αkgk)
x∗
xk
1
C1 C2 H Normal xk�xk+1
ck xk xk+1
�k � 12ck kx � xkk2 �k f(xk)
Constraint set X x Ha Hb a w b sa sb a b yk = (Xk)�1xk yk+1
yk = (1, 1, 1)
(a) (b) xk xk+2 xk+3 xkrf(xk) xk+1 = xk + dk xk � ↵kDkrf(xk)
↵02x = b2 ↵0
1x = b1 a2 a1 (µ1 < 0) (µ2 > 0) �µ2a2 �µ1a1
{x | Ax = b, x � 0} x⇤ {x | Ax = b} {y | AXk = b}
Feasible directions at x d z � x⇤ x � x⇤ x⇤ = 0 x x⇤ � srf(x⇤)
Level sets of f �(t) = max{0, t} t �(t,�, c) x0ref x1
exp x3ref x4 =
x1ref x5 = x3
ref x0 x1 x2 x3
Origin of Destination of OD Pair w rw x1 x2 x3 x4 rf(x) xk+2 xk+3
xk+2 � sk+2rf(xk+2) xk+1 � sk+1rf(xk+1) xk+1 = xk � skrf(xk)
Constraint set X z � x⇤ x � x⇤ x⇤ = 0 x rf(x⇤) Level sets of f�(t) = max{0, t} t �(t,�, c) x0
ref x1exp x3
ref x4 = x1ref x5 = x3
ref
x0 x1 x2 x3 x4 x⇤ x xk xk�1 xk+1 �1 1 0 �k ↵k x1 x0 x2 X1 X2 rf(x2)
d0 d1 y2 ⇠0 ⇠1 ⇠2 x0 x1 x2 x0 x1 d1 = Q�1/2w1 d0 = Q�1/2w0
y0 y1 y2 w0 w1 ⇠0 = �c10d0 d1 = ⇠1 + c10d0 d2 = ⇠2 + c20d0 + c21d1
xi xref xmax xmin xnew xexp x xnew = xref (a) (b) (c) (d)
� �1 �2 �n�1 �nPn
i=1 ⇠i�i1�
�1+�n���1�n
1�
�1+�n�Pn
i=1⇠i�i
�1�n
�(⇠) =1Pn
i=1 ⇠i�i (⇠) =
nX
i=1
⇠i�i
1
C1 C2 H Normal xk�xk+1
ck xk xk+1
�k � 12ck kx � xkk2 �k f(xk)
Constraint set X x Ha Hb a w b sa sb a b yk = (Xk)�1xk yk+1
yk = (1, 1, 1)
(a) (b) xk xk+2 xk+3 xkrf(xk) xk+1 = xk + dk xk � ↵kDkrf(xk)
↵02x = b2 ↵0
1x = b1 a2 a1 (µ1 < 0) (µ2 > 0) �µ2a2 �µ1a1
{x | Ax = b, x � 0} x⇤ {x | Ax = b} {y | AXk = b}
Feasible directions at x d z � x⇤ x � x⇤ x⇤ = 0 x x⇤ � srf(x⇤)
Level sets of f �(t) = max{0, t} t �(t,�, c) x0ref x1
exp x3ref x4 =
x1ref x5 = x3
ref x0 x1 x2 x3
Origin of Destination of OD Pair w rw x1 x2 x3 x4 rf(x) xk+2 xk+3
xk+2 � sk+2rf(xk+2) xk+1 � sk+1rf(xk+1) xk+1 = xk � skrf(xk)
Constraint set X z � x⇤ x � x⇤ x⇤ = 0 x rf(x⇤) Level sets of f�(t) = max{0, t} t �(t,�, c) x0
ref x1exp x3
ref x4 = x1ref x5 = x3
ref
x0 x1 x2 x3 x4 x⇤ x xk xk�1 xk+1 �1 1 0 �k ↵k x1 x0 x2 X1 X2 rf(x2)
d0 d1 y2 ⇠0 ⇠1 ⇠2 x0 x1 x2 x0 x1 d1 = Q�1/2w1 d0 = Q�1/2w0
y0 y1 y2 w0 w1 ⇠0 = �c10d0 d1 = ⇠1 + c10d0 d2 = ⇠2 + c20d0 + c21d1
xi xref xmax xmin xnew xexp x xnew = xref (a) (b) (c) (d)
� �1 �2 �n�1 �nPn
i=1 ⇠i�i1�
�1+�n���1�n
1�
�1+�n�Pn
i=1⇠i�i
�1�n
�(⇠) =1Pn
i=1 ⇠i�i (⇠) =
nX
i=1
⇠i�i
1
C1 C2 H Normal xk�xk+1
ck xk xk+1
�k � 12ck kx � xkk2 �k f(xk)
Constraint set X x Ha Hb a w b sa sb a b yk = (Xk)�1xk yk+1
yk = (1, 1, 1)
(a) (b) xk xk+2 xk+3 xkrf(xk) xk+1 = xk + dk xk � ↵kDkrf(xk)
↵02x = b2 ↵0
1x = b1 a2 a1 (µ1 < 0) (µ2 > 0) �µ2a2 �µ1a1
{x | Ax = b, x � 0} x⇤ {x | Ax = b} {y | AXk = b}
Feasible directions at x d z � x⇤ x � x⇤ x⇤ = 0 x x⇤ � srf(x⇤)
Level sets of f �(t) = max{0, t} t �(t,�, c) x0ref x1
exp x3ref x4 =
x1ref x5 = x3
ref x0 x1 x2 x3
Origin of Destination of OD Pair w rw x1 x2 x3 x4 rf(x) xk+2 xk+3
xk+2 � sk+2rf(xk+2) xk+1 � sk+1rf(xk+1) xk+1 = xk � skrf(xk)
Constraint set X z � x⇤ x � x⇤ x⇤ = 0 x rf(x⇤) Level sets of f�(t) = max{0, t} t �(t,�, c) x0
ref x1exp x3
ref x4 = x1ref x5 = x3
ref
x0 x1 x2 x3 x4 x⇤ x xk xk�1 xk+1 �1 1 0 �k ↵k x1 x0 x2 X1 X2 rf(x2)
d0 d1 y2 ⇠0 ⇠1 ⇠2 x0 x1 x2 x0 x1 d1 = Q�1/2w1 d0 = Q�1/2w0
y0 y1 y2 w0 w1 ⇠0 = �c10d0 d1 = ⇠1 + c10d0 d2 = ⇠2 + c20d0 + c21d1
xi xref xmax xmin xnew xexp x xnew = xref (a) (b) (c) (d)
� �1 �2 �n�1 �nPn
i=1 ⇠i�i1�
�1+�n���1�n
1�
�1+�n�Pn
i=1⇠i�i
�1�n
�(⇠) =1Pn
i=1 ⇠i�i (⇠) =
nX
i=1
⇠i�i
1
C1 C2 H Normal xk�xk+1
ck xk xk+1
�k � 12ck kx � xkk2 �k f(xk)
Constraint set X x Ha Hb a w b sa sb a b yk = (Xk)�1xk yk+1
yk = (1, 1, 1)
(a) (b) xk xk+2 xk+3 xkrf(xk) xk+1 = xk + dk xk � ↵kDkrf(xk)
↵02x = b2 ↵0
1x = b1 a2 a1 (µ1 < 0) (µ2 > 0) �µ2a2 �µ1a1
{x | Ax = b, x � 0} x⇤ {x | Ax = b} {y | AXk = b}
Feasible directions at x d z � x⇤ x � x⇤ x⇤ = 0 x x⇤ � srf(x⇤)
Level sets of f �(t) = max{0, t} t �(t,�, c) x0ref x1
exp x3ref x4 =
x1ref x5 = x3
ref x0 x1 x2 x3
Origin of Destination of OD Pair w rw x1 x2 x3 x4 rf(x) xk+2 xk+3
xk+2 � sk+2rf(xk+2) xk+1 � sk+1rf(xk+1) xk+1 = xk � skrf(xk)
Constraint set X z � x⇤ x � x⇤ x⇤ = 0 x rf(x⇤) Level sets of f�(t) = max{0, t} t �(t,�, c) x0
ref x1exp x3
ref x4 = x1ref x5 = x3
ref
x0 x1 x2 x3 x4 x⇤ x xk xk�1 xk+1 �1 1 0 �k ↵k x1 x0 x2 X1 X2 rf(x2)
d0 d1 y2 ⇠0 ⇠1 ⇠2 x0 x1 x2 x0 x1 d1 = Q�1/2w1 d0 = Q�1/2w0
y0 y1 y2 w0 w1 ⇠0 = �c10d0 d1 = ⇠1 + c10d0 d2 = ⇠2 + c20d0 + c21d1
xi xref xmax xmin xnew xexp x xnew = xref (a) (b) (c) (d)
� �1 �2 �n�1 �nPn
i=1 ⇠i�i1�
�1+�n���1�n
1�
�1+�n�Pn
i=1⇠i�i
�1�n
�(⇠) =1Pn
i=1 ⇠i�i (⇠) =
nX
i=1
⇠i�i
1
C1 C2 H Normal xk�xk+1
ck xk xk+1
�k � 12ck kx � xkk2 �k f(xk)
Separating hyperplane
Constraint set X x Ha Hb a w b sa sb a b yk = (Xk)�1xk yk+1
yk = (1, 1, 1)
(a) (b) xk xk+2 xk+3 xkrf(xk) xk+1 = xk + dk xk � ↵kDkrf(xk)
↵02x = b2 ↵0
1x = b1 a2 a1 (µ1 < 0) (µ2 > 0) �µ2a2 �µ1a1
{x | Ax = b, x � 0} x⇤ {x | Ax = b} {y | AXk = b}
Feasible directions at x d z � x⇤ x � x⇤ x⇤ = 0 x x⇤ � srf(x⇤)
Level sets of f �(t) = max{0, t} t �(t,�, c) x0ref x1
exp x3ref x4 =
x1ref x5 = x3
ref x0 x1 x2 x3
Origin of Destination of OD Pair w rw x1 x2 x3 x4 rf(x) xk+2 xk+3
xk+2 � sk+2rf(xk+2) xk+1 � sk+1rf(xk+1) xk+1 = xk � skrf(xk)
Constraint set X z � x⇤ x � x⇤ x⇤ = 0 x rf(x⇤) Level sets of f�(t) = max{0, t} t �(t,�, c) x0
ref x1exp x3
ref x4 = x1ref x5 = x3
ref
x0 x1 x2 x3 x4 x⇤ x xk xk�1 xk+1 �1 1 0 �k ↵k x1 x0 x2 X1 X2 rf(x2)
d0 d1 y2 ⇠0 ⇠1 ⇠2 x0 x1 x2 x0 x1 d1 = Q�1/2w1 d0 = Q�1/2w0
y0 y1 y2 w0 w1 ⇠0 = �c10d0 d1 = ⇠1 + c10d0 d2 = ⇠2 + c20d0 + c21d1
xi xref xmax xmin xnew xexp x xnew = xref (a) (b) (c) (d)
� �1 �2 �n�1 �nPn
i=1 ⇠i�i1�
�1+�n���1�n
1�
�1+�n�Pn
i=1⇠i�i
�1�n
�(⇠) =1Pn
i=1 ⇠i�i (⇠) =
nX
i=1
⇠i�i
1
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 2 / 30
Fixed Point Problem Formulation
Problem: Solve x = T (x)
We assume that T : <n 7→ <n has a unique fixed point and is nonexpansive,∥∥T (x1)− T (x2)∥∥ ≤ γ‖x1 − x2‖, ∀ x1, x2 ∈ <n,
where 0 ≤ γ ≤ 1 and ‖ · ‖ is some norm
Special focus for this talk: The linear case
x = Ax + b
where I − A is invertible and A has spectral radius ≤ 1
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 3 / 30
Proximal Algorithm - Convex Analysis (Martinet, 1970)
The proximal mapping P(c) : <n 7→ <n for x − T (x) = 0, where c > 0
P(c)(x) = Unique solution of y − T (y) =1c
(x − y)
The proximal algorithm isxk+1 = P(c)(xk )
x∗ x∗ c v xkk xk+1+1 xk+2d z xP (c)(x)
) y − T (y) ) y − T (y)
) y − ) y −
+2 Slope = −1
c
) Proximal Mapping Proximal Algorithm ) Proximal Mapping Proximal Algorithm
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 4 / 30
Policy Iteration in DP - Bellman’s Equation
Solve x = T (x) = minµ Tµ(x) where µ: a policy, Tµ(x) = Aµx + bµ
Policy iteration alternates between policy evaluation and policy improvement
xk = Tµk (xk ), (linear), µk+1 ∈ arg minµ
Tµ(xk ), (componentwise)
Alternative policy evaluation based on the TD (Temporal Differences) mapping
For λ ∈ (0, 1) solve x = T (λ)µ (x) where T (λ)
µ is the multistep linear mapping
T (λ)µ = (1− λ)
∞∑`=0
λ`T `+1µ
that has the same fixed point as Tµ
Iterative policy evaluation: Iterate one or more times with Tµk or T (λ)µk
xk+1 = Tµk (xk ) (value iteration) or xk+1 = T (λ)µk (xk ) (λ-policy evaluation)
xk+1 = xk + γk(sampleT (λ)
µk (xk )− xk)
with γk ↓ 0 (TD(λ) algorithm)
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 5 / 30
Policy Evaluation with Subspace Projection for Very Large Problems
Use intermediate projection onto a subspace of basis functions
For a fixed policy, solve the projected equation x = ΠT (λ)(x) (Galerkin approximation)
xk
xk+2
xk+3
Extrapolated Forward-Backward Step Low-Dimensional Subspace
xk+1 = ΠT (λ)(xk)
) T (λ)(xk) ) T (λ)(x∗)
) x∗ = ΠT (λ)(x∗)
xk+1 = ΠT (λ)(xk ), Projected λ-policy evaluation
xk+1 = xk + γk Π(sample T (λ)(xk )− xk
), Projected TD(λ)
Simulation-based implementations (key characteristic of RL)
For large dimension (e.g., n > 1010) there is no alternative to simulation (becauseof high-dimensional inner products)
How simulation is implemented makes a big difference; e.g., sample collection,correlations, bias, choice of λ, etc. (We will not deal with that in this lecture.)
A lot of know-how has been accumulated over the last 30 yearsBertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 6 / 30
KEY POINT OF THIS TALKTD(λ) Map for Policy Evaluation ≈ Proximal Map for the Lin. Bellman Eq
T P
T Px
T P (c)
T P (c)
λ =c
c + 1, c =
λ
1 − λ
T (λ)(x) = x + c+1c
(P (c)(x) − x
)
(
J T (x)
)T (λ)(x) = (P (c)T )(x) = (TP (c))(x)
) P (c)(x) = x + λ(T (λ)(x) − x
)
Extrapolation Formula T (λ) = T · P (c) = P (c) · T
T (λ) IS FASTER
TD(λ) IS A STOCHASTIC PROXIMAL ALGORITHM FOR LINEAR FIXED POINTS
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 7 / 30
Visualization
4
Polyhedral Convexity Template
xk
xk+1
g
fS(s | H0)
x∗
X
Level Sets of f
∂f(x∗)
Significance Levels s − t s + tSpace of Measurement XH0 True Type I Error
1
Pc,f (z) ⌃f(x) 0 slope � c v xk xk+1 xk+2
w pW (·; x) z = g(w) g
⇤k � Dk(x, xk) ⇤k+1 � Dk+1(x, xk+1)
Ty3 x3 Slope = y3
rx(z) = �(cl⇧)(x, z)
rx(µ) � ⌅ µ Z (u, 1)
= Min Common Value w�
= Max Crossing Value q�
Positive Halfspace {x | a⇥x ⇥ b}
a�(C) C C ⇤ S⇤ d z x
Hyperplane {x | a⇥x = b} = {x | a⇥x = a⇥x}
x� x f��x� + (1 � �)x
⇥
x x�
x0 � d x1 x2 x x4 � d x5 � d d
x0 x1 x2 x3
a0 a1 a2 a3
f(z)
z
X 0 u w (µ,⇥) (u, w)µ
⇥
⇥u + w
1
x � T (x) y � T (y) rf(x) P (c)(x) xk xk+1 xk+2
1
x � T (x) y � T (y) rf(x) P (c)(x) xk xk+1 xk+2
1
x � T (x) y � T (y) rf(x) P (c)(x) xk xk+1 xk+2 Slope = �1
c
1
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
1
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
x − T (x) = 1c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
The extrapolated iterate T (x) is closer to x∗ than the proximal iterate x
A FREE LUNCH
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 8 / 30
Potential Implications of the TD-Proximal Relation
Benefit to the TD contextClarify the nature of TD(λ) and other TD methods
Bring proximal methodology and insights to bear on exact and approximate DP
Benefit to the proximal contextBring large scale DP/RL methodology to bear on the proximal mainstream
Develop new convex analysis algorithms based on DP/RL ideas
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 9 / 30
References for this Talk
D. P. Bertsekas, “Proximal Algorithms and Temporal Differences for Large LinearSystems: Extrapolation, Approximation, and Simulation," Report LIDS-P-3205,MIT, Oct. 2016 (rev. Nov. 2017)
Related book references:D. P. Bertsekas, Abstract Dynamic Programming, 2nd Edition, in press
D. P. Bertsekas, Convex Optimization Algorithms, 2015
D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, 1996
Related works on Monte Carlo solution methods for linear systems:D. P. Bertsekas and H. Yu, “Projected Equation Methods for Approximate Solutionof Large Linear Systems," J. of Comp. and Applied Mathematics, Vol. 227, 2009
M. Wang and D. P. Bertsekas, “Convergence of Iterative Simulation-BasedMethods for Singular Linear Systems", Stoch. Systems, Vol. 3, 2013
M. Wang and D. P. Bertsekas, “Stabilization of Stochastic Iterative Methods forSingular and Nearly Singular Linear Systems", Math. of Op. Res., Vol. 39, 2013
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 10 / 30
Outline
1 Acceleration of the Proximal Algorithm for Linear Systems
2 Acceleration of the Proximal Algorithm for Nonlinear Systems
3 Acceleration of Forward-Backward and Proximal Gradient Algorithms
4 Linearized Proximal Algorithms for Nonlinear Systems
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 11 / 30
The Extrapolation Formula
Let c > 0 and λ = cc+1 . Consider the proximal mapping
P(c)(x) = Unique solution of y − T (y) =1c
(x − y)
Then:T (λ) = T · P(c) = P(c) · T
and x , P(c)(x), and T (λ)(x) are colinear:
T (λ)(x) = P(c)(x) +1c(P(c)(x)− x
)
T P
T Px
T P (c)
T P (c)
λ =c
c + 1, c =
λ
1 − λ
T (λ)(x) = x + c+1c
(P (c)(x) − x
)
(
J T (x)
)T (λ)(x) = (P (c)T )(x) = (TP (c))(x)
) P (c)(x) = x + λ(T (λ)(x) − x
)
Extrapolation Formula T (λ) = T · P (c) = P (c) · TBertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 13 / 30
Proof outline
Main idea: Express the proximal mapping in terms of a power seriesWe have
P(c)(x) =
(c + 1
cI − A
)−1(b +
1c
x)
and by a series expansion(c + 1
cI − A
)−1
=
(1λ
I − A)−1
= λ(I − λA)−1 = λ
∞∑`=0
(λA)`
Recall that
T (λ)(x) = (1− λ)∞∑`=0
λ`A`+1x +∞∑`=0
λ`A`b
Using these relations and the fact 1c = 1−λ
λ, it follows that
T (λ) = T · P(c) = P(c) · T
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 14 / 30
Acceleration
The eigenvalues of T (λ) and P(c) are simply related:
θi = ζi · θi
whereθi = i th Eig(T (λ)), θi = i th Eig(P(c)), ζi = i th Eig(A)
Moreover, T (λ) and P(c) have the same eigenvectors
Spectral radius of T (λ) ≤ Spectral radius of P(c)
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 15 / 30
1 Acceleration of the Proximal Algorithm for Linear Systems
2 Acceleration of the Proximal Algorithm for Nonlinear Systems
3 Acceleration of Forward-Backward and Proximal Gradient Algorithms
4 Linearized Proximal Algorithms for Nonlinear Systems
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 17 / 30
Nonlinear System x = T (x) - Proximal Extrapolation
Assume that the system has a unique solution x∗, and T is nonexpansive:∥∥T (x1)− T (x2)∥∥ ≤ γ‖x1 − x2‖, ∀ x1, x2 ∈ <n
where ‖ · ‖ is some Euclidean norm and γ is a scalar with 0 ≤ γ ≤ 1
Consider the proximal mapping P(c):
P(c)(x) = Unique solution of y − T (y) =1c
(x − y)
Define the extrapolated proximal mapping
E (c)(x) = x +c + 1
c(P(c)(x)− x
)Important difference: P(c)(x) and E (c)(x) cannot be easily computed by simulation
Similar to the linear case, we have
E (c)(x) = T(P(c)(x)
),
∥∥E (c)(x)− x∗∥∥ ≤ γ∥∥P(c)(x)− x∗
∥∥Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 18 / 30
Geometric Interpretation and Proof
x∗ d z x
) y − T (y)
) y −
+2 Slope = −1
c
x = P (c)(x)) E(c)(x) = T (x)
x − T (x) = 1c (x − x)
From the definition of P(c), we have
T(P(c)(x)
)= P(c)(x) +
1c(P(c)(x)− x
)so that
T(P(c)(x)
)= x +
c + 1c(P(c)(x)− x
) def= E (c)(x)
Hence, using the assumption,∥∥E (c)(x)− x∗∥∥ =
∥∥∥T(P(c)(x)
)− x∗
∥∥∥ =∥∥∥T(P(c)(x)
)− T (x∗)
∥∥∥ ≤ γ∥∥P(c)(x)− x∗∥∥
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 19 / 30
1 Acceleration of the Proximal Algorithm for Linear Systems
2 Acceleration of the Proximal Algorithm for Nonlinear Systems
3 Acceleration of Forward-Backward and Proximal Gradient Algorithms
4 Linearized Proximal Algorithms for Nonlinear Systems
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 21 / 30
Forward-Backward Splitting Algorithm for Fixed Point Problemx = T (x)− H(x)
xk+1 = P(α)(xk − αH(xk )), α > 0
≥ π/2 Pc,f(x)−x λ∗−cv∗ = (N1·N2)(λ∗−cv∗) 0 λ∗ N2(λ∗−cv∗) v∗ ∂h(x) x −∇f(x)
Slope Optimal dual proximal solution Optimal primal
x∗ = Pc,f (x∗) Nc,f(x)− x∗ Nc,f(x) X∗ x− x∗ < π/2 (N1 · N2)(yk) N2(yk)
xk−1 xk+1 yk yk+1 xk+2 ∇f(xk+1) yk − α∇f(yk) X
yk = xk − γk∇f(xk) yk = xk + βk(xk − xk−1)
x0 = x−1 x1 = y1 y2 x2 = x∗ xk+2 ∇f(xk+1)
y2 = x1 − α1∇f(x1) x2 = y2 + β2(xk − xk−1)
Gradient Projection Step Extrapolation StepOptimal F ⋆
k (λ) y f⋆(λ) λ Slope = −1/c Slope = 1/c p(u) p(u)+ c2∥u∥2
uk+1 u
f(y) ℓ(y; x) +L
2∥y − x∥2
Slope = 1/c Slope = −1/c Slope =xk − xk+1
ckf(xk+1)
f(xk+1) + (x − xk+1)′gk+1 xk+3
βd(x)α d(xk+1) d(xk) Slope =d(xk) − d(xk+1)
ckf(x) δk+1
Augmented Lagrangian Method Proximal Algorithm Dual ProximalAlgorithm
z = Pc,f (z) Nc,f(z) ∂f(x) 0 slope −c v vk xk xk+1 = Pc(xk) = xk−vk xk+2 = Pc(xk+1)
Pc,M (z) M(x) ∂f(x) 0 slope − c v xk xk+1 xk+2 xk+1 = xk − cvk
w pW (·; x) z = g(w) g
1
≥ π/2 Pc,f(x)−x λ∗−cv∗ = (N1·N2)(λ∗−cv∗) 0 λ∗ N2(λ∗−cv∗) v∗ ∂h(x) x −∇f(x)
Slope Optimal dual proximal solution Optimal primal
x∗ = Pc,f (x∗) Nc,f(x)− x∗ Nc,f(x) X∗ x− x∗ < π/2 (N1 · N2)(yk) N2(yk)
xk−1 xk+1 yk yk+1 xk+2 ∇f(xk+1) yk − α∇f(yk) X
yk = xk − γk∇f(xk) yk = xk + βk(xk − xk−1)
x0 = x−1 x1 = y1 y2 x2 = x∗ xk+2 ∇f(xk+1)
y2 = x1 − α1∇f(x1) x2 = y2 + β2(xk − xk−1)
Gradient Projection Step Extrapolation StepOptimal F ⋆
k (λ) y f⋆(λ) λ Slope = −1/α Slope = 1/c p(u) p(u)+ c2∥u∥2
uk+1 u
f(y) ℓ(y; x) +L
2∥y − x∥2
Slope = 1/c Slope = −1/c Slope =xk − xk+1
ckf(xk+1)
f(xk+1) + (x − xk+1)′gk+1 xk+3
βd(x)α d(xk+1) d(xk) Slope =d(xk) − d(xk+1)
ckf(x) δk+1
Augmented Lagrangian Method Proximal Algorithm Dual ProximalAlgorithm
z = Pc,f (z) Nc,f(z) ∂f(x) 0 slope −c v vk xk xk+1 = Pc(xk) = xk−vk xk+2 = Pc(xk+1)
Pc,M (z) M(x) ∂f(x) 0 slope − c v xk xk+1 xk+2 xk+1 = xk − cvk
w pW (·; x) z = g(w) g
1
x0 x0 − α∇f(x0) x1 x2 x3 x∗ x1 − α∇f(x1) x2 ∂h(x) x − ∇f(x)
≥ π/2 Pc,f(x)−x λ∗−cv∗ = (N1·N2)(λ∗−cv∗) 0 λ∗ N2(λ∗−cv∗) v∗ ∂h(x) x −∇f(x)
Gradient Step Proximal Step dual proximal solution Optimal primal
x∗ = Pc,f (x∗) Nc,f(x)− x∗ Nc,f(x) X∗ x− x∗ < π/2 (N1 · N2)(yk) N2(yk)
xk−1 xk+1 yk yk+1 xk+2 ∇f(xk+1) yk − α∇f(yk) X
yk = xk − γk∇f(xk) yk = xk + βk(xk − xk−1)
x0 = x−1 x1 = y1 y2 x2 = x∗ xk+2 ∇f(xk+1)
y2 = x1 − α1∇f(x1) x2 = y2 + β2(xk − xk−1)
Gradient Projection Step Extrapolation StepOptimal F ⋆
k (λ) y f⋆(λ) λ Slope = −1/α Slope = 1/c p(u) p(u)+ c2∥u∥2
uk+1 u
f(y) ℓ(y; x) +L
2∥y − x∥2
Slope = 1/c Slope = −1/c Slope =xk − xk+1
ckf(xk+1)
f(xk+1) + (x − xk+1)′gk+1 xk+3
βd(x)α d(xk+1) d(xk) Slope =d(xk) − d(xk+1)
ckf(x) δk+1
Augmented Lagrangian Method Proximal Algorithm Dual ProximalAlgorithm
z = Pc,f (z) Nc,f(z) ∂f(x) 0 slope −c v vk xk xk+1 = Pc(xk) = xk−vk xk+2 = Pc(xk+1)
Pc,M (z) M(x) ∂f(x) 0 slope − c v xk xk+1 xk+2 xk+1 = xk − cvk
1
x0 x0 − α∇f(x0) x1 x2 x3 x∗ x1 − α∇f(x1) x2 ∂h(x) x − ∇f(x)
≥ π/2 Pc,f(x)−x λ∗−cv∗ = (N1·N2)(λ∗−cv∗) 0 λ∗ N2(λ∗−cv∗) v∗ ∂h(x) x −∇f(x)
Gradient Step Proximal Step dual proximal solution Optimal primal
x∗ = Pc,f (x∗) Nc,f(x)− x∗ Nc,f(x) X∗ x− x∗ < π/2 (N1 · N2)(yk) N2(yk)
xk−1 xk+1 yk yk+1 xk+2 ∇f(xk+1) yk − α∇f(yk) X
yk = xk − γk∇f(xk) yk = xk + βk(xk − xk−1)
x0 = x−1 x1 = y1 y2 x2 = x∗ xk+2 ∇f(xk+1)
y2 = x1 − α1∇f(x1) x2 = y2 + β2(xk − xk−1)
Gradient Projection Step Extrapolation StepOptimal F ⋆
k (λ) y f⋆(λ) λ Slope = −1/α Slope = 1/c p(u) p(u)+ c2∥u∥2
uk+1 u
f(y) ℓ(y; x) +L
2∥y − x∥2
Slope = 1/c Slope = −1/c Slope =xk − xk+1
ckf(xk+1)
f(xk+1) + (x − xk+1)′gk+1 xk+3
βd(x)α d(xk+1) d(xk) Slope =d(xk) − d(xk+1)
ckf(x) δk+1
Augmented Lagrangian Method Proximal Algorithm Dual ProximalAlgorithm
z = Pc,f (z) Nc,f(z) ∂f(x) 0 slope −c v vk xk xk+1 = Pc(xk) = xk−vk xk+2 = Pc(xk+1)
Pc,M (z) M(x) ∂f(x) 0 slope − c v xk xk+1 xk+2 xk+1 = xk − cvk
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
−H(x) z = x − αH(x) x = P (α)(z) x − T (x) = 1c (x − x)
x − T (x) = 1c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
x � T (x) y � T (y) rf(x) P (c)(x) xk xk+1 xk+2
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
−H(x) z = x − αH(x) x = P (α)(z) Forward Step
x − T (x) = 1c (x − x) x − T (x) = 1
c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
−H(x) z = x − αH(x) x = P (α)(z) Forward Step
x − T (x) = 1c (x − x) x − T (x) = 1
c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
−H(x) z = x − αH(x) x = P (α)(z) Forward Step
x − T (x) = 1c (x − x) x − T (x) = 1
c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
xc = ΠP (c)(xc) Bias (→ 0 as c → ∞) E(c)(x) = T (x) P (c)(xc)
−H(x) z = x − αH(x) x = P (α)(z) Forward Step
x − T (x) + H(x) = 1α (z − x) + H(x) x − T (x) = 1
c (x − x)
Q(λ) = (Ψ′ΞΦ)−1Ψ′ΞA(λ)Φ d(λ) = (Ψ′ΞΦ)−1Ψ′Ξb(λ) S = {Φr | r ∈ ℜs} x∗ Πx∗ x
xλ = ΠT (xλ) Πx∗ Bias (→ 0 as λ → 1) T (xλ) x Projection ΠxSolution Approximate solution
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
x � T (x) y � T (y) rf(x) P (c)(x) xk xk+1 xk+2
1
Properties (Lions and Mercier, 1979, Gabay, 1983, Tseng, 1991):If T is nonexpansive, and H is single-valued and strongly monotone, the F-Balgorithm converges to x∗ if α is sufficiently small
For a minimization problem where H is the gradient of a strongly convex function,FB = the proximal gradient algorithm
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 22 / 30
Extrapolation and Acceleration
Extrapolated forward-backward algorithm
zk = xk − αH(xk ), xk = P(α)(zk ) (Forward-Backward Iteration)
xk+1 = xk +1α
(xk − zk )− H(xk ) (Extrapolation)
λ Slope = −1/α
3 x∗
) y − T (y)
) Forward Step
) y −xk
zk = xk − αH(xk)
xk − T (xk) + H(xk) = 1α (zk − xk) + H(xk)
) Forward-Backward Step
) Forward-Backward Step
) xk = P (α)(zk)
Extrapolated Forward-Backward Step
Extrapolated Forward-Backward Step xk+1 = T (xk) − H(xk)
−H(y)
We havexk+1 = T (xk )− H(xk )
so there is acceleration if T − H is contractiveBertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 23 / 30
Connect with TD: Apply Policy Iteration Ideas to the Proximal Context
Problem: Solve “concave" fixed point problem x = T (x) = minµ Tµ(x)
i th component of Tµ(x) = a(i, µ(i)
)′x + b(i, µ(i)
), µ(i) ∈M(i) (Linear)
) y − T (y)
) y −
+2 Slope = −1
c
xk
Linearized Mapping
Newton iteratexµk
y − Tµk(y)
) xk = P(c)µk (xk)
) T(λ)µk (xk) = Tµk
(xk)
} x∗
The policy iteration algorithm
xµk = Tµk (xµk ), µk+1 ∈ arg minµ
Tµ(xµk ), (“Newton" method)
The λ-policy iteration algorithm
xk+1 = T (λ)µk (xk ), µk+1 ∈ arg min
µTµ(xk+1), (“prox-linear" method)
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 25 / 30
A Fundamental Difficulty
The algorithm chases a moving target
The TD(λ) mapping T (λ)µk “targets" the fixed point of Tµk , but as µk changes so
does the target ...
This is why some policy iteration algorithms may not converge (particularly withcost function approximation) ...
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 26 / 30
Convergence Under Monotonicity Assumptions
Assume the following:For all µ, the matrix Aµ has nonnegative components
The mappings Tµ are all contractions with respect to a common sup-norm (thiscan be relaxed ...)
Then:T is a contraction and its fixed point is x∗ = minµ xµA sequence {xk} generated from an initial condition x0 such that x0 ≥ T (x0) ismonotonically nonincreasing and converges to x∗ (this can be improved ...)
Proof idea:Based on DP/policy iteration arguments
Monotonicity is critical
Once projection is introduced in policy iteration, monotonicity may be lost
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 27 / 30
Convergence of Randomized Version Without Monotonicity
Assume the following:The setM is finite
The mappings Tµ and T are all contractions with respect to a common norm
We use a randomized form of the linearized iteration:
xk+1 = Tµk (xk ), with probability p,
xk+1 = T (λ)µk (xk ), with probability 1− p,
followed by µk+1 ∈ arg minµ Tµ(xk )
Then:For any starting point x0, a sequence {xk} generated by the algorithm converges to thefixed point of T with probability one
Randomization resolves the “moving target" problem
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 28 / 30
Concluding Remarks
Proximal and multistep/TD iterations for fixed point problems are closelyconnected
x , P(c)(x), and T (λ)(x) are colinear and simply related (no line search needed)
TD(λ) mapping is “faster" than proximal
A free lunch: Acceleration of the proximal algorithm. It can be substantial,particularly for small c
Extrapolation formula provides new insight and justification for TD-type methodsI TD(λ) is stochastic version of the proximal algorithmI TD(λ) with subspace approximation is stochastic version of the projected proximal
The ideas extend to the forward-backward algorithm and potentially otheralgorithmic contexts that involve fixed points and proximal operators
The relation between proximal and TD methods extends to classes of nonlinearfixed point problems using linearization/policy iteration ideas
Bertsekas (M.I.T.) TD(λ) and the Proximal Algorithm 29 / 30
Top Related