PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf ·...

20
PILCO: A Model-Based and Data-Ecient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016

Transcript of PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf ·...

Page 1: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

PILCO: A Model-Based and Data-EfficientApproach to Policy Search

(M.P. Deisenroth and C.E. Rasmussen)

CSC2541November 4, 2016

Page 2: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

PILCO Graphical Model

PILCO – Probabilistic Inference for Learning COntrol

Latent states {Xt} evolve through time based on previousstates and controls

Policy π maps Zt, a noisy observation of Xt, into a control, Ut

CSC2541 November 4, 2016 2/ 19

Page 3: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

PILCO Objective

Transitions follow dynamic system

xt = f(xt−1, ut−1)

where x ∈ RD, u ∈ RF and f is a latent function.

Let π be parameterized by θ and ut = π(xt, θ). The objective is tofind π that minimizes expected cost of following π for T steps

Cost function encodes information about a target state, e.g.,c(x) = 1− exp(−‖x− xtarget‖2/σ2

c )

CSC2541 November 4, 2016 3/ 19

Page 4: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Algorithm

CSC2541 November 4, 2016 4/ 19

Page 5: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Algorithm

CSC2541 November 4, 2016 5/ 19

Page 6: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Dynamics Model Learning

Multiple plausible function approximators of f

CSC2541 November 4, 2016 6/ 19

Page 7: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Dynamics Model Learning

Multiple plausible function approximators of f

CSC2541 November 4, 2016 6/ 19

Page 8: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Dynamics Model Learning

Define a Gaussian process (GP) prior on the latent dynamicfunction f

CSC2541 November 4, 2016 7/ 19

Page 9: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Dynamics Model Learning

Let the prior of f be GP(0, k(x̃, x̃′)) where x̃ , [xTuT ]T and thesquared exponential kernel is given by

CSC2541 November 4, 2016 8/ 19

Page 10: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Dynamics Model Learning

Let ∆t = xt − xt−1 + ε where ε ∼ N (0,Σε) andΣε = diag([σε1 , . . . , σεD ]). The GP yields one-step predictions (seeSection 2.2 in reference 3)

Given n training inputs X̃ = [x̃1, . . . , x̃n] and correspondingtraining targets y = [∆1, . . . ,∆n], the posterior GPhyper-parameters are learned by evidence maximization (type 2maximum likelihood).

CSC2541 November 4, 2016 9/ 19

Page 11: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Algorithm

CSC2541 November 4, 2016 10/ 19

Page 12: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Policy Evaluation

In evaluating objective Jπ(θ), we must calculate p(xt) since

We have xt = xt−1 + ∆t − ε, where in general, computing p(∆t) isanalytically intractable.

Instead, p(∆t) is approximated with a Gaussian via momentmatching.

CSC2541 November 4, 2016 11/ 19

Page 13: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Moment Matching

Input distribution p(xt−1, ut1) is assumed Gaussian

When propagated through the GP model, we obtain p(∆t)

p(∆t) is approximated by a Gaussian via moment matching

CSC2541 November 4, 2016 12/ 19

Page 14: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Moment Matching

p(xt) can now be approximated with N (µt,Σt) where

µ∆ and Σ∆ are computed exactly via iterated expectation andvariance

CSC2541 November 4, 2016 13/ 19

Page 15: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Algorithm

CSC2541 November 4, 2016 14/ 19

Page 16: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Analytic Gradient for Policy Improvement

Let Et = Ext [c(xt)] so that Jπ(θ) =∑T

t=1 Et.Et depends on θ through p(xt), which depends on θ throughp(xt−1), which depends on θ through µt and Σt, . . ., whichdepends on θ based on µu and Σu, where ut = π(xt, θ).

Chain rule is used to calculate derivatives

Analytic gradients allow for gradient-based non-convexoptimization methods, e.g., CG or L-BFGS

CSC2541 November 4, 2016 15/ 19

Page 17: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Data-Efficiency

CSC2541 November 4, 2016 16/ 19

Page 18: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Advantages and Disadvantages

Advantages

Data-efficient

Incorporates model-uncertainty into long-term planning

Does not rely on expert knowledge, i.e., demonstrations, ortask-specific prior knowledge.

Disadvantages

Not an optimal control method. If p(Xi) do not cover thetarget region and σc induces a cost that is very peaked aroundthe target solution, PILCO gets stuck in a local optimumbecause of zero gradients.

Learned dynamics models are only confident in areas of thestate space previously observed.

Does not take temporal correlation into account. Modeluncertainty treated as uncorrelated noise

CSC2541 November 4, 2016 17/ 19

Page 19: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

Extension: PILCO with Bayesian Filtering

R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523

CSC2541 November 4, 2016 18/ 19

Page 20: PILCO: A Model-Based and Data-Efficient Approach …duvenaud/courses/csc2541/slides/pilco.pdf · PILCO: A Model-Based and Data-E cient Approach to Policy Search ... (2006). Pattern

References

1 M.P. Deisenroth and C.E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to PolicySearch” in Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA,2011.

2 R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523

3 C.E. Rasmussen and C.K.I. Williams (2006) Gaussian Processes for Machine Learning. MIT Press.www.gaussianprocess.org/gpml/chapters

4 C.M. Bishop (2006). Pattern Recognition and Machine Learning Chapter 6.4. Springer. ISBN0-387-31073-8.

CSC2541 November 4, 2016 19/ 19