Classifier-Based Approximate Policy Iteration

Alan Fern

Uniform Policy Rollout AlgorithmRollout[π,h,w](s) 1. For each ai run SimQ(s,ai,π,h) w times 2. Return action with best average of SimQ results

a1 a2ak

q11 q12 … q1w q21 q22 … q2w qk1 qk2 … qkw

… … … … … … … … …

SimQ(s,ai,π,h) trajectories

Each simulates taking action ai then following π for h-1 steps.

Samples of SimQ(s,ai,π,h)

Multi-Stage Rollout

a1 a2ak

… … … … … … … … …

Trajectories of SimQ(s,ai,Rollout[π,h,w],h)

Each step requires khw simulator callsfor Rollout policy

• Two stage: compute rollout policy of “rollout policy of π”

• Requires (khw)2 calls to the simulator for 2 stages• In general exponential in the number of stages

Example: Rollout for Solitaire [Yan et al. NIPS’04]

h Multiple levels of rollout can payoff but is expensive Can we somehow get the benefit of multiple levels

without the complexity?

Player Success Rate Time/Game

Human Expert 36.6% 20 min

(naïve) Base Policy

13.05% 0.021 sec

1 rollout 31.20% 0.67 sec

2 rollout 47.6% 7.13 sec

3 rollout 56.83% 1.5 min

4 rollout 60.51% 18 min

5 rollout 70.20% 1 hour 45 min

Approximate Policy Iteration: Main Ideah Nested rollout is expensive because the

“base policies” (i.e. nested rollouts themselves) are expensive

h Suppose that we could approximate a level-one rollout policy with a very fast function (e.g. O(1) time)

h Then we could approximate a level-two rollout policy while paying only the cost of level-one rollout

h Repeatedly applying this idea leads to approximate policy iteration

Return to Policy Iteration

Current Policy

Choose best action

at each state

Compute Vp

at all states

Improved Policy p’p

Approximate policy iteration: • Only computes values and improved action at some states.

• Uses those to infer a fast, compact policy over all states.

Approximate Policy Iteration

1. Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I)

2. “Learn a fast approximation” of rollout policy

3. Loop to step 1 using the learned policy as the base policy

p’Current Policy

Learn fast approximation

of p’

Sample p’ trajectories using rollout

p’ trajectories

What do we mean by generate trajectories?

technically rollout only approximates π’.

Generating Rollout Trajectories

… …s

a1 a2 ak

… … … … … … … …

a1 a2 ak

…… … … … … … … …

Get trajectories of current rollout policy from an initial stateRandom draw from i

run policy rollout run policy rollout

… …

Get trajectories of current rollout policy from an initial state

… …

a1 a2 ak

… … … … … … … …

a1 a2 ak

… … … … … … … …

Multiple trajectoriesdiffer since initial stateand transitions are stochastic

… …

Get trajectories of current rollout policy from an initial state

… …

{(s1,a1), (s2,a2),…,(sn,an)}

Results in a set of state-action pairs giving the action selected by “improved policy” in states that it visits.

1. Generate trajectories of rollout policy (starting state of each trajectory is drawn from initial state distribution I)

2. “Learn a fast approximation” of rollout policy

p’Current Policy

Learn fast approximation

of p’

p’ trajectories

What do we mean by “learn an approximation”?

technically rollout only approximates π’.

Aside: Classifier Learning

LearningAlgorithm

Training Data{(x1,c1),(x2,c2),…,(xn,cn)}

ClassifierH : X C

• A classifier is a function that labels inputs with class labels.• “Learning” classifiers from training data is a well studied problem

(decision trees, support vector machines, neural networks, etc).

Example problem:xi - image of a face

ci {male, female}

Aside: Control Policies are Classifiers

LearningAlgorithm

Training Data{(s1,a1), (s2,a2),…,(sn,an)}

Classifier/Policy p : states actions

A control policy maps states and goals to actions. p : states actions

p’Current Policy

Learn classifier to

approximate p’

p’ training data{(s1,a1), (s2,a2),…,(sn,an)}

1. Generate trajectories of rollout policy Results in training set of state-action pairs along trajectories T = {(s1,a1), (s2,a2),…,(sn,an)}

2. Learn a classifier based on T to approximate rollout policy

p’Current Policy

Learn classifier to

approximate p’

p’ training data{(s1,a1), (s2,a2),…,(sn,an)}

• The hope is that the learned classifier will capture the general structure of improved policy from examples

• Want classifier to quickly select correct actions in states outside of training data (classifier should generalize)

• Approach allows us to leverage large amounts of work in machine learning for classification

API for Inverted Pendulum

Consider the problem of balancing a pole by applying either a positive or negative force to the cart.

The state space is described by the velocity of the cartand angle of the pendulum.

There is noise in the force that is applied, so problem is stochastic.

Experimental Results

A data set from an API iteration. + is positive action, x is negative(ignore the circles in the figure)

Experimental Results

Learned classifier/policy after 2 iterations: (near optimal)blue = positive, red = negative

Support vector machineused as classifier.(take CS534 for details)

Maps any state to + or –

API for Stacking Blocks

Consider the problem of form a goal configuration of blocks/crates/etc. from a starting configuration using basic movements such as pickup, putdown, etc.

Also handle situations where actions fail and blocksfall.

Experimental ResultsExperimental ResultsBlocks World (20 blocks)

0 1 2 3 4 5 6 7 8

Iterations

The resulting policy is fast near optimal. These problems are very hard for more traditional planners.

API in Computer Vision

Summary of APIh Approximate policy iteration is a practical way to select

policies in large state spacesh Relies on ability to learn good, compact approximations of

improved policies (must be efficient to execute)h Relies on the effectiveness of rollout for the problem

h There are only a few positive theoretical results 5 convergence in the limit under strict assumptions5 PAC results for single iteration

h But often works well in practice

Classifier-Based Approximate Policy Iteration

Documents

Transcript of Classifier-Based Approximate Policy Iteration

Understanding and Visualizing Data Iteration in Machine ... · Understanding and Visualizing Data Iteration in Machine Learning Fred Hohmang, Kanit Wongsuphasawata, Mary Beth Keryq,

Tolerance levels for the Approximate Bayesian … levels for the Approximate Bayesian Computation ... composed with a covariance ... Tolerance levels for the Approximate Bayesian Computation

Approximate Inference in Graphical Models using LP Relaxationscnls.lanl.gov/~jasonj/poa/slides/sontag.pdf · Approximate Inference in Graphical Models using LP Relaxations David Sontag

Classifier Evaluation Learning Curve - Javeriana Calicic.puj.edu.co/wiki/...media=grupos:destino:1_3_evaluation_handouts… · 5/18/2008 Evaluation 1 Classifier Evaluation ... A2

Overview of Approximate Bayesian Computation · 2018-02-28 · Overview of Approximate Bayesian Computation S. A. Sisson Y. Fan and M. A. Beaumonty February 28, 2018 1 Introduction

A Timed Presentation - do not click the mouse. Approximate Run Time - 4 minutes.

Approximate degree, secret sharing, and concentration ... · Approximate degree, secret sharing, and concentration phenomena Andrej Bogdanov1, Nikhil S. Mande2, Justin Thaler2, and

1+eps-Approximate Sparse Recovery

Resampling Methods: The Bootstrap4.4.1 Bootstrap CIs Assuming Approximate Normality An approximate 100(1 )% con dence interval for is b tse B(b ) or b zse B(b ) (8) where t is the

Saharon Shelah- Historic Iteration with ℵε-Support

Parametric Density Estimation: Bayesian Estimation. Naïve ...rita/uml_course/add_mat/PDE_Bayesian_NB.pdf · Parametric Density Estimation: Bayesian Estimation. ... Learn Bayes classifier

Lecture IV Value Function Iteration with Discretization · 2015-09-17 · Lecture IV Value Function Iteration with Discretization Gianluca Violante New York University Quantitative

Recent Applications of Approximate ... - Statistics Researchstats.research.att.com › nycseminars › slides › rush.pdf · Recent Applications of Approximate Message Passing Algorithms

Math 10C - Unit 4 Workbook 1: Graphing Relations Approximate Completion Time: 2 Days Lesson 2: Domain and Range Approximate Completion Time: 1 Day Lesson 3: Functions Approximate Completion

`UQ' perspectives on ABC (approximate Bayesian computation) · Approximate Rejection Algorithm With Summaries Draw from ˇ( ) Simulate X ˘f( ) Accept if ˆ(S(D);S(X)) < If S is su

Note 10 (Some notesabout the meanand varianceof Rswcheng/Teaching/stat3875/lecture/01_Survey... · Ch7, p.49 Theorem 17 (approximate varianceof R) Note 12 (Some notesabout the approximate

Approximate inference for vector parameters

Eigenproblems II: Computation · Power Iteration Other Eigenvalues Multiple Eigenvalues QR Iteration Eigenproblems II: Computation CS 205A: Mathematical Methods for Robotics, Vision,

Νεςπο Ασαυήρ Υπολογιστική · Competitive learning •We can now design a competitive network classifier by setting the rows of to the desired prototype vectors

Approximate Linear Programming for MDPs