15-780: Grad AI Lecture 22: MDPs

27
15-780: Grad AI Lecture 22: MDPs Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik Zawadzki, Abe Othman

Transcript of 15-780: Grad AI Lecture 22: MDPs

Page 1: 15-780: Grad AI Lecture 22: MDPs

15-780: Grad AILecture 22: MDPs

Geoff Gordon (this lecture)Tuomas Sandholm

TAs Erik Zawadzki, Abe Othman

Page 2: 15-780: Grad AI Lecture 22: MDPs

Review: Bayesian learning

Bayesian learning: P(θ | D) = P(D | θ) P(θ) / Z

‣ P(θ): prior over parametric model class

‣ P(D | θ): likelihood

‣ or, P(θ | X, Y) = P(Y | θ, X) P(θ) / Z as long as X ⊥ θ

Predictive distribution

Page 3: 15-780: Grad AI Lecture 22: MDPs

Review: Bayesian learning

Exact Bayes w/ conjugate prior, or numerical integration—this example: logistic regression

Or, MLE/MAP

!"# !"$ !"% & &"' &"# &"$ &"%!%

!(

!$

!)

!#

!*

!'

! " # $ %!&'(

&

&'(

&'"

&'$

&')

*

*'(

Page 4: 15-780: Grad AI Lecture 22: MDPs

Review: MDPs

Sequential decision problem under uncertainty

States, actions, costs, transitions, discounting

Policy, execution trace

State-value (J) and action-value (Q) function

‣ (1–γ) ! immediate cost + γ ! future cost

Page 5: 15-780: Grad AI Lecture 22: MDPs

Review: MDPs

Tree search

Receding horizon tree search w/ heuristic

Dynamic programming (value iteration)

Pruning (once we realize a branch is bad, or subsampling scenarios)

Curse of dimensionality

Page 6: 15-780: Grad AI Lecture 22: MDPs

Alternate algorithms for “small” systems—policy evaluation

Linear equations: so, Gaussian elimination, biconjugate gradient, Gauss-Seidel iteration, …

‣ VI is essentially the Jacobi iterative method for matrix inversion

Stochastic-gradient-descent-like

‣ TD(λ), Q-learning

Qπ(s, a) = (1− γ)C(s, a) + γE[Jπ(s�) | s� ∼ T (· | s, a)]

Jπ(s) = E[Qπ(s, a) | a ∼ π(· | s)]

Page 7: 15-780: Grad AI Lecture 22: MDPs

Alternate algorithms for “small” systems—policy optimization

Policy iteration: alternately

‣ use any above method to evaluate current "

‣ replace " with greedy policy: at each state s, "(s) := arg maxa Q(s,a)

Actor-critic: like policy iteration, but interleave solving for J" and updating "

‣ e.g., run biconjugate gradient for a few steps

‣ warm start: each J" probably similar to next

SARSA = AC w/ TD(λ) critic, ϵ-greedy policy

Page 8: 15-780: Grad AI Lecture 22: MDPs

(Stochastic) policy gradient

‣ pick a parameterized policy class "θ(a | s)

‣ compute or estimate g = ∇θ J"(s1)

‣ θ ← θ – ƞg, repeat

More detail:

‣ can estimate g quickly by simulating a few trajectories

‣ can also use natural gradient to get faster convergence

Alternate algorithms for “small” systems—policy optimization

Page 9: 15-780: Grad AI Lecture 22: MDPs

Linear programming

‣ analogy: use an LP to compute min(3, 6, 5)

‣ note min v. max

max J s.t.J ≤ 3J ≤ 6J ≤ 5

Alternate algorithms for “small” systems—policy optimization

Page 10: 15-780: Grad AI Lecture 22: MDPs

Linear programming

Variables J(s) and Q(s,a) for all s, a

Note: dual of this LP is interesting

‣ generalizes single-source shortest paths

max J(s1) s.t.

Q(s, a) = (1− γ)C(s, a) + γE[J(s�) | s� ∼ T (· | s, a)]

J(s) ≤ Q(s, a) ∀s, a

Page 11: 15-780: Grad AI Lecture 22: MDPs

Model requirements

What we have to know about the MDP in order to plan?

‣ full model

‣ simulation model

‣ no model: only the real world

Page 12: 15-780: Grad AI Lecture 22: MDPs

Model requirements

VI and LP require full model

PI and actor-critic inherit requirements of policy-evaluation subroutine

TD(λ), SARSA, policy gradient: OK with simulation model or no model

‣ horribly data-inefficient if used directly on real world with no model—don’t do this!

‣ note: model can be just { all of my data }

Page 13: 15-780: Grad AI Lecture 22: MDPs

A word on performance measurements

Multiple criteria we might care about:

‣ data (from real world)

‣ runtime

‣ calls to model (under some API)

Measure convergence rate of:

‣ J(s) or Q(s, a)

‣ "(s)‣ actual (expected total discounted) cost

Page 14: 15-780: Grad AI Lecture 22: MDPs

Building a modelHow to handle lack of model without horrible data inefficiency? Build one!

‣ hard inference problem; getting it wrong is bad

‣ this is why { all of my data } is a popular model

What do we do with posterior over models?

‣ just use MAP model (“certainty equivalent”)

‣ compute posterior over "*: slow, still wrong

‣ even slower:

‣ except policy gradient (Ng’s helicopter)

maxπ

E(Jπ(s) | data, model class)

Page 15: 15-780: Grad AI Lecture 22: MDPs

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Algorithms for large systems

Policy gradient: no change

Any value-based method: can’t even write down J(s) or Q(s,a)

So,

J(s) =�

i

wiφi(s)

Q(s, a) =�

i

wiφi(s, a)

Page 16: 15-780: Grad AI Lecture 22: MDPs

Algorithms for large systems

Evaluation: TD(λ), LSTD

Optimization:

‣ policy iteration or actor-critic

‣ e.g., LSTD → LSPI

‣ approximate LP

‣ value iteration: only special cases, e.g., finite-element grid

Page 17: 15-780: Grad AI Lecture 22: MDPs

Least-squares temporal differences(LSTD)

Data: τ = (s1, a1, c1, s2, a2, c2, …) ~ "

Want Q(st, at) # (1–γ)ct + γQ(st+1, at+1)

‣ wTΦ(st, at) # (1–γ)ct + γwTΦ(st+1, at+1)

‣ Φ = vector of k features, w = weight vector

Qπ(s, a) = (1− γ)C(s, a) + γE[Jπ(s�) | s� ∼ T (· | s, a)]

Jπ(s) = E[Qπ(s, a) | a ∼ π(· | s)]

Page 18: 15-780: Grad AI Lecture 22: MDPs

LSTD

wTΦ(st, at) # (1–γ)ct + γwTΦ(st+1, at+1)

Vector notation:

‣ Fw # (1–γ)ct + γF1w

Overconstrained: multiply both sides by F

‣ FTFw = (1–γ)FTct + γFTF1w

Page 19: 15-780: Grad AI Lecture 22: MDPs

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

LSTD: example

100 states in a line; move left or right at cost 1 per state; goals at both ends; discount 0.99

optimal policy

Page 20: 15-780: Grad AI Lecture 22: MDPs

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

LSTD: example

Compare: suboptimal policy

True J(x) has discontinuity at x=10

Page 21: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 22: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 23: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 24: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 25: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 26: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State

Page 27: 15-780: Grad AI Lecture 22: MDPs

LSPI

0 20 40 60 80 1000.5

0

0.5

1

1.5

i(s)

0 20 40 60 80 1000.5

0

0.5

1

1.5

J(s)

State