Approximate Linear Programming for MDPs

16
9/21/17 1 Approximate Linear Programming for MDPs CompSci 590.2 Ron Parr Duke University Linear Programming MDP solution Issue: Turn the non-linear max into a collection of linear constraints V (s) = max a R(s , a) + γ P(s '| s , a) V (s ') s ' s , a : V ( s ) R( s , a) + γ P( s'| s, a) V ( s') s ' MINIMIZE: V ( s) s Weakly polynomial; slower than PI in practice (though can be modified to behave like PI) Optimal action has tight constraints

Transcript of Approximate Linear Programming for MDPs

ALPCompSci 590.2 Ron Parr
Linear Programming MDP solution
Issue: Turn the non-linear max into a collection of linear constraints
V(s)=maxa R(s,a)+γ P(s'|s,a)V(s') s'∑
!!


V(s) s ∑
Weakly polynomial; slower than PI in practice (though can be modified to behave like PI)
Optimal action has tight constraints
9/21/17
2
Linear programming with samples
• Suppose we don’t have the model, but have samples • For sample (s,a,r,s’), constraint looks like:
• What goes wrong?
Problem: Noise
• Suppose s goes to s1 w.p. 0.5 and and s2 w.p. 0.5, and • V(s1) = 100, V(s2) = 0 → V(s) = 50g • Samples: (s,a,r,s1), (s,a,r,s2) • Constraints:
≥ + * ≥ + (+)
9/21/17
3
Nosie solution
• There is no (ideal) noise solution! • Problem never really goes away
• LP methods most effective in low noise scenarios
• Can do local averaging by explicitly adding an average over “nearby” states to the LP constraints
Approximate Linear Program (ALP) with model
min V(s) s ∑
k
k
∑ s' ∑
Notes: • No sampling yet • Same number of constraints, just k variables • Assumes we have access to a model
9/21/17
4
• Normally, we minimize:
• We could do:
• For c = some probability distribution • ci = importance/relevance of state I • ci doesn’t matter for the exact case, but • Can change the answer in the approximate case
,(-) .
∗ − 3 *,5 ≤ 2
1 − min< ∗ − Φ ?
Improving the bound
• We can improve the bound if: we pick weights c according to the stationary distribution of optimal policy • But how do we do that?
• Just run the optimal policy and then… • Wait a sec…
• In practice, iterative weighting schemes may help • Start with arbitrary weights • Generate policy by solving ALP • Reweight based upon resulting policy • Repeat
Problem: Missing constraints
• We’re doing approximation because n is large or infinite • Can’t write down the entire LP!
• General (for any LP) constraint sampling approach: • Observe that for an nxk system, optimal solution will have k tight constraints • Most constraints are loose/unnecessary • Sample some set of constraints • Repeat until no constraints violated
• Solve LP • Find violated constraints, then add back to the LP
• Works well if you have an efficient way to find most violated constraints
9/21/17
6
Missing constraints in ALP
• Often no obviously efficient way to find the most violated constraint • Idea: • Sample constraints (s,a,r,s’) by an initial policy • Repeat until ?
• Solve LP, produce policy • Execute policy to sample new constraints (s,a,r,s’)
• Challenges: Missing constraints can produce unbounded LP solutions
What if we’re missing constraints for some state?
1 2 3 4
R(s4) + V (s4) V (s4) Thanks to Gavin Taylor
9/21/17
7
Avoiding constraint sampling problems
• If we can sample constraints from the stationary distribution of the optimal policy (de Farias and Van Roy), then we should be OK
• Wait a sec…
Making additional assumptions
• Regularized ALP (RALP) • Assumes that features are Lipschitz continuous (Petrik et al.) • Adds bound on 1-norm of weights as a constraint
• Lipschitz constant for function f, with distance metric d
* − + ≤ (*, +)
9/21/17
8
(best max-norm error)
: = = p p :
1 [() + p()]
ep bounded as function of: • constraint density and • Lipschitz constants of features
9/21/17
11
• L1 regularization tends to produce sparser solutions
Bicycle Domain
• Ride a bicycle to a destination • S: • A: Shift weight, turn handlebars • R: Based on distance to goal and angle from upright
• Φ: Polynomials and cross terms of state dimensions (160 candidate features)

[Randløv and Alstrøm, 98] [http://english.people.com.cn]
9/21/17
12
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0
10
20
30
40
50
60
70
80
90
100
100 features total
No model used (incorrect deterministic action assumption) RALP had |A| constraints/sample
Problem: Picking actions
• How do we use the ALP solution to produce a policy? • We get an approximate value function, not Q-functions
• Must use model to pick actions: L
9/21/17
13
• Continuous action space • Continuous state space • Lipschitz continuous value function • Can we exploit this directly?
• Non-parametric approximate dynamic programming (NP-ALP) • Produces a piecewise-linear value function • Each segment has an associated optimal action • Actions selected from previously tried actions, so greater sampling of state- action space leads to finer grained decisions
LP for NP-ALP
min V(s) s ∑
s' ∑
9/21/17
14
• Use with sampled transitions, k-nn averaging
• Size of LP quadratic in number of (sampled) states, but • Very sparse • Very amenable to constraint generation
• Solution is sparse (unless Lv is large)
Still More NP-ALP Details • Every value traceable to tight Bellman constraint • For action selection for query state s: • Find state t which bounds value of s • Action taken at t is optimal for s
• Error bounds:
Scales with gaps in sampling, Lipschitz constant
Note that this is a bound on the value of the resulting policy, not the quality of the resulting value function!
9/21/17
15
RALP: Discrete actions, model, RBF features NP-ALP: Continuous actions, no model, no features
NP-ALP: Bicycle Balancing