Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

25
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014

Transcript of Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Page 1: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Lirong Xia

Reinforcement Learning (2)

Tue, March 21, 2014

Page 2: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

• Project 2 due tonight

• Project 3 is online (more later)

– due in two weeks

2

Reminder

Page 3: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Recap: MDPs

3

• Markov decision processes:– States S– Start state s0

– Actions A– Transition p(s’|s,a) (or T(s,a,s’))– Reward R(s,a,s’) (and discount ϒ)

• MDP quantities:– Policy = Choice of action for each (MAX) state– Utility (or return) = sum of discounted rewards

Page 4: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Optimal Utilities

4

• The value of a state s:– V*(s) = expected utility starting in s and

acting optimally• The value of a Q-state (s,a):

– Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally

• The optimal policy:– π*(s) = optimal action from state s

Page 5: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Solving MDPs

5

• Value iteration– Start with V1(s) = 0– Given Vi, calculate the values for all states for depth i+1:

– Repeat until converge– Use Vi as evaluation function when computing Vi+1

• Policy iteration– Step 1: policy evaluation: calculate utilities for some fixed

policy– Step 2: policy improvement: update policy using one-step

look-ahead with resulting utilities as future values– Repeat until policy converges

1'

max , , ' , , ' 'i ia

s

V s T s a s R s a s V s

Page 6: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

• Don’t know T and/or R, but can observe R

– Learn by doing

– can have multiple episodes (trials)

6

Reinforcement learning

Page 7: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

The Story So Far: MDPs and RL

7

Things we know how to do:• If we know the MDP

– Compute V*, Q*, π* exactly– Evaluate a fixed policy π

• If we don’t know the MDP– If we can estimate the MDP then

solve– We can estimate V for a fixed

policy π– We can estimate Q*(s,a) for the

optimal policy while executing an exploration policy

Techniques:• Computation

• Value and policy iteration

• Policy evaluation

• Model-based RL• sampling

• Model-free RL:• Q-learning

Page 8: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Model-Free Learning

8

• Model-free (temporal difference) learning– Experience world through episodes

(s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…)– Update estimates each transition (s,a,r,s’)– Over time, updates will mimic Bellman updates

1'

'

Q-Value Iteration (model-based, requires known MDP)

, , , ' , , ' max ', 'i ia

s

Q s a T s a s R s a s Q s a

'

Q-Learning (model-free, requires only experienced transitions)

( , ) (1 ) ( , ) max ', 'a

Q s a Q s a r Q s a

Page 9: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Detour: Q-Value Iteration

9

• We’d like to do Q-value updates to each Q-state:

– But can’t compute this update without knowing T,R• Instead, compute average as we go

– Receive a sample transition (s,a,r,s’)– This sample suggests

– But we want to merge the new observation to the old ones

– So keep a running average

1'

'

, , , ' , , ' max ', 'i ia

s

Q s a T s a s R s a s Q s a

'

( , ) (1 ) ( , ) max ', 'a

Q s a Q s a r Q s a

'

( , ) max ', 'a

Q s a r Q s a

Page 10: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Q-Learning Properties

10

• Will converge to optimal policy– If you explore enough (i.e. visit each q-state many times)– If you make the learning rate small enough– Basically doesn’t matter how you select actions (!)

• Off-policy learning: learns optimal q-values, not the values of the policy you are following

Page 11: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Q-Learning

11

• Q-learning produces tables of q-values:

Page 12: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Exploration / Exploitation

12

• Random actions (ε greedy)–Every time step, flip a coin–With probability ε, act randomly–With probability 1-ε, act according to current

policy

Page 13: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Today: Q-Learning with state abstraction

13

• In realistic situations, we cannot possibly learn about every single state!– Too many states to visit them all in training– Too many states to hold the Q-tables in memory

• Instead, we want to generalize:– Learn about some small number of training states from

experience– Generalize that experience to new, similar states– This is a fundamental idea in machine learning, and we’ll

see it over and over again

Page 14: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Example: Pacman

14

• Let’s say we discover through experience that this state is bad:

• In naive Q-learning, we know nothing about this state or its Q-states:

• Or even this one!

Page 15: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Feature-Based Representations

15

• Solution: describe a state using a vector of features (properties)– Features are functions from states

to real numbers (often 0/1) that capture important properties of the state

– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1/ (dist to dot)2

• Is Pacman in a tunnel? (0/1)• …etc• Is it the exact state on this slide?

– Can also describe a Q-state (s,a) with features (e.g. action moves closer to food)

Similar to a evaluation function

Page 16: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Linear Feature Functions

16

• Using a feature representation, we can write a Q function (or value function) for any state using a few weights:

• Advantage: our experience is summed up in a few powerful numbers

• Disadvantage: states may share features but actually be very different in value!

Page 17: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Function Approximation

17

• Q-learning with linear Q-functions:transition = (s,a,r,s’)

• Intuitive interpretation:– Adjust weights of active features– E.g. if something unexpectedly bad happens, disprefer all

states with that state’s features• Formal justification: online least squares

'

difference max ', ' ( , )a

r Q s a Q s a Exact Q’s

Approximate Q’s

Page 18: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Example: Q-Pacman

18

Q(s,a)= 4.0fDOT(s,a)-1.0fGST(s,a)

fDOT(s,NORTH)=0.5

fGST(s,NORTH)=1.0

Q(s,a)=+1

R(s,a,s’)=-500

difference=-501

wDOT←4.0+α[-501]0.5

wGST ←-1.0+α[-501]1.0

Q(s,a)= 3.0fDOT(s,a)-3.0fGST(s,a)

α= Northr = -500

s

s’

Page 19: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Linear Regression

19

Page 20: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Ordinary Least Squares (OLS)

20

Page 21: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Minimizing Error

21

21

error2

error

k kk

k k mkm

m m k k mk

w y w f x

wy w f x f x

w

w w y w f x f x

Imagine we had only one point x with features f(x):

Approximate q update explained:

max ( ', ') ( , )m ma

mw r Q s Q f xaaw s

“target” “prediction”

Page 22: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

• As many as possible?

– computational burden

– overfitting

• Feature selection is important

– requires domain expertise

22

How many features should we use?

Page 23: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Overfitting

23

Page 24: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

The elements of statistical learning. Fig 2.11. 24

Page 25: Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

• MDPs

– Q1: value iteration

– Q2: find parameters that lead to certain optimal policy

– Q3: similar to Q2

• Q-learning

– Q4: implement the Q-learning algorithm

– Q5: implement ε greedy action selection

– Q6: try the algorithm

• Approximate Q-learning and state abstraction

– Q7: Pacman

– Q8 (bonus): design and implement a state-abstraction Q-learning

algorithm

• Hints

– make your implementation general

– try to use class methods as much as possible

25

Overview of Project 3