Machine Learning I 80-629A Apprentissage Automatique I...

92
Sequential Decision Making II — Week #12 Machine Learning I 80-629A Apprentissage Automatique I 80-629

Transcript of Machine Learning I 80-629A Apprentissage Automatique I...

Page 1: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Sequential Decision Making II— Week #12

Machine Learning I 80-629A

Apprentissage Automatique I 80-629

Page 2: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Brief recap• Markov Decision Processes (MDP)

• Offer a framework for sequential decision making

• Goal: find the optimal policy

• Dynamic programming and several algorithms (e.g., VI,PI)

2

!&,8,5,7, !"

Figure 3.1, RL: An introduction

Page 3: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

From MDPs to RL• In MDPs we assume that we know

1. Transition probabilities: P(s’ | s, a)

2. Reward function: R(s)

• RL is more general

• In RL both are typically unknown

• RL agents navigate the world to gather this information

3

Page 4: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Experience (I)

A. Supervised Learning:

• Given fixed dataset

• Goal: maximize objective on test set (population)

B. Reinforcement Learning

• Collect data as agent interacts with the world

• Goal: maximize sum of rewards

4

Page 5: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Experience (II)• Any supervised learning problem can be mapped to a

reinforcement learning problem

• Loss function can be used as a reward function

• Dependent variables (Y) become the actions

• States encode information from the independent variables (x)

• Such mapping isn’t very useful in practice

• SL is easier than RL

5

Page 6: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Example

https://cdn.dribbble.com/users/3907/screenshots/318354/tounge-stuck-on-pole-shot.jpg

Slide adapted from Pascal Poupart

Reinforcement LearningSupervised Learning

Don’t touch your tongue will get

stuck.

Page 7: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

7

Page 8: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

• Key: decision making over time, uncertain environments

7

Page 9: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

• Key: decision making over time, uncertain environments

• Robot navigation: Self-driving cars, helicopter control

7

Page 10: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

• Key: decision making over time, uncertain environments

• Robot navigation: Self-driving cars, helicopter control

• Interactive systems: recommender systems, chatbots

7

Page 11: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

• Key: decision making over time, uncertain environments

• Robot navigation: Self-driving cars, helicopter control

• Interactive systems: recommender systems, chatbots

• Game playing: Backgammon, go

7

Page 12: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

RL applications

• Key: decision making over time, uncertain environments

• Robot navigation: Self-driving cars, helicopter control

• Interactive systems: recommender systems, chatbots

• Game playing: Backgammon, go

• Healthcare: monitoring systems

7

Page 13: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Reinforcement learning and recommender systems• Most users have multiple interactions with the system of time

• Making recommendations over time can be advantageous (e.g., you could better explore one’s preferences)

• States: Some representation of user preferences (e.g., previous items they consumed)

• Actions: what to recommend

• Reward:

• + user consumes the recommendation

• - user does not consume the recommendation

8

Page 14: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Challenges of reinforcement learning

9

Page 15: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Challenges of reinforcement learning

• Credit assignment problem: which action(s) should be credited for obtaining a reward

• A series of actions (getting coffee from cafeteria)

• A small number of actions several time steps ago may be important (test taking: study before, getting grade long after)

9

Page 16: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Challenges of reinforcement learning

• Credit assignment problem: which action(s) should be credited for obtaining a reward

• A series of actions (getting coffee from cafeteria)

• A small number of actions several time steps ago may be important (test taking: study before, getting grade long after)

• Exploration/Exploitation tradeoff: As agent interacts should it exploit its current knowledge (exploitation) or seek out additional information (exploration)

9

R=10

R=1

Page 17: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Algorithms for RL

• Two main classes of approach

10

Page 18: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Algorithms for RL

• Two main classes of approach

1. Model-based

• Learns a model of the transition and uses it to optimize a policy given the model

10

Page 19: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Algorithms for RL

• Two main classes of approach

1. Model-based

• Learns a model of the transition and uses it to optimize a policy given the model

2. Model-free

• Learns an optimal policy without explicitly learning transitions

10

Page 20: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo Methods

11

Page 21: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo Methods• Model-free

11

Page 22: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo Methods• Model-free

• Assume the environment is episodic

• Think of playing poker. An episode is a poker hand

• Updates the policy after each episode

11

Page 23: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo Methods• Model-free

• Assume the environment is episodic

• Think of playing poker. An episode is a poker hand

• Updates the policy after each episode

• Intuition

• Experience many episodes

• Play many hands of poker

• Average the rewards received at each state

• What is the proportion of wins of each hand

11

Page 24: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Prediction vs. control

1. Prediction: evaluate a given policy

2. Control: Learn a policy

• Sometimes also called

• passive (prediction)

• active (control)

12

Page 25: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

First-visit Monte Carlo• Given a fixed policy (prediction)

• Calculate the value function V(s) for each state

• Converges to V as the number of visits to each states goes to infinity

13

[ä(X)

[Sutton & Barto, RL Book, Ch 5]

Page 26: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

Episode:

ä

Page 27: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

Episode:

ä

Page 28: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

Episode:

ä

Page 29: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode:

ä

Page 30: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode:

ä

Page 31: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (3, )�

ä

Page 32: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (3, )�

ä

Page 33: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )�

ä

Page 34: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )�

ä

Page 35: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )�

ä

Page 36: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )�

ä

Page 37: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )�

ä

Page 38: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )�

ä

Page 39: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )�

ä

Page 40: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )�

ä

Page 41: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )�

ä

Page 42: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )�

ä

Page 43: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

Page 44: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

Page 45: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

Page 46: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

Page 47: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

For state 7:

Page 48: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)

= ����

Page 49: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)

= ����

Page 50: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: grid world• Start state is top-left (start of episode)

• Bottom right is absorbing (end of episode)

• Policy is given (gray arrows)

14

R=10

(1, )

1 2 3 4

5 6

9

7 8

10 11

12 13 14

15 16 17 18

� (2, )Episode: (7, )�(3, )� (6, )� (7, )� (10, )� (13, )� � (17, )

ä

For state 7: WJYZWS(�) = �7(�) + ��7(�) + ��7(��) + ��7(��) + ��7(��) + ��7(��)

= ����

;(�) = �� � ��

Page 51: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Example: Black Jack

• Episode: one hand

• States: Sum of player’s cards, dealer’s card, usable ace

• Actions: {Stay, Hit}

• Rewards: {Win +1, Tie 0, Loose -1}

• A few other assumptions: infinite deck

15

Page 52: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

• Evaluates the policy that hits except when the sum of the cards is 20 or 21

R=1

R=-1

[Figure 5.1, Sutton & Barto]

Page 53: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

• Evaluates the policy that hits except when the sum of the cards is 20 or 21

R=1

R=-1

[Figure 5.1, Sutton & Barto]

Page 54: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

• Evaluates the policy that hits except when the sum of the cards is 20 or 21

R=1

R=-1

[Figure 5.1, Sutton & Barto]

Page 55: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

• Evaluates the policy that hits except when the sum of the cards is 20 or 21

R=1

R=-1

[Figure 5.1, Sutton & Barto]

Page 56: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-value for control

• We know about state-value functions V(s)

17

Page 57: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-value for control

• We know about state-value functions V(s)

• If state transitions are known then they can be used to derive an optimal policy [recall value iteration]:

17

ä!(X) = argmaxF

!7(X) + !

"

X!5(X" | X,F);N(X

")

#

!X

Page 58: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-value for control

• We know about state-value functions V(s)

• If state transitions are known then they can be used to derive an optimal policy [recall value iteration]:

• When state transitions are unknown what can we do?

• Q(s,a) the value function of a (state,action) pair

17

ä!(X) = argmaxF

!7(X) + !

"

X!5(X" | X,F);N(X

")

#

!X

ä�(X) = arg maxF

{6(X,F)}

Page 59: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo ES (control)

18

[Sutton & Barto, RL Book, Ch.5]

Page 60: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo ES (control)

• Strong reasons to believe that it converges to the optimal policy

• “Exploring starts” requirement may be unrealistic

18

[Sutton & Barto, RL Book, Ch.5]

Page 61: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Learning without “exploring starts”

• “Exploring starts” insures that all states can be visited regardless of the policy

• Specific policy may not visit all states

19

Page 62: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Learning without “exploring starts”

• “Exploring starts” insures that all states can be visited regardless of the policy

• Specific policy may not visit all states

• Solution: inject some uncertainty in the policy

19

Page 63: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo without exploring states (on policy)

20

[Sutton & Barto, RL Book, Ch.5]

Page 64: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte Carlo without exploring states (on policy)

• Policy value cannot decrease

20

[ä�(X) � [ä(X), �X � 8ä�ä : policy at current step

: policy at next step

[Sutton & Barto, RL Book, Ch.5]

Page 65: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Monte-Carlo methods summary

• Allow a policy to be learned through interactions

• States are effectively treated as being independent

• Focus on a subset of states (e.g., states for which playing optimally is of particular importance)

• Episodic (with or without exploring stats)

21

Page 66: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Temporal Difference (TD) Learning

22

Page 67: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Temporal Difference (TD) Learning

• One of the “central ideas of RL” [Sutton & Barto, RL book]

22

Page 68: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Temporal Difference (TD) Learning

• One of the “central ideas of RL” [Sutton & Barto, RL book]

• Monte Carlo methods

22

;(XY) = ;(XY) + � [,Y � ;(XY)]

Step size

Observed returned : ,Y =9�

Y

�Y7(XY)

Page 69: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Temporal Difference (TD) Learning

• One of the “central ideas of RL” [Sutton & Barto, RL book]

• Monte Carlo methods

22

;(XY) = ;(XY) + � [,Y � ;(XY)]

Step size

Observed returned : ,Y =9�

Y

�Y7(XY)

Page 70: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Temporal Difference (TD) Learning

• One of the “central ideas of RL” [Sutton & Barto, RL book]

• Monte Carlo methods

• TD(0)

• updates “instantly”

22

;(XY) = ;(XY) + � [,Y � ;(XY)]

Step size

Observed returned :

;(XY) = ;(XY) + � [7Y+� + �;(XY+�) � ;(XY)]

,Y =9�

Y

�Y7(XY)

Page 71: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

TD(0) for prediction

23

[Sutton & Barto, RL Book, Ch.6]

Page 72: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

TD for control

Page 74: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Comparing TD and MC• MC requires going through full

episodes before updating the value function. Episodic.

• Converges to the optimal solution

26

• TD updates each V(s) after each transition. Online.

• Converges to the optimal solution (some conditions on )

• Empirically TD methods tend to converge faster

Page 75: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

27

[Sutton & Barto, RL Book, Ch.6]

Page 76: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

27

F =

!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,

WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!

[Sutton & Barto, RL Book, Ch.6]

Page 77: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

27

F =

!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,

WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!

[Sutton & Barto, RL Book, Ch.6]

Page 78: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

27

F =

!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,

WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!

[Sutton & Barto, RL Book, Ch.6]

Page 79: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

27

F =

!argmaxF 6(F, X) \NYM UWTGFGNQNY^ �! !,

WFSITR F \NYM UWTGFGNQNY^ !.-greedy policy!

[Sutton & Barto, RL Book, Ch.6]

Page 80: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Q-learning for control

• Converges to Q* as long as all (s,a) pairs continue to be updated and with minor constraints on learning rate

27

[Sutton & Barto, RL Book, Ch.6]

Page 81: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques

• Methods we studied are “tabular”

• State value functions (and Q) can be approximated

• Linear approximation:

• Coupling between states through x(s)

• Adapt the algorithms for this case.

• Updates to the value function now imply updating the weights w using a gradient

28

;(X) = \!](X)

Page 82: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques (prediction)

• Linear approximation:

29

;(X) = \!](X)

Page 83: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques (prediction)

• Linear approximation:

• Objective:

29

;(X) = \!](X)

X�8

�[ä(X) �\�](X)

��

Page 84: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques (prediction)

• Linear approximation:

• Objective:

• Gradient update:

29

;(X) = \!](X)

X�8

�[ä(X) �\�](X)

��

\Y+� = \Y � ���

X�8

�[ä(X) �\�](X)

�](X)

Page 85: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques (prediction)

• Linear approximation:

• Objective:

• Gradient update:

29

;(X) = \!](X)

X�8

�[ä(X) �\�](X)

��

\Y+� = \Y � ���

X�8

�[ä(X) �\�](X)

�](X)

[Sutton & Barto, RL Book, Ch.9]

Page 86: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques (prediction)

• Linear approximation:

• Objective:

• Gradient update:

29

;(X) = \!](X)

X�8

�[ä(X) �\�](X)

��

\Y+� = \Y � ���

X�8

�[ä(X) �\�](X)

�](X)

[Sutton & Barto, RL Book, Ch.9]

,Y NX FS ZSGNFXJI JXYNRFYTW TK [ä(XY)

Page 87: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques

• Works both for prediction and control

30

Page 88: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques

• Works both for prediction and control

• Any model can be used to approximate

30

Page 89: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Approximation techniques

• Works both for prediction and control

• Any model can be used to approximate

• Recent work using deep neural networks yield impressive performance on computer (Atari) games

30

Page 90: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Summary• Today we have defined RL studied several algorithms for solving RL problems (mostly

for for tabular case)

• Main challenges

• Credit assignment

• Exploration/Exploitation tradeoff

• Algorithms

• Prediction

• Monte Carlo and TD(0)

• Control

• Q-learning

• Approximation algorithms can help scale reinforcement learning

31

Page 91: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

Practical difficulties

• Compared to supervised learning setting up an RL problem is often harder

• Need an environment (or at least a simulator)

• Rewards

• In some domains it’s clear (e.g., in games)

• In others it’s much more subtle (e.g., you want to please a human)

32

Page 92: Machine Learning I 80-629A Apprentissage Automatique I 80-629lcharlin/courses/80-629/slides_rl2.pdf · 2019-11-19 · Laurent Charlin — 80-629 Brief recap • Markov Decision Processes

Laurent Charlin — 80-629

• The algorithms are from “Reinforcement Learning: An Introduction” by Richard Sutton and Andrew Barto

• The definitive RL reference

• Some of these slides were adapted from Pascal Poupart’s slides (CS686 U.Waterloo)

• The TD demo is from Andrej Karpathy

33

Acknowledgements