Download - Model Free Prediction and Control

Transcript

Page 1: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

usepackagexcolor

Model Free Prediction and Control

Clément Rambour

Clément Rambour Model Free Prediction and Control 1 / 44

Page 2: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Plan

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 2 / 44

Page 3: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Plan

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 3 / 44

Page 4: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 4 / 44

Page 5: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Échantillonnage de Monte Carlo

ButApprendre vπ d’après les épisodes vécus en suivant π :

S1, A1, R1, · · · , sk ∼ π

La fonction de valeur d’un état vπ est donnée par :

vπ(s) = Eπ[Gt|St = s]

avec Gt = Rt+1 + γRt+2 + · · ·+ γT−1RT

Monte CarloSimplest way to estimate the value → mean of the returns

Clément Rambour Model Free Prediction and Control 5 / 44

Page 6: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Échantillonnage de Monte Carlo

ButApprendre vπ d’après les épisodes vécus en suivant π :

S1, A1, R1, · · · , sk ∼ π

La fonction de valeur d’un état vπ est donnée par :

vπ(s) = Eπ[Gt|St = s]

avec Gt = Rt+1 + γRt+2 + · · ·+ γT−1RT

Monte CarloSimplest way to estimate the value → mean of the returns

Clément Rambour Model Free Prediction and Control 5 / 44

Page 7: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Échantillonnage de Monte Carlo

First Visit Monte-Carlo Policy Evaluation

v̂π(s) =S(s)

N(s)

N(s) : compteur incrémenté si s est vu au moins une fois dansun épisodeS(s)← S(s) +Gt : retour total pour l’état s

Clément Rambour Model Free Prediction and Control 6 / 44

Page 8: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Échantillonnage de Monte Carlo

Every Visit Monte-Carlo Policy Evaluation

v̂π(s) =S(s)

N(s)

N(s) : compteur incrémenté à chaque fois que s est vu dans unépisodeS(s)← S(s) +Gt : retour total pour l’état s

Clément Rambour Model Free Prediction and Control 7 / 44

Page 9: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Moyenne incrémentielle

µk =1

k∑j=1

(xk +

k−1∑j=1

xj)

k(xk + (k − 1)µk−1)

= µk−1 +1

k(xk − µk−1)

V (s) est mis a jour après chaqueépisode :foreach St do

N(St)← N(St) + 1 Vπ(St)←V (St) +

1N(St)

(Gt − V (St))

end

En cas de non stationnarité, lamoyenne peut être mise à jour ie. lepoids des anciens épisodes → 0 :

Vπ(St)← V (St) + α(Gt − V (St))

α << 1

Clément Rambour Model Free Prediction and Control 8 / 44

Page 10: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Moyenne incrémentielle

µk =1

k∑j=1

(xk +

k−1∑j=1

xj)

k(xk + (k − 1)µk−1)

= µk−1 +1

k(xk − µk−1)

V (s) est mis a jour après chaqueépisode :foreach St do

N(St)← N(St) + 1 Vπ(St)←V (St) +

1N(St)

(Gt − V (St))

end

En cas de non stationnarité, lamoyenne peut être mise à jour ie. lepoids des anciens épisodes → 0 :

Vπ(St)← V (St) + α(Gt − V (St))

α << 1

Clément Rambour Model Free Prediction and Control 8 / 44

Page 11: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Moyenne incrémentielle

µk =1

k∑j=1

(xk +

k−1∑j=1

xj)

k(xk + (k − 1)µk−1)

= µk−1 +1

k(xk − µk−1)

V (s) est mis a jour après chaqueépisode :foreach St do

N(St)← N(St) + 1 Vπ(St)←V (St) +

1N(St)

(Gt − V (St))

end

En cas de non stationnarité, lamoyenne peut être mise à jour ie. lepoids des anciens épisodes → 0 :

Vπ(St)← V (St) + α(Gt − V (St))

α << 1

Clément Rambour Model Free Prediction and Control 8 / 44

Page 12: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Monte Carlo prediction : pros and cons

pros

IntuitifFacile à implémenterSans biais

ConsNe peut être mis à jour qu’après un épisode complet ie. ne peutpas être utilisé onlineLong à converger

Clément Rambour Model Free Prediction and Control 9 / 44

Page 13: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 10 / 44

Page 14: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

De Monte Carlo à TD(0)

Monte CarloMise à jour basée sur le retour Gt :

vπ(St)← v(St) + α(Gt − v(St))

Offline

TD(0)

D’après les équations de Bellman :

vπ(s) = E[Rt+1 + vπ(St+1)|St = s]

Clément Rambour Model Free Prediction and Control 11 / 44

Page 15: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

De Monte Carlo à TD(0)

Monte CarloMise à jour basée sur le retour Gt :

vπ(St)← v(St) + α(Gt − v(St))

Offline

TD(0)

D’après les équations de Bellman :

vπ(s) = E[Rt+1 + vπ(St+1)|St = s]

Mise à jour basée sur le retour estimé Rt+1 + γvπ(St+1) :

vπ(St)← vπ(St) + α(Rt+1 + γvπ(St+1)− vπ(St))

Online

Clément Rambour Model Free Prediction and Control 11 / 44

Page 16: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

De Monte Carlo à TD(0)

Monte CarloMise à jour basée sur le retour Gt :

vπ(St)← v(St) + α(Gt − v(St))

Offline

TD(0)

D’après les équations de Bellman :

vπ(s) = E[Rt+1 + vπ(St+1)|St = s]

vπ(St)← vπ(St) + α(Rt+1 + γvπ(St+1)︸︷︷︸TD Target

−vπ(St)

︸︷︷︸TD Error = δt

)

Clément Rambour Model Free Prediction and Control 11 / 44

Page 17: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Temporal Difference

TD(0)

Intuition : Estimation de la valeur basée sur la connaissance de l’étatsuivant

Vπ(St)← Vπ(St) + α(Rt+1 + γVπ(St+1)− Vπ(St))

Généralisation : n-stepGénéralisation possible en se basant sur les n état suivants.

Clément Rambour Model Free Prediction and Control 12 / 44

Page 18: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Temporal Difference

TD(0)

Intuition : Estimation de la valeur basée sur la connaissance de l’étatsuivant

Vπ(St)← Vπ(St) + α(Rt+1 + γVπ(St+1)− Vπ(St))

Généralisation : n-stepGénéralisation possible en se basant sur les n état suivants.

D’après les équations de Bellman :

G(n)t = Rt+1 + γRt+2 + · · ·+ γnvπ(St+n)

Mise à jour de la valeur :

Vπ(SSt)← Vπ(St) + α(G

(n)t − Vπ(St))

Clément Rambour Model Free Prediction and Control 12 / 44

Page 19: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Temporal Difference

TD(0)

Intuition : Estimation de la valeur basée sur la connaissance de l’étatsuivant

Vπ(St)← Vπ(St) + α(Rt+1 + γVπ(St+1)− Vπ(St))

Généralisation : n-stepGénéralisation possible en se basant sur les n état suivants.

Équivalent à Monte-Carlo pour n→∞Nécessite de stocker n récompenses et valeurs d’état

Clément Rambour Model Free Prediction and Control 12 / 44

Page 20: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Exemple

Clément Rambour Model Free Prediction and Control 13 / 44

Page 21: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Exemple : MC vs TD

Clément Rambour Model Free Prediction and Control 14 / 44

Page 22: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Exemple : MC vs TD

Clément Rambour Model Free Prediction and Control 15 / 44

Page 23: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Exemple : MC vs TD

Clément Rambour Model Free Prediction and Control 16 / 44

Page 24: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Exemple : MC vs TD

Clément Rambour Model Free Prediction and Control 17 / 44

Page 25: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

MC vs TD

batch-MC converge vers la solution correspondant aux moindrecarrésBatch TD(0) converge vers le maximum de vraisemblanced’après la dynamique du modèle conditionné par la politique

Clément Rambour Model Free Prediction and Control 18 / 44

Page 26: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 19 / 44

Page 27: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

TD(λ)

Utilisation de toutes les approximation du retour :

Gλt = (1− λ)∞∑n=1

λn−1G(n)t

(forward-view) TD(λ) :

Vπ(St)→ Vπ(st) + α(Gλt − Vπ(St)

)Nécessite des épisodes complets (comme MC)offline

Peut être approché online (backward-view)

Clément Rambour Model Free Prediction and Control 20 / 44

Page 28: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

TD(λ)

Utilisation de toutes les approximation du retour :

Gλt = (1− λ)∞∑n=1

λn−1G(n)t

(forward-view) TD(λ) :

Vπ(St)→ Vπ(st) + α(Gλt − Vπ(St)

)Nécessite des épisodes complets (comme MC)offlinePeut être approché online (backward-view)

Clément Rambour Model Free Prediction and Control 20 / 44

Page 29: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Eligibility traces

Plutôt que de regarder les n états suivants, on peut regarder lesn états précédentsPondération donnée par l’éligibilité des états observées

Éligibilité

Mise à jour basée sur les états récents et fréquents :

E0(s) = 0

Et(s) = γλEt−1(s) + 1St=s

Clément Rambour Model Free Prediction and Control 21 / 44

Page 30: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Backward view TD(λ)

Éligibilité calculée pour chaque étatMise à jour de tous les états à chaque itérationL’erreur TS δt est pondéré par l’éligibilité

δt = Rt+1 + γV (St+1)− V (St)

V (St)← V (St) + αδEt(St)

Clément Rambour Model Free Prediction and Control 22 / 44

Page 31: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Plan

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 23 / 44

Page 32: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 24 / 44

Page 33: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Model Free Control

Précédemment (programmation dynamique), l’améliorationgreedy de la politique se fait selon la fonction de valeur d’état :

πk+1(s) = argmaxa∈A

qπ(s, a)

→ on choisit les action qui amèneront aux meilleurs étatsNécessité de stocker la valeur de chaque état vπ(s)

Sans connaître la dynamique du modèle, on ne sait pas quelssont les états atteignables→ Nécessité de stocker la valeur de chaque couple (action-état)qπ(s, a)

Clément Rambour Model Free Prediction and Control 25 / 44

Page 34: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Model Free Control

Précédemment (programmation dynamique), l’améliorationgreedy de la politique se fait selon la fonction de valeur d’état :

πk+1(s) = argmaxa∈A

qπ(s, a)

→ on choisit les action qui amèneront aux meilleurs étatsNécessité de stocker la valeur de chaque état vπ(s)

Sans connaître la dynamique du modèle, on ne sait pas quelssont les états atteignables→ Nécessité de stocker la valeur de chaque couple (action-état)qπ(s, a)

Clément Rambour Model Free Prediction and Control 25 / 44

Page 35: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Generalized policy iteration

Même principe qu’enprogrammationdynamiqueAlternance :

Estimation de qπAmélioration greedyde la politique

Clément Rambour Model Free Prediction and Control 26 / 44

Page 36: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

GLIE

Condition suffisante pour la convergence vers qπ∗

DéfinitionUn processus d’amélioration de politique est dit Greedy in the Limitwith Infinite Exploration (GLIE) si

Tous les états-action sont explorés infiniments :

limk←∞

Nk(s, a) =∞

La politique converge vers une politique greedy :

limk←∞

πk(a|s) = 1argmaxa∈A

qk(s,a)

Par exemple : un apprentissage ε-greedy est GLIE si εk = 1k

Clément Rambour Model Free Prediction and Control 27 / 44

Page 37: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

GLIE

Condition suffisante pour la convergence vers qπ∗

DéfinitionUn processus d’amélioration de politique est dit Greedy in the Limitwith Infinite Exploration (GLIE) si

Tous les états-action sont explorés infiniments :

limk←∞

Nk(s, a) =∞

La politique converge vers une politique greedy :

limk←∞

πk(a|s) = 1argmaxa∈A

qk(s,a)

Par exemple : un apprentissage ε-greedy est GLIE si εk = 1k

Clément Rambour Model Free Prediction and Control 27 / 44

Page 38: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

ε-greedy policy iteration

Intuition : explorer continuement tout en convergent vers lapolitique optimale

Toutes les actions ont une probabilité non nulleL’action qui maximise q(s, a est prise avec la probabilité εLes autres actions peuvent être choisie avec une probabilité ε

πgreedy(a|s) =

εm + 1− ε si a∗ = argmax

a∈Aq(s, a)

εm sinon

avec m = |A|

Clément Rambour Model Free Prediction and Control 28 / 44

Page 39: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

GLIE Monte Carlo

Échantillonner k épisodes basés sur π : {S1, A1, R2, · · · , ST }Pour chaque couple (st, at) :

N(st, at)← N(st, at) + 1

Q(st, at)← Q(st, at) +1

N(st, at)(Gt −Q(st, at))

Amélioration ε-greedy de π :

ε← 1

π ← ε− greedy(Q)

GLIE MC converge vers la fonction action-valeur optimale

Clément Rambour Model Free Prediction and Control 29 / 44

Page 40: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

GLIE Monte Carlo

Échantillonner k épisodes basés sur π : {S1, A1, R2, · · · , ST }Pour chaque couple (st, at) :

N(st, at)← N(st, at) + 1

Q(st, at)← Q(st, at) +1

N(st, at)(Gt −Q(st, at))

Amélioration ε-greedy de π :

ε← 1

π ← ε− greedy(Q)

GLIE MC converge vers la fonction action-valeur optimale

Clément Rambour Model Free Prediction and Control 29 / 44

Page 41: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 30 / 44

Page 42: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

SARSA

Clément Rambour Model Free Prediction and Control 31 / 44

Page 43: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

SARSA

SARSA converge vers la politique optimale (Q(s, a)→ q∗(s, a)) si,La séquence des politiques est GLIELe learning rate αt vérifie les conditions suivantes :

∞∑t=1

αt =∞

∞∑t=1

α2t <∞

Clément Rambour Model Free Prediction and Control 32 / 44

Page 44: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

SARSA

SARSA converge vers la politique optimale (Q(s, a)→ q∗(s, a)) si,La séquence des politiques est GLIELe learning rate αt vérifie les conditions suivantes :

∞∑t=1

αt =∞

∞∑t=1

α2t <∞

Clément Rambour Model Free Prediction and Control 32 / 44

Page 45: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Extensions

SARSA similaire à l’estimation avec TD(0) :On estime la fonction de valeur d’état-action q(s, a) par mise àjour de la TD Target :

Q(st, at)← Q(st, at) + α(Rt+1 + γQ(st+1, at+1)−Q(st, at))

contrôle : on cherche à suivre une politique optimale →πt = ε-greedy(Q(st, at))

L’estimation peut être étendue aux extension de TD déjàrencontrée

Clément Rambour Model Free Prediction and Control 33 / 44

Page 46: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Extensions

SARSA similaire à l’estimation avec TD(0) :On estime la fonction de valeur d’état-action q(s, a) par mise àjour de la TD Target :

Q(st, at)← Q(st, at) + α(Rt+1 + γQ(st+1, at+1)−Q(st, at))

contrôle : on cherche à suivre une politique optimale →πt = ε-greedy(Q(st, at))

L’estimation peut être étendue aux extension de TD déjàrencontrée

Clément Rambour Model Free Prediction and Control 33 / 44

Page 47: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Table of Contents

1 Introduction

2 Model Free PredictionMonte Carlo samplingTD0 LearningTD(λ) Learning

3 Model Free ControlIntroductionOn policy et SARSAOff policy et Q-learning

Clément Rambour Model Free Prediction and Control 34 / 44

Page 48: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Off-Policy learning

Intuition : Apprendre d’une autre politiqueÉvaluer une politique cible π(s|a) : vπ(s), qπ(s, a)Suivre une politique différente (comportement) : µ(a|s)

{S1, A1, R2, ..., ST } ∼ µ

Nombreux cas d’utilisation :Apprendre d’humainsRé-utiliser l’expérience passéeApprendre une politique optimale tout en suivant unepolitique exploratoire

Clément Rambour Model Free Prediction and Control 35 / 44

Page 49: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Importance sampling

Espérance de f(X) avec X ∼ P :

EP [f(X)] =∑

P (X)f(X)

Espérance de f(X) par rapport à une autre distribution Q :

EP [f(X)] =∑

P (X)f(X)

=∑

Q(X)P (X)

Q(x)f(X)

= EX∼Q[P (X)

Q(X)f(X)] (1)

Clément Rambour Model Free Prediction and Control 36 / 44

Page 50: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Off-Policy Monte-Carlo

Utiliser les retour générer par µ (le comportement) pour évaluerπ

Pondération du retour par le facteur de correction sur toutl’épisode :

Gπ/µt =

La valeur est mise à jour grâce au retour corrigé :

V (St)← V (St) + α(Gπ/µt − V (St))

Pour que la correction soit définie : µ 6= 0 si π 6= 0

Peut entraîner des problème de stabilité

Clément Rambour Model Free Prediction and Control 37 / 44

Page 51: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Off-Policy TD

Utiliser les retour générer par µ (le comportement) pour évaluerπ

Pondération de la TD target facteur de correction sur pour uneétape lors de la mise à jour :

V (St)← V (St) + α

(π(At|St)µ(At|St)

(Rt+1 + γV (St+1))− V (St)

)Meilleure variance et conditionnement que off-policy MCsampling

Clément Rambour Model Free Prediction and Control 38 / 44

Page 52: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Q-learning

Off policy TD learnging pour l’estimation de la fonction devaleur (état,action)Importance sampling plus nécessaireAt+1 ∼ µ(.|St+1)

On considère une action alternative A′ ∼ π(.|St+1)

La TD target est distribuée selon µ mais on a choisi une actionA′ donnée par π :

Q(St, At ← Q(st, At) + α((Rt+1 + γQ(St+1, A

′))−Q(St, At))

Clément Rambour Model Free Prediction and Control 39 / 44

Page 53: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Q-learning

Choix de µ et π ?µ gère l’explorationπ cherche l’optimalitéSimple schéma off-policy à partir d’une même politique :

π(S|A) = greedy(Q) = argmaxa′

Q(S,A)

µ(S|A) = ε− greedy(Q)

Clément Rambour Model Free Prediction and Control 40 / 44

Page 54: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Q-learning

Clément Rambour Model Free Prediction and Control 41 / 44

Page 55: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Q-learngin

Clément Rambour Model Free Prediction and Control 42 / 44

Page 56: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

SARSA vs. QLearning

Clément Rambour Model Free Prediction and Control 43 / 44

Page 57: Model Free Prediction and Control

Introduction Model Free Prediction Model Free Control

Bibliogrphie

Reinforcement Learning : an introduction, second edition,Richard S. Sutton and Adrew G. BartoReinforcement Learning courses, David Silver, DeepMind(https ://www.davidsilver.uk/)

Clément Rambour Model Free Prediction and Control 44 / 44

Top Related