Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.

63
Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009 Relational Transfer in Reinforcement Learning

Transcript of Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.

Lisa Torrey

University of Wisconsin – Madison

Doctoral Defense

May 2009

Relational Transfer in Reinforcement Learning

Transfer Learning

Given

Learn

Task T

Task S

Reinforcement Learning

Environment

s1

AgentQ(s1, a) = 0

policy π(s1) = a1 a

1

s2

r2

δ(s1, a1) = s2

r(s1, a1) = r2

Q(s1, a1) Q(s1, a1) + Δ

π(s2) = a2a2

δ(s2, a2) = s3

r(s2, a2) = r3

s3

r3

Exploration Exploitation

Maximize reward

Reference: Sutton and Barto, Reinforcement Learning: An Introduction, MIT Press 1998

Learning Curves

perf

orm

an

ce

training

higher start

higher slope

higher asymptote

RoboCup Domain

3-on-2 BreakAway

3-on-2 KeepAway

3-on-2 MoveDownfield

2-on-1 BreakAway

Qa(s) = w1f1 + w2f2 + w3f3 + … Hand-coded defendersSingle learning agent

Transfer in Reinforcement Learning

Starting-point

methods

Imitation methods

Hierarchical methods

Alterationmethods

New RL algorithms

Relational Transfer

pass(t

1)

pass(t

2)

pass(Teammate)

Opponent 1

Opponent 2

IF feature(Opponent)THEN

Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm ECML 2006

(ECML 2005)Macro transfer

Macro-operatorsDemonstrationMacro-transfer algorithm ILP 2007

Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithm AAAI workshop

2008MLN policy-transfer algorithm ILP 2009

Thesis Contributions

Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm

Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm

Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm

Thesis Contributions

Advice

IF these conditions hold

THEN pass is the best action

Transfer via Advice

Try what workedin a previous task!

Learning Without Advice

Batch Reinforcement Learning via Support Vector Regression (RL-SVR)

Environment

Agent

Batch 1

Environment

Agent

Batch 2

…Compute

Q-functions

Find Q-functions that minimize: ModelSize + C × DataMisfit(one per

action)

Learning With Advice

Find Q-functions that minimize: ModelSize + C × DataMisfit

Batch Reinforcement Learning with Advice (KBKR)

Environment

Agent

Batch 1

Compute Q-

functions Environment

Agent

Batch 2

Advice

+ µ × AdviceMisfit

Robust to negative transfer!

Inductive Logic ProgrammingIF [ ]THEN pass(Teammate)

IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 15 THEN pass(Teammate)

IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate)

IF distance(Teammate) ≤ 5 THEN pass(Teammate)

IF distance(Teammate) ≤ 10 THEN pass(Teammate)

F(β) = (1+ β2) × Precision × Recall (β2 × Precision) + RecallReference: De Raedt, Logical and Relational Learning, Springer 2008

Skill-Transfer Algorithm

Source

Target

IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30THEN pass(Teammate)

ILP

Advice Taking

Selected ResultsSkill transfer from 3-on-2 MoveDownfield to 4-on-3 MoveDownfield

IF distance(me, Teammate) ≥ 15distance(me, Teammate) ≤ 27 distance(Teammate, rightEdge)

≤ 10angle(Teammate, me,

Opponent) ≥ 24distance(me, Opponent) ≥ 4

THEN pass(Teammate)

Selected ResultsSkill transfer from several tasks to 3-on-2 BreakAway

Torrey et al. ECML 2006

Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm

Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm

Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm

Thesis Contributions

Macro-Operators

pass(Teammate)

move(Direction)

shoot(goalRight)

shoot(goalLeft)

IF [ ... ] THEN pass(Teammate)

IF [ ... ] THEN move(ahead)

IF [ ... ] THEN shoot(goalRight)

IF [ ... ] THEN shoot(goalLeft)

IF [ ... ] THEN pass(Teammate)

IF [ ... ] THEN move(left)

IF [ ... ] THEN shoot(goalRight)

IF [ ... ] THEN shoot(goalRight)

Demonstration Method

source

target

target-task training

policy used

No more protection against negative transfer!

But… best-case scenario could be very good.

Macro-Transfer AlgorithmSourc

e

Target

ILP

Demonstration

Macro-Transfer AlgorithmLearning structures

Positive: BreakAway

games that score

Negative: BreakAway games that didn’t score

ILP

IF actionTaken(Game, StateA, pass(Teammate), StateB) actionTaken(Game, StateB, move(Direction), StateC) actionTaken(Game, StateC, shoot(goalRight), StateD) actionTaken(Game, StateD, shoot(goalLeft), StateE)THEN isaGoodGame(Game)

Macro-Transfer AlgorithmLearning rules for arcs

Positive: states in good games

that took the arc

Negative: states in good games that could have taken the arc but didn’t

ILP

shoot(goalRight)

IF [ … ]THEN enter(State)

IF [ … ]THEN loop(State, Teammate))

pass(Teammate)

Macro-Transfer AlgorithmSelecting and scoring rules

Rule 1Precision=1.0Rule 2Precision=0.99Rule3Precision=0.96… …

Does rule increase F(10) of ruleset?

yes Add toruleset

Rule score = # games that follow the rule that are good # games that follow the rule

Selected ResultsMacro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway

pass(Teammate) move(ahead)

pass(Teammate) move(right) shoot(goalLeft)

move(right) move(left) shoot(goalLeft) shoot(goalRight)

move(left) shoot(goalLeft) shoot(goalRight)

move(ahead) move(right) shoot(goalLeft) shoot(goalRight)

move(away) shoot(goalLeft) shoot(goalRight)

move(right) shoot(goalLeft) shoot(goalRight)

shoot(goalLeft) shoot(goalRight)

shoot(GoalPart)

Selected ResultsMacro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway

Torrey et al. ILP 2007

Selected ResultsMacro self-transfer in 2-on-1 BreakAway

Pro

bab

ilit

y of

goal

Training games

Asymptote 56%

Initial 1%

Single macro 32%

Multiple macro 43%

Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm

Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm

Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm

Thesis Contributions

Markov Logic Networks

Formulas (F)

evidence1(X) AND query(X)evidence2(X) AND query(X)

Weights (W)

w0 = 1.1w1 = 0.9

Fi

ii worldnwZ

worldP )(exp1

)(

ni(world) = # true groundings of ith formula in world

query(x1)

e1 e2 …

query(x2)

e1 e2 …

Reference: Richardson and Domingos, Markov Logic Networks, Machine Learning 2006

MLN Weight Learning

IF [ ... ] THEN …

Alchemyweight learning

w0 = 1.1

From ILP:

MLN:

Reference: http://alchemy.cs.washington.edu

Markov Logic Networks in Macros

IF distance(Teammate, goal) < 12

THEN pass(Teammate)

IF angle(Teammate, defender) > 30

THEN pass(Teammate)

Matches t1 , score=0.92

Matches t2 , score=0.88

pass(Teammate)

MLN

P(t1) = 0.35P(t2) = 0.65

Markov Logic Networks in Macros

pass(Teammate) AND angle(Teammate, defender) > 30

pass(Teammate) AND distance(Teammate, goal) < 12

pass(t1)

pass(t2)

distance(t1, goal) < 12

distance(t2, goal) < 12 angle(t2 , defender) > 30

angle(t1, defender) > 30

Selected ResultsMacro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway

Selected ResultsMacro self-transfer in 2-on-1 BreakAway

Pro

bab

ilit

y of

goal

Training games

Asymptote 56%

Initial 1%

Regular macro 32%

Macro with MLN 43%

MLN Q-Function Transfer Algorithm

Source

Target

ILP, Alchemy

Demonstration

MLN foraction 1

State Q-value

MLN Q-functions

MLN foraction 2

State Q-value

MLN Q-Function

0 ≤ Qa < 0.2 0.2 ≤ Qa < 0.4 0.4 ≤ Qa < 0.6

… … …

Bin Number

Pro

bab

ilit

y

Bin Number

Pro

bab

ilit

y

Bin Number

Pro

bab

ilit

y

bins

bina binQEprobsQ ]|[)(

Selected ResultsMLN Q-function transfer from 2-on-1 BreakAway to 3-on-2

BreakAwayIF distance(me, GoalPart) ≥ 42

distance(me, Teammate) ≥ 39

THEN pass(Teammate) falls into [0, 0.11]

IF angle(topRight, goalCenter, me) ≤ 42

angle(topRight, goalCenter, me) ≥ 55

angle(goalLeft, me, goalie) ≥ 20angle(goalCenter, me, goalie) ≤

30

THEN pass(Teammate) falls into [0.11, 0.27]

IF distance(Teammate, goalCenter) ≤ 9

angle(topRight, goalCenter, me) ≤ 85

THEN pass(Teammate) falls into [0.27, 0.43]

Selected ResultsMLN Q-function transfer from 2-on-1 BreakAway to 3-on-2

BreakAway

Torrey et al. AAAI workshop 2008

MLN Policy-Transfer Algorithm

Source

Target

ILP, Alchemy

Demonstration

MLN(F,W)

StateAction

Probability

MLN Policy

MLN Policy

move(ahead) pass(Teammate) shoot(goalLeft)

… … …

Policy = highest-probability action

Selected ResultsMLN policy transfer from 2-on-1 BreakAway to 3-on-2

BreakAwayIF angle(topRight, goalCenter, me) ≤ 70

timeLeft ≥ 98distance(me, Teammate) ≥ 3

THEN pass(Teammate)

IF distance(me, GoalPart) ≥ 36distance(me, Teammate) ≥ 12timeLeft ≥ 91angle(topRight, goalCenter, me)

≤ 80THEN pass(Teammate)

IF distance(me, GoalPart) ≥ 27angle(topRight, goalCenter, me)

≤ 75distance(me, Teammate) ≥ 9angle(Teammate, me, goalie) ≥

25THEN pass(Teammate)

Selected ResultsMLN policy transfer from 2-on-1 BreakAway to 3-on-2

BreakAway

Torrey et al. ILP 2009

Selected ResultsMLN self-transfer in 2-on-1 BreakAway

Pro

bab

ilit

y of

goal

Training games

Asymptote 56%

Initial 1%

MLN Q-function 59%

MLN Policy 65%

Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm ECML 2006

(ECML 2005)Macro transfer

Macro-operatorsDemonstrationMacro-transfer algorithm ILP 2007

Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithm AAAI workshop

2008MLN policy-transfer algorithm ILP 2009

Thesis Contributions

Starting-pointTaylor et al. 2005: Value-function transfer

ImitationFernandez and Veloso 2006: Policy reuse

HierarchicalMehta et al. 2008: MaxQ transfer

AlterationWalsh et al. 2006: Aggregate states

New AlgorithmsSharma et al. 2007: Case-based RL

Related Work

Transfer can improve reinforcement learningInitial performanceLearning speed

Advice transferLow initial performanceSteep learning curvesRobust to negative transfer

Macro transfer and MLN transferHigh initial performanceShallow learning curvesVulnerable to negative transfer

Conclusions

ConclusionsClose-transfer scenarios

Distant-transfer scenarios

Multiple Macro

MLN Policy

Single Macro

MLN Q-Function

Skill Transfer

= =

≥≥

Multiple Macro

MLN Policy

Single Macro

MLN Q-Function

Skill Transfer

= =

≥ ≥

Future Work

Task T

Multiple source tasks

Task

S1

Theoretical resultsHow high can the initial performance be?How quickly can the target-task learner

improve?How many episodes are “saved” through

transfer?

Future Work

Source

Target

Relationship?

Joint learning and inference in macrosSingle searchCombined rule/weight learning

Future Work

pass(Teammate)

move(Direction)

Future WorkRefinement of transferred knowledge

MacrosRevising rule scoresRelearning rulesRelearning structure

MLNsRevising weightsRelearning rules

Too-specificclause

Betterclause

Too-generalclause

Betterclause

(Mihalkova et. al 2007)

Relational reinforcement learningQ-learning with MLN Q-functionPolicy search with MLN policies or macro

Future Work

Bin Number

Pro

bab

ilit

y

bins

binaction binQEpstateQ ]|[)(

MLN Q-functions lose too much information:

Diverse tasks

Complex testbeds

Automated mapping

Protection against negative transfer

Future WorkGeneral challenges in RL transfer

Advisor: Jude Shavlik

Collaborators: Trevor Walker and Richard Maclin

Committee David Page Mark Craven Jerry Zhu Michael Coen

UW Machine Learning Group

Grants DARPA HR0011-04-1-0007 NRL N00173-06-1-G002 DARPA FA8650-06-C-7606

Acknowledgements

Backup Slides

Transfer in Reinforcement Learning

0 0 0 0

0 0 0 0

0 0 0 0 target-task training

2 5 4 8

9 1 7 2

5 9 1 4

Initial Q-tabletransferno transfer

Source task

Starting-point methods

Transfer in Reinforcement Learning

Imitation methods

training

source

target

policy used

Transfer in Reinforcement Learning

Hierarchical methods

Run Kick

Pass Shoot

Soccer

Transfer in Reinforcement Learning

Alteration methods

Task S

Original statesOriginal actionsOriginal rewards

New statesNew actionsNew rewards

Policy Transfer Algorithm

Source

Target

IF Q(pass(Teammate)) > Q(other)

THEN pass(Teammate)

Advice Taking

action = pass(X) ?

outcome = caught(X) ?

pass(X) good?

pass(X) clearly best?

some action good?

pass(X) clearly bad?

Positive example for

pass(X)

Negative example for

pass(X)

yes

no

yes

yes

yes

yes

yes

Reject exampl

e

no

no

no

no

no

Skill Transfer Algorithm

Markov Logic Networks in MacrosExact Inference

x1 = world where pass(t1) is true x0 = world where

pass(t1) is false

Fi

ii xnwZ

xP )(exp1

)( 11

Fi

ii xnwZ

xP )(exp1

)( 00

Note: when pass(t1) is false no formulas are true

ZZxP

10exp

1)( 0

Fi

iinwZxP exp

1)( 1

pass(t1) AND angle(t1, defender) > 30

pass(t1) AND distance(t1, goal) < 12

Markov Logic Networks in MacrosExact Inference

1)()( 10 xPxP

1)(exp11

1 Fi

ii xnwZZ

1)(exp 1 Fi

ii xnwZ

Fiii

Fiii

nw

nwtruetpassP

exp1

exp))1((