Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.
-
Upload
egbert-mason -
Category
Documents
-
view
217 -
download
1
Transcript of Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.
Lisa Torrey
University of Wisconsin – Madison
Doctoral Defense
May 2009
Relational Transfer in Reinforcement Learning
Reinforcement Learning
Environment
s1
AgentQ(s1, a) = 0
policy π(s1) = a1 a
1
s2
r2
δ(s1, a1) = s2
r(s1, a1) = r2
Q(s1, a1) Q(s1, a1) + Δ
π(s2) = a2a2
δ(s2, a2) = s3
r(s2, a2) = r3
s3
r3
Exploration Exploitation
Maximize reward
Reference: Sutton and Barto, Reinforcement Learning: An Introduction, MIT Press 1998
RoboCup Domain
3-on-2 BreakAway
3-on-2 KeepAway
3-on-2 MoveDownfield
2-on-1 BreakAway
Qa(s) = w1f1 + w2f2 + w3f3 + … Hand-coded defendersSingle learning agent
Transfer in Reinforcement Learning
Starting-point
methods
Imitation methods
Hierarchical methods
Alterationmethods
New RL algorithms
Relational Transfer
pass(t
1)
pass(t
2)
pass(Teammate)
Opponent 1
Opponent 2
IF feature(Opponent)THEN
Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm ECML 2006
(ECML 2005)Macro transfer
Macro-operatorsDemonstrationMacro-transfer algorithm ILP 2007
Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithm AAAI workshop
2008MLN policy-transfer algorithm ILP 2009
Thesis Contributions
Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm
Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm
Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm
Thesis Contributions
Learning Without Advice
Batch Reinforcement Learning via Support Vector Regression (RL-SVR)
Environment
Agent
Batch 1
Environment
Agent
Batch 2
…Compute
Q-functions
Find Q-functions that minimize: ModelSize + C × DataMisfit(one per
action)
Learning With Advice
Find Q-functions that minimize: ModelSize + C × DataMisfit
Batch Reinforcement Learning with Advice (KBKR)
Environment
Agent
Batch 1
Compute Q-
functions Environment
Agent
Batch 2
…
Advice
+ µ × AdviceMisfit
Robust to negative transfer!
Inductive Logic ProgrammingIF [ ]THEN pass(Teammate)
IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 15 THEN pass(Teammate)
IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate)
IF distance(Teammate) ≤ 5 THEN pass(Teammate)
IF distance(Teammate) ≤ 10 THEN pass(Teammate)
…
F(β) = (1+ β2) × Precision × Recall (β2 × Precision) + RecallReference: De Raedt, Logical and Relational Learning, Springer 2008
Skill-Transfer Algorithm
Source
Target
IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30THEN pass(Teammate)
ILP
Advice Taking
Selected ResultsSkill transfer from 3-on-2 MoveDownfield to 4-on-3 MoveDownfield
IF distance(me, Teammate) ≥ 15distance(me, Teammate) ≤ 27 distance(Teammate, rightEdge)
≤ 10angle(Teammate, me,
Opponent) ≥ 24distance(me, Opponent) ≥ 4
THEN pass(Teammate)
Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm
Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm
Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm
Thesis Contributions
Macro-Operators
pass(Teammate)
move(Direction)
shoot(goalRight)
shoot(goalLeft)
IF [ ... ] THEN pass(Teammate)
IF [ ... ] THEN move(ahead)
IF [ ... ] THEN shoot(goalRight)
IF [ ... ] THEN shoot(goalLeft)
IF [ ... ] THEN pass(Teammate)
IF [ ... ] THEN move(left)
IF [ ... ] THEN shoot(goalRight)
IF [ ... ] THEN shoot(goalRight)
Demonstration Method
source
target
target-task training
policy used
No more protection against negative transfer!
But… best-case scenario could be very good.
Macro-Transfer AlgorithmLearning structures
Positive: BreakAway
games that score
Negative: BreakAway games that didn’t score
ILP
IF actionTaken(Game, StateA, pass(Teammate), StateB) actionTaken(Game, StateB, move(Direction), StateC) actionTaken(Game, StateC, shoot(goalRight), StateD) actionTaken(Game, StateD, shoot(goalLeft), StateE)THEN isaGoodGame(Game)
Macro-Transfer AlgorithmLearning rules for arcs
Positive: states in good games
that took the arc
Negative: states in good games that could have taken the arc but didn’t
ILP
shoot(goalRight)
IF [ … ]THEN enter(State)
IF [ … ]THEN loop(State, Teammate))
pass(Teammate)
Macro-Transfer AlgorithmSelecting and scoring rules
Rule 1Precision=1.0Rule 2Precision=0.99Rule3Precision=0.96… …
Does rule increase F(10) of ruleset?
yes Add toruleset
Rule score = # games that follow the rule that are good # games that follow the rule
Selected ResultsMacro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway
pass(Teammate) move(ahead)
pass(Teammate) move(right) shoot(goalLeft)
move(right) move(left) shoot(goalLeft) shoot(goalRight)
move(left) shoot(goalLeft) shoot(goalRight)
move(ahead) move(right) shoot(goalLeft) shoot(goalRight)
move(away) shoot(goalLeft) shoot(goalRight)
move(right) shoot(goalLeft) shoot(goalRight)
shoot(goalLeft) shoot(goalRight)
shoot(GoalPart)
Selected ResultsMacro self-transfer in 2-on-1 BreakAway
Pro
bab
ilit
y of
goal
Training games
Asymptote 56%
Initial 1%
Single macro 32%
Multiple macro 43%
Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm
Macro transferMacro-operatorsDemonstrationMacro-transfer algorithm
Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithmMLN policy-transfer algorithm
Thesis Contributions
Markov Logic Networks
Formulas (F)
evidence1(X) AND query(X)evidence2(X) AND query(X)
Weights (W)
w0 = 1.1w1 = 0.9
Fi
ii worldnwZ
worldP )(exp1
)(
ni(world) = # true groundings of ith formula in world
query(x1)
e1 e2 …
query(x2)
e1 e2 …
Reference: Richardson and Domingos, Markov Logic Networks, Machine Learning 2006
MLN Weight Learning
IF [ ... ] THEN …
Alchemyweight learning
w0 = 1.1
From ILP:
MLN:
Reference: http://alchemy.cs.washington.edu
Markov Logic Networks in Macros
IF distance(Teammate, goal) < 12
THEN pass(Teammate)
IF angle(Teammate, defender) > 30
THEN pass(Teammate)
Matches t1 , score=0.92
Matches t2 , score=0.88
pass(Teammate)
MLN
P(t1) = 0.35P(t2) = 0.65
Markov Logic Networks in Macros
pass(Teammate) AND angle(Teammate, defender) > 30
pass(Teammate) AND distance(Teammate, goal) < 12
pass(t1)
pass(t2)
distance(t1, goal) < 12
distance(t2, goal) < 12 angle(t2 , defender) > 30
angle(t1, defender) > 30
Selected ResultsMacro self-transfer in 2-on-1 BreakAway
Pro
bab
ilit
y of
goal
Training games
Asymptote 56%
Initial 1%
Regular macro 32%
Macro with MLN 43%
MLN Q-Function Transfer Algorithm
Source
Target
ILP, Alchemy
Demonstration
MLN foraction 1
State Q-value
MLN Q-functions
MLN foraction 2
State Q-value
…
MLN Q-Function
0 ≤ Qa < 0.2 0.2 ≤ Qa < 0.4 0.4 ≤ Qa < 0.6
… … …
…
Bin Number
Pro
bab
ilit
y
Bin Number
Pro
bab
ilit
y
Bin Number
Pro
bab
ilit
y
bins
bina binQEprobsQ ]|[)(
Selected ResultsMLN Q-function transfer from 2-on-1 BreakAway to 3-on-2
BreakAwayIF distance(me, GoalPart) ≥ 42
distance(me, Teammate) ≥ 39
THEN pass(Teammate) falls into [0, 0.11]
IF angle(topRight, goalCenter, me) ≤ 42
angle(topRight, goalCenter, me) ≥ 55
angle(goalLeft, me, goalie) ≥ 20angle(goalCenter, me, goalie) ≤
30
THEN pass(Teammate) falls into [0.11, 0.27]
IF distance(Teammate, goalCenter) ≤ 9
angle(topRight, goalCenter, me) ≤ 85
THEN pass(Teammate) falls into [0.27, 0.43]
Selected ResultsMLN Q-function transfer from 2-on-1 BreakAway to 3-on-2
BreakAway
Torrey et al. AAAI workshop 2008
MLN Policy-Transfer Algorithm
Source
Target
ILP, Alchemy
Demonstration
MLN(F,W)
StateAction
Probability
MLN Policy
Selected ResultsMLN policy transfer from 2-on-1 BreakAway to 3-on-2
BreakAwayIF angle(topRight, goalCenter, me) ≤ 70
timeLeft ≥ 98distance(me, Teammate) ≥ 3
THEN pass(Teammate)
IF distance(me, GoalPart) ≥ 36distance(me, Teammate) ≥ 12timeLeft ≥ 91angle(topRight, goalCenter, me)
≤ 80THEN pass(Teammate)
IF distance(me, GoalPart) ≥ 27angle(topRight, goalCenter, me)
≤ 75distance(me, Teammate) ≥ 9angle(Teammate, me, goalie) ≥
25THEN pass(Teammate)
Selected ResultsMLN policy transfer from 2-on-1 BreakAway to 3-on-2
BreakAway
Torrey et al. ILP 2009
Selected ResultsMLN self-transfer in 2-on-1 BreakAway
Pro
bab
ilit
y of
goal
Training games
Asymptote 56%
Initial 1%
MLN Q-function 59%
MLN Policy 65%
Advice transferAdvice taking Inductive logic programmingSkill-transfer algorithm ECML 2006
(ECML 2005)Macro transfer
Macro-operatorsDemonstrationMacro-transfer algorithm ILP 2007
Markov Logic Network transferMarkov Logic Networks MLNs in macrosMLN Q-function transfer algorithm AAAI workshop
2008MLN policy-transfer algorithm ILP 2009
Thesis Contributions
Starting-pointTaylor et al. 2005: Value-function transfer
ImitationFernandez and Veloso 2006: Policy reuse
HierarchicalMehta et al. 2008: MaxQ transfer
AlterationWalsh et al. 2006: Aggregate states
New AlgorithmsSharma et al. 2007: Case-based RL
Related Work
Transfer can improve reinforcement learningInitial performanceLearning speed
Advice transferLow initial performanceSteep learning curvesRobust to negative transfer
Macro transfer and MLN transferHigh initial performanceShallow learning curvesVulnerable to negative transfer
Conclusions
ConclusionsClose-transfer scenarios
Distant-transfer scenarios
Multiple Macro
MLN Policy
Single Macro
MLN Q-Function
Skill Transfer
= =
≥≥
Multiple Macro
MLN Policy
Single Macro
MLN Q-Function
Skill Transfer
= =
≥ ≥
Theoretical resultsHow high can the initial performance be?How quickly can the target-task learner
improve?How many episodes are “saved” through
transfer?
Future Work
Source
Target
Relationship?
Joint learning and inference in macrosSingle searchCombined rule/weight learning
Future Work
pass(Teammate)
move(Direction)
Future WorkRefinement of transferred knowledge
MacrosRevising rule scoresRelearning rulesRelearning structure
MLNsRevising weightsRelearning rules
Too-specificclause
Betterclause
Too-generalclause
Betterclause
(Mihalkova et. al 2007)
Relational reinforcement learningQ-learning with MLN Q-functionPolicy search with MLN policies or macro
Future Work
Bin Number
Pro
bab
ilit
y
bins
binaction binQEpstateQ ]|[)(
MLN Q-functions lose too much information:
Diverse tasks
Complex testbeds
Automated mapping
Protection against negative transfer
Future WorkGeneral challenges in RL transfer
Advisor: Jude Shavlik
Collaborators: Trevor Walker and Richard Maclin
Committee David Page Mark Craven Jerry Zhu Michael Coen
UW Machine Learning Group
Grants DARPA HR0011-04-1-0007 NRL N00173-06-1-G002 DARPA FA8650-06-C-7606
Acknowledgements
Transfer in Reinforcement Learning
0 0 0 0
0 0 0 0
0 0 0 0 target-task training
2 5 4 8
9 1 7 2
5 9 1 4
Initial Q-tabletransferno transfer
Source task
Starting-point methods
Transfer in Reinforcement Learning
Alteration methods
Task S
Original statesOriginal actionsOriginal rewards
New statesNew actionsNew rewards
Policy Transfer Algorithm
Source
Target
IF Q(pass(Teammate)) > Q(other)
THEN pass(Teammate)
Advice Taking
action = pass(X) ?
outcome = caught(X) ?
pass(X) good?
pass(X) clearly best?
some action good?
pass(X) clearly bad?
Positive example for
pass(X)
Negative example for
pass(X)
yes
no
yes
yes
yes
yes
yes
Reject exampl
e
no
no
no
no
no
Skill Transfer Algorithm
Markov Logic Networks in MacrosExact Inference
x1 = world where pass(t1) is true x0 = world where
pass(t1) is false
Fi
ii xnwZ
xP )(exp1
)( 11
Fi
ii xnwZ
xP )(exp1
)( 00
Note: when pass(t1) is false no formulas are true
ZZxP
10exp
1)( 0
Fi
iinwZxP exp
1)( 1
pass(t1) AND angle(t1, defender) > 30
pass(t1) AND distance(t1, goal) < 12