Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

1
APWD’s Reward GRWD’s 0≤γ<1 γ=1 No Discoun;ng With Discoun;ng Goal Reward Step > 0 Goal > 1 Ac;on Penalty Step > 1 Goal > 0 Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other Worst possible performance - undirected, infinite, wandering APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor GRWD’s Choice APND’s Choice GRWD’s Choice APND’s Choice Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Infinite Domains Judah Schvimer Advisor: Prof. Michael LiEman Draw a 9 of Diamonds Draw an Ace of Spades 5 of Hearts on 6 of Spades Termina;on Optimal Policy is finite Actions transition to a finite number of states The greatest probability of reaching the goal is 1* The reward scheme causes states within a finite number of steps of the start to have greater values than states an infinite number of steps away from the start 1 Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states. 1 If there is a tie, choose the policy with the greatest expected total reward 2 If there is still a tie, choose the policy arbitrarily, though consistently 2 Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action 3 Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1 4 Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability 1 If there is a tie, choose one state arbitrarily, though consistently 5 Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1 6 Policy Choice: Return the last expanded policy Modified Breadth First Search Uses short circuiting termination Guaranteed to find the optimal policy but not to terminate when it does, without the greatest probability of reaching the goal equaling 1 Both this and the Probabilistic Search Algorithm do not find the policy with the fewest number of expected steps Reward Func;ons Probabilis;c Heuris;c Search Algorithm GRWD Expands Fewer States APND Expands Fewer States GRND Doesn’t Terminate

Transcript of Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

Page 1: Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf · 2015-04-21 · ThesisPoster.pptx Author: Judah Schvimer Created Date: 4/18/2015

APWD’s  Reward   GRWD’s  

0≤γ<1   γ=1  No  Discoun;ng  With  Discoun;ng  

Goal  Reward        Step  -­‐>  0        Goal  -­‐>  1  

Ac;on  Penalty      Step  -­‐>  -­‐1        Goal  -­‐>  0  

Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other

Worst possible performance - undirected, infinite, wandering

APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor

GRWD’s  Choice  

APND’s  Choice  

GRWD’s  Choice  

APND’s  Choice  

Take  the  First  Right  and  Go  Straight  Forever:    Novel  Planning  Algorithms  in  Stochas;c    Infinite  Domains  

Judah  Schvimer  Advisor:  Prof.  Michael  LiEman    

Draw  a  9  of  Diamonds  

Draw  an  Ace  of  Spades  

5  of  Hearts  on  6  of  Spades  

Termina;on  ü  Optimal Policy is finite ü  Actions transition to a finite number of states ü  The greatest probability of reaching the goal is 1* ü  The reward scheme causes states within a finite

number of steps of the start to have greater values than states an infinite number of steps away from the start

 

1  Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states.

1  If there is a tie, choose the policy with the greatest expected total reward 2  If there is still a tie, choose the policy arbitrarily, though consistently

2  Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action

3  Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1

4  Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability

1  If there is a tie, choose one state arbitrarily, though consistently

5  Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1

6  Policy Choice: Return the last expanded policy

Modified  Breadth  First  Search  ü  Uses short circuiting termination ü  Guaranteed to find the optimal policy but not to terminate when it

does, without the greatest probability of reaching the goal equaling 1 ü  Both this and the Probabilistic Search Algorithm do not find the

policy with the fewest number of expected steps

Reward  Func;ons    

Probabilis;c  Heuris;c  Search  Algorithm  

GRWD  Expands  Fewer  States  

APND  Expands  Fewer  States  

GRND  Doesn’t  Terminate