Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

Post on 20-Jul-2020

0 views 0 download

Transcript of Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

APWD’s  Reward   GRWD’s  

0≤γ<1   γ=1  No  Discoun;ng  With  Discoun;ng  

Goal  Reward        Step  -­‐>  0        Goal  -­‐>  1  

Ac;on  Penalty      Step  -­‐>  -­‐1        Goal  -­‐>  0  

Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other

Worst possible performance - undirected, infinite, wandering

APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor

GRWD’s  Choice  

APND’s  Choice  

GRWD’s  Choice  

APND’s  Choice  

Take  the  First  Right  and  Go  Straight  Forever:    Novel  Planning  Algorithms  in  Stochas;c    Infinite  Domains  

Judah  Schvimer  Advisor:  Prof.  Michael  LiEman    

Draw  a  9  of  Diamonds  

Draw  an  Ace  of  Spades  

5  of  Hearts  on  6  of  Spades  

Termina;on  ü  Optimal Policy is finite ü  Actions transition to a finite number of states ü  The greatest probability of reaching the goal is 1* ü  The reward scheme causes states within a finite

number of steps of the start to have greater values than states an infinite number of steps away from the start

 

1  Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states.

1  If there is a tie, choose the policy with the greatest expected total reward 2  If there is still a tie, choose the policy arbitrarily, though consistently

2  Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action

3  Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1

4  Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability

1  If there is a tie, choose one state arbitrarily, though consistently

5  Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1

6  Policy Choice: Return the last expanded policy

Modified  Breadth  First  Search  ü  Uses short circuiting termination ü  Guaranteed to find the optimal policy but not to terminate when it

does, without the greatest probability of reaching the goal equaling 1 ü  Both this and the Probabilistic Search Algorithm do not find the

policy with the fewest number of expected steps

Reward  Func;ons    

Probabilis;c  Heuris;c  Search  Algorithm  

GRWD  Expands  Fewer  States  

APND  Expands  Fewer  States  

GRND  Doesn’t  Terminate