Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

APWD’s Reward GRWD’s 0≤γ<1 γ=1 No Discoun;ng With Discoun;ng Goal Reward Step > 0 Goal > 1 Ac;on Penalty Step > 1 Goal > 0 Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other Worst possible performance - undirected, infinite, wandering APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor GRWD’s Choice APND’s Choice GRWD’s Choice APND’s Choice Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Inﬁnite Domains Judah Schvimer Advisor: Prof. Michael LiEman Draw a 9 of Diamonds Draw an Ace of Spades 5 of Hearts on 6 of Spades Termina;on Optimal Policy is finite Actions transition to a finite number of states The greatest probability of reaching the goal is 1* The reward scheme causes states within a finite number of steps of the start to have greater values than states an infinite number of steps away from the start 1 Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states. 1 If there is a tie, choose the policy with the greatest expected total reward 2 If there is still a tie, choose the policy arbitrarily, though consistently 2 Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action 3 Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1 4 Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability 1 If there is a tie, choose one state arbitrarily, though consistently 5 Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1 6 Policy Choice: Return the last expanded policy Modiﬁed Breadth First Search Uses short circuiting termination Guaranteed to find the optimal policy but not to terminate when it does, without the greatest probability of reaching the goal equaling 1 Both this and the Probabilistic Search Algorithm do not find the policy with the fewest number of expected steps Reward Func;ons Probabilis;c Heuris;c Search Algorithm GRWD Expands Fewer States APND Expands Fewer States GRND Doesn’t Terminate

Upload
others
Category

Documents
view
0
download
0

Embed Size (px):

Transcript of Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf ·...

Page 1: Probabilis;c’Heuris;c’Search’Algorithm’cs.brown.edu/degrees/undergrad/research/judah.pdf · 2015-04-21 · ThesisPoster.pptx Author: Judah Schvimer Created Date: 4/18/2015

APWD’s Reward GRWD’s

0≤γ<1 γ=1 No Discoun;ng With Discoun;ng

Goal Reward Step -‐> 0 Goal -‐> 1

Ac;on Penalty Step -‐> -‐1 Goal -‐> 0

Expand same states - Find same answer - Take same time Why? - The algorithm uses rewards in the Set Policy Step - Tie breaker chooses the policy with the greatest expected total reward - APWD and GRWD’s expected total rewards are linear transformations of each other

Worst possible performance - undirected, infinite, wandering

APND and GRWD/APWD are not comparable - domains exist where each performs better - GRWD/APWD makes different decisions based on the discount factor

GRWD’s Choice

APND’s Choice

GRWD’s Choice

APND’s Choice

Take the First Right and Go Straight Forever: Novel Planning Algorithms in Stochas;c Infinite Domains

Judah Schvimer Advisor: Prof. Michael LiEman

Draw a 9 of Diamonds

Draw an Ace of Spades

5 of Hearts on 6 of Spades

Termina;on ü  Optimal Policy is finite ü  Actions transition to a finite number of states ü  The greatest probability of reaching the goal is 1* ü  The reward scheme causes states within a finite

number of steps of the start to have greater values than states an infinite number of steps away from the start

1  Set Policy: Choose the policy with the greatest probability of reaching the goal using a standard planning algorithm, assuming optimistically that unexplored states are goal states.

1  If there is a tie, choose the policy with the greatest expected total reward 2  If there is still a tie, choose the policy arbitrarily, though consistently

2  Short Circuiting (Optional): If the policy's pessimistic estimate for the probability of reaching the goal is better than the best optimistic estimate from a different first action, go to Step 6 and return only the optimal first action

3  Termination: If there are no more fringe states in the current policy, go to Step 6, otherwise return to Step 1

4  Choose Expansion State: Among all fringe states, choose the one reached with the greatest probability

1  If there is a tie, choose one state arbitrarily, though consistently

5  Expand the chosen fringe state by seeing where its actions transition and adding those states to the MDP; go to Step 1

6  Policy Choice: Return the last expanded policy

Modified Breadth First Search ü  Uses short circuiting termination ü  Guaranteed to find the optimal policy but not to terminate when it

does, without the greatest probability of reaching the goal equaling 1 ü  Both this and the Probabilistic Search Algorithm do not find the

policy with the fewest number of expected steps

Reward Func;ons

Probabilis;c Heuris;c Search Algorithm

GRWD Expands Fewer States

APND Expands Fewer States

GRND Doesn’t Terminate

Development of Dynamic Models - Chemical Engineeringceweb/faculty/seborg/teaching/SE… · chemical kinetics, equipment geometry, etc.). 7. Perform a degrees of freedom analysis (Section

ΦΗΜΗ - Classical Studies · Undergrad Report Page 24 Gratias Agimus Letter from the Chair In This Issue Dear Friends, Colleagues, students, and staff have contributed to another

Angular MeasurementAngular Measurement •No absolute standard is required for angular measurement. •Units of measurement –Degrees ( ): defined as 1/360 of a circle Universal Bevel

Research Article Fuzzy -Hyperideals in -Hypersemirings by ...downloads.hindawi.com/journals/tswj/2014/428635.pdf · Fuzzy sets are sets whose elements have degrees of membership.

Swimwear minerva 2015 / Kupaći kostimi 2015

INTERPRETING ARITHMETIC IN THE R.E. DEGREES UNDER …pi.math.cornell.edu/~shore/papers/pdf/revrec27z.pdf · INTERPRETING ARITHMETIC IN THE R.E. DEGREES UNDER 4-INDUCTION C. T. CHONG,

Rotational Kinematics. Angular Position Degrees and revolutions: Angular Position θ > 0 θ < 0.

$Math 30-1 - Unit 4 Workbook · Lesson 1: Degrees and Radians Approximate Completion Time: 4 Days Lesson 2: The Unit Circle Approximate Completion Time: 4 Days Lesson 3: Trigonometric$