Accumulation vs. replacement; model- free vs. model-based RL.

Accumulation vs. replacement; model-free vs.

model-based RL

Today in history•Last time:

•Explanations of Q-learning

•Action selection

•On/off-policy learning

•Use of experience

•Eligibility traces

•SARSA

•Today

•SARSA(λ)

•Replacing vs accumulating traces

•Thinking about eligibility

•R3 discussion

Administrivia

•Select presentation days:

•Tues, May 1:

•Alex, Blake, Diane

•Thu, May 3:

•Hairong, Jesse, Josh

Presentation hints

Terran’s packaged rant...

Presentation hints•Formal presentation to an audience

•Trying to convince audience of something

•E.g., you have invented a great idea and proven that it works

•Subtext: you’re smart and they should invest in you

•Think of it as a sales pitch (sort-of)

•Get the core idea across

•Don’t dwell on tedious detail

•Don’t be fluffy

Presentation hints•Practice!

•Time will be tight -- time yourself

•Get friends/colleagues to help you practice

•Practice!

•Think about order of material presentation

•Practice!

Presentation hints

•Avoid

•using

•every

•clever

•powerpoint

•trick And be careful

with cute, but

pointless

images

Presentation hintsOh, and avoid using bizarre fonts and really tiny font sizes just so that you can cram as much junk on the screen as possible. Remember: it’s more important that the audience actually understand your material than that you convey more ‘volume’ of material in the same time. It’s essentially pointless to ream through bunches of text or incredible amounts of math if nobody in the audience gets it. At best, they will be bored and zone out for most of your talk. At worst, they will be actively put off or annoyed by your presentation. And, presumably, you want them all to like you and be impressed with your material and ideas, so it’s counterproductive to antagonize your audience. Remember: at some point, your project, future funding, and/or job may depend on a presentation like this, so it behooves you to keep your audience happy. I have actually seen people give abysmally bad presentations and be completely rejected from the job opening because of their poor presentations. Now that that has been said, I still need to fill out this page with a large blob of text so that it’s as intimidating as possible. Honestly, I don’t expect anybody to actually read this far even in the online copy, let alone in class. If you do actually get this far while I’’m flashing this page up in class, do please shout out. I’ll be most impressed and you’ll get brownie points for speed reading. Even if you happen to read this far in the online copy, please send me a note, just to satisfy my curiosity about who’s determined enough to get that far. Hm. Still half a page to fill. This is a pretty drastically condensed slide. Let’s see. Need more text. Maybe a little web mining... Ok, here we go: APRIL is the cruellest month, breeding / Lilacs out of the dead land, mixing / Memory and desire, stirring / Dull roots with spring rain. / Winter kept us warm, covering / Earth in forgetful snow, feeding / A little life with dried tubers. / Summer surprised us, coming over the Starnbergersee / With a shower of rain; we stopped in the colonnade, / And went on in sunlight, into the Hofgarten, / And drank coffee, and talked for an hour. / Bin gar keine Russin, stamm' aus Litauen, echt deutsch. / And when we were children, staying at the archduke's, / My cousin's, he took me out on a sled, / And I was frightened. He said, Marie, / Marie, hold on tight. And down we went. / In the mountains, there you feel free. / I read, much of the night, and go south in the winter. / / What are the roots that clutch, what branches grow / Out of this stony rubbish? Son of man, / You cannot say, or guess, for you know only / A heap of broken images, where the sun beats, / And the dead tree gives no shelter, the cricket no relief, / And the dry stone no sound of water. Only / There is shadow under this red rock, / (Come in under the shadow of this red rock), / And I will show you something different from either / Your shadow at morning striding behind you / Or your shadow at evening rising to meet you; / I will show you fear in a handful of dust. / Frisch weht der Wind / Der Heimat zu. / Mein Irisch Kind, / Wo weilest du? / 'You gave me hyacinths first a year ago; / 'They called me the hyacinth girl.' / —Yet when we came back, late, from the Hyacinth garden, / Your arms full, and your hair wet, I could not / Speak, and my eyes failed, I was neither / Living nor dead, and I knew nothing, / Looking into the heart of light, the silence. / Od' und leer das Meer.

Presentation hints

Oh yeah.Don’t switch slides too quickly.

Presentation hints

•Be sure to look at audience

•Don’t just read from your slides

•Don’t stare at screen whole time

•Be careful w/ laser pointers

•Practice!

Back to RL...

The Q-learning algorithmAlgorithm: Q_learn

Inputs: State space S; Act. space A

Discount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Repeat {

s=get_current_world_state()

a=pick_next_action(Q,s)

(r,s’)=act_in_world(a)

Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))

} Until (bored)

SARSA-learning algorithmAlgorithm: SARSA_learn

Inputs: State space S; Act. space ADiscount γ (0<=γ<1); Learning rate α (0<=α<1)

Outputs: Q

Q =random(|S|,|A|); // Initializes=get_current_world_state()

a=pick_next_action(Q,s)

Repeat {

(r,s’)=act_in_world(a)

a’=pick_next_action(Q,s’)

Q(s,a)=Q(s,a)+α*(r+γ*Q(s’,a’)-Q(s,a))a=a’; s=s’;

} Until (bored)

Radioactive breadcrumbs•Can now define eligibility traces for SARSA

• In addition to Q(s,a) table, keep an e(s,a) table

•Records “eligibility” (real number) for each state/action pair

•At every step ((s,a,r,s’,a’) tuple):

• Increment e(s,a) for current (s,a) pair by 1

•Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’)

•Decay all e(s’’,a’’) by factor of λγ

•Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

SARSA(λ)-learning alg.Algorithm: SARSA(λ)_learnInputs: S, A, γ (0<=γ<1), α (0<=α<1), λ(0<=λ<1)Outputs: Qe(s,a)=0 // for all s, as=get_curr_world_st(); a=pick_nxt_act(Q,s)Repeat {(r,s’)=act_in_world(a)a’=pick_next_action(Q,s’)δ=r+γ*Q(s’,a’)-Q(s,a)e(s,a)+=1foreach (s’’,a’’) pair in (SXA) {Q(s’’,a’’)=Q(s’’,a’’)+α*e(s’’,a’’)*δe(s’’,a’’)*=λγ}

a=a’; s=s’;} Until (bored)

The trail of crumbs

Sutton & Barto, Sec 7.5

The trail of crumbs


λ=0

The trail of crumbs


Eligibility for a single state

e(si,aj)

1st visit2nd visit ...


Eligibility trace followup•Eligibility trace allows:

•Tracking where the agent has been

•Backup of rewards over longer periods

•Credit assignment: state/action pairs rewarded for having contributed to getting to the reward

•Why does it work?

The “forward view” of elig.•Original SARSA did “one step” backup:

Q(s,a)rt

Q(st+1,at+1

)

Rest of trajectoryInfo backup


•Could also do a “two step backup”:

Q(s,a)rt

Q(st+2,at+2

)

Rest of trajectory

rt+1

Info backup


•Could also do a “two step backup”:

•Or even an “n step backup”:

The “forward view” of elig.•Small-step backups (n=1, n=2, etc.) are

slow and nearsighted

•Large-step backups (n=100, n=1000, n=∞) are expensive and may miss near-term effects

•Want a way to combine them

•Can take a weighted average of different backups

•E.g.:

The “forward view” of elig.

1/3

2/3

The “forward view” of elig.•How do you know which number of steps

to avg over? And what the weights should be?

•Accumulating eligibility traces are just a clever way to easily avg. over all n:

The “forward view” of elig.λ0

λ1

λ2

λn-1

Replacing traces•Kind just described are accumulating e-

traces

•Every time you go back to state, add extra e.

•There are also replacing eligibility traces

•Every time you go back to a state/action, reset e(s,a) to 1

•Works better sometimes

Sutton &Barto, Sec 7.8

Accumulation vs. replacement; model- free vs. model-based RL.

Documents

Transcript of Accumulation vs. replacement; model- free vs. model-based RL.