Josh: Tic-tac-toe

• Josh: Tic-tac-toe

• Where might you find bandit problems?

• Clinical Trials• Feynman: restaurants• E-advertising (Yahoo, MSFT)• Rewards to users (Diabetes study, DMN)

• Utility functions

Action-Value Methods

• ε-greedy• Vs. running update?

Action-Value Methods

• ε-greedy• Vs. running update?

Which is best?

Softmax

• Gibbs / Boltzmann distribution• Action a on tth play• Temperature goes to zero – (may be harder to set)

Nonstationary

• Exponential, recency-weighted average

• Learning rate can vary per step – Why?

Nonstationary

• Exponential, recency-weighted average

• Learning rate can vary per step – Why?

• How do you know if the task is stationary?

Initialization• Optimistic• Pessimistic• Something else?

Teaching

• http://www.youtube.com/watch?v=VTbbYLvhDSM

• N-armed bandit• Multiple n-armed bandits (contextual bandit)– Bei’s research problem

• Reinforcement Learning

• Unfortunately, interval estimation methods are problematic in practice because of the complexity of the statistical methods used to estimate the confidence intervals.

• There is also a well-known algorithm for computing the Bayes optimal way to balance exploration and exploitation. This method is computationally intractable when done exactly, but there may be efficient ways to approximate it.

Bandit Algorithms

• Goal: minimize regret• Regret: defined in terms of average reward• Average reward of best action is μ* and any

other action j as μj. There are K total actions. Tj(n) is # times tried action j during our n executed actions.

• Calculate confidence intervals (leverage Chernoff-Hoeffding bound)

• For each action j, record average reward xj and the number of times we’ve tried it as nj. n is the total number of actions we’ve tried.

• Try the action that maximizes

UCB1 regret

UCB1 - Tuned

• Can compute sample variance for each action, σj

• Easy hack for non-stationary environments?

Adversarial Bandits

• Optimism can be naïve

• Reward vectors must be fixed in advance of the algorithm running. • Payoffs can depend adversarially on the algorithm the player decides to use. • Ex: if the player chooses the strategy of always picking the first action, then

the adversary can just make that the worst possible action to choose. • Rewards cannot depend on the random choices made by the player during

the game.

• Why can’t the adversary just make all the payoffs zero? (or negative!)

• Why can’t the adversary just make all the payoffs zero? (or negative!) • In this event the player won’t get any reward, but he can emotionally

and psychologically accept this fate. If he never stood a chance to get any reward in the first place, why should he feel bad about the inevitable result?

• What a truly cruel adversary wants is, at the end of the game, to show the player what he could have won, and have it far exceed what he actually won. In this way the player feels regret for not using a more sensible strategy, and likely returns to the casino to lose more money.

• The trick that the player has up his sleeve is precisely the randomness in his choice of actions, and he can use its objectivity to partially overcome even the nastiest of adversaries.

• Exp3: Exponential-weight algorithm for Exploration and Exploitation

k-Meterologists Problem• ICML-09, Diuk, Li, and Leffler

• Imagine that you just moved to a new town that has multiple (k) radio and TV stations. Each morning, you tune in to one of the stations to find out what the weather will be like. Which of the k different meteorologists making predictions every morning is the most trustworthy? Let us imagine that, to decide on the best meteorologist, each morning for the first M days you tune in to all k stations and write down the probability that each meteorologist assigns to the chances of rain. Then, every evening you write down a 1 if it rained, and a 0 if it didn’t. Can this data be used to determine who is the best meteorologist?

• Related to expert algorithm selection

• PAC Subset Selection in Stochastic Multi-armed Bandits

• ICML-12• Select best subset of m arms out of n possible

Josh: Tic-tac-toe

Documents

Transcript of Josh: Tic-tac-toe

ODIGOS SPOUDWN Photoshop booklet - ODIGOS SPOUDWN... · Photoshop ON-TOE . 08 Web Kápr€s 06 Aqío€s Ene§epyaoía ElKóvas Tl unopú va p€ TO PHOTOSHOP, 05 Evnp€pWTlKá 02

Score Happy€¦ · Υ Υ Υ Υ > > > > ª ∀ ∀ ∀ ∀ ∀ ∀ ∀ ∀ 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Maclean Chad Hunter Jacob Josh Braxton Evan Caleb Keith œœŒ Ó œœŒ

The expanding CRISPR toolbox - Amazon Web Servicess3-service-broker-live-19ea8b98-4d41-4cb4-be4c-d68f4963b7dd.s3.amazonaws.… · The expanding CRISPR toolbox Josh Tycko1,2,4, Gaelen

Kλεισμένοι έξω απο τα δεδομένα μας απο Cryptowall & locky - infocom 2016 - tic tac data recovery

+ΚΑΣΤΟΡΙΑΔΗΣ - isotis.files.wordpress.com · žÆõt äv 6' Tó zoÜ 1776 1789. xoñ 1871, 1917 Má-q '68. o' zó (Aktooaoép) Ttoó azó wovovzoóÄ"7to zóv 7távtD TOE)

Dr. Josh Flohr - Evaluating Maternal Vitamin D Supplementation on Sow and Subsequent Pig Performance

Een discourse community van docenten rond …• schakel de leerlingen in bij het geven van feedback aan de docent; • pas het congruentieprincipe (Korthagen e.a. 2001) toe: ga met

Report from ν-TAC - KEKT.Kobayashi (KEK) J-PARC NPFC Feb. 16, 2004 6 Q5P V2 PQ 2A PQ 4A PH 3 P 4B V1 P D2 1.9deg. b n Q3 A P2B H1.9 D H2 コ ン プ レ ッ サ ー 室 冷 凍 機

Red Supergiants as Extragalactic Abundance Probes: Establishing the J-Band Technique Zach Gazak Rolf Kudritzki (chair), Josh Barnes, Fabio Bresolin, Ben.

J-PARC PAC Jan 20071 Report from the ν-TAC Committee Ewart Blackmore – TRIUMF Neutrino Facility Technical Advisory Committee met twice November 12-13,

Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines

Restricted Optimum Design of Reinforced Concrete Retaining ... · Restricted Optimum Design of Reinforced Concrete Retaining Walls . ... and toe of the retaining wall. In the optimum

Multiple-output power supplies TOE 8732 - Startseite-11 TOE 8735 Special features – Extremely low residual ripple < 50 μV – Electrically decoupled outputs – Precise digital

Performance Characterization of a 10-Gigabit Ethernet TOE

ANGULAR KINEMATICS · Relationship between angular velocity and linear velocity. I radian \ angular velocity vector ... So a point on the toe (for example) has both angular velocity

TAC vnitřní soustružnické nože - MIKRONmikron.sk/userfiles/5_TAC_vnutorne_sustruznicke_noze.pdf · 2014-05-06 · 5 5 f l s l 80 5 l1

Josh Lothian Adam Stanton Team ί Burning Equalizer.

100...100 Artists Mitchell and Yancey Counties North Carolina Toe River Arts Council trac@toeriverarts.org toeriverarts.org 2010 Toe River Arts Council Stanley Mace Andersen Φ Andersen

Augmented Reality Tic-Tac-Toe - Stanford Universityfw095wd5869/... · This project implements an augmented reality version of tic -tac toe. In this game, the user draws an X on a

超薄膜TACフィルムの開発 - KONICA MINOLTA...KONICA MINOLTA TECHNOLOGY EPOT OL.11 (2014) 95 近年上市した新しいTACフィルム1) の「薄くて強い」と いう特徴を最大限に活かすべく，以下の開発目標を設定

Report from ν-TAC - KEKT.Kobayashi (KEK) J-PARC NPFC Feb. 16, 2004 6 Q5P V2 PQ 2A PQ 4A PH 3 P 4B V1 P D2 1.9deg. b n Q3 A P2B H1.9 D H2 コンプレッサー室冷凍機

超薄膜TACフィルムの開発 - KONICA MINOLTA...KONICA MINOLTA TECHNOLOGY EPOT OL.11 (2014) 95 近年上市した新しいTACフィルム1) の「薄くて強い」という特徴を最大限に活かすべく，以下の開発目標を設定