# Adaptive Bandits: Towards the best history-dependent...

date post

05-Jun-2018Category

## Documents

view

220download

0

Embed Size (px)

### Transcript of Adaptive Bandits: Towards the best history-dependent...

570

Adaptive Bandits:Towards the best history-dependent strategy

Odalric-Ambrym Maillard Remi MunosINRIA Lille Nord-Europe INRIA Lille Nord-Europe

Abstract

We consider multi-armed bandit games withpossibly adaptive opponents. We introducemodels of constraints based on equiva-lence classes on the common history (infor-mation shared by the player and the oppo-nent) which define two learning scenarios:(1) The opponent is constrained, i.e. he pro-vides rewards that are stochastic functionsof equivalence classes defined by some model . The regret is measured with re-spect to (w.r.t.) the best history-dependentstrategy. (2) The opponent is arbitrary andwe measure the regret w.r.t. the best strat-egy among all mappings from classes to ac-tions (i.e. the best history-class-based strat-egy) for the best model in . This allowsto model opponents (case 1) or strategies(case 2) which handles finite memory, period-icity, standard stochastic bandits and othersituations. When = {}, i.e. only onemodel is considered, we derive tractable al-gorithms achieving a tight regret (at time T)bounded by O(

TAC), where C is the num-

ber of classes of . Now, when many modelsare available, all known algorithms achievinga nice regret O(

T ) are unfortunately not

tractable and scale poorly with the number ofmodels ||. Our contribution here is to pro-vide tractable algorithms with regret boundedby T 2/3C1/3 log(||)1/2.

1 INTRODUCTION

Designing medical treatments for patients infected bythe Human Immunodeficiency Virus (HIV) is challeng-ing due to the ability of the HIV to mutate into newviral strains that become, with time, resistant to a

Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

specific drug Ernst et al. (2006). Thus we need toalternate between drugs. The standard formalism ofstochastic bandits (see Robbins (1952)) used for de-signing medical treatment strategies models each pos-sible drug as an arm (action) and the immediate effi-ciency of the drug as a reward. In this setting, the re-wards are assumed to be i.i.d., thus the optimal strat-egy is constant in time. However in the case of adapt-ing viruses, like the HIV, no constant strategy (i.e., astrategy that constantly uses the same drug) is good onthe long term. We thus need to design new algorithms(together with new performance criteria) to handle alarger class of strategies that may depend on the wholetreatment history (i.e., past actions and rewards).

More formally, we consider a sequential decision mak-ing problem with rewards provided by a possibly adap-tive opponent. The general game is defined as follows:At each time-step t, the decision-maker (or player, oragent) selects an action at A (where A is a set ofA = |A| possible actions), and simultaneously the op-ponent (or adversary or environment) chooses a rewardfunction rt : A 7 [0, 1]. The agent receives the rewardrt(at). In this paper we consider the so-called banditinformation setting where the agent only sees the re-wards of the chosen action, and not the other rewardsprovided by the opponent. The goal of the agent isto maximize the cumulative sum of the rewards re-ceived, i.e. choose a sequence of actions (at)tT that

maximizesTt=1 rt(at).

Motivating Example In order to better under-stand our goal, consider the following very sim-ple problem for which no standard bandit algorithmachieves good cumulative rewards.

The set of actions is A = {a, b}, and the opponent isdefined by: r(a) = 1 and r(b) = 0 if the last action ofthe player is b, and r(a) = 0 and r(b) = 1 if the lastaction is a. Finally r(a) = r(b) = 1 for the first action.

In that game, playing a constant action a (or b) yieldsa cumulative reward of T/2 at time T . On the otherhand, a player that would switch its actions at eachround would obtain a total rewards of T , which is opti-

571

Adaptive Bandits

mal. Although this opponent is very simple, the loss ofusing any usual multi-armed bandit algorithm (such asUCB Auer et al. (2002) and EXP3 Auer et al. (2003))instead of this simple switching strategy is linear in T .

Adaptive Opponents In this paper, we considerthe setting when the opponent is adaptive, in the sensethat the reward functions can be arbitrary measur-able functions of the past common history, where bycommon history we mean all the observed rewards(rs(as))s

572

Odalric-Ambrym Maillard, Remi Munos

nents. For some equivalence relation on histories,we write [h] for the equivalence class of the history hw.r.t. . We introduce -constrained opponents thatare such that the reward functions only depend on theequivalence classes of history, i.e. rt(a) = r([h

573

Adaptive Bandits

where g([h

574

Odalric-Ambrym Maillard, Remi Munos

number of classes C of H/). The following resulteasily follows from Bubeck (2010).

Theorem 2 Let sup represents the supremum takenover all -constrained opponents and inf the infimumover all players, then the stochastic -regret is lower-bounded as:

sup;|H/|=C

infalgo

supopp

RT 1

20

TAC.

Let sup represents the supremum taken over all possi-ble opponents, then the adversarial -regret is lower-bounded as:

sup;|H/|=C

infalgo

supopp

RT 1

20

TAC.

3 PLAYING USING A POOL OFMODELS

After this introductory section, we now turn to themain challenge of this paper. When playing againsta given opponent, its model of constraints may notbe known. It is thus natural to consider several equiv-alence relations defined by a pool of class functions(models) = (), and that the opponent playswith some model induced by some . We considertwo cases: either = , i.e. the opponent isa -constrained opponent with

, or the oppo-nent is arbitrary, and we will compare our performanceto that of the best model in .

We define accordingly two notions of regret: If we con-sider a -constrained opponent, where , thenone can define the so-called stochastic -regret as:

RT = E( Tt=1

maxaA

[h

575

Adaptive Bandits

We provide two algorithms based respectively on UCBand Exp3, that may be used by each individual ex-pert :

-UCB is defined as before (see Figure 1), ex-cept that instead of step (4) we define t asa Dirac distribution at the recommended actionargmaxaA t,ct(a).

-Exp3 is defined as before (see Figure 1), exceptthat in step (1), no action is drawn from t (sincethe meta algorithm chooses at qt), and step(2) is replaced by: lct (a) =

1rt(at)qt(a)

Iat=aIc=[h

576

Odalric-Ambrym Maillard, Remi Munos

FUTURE WORKS

We do not know whether in the case of a pool ofmodels , there exist tractable algorithms with -regret better that T 2/3 with log dependency w.r.t. ||.Here we have used a meta Exp4 algorithm, but wecould have used other meta algorithms using a mix-ture qt(a) =

pt()

t (a) (where the pt are internal

weights of the meta algorithm). However, when com-puting the approximation term of the best model

by models (see the supplementary material), itseems that the -regret cannot be strongly reducedwithout making further assumptions on the structureof the game, since in general the mixture distributionqt may not converge to the distribution

t proposed

by the best model . This question remains open.Acknowledgment

This work has been supported by French NationalResearch Agency (ANR) through COSINUS program(project EXPLO-RA number ANR-08-COSI-004).

A SOME TECHNICAL PROOFS

In order to relate the cumulative reward of the Exp4algorithm to the one of the best expert on top of whichit is run, we state the following simple Lemma. Notethat in our case, since the t are not fixed in advancebut are random variables, we can not apply the originalresult of Auer et al. (2003) for fixed expert advises, butneed to adapt it. The proof easily follows from theoriginal proof of Theorem 7.1 in Auer et al. (2003),and is reported in the supplementary material.

Lemma 1 For any (0, 1], for any family of ex-perts which includes the uniform expert such that allexpert advises are adapted to the filtration of the past,one has

max

Tt=1

Ea1,...,at1(Eat (rt(a))) Ea1,...,aT (Tt=1

rt(at))

(e 1)T + A log(||)

.

A.1 The Rebel-bandit Setting

We now introduce the setting of Rebel bandits thatmay have its own interest. It will be used to com-pute the model-based regret of the Exp4 algorithm.In this setting, we consider that at time t the player proposes a distribution of probability t over thearms, but he actually receives the reward correspond-ing to an action drawn from another distribution, qt,the distribution of probability proposed by the metaalgorithm.

Following (4), we define the best model of the pool:

= argmax

supg:H/A

E( Tt=1

[rt(g([h

577

Adaptive Bandits

Now using the fact that log x x 1 and exp(x)1 +x x2, x 0, the first term on the right hand of(5) is bounded by:

t2 Eat (l

t,ct(a)

2)

Thus, considering that t (a) =exp(t

t1s=1 l

s,ct

(a))a exp(t

t1s=1 l

s,ct

(a),

we can introduce the quantity t (, c) =1 log(

1A

a exp(

ts=1 l

s,c(a))) so that the second

right term of (5) is t1(t , ct)t (t , ct). Thus we

deduce that:

RqT ()

Tt=1

Ea1,..,at1[Eatqt(

t2

(1 rt(at))2t (at)

q2t (at))

+Eat(t1(

t , ct)t (t , ct)

) Eat lt,ct(a

ct)],

where we have replace lt,ct(a) by its definition.

Step 3. Now we consider the first term in the righthand side of previous equation, which is bounded by:

Eatqt((1 rt(at))2t (at)

q2t (at))

a

t (a)

qt(a) 1.

Step 4. Introduce the equivalence classes. We nowconsider the second term defined with functions.Let us introduce the notations Ic (t) =

ts=1 Ic=[h

578

Odalric-Ambrym Maillard, Remi Munos

References

J.-Y. Audibert and S. Bubeck. Minimax policies foradversarial and stochastic bandits. In 22nd annualConference on Learning Theory, 2009.

J.Y. Audibert, R. Munos, and Cs. Szepesvari.Exploration-exploitation trade-off using variance es-timates in multi-armed bandits. Theoretical Com-puter Science, 2008.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. MachineLearning Journal, 47:235256, 2002.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, andRobert E. Schapire. The nonstochastic multiarmedbandit problem. SIAM Journal on Computing, 32:4877, January 2003. ISSN 0097-5397.

Avrim Blum and Yishay Mansour. From external tointernal regret. In Conference on Learning Theory,pages 621636, 2005.

Avrim Blum and Yishay Mansour. From external tointernal regret. Journal of Machine Learning Re-search, 8:13071324, December 2007.

S. Bubeck. Bandits Games and Clustering Founda-tions. PhD thesis, Universite Lille 1, 2010.

Nicolo Cesa-Bianchi and Gabor Lugosi. Potential-based algorithms in on-line prediction and game the-ory. Journal of Machine Learning, 51(3):239261,2003. ISSN 0885-6125.

Damien Ernst, Guy-Bart Stan, Jorge Goncalves, andLouis Wehenkel. Clinical data based optimal stistrategies for hiv; a reinforcement learning ap-proach. In Machine Learning Conference of Belgiumand the Netherlands (Benelearn), pages 6572, 2006.

Dean Foster and Rakesh Vohra. Asymptotic calibra-tion. Biometrika, 85:379390, 1996.

Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behav-ior, 29(1-2):735, 1999. ISSN 0899-8256.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and anapplication to boosting. In EuroCOLT 95: Proceed-ings of the Second European Conference on Compu-tational Learning Theory, pages 2337, London, UK,1995. Springer-Verlag. ISBN 3-540-59119-2.

Sergiu Hart and Andreu Mas-Colell. A simple adaptiveprocedure leading to correlated equilibrium. Econo-metrica, 68(5):11271150, 2000. ISSN 00129682.

Marcus Hutter. Feature reinforcement learning: PartI: Unstructured MDPs. Journal of Artificial GeneralIntelligence, 1:324, 2009.

Varun Kanade, H. Brendan McMahan, and BrentBryan. Sleeping Experts and Bandits with Stochas-tic Action Availability and Adversarial Rewards. InProceedings of the Twelfth International Conferenceon Artificial Intelligence and Statistics, number 5,pages 272279, 2009.

Robert D. Kleinberg, Alexandru Niculescu-Mizil, andYogeshwer Sharma. Regret bounds for sleeping ex-perts and bandits. In Conference on Learning The-ory, July 2008.

Ehud Lehrer and Dinah Rosenberg. A wide rangeno-regret theorem. Game theory and information,EconWPA, 2003.

Ronald Ortner. Online regret bounds for markov deci-sion processes with deterministic transitions. In Pro-ceedings of the 19th international conference on Al-gorithmic Learning Theory, pages 123137, Berlin,Heidelberg, 2008. Springer-Verlag.

H. Robbins. Some aspects of the sequential design ofexperiments. Bulletin of the American MathematicsSociety, 58:527535, 1952.

Daniil Ryabko and Marcus Hutter. On the possibilityof learning in reactive environments with arbitrarydependence. Theoretical Computer Science, 405(3):274284, 2008. ISSN 0304-3975.

Gilles Stoltz. Incomplete Information and Internal Re-gret in Prediction of Individual Sequences. PhD the-sis, Universite Paris-Sud, Orsay, France, May 2005.