Adaptive Bandits: Towards the best history-dependent...

of 9/9
570 Adaptive Bandits: Towards the best history-dependent strategy Odalric-Ambrym Maillard emi Munos INRIA Lille Nord-Europe INRIA Lille Nord-Europe Abstract We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Θ of constraints based on equiva- lence classes on the common history (infor- mation shared by the player and the oppo- nent) which define two learning scenarios: (1) The opponent is constrained, i.e. he pro- vides rewards that are stochastic functions of equivalence classes defined by some model θ * Θ. The regret is measured with re- spect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strat- egy among all mappings from classes to ac- tions (i.e. the best history-class-based strat- egy) for the best model in Θ. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, period- icity, standard stochastic bandits and other situations. When Θ = {θ}, i.e. only one model is considered, we derive tractable al- gorithms achieving a tight regret (at time T) bounded by ˜ O( T AC), where C is the num- ber of classes of θ. Now, when many models are available, all known algorithms achieving a nice regret O( T ) are unfortunately not tractable and scale poorly with the number of models |Θ|. Our contribution here is to pro- vide tractable algorithms with regret bounded by T 2/3 C 1/3 log(|Θ|) 1/2 . 1 INTRODUCTION Designing medical treatments for patients infected by the Human Immunodeficiency Virus (HIV) is challeng- ing due to the ability of the HIV to mutate into new viral strains that become, with time, resistant to a Appearing in Proceedings of the 14 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: W&CP 15. Copyright 2011 by the authors. specific drug Ernst et al. (2006). Thus we need to alternate between drugs. The standard formalism of stochastic bandits (see Robbins (1952)) used for de- signing medical treatment strategies models each pos- sible drug as an arm (action) and the immediate effi- ciency of the drug as a reward. In this setting, the re- wards are assumed to be i.i.d., thus the optimal strat- egy is constant in time. However in the case of adapt- ing viruses, like the HIV, no constant strategy (i.e., a strategy that constantly uses the same drug) is good on the long term. We thus need to design new algorithms (together with new performance criteria) to handle a larger class of strategies that may depend on the whole treatment history (i.e., past actions and rewards). More formally, we consider a sequential decision mak- ing problem with rewards provided by a possibly adap- tive opponent. The general game is defined as follows: At each time-step t, the decision-maker (or player, or agent) selects an action a t ∈A (where A is a set of A = |A| possible actions), and simultaneously the op- ponent (or adversary or environment) chooses a reward function r t : A 7→ [0, 1]. The agent receives the reward r t (a t ). In this paper we consider the so-called bandit information setting where the agent only sees the re- wards of the chosen action, and not the other rewards provided by the opponent. The goal of the agent is to maximize the cumulative sum of the rewards re- ceived, i.e. choose a sequence of actions (a t ) tT that maximizes T t=1 r t (a t ). Motivating Example In order to better under- stand our goal, consider the following very sim- ple problem for which no standard bandit algorithm achieves good cumulative rewards. The set of actions is A = {a, b}, and the opponent is defined by: r(a) = 1 and r(b) = 0 if the last action of the player is b, and r(a) = 0 and r(b) = 1 if the last action is a. Finally r(a)= r(b) = 1 for the first action. In that game, playing a constant action a (or b) yields a cumulative reward of T/2 at time T . On the other hand, a player that would switch its actions at each round would obtain a total rewards of T , which is opti-
  • date post

    05-Jun-2018
  • Category

    Documents

  • view

    220
  • download

    0

Embed Size (px)

Transcript of Adaptive Bandits: Towards the best history-dependent...

  • 570

    Adaptive Bandits:Towards the best history-dependent strategy

    Odalric-Ambrym Maillard Remi MunosINRIA Lille Nord-Europe INRIA Lille Nord-Europe

    Abstract

    We consider multi-armed bandit games withpossibly adaptive opponents. We introducemodels of constraints based on equiva-lence classes on the common history (infor-mation shared by the player and the oppo-nent) which define two learning scenarios:(1) The opponent is constrained, i.e. he pro-vides rewards that are stochastic functionsof equivalence classes defined by some model . The regret is measured with re-spect to (w.r.t.) the best history-dependentstrategy. (2) The opponent is arbitrary andwe measure the regret w.r.t. the best strat-egy among all mappings from classes to ac-tions (i.e. the best history-class-based strat-egy) for the best model in . This allowsto model opponents (case 1) or strategies(case 2) which handles finite memory, period-icity, standard stochastic bandits and othersituations. When = {}, i.e. only onemodel is considered, we derive tractable al-gorithms achieving a tight regret (at time T)bounded by O(

    TAC), where C is the num-

    ber of classes of . Now, when many modelsare available, all known algorithms achievinga nice regret O(

    T ) are unfortunately not

    tractable and scale poorly with the number ofmodels ||. Our contribution here is to pro-vide tractable algorithms with regret boundedby T 2/3C1/3 log(||)1/2.

    1 INTRODUCTION

    Designing medical treatments for patients infected bythe Human Immunodeficiency Virus (HIV) is challeng-ing due to the ability of the HIV to mutate into newviral strains that become, with time, resistant to a

    Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

    specific drug Ernst et al. (2006). Thus we need toalternate between drugs. The standard formalism ofstochastic bandits (see Robbins (1952)) used for de-signing medical treatment strategies models each pos-sible drug as an arm (action) and the immediate effi-ciency of the drug as a reward. In this setting, the re-wards are assumed to be i.i.d., thus the optimal strat-egy is constant in time. However in the case of adapt-ing viruses, like the HIV, no constant strategy (i.e., astrategy that constantly uses the same drug) is good onthe long term. We thus need to design new algorithms(together with new performance criteria) to handle alarger class of strategies that may depend on the wholetreatment history (i.e., past actions and rewards).

    More formally, we consider a sequential decision mak-ing problem with rewards provided by a possibly adap-tive opponent. The general game is defined as follows:At each time-step t, the decision-maker (or player, oragent) selects an action at A (where A is a set ofA = |A| possible actions), and simultaneously the op-ponent (or adversary or environment) chooses a rewardfunction rt : A 7 [0, 1]. The agent receives the rewardrt(at). In this paper we consider the so-called banditinformation setting where the agent only sees the re-wards of the chosen action, and not the other rewardsprovided by the opponent. The goal of the agent isto maximize the cumulative sum of the rewards re-ceived, i.e. choose a sequence of actions (at)tT that

    maximizesTt=1 rt(at).

    Motivating Example In order to better under-stand our goal, consider the following very sim-ple problem for which no standard bandit algorithmachieves good cumulative rewards.

    The set of actions is A = {a, b}, and the opponent isdefined by: r(a) = 1 and r(b) = 0 if the last action ofthe player is b, and r(a) = 0 and r(b) = 1 if the lastaction is a. Finally r(a) = r(b) = 1 for the first action.

    In that game, playing a constant action a (or b) yieldsa cumulative reward of T/2 at time T . On the otherhand, a player that would switch its actions at eachround would obtain a total rewards of T , which is opti-

  • 571

    Adaptive Bandits

    mal. Although this opponent is very simple, the loss ofusing any usual multi-armed bandit algorithm (such asUCB Auer et al. (2002) and EXP3 Auer et al. (2003))instead of this simple switching strategy is linear in T .

    Adaptive Opponents In this paper, we considerthe setting when the opponent is adaptive, in the sensethat the reward functions can be arbitrary measur-able functions of the past common history, where bycommon history we mean all the observed rewards(rs(as))s

  • 572

    Odalric-Ambrym Maillard, Remi Munos

    nents. For some equivalence relation on histories,we write [h] for the equivalence class of the history hw.r.t. . We introduce -constrained opponents thatare such that the reward functions only depend on theequivalence classes of history, i.e. rt(a) = r([h

  • 573

    Adaptive Bandits

    where g([h

  • 574

    Odalric-Ambrym Maillard, Remi Munos

    number of classes C of H/). The following resulteasily follows from Bubeck (2010).

    Theorem 2 Let sup represents the supremum takenover all -constrained opponents and inf the infimumover all players, then the stochastic -regret is lower-bounded as:

    sup;|H/|=C

    infalgo

    supopp

    RT 1

    20

    TAC.

    Let sup represents the supremum taken over all possi-ble opponents, then the adversarial -regret is lower-bounded as:

    sup;|H/|=C

    infalgo

    supopp

    RT 1

    20

    TAC.

    3 PLAYING USING A POOL OFMODELS

    After this introductory section, we now turn to themain challenge of this paper. When playing againsta given opponent, its model of constraints may notbe known. It is thus natural to consider several equiv-alence relations defined by a pool of class functions(models) = (), and that the opponent playswith some model induced by some . We considertwo cases: either = , i.e. the opponent isa -constrained opponent with

    , or the oppo-nent is arbitrary, and we will compare our performanceto that of the best model in .

    We define accordingly two notions of regret: If we con-sider a -constrained opponent, where , thenone can define the so-called stochastic -regret as:

    RT = E( Tt=1

    maxaA

    [h

  • 575

    Adaptive Bandits

    We provide two algorithms based respectively on UCBand Exp3, that may be used by each individual ex-pert :

    -UCB is defined as before (see Figure 1), ex-cept that instead of step (4) we define t asa Dirac distribution at the recommended actionargmaxaA t,ct(a).

    -Exp3 is defined as before (see Figure 1), exceptthat in step (1), no action is drawn from t (sincethe meta algorithm chooses at qt), and step(2) is replaced by: lct (a) =

    1rt(at)qt(a)

    Iat=aIc=[h

  • 576

    Odalric-Ambrym Maillard, Remi Munos

    FUTURE WORKS

    We do not know whether in the case of a pool ofmodels , there exist tractable algorithms with -regret better that T 2/3 with log dependency w.r.t. ||.Here we have used a meta Exp4 algorithm, but wecould have used other meta algorithms using a mix-ture qt(a) =

    pt()

    t (a) (where the pt are internal

    weights of the meta algorithm). However, when com-puting the approximation term of the best model

    by models (see the supplementary material), itseems that the -regret cannot be strongly reducedwithout making further assumptions on the structureof the game, since in general the mixture distributionqt may not converge to the distribution

    t proposed

    by the best model . This question remains open.Acknowledgment

    This work has been supported by French NationalResearch Agency (ANR) through COSINUS program(project EXPLO-RA number ANR-08-COSI-004).

    A SOME TECHNICAL PROOFS

    In order to relate the cumulative reward of the Exp4algorithm to the one of the best expert on top of whichit is run, we state the following simple Lemma. Notethat in our case, since the t are not fixed in advancebut are random variables, we can not apply the originalresult of Auer et al. (2003) for fixed expert advises, butneed to adapt it. The proof easily follows from theoriginal proof of Theorem 7.1 in Auer et al. (2003),and is reported in the supplementary material.

    Lemma 1 For any (0, 1], for any family of ex-perts which includes the uniform expert such that allexpert advises are adapted to the filtration of the past,one has

    max

    Tt=1

    Ea1,...,at1(Eat (rt(a))) Ea1,...,aT (Tt=1

    rt(at))

    (e 1)T + A log(||)

    .

    A.1 The Rebel-bandit Setting

    We now introduce the setting of Rebel bandits thatmay have its own interest. It will be used to com-pute the model-based regret of the Exp4 algorithm.In this setting, we consider that at time t the player proposes a distribution of probability t over thearms, but he actually receives the reward correspond-ing to an action drawn from another distribution, qt,the distribution of probability proposed by the metaalgorithm.

    Following (4), we define the best model of the pool:

    = argmax

    supg:H/A

    E( Tt=1

    [rt(g([h

  • 577

    Adaptive Bandits

    Now using the fact that log x x 1 and exp(x)1 +x x2, x 0, the first term on the right hand of(5) is bounded by:

    t2 Eat (l

    t,ct(a)

    2)

    Thus, considering that t (a) =exp(t

    t1s=1 l

    s,ct

    (a))a exp(t

    t1s=1 l

    s,ct

    (a),

    we can introduce the quantity t (, c) =1 log(

    1A

    a exp(

    ts=1 l

    s,c(a))) so that the second

    right term of (5) is t1(t , ct)t (t , ct). Thus we

    deduce that:

    RqT ()

    Tt=1

    Ea1,..,at1[Eatqt(

    t2

    (1 rt(at))2t (at)

    q2t (at))

    +Eat(t1(

    t , ct)t (t , ct)

    ) Eat lt,ct(a

    ct)],

    where we have replace lt,ct(a) by its definition.

    Step 3. Now we consider the first term in the righthand side of previous equation, which is bounded by:

    Eatqt((1 rt(at))2t (at)

    q2t (at))

    a

    t (a)

    qt(a) 1.

    Step 4. Introduce the equivalence classes. We nowconsider the second term defined with functions.Let us introduce the notations Ic (t) =

    ts=1 Ic=[h

  • 578

    Odalric-Ambrym Maillard, Remi Munos

    References

    J.-Y. Audibert and S. Bubeck. Minimax policies foradversarial and stochastic bandits. In 22nd annualConference on Learning Theory, 2009.

    J.Y. Audibert, R. Munos, and Cs. Szepesvari.Exploration-exploitation trade-off using variance es-timates in multi-armed bandits. Theoretical Com-puter Science, 2008.

    P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-timeanalysis of the multiarmed bandit problem. MachineLearning Journal, 47:235256, 2002.

    Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, andRobert E. Schapire. The nonstochastic multiarmedbandit problem. SIAM Journal on Computing, 32:4877, January 2003. ISSN 0097-5397.

    Avrim Blum and Yishay Mansour. From external tointernal regret. In Conference on Learning Theory,pages 621636, 2005.

    Avrim Blum and Yishay Mansour. From external tointernal regret. Journal of Machine Learning Re-search, 8:13071324, December 2007.

    S. Bubeck. Bandits Games and Clustering Founda-tions. PhD thesis, Universite Lille 1, 2010.

    Nicolo Cesa-Bianchi and Gabor Lugosi. Potential-based algorithms in on-line prediction and game the-ory. Journal of Machine Learning, 51(3):239261,2003. ISSN 0885-6125.

    Damien Ernst, Guy-Bart Stan, Jorge Goncalves, andLouis Wehenkel. Clinical data based optimal stistrategies for hiv; a reinforcement learning ap-proach. In Machine Learning Conference of Belgiumand the Netherlands (Benelearn), pages 6572, 2006.

    Dean Foster and Rakesh Vohra. Asymptotic calibra-tion. Biometrika, 85:379390, 1996.

    Dean P. Foster and Rakesh Vohra. Regret in the on-line decision problem. Games and Economic Behav-ior, 29(1-2):735, 1999. ISSN 0899-8256.

    Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and anapplication to boosting. In EuroCOLT 95: Proceed-ings of the Second European Conference on Compu-tational Learning Theory, pages 2337, London, UK,1995. Springer-Verlag. ISBN 3-540-59119-2.

    Sergiu Hart and Andreu Mas-Colell. A simple adaptiveprocedure leading to correlated equilibrium. Econo-metrica, 68(5):11271150, 2000. ISSN 00129682.

    Marcus Hutter. Feature reinforcement learning: PartI: Unstructured MDPs. Journal of Artificial GeneralIntelligence, 1:324, 2009.

    Varun Kanade, H. Brendan McMahan, and BrentBryan. Sleeping Experts and Bandits with Stochas-tic Action Availability and Adversarial Rewards. InProceedings of the Twelfth International Conferenceon Artificial Intelligence and Statistics, number 5,pages 272279, 2009.

    Robert D. Kleinberg, Alexandru Niculescu-Mizil, andYogeshwer Sharma. Regret bounds for sleeping ex-perts and bandits. In Conference on Learning The-ory, July 2008.

    Ehud Lehrer and Dinah Rosenberg. A wide rangeno-regret theorem. Game theory and information,EconWPA, 2003.

    Ronald Ortner. Online regret bounds for markov deci-sion processes with deterministic transitions. In Pro-ceedings of the 19th international conference on Al-gorithmic Learning Theory, pages 123137, Berlin,Heidelberg, 2008. Springer-Verlag.

    H. Robbins. Some aspects of the sequential design ofexperiments. Bulletin of the American MathematicsSociety, 58:527535, 1952.

    Daniil Ryabko and Marcus Hutter. On the possibilityof learning in reactive environments with arbitrarydependence. Theoretical Computer Science, 405(3):274284, 2008. ISSN 0304-3975.

    Gilles Stoltz. Incomplete Information and Internal Re-gret in Prediction of Individual Sequences. PhD the-sis, Universite Paris-Sud, Orsay, France, May 2005.