Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A...

111
Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 Bandit Problems Part III - Bandits for Optimization RLSS, Lille, July 2019

Transcript of Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A...

Page 1: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019

Bandit ProblemsPart III - Bandits for OptimizationRLSS, Lille, July 2019

Page 2: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

K arms ↔ K probability distributions : νa has mean µa

ν1 ν2 ν3 ν4 ν5

At round t, an agent:I chooses an arm AtI receives a reward Rt = XAt ,t ∼ νAt

Sequential sampling strategy (bandit algorithm):

At+1 = Ft(A1,R1, . . . ,At ,Rt).

Goal: Maximize E[∑T

t=1 Rt]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 2

.The Stochastic Multi-Armed Bandit Setup

Page 3: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

K arms ↔ K probability distributions : νa has mean µa

ν1 ν2 ν3 ν4 ν5

At round t, an agent:I chooses an arm AtI receives a sample Xt = XAt ,t ∼ νAt

Sequential sampling strategy (bandit algorithm):

At+1 = Ft(A1,X1, . . . ,At ,Xt).

Goal: Maximize E[∑T

t=1 Xt]→ not the only possible goal!

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 2

.The Stochastic Multi-Armed Bandit Setup

Page 4: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

B(µ1) B(µ2) B(µ3) B(µ4) B(µ5)

For the t-th patient in a clinical study,I chooses a treatment AtI observes a response Xt ∈ {0, 1}: P(Xt = 1) = µAt

Maximize rewards ↔ cure as many patients as possible

Alternative goal: identify as quickly as possible the best treatment(without trying to cure patients during the study)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 3

.Should we maximize rewards?

Page 5: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

B(µ1) B(µ2) B(µ3) B(µ4) B(µ5)

For the t-th patient in a clinical study,I chooses a treatment AtI observes a response Xt ∈ {0, 1}: P(Xt = 1) = µAt

Maximize rewards ↔ cure as many patients as possible

Alternative goal: identify as quickly as possible the best treatment(without trying to cure patients during the study)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 3

.Should we maximize rewards?

Page 6: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Probability that some version of a website generates a conversion:

. . .

µ1 µ2 µK

Best version: a? = argmaxa=1,...,K

µa

Sequential protocol: for the t-th visitor:I display version AtI observe conversion indicator Xt ∼ B(µAt ).

Maximize rewards ↔ maximize the number of conversions

Alternative goal: identify the best version(without trying to maximize conversions during the test)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 4

.Should we maximize rewards?

Page 7: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Probability that some version of a website generates a conversion:

. . .

µ1 µ2 µK

Best version: a? = argmaxa=1,...,K

µa

Sequential protocol: for the t-th visitor:I display version AtI observe conversion indicator Xt ∼ B(µAt ).

Maximize rewards ↔ maximize the number of conversions

Alternative goal: identify the best version(without trying to maximize conversions during the test)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 4

.Should we maximize rewards?

Page 8: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Goal: find the best version.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 5

.A/B Testing (K = 2)

Page 9: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

A way to do A/B Testing:I allocate nA users to page A and nB users to page B

(decided in advance, often nA = nB)I perform a statistical test of “A better than B”

Alternative: fully adaptive A/B TestingI sequentially choose which version to allocate to each visitorI (adaptively choose when to stop the experiment)

Ü best arm identification in a bandit model

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 6

.Improve your A/B tests?

Page 10: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

A way to do A/B Testing:I allocate nA users to page A and nB users to page B

(decided in advance, often nA = nB)I perform a statistical test of “A better than B”

Alternative: fully adaptive A/B TestingI sequentially choose which version to allocate to each visitorI (adaptively choose when to stop the experiment)

Ü best arm identification in a bandit model

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 6

.Improve your A/B tests?

Page 11: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the best armin a bandit model

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 7

Page 12: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Goal: identify an arm with large mean as quickly and accurately aspossible ' identify

a? = argmaxa=1,...,K

µa.

Algorithm: made of three components:Ü sampling rule: At (arm to explore)Ü recommendation rule: Bt (current guess for the best arm)Ü stopping rule τ (when do we stop exploring?)

Probability of error [Even-Dar et al., 2006, Audibert et al., 2010]The probability of error after n rounds is

pν(n) = Pν (Bn 6= a?) .

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 8

.Pure Exploration in Bandit Models

Page 13: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Goal: identify an arm with large mean as quickly and accurately aspossible ' identify

a? = argmaxa=1,...,K

µa.

Algorithm: made of three components:Ü sampling rule: At (arm to explore)Ü recommendation rule: Bt (current guess for the best arm)Ü stopping rule τ (when do we stop exploring?)

Simple regret [Bubeck et al., 2009]The simple regret after n rounds is

rν(n) = µ? − µBn .

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 8

.Pure Exploration in Bandit Models

Page 14: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Goal: identify an arm with large mean as quickly and accurately aspossible ' identify

a? = argmaxa=1,...,K

µa.

Algorithm: made of three components:Ü sampling rule: At (arm to explore)Ü recommendation rule: Bt (current guess for the best arm)Ü stopping rule τ (when do we stop exploring?)

Simple regret [Bubeck et al., 2009]The simple regret after n rounds is

rν(n) = µ? − µBn .

∆minpν(n) ≤ Eν [rν(n)] ≤ ∆maxpν(n)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 8

.Pure Exploration in Bandit Models

Page 15: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Algorithm: made of three components:Ü sampling rule: At (arm to explore)Ü recommendation rule: Bt (current guess for the best arm)Ü stopping rule τ (when do we stop exploring?)

I Objectives studied in the literature:

Fixed-confidence setting Fixed-budget setting Anytime explorationinput: risk parameter δ input: budget T no input(tolerance parameter ε)

minimize E[τ ] τ = T for all t,P(Bτ 6= a?) ≤ δ minimize P(BT 6= a?) minimize pν(t)

or P(rν(τ) < ε) ≤ δ or E[rT (ν)] or E[rν(t)][Even-Dar et al., 2006] [Bubeck et al., 2009] [Jun and Nowak, 2016]

[Audibert et al., 2010]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 9

.Several objectives

Page 16: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 10

.Outline

Page 17: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Context: bounded rewards (νa supported in [0, 1])We know good algorithms to maximize rewards, for example UCB(α)

At+1 = argmaxa=1,...,K

µa(t) +√α

ln(t)Na(t)

I How good is it for best arm identification?

Possible recommendation rules:

Empirical Best Arm Bt = argmaxa µa(t)(EBA)

Most Played Arm Bt = argmaxa Na(t)(MPA)

Empirical Distribution of Plays Bt ∼ pt , where(EDP) pt =

(N1(t)

t , . . . , NK (t)t

)[Bubeck et al., 2009]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 11

.Can we use UCB?

Page 18: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Context: bounded rewards (νa supported in [0, 1])We know good algorithms to maximize rewards, for example UCB(α)

At+1 = argmaxa=1,...,K

µa(t) +√α

ln(t)Na(t)

I How good is it for best arm identification?

Possible recommendation rules:

Empirical Best Arm Bt = argmaxa µa(t)(EBA)

Most Played Arm Bt = argmaxa Na(t)(MPA)

Empirical Distribution of Plays Bt ∼ pt , where(EDP) pt =

(N1(t)

t , . . . , NK (t)t

)[Bubeck et al., 2009]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 11

.Can we use UCB?

Page 19: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I UCB + Empirical Distribution of Plays

E [rν(n)] = E [µ? − µBn ] = E[ K∑

b=1(µ? − µb)1(Bn=b)

]

= E[ K∑

b=1(µ? − µb)P(Bn = b|Fn)

]

= E[ K∑

b=1(µ? − µb)Nb(n)

n

]

= 1n

K∑b=1

(µ? − µb)E[Nb(n)]

= Rν(n)n .

Ü a conversion from cumulated regret to simple regret!Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 12

.Can we use UCB?

Page 20: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I UCB + Empirical Distribution of Plays

E [rν (UCB(α), n)] ≤ Rν(UCB(α), n)n ≤ C(ν) ln(n)

n

I UCB + Most Played Arm

Theorem [Bubeck et al., 2009]With the Most Play Armed as a recommendation rule, for n large enough,

E [rν (UCB(α), n)] ≤ C

√Kα ln(n)

n .

(more precise problem-dependent analysis in [Bubeck et al., 2009])

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 13

.Can we use UCB?

Page 21: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I UCB + Empirical Distribution of Plays

E [rν (UCB(α), n)] ≤ Rν(UCB(α), n)n ≤ C

√Kα ln(n)

n

I UCB + Most Played Arm

Theorem [Bubeck et al., 2009]With the Most Play Armed as a recommendation rule, for n large enough,

E [rν (UCB(α), n)] ≤ C

√Kα ln(n)

n .

(more precise problem-dependent analysis in [Bubeck et al., 2009])

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 13

.Can we use UCB?

Page 22: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I UCB + Empirical Distribution of Plays

E [rν (UCB(α), n)] ≤ Rν(UCB(α), n)n ≤ C

√Kα ln(n)

n

I UCB + Most Played Arm

Theorem [Bubeck et al., 2009]With the Most Play Armed as a recommendation rule, for n large enough,

E [rν (UCB(α), n)] ≤ C

√Kα ln(n)

n .

(more precise problem-dependent analysis in [Bubeck et al., 2009])

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 13

.Can we use UCB?

Page 23: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Eν [rν (UCB(α), n)] '

√Kα ln(n)

n .

I the uniform allocation strategy can beat UCB!

Theorem [Bubeck et al., 2009]For n ≥ 4K ln(K )/∆2

min, the simple regret decays exponentially :

Eν [rν (Unif, n)] ≤ ∆max exp(−1

8nK ∆2

min

)

I the smaller the cumulative regret, the larger the simple regret

Theorem [Bubeck et al., 2009]

′′E[rν(A, n)] ≥ ∆min2 exp (−C ×Rν(A, n))′′

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 14

.Are those results good?

Page 24: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Answer:I UCB has to be coupled with an appropriate recommendation ruleI it is not guaranteed to perform better than uniform exploration...

Variants of UCB?I UCB-E for the fixed-budget setting [Audibert et al., 2010]I LIL-UCB for the fixed-confidence setting [Jamieson et al., 2014]

Other algorithms?I many specific algorithm for best arm identification

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 15

.Can we use UCB?

Page 25: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Answer:I UCB has to be coupled with an appropriate recommendation ruleI it is not guaranteed to perform better than uniform exploration...

Variants of UCB?I UCB-E for the fixed-budget setting [Audibert et al., 2010]I LIL-UCB for the fixed-confidence setting [Jamieson et al., 2014]

Other algorithms?I many specific algorithm for best arm identification

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 15

.Can we use UCB?

Page 26: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Input: total number of plays TIdea: split the budget in log2(K ) phases of equal length, eliminate theworst half of the remaining arms after each phase.

Initialisation: S0 = {1, . . . ,K};For r = 0 to dln2(K )e − 1, do

sample each arm a ∈ Sr tr =⌊

T|Sr |dlog2(K)e

⌋times;

let µra be the empirical mean of arm a;

let Sr+1 be the set of d|Sr |/2e arms with largest µra

Output: BT the unique arm in Sdlog2(K)e

Theorem [Karnin et al., 2013]Letting H2(ν) = maxa 6=a? a∆−2

[a] , for any bounded bandit instance,

Pν (BT 6= a?) ≤ 3 log2(K ) exp(− T

8 log2(K )H2(ν)

).

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 16

.Fixed Budget: Sequential Halving

Page 27: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Input: risk parameter δ ∈ (0, 1).Idea: sample all remaining arm uniformly and perform eliminations ofarms which look sub-optimal

Initialization: S = {1, . . . ,K}While |S| > 1

Draw all arms in S. t ← t + |S|.S ← S\{a} if maxi∈S µi (t)− µa(t) ≥ 2

√ln(Kt2/δ)

t .Output: the unique arm Bτ ∈ S.

Theorem [Even-Dar et al., 2006]Successive Elimination satisfies Pν (Bτ = a?) ≥ 1− δ. Moreover,

(τδ = O

( K∑a=2

1∆2

aln( Kδ∆a

)))≥ 1− δ.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 17

.Fixed Confidence: Successive Elimination

Page 28: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Ia(t) = [LCBa(t),UCBa(t)].

0

1

771 459 200 45 48 23

I At round t, drawBt = argmax

bµb(t)

Ct = argmaxc 6=Bt

UCBc(t)

I Stop at round t if

LCBBt (t) > UCBCt (t)− ε

Theorem [Kalyanakrishnan et al., 2012]For well-chosen confidence intervals, Pν(µBτ > µ? − ε) ≥ 1− δ and

E [τδ] = O([

1∆2

2 ∨ ε2 +K∑

a=2

1∆2

a ∨ ε2

]ln(1δ

))

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 18

.Fixed-confidence: LUCB

Page 29: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 19

.Outline

Page 30: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Context: Exponential family bandit modelν = (ν1, . . . , νK ) ↔ µ = (µ1, . . . , µK )

(Bernoulli, Gaussian with known variance, Poisson...)

Algorithm: made of three components:Ü sampling rule: At (arm to explore)Ü stopping rule τ (when do we stop exploring?)Ü recommendation rule: Bτ (guess for the best arm when stopping)

ObjectiveI a δ-correct strategy: for all µ with a unique optimal arm,

Pµ (Bτ = a?) ≥ 1− δ.I with a small sample complexity Eµ[τδ].

Ü minimal sample complexity?Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 20

.Minimizing the sample complexity

Page 31: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Divergence function: kl(µ, µ′) = KL(νµ, νµ′).

Change of distribution lemma [Kaufmann et al., 2016]µ and λ be such that a?(µ) 6= a?(λ). For any δ-correct algorithm,

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ).

I For any a ∈ {2, . . . ,K}, introducing λ:

µa µ1 µ1+ εµ2µK

{λa = µ1 + ελi = µi , if i 6= a

Eµ[Na(τ)]kl(µa, µ1 + ε) ≥ klBer(δ, 1− δ)

Eµ[Na(τ)] ≥ 1kl(µa, µ1) ln

( 13δ

).

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 21

.A first lower bound

Page 32: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Divergence function: kl(µ, µ′) = (µ−µ′)2

2σ2 (Gaussian distributions).

Change of distribution lemma [Kaufmann et al., 2016]µ and λ be such that a?(µ) 6= a?(λ). For any δ-correct algorithm,

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ).

I For any a ∈ {2, . . . ,K}, introducing λ:

µa µ1 µ1+ εµ2µK

{λa = µ1 + ελi = µi , if i 6= a

Eµ[Na(τ)]kl(µa, µ1 + ε) ≥ klBer(δ, 1− δ)

Eµ[Na(τ)] ≥ 1kl(µa, µ1) ln

( 13δ

).

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 21

.A first lower bound

Page 33: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Divergence function: kl(µ, µ′) = KL(νµ, νµ′).

Change of distribution lemma [Kaufmann et al., 2016]µ and λ be such that a?(µ) 6= a?(λ). For any δ-correct algorithm,

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ).

I One obtains

Theorem [Kaufmann et al., 2016]For any δ-correct algorithm,

Eµ[τ ] ≥(

1kl(µ1, µ2) +

K∑a=2

1kl(µa, µ1)

)ln( 1

).

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 21

.A first lower bound

Page 34: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Divergence function: kl(µ, µ′) = KL(νµ, νµ′).

Change of distribution lemma [Kaufmann et al., 2016]µ and λ be such that a?(µ) 6= a?(λ). For any δ-correct algorithm,

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ).

I One obtains

Theorem [Kaufmann et al., 2016]For any δ-correct algorithm,

Eµ[τ ] ≥(

1kl(µ1, µ2) +

K∑a=2

1kl(µa, µ1)

)ln( 1

).

Ü not tight enough...

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 21

.A first lower bound

Page 35: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Change of distribution lemma [Kaufmann et al., 2016]µ and λ be such that a?(µ) 6= a?(λ). For any δ-correct algorithm,

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ).

I Let Alt(µ) = {λ : a?(λ) 6= a?(µ)}.

infλ∈Alt(µ)

K∑a=1

Eµ[Na(τ)]kl(µa, λa) ≥ klBer(δ, 1− δ)

Eµ[τ ]× infλ∈Alt(µ)

K∑a=1

Eµ[Na(τ)]Eµ[τ ] kl(µa, λa) ≥ ln

( 13δ

)

Eµ[τ ]×(

supw∈ΣK

infλ∈Alt(µ)

K∑a=1

wakl(µa, λa))≥ ln

( 13δ

)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 22

.The best possible lower bound

Page 36: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Theorem [Garivier and Kaufmann, 2016]For any δ-PAC algorithm,

Eµ[τ ] ≥ T?(µ) ln( 1

),

whereT?(µ)−1 = sup

w∈ΣK

infλ∈Alt(µ)

( K∑a=1

wakl(µa, λa)).

Moreover, the vector of optimal proportions,

w?(µ) = argmaxw∈ΣK

infλ∈Alt(µ)

( K∑a=1

wakl(µa, λa))

is well-defined, and can be computed efficiently.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 23

.The best possible lower bound

Page 37: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 24

.Outline

Page 38: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

µ(t) = (µ1(t), . . . , µK (t)): vector of empirical means

I Introducing

Ut ={

a : Na(t) <√

t},

one has

At+1 ∈

argmin

a∈UtNa(t) if Ut 6= ∅ (forced exploration)

argmax1≤a≤K

[(w?(µ(t)))a − Na(t)

t

](tracking)

LemmaUnder the Tracking sampling rule,

(lim

t→∞Na(t)

t = (w?(µ))a

)= 1.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 25

.How to match the lower bound? Sampling rule.

Page 39: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I a Generalized Likelihood Ratio:

Z (t) = ln ` (X1, . . . ,Xt ; µ(t))max

λ∈Alt(µ(t))`(X1, . . . ,Xt ; λ) = inf

λ∈Alt(µ(t))

K∑a=1

Na(t)kl(µa(t), λa)

Ü high value of Z (t) rejects the hypothesis “µ ∈ Alt(µ(t))”.

Stopping and recommendation rule

τδ = inf{

t ∈ N : Z (t) > β(t, δ)}

Bτ = argmaxa=1,...,K

µa(τ).

(can be traced back to [Chernoff, 1959])

Ü How to pick the threshold β(t, δ)?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 26

.How to match the lower bound? Stopping rule.

Page 40: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I a Generalized Likelihood Ratio:

Z (t) = ln ` (X1, . . . ,Xt ; µ(t))max

λ∈Alt(µ(t))`(X1, . . . ,Xt ; λ) = inf

λ∈Alt(µ(t))

K∑a=1

Na(t)kl(µa(t), λa)

Ü high value of Z (t) rejects the hypothesis “µ ∈ Alt(µ(t))”.

Stopping and recommendation rule

τδ = inf{

t ∈ N : Z (t) > β(t, δ)}

Bτ = argmaxa=1,...,K

µa(τ).

(can be traced back to [Chernoff, 1959])

Ü How to pick the threshold β(t, δ)?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 26

.How to match the lower bound? Stopping rule.

Page 41: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Z (t) = infλ∈Alt(µ(t))

K∑a=1

Na(t)kl(µa(t), λa)

= minb 6=Bt

inf{λ:λBt≤λb}

K∑a=1

Na(t)kl(µa(t), λa) bla

P (Bτδ 6= a?) ≤ P(∃t ∈ N?,∃a 6= a? : Bt = a, Z (t) > β(t, δ)

)≤ P

(∃t ∈ N?,∃a 6= a? : inf

λa≤λa?

∑i∈{a,a?}

Ni (t)kl(µi (t), λi ) > β(t, δ))

≤∑

a 6=a?P(∃t ∈ N? : Na(t)kl(µa(t), µa) + Na?(t)kl(µa?(t), µa?) > β(t, δ)

)︸ ︷︷ ︸

requires simultaneous deviations on the two arms

Bernoulli case: [Garivier and Kaufmann, 2016]Exponential families: [Kaufmann and Koolen, 2018]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 27

.A δ-correct stopping rule

Page 42: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Z (t) = infλ∈Alt(µ(t))

K∑a=1

Na(t)kl(µa(t), λa)

= minb 6=Bt

inf{λ:λBt≤λb}

[NBt (t)kl(µBt (t), λBt ) + Nb(t)kl(µb(t), λb)

]

P (Bτδ 6= a?) ≤ P(∃t ∈ N?,∃a 6= a? : Bt = a, Z (t) > β(t, δ)

)≤ P

(∃t ∈ N?,∃a 6= a? : inf

λa≤λa?

∑i∈{a,a?}

Ni (t)kl(µi (t), λi ) > β(t, δ))

≤∑

a 6=a?P(∃t ∈ N? : Na(t)kl(µa(t), µa) + Na?(t)kl(µa?(t), µa?) > β(t, δ)

)︸ ︷︷ ︸

requires simultaneous deviations on the two arms

Bernoulli case: [Garivier and Kaufmann, 2016]Exponential families: [Kaufmann and Koolen, 2018]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 27

.A δ-correct stopping rule

Page 43: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Z (t) = infλ∈Alt(µ(t))

K∑a=1

Na(t)kl(µa(t), λa)

= minb 6=Bt

inf{λ:λBt≤λb}

[NBt (t)kl(µBt (t), λBt ) + Nb(t)kl(µb(t), λb)

]

P (Bτδ 6= a?) ≤ P(∃t ∈ N?,∃a 6= a? : Bt = a, Z (t) > β(t, δ)

)≤ P

(∃t ∈ N?,∃a 6= a? : inf

λa≤λa?

∑i∈{a,a?}

Ni (t)kl(µi (t), λi ) > β(t, δ))

≤∑

a 6=a?P(∃t ∈ N? : Na(t)kl(µa(t), µa) + Na?(t)kl(µa?(t), µa?) > β(t, δ)

)︸ ︷︷ ︸

requires simultaneous deviations on the two arms

Bernoulli case: [Garivier and Kaufmann, 2016]Exponential families: [Kaufmann and Koolen, 2018]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 27

.A δ-correct stopping rule

Page 44: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

TheoremThe Track-and-Stop strategy, that usesI the Tracking sampling ruleI the GLRT stopping rule with

β(t, δ) ' ln(K − 1

δ

)+ 2 ln ln

(K − 1δ

)+ 6 ln(ln(t))

I and recommends Bτ = argmaxa=1...K

µa(τ)

is δ-PAC for every δ ∈]0, 1[ and satisfies

lim supδ→0

Eµ[τδ]ln(1/δ) = T?(µ).

Why?

τδ = inf{

t ∈ N? : infλ∈Alt(µ(t))

K∑a=1

Na(t)kl (µa(t), λa) > β(t, δ)}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 28

.An asymptotically optimal algorithm

Page 45: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

TheoremThe Track-and-Stop strategy, that usesI the Tracking sampling ruleI the GLRT stopping rule with

β(t, δ) ' ln(K − 1

δ

)+ 2 ln ln

(K − 1δ

)+ 6 ln(ln(t))

I and recommends Bτ = argmaxa=1...K

µa(τ)

is δ-PAC for every δ ∈]0, 1[ and satisfies

lim supδ→0

Eµ[τδ]ln(1/δ) = T?(µ).

Why?

τδ = inf{

t ∈ N? : infλ∈Alt(µ(t))

K∑a=1

Na(t)kl (µa(t), λa) > β(t, δ)}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 28

.An asymptotically optimal algorithm

Page 46: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

TheoremThe Track-and-Stop strategy, that usesI the Tracking sampling ruleI the GLRT stopping rule with

β(t, δ) ' ln(K − 1

δ

)+ 2 ln ln

(K − 1δ

)+ 6 ln(ln(t))

I and recommends Bτ = argmaxa=1...K

µa(τ)

is δ-PAC for every δ ∈]0, 1[ and satisfies

lim supδ→0

Eµ[τδ]ln(1/δ) = T?(µ).

Why?

τδ = inf{

t ∈ N? : t × infλ∈Alt(µ(t))

K∑a=1

Na(t)t kl (µa(t), λa) > β(t, δ)

}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 28

.An asymptotically optimal algorithm

Page 47: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

TheoremThe Track-and-Stop strategy, that usesI the Tracking sampling ruleI the GLRT stopping rule with

β(t, δ) ' ln(K − 1

δ

)+ 2 ln ln

(K − 1δ

)+ 6 ln(ln(t))

I and recommends Bτ = argmaxa=1...K

µa(τ)

is δ-PAC for every δ ∈]0, 1[ and satisfies

lim supδ→0

Eµ[τδ]ln(1/δ) = T?(µ).

Why?

τδ ' inf{

t ∈ N? : t × infλ∈Alt(µ)

K∑a=1

(w?(µ))akl (µa, λa) > β(t, δ)}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 28

.An asymptotically optimal algorithm

Page 48: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

TheoremThe Track-and-Stop strategy, that usesI the Tracking sampling ruleI the GLRT stopping rule with

β(t, δ) ' ln(K − 1

δ

)+ 2 ln ln

(K − 1δ

)+ 6 ln(ln(t))

I and recommends Bτ = argmaxa=1...K

µa(τ)

is δ-PAC for every δ ∈]0, 1[ and satisfies

lim supδ→0

Eµ[τδ]ln(1/δ) = T?(µ).

Why?

τδ ' inf{

t ∈ N? : t × T−1? (µ) > β(t, δ)

}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 28

.An asymptotically optimal algorithm

Page 49: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Experiments on two Bernoulli bandit models:I µ1 = [0.5 0.45 0.43 0.4], such that

w?(µ1) = [0.417 0.390 0.136 0.057]

I µ2 = [0.3 0.21 0.2 0.19 0.18], such that

w?(µ2) = [0.336 0.251 0.177 0.132 0.104]

In practice, set the threshold to β(t, δ) = ln(

ln(t)+1δ

).

Track-and-Stop GLRT-SE KL-LUCB KL-SEµ1 4052 4516 8437 9590µ2 1406 3078 2716 3334

Table: Expected number of draws Eµ[τδ] for δ = 0.1,averaged over N = 3000 experiments.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 29

.Numerical experiments

Page 50: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I TaS: a lower-bound inspired algorithmI KL-UCB versus TaS: very different sampling rules!

0

1

947 312 91 58

0

1

600 558 210 40

I Two recent improvements of the Tracking sampling rule:Ü relax the need for computing the optimal weights at every round

[Menard, 2019]Ü get rid of the forced exploration by using Upper Confidence Bounds

[Degenne et al., 2019]Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 30

.Observations and improvements

Page 51: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I Unlike previously mentioned strategies, the sampling rule of TaS isanytime, i.e. does not depend on a budget T or a risk parameter δ.

Ü is it also a good strategy in the fixed budget setting?Ü can control E[rν(TaS, t)] for any t?

I Fixed-budget setting: no exactly matching upper and lower boundbest lower bounds: [Carpentier and Locatelli, 2016]

I Top-Two Thompson Sampling [Russo, 2016]: a Bayesian (anytime)strategy that is optimal in a different (Bayesian, asymptotic) sense

Ü can we obtain frequentist guarantees for this algorithm?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 31

.Open problems

Page 52: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

BeyondBest Arm Identification

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 32

Page 53: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 33

.Outline

Page 54: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Context: exponential family bandit model

ν ↔ µ = (µ1, . . . , µK ) ∈ IK

Goal: Given M regions of IK , R1, . . . ,RM , the goal is to identify oneregion to which µ belongs.

Formalization: build aÜ sampling rule (At)Ü stopping rule τÜ recommendation rule ıτ ∈ {1, . . . ,M}

such that, for some risk parameter δ,

Pµ (µ /∈ Rıτ ) ≤ δ and Eµ[τ ] is small.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 34

.Active Identification in Bandit Models

Page 55: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Context: exponential family bandit model

ν ↔ µ = (µ1, . . . , µK ) ∈ IK

Goal: Given M regions of IK , R1, . . . ,RM , the goal is to identify oneregion to which µ belongs.

Two cases:I R1, . . . ,RM form a partition : a lower-bound inspired sampling rule

+ a GLRT stopping rule essentially worksBcomputing the optimal allocation may be difficult[Juneja and Krishnasamy, 2019, Kaufmann and Koolen, 2018]

I R1, . . . ,RM are overlapping regions : a Track-and-Stop approachdoes not always work [Degenne and Koolen, 2019]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 35

.Active Identification in Bandit Models

Page 56: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 36

.Outline

Page 57: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Anomaly detection: given a threshold θ:I find all the arms whose mean is below θ [Locatelli et al., 2016]I find whether there is an arm with mean below θ

[Kaufmann et al., 2018]

Phase I clinical trial: find the arm with mean closest to the threshold...with increasing means. [Garivier et al., 2017]

Ra ={

λ ∈ Inc : a = argmini|θ − λi |

}

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 37

.Bandits and thresholds

Page 58: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Find the best move at the root of a game tree by actively sampling itsleaves. s? = argmax

s∈C(s0)Vs .

μ1 μ2 μ3 μ4 μ5 μ6 μ7 μ8

s0

Vs =

µs if s ∈ L,

maxc∈C(s) Vc if s is a MAX node,minc∈C(s) Vc if s is a MIN node.

Ü more details laterEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 38

.Bandit and games

Page 59: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Bandit for Optimizationin a Larger Space

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 39

Page 60: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 40

.Outline

Page 61: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

f : {1, . . . ,K} −→ R maxa=1,...,K

f (a) ?

Sequential evaluations: at time t, choose At ∈ {1, . . . ,K}, observe

Xt ∼ νAt where νa has mean f (a).

After T observations,

Minimize the cumulative regret

minimize E[ T∑

t=1(f (a?)− f (At))

]

Minimize the simple regret (optimization error)If BT is a guess of the argmax

minimize E [f (a?)− f (BT )]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 41

.Bandit problems from an optimization perspective

Page 62: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

f : {1, . . . ,K} −→ R maxa=1,...,K

f (a) ?

Sequential evaluations: at time t, choose At ∈ {1, . . . ,K}, observe

Xt ∼ νAt where νa has mean f (a).

0 1 2 3 4 5 60

1

2

3

4

5

6

f(1)

f(2)

f(3)

f(4)

f(5)

sequential optimization of a discrete functionbased on noisy observations

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 41

.Bandit problems from an optimization perspective

Page 63: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

f : X −→ R maxx∈X

f (x) ?

Sequential evaluations: at time t, choose xt ∈ X , observe

yt = f (xt) + εt

f

sequential optimization of a black-box functionbased on noisy observations

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 42

.General Sequential (Noisy) Optimization

Page 64: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

f : X −→ R maxx∈X

f (x) ?

Sequential evaluations: at time t, choose xt ∈ X , observe

yt = f (xt) + εt

After T observations,

Minimize the cumulative regret

minimize E[ T∑

t=1(f (x?)− f (xt))

]

Minimize the simple regret (optimization error)If zT is a guess of the argmax

minimize E [f (x?)− f (zT )]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 42

.General Sequential (Noisy) Optimization

Page 65: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I learning based on (costly, noisy) function evaluations only!I no access to gradients of f

fx y

Examples of function fI a costly PDE solver (numerical experiments)I a simulator of the effect of a chemical compound (drug discovery)I training and evaluation a neural network

(hyper-parameter optimization)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 43

.Black Box Optimization

Page 66: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

x

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 67: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

x

f(x)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 68: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 69: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 70: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

How to choose the next querry?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 71: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

How to choose the next querry?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 44

.What we want to do

Page 72: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 45

.Outline

Page 73: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Bandit in metric spaces [Kleinberg et al., 2008]X -armed bandits [Bubeck et al., 2011]

Idea: Partition the space, and adaptatively choose in which cell to sample

h=0

h=1

h=2

I For any depth h, X is partitioned in Kh cells (Ph,i )0≤Kh−1.

I K -ary tree T where depth h = 0 is the whole X .

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 46

.Hierarchical partitioning

Page 74: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Assumptions (given a metric `(x , y))I f (x?)− f (y) ≤ f (x?)− f (x) + max{f (x?)− f (x); `(x , y)}I sup(x ,y)∈Ph,i `(x , y) ≤ νρh.

Idea: Use Upper-Confidence Bounds on the maximum values of thefunction in each cell to guide exploration

U(h,i)(t) = µ(h,i)(t) +√

2 ln(t)N(h,i)(t) + νρh

B(h,i)(t) = min{

U(h,i)(t); B(h+1,2i)(t);

B(h+1,2i+1)(t)}

[Bubeck et al., 2011]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 47

.Hierarchical Optimistic Optimization (HOO)

Page 75: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Cumulative regret of HOO

E[ T∑

t=1(f (x?)− f (xt))

]≤ CT

d+1d+2 (ln(T ))

1d+2

for some near-optimality dimension d .

Ü how to turn this into an optimizer?

zT ∼ U(x1, . . . , xT ), then

E [f (x?)− f (zT )]=R(HOO,T )T ≤C

(ln(T )T

) 1d+2

a tree built by HOO

I Many variants!

DOO, SOO [Munos, 2011], StoSOO [Valko et al., 2013], POO[Grill et al., 2015], GPO [Shang et al., 2019]...

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 48

.Results

Page 76: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 49

.Outline

Page 77: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Assumption. the function f is drawn from some Gaussian Process :

f ∼ GP(0, k(x , y)).

i.e. for any distinct points x1, . . . , x` in X ,f (x1)f (x2). . .

f (x`)

∼ N (0,K ) where K = (k(xi , xj))1≤i ,j≤`

Bayesian inferenceGiven some (possibly noisy) observations of f in x1, . . . , xt , the posterioron all the function value in any point is Gaussian

f (y)|x1, . . . , xt ∼ N(µt(y), σt(y)2

)[Rasmussen and Williams, 2005]

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 50

.Gaussian Process Regression

Page 78: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Ü use the current GP posterior to pick the new point to select

Example: [µt(y)± βσt(y)] is a kind of confidence interval on f (y).Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 51

.Bayesian Optimization

Page 79: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

GP-UCB [Srinivas et al., 2012] selects at round t + 1

xt+1 = argmaxx∈X

µt(x) +√β(t, δ)σt(x)

Ü Bayesian and frequentist guarantees in terms of cumulative regret fordifferent β(t, δ): P

(RT (GP-UCB,T ) ≤ C

√Tβ(T , δ)γT

)≥ 1− δ.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 52

.GP-UCB

Page 80: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

More generally, many Bayesian Optimization algorithms optimize anacquisition function that depends on the posterior and select

xt+1 = argmaxx∈X

αt(x)

Many other acquisition functions: [Shahriari et al., 2016]I Expected improvementI Probability of improvementI Entropy Search ...

Remark: optimization the acquisition function is another (non-trivial)optimization problem!

I Thompson Sampling?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 53

.BO Algorithms

Page 81: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Bandit Tools forPlanning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 54

Page 82: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 55

.Outline

Page 83: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Goal: decide for the next move based on evaluation of possibletrajectories in the game, ending with a random evaluation.

A famous bandit approach: [Kocsis and Szepesvari, 2006]Ü use UCB in each node to decide the next children to explore

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 56

.Playout-Based Monte-Carlo Tree Search

Page 84: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

N(s) : number of visits of node sS(s) : number of visits finishing ending with the root player winning

UCT in a MaxMin TreeIn a MAX node s (= root player move), go towards the children

argmaxc∈C(s)

S(c)N(c) + c

√ln(N(s))

N(c)

8/11

2/6 2/3 4/2

N=19 visits

n3=6 visitsUCB3 = 4/6 + c√log(N)/n3

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 57

.The UCT algorithm

Page 85: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

N(s) : number of visits of node sS(s) : number of visits finishing ending with the root player winning

UCT in a MaxMin TreeIn a MIN node s (= adversary move), go towards the children

argminc∈C(s)

S(c)N(c) − c

√ln(N(s))

N(c)

8/11

2/6 2/3 4/2

N=19 visits

n3=6 visitsUCB3 = 4/6 + c√log(N)/n3

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 57

.The UCT algorithm

Page 86: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

N(s) : number of visits of node sS(s) : number of visits finishing ending with the root player winning

UCT in a MaxMin TreeIn a MAX node s (= root player move), go towards the children

argmaxc∈C(s)

S(c)N(c) + c

√ln(N(s))

N(c)

8/11

2/6 2/3 4/2

N=19 visits

n3=6 visitsUCB3 = 4/6 + c√log(N)/n3

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 57

.The UCT algorithm

Page 87: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

N(s) : number of visits of node sS(s) : number of visits finishing ending with the root player winning

UCT in a MaxMin TreeIn a MAX node s (= root player move), go towards the children

argmaxc∈C(s)

S(c)N(c) + c

√ln(N(s))

N(c)

8/11

2/6 2/3 4/2

N=19 visits

n3=6 visitsUCB3 = 4/6 + c√log(N)/n3

+ first Go AI based on variants of UCT (+ heuristics)

blaEmilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 57

.The UCT algorithm

Page 88: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

N(s) : number of visits of node sS(s) : number of visits finishing ending with the root player winning

UCT in a MaxMin TreeIn a MAX node s (= root player move), go towards the children

argmaxc∈C(s)

S(c)N(c) + c

√ln(N(s))

N(c)

8/11

2/6 2/3 4/2

N=19 visits

n3=6 visitsUCB3 = 4/6 + c√log(N)/n3

− UCT is not based on statistically-valid confidence intervals− no sample complexity guarantees− should we really minimize rewards?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 57

.The UCT algorithm

Page 89: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Finding the Best Arm in a Bandit ModelAlgorithms for Best Arm IdentificationPerformance Lower BoundsAn asymptotically optimal algorithm

Beyond Best Arm IdentificationActive Identification in Bandit ModelsExamples

Bandit for Optimization in a Larger SpaceBlack-Box OptimizationHierarchical BanditsBayesian Optimization

Bandit Tools for Planning in GamesUpper Confidence Bounds for TreesBAI tools for Planning in Games

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 58

.Outline

Page 90: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

A fixed MAXMIN game tree T , with leaves L.

MAX node (= root player move)

MIN node (= adversary move)

Leaf `: stochastic oracle O` that evaluates the position

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 59

.A simple model for MCTS

Page 91: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

μ1 μ2 μ3 μ4 μ5 μ6 μ7 μ8

At round t a MCTS algorithm:I picks a path down to a leaf LtI get an evaluation of this leaf Xt ∼ OLt

Assumption: i.i.d. sucessive evaluations, EX∼O` [X ] = µ`

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 60

.A simple model for MCTS

Page 92: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

μ1 μ2 μ3 μ4 μ5 μ6 μ7 μ8

s0

A MCTS algorithm should find the best move at the root:

Vs =

µs if s ∈ L,

maxc∈C(s) Vc if s is a MAX node,minc∈C(s) Vc if s is a MIN node.

s∗ = argmaxs∈C(s0)

Vs

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 61

.Goal

Page 93: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

MCTS algorithm: (Lt , τ, sτ ), whereI Lt is the sampling ruleI τ is the stopping ruleI sτ ∈ C(s0) is the recommendation rule

μ1 μ2 μ3 μ4 μ5 μ6 μ7 μ8

s0

Goal: an (ε, δ)-PAC MCTS algorithm:

P(V sτ ≥ V s∗ − ε) ≥ 1 − δ

with a small sample complexity τ . [Teraoka et al., 2014]Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 62

.A structured BAI problem

Page 94: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

MCTS algorithm: (Lt , τ, sτ ), whereI Lt is the sampling ruleI τ is the stopping ruleI sτ ∈ C(s0) is the recommendation rule

μ1 μ2 μ3 μ4 μ5 μ6 μ7 μ8

s0

Idea: use LUCB on the depth-one nodesÜ requires confidence intervals on the values (Vs)s∈C0

Ü requires to identify a leaf to sample starting from s ∈ C0

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 62

.A structured BAI problem

Page 95: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Using the samples collected for the leaves, one can build, for ` ∈ L,[LCB`(t),UCB`(t)] a confidence interval on µ`

s0

Idea: Propagate these confidence intervals up in the tree

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 63

.First tool: confidence intervals

Page 96: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

MAX node:UCBs(t) = max

c∈C(s)UCBc(t) LCBs(t) = max

c∈C(s)LCBc(t)

s0

Idea: Propagate these confidence intervals up in the tree

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 63

.First tool: confidence intervals

Page 97: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

MAX node:UCBs(t) = max

c∈C(s)UCBc(t) LCBs(t) = max

c∈C(s)LCBc(t)

s0

Idea: Propagate these confidence intervals up in the tree

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 63

.First tool: confidence intervals

Page 98: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

MIN node:UCBs(t) = min

c∈C(s)UCBc(t) LCBs(t) = min

c∈C(s)LCBc(t)

s0

Idea: Propagate these confidence intervals up in the tree

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 63

.First tool: confidence intervals

Page 99: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

`s(t): representative leaf of internal node s ∈ T .

s0

Idea: alternate optimistic/pessimistic moves starting from s

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 64

.Second tool: representative leaves

Page 100: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I run a BAI algorithm on the depth-on nodes

→ selects Rt ∈ C0

I sample the representative leaf associated to that node:

Lt = `Rt (t)

(' from depth one, run UCT based on statistically valid CIs)I update the confidence intervalsI stop when the BAI algorithm tell us toI recommand the depth-one node chosen by the BAI algorithm

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 65

.The BAI-MCTS architecture

Page 101: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

For some exploration function β, define

LCB`(t) = µ`(t)−√β(N`(t), δ)

2N`(t) ,

UCB`(t) = µ`(t) +√β(N`(t), δ)

2N`(t) .

Theorem [Kaufmann et al., 2018]Choosing

β(s, δ) ' ln( |L| ln(s)

δ

),

LUCB-MCTS and UGapE-MCTS are (ε, δ)-PAC and

P(τ = O

(H∗ε (µ) ln

(1δ

)))≥ 1− δ

for UGapE-MCTS.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 66

.Theoretical guarantees

Page 102: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

H∗ε (µ) :=∑`∈L

1∆2` ∨∆2

∗ ∨ ε2

where∆∗ := V (s∗)− V (s∗2 )∆` := max

s∈Ancestors(`)\{s0}

∣∣∣VParent(s) − Vs∣∣∣

0.5 0.3 0.2 0.4 0.3 0.5 0.6 0.1

0.5 0.4 0.3 0.6

0.4 0.3

0.4

(slightly improved complexity in the work of [Huang et al., 2017])

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 67

.The complexity term

Page 103: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I Optimal and efficient algorithms for solving best action identificationin a maxmin tree...

I ... and other generic active identification problemsI UCT versus BAI-MCTS on large-scale problem?

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 68

.Future work

Page 104: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

I Best arm identification and regret minimization are two differentproblems that require different sampling rules

I Upper and Lower Confidence Bounds are useful in both settingsI Optimal algorithms for BAI are inspired by the lower bounds

(cf. structured bandits)I Tools for BAI → more general Active Identification problems

I Bandit tools inspire methods for sequential optimization in largespaces (games trees or continuous spaces)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 69

.Take-home messages

Page 105: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

That’s all!

now you’re ready to pull the right arm ;-)

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 106: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Audibert, J.-Y., Bubeck, S., and Munos, R. (2010).Best Arm Identification in Multi-armed Bandits.In Proceedings of the 23rd Conference on Learning Theory.

Bubeck, S., Munos, R., and Stoltz, G. (2009).Pure Exploration in multi-armed bandit problems.In Algorithmtic Learning.

Bubeck, S., Munos, R., Stoltz, G., and Szepesvari, C. (2011).X-armed bandits.Journal of Machine Learning Research, 12:1587–1627.

Carpentier, A. and Locatelli, A. (2016).Tight (lower) bounds for the fixed budget best arm identification bandit problem.In Conference on Learning Theory (COLT).

Chernoff, H. (1959).Sequential design of Experiments.The Annals of Mathematical Statistics, 30(3):755–770.

Degenne, R., Koolen, W., and Menard, P. (2019).Non asymptotic pure exploration by solving games.arXiv:1906.10431.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 107: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Degenne, R. and Koolen, W. M. (2019).Pure Exploration with Multiple Correct Answers.arXiv:1902.03475.

Even-Dar, E., Mannor, S., and Mansour, Y. (2006).Action Elimination and Stopping Conditions for the Multi-Armed Bandit andReinforcement Learning Problems.Journal of Machine Learning Research, 7:1079–1105.

Garivier, A. and Kaufmann, E. (2016).Optimal best arm identification with fixed confidence.In Proceedings of the 29th Conference On Learning Theory (to appear).

Garivier, A., Menard, P., and Rossi, L. (2017).Thresholding bandit for dose-ranging: The impact of monotonicity.arXiv:1711.04454.

Grill, J., Valko, M., and Munos, R. (2015).Black-box optimization of noisy functions with unknown smoothness.In Advances in Neural Information Processing Systems.

Huang, R., Ajallooeian, M. M., Szepesvari, C., and Muller, M. (2017).Structured best arm identification with fixed confidence.In International Conference on Algorithmic Learning Theory (ALT).

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 108: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. (2014).lil’UCB: an Optimal Exploration Algorithm for Multi-Armed Bandits.In Proceedings of the 27th Conference on Learning Theory.

Jun, K.-W. and Nowak, R. (2016).Anytime exploration for multi-armed bandits using confidence information.In International Conference on Machine Learning (ICML).

Juneja, S. and Krishnasamy, S. (2019).Sample complexity of partition identification using multi-armed bandits.In Conference on Learning Theory (COLT).

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. (2012).PAC subset selection in stochastic multi-armed bandits.In International Conference on Machine Learning (ICML).

Karnin, Z., Koren, T., and Somekh, O. (2013).Almost optimal Exploration in multi-armed bandits.In International Conference on Machine Learning (ICML).

Kaufmann, E., Cappe, O., and Garivier, A. (2016).On the Complexity of Best Arm Identification in Multi-Armed Bandit Models.Journal of Machine Learning Research, 17(1):1–42.

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 109: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Kaufmann, E. and Koolen, W. (2018).Mixture martingales revisited with applications to sequential tests and confidenceintervals.arXiv:1811.11419.

Kaufmann, E., Koolen, W., and Garivier, A. (2018).Sequential test for the lowest mean: From Thompson to Murphy sampling.In Advances in Neural Information Processing Systems (NeurIPS).

Kleinberg, R., Slivkins, A., and Upfal, E. (2008).Multi-armed bandits in metric spaces.In Proceedings of the 40th Annual ACM Symposium on Theory of Computing.

Kocsis, L. and Szepesvari, C. (2006).Bandit based monte-carlo planning.In Proceedings of the 17th European Conference on Machine Learning, ECML’06,pages 282–293, Berlin, Heidelberg. Springer-Verlag.

Locatelli, A., Gutzeit, M., and Carpentier, A. (2016).An optimal algorithm for the thresholding bandit problem.In International Conference on Machine Learning (ICML).

Menard, P. (2019).Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 110: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Gradient ascent for active exploration in bandit problems.arXiv:1905.08165.

Munos, R. (2011).Optimistic optimization of a deterministic function without the knowledge of itssmoothness.In Advances in Neural Information Processing Systems.

Rasmussen, C. E. and Williams, C. K. I. (2005).Gaussian Processes for Machine Learning.The MIT Press.

Russo, D. (2016).Simple bayesian algorithms for best arm identification.In Proceedings of the 29th Conference On Learning Theory.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016).Taking the human out of the loop: A review of bayesian optimization.Proceedings of the IEEE, 104(1):148–175.

Shang, X., Kaufmann, E., and Valko, M. (2019).General parallel optimization a without metric.In Algorithmic Learning Theory (ALT).

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70

Page 111: Bandit Problems Part III - Bandits for Optimization · I observe conversion indicator X t ∼B(µ A t). Maximize rewards ↔maximize the number of conversions Alternative goal: identify

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2012).Information-Theoretic Regret Bounds for Gaussian Process Optimization in theBandit Setting.IEEE Transactions on Information Theory, 58(5):3250–3265.

Teraoka, K., Hatano, K., and Takimoto, E. (2014).Efficient sampling method for Monte Carlo tree search problem.IEICE Transactions on Infomation and Systems, pages 392–398.

Valko, M., Carpentier, A., and Munos, R. (2013).Stochastic simultaneous optimistic optimization.In Proceedings of the 30th International Conference on Machine Learning (ICML).

Emilie Kaufmann (CNRS) - Stochastic Bandits July, 2019 - 70