Bandits and Hyper-parameter Optimization · Paris, Oct. 2019 Bandits and Hyper-parameter...
Embed Size (px)
Transcript of Bandits and Hyper-parameter Optimization · Paris, Oct. 2019 Bandits and Hyper-parameter...

Paris, Oct. 2019
Bandits and Hyper-parameter Optimization
Xuedong Shang1, Emilie Kaufmann2, Michal Valko3
1 Inria Lille, SequeL team2 CNRS, CRIStAL3 Google DeepMind
Huawei Paris

What — Hyper-parameter optimization
ProblemWe tackle hyper-parameter tuning for supervised learning tasks:
global optimisation task: minf (λ) : λ ∈ Ω;
f (λ) , E[`(
Y, g (n)λ (X)
)]measures the generalization
power.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 2/11

What — Hyper-parameter optimization
ProblemWe tackle hyper-parameter tuning for supervised learning tasks:
global optimisation task: minf (λ) : λ ∈ Ω;
f (λ) , E[`(
Y, g (n)λ (X)
)]measures the generalization
power.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 2/11

What — Hyper-parameter optimization
ProblemWe tackle hyper-parameter tuning for supervised learning tasks:
global optimisation task: minf (λ) : λ ∈ Ω;
f (λ) , E[`(
Y, g (n)λ (X)
)]measures the generalization
power.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 2/11

How — Best-arm identification
We see the problem as best-arm identification in a stochasticinfinitely-armed bandit: arms’ means are drawn from somereservoir distribution ν0.
l
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 3/11

How — Best-arm identification
We see the problem as best-arm identification in a stochasticinfinitely-armed bandit: arms’ means are drawn from somereservoir distribution ν0.
l
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 3/11

How — Best-arm identification
We see the problem as best-arm identification in a stochasticinfinitely-armed bandit: arms’ means are drawn from somereservoir distribution ν0.
In each round1. (optional) query a new arm from ν0;2. sample an arm that was previously queried.
goal: output an arm with mean close to µ?D-TTTS (Dynamic Top-Two Thompson Sampling) a dynamicalgorithm built on TTTS.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 3/11

How — D-TTTS
In this talk...I Beta-Bernoulli Bayesian bandit modelI a uniform prior over the mean of new arms
1: Initialization: µ1 ∼ ν0; A = µ1; m = 1; S1,N1 = 02: while budget still available do3: ∀i ∈ A, θi ∼ Beta(Si + 1,Ni − Si + 1)4: I(1) = arg maxi=0,...,m θiThompson sampling5: end while
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 4/11

How — D-TTTS
1: Initialization: µ1 ∼ ν0; A = µ1; m = 1; S1,N1 = 02: while budget still available do3: ∀i ∈ A, θi ∼ Beta(Si + 1,Ni − Si + 1)4: I(1) = arg maxi=0,...,m θiThompson sampling5: if U(∼ U([0, 1])) > β then6: while I(2) 6= I(1) do7: ∀i ∈ A, θ′i ∼ Beta(Si + 1,Ni − Si + 1)8: I(2) ← arg maxi=0,...,m θ
′i
9: end while10: I(1) ← I(2)11: end ifTTTS12: end while
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 4/11

How — D-TTTS
1: Initialization: µ1 ∼ ν0; A = µ1; m = 1; S1,N1 = 02: while budget still available do3: µm+1 ∼ ν0; A ← A∪ µm+14: Sm+1,Nm+1 ← 0; m← m + 15: ∀i ∈ A, θi ∼ Beta(Si + 1,Ni − Si + 1)6: I(1) = arg maxi=0,...,m θiThompson sampling7: if U(∼ U([0, 1])) > β then8: while I(2) 6= I(1) do9: ∀i ∈ A, θ′i ∼ Beta(Si + 1,Ni − Si + 1)
10: I(2) ← arg maxi=0,...,m θ′i
11: end while12: I(1) ← I(2)13: end ifTTTS14: Y ← evaluate arm I(1); X ∼ Ber(Y )15: SI(1) ← SI(1) + X ; NI(1) ← NI(1) + 116: end while
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 4/11

How — D-TTTS c’td
D-TTTS in summary...In each round, query a new arm endowed with a Beta(1,1) prior,without sampling it, and run TTTS on the new set of arms.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 5/11

How — D-TTTS c’tdOrder statistic trickWith Lt−1 the list of arms that have been effectively sampled attime t, we run TTTS on the set Lt−1 ∪ µ0 where µ0 is apseudo-arm with posterior Beta(t − |Lt−1|, 1).
Figure: Posterior distributions of 4 arms and the pseudo-arm
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 5/11

Why
I TTTS is anytime for finitely-armed bandits;I the flexibility of this Bayesian algorithm allows to propose a
dynamic version for the infinite BAI;I unlike previous approaches, D-TTTS does not need to fix the
number of arms queried in advance, and naturally adapts tothe difficulty of the task.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 6/11

HPO as a BAI problem
BAI HPO
query ν0 pick a new configuration λ˙
sample an arm˙ train the classifier gλ
reward cross-validation loss
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 7/11

Experiments — Setting
Classifier Hyper-parameter Type Bounds
SVM C R+[10−5, 105]
γ R+[10−5, 105]
Table: hyper-parameters to be tuned for UCI experiments.
Classifier Hyper-parameter Type Bounds
MLP hidden layer size Integer [5, 50]alpha R+ [0, 0.9]learning rate init R+
[10−5, 10−1]
Table: hyper-parameters to be tuned for MNIST experiments.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 8/11

Experiments — Some results
0 5 10 15 20 25Number of Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Valid
atio
n er
ror
HyperbandTPERandom SearchH-TTTSD-TTTS
(a) wine
0 10 20 30 40 50 60 70 80Number of Iterations
0.05
0.10
0.15
0.20
0.25
0.30
Valid
atio
n er
ror
HyperbandTPERandom SearchH-TTTSD-TTTS
(b) breast cancer
0 20 40 60 80 100 120 140 160Number of Iterations
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
Valid
atio
n er
ror
HyperbandTPERandom SearchH-TTTSD-TTTS
(c) adult
0 20 40 60 80 100Number of Iterations
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
0.200
Valid
atio
n er
ror
HyperbandTPERandom SearchH-TTTSD-TTTS
(d) MNIST
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 9/11

Next?
I Extend to the non-stochastic setting;I Theoretical guarantee?
Theorem (Shang, Heide, et al. 2019)The TTTS sampling rule coupled with the Chernoff stopping ruleform a δ-correct BAI strategy. Moreover, if all the arms means aredistinct, it satisfies
lim supδ→0
E [τδ]
log(1/δ)≤ 1
Γ?β.
More detailsCheck out [Shang, Kaufmann, et al. 2019; Shang, Heide, et al.2019].
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 10/11

References
Thank you!
Xuedong Shang, Rianne de Heide, Pierre Menard,Emilie Kaufmann, and Michal Valko. “Fixed-confidenceguarantees for Bayesian best-arm identification”. 2019.Xuedong Shang, Emilie Kaufmann, and Michal Valko. “Asimple dynamic bandit algorithm for hyper-parameteroptimization”. In: 6th ICML Workshop on AutoML. 2019.
Best-Arm Identification and Hyper-parameter Optimization Huawei Paris - 11/11