Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf ·...

116
Online Combinatorial Optimization under Bandit Feedback M. Sadegh Talebi * * Department of Automatic Control KTH The Royal Institute of Technology February 2016 1 / 49

Transcript of Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf ·...

Page 1: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Online Combinatorial Optimizationunder Bandit Feedback

M. Sadegh Talebi ∗

∗Department of Automatic ControlKTH The Royal Institute of Technology

February 2016

1 / 49

Page 2: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization

Decision space M⊂ 0, 1dEach decision M ∈M is a binary d-dimensional vector.Combinatorial structure, e.g., matchings, spanning trees, fixed-sizesubsets, graph cuts, paths

Weights θ ∈ Rd

Generic combinatorial (linear) optimization

maximize M>θ =

d∑i=1

Miθi

over M ∈M

Sequential decision making over T rounds

2 / 49

Page 3: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization

Decision space M⊂ 0, 1dEach decision M ∈M is a binary d-dimensional vector.Combinatorial structure, e.g., matchings, spanning trees, fixed-sizesubsets, graph cuts, paths

Weights θ ∈ Rd

Generic combinatorial (linear) optimization

maximize M>θ =

d∑i=1

Miθi

over M ∈M

Sequential decision making over T rounds

2 / 49

Page 4: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization under Uncertainty

Sequential decision making over T rounds

Known θ =⇒ always select M? := argmaxM∈MM>θ.

Weights θ could be initially unknown or unpredictably varying.

At time n, environment chooses a reward vector X(n) ∈ RdStochastic: X(n) i.i.d., E[X(n)] = θ.Adversarial: X(n) chosen beforehand by an adversary.

Selecting M gives reward M>X(n) =∑d

i=1MiXi(n).

Sequential Learning: at each step n, select M(n) ∈Mbased on the previous decisions and observed rewards

3 / 49

Page 5: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization under Uncertainty

Sequential decision making over T rounds

Known θ =⇒ always select M? := argmaxM∈MM>θ.

Weights θ could be initially unknown or unpredictably varying.

At time n, environment chooses a reward vector X(n) ∈ RdStochastic: X(n) i.i.d., E[X(n)] = θ.Adversarial: X(n) chosen beforehand by an adversary.

Selecting M gives reward M>X(n) =∑d

i=1MiXi(n).

Sequential Learning: at each step n, select M(n) ∈Mbased on the previous decisions and observed rewards

3 / 49

Page 6: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization under Uncertainty

Sequential decision making over T rounds

Known θ =⇒ always select M? := argmaxM∈MM>θ.

Weights θ could be initially unknown or unpredictably varying.

At time n, environment chooses a reward vector X(n) ∈ RdStochastic: X(n) i.i.d., E[X(n)] = θ.Adversarial: X(n) chosen beforehand by an adversary.

Selecting M gives reward M>X(n) =∑d

i=1MiXi(n).

Sequential Learning: at each step n, select M(n) ∈Mbased on the previous decisions and observed rewards

3 / 49

Page 7: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization under Uncertainty

Sequential decision making over T rounds

Known θ =⇒ always select M? := argmaxM∈MM>θ.

Weights θ could be initially unknown or unpredictably varying.

At time n, environment chooses a reward vector X(n) ∈ RdStochastic: X(n) i.i.d., E[X(n)] = θ.Adversarial: X(n) chosen beforehand by an adversary.

Selecting M gives reward M>X(n) =∑d

i=1MiXi(n).

Sequential Learning: at each step n, select M(n) ∈Mbased on the previous decisions and observed rewards

3 / 49

Page 8: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Combinatorial Optimization under Uncertainty

Sequential decision making over T rounds

Known θ =⇒ always select M? := argmaxM∈MM>θ.

Weights θ could be initially unknown or unpredictably varying.

At time n, environment chooses a reward vector X(n) ∈ RdStochastic: X(n) i.i.d., E[X(n)] = θ.Adversarial: X(n) chosen beforehand by an adversary.

Selecting M gives reward M>X(n) =∑d

i=1MiXi(n).

Sequential Learning: at each step n, select M(n) ∈Mbased on the previous decisions and observed rewards

3 / 49

Page 9: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret

Goal: Maximize collected rewards in expectation

E

[T∑n=1

M(n)>X(n)

].

Or equivalently, minimize regret over T rounds:

R(T ) = maxM∈M

E

[T∑n=1

M>X(n)

]︸ ︷︷ ︸

oracle

−E

[T∑n=1

M(n)>X(n)

]︸ ︷︷ ︸

your algorithm

.

Quantifies cumulative loss of not choosing the best decision (inhindsight).

Algorithm is learning iff R(T ) = o(T ).

4 / 49

Page 10: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret

Goal: Maximize collected rewards in expectation

E

[T∑n=1

M(n)>X(n)

].

Or equivalently, minimize regret over T rounds:

R(T ) = maxM∈M

E

[T∑n=1

M>X(n)

]︸ ︷︷ ︸

oracle

−E

[T∑n=1

M(n)>X(n)

]︸ ︷︷ ︸

your algorithm

.

Quantifies cumulative loss of not choosing the best decision (inhindsight).

Algorithm is learning iff R(T ) = o(T ).

4 / 49

Page 11: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret

Goal: Maximize collected rewards in expectation

E

[T∑n=1

M(n)>X(n)

].

Or equivalently, minimize regret over T rounds:

R(T ) = maxM∈M

E

[T∑n=1

M>X(n)

]︸ ︷︷ ︸

oracle

−E

[T∑n=1

M(n)>X(n)

]︸ ︷︷ ︸

your algorithm

.

Quantifies cumulative loss of not choosing the best decision (inhindsight).

Algorithm is learning iff R(T ) = o(T ).

4 / 49

Page 12: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Feedback

Choose M(n) based on previous decisions and observed feedback

Full information: X(n) is revealed.

Semi-bandit feedback: Xi(n) is revealed iff Mi(n) = 1.

Bandit feedback: only the reward M(n)>X(n) is revealed.

Sequential learning is modeled as aMulti-Armed Bandit (MAB) problem.

Combinatorial MAB:

Decision M ∈M ⇐⇒ Arm

Element 1, . . . , d ⇐⇒ Basic action

Each arm is composed of several basic actions.

5 / 49

Page 13: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Feedback

Choose M(n) based on previous decisions and observed feedback

Full information: X(n) is revealed.

Semi-bandit feedback: Xi(n) is revealed iff Mi(n) = 1.

Bandit feedback: only the reward M(n)>X(n) is revealed.

Sequential learning is modeled as aMulti-Armed Bandit (MAB) problem.

Combinatorial MAB:

Decision M ∈M ⇐⇒ Arm

Element 1, . . . , d ⇐⇒ Basic action

Each arm is composed of several basic actions.

5 / 49

Page 14: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Feedback

Choose M(n) based on previous decisions and observed feedback

Full information: X(n) is revealed.

Semi-bandit feedback: Xi(n) is revealed iff Mi(n) = 1.

Bandit feedback: only the reward M(n)>X(n) is revealed.

Sequential learning is modeled as aMulti-Armed Bandit (MAB) problem.

Combinatorial MAB:

Decision M ∈M ⇐⇒ Arm

Element 1, . . . , d ⇐⇒ Basic action

Each arm is composed of several basic actions.

5 / 49

Page 15: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Application 1: Spectrum Sharing

ch. 1 ch. 2 ch. 3

link 1 link 2 link 3 link 4 link 5

ch. 4 ch. 5

K channels, L links

M≡ the set of matchings from [L] to [K]

θij ≡ data rate on the connection (link-i, channel-j)

Xij(n) ≡ success/failure indicator for transmission of link i onchannel j

6 / 49

Page 16: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Application 2: Shortest-path Routing

Source

Destination

M≡ the set of paths

θi ≡ average transmission delay on link i

Xi(n) ≡ transmission delay of link i for n-th packet

7 / 49

Page 17: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Application 2: Shortest-path Routing

Source

Destination 2

4 7

1 6

Source

Destination 2

4 7

1 6

Delay = 20

Semi-bandit feedback: (2, 4, 7, 1, 6) are revealed for chosen links (red).

Bandit feedback: 20 is revealed for the chosen path.

8 / 49

Page 18: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Application 2: Shortest-path Routing

Source

Destination 2

4 7

1 6

Source

Destination 2

4 7

1 6

Delay = 20

Semi-bandit feedback: (2, 4, 7, 1, 6) are revealed for chosen links (red).

Bandit feedback: 20 is revealed for the chosen path.

8 / 49

Page 19: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Application 2: Shortest-path Routing

Source

Destination 2

4 7

1 6

Source

Destination 2

4 7

1 6

Delay = 20

Semi-bandit feedback: (2, 4, 7, 1, 6) are revealed for chosen links (red).

Bandit feedback: 20 is revealed for the chosen path.

9 / 49

Page 20: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Exploiting Combinatorial Structure

Classical MAB (M set of singletons; |M| = d):

Stochastic R(T ) ∼ |M| log(T )

Adversarial R(T ) ∼√|M|T

Generic combinatorial M|M| could grow exponentially in d =⇒ prohibitive regret

Arms are correlated; they share basic actions.

=⇒ exploit combinatorial structure in M to get R(T ) ∼ C log(T ) orR(T ) ∼

√CT where C |M|

How much can we reduce the regret by exploitingthe combinatorial structure of M?

How to optimally do so?

10 / 49

Page 21: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Exploiting Combinatorial Structure

Classical MAB (M set of singletons; |M| = d):

Stochastic R(T ) ∼ |M| log(T )

Adversarial R(T ) ∼√|M|T

Generic combinatorial M|M| could grow exponentially in d =⇒ prohibitive regret

Arms are correlated; they share basic actions.

=⇒ exploit combinatorial structure in M to get R(T ) ∼ C log(T ) orR(T ) ∼

√CT where C |M|

How much can we reduce the regret by exploitingthe combinatorial structure of M?

How to optimally do so?

10 / 49

Page 22: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Exploiting Combinatorial Structure

Classical MAB (M set of singletons; |M| = d):

Stochastic R(T ) ∼ |M| log(T )

Adversarial R(T ) ∼√|M|T

Generic combinatorial M|M| could grow exponentially in d =⇒ prohibitive regret

Arms are correlated; they share basic actions.

=⇒ exploit combinatorial structure in M to get R(T ) ∼ C log(T ) orR(T ) ∼

√CT where C |M|

How much can we reduce the regret by exploitingthe combinatorial structure of M?

How to optimally do so?

10 / 49

Page 23: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Exploiting Combinatorial Structure

Classical MAB (M set of singletons; |M| = d):

Stochastic R(T ) ∼ |M| log(T )

Adversarial R(T ) ∼√|M|T

Generic combinatorial M|M| could grow exponentially in d =⇒ prohibitive regret

Arms are correlated; they share basic actions.

=⇒ exploit combinatorial structure in M to get R(T ) ∼ C log(T ) orR(T ) ∼

√CT where C |M|

How much can we reduce the regret by exploitingthe combinatorial structure of M?

How to optimally do so?

10 / 49

Page 24: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Map of Thesis

How much can we reduce the regret by exploitingthe combinatorial structure of M?

How to optimally do so?

Chapter Combinatorial Structure M Reward XCh. 3 Generic BernoulliCh. 4 Matroid BernoulliCh. 5 Generic GeometricCh. 6 Generic (with fixed cardinality) Adversarial

11 / 49

Page 25: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Outline

1 Combinatorial MABs: Bernoulli Rewards

2 Stochastic Matroid Bandits

3 Adversarial Combinatorial MABs

4 Conclusion and Future Directions

12 / 49

Page 26: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Outline

1 Combinatorial MABs: Bernoulli Rewards

2 Stochastic Matroid Bandits

3 Adversarial Combinatorial MABs

4 Conclusion and Future Directions

13 / 49

Page 27: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Stochastic CMABs

Rewards:

X(n) i.i.d. , Bernoulli distributed with E[X(n)] = θ ∈ [0, 1]d

Xi(n), i ∈ [d] are independent across i

µM := M>θ average reward of arm M

Average reward gap ∆M = µ? − µMOptimality gap ∆min = minM 6=M? ∆M

Algorithm Regret

LLR (Gai et al., 2012) O(m4d∆2

minlog(T )

)CUCB (Chen et al., 2013) O

(m2d∆min

log(T ))

CUCB (Kveton et al., 2015) O(

md∆min

log(T ))

ESCB O( √

md∆min

log(T ))

m = maximal cardinality of arms

14 / 49

Page 28: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Stochastic CMABs

Rewards:

X(n) i.i.d. , Bernoulli distributed with E[X(n)] = θ ∈ [0, 1]d

Xi(n), i ∈ [d] are independent across i

µM := M>θ average reward of arm M

Average reward gap ∆M = µ? − µMOptimality gap ∆min = minM 6=M? ∆M

Algorithm Regret

LLR (Gai et al., 2012) O(m4d∆2

minlog(T )

)CUCB (Chen et al., 2013) O

(m2d∆min

log(T ))

CUCB (Kveton et al., 2015) O(

md∆min

log(T ))

ESCB O( √

md∆min

log(T ))

m = maximal cardinality of arms

14 / 49

Page 29: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Stochastic CMABs

Rewards:

X(n) i.i.d. , Bernoulli distributed with E[X(n)] = θ ∈ [0, 1]d

Xi(n), i ∈ [d] are independent across i

µM := M>θ average reward of arm M

Average reward gap ∆M = µ? − µMOptimality gap ∆min = minM 6=M? ∆M

Algorithm Regret

LLR (Gai et al., 2012) O(m4d∆2

minlog(T )

)CUCB (Chen et al., 2013) O

(m2d∆min

log(T ))

CUCB (Kveton et al., 2015) O(

md∆min

log(T ))

ESCB O( √

md∆min

log(T ))

m = maximal cardinality of arms

14 / 49

Page 30: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Stochastic CMABs

Rewards:

X(n) i.i.d. , Bernoulli distributed with E[X(n)] = θ ∈ [0, 1]d

Xi(n), i ∈ [d] are independent across i

µM := M>θ average reward of arm M

Average reward gap ∆M = µ? − µMOptimality gap ∆min = minM 6=M? ∆M

Algorithm Regret

LLR (Gai et al., 2012) O(m4d∆2

minlog(T )

)CUCB (Chen et al., 2013) O

(m2d∆min

log(T ))

CUCB (Kveton et al., 2015) O(

md∆min

log(T ))

ESCB O( √

md∆min

log(T ))

m = maximal cardinality of arms

14 / 49

Page 31: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimism in the face of uncertainty

Construct a confidence bound [b−, b+] for (unknown) µ s.t.

µ ∈ [b−, b+] with high probability

Maximization problem =⇒ we replace (unknown) µ by b+, itsUpper Confidence Bound (UCB) index.

“Optimism in the face of uncertainty” principle:Choose arm M with the highest UCB index

Algorithm based on optimistic principle:

For arm M and time n, find confidence interval for µM :

P[µM ∈ [b−M (n), b+M (n)]

]≥ 1−O

(1

n log(n)

)Choose M(n) ∈ argmaxM∈M b+M (n)

15 / 49

Page 32: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimism in the face of uncertainty

Construct a confidence bound [b−, b+] for (unknown) µ s.t.

µ ∈ [b−, b+] with high probability

Maximization problem =⇒ we replace (unknown) µ by b+, itsUpper Confidence Bound (UCB) index.

“Optimism in the face of uncertainty” principle:Choose arm M with the highest UCB index

Algorithm based on optimistic principle:

For arm M and time n, find confidence interval for µM :

P[µM ∈ [b−M (n), b+M (n)]

]≥ 1−O

(1

n log(n)

)Choose M(n) ∈ argmaxM∈M b+M (n)

15 / 49

Page 33: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimism in the face of uncertainty

Construct a confidence bound [b−, b+] for (unknown) µ s.t.

µ ∈ [b−, b+] with high probability

Maximization problem =⇒ we replace (unknown) µ by b+, itsUpper Confidence Bound (UCB) index.

“Optimism in the face of uncertainty” principle:Choose arm M with the highest UCB index

Algorithm based on optimistic principle:

For arm M and time n, find confidence interval for µM :

P[µM ∈ [b−M (n), b+M (n)]

]≥ 1−O

(1

n log(n)

)Choose M(n) ∈ argmaxM∈M b+M (n)

15 / 49

Page 34: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimism in the face of uncertainty

Construct a confidence bound [b−, b+] for (unknown) µ s.t.

µ ∈ [b−, b+] with high probability

Maximization problem =⇒ we replace (unknown) µ by b+, itsUpper Confidence Bound (UCB) index.

“Optimism in the face of uncertainty” principle:Choose arm M with the highest UCB index

Algorithm based on optimistic principle:

For arm M and time n, find confidence interval for µM :

P[µM ∈ [b−M (n), b+M (n)]

]≥ 1−O

(1

n log(n)

)Choose M(n) ∈ argmaxM∈M b+M (n)

15 / 49

Page 35: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimism in the face of uncertainty

Construct a confidence bound [b−, b+] for (unknown) µ s.t.

µ ∈ [b−, b+] with high probability

Maximization problem =⇒ we replace (unknown) µ by b+, itsUpper Confidence Bound (UCB) index.

“Optimism in the face of uncertainty” principle:Choose arm M with the highest UCB index

Algorithm based on optimistic principle:

For arm M and time n, find confidence interval for µM :

P[µM ∈ [b−M (n), b+M (n)]

]≥ 1−O

(1

n log(n)

)Choose M(n) ∈ argmaxM∈M b+M (n)

15 / 49

Page 36: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimistic Principle

M

µM

+

+ µ2 (n)

µ2++

++

µ1 (n)

µ1

µ3

µ3 (n)

16 / 49

Page 37: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimistic Principle

M

µM

+

+ µ2 (n)

b2 (n)

µ2++

++

Choose M(n) ∈ argmax bM (n)

b3 (n)

b1 (n)

µ1 (n)

µ1

µ3

µ3 (n)

How to construct the index for arms?

17 / 49

Page 38: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Optimistic Principle

M

µM

+

+ µ2 (n)

b2 (n)

µ2++

++

Choose M(n) ∈ argmax bM (n)

b3 (n)

b1 (n)

µ1 (n)

µ1

µ3

µ3 (n)

How to construct the index for arms?

17 / 49

Page 39: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Naive approach: construct index for basic actions=⇒ index of arm M = sum of indexes of basic action in arm M

Empirical mean θi(n), number of observations: ti(n).Hoeffding’s inequality:

P

[θi ∈

(θi(n)−

√log(1/δ)

2ti(n), θi(n) +

√log(1/δ)

2ti(n)

)]≥ 1− 2δ

Choose δ = 1n3

Index: bM (n) =

d∑i=1

Miθi(n)︸ ︷︷ ︸µM (n)

+

d∑i=1

Mi

√3 log(n)

2ti(n)︸ ︷︷ ︸confidence radius

.

µM ∈

[µM (n)−

d∑i=1

Mi

√3 log(n)

2ti(n), µM (n) +

d∑i=1

Mi

√3 log(n)

2ti(n)

]w.p. ≥ 1− 1

n3

18 / 49

Page 40: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Naive approach: construct index for basic actions=⇒ index of arm M = sum of indexes of basic action in arm M

Empirical mean θi(n), number of observations: ti(n).Hoeffding’s inequality:

P

[θi ∈

(θi(n)−

√log(1/δ)

2ti(n), θi(n) +

√log(1/δ)

2ti(n)

)]≥ 1− 2δ

Choose δ = 1n3

Index: bM (n) =

d∑i=1

Miθi(n)︸ ︷︷ ︸µM (n)

+

d∑i=1

Mi

√3 log(n)

2ti(n)︸ ︷︷ ︸confidence radius

.

µM ∈

[µM (n)−

d∑i=1

Mi

√3 log(n)

2ti(n), µM (n) +

d∑i=1

Mi

√3 log(n)

2ti(n)

]w.p. ≥ 1− 1

n3

18 / 49

Page 41: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Naive approach: construct index for basic actions=⇒ index of arm M = sum of indexes of basic action in arm M

Empirical mean θi(n), number of observations: ti(n).Hoeffding’s inequality:

P

[θi ∈

(θi(n)−

√log(1/δ)

2ti(n), θi(n) +

√log(1/δ)

2ti(n)

)]≥ 1− 2δ

Choose δ = 1n3

Index: bM (n) =

d∑i=1

Miθi(n)︸ ︷︷ ︸µM (n)

+

d∑i=1

Mi

√3 log(n)

2ti(n)︸ ︷︷ ︸confidence radius

.

µM ∈

[µM (n)−

d∑i=1

Mi

√3 log(n)

2ti(n), µM (n) +

d∑i=1

Mi

√3 log(n)

2ti(n)

]w.p. ≥ 1− 1

n3

18 / 49

Page 42: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Naive approach: construct index for basic actions=⇒ index of arm M = sum of indexes of basic action in arm M

Empirical mean θi(n), number of observations: ti(n).Hoeffding’s inequality:

P

[θi ∈

(θi(n)−

√log(1/δ)

2ti(n), θi(n) +

√log(1/δ)

2ti(n)

)]≥ 1− 2δ

Choose δ = 1n3

Index: bM (n) =

d∑i=1

Miθi(n)︸ ︷︷ ︸µM (n)

+

d∑i=1

Mi

√3 log(n)

2ti(n)︸ ︷︷ ︸confidence radius

.

µM ∈

[µM (n)−

d∑i=1

Mi

√3 log(n)

2ti(n), µM (n) +

d∑i=1

Mi

√3 log(n)

2ti(n)

]w.p. ≥ 1− 1

n3

18 / 49

Page 43: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Our approach: constructing confidence interval directly for each armM

Motivated by concentration for sum of empirical KL-divergences.

For a given δ, consider a set

B =

λ ∈ [0, 1]d :

d∑i=1

ti(n) kl(θi(n), λi) ≤ log(1/δ)

with

kl(u, v) := u logu

v+ (1− u) log

1− u1− v

.

Find an upper confidence bound for µM such that

µM ∈[×, M>λ

]w.p. at least 1− δ, ∀λ ∈ B.

Equivalently,

µM ≤ maxλ∈B

M>λ w.p. at least 1− δ.

19 / 49

Page 44: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Our approach: constructing confidence interval directly for each armM

Motivated by concentration for sum of empirical KL-divergences.

For a given δ, consider a set

B =

λ ∈ [0, 1]d :

d∑i=1

ti(n) kl(θi(n), λi) ≤ log(1/δ)

with

kl(u, v) := u logu

v+ (1− u) log

1− u1− v

.

Find an upper confidence bound for µM such that

µM ∈[×, M>λ

]w.p. at least 1− δ, ∀λ ∈ B.

Equivalently,

µM ≤ maxλ∈B

M>λ w.p. at least 1− δ.

19 / 49

Page 45: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Our approach: constructing confidence interval directly for each armM

Motivated by concentration for sum of empirical KL-divergences.

For a given δ, consider a set

B =

λ ∈ [0, 1]d :

d∑i=1

ti(n) kl(θi(n), λi) ≤ log(1/δ)

with

kl(u, v) := u logu

v+ (1− u) log

1− u1− v

.

Find an upper confidence bound for µM such that

µM ∈[×, M>λ

]w.p. at least 1− δ, ∀λ ∈ B.

Equivalently,

µM ≤ maxλ∈B

M>λ w.p. at least 1− δ.

19 / 49

Page 46: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Index Construction

Our approach: constructing confidence interval directly for each armM

Motivated by concentration for sum of empirical KL-divergences.

For a given δ, consider a set

B =

λ ∈ [0, 1]d :

d∑i=1

ti(n) kl(θi(n), λi) ≤ log(1/δ)

with

kl(u, v) := u logu

v+ (1− u) log

1− u1− v

.

Find an upper confidence bound for µM such that

µM ∈[×, M>λ

]w.p. at least 1− δ, ∀λ ∈ B.

Equivalently,

µM ≤ maxλ∈B

M>λ w.p. at least 1− δ.

19 / 49

Page 47: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

Two new indexes:

(1) Index bM as the optimal value of the following problem:

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n)︸︷︷︸log(1/δ)

,

with f(n) = log(n) + 4m log(log(n)).

bM is computed by a line search (derived based on KKT conditions)Generalizes the KL-UCB index (Garivier & Cappe, 2011) to the caseof combinatorial MABs(2) Index cM :

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

20 / 49

Page 48: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

Two new indexes:

(1) Index bM as the optimal value of the following problem:

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n)︸︷︷︸log(1/δ)

,

with f(n) = log(n) + 4m log(log(n)).

bM is computed by a line search (derived based on KKT conditions)Generalizes the KL-UCB index (Garivier & Cappe, 2011) to the caseof combinatorial MABs(2) Index cM :

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

20 / 49

Page 49: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

Two new indexes:

(1) Index bM as the optimal value of the following problem:

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n)︸︷︷︸log(1/δ)

,

with f(n) = log(n) + 4m log(log(n)).

bM is computed by a line search (derived based on KKT conditions)Generalizes the KL-UCB index (Garivier & Cappe, 2011) to the caseof combinatorial MABs(2) Index cM :

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

20 / 49

Page 50: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

Two new indexes:

(1) Index bM as the optimal value of the following problem:

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n)︸︷︷︸log(1/δ)

,

with f(n) = log(n) + 4m log(log(n)).

bM is computed by a line search (derived based on KKT conditions)Generalizes the KL-UCB index (Garivier & Cappe, 2011) to the caseof combinatorial MABs(2) Index cM :

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

20 / 49

Page 51: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n),

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

Theorem

For all M ∈M and n ≥ 1: cM (n) ≥ bM (n).

Proof idea: Pinsker’s inequality + Cauchy-Schwarz inequality

21 / 49

Page 52: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Proposed Indexes

bM (n) = maxλ∈[0,1]d

d∑i=1

Miλi

subject to :

d∑i=1

Miti(n) kl(θi(n), λi) ≤ f(n),

cM (n) = µM (n) +

√√√√f(n)

2

d∑i=1

Mi

ti(n).

Theorem

For all M ∈M and n ≥ 1: cM (n) ≥ bM (n).

Proof idea: Pinsker’s inequality + Cauchy-Schwarz inequality

21 / 49

Page 53: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

ESCB Algorithm

ESCB ≡ Efficient Sampling for Combinatorial Bandits

Algorithm 1 ESCB

for n ≥ 1 doSelect arm M(n) ∈ argmaxM∈M ζM (n).

Observe the rewards, and update ti(n) and θi(n),∀i ∈M(n).

end for

ESCB-1 if ζM = bM , ESCB-2 if ζM = cM .

22 / 49

Page 54: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Analysis

Theorem

The regret under ESCB satisfies

R(T ) ≤ 16d√m

∆minlog(T ) +O(log(log(T ))).

Proof idea

cM (n) ≥ bM (n) ≥ µM with high probabilityCrucial concentration inequality (Magureanu et al., COLT 2014):

P

[maxn≤T

d∑i=1

Miti(n)kl(θi(n), θi) ≥ δ

]≤ Cm(log(T )δ)me−δ.

23 / 49

Page 55: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Analysis

Theorem

The regret under ESCB satisfies

R(T ) ≤ 16d√m

∆minlog(T ) +O(log(log(T ))).

Proof idea

cM (n) ≥ bM (n) ≥ µM with high probabilityCrucial concentration inequality (Magureanu et al., COLT 2014):

P

[maxn≤T

d∑i=1

Miti(n)kl(θi(n), θi) ≥ δ

]≤ Cm(log(T )δ)me−δ.

23 / 49

Page 56: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

How far are we from the optimal algorithm?

Uniformly good algorithm π: Rπ(T ) = O(log(T )) for all θ.

Notion of bad parameter: λ is bad if:

(i) it is statistically indistinguishable from true parameter θ (in thesense of KL-divergence) ≡ reward distribution of optimal arm M? isthe same under θ or λ,(ii) M? is not optimal under λ.

Set of all bad parameters B(θ):

B(θ) =λ ∈ [0, 1]d : (λi = θi, ∀i ∈M?)︸ ︷︷ ︸

condition (i)

and maxM∈M

M>λ > µ?︸ ︷︷ ︸condition (ii)

.

24 / 49

Page 57: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

How far are we from the optimal algorithm?

Uniformly good algorithm π: Rπ(T ) = O(log(T )) for all θ.

Notion of bad parameter: λ is bad if:

(i) it is statistically indistinguishable from true parameter θ (in thesense of KL-divergence) ≡ reward distribution of optimal arm M? isthe same under θ or λ,(ii) M? is not optimal under λ.

Set of all bad parameters B(θ):

B(θ) =λ ∈ [0, 1]d : (λi = θi, ∀i ∈M?)︸ ︷︷ ︸

condition (i)

and maxM∈M

M>λ > µ?︸ ︷︷ ︸condition (ii)

.

24 / 49

Page 58: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

How far are we from the optimal algorithm?

Uniformly good algorithm π: Rπ(T ) = O(log(T )) for all θ.

Notion of bad parameter: λ is bad if:

(i) it is statistically indistinguishable from true parameter θ (in thesense of KL-divergence) ≡ reward distribution of optimal arm M? isthe same under θ or λ,(ii) M? is not optimal under λ.

Set of all bad parameters B(θ):

B(θ) =λ ∈ [0, 1]d : (λi = θi, ∀i ∈M?)︸ ︷︷ ︸

condition (i)

and maxM∈M

M>λ > µ?︸ ︷︷ ︸condition (ii)

.

24 / 49

Page 59: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

Theorem

For any uniformly good algorithm π, lim infT→∞Rπ(T )log(T ) ≥ c(θ), with

c(θ) = infx∈R|M|+

∑M∈M

∆MxM

subject to :

d∑i=1

kl(θi, λi)∑M∈M

MixM ≥ 1, ∀λ ∈ B(θ).

The first problem dependent tight LB

Interpretation: each arm M must be sampled at least x?M log(T )times.

Proof idea: adaptive control of Markov chains with unknowntransition probabilities (Graves & Lai, 1997)

25 / 49

Page 60: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

Theorem

For any uniformly good algorithm π, lim infT→∞Rπ(T )log(T ) ≥ c(θ), with

c(θ) = infx∈R|M|+

∑M∈M

∆MxM

subject to :

d∑i=1

kl(θi, λi)∑M∈M

MixM ≥ 1, ∀λ ∈ B(θ).

The first problem dependent tight LB

Interpretation: each arm M must be sampled at least x?M log(T )times.

Proof idea: adaptive control of Markov chains with unknowntransition probabilities (Graves & Lai, 1997)

25 / 49

Page 61: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret Lower Bound

Theorem

For any uniformly good algorithm π, lim infT→∞Rπ(T )log(T ) ≥ c(θ), with

c(θ) = infx∈R|M|+

∑M∈M

∆MxM

subject to :

d∑i=1

kl(θi, λi)∑M∈M

MixM ≥ 1, ∀λ ∈ B(θ).

The first problem dependent tight LB

Interpretation: each arm M must be sampled at least x?M log(T )times.

Proof idea: adaptive control of Markov chains with unknowntransition probabilities (Graves & Lai, 1997)

25 / 49

Page 62: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Towards An Explicit LB

How does c(θ) scale with d, m?

Proposition

For most problems c(θ) = Ω(d−m).

Intuitive since d−m basic actions are not sampled when playing M?.Proof idea

Construct a covering set H for suboptimal basic actionsKeeping constraints for M ∈ H

Definition

H is a covering set for basic actions if it is a (inclusion-wise) maximalsubset of M\M? such that for all distinct M,M ′ ∈ H, we have

(M \M?) ∩ (M ′ \M?) = ∅.26 / 49

Page 63: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Towards An Explicit LB

How does c(θ) scale with d, m?

Proposition

For most problems c(θ) = Ω(d−m).

Intuitive since d−m basic actions are not sampled when playing M?.Proof idea

Construct a covering set H for suboptimal basic actionsKeeping constraints for M ∈ H

Definition

H is a covering set for basic actions if it is a (inclusion-wise) maximalsubset of M\M? such that for all distinct M,M ′ ∈ H, we have

(M \M?) ∩ (M ′ \M?) = ∅.26 / 49

Page 64: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Towards An Explicit LB

How does c(θ) scale with d, m?

Proposition

For most problems c(θ) = Ω(d−m).

Intuitive since d−m basic actions are not sampled when playing M?.Proof idea

Construct a covering set H for suboptimal basic actionsKeeping constraints for M ∈ H

Definition

H is a covering set for basic actions if it is a (inclusion-wise) maximalsubset of M\M? such that for all distinct M,M ′ ∈ H, we have

(M \M?) ∩ (M ′ \M?) = ∅.26 / 49

Page 65: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Towards An Explicit LB

How does c(θ) scale with d, m?

Proposition

For most problems c(θ) = Ω(d−m).

Intuitive since d−m basic actions are not sampled when playing M?.Proof idea

Construct a covering set H for suboptimal basic actionsKeeping constraints for M ∈ H

Definition

H is a covering set for basic actions if it is a (inclusion-wise) maximalsubset of M\M? such that for all distinct M,M ′ ∈ H, we have

(M \M?) ∩ (M ′ \M?) = ∅.26 / 49

Page 66: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Numerical Experiments

Matchings in Km,m

Parameter θ:

θi =

a i ∈M?

b otherwise.

c(θ) = m(m−1)(a−b)2kl(b,a)

matchings K5,5

a = 0.7, b = 0.5

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

100

200

300

400

500

TimeR

egre

t

LLRCUCBESCB−1ESCB−2Epoch−ESCB

101

102

103

1040

100

200

300

400

Time

Reg

ret

LLRCUCBESCB−1ESCB−2Epoch−ESCBLower Bound

27 / 49

Page 67: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Outline

1 Combinatorial MABs: Bernoulli Rewards

2 Stochastic Matroid Bandits

3 Adversarial Combinatorial MABs

4 Conclusion and Future Directions

28 / 49

Page 68: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid

Combinatorial optimization over a matroid

Of particular interest in combinatorial optimization

Power of greedy solution

Matroid constraints arise in many applications

Cardinality constraints, partitioning constraints, coverage constraints

Definition

Given a finite set E and I ⊂ 2E , the pair (E, I) is called a matroid if:

(i) If X ∈ I and Y ⊆ X, then Y ∈ I (closed under subset).

(ii) If X,Y ∈ I with |X| > |Y |, then there is some element ` ∈ X \ Ysuch that Y ∪ ` ∈ I (augmentation property).

29 / 49

Page 69: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid

Combinatorial optimization over a matroid

Of particular interest in combinatorial optimization

Power of greedy solution

Matroid constraints arise in many applications

Cardinality constraints, partitioning constraints, coverage constraints

Definition

Given a finite set E and I ⊂ 2E , the pair (E, I) is called a matroid if:

(i) If X ∈ I and Y ⊆ X, then Y ∈ I (closed under subset).

(ii) If X,Y ∈ I with |X| > |Y |, then there is some element ` ∈ X \ Ysuch that Y ∪ ` ∈ I (augmentation property).

29 / 49

Page 70: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid

E is ground set, I is set of independent sets.

Basis: any inclusion-wise maximal element of IRank: common cardinality of bases

Example: Graphic Matroid (for graph G = (V,H)):

(H, I) with I = F ⊆ H : (V, F ) is a forest.

A basis is an spanning forest of the G

30 / 49

Page 71: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid

E is ground set, I is set of independent sets.

Basis: any inclusion-wise maximal element of IRank: common cardinality of bases

Example: Graphic Matroid (for graph G = (V,H)):

(H, I) with I = F ⊆ H : (V, F ) is a forest.

A basis is an spanning forest of the G

30 / 49

Page 72: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid Optimization

Weighted matroid: is triple (E, I, w) where w is a positive weightvector (w` is the weight of ` ∈ E).

Maximum-weight basis:

maxX∈I

∑`∈X

w`

Can be solved greedily: At each step of the algorithm, add a newelement of E with the largest weight so that the resulting set remainsin I.

31 / 49

Page 73: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid Bandits

Weighted matroid G = (E, I, θ)Set of basic actions ≡ ground set of matroid E

For each i, (Xi(n))n≥1 is i.i.d. with Bernoulli of mean θi

Each arm is a basis of G; M≡ set of bases of G

Prior work:

Uniform matroids (Anantharam et al. 1985): Regret LB

Generic matroids (Kveton et al., 2014): OMM with regret

O(

d∆min

log(T ))

32 / 49

Page 74: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Matroid Bandits

Weighted matroid G = (E, I, θ)Set of basic actions ≡ ground set of matroid E

For each i, (Xi(n))n≥1 is i.i.d. with Bernoulli of mean θi

Each arm is a basis of G; M≡ set of bases of G

Prior work:

Uniform matroids (Anantharam et al. 1985): Regret LB

Generic matroids (Kveton et al., 2014): OMM with regret

O(

d∆min

log(T ))

32 / 49

Page 75: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret LB

Theorem

For all θ and every weighted matroid G = (E, I, θ), the regret ofuniformly good algorithm π satisfies

lim infT→∞

Rπ(T )

log(T )≥ c(θ) =

∑i/∈M?

θσ(i) − θikl(θi, θσ(i))

,

where for any iσ(i) = arg min

`:(M?\`)∪i∈Iθ`.

Tight LB, first explicit regret LB for matroid bandits

Generalizes LB of (Anantharam et al., 1985) to matroids.Proof idea

Specialization of Graves-Lai resultChoosing d−m box constraints in view of σLower bounding ∆M ,M ∈M in terms of σ

33 / 49

Page 76: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret LB

Theorem

For all θ and every weighted matroid G = (E, I, θ), the regret ofuniformly good algorithm π satisfies

lim infT→∞

Rπ(T )

log(T )≥ c(θ) =

∑i/∈M?

θσ(i) − θikl(θi, θσ(i))

,

where for any iσ(i) = arg min

`:(M?\`)∪i∈Iθ`.

Tight LB, first explicit regret LB for matroid bandits

Generalizes LB of (Anantharam et al., 1985) to matroids.Proof idea

Specialization of Graves-Lai resultChoosing d−m box constraints in view of σLower bounding ∆M ,M ∈M in terms of σ

33 / 49

Page 77: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Regret LB

Theorem

For all θ and every weighted matroid G = (E, I, θ), the regret ofuniformly good algorithm π satisfies

lim infT→∞

Rπ(T )

log(T )≥ c(θ) =

∑i/∈M?

θσ(i) − θikl(θi, θσ(i))

,

where for any iσ(i) = arg min

`:(M?\`)∪i∈Iθ`.

Tight LB, first explicit regret LB for matroid bandits

Generalizes LB of (Anantharam et al., 1985) to matroids.Proof idea

Specialization of Graves-Lai resultChoosing d−m box constraints in view of σLower bounding ∆M ,M ∈M in terms of σ

33 / 49

Page 78: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

KL-OSM Algorithm

KL-OSM (KL-based Optimal Sampling for Matroids)

Uses KL-UCB index attached to each basic action i ∈ E:

ωi(n) = maxq > θi(n) : ti(n)kl(θi(n), q) ≤ f(n)

with f(n) = log(n) + 3 log(log(n)).

Relies on Greedy

Algorithm 2 KL-OSMfor n ≥ 1 do

SelectM(n) ∈ arg max

M∈M

∑i∈M

ωi(n)

using the Greedy algorithm.

Play M(n), observe the rewards, and update ti(n) and θi(n), ∀i ∈M(n).end for

34 / 49

Page 79: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

KL-OSM Algorithm

KL-OSM (KL-based Optimal Sampling for Matroids)

Uses KL-UCB index attached to each basic action i ∈ E:

ωi(n) = maxq > θi(n) : ti(n)kl(θi(n), q) ≤ f(n)

with f(n) = log(n) + 3 log(log(n)).

Relies on Greedy

Algorithm 3 KL-OSMfor n ≥ 1 do

SelectM(n) ∈ arg max

M∈M

∑i∈M

ωi(n)

using the Greedy algorithm.

Play M(n), observe the rewards, and update ti(n) and θi(n), ∀i ∈M(n).end for

34 / 49

Page 80: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

KL-OSM Regret

Theorem

For any ε > 0, the regret under KL-OSM satisfies

R(T ) ≤ (1 + ε)c(θ) log(T ) +O(log(log(T )))

KL-OSM is asymptotically optimal:

lim supT→∞

R(T )

log(T )≤ c(θ)

The first optimal algorithm for matroid bandits

Runs in O(d log(d)T ) (in the independence oracle model)

35 / 49

Page 81: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

KL-OSM Regret

Theorem

For any ε > 0, the regret under KL-OSM satisfies

R(T ) ≤ (1 + ε)c(θ) log(T ) +O(log(log(T )))

KL-OSM is asymptotically optimal:

lim supT→∞

R(T )

log(T )≤ c(θ)

The first optimal algorithm for matroid bandits

Runs in O(d log(d)T ) (in the independence oracle model)

35 / 49

Page 82: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Numerical Experiments: Spanning Trees

1 2 3 4 5 6 7 8 9 10x 10

4

0

20

40

60

80

100

120

140

160

Time

Reg

ret

OMMKL−OSMLower Bound

102

103

104

1055

10

15

20

25

30

35

40

45

Time

Reg

ret

OMMKL−OSMLower Bound

36 / 49

Page 83: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Outline

1 Combinatorial MABs: Bernoulli Rewards

2 Stochastic Matroid Bandits

3 Adversarial Combinatorial MABs

4 Conclusion and Future Directions

37 / 49

Page 84: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Adversarial Combinatorial MABs

Arms have the same cardinality m (but otherwise arbitrary)

Rewards X(n) ∈ [0, 1]d are arbitrary (oblivious adversary)

Bandit feedback: only M(n)>X(n) is observed at round n.

Regret

R(T ) = maxM∈M

E

[T∑n=1

M>X(n)

]− E

[T∑n=1

M(n)>X(n)

].

E[·] is w.r.t. random seed of the algorithm.

38 / 49

Page 85: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 86: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 87: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 88: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 89: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 90: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 91: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 92: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 93: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Inspired by OSMD algorithm (Audibert et al., 2013)

maxM∈M

M>X = maxα∈conv(M)

α>X.

Maintain a distribution q = α/m over basic actions 1, . . . , d.q induces a distribution p over arms M.

Sample M from p, play it, and receive bandit feedback.

Update q (create q) based on feedback.

Project α = mq onto conv(M).

Introduce exploration

39 / 49

Page 94: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP Algorithm

Algorithm 4 CombEXP

Initialization: Set q0 = µ0 (uniform distribution over [d]), γ, η ∝ 1√T

for n ≥ 1 do

Mixing: Let q′n−1 = (1− γ)qn−1 + γµ0.

Decomposition: Select a distribution pn−1 over arms M such that∑M

pn−1(M)M = mq′n−1.

Sampling: Select M(n) ∼ pn−1 and receive reward Yn = M(n)>X(n).

Estimation: Let Σn−1 = EM∼pn−1

[MM>

]. Set X(n) = YnΣ+

n−1M(n).

Update: Set qn(i) ∝ qn−1(i)eηXi(n), ∀i ∈ [d].

Projection: Set

qn = arg minp∈conv(M)

KL

(1

mp, qn

).

end for

40 / 49

Page 95: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Regret

Theorem

RCombEXP(T ) ≤ 2

√m3T

(d+

m1/2

λmin

)logµ−1

min +O(1),

where λmin is the smallest nonzero eigenvalue of E[MM>] when M isuniformly distributed and

µmin = mini

1

|M|∑M∈M

Mi.

For most problems λmin = Ω(md ) and µ−1min = O(poly(d/m)):

R(T ) ∼√m3dT log

d

m.

41 / 49

Page 96: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Regret

Theorem

RCombEXP(T ) ≤ 2

√m3T

(d+

m1/2

λmin

)logµ−1

min +O(1),

where λmin is the smallest nonzero eigenvalue of E[MM>] when M isuniformly distributed and

µmin = mini

1

|M|∑M∈M

Mi.

For most problems λmin = Ω(md ) and µ−1min = O(poly(d/m)):

R(T ) ∼√m3dT log

d

m.

41 / 49

Page 97: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Regret

Theorem

RCombEXP(T ) ≤ 2

√m3T

(d+

m1/2

λmin

)logµ−1

min +O(1),

where λmin is the smallest nonzero eigenvalue of E[MM>] when M isuniformly distributed and

µmin = mini

1

|M|∑M∈M

Mi.

For most problems λmin = Ω(md ) and µ−1min = O(poly(d/m)):

R(T ) ∼√m3dT log

d

m.

41 / 49

Page 98: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP with Approximate Projection

Exact projection with finitely many operations may be impossible=⇒ CombEXP with approximate projection.

Proposition

Assume that the projection step of CombEXP is solved up to accuracy

O(

1

n2 log3(n)

), ∀n ≥ 1.

Then

R(T ) ≤ 2

√2m3T

(d+

m1/2

λmin

)logµ−1

min +O(1)

The same regret scaling as for exact projection.Proof idea: Strong convexity of KL w.r.t. ‖ · ‖1 + Properties ofprojection with KL

42 / 49

Page 99: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP with Approximate Projection

Exact projection with finitely many operations may be impossible=⇒ CombEXP with approximate projection.

Proposition

Assume that the projection step of CombEXP is solved up to accuracy

O(

1

n2 log3(n)

), ∀n ≥ 1.

Then

R(T ) ≤ 2

√2m3T

(d+

m1/2

λmin

)logµ−1

min +O(1)

The same regret scaling as for exact projection.Proof idea: Strong convexity of KL w.r.t. ‖ · ‖1 + Properties ofprojection with KL

42 / 49

Page 100: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP with Approximate Projection

Exact projection with finitely many operations may be impossible=⇒ CombEXP with approximate projection.

Proposition

Assume that the projection step of CombEXP is solved up to accuracy

O(

1

n2 log3(n)

), ∀n ≥ 1.

Then

R(T ) ≤ 2

√2m3T

(d+

m1/2

λmin

)logµ−1

min +O(1)

The same regret scaling as for exact projection.Proof idea: Strong convexity of KL w.r.t. ‖ · ‖1 + Properties ofprojection with KL

42 / 49

Page 101: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Complexity

Theorem

Let

c = # eq. conv(M), s = # ineq. conv(M).

Then, if the projection step of CombEXP is solved up to accuracyO(n−2 log−3(n)), ∀n ≥ 1, CombEXP after T rounds has timecomplexity

O(T [√s(c+ d)3 log(T ) + d4]).

Box inequality constraints: O(T [c2√s(c+ d) log(T ) + d4]).

Proof idea

Constructive proof of Caratheodory Theorem for decompositionBarrier method for projection

43 / 49

Page 102: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Complexity

Theorem

Let

c = # eq. conv(M), s = # ineq. conv(M).

Then, if the projection step of CombEXP is solved up to accuracyO(n−2 log−3(n)), ∀n ≥ 1, CombEXP after T rounds has timecomplexity

O(T [√s(c+ d)3 log(T ) + d4]).

Box inequality constraints: O(T [c2√s(c+ d) log(T ) + d4]).

Proof idea

Constructive proof of Caratheodory Theorem for decompositionBarrier method for projection

43 / 49

Page 103: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

CombEXP: Complexity

Theorem

Let

c = # eq. conv(M), s = # ineq. conv(M).

Then, if the projection step of CombEXP is solved up to accuracyO(n−2 log−3(n)), ∀n ≥ 1, CombEXP after T rounds has timecomplexity

O(T [√s(c+ d)3 log(T ) + d4]).

Box inequality constraints: O(T [c2√s(c+ d) log(T ) + d4]).

Proof idea

Constructive proof of Caratheodory Theorem for decompositionBarrier method for projection

43 / 49

Page 104: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Prior Work

AlgorithmA

BRegret (Symmetric Problems)

Lower Bound (Audibert et al., 2013)

√A2

A2Ω(m√dT)

, if d ≥ 2m

ComBand (Cesa-Bianchi & Lugosi, 2012)

d

md

m

O

(√m3dT log

d

m

) d

md

m

CombEXP

d

md

m

O

(√m3dT log

d

m

) d

md

m

Both ComBand and CombEXP are off the LB by a factor√m log(d/m).

ComBand relies on (approximate) sampling from M whereasCombEXP does convex optimization over conv(M).

44 / 49

Page 105: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Prior Work

AlgorithmA

BRegret (Symmetric Problems)

Lower Bound (Audibert et al., 2013)

√A2

A2Ω(m√dT)

, if d ≥ 2m

ComBand (Cesa-Bianchi & Lugosi, 2012)

d

md

m

O

(√m3dT log

d

m

) d

md

m

CombEXP

d

md

m

O

(√m3dT log

d

m

) d

md

m

Both ComBand and CombEXP are off the LB by a factor√m log(d/m).

ComBand relies on (approximate) sampling from M whereasCombEXP does convex optimization over conv(M).

44 / 49

Page 106: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Complexity Example: Matchings

Matchings in Km,m:

conv(M) is the set of all doubly stochastic m×m matrices (Birkhoffpolytope):

conv(M) =

Z ∈ Rm×m+ :

m∑k=1

zik = 1, ∀i,m∑k=1

zkj = 1, ∀j

.

c = 2m and s = m2 (box constraints).

Complexity of ComEXP: O(m5T log(T ))

Complexity of CombBand: O(m10F (T )) for some super-linearfunction F (T ) (need for approximating a permanent at each round).

45 / 49

Page 107: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Complexity Example: Matchings

Matchings in Km,m:

conv(M) is the set of all doubly stochastic m×m matrices (Birkhoffpolytope):

conv(M) =

Z ∈ Rm×m+ :

m∑k=1

zik = 1, ∀i,m∑k=1

zkj = 1, ∀j

.

c = 2m and s = m2 (box constraints).

Complexity of ComEXP: O(m5T log(T ))

Complexity of CombBand: O(m10F (T )) for some super-linearfunction F (T ) (need for approximating a permanent at each round).

45 / 49

Page 108: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Outline

1 Combinatorial MABs: Bernoulli Rewards

2 Stochastic Matroid Bandits

3 Adversarial Combinatorial MABs

4 Conclusion and Future Directions

46 / 49

Page 109: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Conclusion

Stochastic combinatorial MABs

The first regret LBESCB: best performance in terms of regret

Stochastic matroid bandits

The first explicit regret LBKL-OSM: the first optimal algorithm

Adversarial combinatorial MABs

CombEXP: the same regret as state-of-the-art but with lowercomputational complexity

More in the thesis!

47 / 49

Page 110: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Conclusion

Stochastic combinatorial MABs

The first regret LBESCB: best performance in terms of regret

Stochastic matroid bandits

The first explicit regret LBKL-OSM: the first optimal algorithm

Adversarial combinatorial MABs

CombEXP: the same regret as state-of-the-art but with lowercomputational complexity

More in the thesis!

47 / 49

Page 111: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Conclusion

Stochastic combinatorial MABs

The first regret LBESCB: best performance in terms of regret

Stochastic matroid bandits

The first explicit regret LBKL-OSM: the first optimal algorithm

Adversarial combinatorial MABs

CombEXP: the same regret as state-of-the-art but with lowercomputational complexity

More in the thesis!

47 / 49

Page 112: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Future Directions: Stochastic

Improvement to the proposed algorithms

Tighter regret analysis of ESCB-1 (order-optimality conjecture)Can we amortize index computation?

Analysis of Thompson Sampling for stochastic combinatorialMABs

Stochastic combinatorial MABs under bandit feedback

Projection-free optimal algorithm for bandit and semi-banditfeedbacks

48 / 49

Page 113: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Future Directions: Stochastic

Improvement to the proposed algorithms

Tighter regret analysis of ESCB-1 (order-optimality conjecture)Can we amortize index computation?

Analysis of Thompson Sampling for stochastic combinatorialMABs

Stochastic combinatorial MABs under bandit feedback

Projection-free optimal algorithm for bandit and semi-banditfeedbacks

48 / 49

Page 114: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Future Directions: Stochastic

Improvement to the proposed algorithms

Tighter regret analysis of ESCB-1 (order-optimality conjecture)Can we amortize index computation?

Analysis of Thompson Sampling for stochastic combinatorialMABs

Stochastic combinatorial MABs under bandit feedback

Projection-free optimal algorithm for bandit and semi-banditfeedbacks

48 / 49

Page 115: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Publications

Combinatorial bandits revisitedwith R. Combes, A. Proutiere, and M. Lelarge (NIPS 2015)

An optimal algorithm for stochastic matroid bandit optimizationwith A. Proutiere (AAMAS 2016)

Spectrum bandit optimizationwith M. Lelarge and A. Proutiere (ITW 2013)

Stochastic online shortest path routing: The value of feedbackwith Z. Zou, R. Combes, A. Proutiere, and M. Johansson (Submitted to IEEE

TAC)

Thanks for your attention!

49 / 49

Page 116: Online Combinatorial Optimization under Bandit Feedbackmstms/sadegh_talebi_licentiate_slides.pdf · Combinatorial Optimization Decision space Mˆf0;1gd Each decision M2Mis a binary

Publications

Combinatorial bandits revisitedwith R. Combes, A. Proutiere, and M. Lelarge (NIPS 2015)

An optimal algorithm for stochastic matroid bandit optimizationwith A. Proutiere (AAMAS 2016)

Spectrum bandit optimizationwith M. Lelarge and A. Proutiere (ITW 2013)

Stochastic online shortest path routing: The value of feedbackwith Z. Zou, R. Combes, A. Proutiere, and M. Johansson (Submitted to IEEE

TAC)

Thanks for your attention!

49 / 49