Post on 18-Oct-2021
Counterfactual Risk MinimizationLearning from logged bandit
feedbackAdith Swaminathan, Thorsten Joachims
Software: http://www.cs.cornell.edu/~adith/poem/
Learning frameworks
๐
๐Online Batch
Full Information Perceptron, โฆ SVM, ...
Bandit Feedback LinUCB, โฆ ?
2
Alice: Cool!
SpaceX launch
Logged bandit feedback is everywhere!
3
๐ฅ๐: Query
๐ฅ1 = (๐ด๐๐๐๐, ๐ก๐๐โ)๐ฆ1 = (๐๐๐๐๐๐)ฮด1 = 1
๐ฆ๐: Prediction
ฮด๐: Feedback
Alice: Want Tech news
Goal
โข Risk of โ: ๐ โฆ ๐๐ โ = ๐ผ๐ฅ ฮด(๐ฅ, โ ๐ฅ )
โข Find โโ โ ๐ with minimum risk
โข Can we find โโ using ๐ = collected from โ0?
4
๐ฅ1 = (๐ด๐๐๐๐, ๐ก๐๐โ)๐ฆ1 = (๐๐๐๐๐๐)ฮด1 = 1
๐ฅ2 = (๐ด๐๐๐๐, ๐ ๐๐๐๐ก๐ )๐ฆ2 = (๐น1)ฮด2 = 5
Learning by replaying logs?
โข Training/evaluation from logged data is counter-factual [Bottou et al]
5
๐ฅ1 = (๐ด๐๐๐๐, ๐ก๐๐โ)๐ฆ1 = (๐๐๐๐๐๐)ฮด1 = 1
๐ฅ2 = (๐ด๐๐๐๐, ๐ ๐๐๐๐ก๐ )๐ฆ2 = (๐น1)ฮด2 = 5
๐ฅ1: (Alice,tech)
๐ฆ1โฒ : Apple watch
What would Alice do ??
๐ฅ โผ Pr(๐), however ๐ฆ = โ0(๐ฅ), not โ(๐ฅ)
โข Stochastic โ: ๐ โฆ โ(๐), ๐ฆ โผ โ(๐ฅ)๐ โ = ๐ผ๐ฅ๐ผ๐ฆโผโ(๐ฅ) ฮด(๐ฅ, ๐ฆ)
Stochastic policies to the rescue!
Likely
Unlikely
๐
SpaceX launch
6
โ0โ
Counterfactual risk estimators
โข ๐ = { ๐ฅ1, ๐ฆ1, ฮด1, ๐1 , ๐ฅ2, ๐ฆ2, ฮด2, ๐2 , โฆ , ๐ฅ๐, ๐ฆ๐, ฮด๐, ๐๐ }
โข ๐๐ = โ0 ๐ฆ๐ ๐ฅ๐ โฆ propensity [Rosenbaum et al]
๐น๐ ๐ =๐
๐
๐=๐
๐
๐น๐๐(๐๐|๐๐)
๐๐
Basic Importance Sampling [Owen]
๐ผ๐ฅ ๐ผ๐ฆโผโ ฮด ๐ฅ, ๐ฆ = ๐ผ๐ฅ ๐ผ๐ฆโผโ๐ ฮด(๐ฅ, ๐ฆ)โ(๐ฆ|๐ฅ)
โ0(๐ฆ|๐ฅ)Perf of new system Samples from old system Importance weight
7
Story so far
8
Logged bandit data
Fix โ ๐ฃ๐ . โ0mismatch
Control variance
Tractable bound
Importance sampling causes non-uniform variance!
9
Logged bandit data
Fix โ ๐ฃ๐ . โ0mismatch
Control variance
Tractable bound
๐ฅ1 = (๐ด๐๐๐๐, ๐ ๐๐๐๐ก๐ )๐ฆ1 = (๐น1)๐ฟ1 = 5๐1 = 0.9
๐ฅ2 = (๐ด๐๐๐๐, ๐ก๐๐โ)๐ฆ2 = (๐๐๐๐๐๐)๐ฟ2 = 1๐2 = 0.9
๐ฅ3 = (๐ด๐๐๐๐,๐๐๐ฃ๐๐๐ )๐ฆ3 = (๐๐ก๐๐ ๐๐๐๐ )๐ฟ3 = 2๐3 = 0.9
๐ฅ4 = (๐ด๐๐๐๐, ๐ก๐๐โ)๐ฆ4 = (๐๐๐ ๐๐)๐ฟ4 = 1๐4 = 0.9
โ1 ๐ ๐ โ1 = 1
โ2 ๐ ๐ โ2 = 1.33
Want: Error bound that captures variance of importance sampling
Capacity control
Counterfactual Risk Minimization
โข W.h.p. in ๐ โผ โ0
โโ โ ๐, ๐ (โ) โค ๐ ๐(โ) + ๐ ๐๐๐๐ โ ๐ + ๐( ๐โ(๐) ๐)
*conditions apply. Refer [Maurer et al]
Empirical risk Variance regularization
Logged bandit data
Fix โ ๐ฃ๐ . โ0mismatch
Control variance
Tractable bound
10
Learning objective
โ๐ถ๐ ๐ = argminโโ๐
๐ ๐ โ + ฮป ๐๐๐๐ โ
๐
POEM: CRM algorithm for structured prediction
โข CRFs: โ๐ค โ ๐๐๐๐; โ๐ค ๐ฆ ๐ฅ =exp(๐คฯ(๐ฅ,๐ฆ))
โค(๐ฅ;๐ค)
โข Policy Optimizer for Exponential Models :
๐คโ = argmin๐ค
1
๐
๐=1
๐
ฮด๐โ๐ค(๐ฆ๐|๐ฅ๐)
๐๐+ ฮป ๐๐๐(๐๐ค)
๐+ ฮผ ๐ค 2
11
Good: Gradient descent, search over infinitely
many w
Bad: Not convex in w
Ugly: Resists stochastic
optimization
Logged bandit data
Fix โ ๐ฃ๐ . โ0mismatch
Control variance
Tractable bound
Stochastically optimize ๐๐๐(๐๐ค) ?
โข Taylor-approximate!
๐๐๐(๐๐ค) โค ๐ด๐ค๐ก
๐=1
๐
โ๐ค๐ + ๐ต๐ค๐ก
๐=1
๐
โ๐ค๐ 2 + ๐ถ๐ค๐ก
โข During epoch: Adagrad with ๐ปโ๐ค๐ + ฮป ๐(๐ด๐ค๐ก๐ปโ๐ค
๐ + 2๐ต๐ค๐กโ๐ค๐ ๐ปโ๐ค๐ )}
โข After epoch: ๐ค๐ก+1 โค ๐ค, compute ๐ด๐ค๐ก+1 , ๐ต๐ค๐ก+1
๐ค๐ก
w๐ค๐ก+1
12
Logged bandit data
Fix โ ๐ฃ๐ . โ0mismatch
Control variance
Tractable bound
Experiment
โข Supervised Bandit MultiLabel [Agarwal et al]
โข ฮด ๐ฅ, ๐ฆ = Hamming(๐ฆโ ๐ฅ , ๐ฆ) (smaller is better)
โข LibSVM Datasetsโข Scene (few features, labels and data)โข Yeast (many labels)โข LYRL (many features and data)โข TMC (many features, labels and data)
โข Validate hyper-params (ฮป,ฮผ) using ๐ ๐๐ฃ๐๐(โ)
โข Supervised test set expected Hamming loss
13
Approaches
โข Baselinesโข โ0: Supervised CRF trained on 5% of training data
โข Proposedโข IPS (No variance penalty) (extends [Bottou et al])
โข POEM
โข Skylinesโข Supervised CRF (independent logit regression)
14
(1) Does variance regularization help?
15
1.543
5.547
1.463
3.445
1.519
4.614
1.118
3.023
1.143
4.517
0.996
2.522
0
1
2
3
4
5
6
Scene Yeast LYRL TMC
ฮด
Test set expected Hamming Loss
h0 IPS POEM Supervised CRF
0.659
2.822
0.222
1.189
(2) Is it efficient?
Avg Time (s) Scene Yeast LYRL TMC
POEM(B) 75.20 94.16 561.12 949.95
POEM(S) 4.71 5.02 120.09 276.13
CRF 4.86 3.28 62.93 99.18
16
โข POEM recovers same performance at fraction of L-BFGS costโข Scales as supervised CRF, learns from bandit feedback
(3) Does generalization improve as ๐ โ โ?
17
2.75
3.25
3.75
4.25
1 2 4 8 16 32 64 128 256
ฮด
Replay Count
CRF h0 POEM
(4) Does stochasticity of โ0 affect learning?
18
0.8
1
1.2
1.4
1.6
0.5 1 2 4 8 16 32
ฮด
Param Multiplier
h0 POEM
Stochastic Deterministic
Conclusion
โข CRM principle to learn from logged bandit feedbackโข Variance regularization
โข POEM for structured output predictionโข Scales as supervised CRF, learns from bandit feedback
โข Contact: adith@cs.cornell.edu
โข POEM available at http://www.cs.cornell.edu/~adith/poem/
โข Long paper: Counterfactual risk minimization โ Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html
โข Thanks!
19
References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.
2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.
3. Lรฉon Bottou, Jonas Peters, Joaquin Quiรฑonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.
4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.
5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.
6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.
20