Counterfactual Risk Minimization

Counterfactual Risk MinimizationLearning from logged bandit

feedbackAdith Swaminathan, Thorsten Joachims

Software: http://www.cs.cornell.edu/~adith/poem/

Learning frameworks

𝒚Online Batch

Full Information Perceptron, … SVM, ...

Bandit Feedback LinUCB, … ?

Alice: Cool!

SpaceX launch

Logged bandit feedback is everywhere!

𝑥𝑖: Query

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦1 = (𝑆𝑝𝑎𝑐𝑒𝑋)δ1 = 1

𝑦𝑖: Prediction

δ𝑖: Feedback

Alice: Want Tech news

• Risk of ℎ: 𝕏 ↦ 𝕐𝑅 ℎ = 𝔼𝑥 δ(𝑥, ℎ 𝑥 )

• Find ℎ∗ ∈ 𝓗 with minimum risk

• Can we find ℎ∗ using 𝒟 = collected from ℎ0?

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

Learning by replaying logs?

• Training/evaluation from logged data is counter-factual [Bottou et al]

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦2 = (𝐹1)δ2 = 5

𝑥1: (Alice,tech)

𝑦1′ : Apple watch

What would Alice do ??

𝑥 ∼ Pr(𝕏), however 𝑦 = ℎ0(𝑥), not ℎ(𝑥)

• Stochastic ℎ: 𝕏 ↦ ∆(𝕐), 𝑦 ∼ ℎ(𝑥)𝑅 ℎ = 𝔼𝑥𝔼𝑦∼ℎ(𝑥) δ(𝑥, 𝑦)

Stochastic policies to the rescue!

Likely

Unlikely

SpaceX launch

ℎ0ℎ

Counterfactual risk estimators

• 𝒟 = { 𝑥1, 𝑦1, δ1, 𝑝1 , 𝑥2, 𝑦2, δ2, 𝑝2 , … , 𝑥𝑛, 𝑦𝑛, δ𝑛, 𝑝𝑛 }

• 𝑝𝑖 = ℎ0 𝑦𝑖 𝑥𝑖 … propensity [Rosenbaum et al]

𝑹𝓓 𝒉 =𝟏

𝒊=𝟏

𝜹𝒊𝒉(𝒚𝒊|𝒙𝒊)

𝒑𝒊

Basic Importance Sampling [Owen]

𝔼𝑥 𝔼𝑦∼ℎ δ 𝑥, 𝑦 = 𝔼𝑥 𝔼𝑦∼ℎ𝑜 δ(𝑥, 𝑦)ℎ(𝑦|𝑥)

ℎ0(𝑦|𝑥)Perf of new system Samples from old system Importance weight

Story so far

Logged bandit data

Fix ℎ 𝑣𝑠. ℎ0mismatch

Control variance

Tractable bound

Importance sampling causes non-uniform variance!

Logged bandit data

Control variance

Tractable bound

𝑥1 = (𝐴𝑙𝑖𝑐𝑒, 𝑠𝑝𝑜𝑟𝑡𝑠)𝑦1 = (𝐹1)𝛿1 = 5𝑝1 = 0.9

𝑥2 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦2 = (𝑆𝑝𝑎𝑐𝑒𝑋)𝛿2 = 1𝑝2 = 0.9

𝑥3 = (𝐴𝑙𝑖𝑐𝑒,𝑚𝑜𝑣𝑖𝑒𝑠)𝑦3 = (𝑆𝑡𝑎𝑟 𝑊𝑎𝑟𝑠)𝛿3 = 2𝑝3 = 0.9

𝑥4 = (𝐴𝑙𝑖𝑐𝑒, 𝑡𝑒𝑐ℎ)𝑦4 = (𝑇𝑒𝑠𝑙𝑎)𝛿4 = 1𝑝4 = 0.9

ℎ1 𝑅𝓓 ℎ1 = 1

ℎ2 𝑅𝓓 ℎ2 = 1.33

Want: Error bound that captures variance of importance sampling

Capacity control

Counterfactual Risk Minimization

• W.h.p. in 𝒟 ∼ ℎ0

∀ℎ ∈ 𝓗, 𝑅(ℎ) ≤ 𝑅𝒟(ℎ) + 𝑂 𝑉𝑎𝑟𝒟 ℎ 𝑛 + 𝑂( 𝓝∞(𝓗) 𝑛)

*conditions apply. Refer [Maurer et al]

Empirical risk Variance regularization

Logged bandit data

Control variance

Tractable bound

Learning objective

ℎ𝐶𝑅𝑀 = argminℎ∈𝓗

𝑅𝒟 ℎ + λ 𝑉𝑎𝑟𝒟 ℎ

POEM: CRM algorithm for structured prediction

• CRFs: ℎ𝑤 ∈ 𝓗𝑙𝑖𝑛; ℎ𝑤 𝑦 𝑥 =exp(𝑤ϕ(𝑥,𝑦))

ℤ(𝑥;𝑤)

• Policy Optimizer for Exponential Models :

𝑤∗ = argmin𝑤

𝑖=1

δ𝑖ℎ𝑤(𝑦𝑖|𝑥𝑖)

𝑝𝑖+ λ 𝑉𝑎𝑟(𝒉𝑤)

𝑛+ μ 𝑤 2

Good: Gradient descent, search over infinitely

many w

Bad: Not convex in w

Ugly: Resists stochastic

optimization

Logged bandit data

Control variance

Tractable bound

Stochastically optimize 𝑉𝑎𝑟(𝒉𝑤) ?

• Taylor-approximate!

𝑉𝑎𝑟(𝒉𝑤) ≤ 𝐴𝑤𝑡

𝑖=1

ℎ𝑤𝑖 + 𝐵𝑤𝑡

𝑖=1

ℎ𝑤𝑖 2 + 𝐶𝑤𝑡

• During epoch: Adagrad with 𝛻ℎ𝑤𝑖 + λ 𝑛(𝐴𝑤𝑡𝛻ℎ𝑤

𝑖 + 2𝐵𝑤𝑡ℎ𝑤𝑖 𝛻ℎ𝑤𝑖 )}

• After epoch: 𝑤𝑡+1 ↤ 𝑤, compute 𝐴𝑤𝑡+1 , 𝐵𝑤𝑡+1

𝑤𝑡

w𝑤𝑡+1

Logged bandit data

Control variance

Tractable bound

Experiment

• Supervised Bandit MultiLabel [Agarwal et al]

• δ 𝑥, 𝑦 = Hamming(𝑦∗ 𝑥 , 𝑦) (smaller is better)

• LibSVM Datasets• Scene (few features, labels and data)• Yeast (many labels)• LYRL (many features and data)• TMC (many features, labels and data)

• Validate hyper-params (λ,μ) using 𝑅𝒟𝑣𝑎𝑙(ℎ)

• Supervised test set expected Hamming loss

Approaches

• Baselines• ℎ0: Supervised CRF trained on 5% of training data

• Proposed• IPS (No variance penalty) (extends [Bottou et al])

• POEM

• Skylines• Supervised CRF (independent logit regression)

(1) Does variance regularization help?

Scene Yeast LYRL TMC

Test set expected Hamming Loss

h0 IPS POEM Supervised CRF

(2) Is it efficient?

Avg Time (s) Scene Yeast LYRL TMC

POEM(B) 75.20 94.16 561.12 949.95

POEM(S) 4.71 5.02 120.09 276.13

CRF 4.86 3.28 62.93 99.18

• POEM recovers same performance at fraction of L-BFGS cost• Scales as supervised CRF, learns from bandit feedback

(3) Does generalization improve as 𝑛 → ∞?

1 2 4 8 16 32 64 128 256

Replay Count

CRF h0 POEM

(4) Does stochasticity of ℎ0 affect learning?

0.5 1 2 4 8 16 32

Param Multiplier

h0 POEM

Stochastic Deterministic

Conclusion

• CRM principle to learn from logged bandit feedback• Variance regularization

• POEM for structured output prediction• Scales as supervised CRF, learns from bandit feedback

• Contact: adith@cs.cornell.edu

• POEM available at http://www.cs.cornell.edu/~adith/poem/

• Long paper: Counterfactual risk minimization – Learning from logged bandit feedback, http://jmlr.org/proceedings/papers/v37/swaminathan15.html

• Thanks!

References1. Art B. Owen. 2013. Monte Carlo theory, methods and examples.

2. Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70. 41-55.

3. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14, 1, 3207-3260.

4. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li and Robert Schapire. 2014. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Proceedings of the 31st International Conference on Machine Learning. 1638-1646.

5. Andreas Maurer and Massimiliano Pontil. 2009. Empirical bernstein bounds and sample-variance penalization. Proceedings of the 22nd Conference on Learning Theory.

6. Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. Proceedings of the 32nd International Conference on Machine Learning.

Counterfactual Risk Minimization

Documents

Transcript of Counterfactual Risk Minimization

Pandemic Risk Management: Resources Contingency Planning ...

Local Risk-Minimization for Defaultable Claims with ... · Local Risk-Minimization for Defaultable Claims with Recovery Process ... the associated default pressco Hgiven by H t= I

Τλέχοομεν Καξεοθοεξήλιεν Οδηγίεν μξη ...static.livemedia.gr › kebe › documents › al11449_us63_20141011075… · Extreme Risk High Risk Intermed.

Linear Time Approximation Schemes for the Gale-Berlekamp Game and Related Minimization Problems

Harmonic Minimization using PSO of meld μgird-MLCI

Metodo Risk Ue (Ml1)

Displaced Commercial Risk

Risk Management Paper for Kuwaiti PPP Conference

Optimal Stopping for Dynamic Convex Risk Measures

Support Vector Machines - EEMB DERSLER · Support Vector Machines Content Introduction The VC Dimension & Structure Risk Minimization Linear SVM The Separable case Linear SVM The

δc3 Delphi Risk Management · Delphi Risk Management Delphi Risk Management Limited, Registered in England # 2583762 Registered Office 3 Hyde Park Steps, Saint George’s Fields;

Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Minimization and Reduction of Plane Curves · Minimization and Reduction of Plane Curves MichaelStoll UniversitätBayreuth (jointworkwithStephanElsenhans) Rational Points on Irrational

Credit Risk: Intro and Merton Model

Risk Management: LeaderShip in Natural Hazard Crisis

Lecture 7: Minimization or maximization of functions ...neufeld/numerical/lecturenotes7.pdfLecture 7: Minimization or maximization ... • As in the case of root finding, ... (a.k.a.

UK Domain Average Windstorm Risk S] risk non-SJ risk Wind ...sws98slg/Downloads/RMetS-NCAS-2016-StingJet... · UK Domain Average Windstorm Risk S] risk non-SJ risk Wind speed threshold

DFA Minimization

Lecture Notes. Cost Minimization Before looked at maximizing Profits (π) = TR – TC or π =pf(L,K) – wL – rK But now also look at cost minimization That.

Minimization of DFA