Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating...

7
1 Counterfactual Model for Online Systems CS 7792 - Fall 2016 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12. Interactive System Schematic Action y for x System ฯ€ 0 Utility: 0 News Recommender โ€ข Context : โ€“ User โ€ข Action : โ€“ Portfolio of newsarticles โ€ข Feedback , : โ€“ Reading time in minutes Ad Placement โ€ข Context : โ€“ User and page โ€ข Action : โ€“ Ad that is placed โ€ข Feedback , : โ€“ Click / no-click Search Engine โ€ข Context : โ€“ Query โ€ข Action : โ€“ Ranking โ€ข Feedback , : โ€“ win/loss against baseline in interleaving Log Data from Interactive Systems โ€ข Data = 1 , 1 , 1 ,โ€ฆ, , , Partial Information (aka โ€œContextual Banditโ€) Feedback โ€ข Properties โ€“ Contexts drawn i.i.d. from unknown () โ€“ Actions selected by existing system 0 : โ†’ โ€“ Feedback from unknown function :ร—โ†’โ„œ context ฯ€ 0 action reward / loss [Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

Transcript of Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating...

Page 1: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

1

Counterfactual Model for Online Systems

CS 7792 - Fall 2016

Thorsten Joachims

Department of Computer Science & Department of Information Science

Cornell University

Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.

Interactive System Schematic

Action y for x

System ฯ€0

Utility: ๐‘ˆ ๐œ‹0

News Recommender

โ€ข Context ๐‘ฅ:

โ€“ User

โ€ข Action ๐‘ฆ:

โ€“ Portfolio of newsarticles

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ Reading time in minutes

Ad Placement

โ€ข Context ๐‘ฅ:

โ€“ User and page

โ€ข Action ๐‘ฆ:

โ€“ Ad that is placed

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ Click / no-click

Search Engine

โ€ข Context ๐‘ฅ:

โ€“ Query

โ€ข Action ๐‘ฆ:

โ€“ Ranking

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ win/loss against baseline in interleaving

Log Data from Interactive Systems โ€ข Data

๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘›

Partial Information (aka โ€œContextual Banditโ€) Feedback

โ€ข Properties โ€“ Contexts ๐‘ฅ๐‘– drawn i.i.d. from unknown ๐‘ƒ(๐‘‹) โ€“ Actions ๐‘ฆ๐‘– selected by existing system ๐œ‹0: ๐‘‹ โ†’ ๐‘Œ โ€“ Feedback ๐›ฟ๐‘– from unknown function ๐›ฟ: ๐‘‹ ร— ๐‘Œ โ†’ โ„œ

context ฯ€0 action reward / loss

[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

Page 2: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

2

Goal

โ€ข Use interaction log data

๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘› , ๐›ฟ๐‘›

for evaluation of system ๐œ‹: โ€ข Estimate online measures of some system ๐œ‹ offline.

โ€ข System ๐œ‹ can be different from ๐œ‹0 that generated log.

Evaluation: Outline โ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€ โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Online Performance Metrics Example metrics

โ€“ CTR โ€“ Revenue โ€“ Time-to-success โ€“ Interleaving โ€“ Etc.

Correct choice depends on application and is not the focus of this lecture.

This lecture: Metric encoded as ฮด(๐‘ฅ, ๐‘ฆ) [click/payoff/time for (x,y) pair]

System โ€ข Definition [Deterministic Policy]:

Function ๐‘ฆ = ๐œ‹(๐‘ฅ)

that picks action ๐‘ฆ for context ๐‘ฅ. โ€ข Definition [Stochastic Policy]:

Distribution ๐œ‹ ๐‘ฆ ๐‘ฅ

that samples action ๐‘ฆ given context ๐‘ฅ ๐œ‹1(๐‘Œ|๐‘ฅ) ๐œ‹2(๐‘Œ|๐‘ฅ)

๐‘Œ|๐‘ฅ

ฯ€1 ๐‘ฅ ฯ€2 ๐‘ฅ

๐‘Œ|๐‘ฅ

System Performance

Definition [Utility of Policy]:

The expected reward / utility U(๐œ‹) of policy ๐œ‹ is

U ๐œ‹ = ๐›ฟ ๐‘ฅ, ๐‘ฆ ๐œ‹ ๐‘ฆ ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅ ๐‘‘๐‘ฆ

๐œ‹(๐‘Œ|๐‘ฅ๐‘–)

๐‘Œ|๐‘ฅ๐‘–

๐œ‹(๐‘Œ|๐‘ฅ๐‘—)

๐‘Œ|๐‘ฅ๐‘—

โ€ฆ e.g. reading

time of user x for portfolio y

Online Evaluation: A/B Testing Given ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘› , ๐›ฟ๐‘› collected under ฯ€0, A/B Testing

Deploy ฯ€1: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹1 ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ) Deploy ฯ€2: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹2 ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ) โ‹ฎ Deploy ฯ€|๐ป|: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹|๐ป| ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ)

๐‘ˆ ฯ€0 =1

๐‘› ๐›ฟ๐‘–

๐‘›

๐‘–=1

Page 3: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

3

Pros and Cons of A/B Testing โ€ข Pro

โ€“ User centric measure โ€“ No need for manual ratings โ€“ No user/expert mismatch

โ€ข Cons โ€“ Requires interactive experimental control โ€“ Risk of fielding a bad or buggy ๐œ‹๐‘– โ€“ Number of A/B Tests limited โ€“ Long turnaround time

Evaluating Online Metrics Offline

โ€ข Online: On-policy A/B Test

โ€ข Offline: Off-policy Counterfactual Estimates

Draw ๐‘†1 from ๐œ‹1 ๐‘ˆ ๐œ‹1

Draw ๐‘†2 from ๐œ‹2 ๐‘ˆ ๐œ‹2

Draw ๐‘†3 from ๐œ‹3 ๐‘ˆ ๐œ‹3

Draw ๐‘†4 from ๐œ‹4 ๐‘ˆ ๐œ‹4

Draw ๐‘†5 from ๐œ‹5 ๐‘ˆ ๐œ‹5

Draw ๐‘†6 from ๐œ‹6 ๐‘ˆ ๐œ‹6

Draw ๐‘† from ๐œ‹0

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ ๐œ‹6

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ ๐œ‹12

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ ๐œ‹18

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ ๐œ‹24

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ โ„Ž1

๐‘ˆ ๐œ‹30

Draw ๐‘†7 from ๐œ‹7 ๐‘ˆ ๐œ‹7

Evaluation: Outline โ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€ โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Approach 1: Reward Predictor โ€ข Idea:

โ€“ Use ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘› from ๐œ‹0 to estimate reward predictor ๐›ฟ ๐‘ฅ, ๐‘ฆ

โ€ข Deterministic ๐œ‹: Simulated A/B Testing with predicted ๐›ฟ ๐‘ฅ, ๐‘ฆ โ€“ For actions ๐‘ฆ๐‘–

โ€ฒ = ๐œ‹(๐‘ฅ๐‘–) from new policy ๐œ‹, generate predicted log ๐‘†โ€ฒ = ๐‘ฅ1, ๐‘ฆ1

โ€ฒ , ๐›ฟ ๐‘ฅ1, ๐‘ฆ1โ€ฒ , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›

โ€ฒ , ๐›ฟ ๐‘ฅ๐‘›, ๐‘ฆ๐‘›โ€ฒ

โ€“ Estimate performace of ๐œ‹ via ๐‘ˆ ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘› ๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

โ€ฒ๐‘›๐‘–=1

โ€ข Stochastic ๐œ‹: ๐‘ˆ ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘› ๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘ฆ ๐œ‹(๐‘ฆ|๐‘ฅ๐‘–)

๐‘›๐‘–=1

๐›ฟ ๐‘ฅ, ๐‘ฆ1 ๐›ฟ ๐‘ฅ, ๐‘ฆ2

๐‘Œ|๐‘ฅ ๐›ฟ ๐‘ฅ, ๐‘ฆ

๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ

Regression for Reward Prediction

Learn ๐›ฟ : ๐‘ฅ ร— ๐‘ฆ โ†’ โ„œ

1. Represent via features ฮจ ๐‘ฅ, ๐‘ฆ 2. Learn regression based on ฮจ ๐‘ฅ, ๐‘ฆ

from ๐‘† collected under ๐œ‹0

3. Predict ๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ for ๐‘ฆโ€ฒ = ๐œ‹(๐‘ฅ) of new policy ๐œ‹

๐›ฟ (๐‘ฅ, ๐‘ฆ)

๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ

ฮจ1

ฮจ2

News Recommender: Exp Setup โ€ข Context x: User profile

โ€ข Action y: Ranking โ€“ Pick from 7 candidates

to place into 3 slots

โ€ข Reward ๐›ฟ: โ€œRevenueโ€ โ€“ Complicated hidden

function

โ€ข Logging policy ๐œ‹0: Non-uniform randomized logging system โ€“ Placket-Luce โ€œexplore around current production rankerโ€

Page 4: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

4

News Recommender: Results

RP is inaccurate even with more training and logged data

Problems of Reward Predictor

โ€ข Modeling bias

โ€“ choice of features and model

โ€ข Selection bias

โ€“ ฯ€0โ€™s actions are over-represented

๐‘ˆ ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘› ๐›ฟ ๐‘ฅ๐‘– , ๐œ‹ ๐‘ฅ๐‘–

๐‘–

๐›ฟ (๐‘ฅ, ๐‘ฆ)

๐›ฟ ๐‘ฅ, ๐œ‹ ๐‘ฅ

ฮจ1

ฮจ2 Can be unreliable and biased

Evaluation: Outline โ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€ โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Approach โ€œModel the Biasโ€

โ€ข Idea:

Fix the mismatch between the distribution ๐œ‹0 ๐‘Œ ๐‘ฅ that generated the data and the distribution ๐œ‹ ๐‘Œ ๐‘ฅ we aim to evaluate.

U ๐œ‹0 = ๐›ฟ ๐‘ฅ, ๐‘ฆ ๐œ‹0 ๐‘ฆ ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅ ๐‘‘๐‘ฆ

๐œ‹ ๐‘ฆ ๐‘ฅ ๐œ‹

Counterfactual Model โ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œ โ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘– โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best?

01

1

1

01

0

1

1

10

Pati

ents

x๐‘–โˆˆ

1,.

..,๐‘›

Counterfactual Model โ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œ โ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘– โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best?

01

1

1

01

0

1

1

10

Pati

ents

๐‘–โˆˆ

1,.

..,๐‘›

Placing Vertical

Click / no Click on SERP

Pos 1 / Pos 2/ Pos 3

Pos 1

Pos 2

Pos 3

Page 5: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

5

Counterfactual Model โ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œ โ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘– โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best? โ€ข Everybody Drugs โ€ข Everybody Stent โ€ข Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 โ€“ really?

01

1

1

01

0

1

1

10

Pati

ents

xi,

๐‘–โˆˆ

1,.

..,๐‘›

Treatment Effects

โ€ข Average Treatment Effect of Treatment ๐‘ฆ

โ€“ U ๐‘ฆ =1

๐‘› ๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ)๐‘–

โ€ข Example

โ€“ U ๐‘๐‘ฆ๐‘๐‘Ž๐‘ ๐‘  =4

11

โ€“ U ๐‘ ๐‘ก๐‘’๐‘›๐‘ก =6

11

โ€“ U ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘  =3

11

Pati

ents

0 1 0

1 1 0

0 0 1

001

010

010

10011

01111

10000

Factual Outcome

Counterfactual Outcomes

Assignment Mechanism โ€ข Probabilistic Treatment Assignment

โ€“ For patient i: ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘– โ€“ Selection Bias

โ€ข Inverse Propensity Score Estimator

โ€“

โ€“ Propensity: pi = ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ๐‘–|๐‘ฅ๐‘–

โ€“ Unbiased: ๐ธ ๐‘ˆ ๐‘ฆ =๐‘ˆ ๐‘ฆ , if ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘– > 0 for all ๐‘–

โ€ข Example

โ€“ ๐‘ˆ ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘  =1

11

1

0.8+

1

0.7+

1

0.8+

0

0.1

= 0.36 < 0.75

Pati

ents

0 1 0

1 1 0

0 0 1

001

010

010

10011

01111

10000

0.30.50.1

0.60.40.1

0.1 0.1 0.8

0.60.20.7

0.30.50.2

0.10.70.1

0.10.10.30.30.4

0.10.80.30.60.4

0.80.10.40.10.2

๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘–

๐‘ˆ ๐‘–๐‘๐‘  ๐‘ฆ =1

๐‘›

๐•€{๐‘ฆ๐‘–= ๐‘ฆ}

๐‘๐‘–๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

๐‘–

Experimental vs Observational โ€ข Controlled Experiment

โ€“ Assignment Mechanism under our control โ€“ Propensities ๐‘๐‘– = ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ๐‘–|๐‘ฅ๐‘– are known by design โ€“ Requirement: โˆ€๐‘ฆ: ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘– > 0 (probabilistic)

โ€ข Observational Study โ€“ Assignment Mechanism not under our control โ€“ Propensities ๐‘๐‘– need to be estimated โ€“ Estimate ๐œ‹ 0 ๐‘Œ๐‘–|๐‘ง๐‘– = ๐œ‹0 ๐‘Œ๐‘– ๐‘ฅ๐‘–) based on features ๐‘ง๐‘– โ€“ Requirement: ๐œ‹ 0 ๐‘Œ๐‘– ๐‘ง๐‘–) = ๐œ‹ 0 ๐‘Œ๐‘– ๐›ฟ๐‘– , ๐‘ง๐‘– (unconfounded)

Conditional Treatment Policies โ€ข Policy (deterministic)

โ€“ Context ๐‘ฅ๐‘– describing patient โ€“ Pick treatment ๐‘ฆ๐‘– based on ๐‘ฅ๐‘–: yi = ๐œ‹(๐‘ฅ๐‘–) โ€“ Example policy:

โ€ข ๐œ‹ ๐ด = ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘ , ๐œ‹ ๐ต = ๐‘ ๐‘ก๐‘’๐‘›๐‘ก, ๐œ‹ ๐ถ = ๐‘๐‘ฆ๐‘๐‘Ž๐‘ ๐‘ 

โ€ข Average Treatment Effect

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘› ๐›ฟ(๐‘ฅ๐‘– , ๐œ‹ ๐‘ฅ๐‘– )๐‘–

โ€ข IPS Estimator

โ€“

Pati

ents

0 1 0

1 1 0

0 0 1

001

010

010

10011

01111

10000

๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต

๐‘ˆ ๐‘–๐‘๐‘  ๐œ‹ =1

๐‘›

๐•€{๐‘ฆ๐‘–= ๐œ‹(๐‘ฅ๐‘–)}

๐‘๐‘–๐‘–

๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

Stochastic Treatment Policies โ€ข Policy (stochastic)

โ€“ Context ๐‘ฅ๐‘– describing patient โ€“ Pick treatment ๐‘ฆ based on ๐‘ฅ๐‘–: ๐œ‹(๐‘Œ|๐‘ฅ๐‘–)

โ€ข Note โ€“ Assignment Mechanism is a stochastic policy as well!

โ€ข Average Treatment Effect

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘› ๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ)๐œ‹ ๐‘ฆ|๐‘ฅ๐‘–๐‘ฆ๐‘–

โ€ข IPS Estimator

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘›

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐‘๐‘–๐‘– ๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

Pati

ents

0 1 0

1 1 0

0 0 1

001

010

010

10011

01111

10000

๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต

Page 6: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

6

Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender

Context ๐‘ฅ๐‘– Diagnostics Query User + Page User + Movie

Treatment ๐‘ฆ๐‘– BP/Stent/Drugs Ranking Placed Ad Watched Movie

Outcome ๐›ฟ๐‘– Survival Click metric Click / no Click Star rating

Propensities ๐‘๐‘– controlled (*) controlled controlled observational

New Policy ๐œ‹ FDA Guidelines Ranker Ad Placer Recommender

T-effect U(๐œ‹) Average quality of new policy.

Rec

ord

ed in

Lo

g

Evaluation: Outline โ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€ โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

System Evaluation via Inverse Propensity Scoring

Definition [IPS Utility Estimator]: Given ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘› , ๐‘ฆ๐‘›, ๐›ฟ๐‘› collected under ๐œ‹0,

Unbiased estimate of utility for any ๐œ‹, if propensity nonzero whenever ๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘– > 0. Note:

If ๐œ‹ = ๐œ‹0, then online A/B Test with

Off-policy vs. On-policy estimation.

๐‘ˆ ๐‘–๐‘๐‘  ๐œ‹ =1

๐‘› ๐›ฟ๐‘–

๐‘›

๐‘–=1

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–

[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Li et al., 2011]

Propensity ๐‘๐‘–

๐‘ˆ ๐‘–๐‘๐‘  ๐œ‹0 =1

๐‘› ๐›ฟ๐‘–

๐‘–

Illustration of IPS

IPS Estimator:

๐œ‹0 ๐‘Œ ๐‘ฅ ๐œ‹(๐‘Œ|๐‘ฅ)

๐‘ˆ ๐ผ๐‘ƒ๐‘† ๐œ‹ =1

๐‘›

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘–|๐‘ฅ๐‘–๐‘–

๐›ฟ๐‘–

IPS Estimator is Unbiased

=1

๐‘› ๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 ๐‘ƒ(๐‘ฅ1) โ€ฆ ๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ(๐‘ฅ๐‘›)

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

๐‘ฅ๐‘›,๐‘ฆ๐‘›๐‘ฅ1,๐‘ฆ1

๐‘–

=1

๐‘› ๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘– ๐‘ƒ(๐‘ฅ๐‘–)

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

๐‘ฅ๐‘–,๐‘ฆ๐‘–

๐‘–

=1

๐‘› ๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘– ๐‘ƒ(๐‘ฅ๐‘–)๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

๐‘ฅ๐‘–,๐‘ฆ๐‘–

๐‘–

=1

๐‘› U(ฯ€)

๐‘–

= ๐‘ˆ ๐œ‹

๐ธ ๐‘ˆ ๐œ‹ =1

๐‘› โ€ฆ

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

๐‘–๐‘ฅ๐‘›,๐‘ฆ๐‘›

๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 โ€ฆ ๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ ๐‘ฅ1 โ€ฆ ๐‘ƒ(๐‘ฅ๐‘›)

๐‘ฅ1,๐‘ฆ1

=1

๐‘› ๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 ๐‘ƒ(๐‘ฅ1)โ€ฆ ๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ(๐‘ฅ๐‘›)

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

๐‘–๐‘ฅ๐‘›,๐‘ฆ๐‘›๐‘ฅ1,๐‘ฆ1

Probabilistic Assignment

News Recommender: Results

IPS eventually beats RP; variance decays as ๐‘‚1

๐‘›

Page 7: Counterfactual Model Interactive System Schematic for ......โ€“Long turnaround time ๐‘ˆ Evaluating Online Metrics Offline โ€ขOnline: On-policy A/B Test โ€ขOffline: Off-policy Counterfactual

7

Evaluation: Outline โ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€ โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator