Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1,...

40
Counterfactual Model for Learning Systems CS 7792 - Fall 2018 Thorsten Joachims Department of Computer Science & Department of Information Science Cornell University Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.

Transcript of Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1,...

Page 1: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Model for Learning Systems

CS 7792 - Fall 2018

Thorsten Joachims

Department of Computer Science & Department of Information Science

Cornell University

Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.

Page 2: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Interactive System Schematic

Action y for x

System ฯ€0

Utility: ๐‘ˆ ๐œ‹0

Page 3: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

News Recommender

โ€ข Context ๐‘ฅ:

โ€“ User

โ€ข Action ๐‘ฆ:

โ€“ Portfolio of newsarticles

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ Reading time in minutes

Page 4: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Ad Placement

โ€ข Context ๐‘ฅ:

โ€“ User and page

โ€ข Action ๐‘ฆ:

โ€“ Ad that is placed

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ Click / no-click

Page 5: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Search Engine

โ€ข Context ๐‘ฅ:

โ€“ Query

โ€ข Action ๐‘ฆ:

โ€“ Ranking

โ€ข Feedback ๐›ฟ ๐‘ฅ, ๐‘ฆ :

โ€“ Click / no-click

Page 6: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Log Data from Interactive Systems

โ€ข Data

๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘›

Partial Information (aka โ€œContextual Banditโ€) Feedback

โ€ข Propertiesโ€“ Contexts ๐‘ฅ๐‘– drawn i.i.d. from unknown ๐‘ƒ(๐‘‹)โ€“ Actions ๐‘ฆ๐‘– selected by existing system ๐œ‹0: ๐‘‹ โ†’ ๐‘Œโ€“ Feedback ๐›ฟ๐‘– from unknown function ๐›ฟ: ๐‘‹ ร— ๐‘Œ โ†’ โ„œ

contextฯ€0 action reward / loss

[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

Page 7: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Goal: Counterfactual Evaluation

โ€ข Use interaction log data

๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘›for evaluation of system ๐œ‹:

โ€ข Estimate online measures of some system ๐œ‹ offline.

โ€ข System ๐œ‹ can be different from ๐œ‹0 that generated log.

Page 8: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluation: Outlineโ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Page 9: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Online Performance MetricsExample metrics

โ€“ CTRโ€“ Revenueโ€“ Time-to-successโ€“ Interleavingโ€“ Etc.

Correct choice depends on application and is not the focus of this lecture.

This lecture:Metric encoded as ฮด(๐‘ฅ, ๐‘ฆ) [click/payoff/time for (x,y) pair]

Page 10: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Systemโ€ข Definition [Deterministic Policy]:

Function๐‘ฆ = ๐œ‹(๐‘ฅ)

that picks action ๐‘ฆ for context ๐‘ฅ.

โ€ข Definition [Stochastic Policy]:Distribution

๐œ‹ ๐‘ฆ ๐‘ฅthat samples action ๐‘ฆ given context ๐‘ฅ

๐œ‹1(๐‘Œ|๐‘ฅ) ๐œ‹2(๐‘Œ|๐‘ฅ)

๐‘Œ|๐‘ฅ

ฯ€1 ๐‘ฅ ฯ€2 ๐‘ฅ

๐‘Œ|๐‘ฅ

Page 11: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

System Performance

Definition [Utility of Policy]:

The expected reward / utility U(๐œ‹) of policy ๐œ‹ is

U ๐œ‹ = เถฑเถฑ๐›ฟ ๐‘ฅ, ๐‘ฆ ๐œ‹ ๐‘ฆ ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅ ๐‘‘๐‘ฆ

๐œ‹(๐‘Œ|๐‘ฅ๐‘–)

๐‘Œ|๐‘ฅ๐‘–

๐œ‹(๐‘Œ|๐‘ฅ๐‘—)

๐‘Œ|๐‘ฅ๐‘—

โ€ฆe.g. reading

time of user x for portfolio y

Page 12: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Given ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘› collected under ฯ€0,

A/B TestingDeploy ฯ€1: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹1 ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ)Deploy ฯ€2: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹2 ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ)

โ‹ฎ

Deploy ฯ€|๐ป|: Draw ๐‘ฅ โˆผ ๐‘ƒ ๐‘‹ , predict ๐‘ฆ โˆผ ๐œ‹|๐ป| ๐‘Œ ๐‘ฅ , get ๐›ฟ(๐‘ฅ, ๐‘ฆ)

๐‘ˆ ฯ€0 =1

๐‘›

๐‘–=1

๐‘›

๐›ฟ๐‘–

Online Evaluation: A/B Testing

Page 13: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Pros and Cons of A/B Testingโ€ข Pro

โ€“ User centric measureโ€“ No need for manual ratingsโ€“ No user/expert mismatch

โ€ข Consโ€“ Requires interactive experimental controlโ€“ Risk of fielding a bad or buggy ๐œ‹๐‘–โ€“ Number of A/B Tests limitedโ€“ Long turnaround time

Page 14: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluating Online Metrics Offline

โ€ข Online: On-policy A/B Test

โ€ข Offline: Off-policy Counterfactual Estimates

Draw ๐‘†1from ๐œ‹1 ๐‘ˆ ๐œ‹1

Draw ๐‘†2from ๐œ‹2 ๐‘ˆ ๐œ‹2

Draw ๐‘†3from ๐œ‹3 ๐‘ˆ ๐œ‹3

Draw ๐‘†4from ๐œ‹4 ๐‘ˆ ๐œ‹4

Draw ๐‘†5from ๐œ‹5 ๐‘ˆ ๐œ‹5

Draw ๐‘†6from ๐œ‹6 ๐‘ˆ ๐œ‹6

Draw ๐‘† from ๐œ‹0

๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ ๐œ‹6

๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ ๐œ‹12

๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ ๐œ‹18

๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ ๐œ‹24

๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ โ„Ž1๐‘ˆ ๐œ‹30

Draw ๐‘†7from ๐œ‹7 ๐‘ˆ ๐œ‹7

Page 15: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluation: Outlineโ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Page 16: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Approach 1: Reward Predictorโ€ข Idea:

โ€“ Use ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘› from๐œ‹0 to estimate reward predictor แˆ˜๐›ฟ ๐‘ฅ, ๐‘ฆ

โ€ข Deterministic ๐œ‹: Simulated A/B Testing with predicted แˆ˜๐›ฟ ๐‘ฅ, ๐‘ฆโ€“ For actions ๐‘ฆ๐‘–

โ€ฒ = ๐œ‹(๐‘ฅ๐‘–) from new policy ๐œ‹, generate predicted log

๐‘†โ€ฒ = ๐‘ฅ1, ๐‘ฆ1โ€ฒ , แˆ˜๐›ฟ ๐‘ฅ1, ๐‘ฆ1

โ€ฒ , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›โ€ฒ , แˆ˜๐›ฟ ๐‘ฅ๐‘›, ๐‘ฆ๐‘›

โ€ฒ

โ€“ Estimate performace of ๐œ‹ via ๐‘ˆ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘›ฯƒ๐‘–=1๐‘› แˆ˜๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

โ€ฒ

โ€ข Stochastic ๐œ‹: ๐‘ˆ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘›ฯƒ๐‘–=1๐‘› ฯƒ๐‘ฆ

แˆ˜๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ ๐œ‹(๐‘ฆ|๐‘ฅ๐‘–)

๐›ฟ ๐‘ฅ, ๐‘ฆ1 ๐›ฟ ๐‘ฅ, ๐‘ฆ2

๐‘Œ|๐‘ฅ แˆ˜๐›ฟ ๐‘ฅ, ๐‘ฆ

๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ

Page 17: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Regression for Reward Prediction

Learn แˆ˜๐›ฟ: ๐‘ฅ ร— ๐‘ฆ โ†’ โ„œ

1. Represent via features ฮจ ๐‘ฅ, ๐‘ฆ2. Learn regression based on ฮจ ๐‘ฅ, ๐‘ฆ

from ๐‘† collected under ๐œ‹03. Predict แˆ˜๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ for ๐‘ฆโ€ฒ = ๐œ‹(๐‘ฅ) of

new policy ๐œ‹

แˆ˜๐›ฟ(๐‘ฅ, ๐‘ฆ)

๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ

ฮจ1

ฮจ2

Page 18: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

News Recommender: Exp Setupโ€ข Context x: User profile

โ€ข Action y: Rankingโ€“ Pick from 7 candidates

to place into 3 slots

โ€ข Reward ๐›ฟ: โ€œRevenueโ€โ€“ Complicated hidden

function

โ€ข Logging policy ๐œ‹0: Non-uniform randomized logging systemโ€“ Placket-Luce โ€œexplore around current production rankerโ€

Page 19: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

News Recommender: Results

RP is inaccurate even with more training and logged data

Page 20: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Problems of Reward Predictor

โ€ข Modeling bias

โ€“ choice of features and model

โ€ข Selection bias

โ€“ ฯ€0โ€™s actions are over-represented

๐‘ˆ๐‘Ÿ๐‘ ๐œ‹ =1

๐‘›

๐‘–

แˆ˜๐›ฟ ๐‘ฅ๐‘– , ๐œ‹ ๐‘ฅ๐‘–

แˆ˜๐›ฟ(๐‘ฅ, ๐‘ฆ)

๐›ฟ ๐‘ฅ, ๐œ‹ ๐‘ฅ

ฮจ1

ฮจ2Can be unreliable and biased

Page 21: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluation: Outlineโ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€โ€“ Counterfactual Model

โ€“ Inverse propensity score (IPS) weighting estimator

Page 22: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Approach โ€œModel the Biasโ€

โ€ข Idea:

Fix the mismatch between the distribution ๐œ‹0 ๐‘Œ ๐‘ฅthat generated the data and the distribution ๐œ‹ ๐‘Œ ๐‘ฅwe aim to evaluate.

U ๐œ‹0 = เถฑเถฑ๐›ฟ ๐‘ฅ, ๐‘ฆ ๐œ‹0 ๐‘ฆ ๐‘ฅ ๐‘ƒ ๐‘ฅ ๐‘‘๐‘ฅ ๐‘‘๐‘ฆ๐œ‹ ๐‘ฆ ๐‘ฅ๐œ‹

Page 23: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Modelโ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œโ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘–โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best?

01

1

1

01

0

1

1

10

Pati

entsx๐‘–โˆˆ

1,...,๐‘›

Page 24: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Modelโ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œโ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘–โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best?

01

1

1

01

0

1

1

10

Pati

ents๐‘–โˆˆ

1,...,๐‘›

Placing Vertical

Click / no Click on SERP

Pos 1 / Pos 2/ Pos 3

Pos 1

Pos 2

Pos 3

Page 25: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Modelโ€ข Example: Treating Heart Attacks

โ€“ Treatments: ๐‘Œโ€ข Bypass / Stent / Drugs

โ€“ Chosen treatment for patient x๐‘–: y๐‘–โ€“ Outcomes: ฮด๐‘–

โ€ข 5-year survival: 0 / 1

โ€“ Which treatment is best?โ€ข Everybody Drugsโ€ข Everybody Stentโ€ข Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 โ€“ really?

01

1

1

01

0

1

1

10

Pati

entsxi,๐‘–โˆˆ

1,...,๐‘›

Page 26: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Treatment Effects

โ€ข Average Treatment Effect of Treatment ๐‘ฆ

โ€“ U ๐‘ฆ =1

๐‘›ฯƒ๐‘– ๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ)

โ€ข Example

โ€“ U ๐‘๐‘ฆ๐‘๐‘Ž๐‘ ๐‘  =4

11

โ€“ U ๐‘ ๐‘ก๐‘’๐‘›๐‘ก =6

11

โ€“ U ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘  =3

11

Pati

ents

010

110

001

001

010

010

10011

01111

10000

Factual Outcome

Counterfactual Outcomes

Page 27: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Assignment Mechanismโ€ข Probabilistic Treatment Assignment

โ€“ For patient i: ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘–โ€“ Selection Bias

โ€ข Inverse Propensity Score Estimator

โ€“

โ€“ Propensity: pi = ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ๐‘–|๐‘ฅ๐‘–

โ€“ Unbiased: ๐ธ ๐‘ˆ ๐‘ฆ =๐‘ˆ ๐‘ฆ , if ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘– > 0 for all ๐‘–

โ€ข Example

โ€“ ๐‘ˆ ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘  =1

11

1

0.8+

1

0.7+

1

0.8+

0

0.1

= 0.36 < 0.75

Pati

ents

010

110

001

001

010

010

10011

01111

10000

0.30.50.1

0.60.40.1

0.10.10.8

0.60.20.7

0.30.50.2

0.10.70.1

0.10.10.30.30.4

0.10.80.30.60.4

0.80.10.40.10.2

๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘–

๐‘ˆ๐‘–๐‘๐‘  ๐‘ฆ =1

๐‘›

๐‘–

๐•€{๐‘ฆ๐‘–= ๐‘ฆ}

๐‘๐‘–๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

Page 28: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Experimental vs Observationalโ€ข Controlled Experiment

โ€“ Assignment Mechanism under our controlโ€“ Propensities ๐‘๐‘– = ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ๐‘–|๐‘ฅ๐‘– are known by designโ€“ Requirement: โˆ€๐‘ฆ: ๐œ‹0 ๐‘Œ๐‘– = ๐‘ฆ|๐‘ฅ๐‘– > 0 (probabilistic)

โ€ข Observational Studyโ€“ Assignment Mechanism not under our controlโ€“ Propensities ๐‘๐‘– need to be estimatedโ€“ Estimate เทœ๐œ‹0 ๐‘Œ๐‘–|๐‘ง๐‘– = ๐œ‹0 ๐‘Œ๐‘– ๐‘ฅ๐‘–) based on features ๐‘ง๐‘–โ€“ Requirement: เทœ๐œ‹0 ๐‘Œ๐‘– ๐‘ง๐‘–) = เทœ๐œ‹0 ๐‘Œ๐‘– ๐›ฟ๐‘– , ๐‘ง๐‘– (unconfounded)

Page 29: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Conditional Treatment Policiesโ€ข Policy (deterministic)

โ€“ Context ๐‘ฅ๐‘– describing patientโ€“ Pick treatment ๐‘ฆ๐‘– based on ๐‘ฅ๐‘–: yi = ๐œ‹(๐‘ฅ๐‘–)โ€“ Example policy:

โ€ข ๐œ‹ ๐ด = ๐‘‘๐‘Ÿ๐‘ข๐‘”๐‘ , ๐œ‹ ๐ต = ๐‘ ๐‘ก๐‘’๐‘›๐‘ก, ๐œ‹ ๐ถ = ๐‘๐‘ฆ๐‘๐‘Ž๐‘ ๐‘ 

โ€ข Average Treatment Effect

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘›ฯƒ๐‘– ๐›ฟ(๐‘ฅ๐‘– , ๐œ‹ ๐‘ฅ๐‘– )

โ€ข IPS Estimator

โ€“

Pati

ents

010

110

001

001

010

010

10011

01111

10000

๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต

๐‘ˆ๐‘–๐‘๐‘  ๐œ‹ =1

๐‘›

๐‘–

๐•€{๐‘ฆ๐‘–= ๐œ‹(๐‘ฅ๐‘–)}

๐‘๐‘–๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

Page 30: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Stochastic Treatment Policiesโ€ข Policy (stochastic)

โ€“ Context ๐‘ฅ๐‘– describing patientโ€“ Pick treatment ๐‘ฆ based on ๐‘ฅ๐‘–: ๐œ‹(๐‘Œ|๐‘ฅ๐‘–)

โ€ข Noteโ€“ Assignment Mechanism is a stochastic policy as well!

โ€ข Average Treatment Effect

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘›ฯƒ๐‘–ฯƒ๐‘ฆ ๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ)๐œ‹ ๐‘ฆ|๐‘ฅ๐‘–

โ€ข IPS Estimator

โ€“ ๐‘ˆ ๐œ‹ =1

๐‘›ฯƒ๐‘–

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐‘๐‘–

๐›ฟ(๐‘ฅ๐‘– , ๐‘ฆ๐‘–)

Pati

ents

010

110

001

001

010

010

10011

01111

10000

๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต

Page 31: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Model = LogsMedical Search Engine Ad Placement Recommender

Context ๐‘ฅ๐‘– Diagnostics Query User + Page User + Movie

Treatment ๐‘ฆ๐‘– BP/Stent/Drugs Ranking Placed Ad Watched Movie

Outcome ๐›ฟ๐‘– Survival Click metric Click / no Click Star rating

Propensities ๐‘๐‘– controlled (*) controlled controlled observational

New Policy ๐œ‹ FDA Guidelines Ranker Ad Placer Recommender

T-effect U(๐œ‹) Average quality of new policy.

Rec

ord

edin

Lo

g

Page 32: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluation: Outlineโ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward prediction

โ€ข Approach 2: โ€œModel the biasโ€โ€“ Counterfactual Model

โ€“ Inverse propensity scoring (IPS) estimator

Page 33: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

System Evaluation via Inverse Propensity Scoring

Definition [IPS Utility Estimator]: Given ๐‘† = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟ๐‘› collected under ๐œ‹0,

Unbiased estimate of utility for any ๐œ‹, if propensity nonzerowhenever ๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘– > 0.

Note: If ๐œ‹ = ๐œ‹0, then online A/B Test with

Off-policy vs. On-policy estimation.

๐‘ˆ๐‘–๐‘๐‘  ๐œ‹ =1

๐‘›

๐‘–=1

๐‘›

๐›ฟ๐‘–๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–

[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Li et al., 2011]

Propensity ๐‘๐‘–

๐‘ˆ๐‘–๐‘๐‘  ๐œ‹0 =1

๐‘›

๐‘–

๐›ฟ๐‘–

Page 34: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Illustration of IPS

IPS Estimator:

Unbiased: If

then

๐œ‹0 ๐‘Œ ๐‘ฅ๐œ‹(๐‘Œ|๐‘ฅ)

๐‘ˆ๐ผ๐‘ƒ๐‘† ๐œ‹ =1

๐‘›

๐‘–

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘–|๐‘ฅ๐‘–

๐›ฟ๐‘–

E ๐‘ˆ๐ผ๐‘ƒ๐‘† ๐œ‹ = ๐‘ˆ(๐œ‹)

โˆ€๐‘ฅ, ๐‘ฆ: ๐œ‹ ๐‘ฆ ๐‘ฅ ๐‘ƒ(๐‘ฅ) > 0 โ†’ ๐œ‹0 ๐‘ฆ ๐‘ฅ > 0

Page 35: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

IPS Estimator is Unbiased

=1

๐‘›

๐‘–

๐‘ฅ1,๐‘ฆ1

๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 ๐‘ƒ(๐‘ฅ1)โ€ฆ

๐‘ฅ๐‘›,๐‘ฆ๐‘›

๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ(๐‘ฅ๐‘›)๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)

๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

=1

๐‘›

๐‘–

๐‘ฅ๐‘–,๐‘ฆ๐‘–

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘– ๐‘ƒ(๐‘ฅ๐‘–)๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)

๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

=1

๐‘›

๐‘–

๐‘ฅ๐‘–,๐‘ฆ๐‘–

๐‘ƒ(๐‘ฅ๐‘–)๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘– ๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘– =1

๐‘›

๐‘–

U(ฯ€) = ๐‘ˆ ๐œ‹

๐ธ ๐‘ˆ๐ผ๐‘ƒ๐‘† ๐œ‹ =1

๐‘›

๐‘ฅ1,๐‘ฆ1

โ€ฆ

๐‘ฅ๐‘›,๐‘ฆ๐‘›

๐‘–

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)

๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘– ๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 โ€ฆ๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ ๐‘ฅ1 โ€ฆ๐‘ƒ(๐‘ฅ๐‘›)

=1

๐‘›

๐‘ฅ1,๐‘ฆ1

๐œ‹0 ๐‘ฆ1 ๐‘ฅ1 ๐‘ƒ(๐‘ฅ1)โ€ฆ

๐‘ฅ๐‘›,๐‘ฆ๐‘›

๐œ‹0 ๐‘ฆ๐‘› ๐‘ฅ๐‘› ๐‘ƒ(๐‘ฅ๐‘›)

๐‘–

๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–)

๐›ฟ ๐‘ฅ๐‘– , ๐‘ฆ๐‘–

full support

independent

marginal

identical x,y

Page 36: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

News Recommender: Results

IPS eventually beats RP; variance decays as ๐‘‚1

๐‘›

Page 37: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Counterfactual Policy Evaluationโ€ข Controlled Experiment Setting:

โ€“ Log data: ๐ท = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1, ๐‘1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟn, ๐‘๐‘›โ€ข Observational Setting:

โ€“ Log data: ๐ท = ๐‘ฅ1, ๐‘ฆ1, ๐›ฟ1, ๐‘ง1 , โ€ฆ , ๐‘ฅ๐‘›, ๐‘ฆ๐‘›, ๐›ฟn, ๐‘ง๐‘›โ€“ Estimate propensities: ๐‘๐‘– = ๐‘ƒ ๐‘ฆ๐‘– ๐‘ฅ๐‘– , ๐‘ง๐‘–) based on ๐‘ฅ๐‘– and other confounders ๐‘ง๐‘–

Goal: Estimate average treatment effect of new policy ๐œ‹.โ€“ IPS Estimator

๐‘ˆ ๐œ‹ =1

๐‘›

๐‘–

๐›ฟ๐‘–๐œ‹ ๐‘ฆ๐‘– ๐‘ฅ๐‘–

๐‘๐‘–

or many others.

Page 38: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Evaluation: Summaryโ€ข Evaluating Online Metrics Offline

โ€“ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)

โ€ข Approach 1: โ€œModel the worldโ€ โ€“ Estimation via reward predictionโ€“ Pro: low varianceโ€“ Con: model mismatch can lead to high bias

โ€ข Approach 2: โ€œModel the biasโ€โ€“ Counterfactual Modelโ€“ Inverse propensity scoring (IPS) estimatorโ€“ Pro: unbiased for known propensitiesโ€“ Con: large variance

Page 39: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

From Evaluation to Learningโ€ข Naรฏve โ€œModel the Worldโ€ Learning:

โ€“ Learn: แˆ˜๐›ฟ: ๐‘ฅ ร— ๐‘ฆ โ†’ โ„œโ€“ Derive Policy:

๐œ‹ ๐‘ฆ ๐‘ฅ = argmin๐‘ฆโ€ฒ

แˆ˜๐›ฟ ๐‘ฅ, ๐‘ฆโ€ฒ

โ€ข Naรฏve โ€œModel the Biasโ€ Learning:โ€“ Find policy that optimizes IPS training error

๐œ‹ = argmin๐œ‹โ€ฒ

๐‘–

๐œ‹โ€ฒ ๐‘ฆ๐‘– ๐‘ฅ๐‘–)

๐œ‹0 ๐‘ฆ๐‘– ๐‘ฅ๐‘–๐›ฟ๐‘–

Page 40: Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐‘†= 1, 1,๐›ฟ1,โ€ฆ, ๐‘›, ๐‘›,๐›ฟ๐‘› collected under ๐œ‹0, Unbiased estimate of utility

Outline of Classโ€ข Counterfactual and Causal Inference โ€ข Evaluation

โ€“ Improved counterfactual estimatorsโ€“ Applications in recommender systems, etc.โ€“ Dealing with missing propensities, randomization, etc.

โ€ข Learningโ€“ Batch Learning from Bandit Feedbackโ€“ Dealing with combinatorial and continuous action spacesโ€“ Learning theoryโ€“ More general learning with partial information data (e.g. ranking, embedding)