Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐= 1,...
Transcript of Counterfactual Model for Learning Systemsย ยท Definition [IPS Utility Estimator]: Given ๐= 1,...
Counterfactual Model for Learning Systems
CS 7792 - Fall 2018
Thorsten Joachims
Department of Computer Science & Department of Information Science
Cornell University
Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.
Interactive System Schematic
Action y for x
System ฯ0
Utility: ๐ ๐0
News Recommender
โข Context ๐ฅ:
โ User
โข Action ๐ฆ:
โ Portfolio of newsarticles
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ Reading time in minutes
Ad Placement
โข Context ๐ฅ:
โ User and page
โข Action ๐ฆ:
โ Ad that is placed
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ Click / no-click
Search Engine
โข Context ๐ฅ:
โ Query
โข Action ๐ฆ:
โ Ranking
โข Feedback ๐ฟ ๐ฅ, ๐ฆ :
โ Click / no-click
Log Data from Interactive Systems
โข Data
๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐
Partial Information (aka โContextual Banditโ) Feedback
โข Propertiesโ Contexts ๐ฅ๐ drawn i.i.d. from unknown ๐(๐)โ Actions ๐ฆ๐ selected by existing system ๐0: ๐ โ ๐โ Feedback ๐ฟ๐ from unknown function ๐ฟ: ๐ ร ๐ โ โ
contextฯ0 action reward / loss
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
Goal: Counterfactual Evaluation
โข Use interaction log data
๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐for evaluation of system ๐:
โข Estimate online measures of some system ๐ offline.
โข System ๐ can be different from ๐0 that generated log.
Evaluation: Outlineโข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโโ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Online Performance MetricsExample metrics
โ CTRโ Revenueโ Time-to-successโ Interleavingโ Etc.
Correct choice depends on application and is not the focus of this lecture.
This lecture:Metric encoded as ฮด(๐ฅ, ๐ฆ) [click/payoff/time for (x,y) pair]
Systemโข Definition [Deterministic Policy]:
Function๐ฆ = ๐(๐ฅ)
that picks action ๐ฆ for context ๐ฅ.
โข Definition [Stochastic Policy]:Distribution
๐ ๐ฆ ๐ฅthat samples action ๐ฆ given context ๐ฅ
๐1(๐|๐ฅ) ๐2(๐|๐ฅ)
๐|๐ฅ
ฯ1 ๐ฅ ฯ2 ๐ฅ
๐|๐ฅ
System Performance
Definition [Utility of Policy]:
The expected reward / utility U(๐) of policy ๐ is
U ๐ = เถฑเถฑ๐ฟ ๐ฅ, ๐ฆ ๐ ๐ฆ ๐ฅ ๐ ๐ฅ ๐๐ฅ ๐๐ฆ
๐(๐|๐ฅ๐)
๐|๐ฅ๐
๐(๐|๐ฅ๐)
๐|๐ฅ๐
โฆe.g. reading
time of user x for portfolio y
Given ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐ collected under ฯ0,
A/B TestingDeploy ฯ1: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐1 ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ)Deploy ฯ2: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐2 ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ)
โฎ
Deploy ฯ|๐ป|: Draw ๐ฅ โผ ๐ ๐ , predict ๐ฆ โผ ๐|๐ป| ๐ ๐ฅ , get ๐ฟ(๐ฅ, ๐ฆ)
๐ ฯ0 =1
๐
๐=1
๐
๐ฟ๐
Online Evaluation: A/B Testing
Pros and Cons of A/B Testingโข Pro
โ User centric measureโ No need for manual ratingsโ No user/expert mismatch
โข Consโ Requires interactive experimental controlโ Risk of fielding a bad or buggy ๐๐โ Number of A/B Tests limitedโ Long turnaround time
Evaluating Online Metrics Offline
โข Online: On-policy A/B Test
โข Offline: Off-policy Counterfactual Estimates
Draw ๐1from ๐1 ๐ ๐1
Draw ๐2from ๐2 ๐ ๐2
Draw ๐3from ๐3 ๐ ๐3
Draw ๐4from ๐4 ๐ ๐4
Draw ๐5from ๐5 ๐ ๐5
Draw ๐6from ๐6 ๐ ๐6
Draw ๐ from ๐0
๐ โ1๐ โ1๐ โ1๐ โ1๐ โ1๐ ๐6
๐ โ1๐ โ1๐ โ1๐ โ1๐ โ1๐ ๐12
๐ โ1๐ โ1๐ โ1๐ โ1๐ โ1๐ ๐18
๐ โ1๐ โ1๐ โ1๐ โ1๐ โ1๐ ๐24
๐ โ1๐ โ1๐ โ1๐ โ1๐ โ1๐ ๐30
Draw ๐7from ๐7 ๐ ๐7
Evaluation: Outlineโข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโโ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
Approach 1: Reward Predictorโข Idea:
โ Use ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐ from๐0 to estimate reward predictor แ๐ฟ ๐ฅ, ๐ฆ
โข Deterministic ๐: Simulated A/B Testing with predicted แ๐ฟ ๐ฅ, ๐ฆโ For actions ๐ฆ๐
โฒ = ๐(๐ฅ๐) from new policy ๐, generate predicted log
๐โฒ = ๐ฅ1, ๐ฆ1โฒ , แ๐ฟ ๐ฅ1, ๐ฆ1
โฒ , โฆ , ๐ฅ๐, ๐ฆ๐โฒ , แ๐ฟ ๐ฅ๐, ๐ฆ๐
โฒ
โ Estimate performace of ๐ via ๐๐๐ ๐ =1
๐ฯ๐=1๐ แ๐ฟ ๐ฅ๐ , ๐ฆ๐
โฒ
โข Stochastic ๐: ๐๐๐ ๐ =1
๐ฯ๐=1๐ ฯ๐ฆ
แ๐ฟ ๐ฅ๐ , ๐ฆ ๐(๐ฆ|๐ฅ๐)
๐ฟ ๐ฅ, ๐ฆ1 ๐ฟ ๐ฅ, ๐ฆ2
๐|๐ฅ แ๐ฟ ๐ฅ, ๐ฆ
๐ฟ ๐ฅ, ๐ฆโฒ
Regression for Reward Prediction
Learn แ๐ฟ: ๐ฅ ร ๐ฆ โ โ
1. Represent via features ฮจ ๐ฅ, ๐ฆ2. Learn regression based on ฮจ ๐ฅ, ๐ฆ
from ๐ collected under ๐03. Predict แ๐ฟ ๐ฅ, ๐ฆโฒ for ๐ฆโฒ = ๐(๐ฅ) of
new policy ๐
แ๐ฟ(๐ฅ, ๐ฆ)
๐ฟ ๐ฅ, ๐ฆโฒ
ฮจ1
ฮจ2
News Recommender: Exp Setupโข Context x: User profile
โข Action y: Rankingโ Pick from 7 candidates
to place into 3 slots
โข Reward ๐ฟ: โRevenueโโ Complicated hidden
function
โข Logging policy ๐0: Non-uniform randomized logging systemโ Placket-Luce โexplore around current production rankerโ
News Recommender: Results
RP is inaccurate even with more training and logged data
Problems of Reward Predictor
โข Modeling bias
โ choice of features and model
โข Selection bias
โ ฯ0โs actions are over-represented
๐๐๐ ๐ =1
๐
๐
แ๐ฟ ๐ฅ๐ , ๐ ๐ฅ๐
แ๐ฟ(๐ฅ, ๐ฆ)
๐ฟ ๐ฅ, ๐ ๐ฅ
ฮจ1
ฮจ2Can be unreliable and biased
Evaluation: Outlineโข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโโ Counterfactual Model
โ Inverse propensity score (IPS) weighting estimator
Approach โModel the Biasโ
โข Idea:
Fix the mismatch between the distribution ๐0 ๐ ๐ฅthat generated the data and the distribution ๐ ๐ ๐ฅwe aim to evaluate.
U ๐0 = เถฑเถฑ๐ฟ ๐ฅ, ๐ฆ ๐0 ๐ฆ ๐ฅ ๐ ๐ฅ ๐๐ฅ ๐๐ฆ๐ ๐ฆ ๐ฅ๐
Counterfactual Modelโข Example: Treating Heart Attacks
โ Treatments: ๐โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best?
01
1
1
01
0
1
1
10
Pati
entsx๐โ
1,...,๐
Counterfactual Modelโข Example: Treating Heart Attacks
โ Treatments: ๐โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best?
01
1
1
01
0
1
1
10
Pati
ents๐โ
1,...,๐
Placing Vertical
Click / no Click on SERP
Pos 1 / Pos 2/ Pos 3
Pos 1
Pos 2
Pos 3
Counterfactual Modelโข Example: Treating Heart Attacks
โ Treatments: ๐โข Bypass / Stent / Drugs
โ Chosen treatment for patient x๐: y๐โ Outcomes: ฮด๐
โข 5-year survival: 0 / 1
โ Which treatment is best?โข Everybody Drugsโข Everybody Stentโข Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 โ really?
01
1
1
01
0
1
1
10
Pati
entsxi,๐โ
1,...,๐
Treatment Effects
โข Average Treatment Effect of Treatment ๐ฆ
โ U ๐ฆ =1
๐ฯ๐ ๐ฟ(๐ฅ๐ , ๐ฆ)
โข Example
โ U ๐๐ฆ๐๐๐ ๐ =4
11
โ U ๐ ๐ก๐๐๐ก =6
11
โ U ๐๐๐ข๐๐ =3
11
Pati
ents
010
110
001
001
010
010
10011
01111
10000
Factual Outcome
Counterfactual Outcomes
Assignment Mechanismโข Probabilistic Treatment Assignment
โ For patient i: ๐0 ๐๐ = ๐ฆ|๐ฅ๐โ Selection Bias
โข Inverse Propensity Score Estimator
โ
โ Propensity: pi = ๐0 ๐๐ = ๐ฆ๐|๐ฅ๐
โ Unbiased: ๐ธ ๐ ๐ฆ =๐ ๐ฆ , if ๐0 ๐๐ = ๐ฆ|๐ฅ๐ > 0 for all ๐
โข Example
โ ๐ ๐๐๐ข๐๐ =1
11
1
0.8+
1
0.7+
1
0.8+
0
0.1
= 0.36 < 0.75
Pati
ents
010
110
001
001
010
010
10011
01111
10000
0.30.50.1
0.60.40.1
0.10.10.8
0.60.20.7
0.30.50.2
0.10.70.1
0.10.10.30.30.4
0.10.80.30.60.4
0.80.10.40.10.2
๐0 ๐๐ = ๐ฆ|๐ฅ๐
๐๐๐๐ ๐ฆ =1
๐
๐
๐{๐ฆ๐= ๐ฆ}
๐๐๐ฟ(๐ฅ๐ , ๐ฆ๐)
Experimental vs Observationalโข Controlled Experiment
โ Assignment Mechanism under our controlโ Propensities ๐๐ = ๐0 ๐๐ = ๐ฆ๐|๐ฅ๐ are known by designโ Requirement: โ๐ฆ: ๐0 ๐๐ = ๐ฆ|๐ฅ๐ > 0 (probabilistic)
โข Observational Studyโ Assignment Mechanism not under our controlโ Propensities ๐๐ need to be estimatedโ Estimate เท๐0 ๐๐|๐ง๐ = ๐0 ๐๐ ๐ฅ๐) based on features ๐ง๐โ Requirement: เท๐0 ๐๐ ๐ง๐) = เท๐0 ๐๐ ๐ฟ๐ , ๐ง๐ (unconfounded)
Conditional Treatment Policiesโข Policy (deterministic)
โ Context ๐ฅ๐ describing patientโ Pick treatment ๐ฆ๐ based on ๐ฅ๐: yi = ๐(๐ฅ๐)โ Example policy:
โข ๐ ๐ด = ๐๐๐ข๐๐ , ๐ ๐ต = ๐ ๐ก๐๐๐ก, ๐ ๐ถ = ๐๐ฆ๐๐๐ ๐
โข Average Treatment Effect
โ ๐ ๐ =1
๐ฯ๐ ๐ฟ(๐ฅ๐ , ๐ ๐ฅ๐ )
โข IPS Estimator
โ
Pati
ents
010
110
001
001
010
010
10011
01111
10000
๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต
๐๐๐๐ ๐ =1
๐
๐
๐{๐ฆ๐= ๐(๐ฅ๐)}
๐๐๐ฟ(๐ฅ๐ , ๐ฆ๐)
Stochastic Treatment Policiesโข Policy (stochastic)
โ Context ๐ฅ๐ describing patientโ Pick treatment ๐ฆ based on ๐ฅ๐: ๐(๐|๐ฅ๐)
โข Noteโ Assignment Mechanism is a stochastic policy as well!
โข Average Treatment Effect
โ ๐ ๐ =1
๐ฯ๐ฯ๐ฆ ๐ฟ(๐ฅ๐ , ๐ฆ)๐ ๐ฆ|๐ฅ๐
โข IPS Estimator
โ ๐ ๐ =1
๐ฯ๐
๐ ๐ฆ๐ ๐ฅ๐๐๐
๐ฟ(๐ฅ๐ , ๐ฆ๐)
Pati
ents
010
110
001
001
010
010
10011
01111
10000
๐ต๐ถ๐ด๐ต๐ด๐ต๐ด๐ถ๐ด๐ถ๐ต
Counterfactual Model = LogsMedical Search Engine Ad Placement Recommender
Context ๐ฅ๐ Diagnostics Query User + Page User + Movie
Treatment ๐ฆ๐ BP/Stent/Drugs Ranking Placed Ad Watched Movie
Outcome ๐ฟ๐ Survival Click metric Click / no Click Star rating
Propensities ๐๐ controlled (*) controlled controlled observational
New Policy ๐ FDA Guidelines Ranker Ad Placer Recommender
T-effect U(๐) Average quality of new policy.
Rec
ord
edin
Lo
g
Evaluation: Outlineโข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward prediction
โข Approach 2: โModel the biasโโ Counterfactual Model
โ Inverse propensity scoring (IPS) estimator
System Evaluation via Inverse Propensity Scoring
Definition [IPS Utility Estimator]: Given ๐ = ๐ฅ1, ๐ฆ1, ๐ฟ1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟ๐ collected under ๐0,
Unbiased estimate of utility for any ๐, if propensity nonzerowhenever ๐ ๐ฆ๐ ๐ฅ๐ > 0.
Note: If ๐ = ๐0, then online A/B Test with
Off-policy vs. On-policy estimation.
๐๐๐๐ ๐ =1
๐
๐=1
๐
๐ฟ๐๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐ ๐ฅ๐
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Li et al., 2011]
Propensity ๐๐
๐๐๐๐ ๐0 =1
๐
๐
๐ฟ๐
Illustration of IPS
IPS Estimator:
Unbiased: If
then
๐0 ๐ ๐ฅ๐(๐|๐ฅ)
๐๐ผ๐๐ ๐ =1
๐
๐
๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐|๐ฅ๐
๐ฟ๐
E ๐๐ผ๐๐ ๐ = ๐(๐)
โ๐ฅ, ๐ฆ: ๐ ๐ฆ ๐ฅ ๐(๐ฅ) > 0 โ ๐0 ๐ฆ ๐ฅ > 0
IPS Estimator is Unbiased
=1
๐
๐
๐ฅ1,๐ฆ1
๐0 ๐ฆ1 ๐ฅ1 ๐(๐ฅ1)โฆ
๐ฅ๐,๐ฆ๐
๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐ ๐ฅ๐)
๐ฟ ๐ฅ๐ , ๐ฆ๐
=1
๐
๐
๐ฅ๐,๐ฆ๐
๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐ ๐ฅ๐)
๐ฟ ๐ฅ๐ , ๐ฆ๐
=1
๐
๐
๐ฅ๐,๐ฆ๐
๐(๐ฅ๐)๐ ๐ฆ๐ ๐ฅ๐ ๐ฟ ๐ฅ๐ , ๐ฆ๐ =1
๐
๐
U(ฯ) = ๐ ๐
๐ธ ๐๐ผ๐๐ ๐ =1
๐
๐ฅ1,๐ฆ1
โฆ
๐ฅ๐,๐ฆ๐
๐
๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐ ๐ฅ๐)
๐ฟ ๐ฅ๐ , ๐ฆ๐ ๐0 ๐ฆ1 ๐ฅ1 โฆ๐0 ๐ฆ๐ ๐ฅ๐ ๐ ๐ฅ1 โฆ๐(๐ฅ๐)
=1
๐
๐ฅ1,๐ฆ1
๐0 ๐ฆ1 ๐ฅ1 ๐(๐ฅ1)โฆ
๐ฅ๐,๐ฆ๐
๐0 ๐ฆ๐ ๐ฅ๐ ๐(๐ฅ๐)
๐
๐ ๐ฆ๐ ๐ฅ๐๐0 ๐ฆ๐ ๐ฅ๐)
๐ฟ ๐ฅ๐ , ๐ฆ๐
full support
independent
marginal
identical x,y
News Recommender: Results
IPS eventually beats RP; variance decays as ๐1
๐
Counterfactual Policy Evaluationโข Controlled Experiment Setting:
โ Log data: ๐ท = ๐ฅ1, ๐ฆ1, ๐ฟ1, ๐1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟn, ๐๐โข Observational Setting:
โ Log data: ๐ท = ๐ฅ1, ๐ฆ1, ๐ฟ1, ๐ง1 , โฆ , ๐ฅ๐, ๐ฆ๐, ๐ฟn, ๐ง๐โ Estimate propensities: ๐๐ = ๐ ๐ฆ๐ ๐ฅ๐ , ๐ง๐) based on ๐ฅ๐ and other confounders ๐ง๐
Goal: Estimate average treatment effect of new policy ๐.โ IPS Estimator
๐ ๐ =1
๐
๐
๐ฟ๐๐ ๐ฆ๐ ๐ฅ๐
๐๐
or many others.
Evaluation: Summaryโข Evaluating Online Metrics Offline
โ A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
โข Approach 1: โModel the worldโ โ Estimation via reward predictionโ Pro: low varianceโ Con: model mismatch can lead to high bias
โข Approach 2: โModel the biasโโ Counterfactual Modelโ Inverse propensity scoring (IPS) estimatorโ Pro: unbiased for known propensitiesโ Con: large variance
From Evaluation to Learningโข Naรฏve โModel the Worldโ Learning:
โ Learn: แ๐ฟ: ๐ฅ ร ๐ฆ โ โโ Derive Policy:
๐ ๐ฆ ๐ฅ = argmin๐ฆโฒ
แ๐ฟ ๐ฅ, ๐ฆโฒ
โข Naรฏve โModel the Biasโ Learning:โ Find policy that optimizes IPS training error
๐ = argmin๐โฒ
๐
๐โฒ ๐ฆ๐ ๐ฅ๐)
๐0 ๐ฆ๐ ๐ฅ๐๐ฟ๐
Outline of Classโข Counterfactual and Causal Inference โข Evaluation
โ Improved counterfactual estimatorsโ Applications in recommender systems, etc.โ Dealing with missing propensities, randomization, etc.
โข Learningโ Batch Learning from Bandit Feedbackโ Dealing with combinatorial and continuous action spacesโ Learning theoryโ More general learning with partial information data (e.g. ranking, embedding)