Always Valid Inference (Ramesh Johari, Stanford)
-
Upload
hakka-labs -
Category
Technology
-
view
1.215 -
download
1
Transcript of Always Valid Inference (Ramesh Johari, Stanford)
Always Valid InferenceBringing Sequential Analysis to A/B Testing
Ramesh Johari | Leo Pekelis | David WalshStanford University / Optimizely
25 May 2016
1 / 29
A/B testing
▶ Visitors randomized toVariation A (control) or B(treatment).
▶ qA, qB: true conversionrates in each group
▶ Hypothesis test:
H0 : θ = 0 (Null)H1 : θ ̸= 0 (Alternative)
where θ = qA − qB.
2 / 29
How it works
Fixed horizon testing:▶ The usermust set sample size N in advance.▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
Remarks:▶ This bounds Type I error (false positive probability) at
level α.▶ α trades off false positives and false negatives.
3 / 29
How it works
Fixed horizon testing:▶ The usermust set sample size N in advance.▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
Remarks:▶ This bounds Type I error (false positive probability) at
level α.▶ α trades off false positives and false negatives.
3 / 29
Continuous Monitoring
4 / 29
Continuous monitoring
In practice:
Technology makes it convenient tocontinuouslymonitor tests!
E.g., results matrix:
5 / 29
The problemwith peeking
Unfortunately this dramatically inflates Type I errors!
Example: A sample path from an A/A test:
6 / 29
The problemwith peekingWith arbitrarily large horizon, Type I error is guaranteed.
Even on finite horizons, Type I errors are highly inflated:
7 / 29
Always Valid Statistics
8 / 29
Always valid p-values
A desired approach is one which decouplesa user's interaction model fromproviding statistically valid results.
Initial goal:▶ A user should be able to look at
their resultswhenever they want.▶ The p-value at that time should
give valid type I error control.
9 / 29
Always valid p-values
DefinitionA (fixed-horizon) p-value process is a (data-dependent)sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
DefinitionA p-value process is always valid if for any data-dependentstopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
Allows user to favorably bias the choice of T based on thedata that is seen.
10 / 29
Always valid p-values
DefinitionA (fixed-horizon) p-value process is a (data-dependent)sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
DefinitionA p-value process is always valid if for any data-dependentstopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
Allows user to favorably bias the choice of T based on thedata that is seen.
10 / 29
Always valid p-values
DefinitionA (fixed-horizon) p-value process is a (data-dependent)sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
DefinitionA p-value process is always valid if for any data-dependentstopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
Allows user to favorably bias the choice of T based on thedata that is seen.
10 / 29
Constructing Always Valid p-values
11 / 29
Sequential tests
DefinitionA sequential test {Tα} is a data-dependent rule for stoppingthe test and rejecting the null that:1. stops the test later when α is lower; and2. stops with probability≤ αwhen the null is true:
P0(Tα < ∞) ≤ α.
12 / 29
Constructing always valid p-values
TheoremGiven a sequential test, define the p-value pn to be:
the smallest α such that the α-level testwould have stopped by observation n.
Then the resulting p-value process is always valid.
13 / 29
Duality
Note that if a user stops the first time that the always validp-value drops below α, then the stopping time is Tα.
This natural stopping policy thus implements the originalsequential test.
We now ask: What family of sequential tests should we use?
14 / 29
Power and Run Length
15 / 29
High power and fast detectionTo choose a particular class of always valid p-values, we haveto return to what the user is trying to do.
Here is one way to view it:1. We choose a method of computing p-values.2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).3. The user determines when they want to stop the test, up
to a maximum run time of N.
So what the user wants is:1. High power. In other words, high Pθ( reject the null ) for
nonzero θ.2. Fast detection. As quickly as possible (up to time N).
16 / 29
High power and fast detectionTo choose a particular class of always valid p-values, we haveto return to what the user is trying to do.
Here is one way to view it:1. We choose a method of computing p-values.2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).3. The user determines when they want to stop the test, up
to a maximum run time of N.So what the user wants is:1. High power. In other words, high Pθ( reject the null ) for
nonzero θ.2. Fast detection. As quickly as possible (up to time N).
16 / 29
Special case: N(θ, 1) data
For simplicity: assume data generated fromN (θ, 1)distribution, where θ is unknown.
Consider testing:H0 : θ = 0
H1 : θ ̸= 0
Notation: Let Ln(θ, θ0; x, n) be LR of θ against θ0 = 0,given n observations with sample mean x.
17 / 29
ThemSPRT
Let H ∼ N(0, σ2), and consider:
Ln =
∫Ln(θ, θ0;Xn, n)dH(θ).
Define:Tα = inf
{n : Ln ≥ 1
α
}.
This test has high power.
Proposition (Robbins and Siegmund){Tα} is a sequential test of power one: themixture sequentialprobability ratio test (mSPRT).
18 / 29
Why it works
Under the null:▶ Typical fluctuations of sample mean are of size 1/
√n, so
any decision rule of the form:
Reject if |sample mean| > k/√
n
is bound to eventually reject. This is what fixed horizontesting (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√2 log log n/
√n is crossed infinitely often.
▶ mSPRT leads to boundary of the form C√
log n/√
n:▶ Goes to zero (so power one);▶ But slowly enough (so Type I error can be controlled).
19 / 29
Why it works
Under the null:▶ Typical fluctuations of sample mean are of size 1/
√n, so
any decision rule of the form:
Reject if |sample mean| > k/√
n
is bound to eventually reject. This is what fixed horizontesting (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√2 log log n/
√n is crossed infinitely often.
▶ mSPRT leads to boundary of the form C√
log n/√
n:▶ Goes to zero (so power one);▶ But slowly enough (so Type I error can be controlled).
19 / 29
Why it works
Under the null:▶ Typical fluctuations of sample mean are of size 1/
√n, so
any decision rule of the form:
Reject if |sample mean| > k/√
n
is bound to eventually reject. This is what fixed horizontesting (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√2 log log n/
√n is crossed infinitely often.
▶ mSPRT leads to boundary of the form C√
log n/√
n:▶ Goes to zero (so power one);▶ But slowly enough (so Type I error can be controlled).
19 / 29
Why it works
Graphical illustration:
20 / 29
Fast detection
In a Bayesian analysis, we show that in the limit of α → 0:
The optimal choice of mixing distribution in the mSPRTinvolves (roughly)matching the mixing variance σ2 to theprior variance of the effect size θ.
21 / 29
Run length
22 / 29
No free lunch?
What do we give up in return for continuous monitoring?▶ If the effect size is known in advance,
should only be better!▶ In practice, we don't know the effect size in advance.
The test we designed does not assume knowledge of theeffect size.
We compare our test to a fixed horizon test,using data from Optimizely.
23 / 29
Run lengths on OptimizelyOur results show robustness to not knowing the effect size:
24 / 29
Run lengths: Interpretation
Our results show robustness to not knowing the effect size.
Intuition:▶ Detecting an effect of size∆ takes
a run length proportional to 1/∆2
▶ So the penalty for guessing wrongabout δ is very high!
▶ An MDE that is 2x too small =⇒run length that is 4x too long
25 / 29
Run lengths: Theory
Suppose that effect size is drawn from a normal distribution.
In an appropriate scaling regime where α → 0 and N → ∞,we show that mSPRT at level α truncated toΘ(N) givessimilar power as fixed horizon test of length N, but withdetection time that is o(N).
26 / 29
Conclusions
27 / 29
Experimentation in the Internet age
Rapid innovation ininformation & communication technologyhas democratized the scientific method.
Our goal: "adapt" statistical methodology toact in partnership with the user.
Additional results:▶ Extension to multiple testing (Benjamini-Hochberg)▶ Extension to adaptive allocation (bandits)▶ Extension to confidence intervals
28 / 29
Optimizely Stats Engine
▶ The ideas presented in this talk were released to all ofOptimizely's customers on January 20, 2015
▶ Provides both always valid p-valuesandmultiple testing corrections
29 / 29