Always Valid Inference (Ramesh Johari, Stanford)

Always Valid InferenceBringing Sequential Analysis to A/B Testing

Ramesh Johari | Leo Pekelis | David WalshStanford University / Optimizely

[email protected]

25 May 2016

1 / 29

A/B testing

▶ Visitors randomized toVariation A (control) or B(treatment).

▶ qA, qB: true conversionrates in each group

▶ Hypothesis test:

H0 : θ = 0 (Null)H1 : θ ̸= 0 (Alternative)

where θ = qA − qB.

2 / 29

How it works

Fixed horizon testing:▶ The usermust set sample size N in advance.▶ After each new visitor, the interface computes a p-value:

pn = P(data at least as "extreme" as current sample|H0).

▶ Decision rule: Reject H0 if p-value pN ≤ α.

Remarks:▶ This bounds Type I error (false positive probability) at

level α.▶ α trades off false positives and false negatives.

3 / 29

Continuous Monitoring

4 / 29

Continuous monitoring

In practice:

Technology makes it convenient tocontinuouslymonitor tests!

E.g., results matrix:

5 / 29

The problemwith peeking

Unfortunately this dramatically inflates Type I errors!

Example: A sample path from an A/A test:

6 / 29

The problemwith peekingWith arbitrarily large horizon, Type I error is guaranteed.

Even on finite horizons, Type I errors are highly inflated:

7 / 29

Always Valid Statistics

8 / 29

Always valid p-values

A desired approach is one which decouplesa user's interaction model fromproviding statistically valid results.

Initial goal:▶ A user should be able to look at

their resultswhenever they want.▶ The p-value at that time should

give valid type I error control.

9 / 29

Always valid p-values

DefinitionA (fixed-horizon) p-value process is a (data-dependent)sequence pn such that for all n and all x ∈ [0, 1]:

P0(pn ≤ x) ≤ x.

DefinitionA p-value process is always valid if for any data-dependentstopping time T and all x ∈ [0, 1]:

P0(pT ≤ x) ≤ x.

Allows user to favorably bias the choice of T based on thedata that is seen.

10 / 29

Constructing Always Valid p-values

11 / 29

Sequential tests

DefinitionA sequential test {Tα} is a data-dependent rule for stoppingthe test and rejecting the null that:1. stops the test later when α is lower; and2. stops with probability≤ αwhen the null is true:

P0(Tα < ∞) ≤ α.

12 / 29

Constructing always valid p-values

TheoremGiven a sequential test, define the p-value pn to be:

the smallest α such that the α-level testwould have stopped by observation n.

Then the resulting p-value process is always valid.

13 / 29

Duality

Note that if a user stops the first time that the always validp-value drops below α, then the stopping time is Tα.

This natural stopping policy thus implements the originalsequential test.

We now ask: What family of sequential tests should we use?

14 / 29

Power and Run Length

15 / 29

High power and fast detectionTo choose a particular class of always valid p-values, we haveto return to what the user is trying to do.

Here is one way to view it:1. We choose a method of computing p-values.2. The user observes p-values, and wants to detect whether

a nonzero effect exists (reject H0).3. The user determines when they want to stop the test, up

to a maximum run time of N.

So what the user wants is:1. High power. In other words, high Pθ( reject the null ) for

nonzero θ.2. Fast detection. As quickly as possible (up to time N).

16 / 29

High power and fast detectionTo choose a particular class of always valid p-values, we haveto return to what the user is trying to do.

Here is one way to view it:1. We choose a method of computing p-values.2. The user observes p-values, and wants to detect whether

a nonzero effect exists (reject H0).3. The user determines when they want to stop the test, up

to a maximum run time of N.So what the user wants is:1. High power. In other words, high Pθ( reject the null ) for

nonzero θ.2. Fast detection. As quickly as possible (up to time N).

16 / 29

Special case: N(θ, 1) data

For simplicity: assume data generated fromN (θ, 1)distribution, where θ is unknown.

Consider testing:H0 : θ = 0

H1 : θ ̸= 0

Notation: Let Ln(θ, θ0; x, n) be LR of θ against θ0 = 0,given n observations with sample mean x.

17 / 29

ThemSPRT

Let H ∼ N(0, σ2), and consider:

Ln =

∫Ln(θ, θ0;Xn, n)dH(θ).

Define:Tα = inf

{n : Ln ≥ 1

α

}.

This test has high power.

Proposition (Robbins and Siegmund){Tα} is a sequential test of power one: themixture sequentialprobability ratio test (mSPRT).

18 / 29

Why it works

Under the null:▶ Typical fluctuations of sample mean are of size 1/

√n, so

any decision rule of the form:

Reject if |sample mean| > k/√

n

is bound to eventually reject. This is what fixed horizontesting (e.g., z-test) will do.

▶ In fact, by law of the iterated logarithm, the boundary√2 log log n/

√n is crossed infinitely often.

▶ mSPRT leads to boundary of the form C√

log n/√

n:▶ Goes to zero (so power one);▶ But slowly enough (so Type I error can be controlled).

19 / 29

Why it works

Graphical illustration:

20 / 29

Fast detection

In a Bayesian analysis, we show that in the limit of α → 0:

The optimal choice of mixing distribution in the mSPRTinvolves (roughly)matching the mixing variance σ2 to theprior variance of the effect size θ.

21 / 29

Run length

22 / 29

No free lunch?

What do we give up in return for continuous monitoring?▶ If the effect size is known in advance,

should only be better!▶ In practice, we don't know the effect size in advance.

The test we designed does not assume knowledge of theeffect size.

We compare our test to a fixed horizon test,using data from Optimizely.

23 / 29

Run lengths on OptimizelyOur results show robustness to not knowing the effect size:

24 / 29

Run lengths: Interpretation

Our results show robustness to not knowing the effect size.

Intuition:▶ Detecting an effect of size∆ takes

a run length proportional to 1/∆2

▶ So the penalty for guessing wrongabout δ is very high!

▶ An MDE that is 2x too small =⇒run length that is 4x too long

25 / 29

Run lengths: Theory

Suppose that effect size is drawn from a normal distribution.

In an appropriate scaling regime where α → 0 and N → ∞,we show that mSPRT at level α truncated toΘ(N) givessimilar power as fixed horizon test of length N, but withdetection time that is o(N).

26 / 29

Conclusions

27 / 29

Experimentation in the Internet age

Rapid innovation ininformation & communication technologyhas democratized the scientific method.

Our goal: "adapt" statistical methodology toact in partnership with the user.

Additional results:▶ Extension to multiple testing (Benjamini-Hochberg)▶ Extension to adaptive allocation (bandits)▶ Extension to confidence intervals

28 / 29

Optimizely Stats Engine

▶ The ideas presented in this talk were released to all ofOptimizely's customers on January 20, 2015

▶ Provides both always valid p-valuesandmultiple testing corrections

29 / 29

Always Valid Inference (Ramesh Johari, Stanford)

Technology

Transcript of Always Valid Inference (Ramesh Johari, Stanford)