Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with...

32
Measuring Sample Quality with Kernels Lester Mackey * Joint work with Jackson Gorham Microsoft Research * , Opendoor Labs June 25, 2018 Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 1 / 31

Transcript of Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with...

Page 1: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Measuring Sample Quality with Kernels

Lester Mackey∗

Joint work with Jackson Gorham†

Microsoft Research∗, Opendoor Labs†

June 25, 2018

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 1 / 31

Page 2: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Motivation: Large-scale Posterior Inference

Example: Bayesian logistic regression1 Fixed covariate vector: vl ∈ Rd for each datapoint l = 1, . . . , L2 Unknown parameter vector: β ∼ N (0, I)

3 Binary class label: Yl | vl, βind∼ Ber

(1

1+e−〈β,vl〉

)Generative model simple to expressPosterior distribution over unknown parameters is complex

Normalization constant unknown, exact integration intractable

Standard inferential approach: Use Markov chain Monte Carlo(MCMC) to (eventually) draw samples from the posterior distribution

Benefit: Approximates intractable posterior expectationsEP [h(Z)] =

∫X p(x)h(x)dx with asymptotically exact sample

estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each new MCMC sample point xi requires iteratingover entire observed dataset: prohibitive when dataset is large!

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 2 / 31

Page 3: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Motivation: Large-scale Posterior Inference

Question: How do we scale Markov chain Monte Carlo (MCMC)posterior inference to massive datasets?

MCMC Benefit: Approximates intractable posteriorexpectations EP [h(Z)] =

∫X p(x)h(x)dx with asymptotically

exact sample estimates EQ[h(X)] = 1n

∑ni=1 h(xi)

Problem: Each point xi requires iterating over entire dataset!

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Approximate standard MCMC procedure in a manner that makesuse of only a small subset of datapoints per sampleReduced computational overhead leads to faster sampling andreduced Monte Carlo varianceIntroduces asymptotic bias: target distribution is not stationaryHope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 3 / 31

Page 4: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Motivation: Large-scale Posterior Inference

Template solution: Approximate MCMC with subset posteriors[Welling and Teh, 2011, Ahn, Korattikara, and Welling, 2012, Korattikara, Chen, and Welling, 2014]

Hope that for fixed amount of sampling time, variance reductionwill outweigh bias introduced

Introduces new challenges

How do we compare and evaluate samples from approximateMCMC procedures?

How do we select samplers and their tuning parameters?

How do we quantify the bias-variance trade-off explicitly?

Difficulty: Standard evaluation criteria like effective sample size,trace plots, and variance diagnostics assume convergence to thetarget distribution and do not account for asymptotic bias

This talk: Introduce new quality measures suitable for comparingthe quality of approximate MCMC samples

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 4 / 31

Page 5: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Quality Measures for Samples

Challenge: Develop measure suitable for comparing the quality ofany two samples approximating a common target distribution

Given

Continuous target distribution P with support X = Rd anddensity p

p known up to normalization, integration under P is intractable

Sample points x1, . . . , xn ∈ XDefine discrete distribution Qn with, for any function h,EQn [h(X)] = 1

n

∑ni=1 h(xi) used to approximate EP [h(Z)]

We make no assumption about the provenance of the xi

Goal: Quantify how well EQn approximates EP in a manner that

I. Detects when a sample sequence is converging to the target

II. Detects when a sample sequence is not converging to the target

III. Is computationally feasibleMackey (MSR) Kernel Stein Discrepancy June 25, 2018 5 / 31

Page 6: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Examples

Bounded Lipschitz (or Dudley) metric, dBL‖·‖(H = BL‖·‖ , {h : supx |h(x)|+ supx 6=y

|h(x)−h(y)|‖x−y‖ ≤ 1})

Wasserstein (or Kantorovich-Rubenstein) distance, dW‖·‖(H =W‖·‖ , {h : supx 6=y

|h(x)−h(y)|‖x−y‖ ≤ 1})

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 6 / 31

Page 7: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Integral Probability Metrics

Goal: Quantify how well EQn approximates EPIdea: Consider an integral probability metric (IPM) [Muller, 1997]

dH(Qn, P ) = suph∈H|EQn [h(X)]− EP [h(Z)]|

Measures maximum discrepancy between sample and targetexpectations over a class of real-valued test functions HWhen H sufficiently large, convergence of dH(Qn, P ) to zeroimplies (Qn)n≥1 converges weakly to P (Requirement II)

Problem: Integration under P intractable!⇒ Most IPMs cannot be computed in practice

Idea: Only consider functions with EP [h(Z)] known a priori to be 0Then IPM computation only depends on Qn!How do we select this class of test functions?Will the resulting discrepancy measure track sample sequenceconvergence (Requirements I and II)?How do we solve the resulting optimization problem in practice?

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 7 / 31

Page 8: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Stein’s Method

Stein’s method [1972] provides a recipe for controlling convergence:1 Identify operator T and set G of functions g : X → Rd with

EP [(T g)(Z)] = 0 for all g ∈ G.T and G together define the Stein discrepancy [Gorham and Mackey, 2015]

S(Qn,T ,G) , supg∈G|EQn [(T g)(X)]| = dT G(Qn, P ),

an IPM-type measure with no explicit integration under P

2 Lower bound S(Qn,T ,G) by reference IPM dH(Qn, P )⇒ S(Qn, T ,G)→ 0 only if (Qn)n≥1 converges to P (Req. II)

Performed once, in advance, for large classes of distributions

3 Upper bound S(Qn,T ,G) by any means necessary todemonstrate convergence to 0 (Requirement I)

Standard use: As analytical tool to prove convergenceOur goal: Develop Stein discrepancy into practical quality measure

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 8 / 31

Page 9: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Identifying a Stein Operator TGoal: Identify operator T for which EP [(T g)(Z)] = 0 for all g ∈ GApproach: Generator method of Barbour [1988, 1990], Gotze [1991]

Identify a Markov process (Zt)t≥0 with stationary distribution PUnder mild conditions, its infinitesimal generator

(Au)(x) = limt→0

(E[u(Zt) | Z0 = x]− u(x))/t

satisfies EP [(Au)(Z)] = 0

Overdamped Langevin diffusion: dZt = 12∇ log p(Zt)dt+ dWt

Generator: (APu)(x) = 12〈∇u(x),∇ log p(x)〉+ 1

2〈∇,∇u(x)〉

Stein operator: (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉[Gorham and Mackey, 2015, Oates, Girolami, and Chopin, 2016]

Depends on P only through ∇ log p; computable even if pcannot be normalized!Multivariate generalization of density method operator(T g)(x) = g(x) ddx log p(x) + g′(x) [Stein, Diaconis, Holmes, and Reinert, 2004]

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 9 / 31

Page 10: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Identifying a Stein Set G

Goal: Identify set G for which EP [(TPg)(Z)] = 0 for all g ∈ G

Approach: Reproducing kernels k : X × X → RA reproducing kernel k is symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ X , ci ∈ R)

Gaussian kernel k(x, y) = e−12‖x−y‖22

Inverse multiquadric kernel k(x, y) = (1 + ‖x− y‖22)−1/2

Generates a reproducing kernel Hilbert space (RKHS) KkWe define the kernel Stein set Gk,‖·‖ as vector-valued g with

Each component gj in KkComponent norms ‖gj‖Kk jointly bounded by 1

EP [(TPg)(Z)] = 0 for all g ∈ Gk,‖·‖ under mild conditions [Gorham

and Mackey, 2017]

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 10 / 31

Page 11: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Computing the Kernel Stein Discrepancy

Kernel Stein discrepancy (KSD) S(Qn, TP ,Gk,‖·‖)Stein operator (TPg)(x) , 〈g(x),∇ log p(x)〉+ 〈∇, g(x)〉Stein set Gk,‖·‖ , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}

Benefit: Computable in closed form [Gorham and Mackey, 2017]

S(Qn, TP ,Gk,‖·‖) = ‖w‖ for wj ,√∑n

i,i′=1 kj0(xi, xi′).

Reduces to parallelizable pairwise evaluations of Stein kernels

kj0(x, y) ,1

p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y))

Stein set choice inspired by control functional kernelsk0 =

∑dj=1 k

j0 of Oates, Girolami, and Chopin [2016]

When ‖·‖ = ‖·‖2, recovers the KSD of Chwialkowski, Strathmann, and

Gretton [2016], Liu, Lee, and Jordan [2016]

To ease notation, will use Gk , Gk,‖·‖2 in remainder of the talk

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 11 / 31

Page 12: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Detecting Non-convergence

Goal: Show S(Qn, TP ,Gk)→ 0 only if (Qn)n≥1 converges to PLet P be the set of targets P with Lipschitz ∇ log p and distantstrong log concavity ( 〈∇ log(p(x)/p(y)),y−x〉

‖x−y‖22≥ k for ‖x− y‖2 ≥ r)

Includes Gaussian mixtures with common covariance, Bayesianlogistic and Student’s t regression with Gaussian priors, ...

For a different Stein set G, Gorham, Duncan, Vollmer, and Mackey [2016]

showed (Qn)n≥1 converges to P if P ∈ P and S(Qn, TP ,G)→ 0

New contribution [Gorham and Mackey, 2017]

Theorem (Univarite KSD detects non-convergence)

Suppose P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If d = 1, thenS(Qn, TP ,Gk)→ 0 only if (Qn)n≥1 converges weakly to P .

Justifies use of KSD with Gaussian, Matern, or inversemultiquadric kernels k in the univariate case

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 12 / 31

Page 13: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

The Importance of Kernel Choice

Goal: Show S(Qn, TP ,Gk)→ 0 only if Qn converges to P

In higher dimensions, KSDs based on common kernels fail todetect non-convergence, even for Gaussian targets P

Theorem (KSD fails with light kernel tails [Gorham and Mackey, 2017])

Suppose d ≥ 3, P = N (0, Id), and α , (12− 1

d)−1. If k(x, y) and its

derivatives decay at a o(‖x− y‖−α2 ) rate as ‖x− y‖2 →∞, thenS(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to P .

Gaussian (k(x, y) = e−12‖x−y‖22) and Matern kernels fail for d ≥ 3

Inverse multiquadric kernels (k(x, y) = (1 + ‖x− y‖22)β) withβ < −1 fail for d > 2β

1+β

The violating sample sequences (Qn)n≥1 are simple to construct

Problem: Kernels with light tails ignore excess mass in the tailsMackey (MSR) Kernel Stein Discrepancy June 25, 2018 13 / 31

Page 14: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

The Importance of Tightness

Goal: Show S(Qn, TP ,Gk)→ 0 only if Qn converges to P

A sequence (Qn)n≥1 is uniformly tight if for every ε > 0, thereis a finite number R(ε) such that supnQn(‖X‖2 > R(ε)) ≤ ε

Intuitively, no mass in the sequence escapes to infinity

Theorem (KSD detects tight non-convergence [Gorham and Mackey, 2017])

Suppose that P ∈ P and k(x, y) = Φ(x− y) for Φ ∈ C2 with anon-vanishing generalized Fourier transform. If (Qn)n≥1 is uniformlytight and S(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

Good news, but, ideally, KSD would detect non-tight sequencesautomatically...

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 14 / 31

Page 15: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Detecting Non-convergence

Goal: Show S(Qn, TP ,Gk)→ 0 only if Qn converges to P

Consider the inverse multiquadric (IMQ) kernel

k(x, y) = (c2 + ‖x− y‖22)β for some β < 0, c ∈ R.

IMQ KSD fails to detect non-convergence when β < −1

However, IMQ KSD automatically enforces tightness and detectsnon-convergence when β ∈ (−1, 0)

Theorem (IMQ KSD detects non-convergence [Gorham and Mackey, 2017])

Suppose P ∈ P and k(x, y) = (c2 + ‖x− y‖22)β for β ∈ (−1, 0). IfS(Qn, TP ,Gk)→ 0, then (Qn)n≥1 converges weakly to P .

No extra assumptions on sample sequence (Qn)n≥1 needed

Intuition: Slow decay rate of kernel ⇒ unbounded (coercive)test functions in TPGk ⇒ non-tight sequences detected

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 15 / 31

Page 16: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Detecting Convergence

Goal: Show S(Qn, TP ,Gk)→ 0 when Qn converges to P

Proposition (KSD detects convergence [Gorham and Mackey, 2017])

If k ∈ C(2,2)b and ∇ log p Lipschitz and square integrable under P ,

then S(Qn, TP ,Gk)→ 0 whenever the Wasserstein distancedW‖·‖2 (Qn, P )→ 0.

Covers Gaussian, Matern, IMQ, and other common boundedkernels k

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 16 / 31

Page 17: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

A Simple Examplei.i.d. from mixture

target Pi.i.d. from single

mixture component

101 102 103 104 101 102 103 10410−2.5

10−2

10−1.5

10−1

10−0.5

100

Number of sample points, n

Dis

crep

ancy

val

ue

Discrepancy

IMQ KSD

Wasserstein

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

gh

=T

P g

−3 0 3 −3 0 3

0.2

0.4

0.6

0.8

1.0

−1.0

−0.5

0.0

0.5

xLeft plot:

For target p(x) ∝ e−12(x+1.5)2 + e−

12(x−1.5)2 , compare an i.i.d.

sample Qn from P and an i.i.d. sample Q′n from one componentExpect S(Q1:n, TP ,Gk)→ 0 & S(Q′1:n, TP ,Gk) 6→ 0Compare IMQ KSD (β = −1/2, c = 1) with Wasserstein distance

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 17 / 31

Page 18: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

A Simple Examplei.i.d. from mixture

target Pi.i.d. from single

mixture component

101 102 103 104 101 102 103 10410−2.5

10−2

10−1.5

10−1

10−0.5

100

Number of sample points, n

Dis

crep

ancy

val

ue

Discrepancy

IMQ KSD

Wasserstein

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

gh

=T

P g

−3 0 3 −3 0 3

0.2

0.4

0.6

0.8

1.0

−1.0

−0.5

0.0

0.5

x

Right plot: For n = 103 sample points,

(Top) Recovered optimal Stein functions g

(Bottom) Associated test functions h , TPg which bestdiscriminate sample Qn from target P

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 18 / 31

Page 19: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

The Importance of Kernel Choice

i.i.d. from target P Off−target sample

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●● ● ● ●●●●●●● ● ● ●●●●●●●

10−2

10−1

100

10−1

100

10−2

10−1

100

Gaussian

Matérn

Inverse Multiquadric

102 103 104 105102 103 104 105

Number of sample points, n

Ker

nel S

tein

dis

crep

ancy

Dimension● d = 5

d = 8

d = 20

Target P = N (0, Id)

Off-target Qn has all‖xi‖2 ≤ 2n1/d log n,‖xi − xj‖2 ≥ 2 log n

Gaussian and MaternKSDs driven to 0 byan off-target sequencethat does not convergeto P

IMQ KSD(β = −1

2, c = 1) does

not have thisdeficiency

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 19 / 31

Page 20: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Sampler Hyperparameters

Target posterior density: p(x) ∝ π(x)∏L

l=1 π(yl | x)

Prior π(x), Likelihood π(y | x)

Approximate slice sampling [DuBois, Korattikara, Welling, and Smyth, 2014]

Approximate MCMC procedure designed for scalability

Uses random subset of datapoints to approximate each slicesampling stepTarget P is not stationary distribution

Tolerance parameter ε controls number of datapoints evaluated

ε too small ⇒ too few sample points generatedε too large ⇒ sampling from very different distributionStandard MCMC selection criteria like effective sample size(ESS) and asymptotic variance do not account for this bias

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 20 / 31

Page 21: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Sampler Hyperparameters

Setup [Welling and Teh, 2011]

Consider the posterior distribution P induced by L datapoints yldrawn i.i.d. from a Gaussian mixture likelihood

Yl|Xiid∼ 1

2N (X1, 2) + 1

2N (X1 +X2, 2)

under Gaussian priors on the parameters X ∈ R2

X1 ∼ N (0, 10) ⊥⊥ X2 ∼ N (0, 1)

Draw m = 100 datapoints yl with parameters (x1, x2) = (0, 1)Induces posterior with second mode at (x1, x2) = (1,−1)

For range of parameters ε, run approximate slice sampling for148000 datapoint likelihood evaluations and store resultingposterior sample Qn

Use minimum IMQ KSD (β = −12, c = 1) to select appropriate ε

Compare with standard MCMC parameter selection criterion,effective sample size (ESS), a measure of Markov chainautocorrelationCompute median of diagnostic over 50 random sequences

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 21 / 31

Page 22: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Sampler Hyperparameters

● ●●

ESS (higher is better)

KSD (lower is better)

2.0

2.5

3.0

3.5

−0.5

0.0

0.5

0 10−4 10−3 10−2 10−1

Tolerance parameter, ε

Log

med

ian

diag

nost

ic

ε = 0 (n = 230) ε = 10−2 (n = 416) ε = 10−1 (n = 1000)

−3

−2

−1

0

1

2

3

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2x1

x 2

ESS maximized at tolerance ε = 10−1

IMQ KSD minimized at tolerance ε = 10−2

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 22 / 31

Page 23: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Samplers

Target posterior density: p(x) ∝ π(x)∏L

l=1 π(yl | x)

Prior π(x), Likelihood π(y | x)

Stochastic Gradient Fisher Scoring (SGFS)[Ahn, Korattikara, and Welling, 2012]

Approximate MCMC procedure designed for scalability

Approximates Metropolis-adjusted Langevin algorithm andcontinuous-time Langevin diffusion with preconditionerRandom subset of datapoints used to select each sampleNo Metropolis-Hastings correction stepTarget P is not stationary distribution

Two variants

SGFS-f inverts a d× d matrix for each new sample pointSGFS-d inverts a diagonal matrix to reduce sampling time

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 23 / 31

Page 24: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Samplers

Setup

MNIST handwritten digits [Ahn, Korattikara, and Welling, 2012]

10000 images, 51 features, binary label indicating whetherimage of a 7 or a 9

Bayesian logistic regression posterior P

L independent observations (yl, vl) ∈ {1,−1} × Rd with

P(Yl = 1|vl, X) = 1/(1 + exp(−〈vl, X〉))

Flat improper prior on the parameters X ∈ Rd

Use IMQ KSD (β = −12, c = 1) to compare SGFS-f to SGFS-d

drawing 105 sample points and discarding first half as burn-in

For external support, compare bivariate marginal means and95% confidence ellipses with surrogate ground truth HamiltonianMonte chain with 105 sample points [Ahn, Korattikara, and Welling, 2012]

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 24 / 31

Page 25: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Selecting Samplers●

●● ●●

●●●●

● ● ●●●●●●● ● ● ● ●

17500

18000

18500

19000

102 102.5 103 103.5 104 104.5

Number of sample points, n

IMQ

ker

nel S

tein

dis

crep

ancy

Sampler● SGFS−d

SGFS−f

SGFS−d

●●

−0.4

−0.3

−0.2

BE

ST

−0.3 −0.2 −0.1 0.0x7

x 51

SGFS−d

●●

0.80.91.01.11.2 W

OR

ST

−0.5 −0.4 −0.3 −0.2 −0.1 0.0x8

x 42

SGFS−f

●●

−0.1

0.0

0.1

0.2 BE

ST

−0.1 0.0 0.1 0.2x32

x 34

SGFS−f

●●

−0.6

−0.5

−0.4

−0.3 WO

RS

T

−1.5 −1.4 −1.3 −1.2 −1.1x2

x 25

Left: IMQ KSD quality comparison for SGFS Bayesian logisticregression (no surrogate ground truth used)

Right: SGFS sample points (n = 5× 104) with bivariatemarginal means and 95% confidence ellipses (blue) that alignbest and worst with surrogate ground truth sample (red).

Both suggest small speed-up of SGFS-d (0.0017s per sample vs.0.0019s for SGFS-f) outweighed by loss in inferential accuracy

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 25 / 31

Page 26: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Beyond Sample Quality Comparison

Goodness-of-fit testingChwialkowski, Strathmann, and Gretton [2016] used the KSD S(Qn, TP ,Gk)to test whether a sample was drawn from a target distribution P(see also Liu, Lee, and Jordan [2016])Test with default Gaussian kernel k experienced considerable lossof power as the dimension d increasedWe recreate their experiment with IMQ kernel (β = −1

2, c = 1)

For n = 500, generate sample (xi)ni=1 with xi = zi + ui e1

ziiid∼ N (0, Id) and ui

iid∼ Unif[0, 1]. Target P = N (0, Id).Compare with standard normality test of Baringhaus and Henze [1988]

Table: Mean power of multivariate normality tests across 400 simulations

d=2 d=5 d=10 d=15 d=20 d=25B&H 1.0 1.0 1.0 0.91 0.57 0.26

Gaussian 1.0 1.0 0.88 0.29 0.12 0.02IMQ 1.0 1.0 1.0 1.0 1.0 1.0

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 26 / 31

Page 27: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Beyond Sample Quality Comparison

Improving sample quality

Given sample points (xi)ni=1, can minimize KSD S(Qn, TP ,Gk)

over all weighted samples Qn =∑n

i=1 qn(xi)δxi for qn aprobability mass function

Liu and Lee [2016] do this with Gaussian kernel k(x, y) = e−1h‖x−y‖22

Bandwidth h set to median of the squared Euclidean distancebetween pairs of sample points

We recreate their experiment with the IMQ kernelk(x, y) = (1 + 1

h‖x− y‖22)−1/2

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 27 / 31

Page 28: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Improving Sample Quality

● ● ● ● ●

10−4.5

10−4

10−3.5

10−3

10−2.5

10−2

2 10 50 75 100Dimension, d

Ave

rage

MS

E, |

|EP Z

−E

Qn~ X

|| 22d

Sample● Initial Qn

Gaussian KSD

IMQ KSD

MSE averaged over 500 simulations (±2 standard errors)

Target P = N (0, Id)

Starting sample Qn = 1n

∑ni=1 δxi for xi

iid∼ P , n = 100.

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 28 / 31

Page 29: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Future Directions

Many opportunities for future development1 Improve KSD scalability while maintaining convergence control

Inexpensive approximations of kernel matrix [?]

Subsampling of likelihood terms in ∇ log p2 Addressing other inferential tasks

Control variate design[??Oates, Girolami, and Chopin, 2016]

Variational inference [Liu and Wang, 2016, Liu and Feng, 2016]

Training generative adversarial networks [Wang and Liu, 2016] andvariational autoencoders [Pu, Gan, Henao, Li, Han, and Carin, 2017]

3 Exploring the impact of Stein operator choiceAn infinite number of operators T characterize PHow is discrepancy impacted? How do we select the best T ?

Thm: If ∇ log p bounded and k ∈ C(1,1)0 , then

S(Qn, TP ,Gk)→ 0 for some (Qn)n≥1 not converging to PDiffusion Stein operators (T g)(x) = 1

p(x)〈∇, p(x)m(x)g(x)〉 ofGorham, Duncan, Vollmer, and Mackey [2016] may be appropriate for heavy tails

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 29 / 31

Page 30: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

References IS. Ahn, A. Korattikara, and M. Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proc. 29th ICML,

ICML’12, 2012.

A. D. Barbour. Stein’s method and Poisson process convergence. J. Appl. Probab., (Special Vol. 25A):175–184, 1988. ISSN0021-9002. A celebration of applied probability.

A. D. Barbour. Stein’s method for diffusion approximations. Probab. Theory Related Fields, 84(3):297–322, 1990. ISSN0178-8051. doi: 10.1007/BF01197887.

L. Baringhaus and N. Henze. A consistent test for multivariate normality based on the empirical characteristic function.Metrika, 35(1):339–348, 1988.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proc. 33rd ICML, ICML, 2016.

C. DuBois, A. Korattikara, M. Welling, and P. Smyth. Approximate slice sampling for Bayesian posterior inference. In Proc.17th AISTATS, pages 185–193, 2014.

J. Gorham and L. Mackey. Measuring sample quality with Stein’s method. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Adv. NIPS 28, pages 226–234. Curran Associates, Inc., 2015.

J. Gorham and L. Mackey. Measuring sample quality with kernels. arXiv:1703.01717, Mar. 2017.

J. Gorham, A. Duncan, S. Vollmer, and L. Mackey. Measuring sample quality with diffusions. arXiv:1611.06972, Nov. 2016.

F. Gotze. On the rate of convergence in the multivariate CLT. Ann. Probab., 19(2):724–739, 1991.

A. Korattikara, Y. Chen, and M. Welling. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proc. of 31stICML, ICML’14, 2014.

Q. Liu and Y. Feng. Two methods for wild variational inference. arXiv preprint arXiv:1612.00081, 2016.

Q. Liu and J. Lee. Black-box importance sampling. arXiv:1610.05247, Oct. 2016. To appear in AISTATS 2017.

Q. Liu and D. Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. arXiv:1608.04471,Aug. 2016.

Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In Proc. of 33rd ICML, volume 48 ofICML, pages 276–284, 2016.

L. Mackey and J. Gorham. Multivariate Stein factors for a class of strongly log-concave distributions. arXiv:1512.07392, 2015.

A. Muller. Integral probability metrics and their generating classes of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997.

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 30 / 31

Page 31: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

References II

C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), pages n/a–n/a, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12185.

Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via stein variational gradient descent. In Advances in NeuralInformation Processing Systems, pages 4237–4246, 2017.

C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. InProc. 6th Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971),Vol. II: Probability theory, pages 583–602. Univ. California Press, Berkeley, Calif., 1972.

C. Stein, P. Diaconis, S. Holmes, and G. Reinert. Use of exchangeable pairs in the analysis of simulations. In Stein’s method:expository lectures and applications, volume 46 of IMS Lecture Notes Monogr. Ser., pages 1–26. Inst. Math. Statist.,Beachwood, OH, 2004.

D. Wang and Q. Liu. Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning.arXiv:1611.01722, Nov. 2016.

M. Welling and Y. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 31 / 31

Page 32: Measuring Sample Quality with Kernelslmackey/papers/ksd-slides.pdfMeasuring Sample Quality with Kernels Lester Mackey Joint work with Jackson Gorhamy Microsoft Research , Opendoor

Comparing Discrepancies

●●

●●

●●●●

●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●● ●●

●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

101 102 103 104 101 102 103 10410−2.5

10−2

10−1.5

10−1

10−0.5

100

Number of sample points, n

Dis

crep

ancy

val

ueDiscrepancy

IMQ KSD

Graph Steindiscrepancy

Wasserstein●

●●●●●●●●●

●●●●●

●●●●

●●●●●

●●●●

●●●●●●

●●●●

●●●●●

d = 1 d = 4

101 102 103 104 101 102 103 104

10−3

10−2

10−1

100

101

102

103

Number of sample points, n

Com

puta

tion

time

(sec

)

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

gh

=T

P g

−3 0 3 −3 0 3

0.2

0.4

0.6

0.8

1.0

−1.0

−0.5

0.0

0.5

x

Left: Samples drawn i.i.d. from either the bimodal Gaussianmixture target p(x) ∝ e−

12(x+1.5)2 + e−

12(x−1.5)2 or a single

mixture component.

Right: Discrepancy computation time using d cores in ddimensions.

Mackey (MSR) Kernel Stein Discrepancy June 25, 2018 32 / 31