09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Transcript

Page 1: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Hypothesis Testing with Kernel Embeddings onInterdependent Data

Dino Sejdinovic

Department of StatisticsUniversity of Oxford

joint work withKacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)

9 April 2015Dagstuhl

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 1 / 19

Kernel Embedding

feature map: x 7→ k(·, x) ∈ Hk

instead ofx 7→ (ϕ1(x), . . . , ϕs(x)) ∈ Rs

〈k(·, x), k(·, y)〉Hk= k(x , y)

inner products easily computed

embedding:P 7→ µk(P) = EX∼Pk(·,X ) ∈ Hk

instead ofP 7→ (Eϕ1(X ), . . . ,Eϕs(X )) ∈ Rs

〈µk(P), µk(Q)〉Hk= EX ,Y k(X ,Y )

inner products easily estimated

µk(P) represents expectations w.r.t. P , i.e.,EX f (X ) = EX 〈f , k(·,X )〉Hk

= 〈f , µk(P)〉Hk∀f ∈ Hk

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 2 / 19

Kernel Embedding

feature map: x 7→ k(·, x) ∈ Hk

instead ofx 7→ (ϕ1(x), . . . , ϕs(x)) ∈ Rs

〈k(·, x), k(·, y)〉Hk= k(x , y)

inner products easily computed

embedding:P 7→ µk(P) = EX∼Pk(·,X ) ∈ Hk

instead ofP 7→ (Eϕ1(X ), . . . ,Eϕs(X )) ∈ Rs

〈µk(P), µk(Q)〉Hk= EX ,Y k(X ,Y )

inner products easily estimated

µk(P) represents expectations w.r.t. P , i.e.,EX f (X ) = EX 〈f , k(·,X )〉Hk

= 〈f , µk(P)〉Hk∀f ∈ Hk

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 2 / 19

Kernel Embedding

feature map: x 7→ k(·, x) ∈ Hk

instead ofx 7→ (ϕ1(x), . . . , ϕs(x)) ∈ Rs

〈k(·, x), k(·, y)〉Hk= k(x , y)

inner products easily computed

embedding:P 7→ µk(P) = EX∼Pk(·,X ) ∈ Hk

instead ofP 7→ (Eϕ1(X ), . . . ,Eϕs(X )) ∈ Rs

〈µk(P), µk(Q)〉Hk= EX ,Y k(X ,Y )

inner products easily estimated

µk(P) represents expectations w.r.t. P , i.e.,EX f (X ) = EX 〈f , k(·,X )〉Hk

= 〈f , µk(P)〉Hk∀f ∈ Hk

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 2 / 19

Page 5: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Kernel MMD

De�nition

Kernel metric (MMD) between P and Q:

MMDk(P,Q) = ‖EXk(·,X )− EY k(·,Y )‖2Hk

= EXX ′k(X ,X ′) + EYY ′k(Y ,Y ′)− 2EXY k(X ,Y )

0 20 40 60 80 100

100

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 3 / 19

Page 6: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Kernel MMD

A polynomial kernel k(x , x ′) =(1 + x>x ′

)son Rp captures the

di�erence in �rst s (mixed) moments only

For a certain family of kernels (characteristic/universal):MMDk(P,Q) = 0 i� P = Q: Gaussian exp(− 1

2σ2‖z − z ′‖22),

Laplacian, inverse multiquadratics, B2n+1- splines...

Under mild assumptions, k-MMD metrizes weak* topology onprobability measures (Sriperumbudur, 2010):

MMDk(Pn,P)→ 0⇔ Pn P

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 4 / 19

Page 7: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Nonparametric two-sample tests

Testing H0 : P = Q vs. HA : P 6= Qbased on samples {xi}nxi=1 ∼ P, {yi}nyi=1 ∼ Q.

Test statistic is an estimate ofMMDk(P,Q) = EXX ′k(X ,X ′) + EYY ′k(Y ,Y ′)− 2EXY k(X ,Y ):

M̂MDk =1

nx(nx − 1)

∑i 6=j

k(xi , xj) +1

ny (ny − 1)

∑i 6=j

k(yi , yj)

− 2

nxny

∑i ,j

k(xi , yj).

Degenerate U-statistic: 1√n-convergence to MMD under HA,

1n-convergence to 0 under H0.

O(n2) to compute (n = nx + ny ) � various approximations(block-based, random features) trade computation for power.

Gretton et al (NIPS 2009, JMLR 2012, NIPS 2012)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 5 / 19

Page 8: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Test threshold

For i.i.d. data, under H0 : P = Q:nxnynx+ny

M̂MDk ∑∞

r=1 λr(Z 2r − 1

), {Zr}∞r=1

i .i .d .∼ N (0, 1)

{λr} depend on both k and P: eigenvalues of T : L2 → L2,

(Tf ) (x) 7→ˆ

f (x ′)k̃(x , x ′)︸︷︷︸centred

dP(x ′).

Asymptotic null distribution typically estimated using a permutationtest.

For interdependent samples, {Zr}∞r=1 are correlated, with thecorrelation structure dependent on the correlation structure within thesamples.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 6 / 19

Page 9: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Nonparametric independence tests

H0 : X ⊥⊥ Y

HA : X 6 ⊥⊥ Y

Test statistic:

HSIC(X ,Y ) =∥∥∥µκ(P̂XY )− µκ(P̂X P̂Y )

∥∥∥2Hκ

with κ = k ⊗ lGretton et al (2005, 2008); Smola et al (2007);

Related to distance covariance (dCov) instatistics literature Szekely et al (AoS 2007, AoAS2009); S. et al (AoS 2013)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 7 / 19

Page 10: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Nonparametric independence tests

H0 : X ⊥⊥ Y ⇔ PXY = PXPY

HA : X 6 ⊥⊥ Y ⇔ PXY 6= PXPY

Test statistic:

HSIC(X ,Y ) =∥∥∥µκ(P̂XY )− µκ(P̂X P̂Y )

∥∥∥2Hκ

with κ = k ⊗ lGretton et al (2005, 2008); Smola et al (2007);

Related to distance covariance (dCov) instatistics literature Szekely et al (AoS 2007, AoAS2009); S. et al (AoS 2013)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 7 / 19

Page 11: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

HSIC computation

k( , )

`(The Sealyham Terrier is thecouch potato of the terrierworld - he loves to layaround and take naps...

,Cairn Terriers are independentlittle bundles of energy. Theyare alert and active with thetrademark terrier temperament...

)

→K=

→L=

HSIC measures

average similarity

between the kernel

matrices:

HSIC(X ,Y ) =1n2〈HKH,HLH〉

H = I − 1

n11>

(centering matrix)

Extensions: conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf,

2008; Zhang, Peters, Janzing and Schölkopf, 2011), three-variable interaction /V-structure discovery (S., Gretton and Bergsma, 2013)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 8 / 19

Page 12: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

HSIC computation

k( , )

`(The Sealyham Terrier is thecouch potato of the terrierworld - he loves to layaround and take naps...

,Cairn Terriers are independentlittle bundles of energy. Theyare alert and active with thetrademark terrier temperament...

)

→K=

→L=

HSIC measures

average similarity

between the kernel

matrices:

HSIC(X ,Y ) =1n2〈HKH,HLH〉

H = I − 1

n11>

(centering matrix)

Extensions: conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf,

2008; Zhang, Peters, Janzing and Schölkopf, 2011), three-variable interaction /V-structure discovery (S., Gretton and Bergsma, 2013)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 8 / 19

Page 13: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

HSIC computation

k( , )

`(The Sealyham Terrier is thecouch potato of the terrierworld - he loves to layaround and take naps...

,Cairn Terriers are independentlittle bundles of energy. Theyare alert and active with thetrademark terrier temperament...

)

→K=

→L=

HSIC measures

average similarity

between the kernel

matrices:

HSIC(X ,Y ) =1n2〈HKH,HLH〉

H = I − 1

n11>

(centering matrix)

Extensions: conditional independence testing (Fukumizu, Gretton, Sun and Schölkopf,

2008; Zhang, Peters, Janzing and Schölkopf, 2011), three-variable interaction /V-structure discovery (S., Gretton and Bergsma, 2013)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 8 / 19

Page 14: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Kernel tests on time series

Kacper Chwialkowski Arthur Gretton

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 9 / 19

Page 15: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Test calibration for dependent observations

0 10 20 30 40-2

-1

0 10 20 30 40-2

-1

-5 0 50

0.1

0.2

0.3

0.4

0.5

-5 0 50

0.1

0.2

0.3

0.4

0.5

IsP

the samedistribution as

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 10 / 19

Page 16: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Kernel MMD

De�nition

Kernel metric (MMD) between P and Q:

MMDk(P,Q) = ‖EXk(·,X )− EY k(·,Y )‖2Hk

= EXX ′k(X ,X ′) + EYY ′k(Y ,Y ′)− 2EXY k(X ,Y )

0 20 40 60 80 100

100

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 11 / 19

Page 17: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.1D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 18: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.2D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 19: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.3D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 20: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.4D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 21: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.5D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 22: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.6D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 23: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.7D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 24: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Permutation test on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,pCorrect thresholdBootstrapped threshold

a = 0.8D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 12 / 19

Page 25: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap

Wild bootstrap process (Leucht and Neumann, 2013):

Wt,n = e−1/lnWt−1,n +√1− e−2/lnεt where W0,n, ε1, . . . , εn

i.i.d.∼ N (0, 1), and

W̃t,n = Wt,n − 1

∑n

j=1Wj,n.

M̂MDk,wb :=1

n2x

nx∑i=1

nx∑j=1

W̃(x)i,nx

W̃(x)j,nx

k(xi , xj)−1

n2x

ny∑i=1

ny∑j=1

W̃(y)i,ny

W̃(y)j,ny

k(yi , yj)

− 2

nxny

nx∑i=1

ny∑j=1

W̃(x)i,nx

W̃(y)j,ny

k(xi , yj).

Theorem (Chwialkowski, S. and Gretton, 2014)

Let k be bounded and Lipschitz continuous, and let {Xt} ∼ P and {Yt} ∼ Q

both be τ -dependent with τ(r) = O(r−6−ε), but independent of each other.

Then, under H0 : P = Q, ϕ(

nxnynx+ny

M̂MDk ,nxnynx+ny

M̂MDk,b

)P→ 0 as nx , ny →∞,

where ϕ is the Prokhorov metric.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 13 / 19

Page 26: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.1D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 27: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.2D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 28: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.3D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 29: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.4D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 30: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.5D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 31: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.6D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 32: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.7D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 33: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap on AR(1): Xt+1 = aXt +√1− a2εt

0 5 10 15 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7Density of V

nHistogram of V

n,wCorrect thresholdBootstrapped threshold

a = 0.8D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 14 / 19

Page 34: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Test calibration for dependent observations

Two-sample test experiment \ method perm. wild

MCMC diagnostics i.i.d. vs i.i.d. (H0) .040 .012

i.i.d. vs Gibbs (H0) .528 .052

Gibbs vs Gibbs (H0) .680 .060

0 20 40 60−4

−2

time

lue

−5 0 5−4

−2

time series 1

tim

e s

erie

s 2

0 20 40 60−4

−2

time

lue

−5 0 5−4

−2

time series 1

tim

e s

erie

s 2

0.2 0.4 0.6 0.8−0.05

0.05

0.1

0.15

0.2

type I e

rror

AR coeffcient0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

type II err

Extinction rate

Vb1

Vb2

Shift

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 15 / 19

Page 35: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Time Series Coupled at a Lag

−20 −15 −10 −5 0 5 10 15 20

lag

0.004

0.005

0.006

0.007

0.008

0.009

0.010mean distance correlation

100 150 200 250 300

0.2

0.4

0.6

0.8

typ

e I

I e

rro

r ra

sample size200 250 300

0.2

0.4

0.6

0.8

sample size

KCSD

HSIC

Xt = cos(φt,1), φt,1 = φt−1,1 + 0.1ε1,t + 2πf1Ts , ε1,ti.i.d.∼ N (0, 1),

Yt = [2+ C sin(φt,1)] cos(φt,2), φt,2 = φt−1,2 + 0.1ε2,t + 2πf2Ts , ε2,ti.i.d.∼ N (0, 1).

Parameters: C = 0.4, f1 = 4Hz ,f2 = 20Hz , 1Ts

= 100Hz .

- M. Besserve, N.K. Logothetis, and B. Schölkopf. Statistical analysis of coupled time series

with kernel cross-spectral density operators. NIPS 2013.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 16 / 19

Page 36: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Summary

Interdependent data lead to incorrect Type I control for kernel tests(too many false positives).

Consistency of a wild bootstrap procedure under weak long-rangedependencies (τ -mixing), applicable to both two-sample andindependence tests

Applications: MCMC diagnostics, time series dependence acrossmultiple lags

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 17 / 19

Page 37: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Open questions

Interdependent case: how to select parameters of the wild bootstrap /block bootstrap - requires estimating mixing properties of the timeseries �rst?

Large-scale testing: tradeo�s between computation and power

How to interpret the discovered di�erences in distributions /discovered dependence? Do we really care about all possibledi�erences between distributions?

Tuning parameters - can select kernels/hyperparameters to directlyoptimize relative e�ciency of the test, but how does this a�ecttradeo�s with interdependent data? Sensitive interplay between thekernel hyperparameter and the wild bootstrap parameters

Multivariate interaction and graphical model selection -approximations?

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 18 / 19

Page 38: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

References

K. Chwialkowski, DS and A. Gretton, A wild bootstrap for degenerate kernel

tests. Advances in Neural Information Processing Systems (NIPS) 27, Dec. 2014.

DS, B. Sriperumbudur, A. Gretton and K. Fukumizu, Equivalence of

distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist.41(5): 2263-2291, Oct. 2013.

M. Besserve, N.K. Logothetis and B. Schölkopf, Statistical analysis of coupledtime series with kernel cross-spectral density operators. Advances in NeuralInformation Processing Systems (NIPS) 26, Dec. 2013.

A. Leucht and M.H. Neumann, Dependent wild bootstrap for degenerate U- and

V-statistics. J. Multivar. Anal. 117:257-280, 2013.

A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf and A. Smola, A Kernel

Two-Sample Test. J. Mach. Learn. Res. 13(Mar): 723-773, 2012.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 19 / 19

Page 39: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Kernels and characteristic functions

E-distance/dCov of Székely and Rizzo (2004,2005),Székely et al (2007):

V2(X ,Y ) = EXYEX ′Y ′∥∥X − X ′

∥∥2

∥∥Y − Y ′∥∥2

+EXEX ′∥∥X − X ′

∥∥2EYEY ′

∥∥Y − Y ′∥∥2

− 2EXY[EX ′

∥∥X − X ′∥∥2EY ′

∥∥Y − Y ′∥∥2

]

−2 0 2−3

−2

−1

Smooth density

−2 0 2−3

−2

−1

Rough density

1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Typ

e I

I e

rro

rfrequency

m=512, α=0.05

dist, q=1/6

dist, q=1/3

dist, q=2/3

dist, q=1

dist, q=4/3

gauss (median)

DS, B. Sriperumbudur, A. Gretton and K. Fukumizu, Equivalence of distance-based and

RKHS-based statistics in hypothesis testing. Annals of Statistics 41(5), p. 2263-2291, 2013.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 20 / 19

Page 40: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Embeddings in Mercer's Expansion

Mercer's Expansion

For a compact metric space X , and a continous kernel k ,

k(x , y) =∞∑r=1

λrΦr (x)Φr (y),

with {λr ,Φr}r≥1 eigenvalue, eigenfunction pairs of f 7→´f (x)k(·, x)dP(x)

on L2(P).

Hk 3 k(·, x) ↔{√

λrΦr (x)}∈ `2

Hk 3 µk(P) ↔{√

λrEΦr (X )}∈ `2∥∥∥µk(P̂)− µk(Q̂)

∥∥∥2Hk

=∞∑r=1

λr

nx∑t=1

Φr (Xt)−1

ny∑t=1

Φr (Yt)

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 21 / 19

Page 41: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

Wild Bootstrap

ρx = nx/n, ρy = ny/n

{Wt,n}1≤t≤n, EWt,n = 0, E [Wt,nWt′,n] = ζ

(|t′−t|`n

), with

limu→0 ζ(u)→ 1

ρxρynM̂MDk =∞∑r=1

λr

(√ρy

nx∑t=1

Φr (Xt)√nx−√ρx

ny∑t=1

Φr (Yt)√ny

ρxρynM̂MDk,wb =∞∑r=1

λr

(√ρy

nx∑t=1

Φr (Xt)W̃(y)t,nx√

nx−√ρx

ny∑t=1

Φr (Yt)W̃(y)t,ny√

E [Φr (X1)W1,nΦs(Xt)Wt,n] = E [Φr (X1)Φs(Xt)] ζ(|t−1|`n

)−→n→∞

E [Φr (X1)Φs(Xt)], ∀t, r , s provided dependence between X1 and Xt

�disappears fast enough� (a τ -mixing condition).

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 22 / 19

Page 42: Hypothesis estingT with Kernel Embeddings on ...sejdinov/talks/pdf/2015-04-09...2015/04/09 · O (n 2) to compute ( n = n x + n y) various approximations (block-based, random features)

ICML Workshop on Large-Scale Kernel Learning

Lille, France, 11 July 2015 (collocated with ICML 2015)

Foundational algorithmic techniques for large-scale kernel learning: matrixfactorization, randomization and approximation, variational inference andsampling, inducing variables, random Fourier features, unifying frameworks

Interface between kernel methods and deep learning architectures

Tradeo�s between statistical and computational e�ciency in kernel methods

Stochastic gradient techniques with kernel methods

Large-scale multiple kernel learning

Large-scale representation learning with kernels

Large-scale kernel methods for complex data types beyond perceptual data

Con�rmed speakers: Francis Bach, Neil Lawrence, RussSalakhutdinov, Marius Kloft, Zaid Harchaoui

Deadline for Submissions: Friday, May 1st, 2015, 23:00 UTC.

D. Sejdinovic (Statistics, Oxford) Testing with Kernels 09/04/15 23 / 19

Top Related