Can we do statistical inference in a - Purdue Universitychengg/Nonasymptotics.pdf · How far can we...

Can we do statistical inference in a

non-asymptotic way?1

Guang Cheng2

Statistics@Purduewww.science.purdue.edu/bigdata/

ONR Review Meeting@DukeOct 11, 2017

1Acknowledge NSF, ONR and Simons Foundation.2An �exploratory� and ongoing work with Z. Shang and Y. Yang.

2 / 30

Starting from Stat 101

Given that X1, .., Xniid∼ N(θ0, 1), we have

(θ0 ∈

[X̄ ± 1.96√

])= 95%;

Without knowing the distribution of Xi's, we obtain via CLT

(θ0 ∈

[X̄ ± 1.96√

])→ 95%.

But the price is the �asymptotic� validity;

What if the distribution of X is unknown and n is �nite?

3 / 30

(θ0 ∈

[X̄ ± 1.96√

])= 95%;

(θ0 ∈

[X̄ ± 1.96√

])→ 95%.

3 / 30

(θ0 ∈

[X̄ ± 1.96√

])= 95%;

(θ0 ∈

[X̄ ± 1.96√

])→ 95%.

3 / 30

A �rst thought: concentration inequality

Concentration inequalities quantify how a random variable Xdeviates around its mean or median µ. They usually take theform of two-sided bounds for the tails of X − µ, such as

P (|X − µ| > t) ≤ something small.

The simplest concentration inequality is Chebyshev inequality;

Concentration inequality holds for any sample size n.

4 / 30

A �rst thought: concentration inequality

Concentration inequalities quantify how a random variable Xdeviates around its mean or median µ. They usually take theform of two-sided bounds for the tails of X − µ, such as

P (|X − µ| > t) ≤ something small.

The simplest concentration inequality is Chebyshev inequality;

Concentration inequality holds for any sample size n.

4 / 30

Does concentration inequality idea really work?

For example, Hoe�ding inequality gives

(θ0 ∈

[X̄ ± 1.27√

nc‖X‖ψ2

])≥ 95%

for iid Xi's with sub-Gaussian tails3. ‖X‖ψ2 is the so-calledsub-Gaussian norm, e.g., ‖X‖ψ2 = 1/

√log 2 for Rademacher X;

For illustration, let us examine one special cases, i.e., Bernoulli,in the class of sub-Gaussian random variables;

When Xi'siid∼ Ber(θ0), we have for any sample size n

P(θ0 ∈ [X̄ ± 1.36/

√n])≥ 95%,

which is sharp in the rate but not the constant in comparisonwith the asymptotic CI, i.e., [X̄ ± 0.98/

√n]. Set θ0 = 0.5.

3Relaxable to sub-exponential tails via Bernstein inequality.5 / 30

(θ0 ∈

[X̄ ± 1.27√

nc‖X‖ψ2

])≥ 95%

P(θ0 ∈ [X̄ ± 1.36/

√n])≥ 95%,

√n]. Set θ0 = 0.5.

(θ0 ∈

[X̄ ± 1.27√

nc‖X‖ψ2

])≥ 95%

P(θ0 ∈ [X̄ ± 1.36/

√n])≥ 95%,

√n]. Set θ0 = 0.5.

Empirical comparison for Bernoulli

Left plot compares width of con�dence intervals of two methods; rightplot demonstrate their coverage probability versus (small) sample size.Simulation were repeated 1000 times.

Two competing e�ects: Conservativness vs Asymptotics

6 / 30

How far can we go beyond these simple cases?

There are some recent results on more complicated �parametricmodels� obtained via various notion of concentration inequalitysuch as Chernozhukov et al (2013) and Strawn et al (2014).

Chernozhukov et al (2013) presented non-asymptotic results onbootstrap validity for high dimensional problems such as‖(1/√n)∑n

i=1Xi‖∞ for Xi ∈ Rp with p� n;

Strawn et al (2014) obtained non-asymptotic bounds on theexpected concentration of a posterior (around the trueparameter) in high dimensional Bayesian linear models;

As far as we are aware, these bounds were mostly derived tojustify (asymptotic) statistical estimation/inference in anon-asymptotic manner, rather than used to construct�nite-sample inference.

7 / 30

Outline

1 Smoothing spline models

2 A non-asymptotic Bahadur representation

3 Statistical application: hypothesis testing

4 Simulations

8 / 30

Smoothing spline models

Consider a nonparametric regression model

yi = f(xi) + εi.

For simplicity, we assume that x ∈ I := [0, 1], and that x and εare independent with Eε = 0 and Var(ε) = 1;

The standard smoothing spline estimate (or more generally,kernel ridge estimate) is obtained as

f̂n := arg minf∈H

n∑i=1

(yi − f(xi))2 + λ‖f‖2H︸︷︷︸

roughness penalty

reproducing kernel Hilbert space (RKHS)

9 / 30

Smoothing spline models

Consider a nonparametric regression model

yi = f(xi) + εi.

For simplicity, we assume that x ∈ I := [0, 1], and that x and εare independent with Eε = 0 and Var(ε) = 1;

The standard smoothing spline estimate (or more generally,kernel ridge estimate) is obtained as

f̂n := arg minf∈H

n∑i=1

(yi − f(xi))2 + λ‖f‖2H︸︷︷︸

roughness penalty

reproducing kernel Hilbert space (RKHS)

9 / 30

A non-asymptotic Bahadur representation

10 / 30

Starting from a �naive� concentration inequality

We �rst investigate the concentration property of Tn := ‖f̂n − f0‖.

Lemma 1

Under proper non-asymptotic conditions on λ > 0 and M > 0, itholds that

supf0∈Hm(1)

Pf0 (Tn ≥ dn(M,λ)) ≤ 2 exp(−M), (1)

where dn(M,λ) = 2λ1/m + cK(√

2M + 1)(nλ1

2m )−1/2.

Note that Lemma 1 holds uniformly over an unit ball

Hm(1) := {f ∈ Sm(I) : ‖f‖H ≤ 1}

for any sample size.

11 / 30

A closer examination of the naive concentration

We notice that

E{f̂n} ≈ f0 − Pλf06= f0!

The term Pλf0 is the estimation bias caused by penalization;

If using Tn = ‖f̂n − f0‖ to test H0 : f = f0, we prove that its

minimal separation rate is n−m

2m+1 , which is sub-optimal in

comparison with the minimax rate of testing n−2m

4m+1 ;

To make inference, we need to develop a second orderexpansion, which we call as �non-asymptotic� Bahadurrepresentation.

12 / 30

We notice that

E{f̂n} ≈ f0 − Pλf06= f0!

4m+1 ;

12 / 30

We notice that

E{f̂n} ≈ f0 − Pλf06= f0!

4m+1 ;

12 / 30

Review on (asymptotic) Bahadur representation

Bahadur representations (Bahadur, 1966) are often useful tostudy the �asymptotic� properties of statistical estimators;

For example, He and Shao (1996) considered a generalM-estimation framework:

(1/n)n∑i=1

ψ(xi, θ̂n) = o(δn)

for some δn → 0, and obtained that as n→∞

θ̂n − θ0 −n∑i=1

D−1n ψ(xi, θ0) = O(Rn) almost surely.

for some Rn � o(n−1/2).

13 / 30

Review on (asymptotic) Bahadur representation

Bahadur representations (Bahadur, 1966) are often useful tostudy the �asymptotic� properties of statistical estimators;

For example, He and Shao (1996) considered a generalM-estimation framework:

n∑i=1

ψ(xi, θ̂n) = o(δn)

for some δn → 0, and obtained that as n→∞

θ̂n − θ0 −n∑i=1

D−1n ψ(xi, θ0) = O(Rn) almost surely.

for some Rn � o(n−1/2).

13 / 30

Non-asymptotic Bahadur representation

Inspired by Shang and Cheng (2013), we de�ne

T̂n :=

∥∥∥∥∥f̂n − (I − Pλ)f0 −1

n∑i=1

εiKXi

∥∥∥∥∥and study its concentration property.

Theorem 2

Under similar conditions as Lemma 1, it holds that

supf0∈Hm(1)

(T̂n ≥ d̂n(M,λ)

)≤ 2 exp(−M),

where d̂n(M,λ) = c2K√Mn−1/2λ−1/(2m)A(λ)dn(M,λ).

14 / 30

Statistical application: hypothesis testing

15 / 30

Nonparametric hypothesis of interest

Consider the following hypothesis testing problem:

H0 : f = f0, vs H1 : f 6= f0,

e.g., f0 = 0.

16 / 30

Test statistic

Clearly, T̂n = ‖f̂n − (I − Pλ)f0 − n−1∑n

i=1 εiKXi‖ cannot beused as a test statistic since εi's are not observable;

Rather, we propose to use T̃n as test statistic:

T̃n = ‖f̂n − (I − Pλ)f0‖2 − E‖n−1n∑i=1

εiKXi‖2,

= ‖f̂n − (I − Pλ)f0‖2 −1

∑ν≥1

1 + λ/ρν;

In practice, replace (I − Pλ)f0 by a �noiseless� version of f̂n:

f̂NLn = arg minf∈Sm(I)

n∑i=1

(f0(Xi)− f(Xi))2 + λ‖f‖Sm(I).

17 / 30

Test statistic

T̃n = ‖f̂n − (I − Pλ)f0‖2 − E‖n−1n∑i=1

εiKXi‖2,

= ‖f̂n − (I − Pλ)f0‖2 −1

∑ν≥1

1 + λ/ρν;

n∑i=1

(f0(Xi)− f(Xi))2 + λ‖f‖Sm(I).

17 / 30

Test statistic

T̃n = ‖f̂n − (I − Pλ)f0‖2 − E‖n−1n∑i=1

εiKXi‖2,

= ‖f̂n − (I − Pλ)f0‖2 −1

∑ν≥1

1 + λ/ρν;

n∑i=1

(f0(Xi)− f(Xi))2 + λ‖f‖Sm(I).

17 / 30

Exact Type I and II errors control

Based on the non-asymptotic Bahadur representation, we cancontrol on Type I and II errors for any �nite sample size as follows.

Theorem 3

Under proper non-asymptotic conditions on λ, α and β, we have

Type I error : Pf0(T̃n ≥ d̃n(log(15/α), λ)) ≤ α,

Type II error : supf−f0∈Hm

α,β(1)Pf (T̃n ≤ d̃n(log(15/α), λ)) ≤ β,

for any α, β ∈ (0, 1). The cut-o� value

d̃n(M,λ) ≈ 4√MρK

nλ1/(4m).

The separation set

Hmα,β(1) := {g ∈ Hm(1) : ‖g‖ ≥ ρn(log(15/α), log(30/β), λ)}, where

ρn(M,L, λ) ≈√ζKλ+

√4ρK√M/(nh1/(4m)).

18 / 30

Asymptotic minimax optimality

An important implication of Theorem 3 is that T̃n,λ isasymptotically minimax optimal;

To see that, we minimize the separation functionρn(log(15/α), log(30/β), λ) over λ, and obtain the minimalseparation rate n−2m/(4m+1) when λ is chosen as

λ∗ :=

((4ρKζK

log(15/α)

)2m/(4m+1)

n−4m/(4m+1)

� n−4m/(4m+1).

Following similar derivations, we prove that the test statisticbased on the naive concentration, i.e., Tn = ‖f̂n − f0‖, isactually sub-optimal. So, our intuition of not using it is correct.

19 / 30

λ∗ :=

((4ρKζK

log(15/α)

)2m/(4m+1)

n−4m/(4m+1)

� n−4m/(4m+1).

19 / 30

λ∗ :=

((4ρKζK

log(15/α)

)2m/(4m+1)

n−4m/(4m+1)

� n−4m/(4m+1).

19 / 30

Extensions

Composite Hypothesis, e.g.,

H0 : f ∈ F0 vs H1 : f 6∈ F0,

where F0 = {f : f is linear on I};

Kernel ridge regression with di�erent decaying rates ofeigen-values;

Non-Gaussian regression with a smooth log-concave densitysuch as logistic regression.

20 / 30

Extensions

H0 : f ∈ F0 vs H1 : f 6∈ F0,

where F0 = {f : f is linear on I};Kernel ridge regression with di�erent decaying rates ofeigen-values;

20 / 30

Extensions

H0 : f ∈ F0 vs H1 : f 6∈ F0,

where F0 = {f : f is linear on I};Kernel ridge regression with di�erent decaying rates ofeigen-values;

20 / 30

Simulations

21 / 30

Simulations

Consider the following hypothesis:

H0 : f = f0 vs H1 : f 6= f0,

where f0 = 5(x2 − x+ 16).

Data were generated as follows

Yi = fc(Xi) + εi, and fc(x) =1

2c x2 + f0(x);

Set εiiid∼ N(0, 1) and Xi

iid∼ Unif(0, 1);

Set Type I and II errors as α = β = 0.05.

22 / 30

Simulations

H0 : f = f0 vs H1 : f 6= f0,

where f0 = 5(x2 − x+ 16).

2c x2 + f0(x);

iid∼ Unif(0, 1);

22 / 30

Simulations

H0 : f = f0 vs H1 : f 6= f0,

where f0 = 5(x2 − x+ 16).

2c x2 + f0(x);

iid∼ Unif(0, 1);

22 / 30

Simulations

H0 : f = f0 vs H1 : f 6= f0,

where f0 = 5(x2 − x+ 16).

2c x2 + f0(x);

iid∼ Unif(0, 1);

22 / 30

Remark: cuto� value

Recall that the cuto� value d̃n(log(15/α), λ) is provided inCorollary 3. However, we found that it is quite conservativedue to the use of some loose concentration inequalities;

Fortunately, we can simulate T̃n as follows. By conditioning onX, we simulate a number of synthetic datasets {Y(k)}Nk=1 fromthe null model

Y(k)i = f0(Xi) +N(0, 1),

each of which yields a new test statistic T̃(k)n . The cuto� value

is set as the (1− α)-th sample quantile of {T̃ (k)n }Nk=1.

23 / 30

Remark: cuto� value

Recall that the cuto� value d̃n(log(15/α), λ) is provided inCorollary 3. However, we found that it is quite conservativedue to the use of some loose concentration inequalities;

Fortunately, we can simulate T̃n as follows. By conditioning onX, we simulate a number of synthetic datasets {Y(k)}Nk=1 fromthe null model

Y(k)i = f0(Xi) +N(0, 1),

each of which yields a new test statistic T̃(k)n . The cuto� value

is set as the (1− α)-th sample quantile of {T̃ (k)n }Nk=1.

23 / 30

Remark: the choice of λ

Rather, our theoretical analysis is useful in choosing λ;

In simulations, we choose λ by numerically minimizing theseparation function

ρn(log(15/α), log(30/β), λ)

over λ. The resulting λ is denoted as λFS ;

This non-asymptotic approach is much computationallycheaper than the conventional generalized cross validation;

Note that all constants in ρn(M,L, λ) and dn(M,λ) such asρK , ζK and cK only depend on λ and the eigenvalues ρν 's,which can be well approximately by the empirical eigenvalues ofthe reproducing kernel matrix.

24 / 30

Empirical comparison with �asymptotically� valid test

For simple hypothesis, we compare four testing procedures:

(S1) The proposed T̃n with the smoothing parameter λFS ;

(S2) Asymptotically valid penalized likelihood ratio test (Shang andCheng, 2013), denoted as PLRT (f0), with the same λFS ;

(S3) The proposed T̃n with λ selected by GCV, denoted as λGCV ;

(S4) PLRT statistic PLRT (f0) with the same λGCV .

Note that the cut-o� value for PLRT in S2 and S4 is obtained fromMonte Carlo simulation and the null limit distribution given inShang and Cheng (2013), respectively.

25 / 30

Simulation results

FINITE SAMPLE INFERENCE 15

(S3) The proposed T̃n,λ with h selected by GCV2, denoted as hGCV ;

(S4) PLRT statistic PLRT (f0) with the same hGCV .

The cut-off value for PLRT in S2 and S4 is obtained from Monte Carlo simulation and the null limit

distribution given in [20], respectively. Simulation results are reported in Table 1. The rejection

proportion (RP) under c = 0 reflects the Type I error, while under c 6= 0, RP reflects the power.

Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size. As for

power performances, we note that (i) the test using hFS is always more powerful than that using

hGCV ; (ii) T̂n,λ is always more powerful than PLRT given the same choice of h. In other words,

S1 is always the most powerful one. The observation (i) justifies the finite sample advantage of

the non-asymptotic formula in selecting h, while (ii) supports the need of removing estimation

bias in nonparametric testing; see Remark 3.2. The third observation is that as c increases, hGCV

continues decreasing and becomes closer to hFS , but never reaches hFS . This is consistent with

their different asymptotic orders (recall hFS � n−1/(2m+1/2) and hGCV � n−1/(2m+1)).

n c λ1/4FS RPFS RPPLRT λ

1/4GCV RPGCV RP ′GCV

0.046 0.052 0.142(0.008) 0.052 0.054

1 0.100 0.092 0.138(0.009) 0.088 0.084

2 0.396 0.340 0.134(0.009) 0.323 0.310

3 0.822 0.764 0.128(0.009) 0.752 0.748

0.048 0.048 0.122(0.007) 0.053 0.052

1 0.264 0.170 0.117(0.008) 0.167 0.144

2 0.650 0.558 0.112(0.007) 0.493 0.474

3 0.976 0.934 0.110(0.003) 0.924 0.918

0.050 0.048 0.104(0.006) 0.051 0.050

1 0.368 0.334 0.102(0.007) 0.325 0.290

2 0.896 0.862 0.098(0.006) 0.832 0.816

3 1.00 1.00 0.094(0.003) 1.00 1.00

0.046 0.048 0.096(0.006) 0.048 0.048

1 0.426 0.404 0.094(0.006) 0.397 0.394

2 0.968 0.946 0.092(0.005) 0.930 0.914

3 1.00 1.00 0.090(0.005) 1.00 1.00

0.052 0.050 0.091(0.004) 0.049 0.054

1 0.668 0.640 0.087(0.005) 0.631 0.618

2 1.00 1.00 0.085(0.004) 1.00 1.00

3 1.00 1.00 0.084(0.003) 1.00 1.00

Table 1Simulation results for simple hypothesis testing. λGCV is an average value over 500 replicates (that varies as c).

RPFS, RPPLRT , RPGCV and RP ′GCV are average rejection proportions by T̃n with λFS, PLRT with λFS, T̃n withλGCV and PLRT with λGCV respectively, over 500 replicates.

For composite hypothesis, we compare two testing procedures:

(C1) The proposed T̃ comn,λ with hcomFS , selected by numerically minimizing the separation function

2The hGCV is obtained by using the ssr function in the R package assist

26 / 30

Empirical observations

Overall, all four procedures have comparable type I errors, i.e.,c = 0, for any sample size;

As for power performances, we note that

(i) the test using λFS is always more powerful than thoseusing λGCV . This justi�es the �nite sample advantage of thenon-asymptotic formula in selecting λ;(ii) T̃n is always more powerful than PLRT given the samechoice of λ. In other words, S1 is always the most powerfulone. This supports the need of removing estimation bias innonparametric testing;

The third observation is that as c increases, λGCV continuesdecreasing and becomes closer to λFS , but never reaches λFS .This is consistent with their di�erent asymptotic orders, i.e.,λFS � n−2m/(2m+1/2) and λGCV � n−2m/(2m+1).

27 / 30

(i) the test using λFS is always more powerful than thoseusing λGCV . This justi�es the �nite sample advantage of thenon-asymptotic formula in selecting λ;

(ii) T̃n is always more powerful than PLRT given the samechoice of λ. In other words, S1 is always the most powerfulone. This supports the need of removing estimation bias innonparametric testing;

27 / 30

Discussions

The �non-asymptotic Bahadur representation� seems a possibledirection to pursue in �nite-sample inference;

The practical usefulness of �nite-sample inference proceduresrely on �computably sharp� concentration inequality. Can thisbe done through data re-sampling or MCMC?

Does it exist something like �non-asymptotic� Fisherinformation?

Can the non-asymptotic approach be extended to Bayesiandomain, e.g., choose hyper-parameter in hierarchical priors?

28 / 30

Discussions

28 / 30

Discussions

28 / 30

Discussions

28 / 30

Questions?

Thanks to ONR for support

29 / 30

Some necessary notation

In smoothing spline models, we set H as an m-th order Sobolevspace, denoted as Sm(I), and ‖f‖2H as

∫I{f

(m)(x)}2dx;De�ne 〈f, g〉 = E{f(X)g(X)}+ λ

∫I f

(m)(x)g(m)(x)dx.Endowed with 〈·, ·〉, Sm(I) is an RKHS;

Let K(x1, x2) be the reproducing kernel function satisfying〈Kx, f〉 = f(x) with Kx(·) = K(x, ·). The underlyingeigen-system is denoted as {ρν , φν(·)}∞ν=1, where ρν � ν−2m;For later use, de�ne

cK = λ1

4m supx∈I

K1/2(x, x),

ρK = λ1

2mE[K2(X1, X2)] with X1, X2iid∼ X.

30 / 30

Can we do statistical inference in a - Purdue Universitychengg/Nonasymptotics.pdf · How far can we...

Documents

Transcript of Can we do statistical inference in a - Purdue Universitychengg/Nonasymptotics.pdf · How far can we...

Probability & Statistics: Infinite Statistics · Infinite vs. Finite Statistics In theory, we can have infinite data sets – An infinite number of data points – We can examine

Non-Comparison Based Sorting. How fast can we sort? Insertion Sort: O(n 2 ) Merge Sort, Quick Sort (expected), Heap Sort: O(nlgn) Can we sort faster than.

· Example: Multiplying Two Integers Suppose we want to multiply two n-bit Integers, but our registers can only perform operations on integers of constant size. For this we ﬁrst

Steven Plotkin - University of British Columbia · 2007. 7. 26. · Size ~ 30 Å We can coarse-grain it so we can study its folding process on a computer ... Protein-like Heteropolymer.

Chapter 10 Lecture - How We Can Help | University Relations

Mini-Course 1: SGD Escapes Saddle PointsWhy do we use SGD? Initially because: I Much cheaper to compute using mini-batch I Can still converge to global minimum in convex case I Can

Multiplexing Switching and Routingliacs.leidenuniv.nl/~wijshoffhag/NETWERKEN2020/... · 2020-03-02 · Routing for Packet Switched Networks Like circuit switching can we differentiate

Supplementary Material of the Paper: Reciprocal ...€¦ · We report the effects of on training the overall RCF-GAN in Fig. 1. From this ﬁgure, we can ﬁnd that the proposed RCF-GAN

12.4 Spectral Resolution - math.byu.edubakker/Math346/BeamerSlides/Lec34.pdf · Recall that we can express 1=(z 3) as a geometric series in (z + 1), and we can express 1=(z + 1) as

Aerodynamics & Aeroelasticity - NTUA...Aerodynamics & Aeroelasticity Certification of WT 10 How can we describe real (turbulent) wind? The turbulent wind model We need to describe

Can we find Earth-mass planets orbiting our nearest star, α Centauri?

Modelling HIV: How complicated do we want to make this? · Exponential rise give R 0 = β /µ = 4.7. Logistic increase to a steady state prevalence of (R 0 –1)/R 0 = 0.8. 0.00 0.05

Efficient Spectral Methods for Learning Mixture Modelsqingqinghuang.github.io/files/qq_2016_specMix_slides.pdf8 Contribution / Outline of the talk Can we learn “non-worst-cases”

Geometric Brownian Motion - George Mason Universitymason.gmu.edu/~jgentle/csi779/14s/L13_Chapter6_14s.pdfIto’s Lemma We can formalize the ... Derivation of Ito’s Formula ... The

Can we infer compositional similarities of soil and plant samples? Wilfred, Michelle, Geoff C/N ratios δ 13 C FTIR.

Untightening torque Can we use the untightening / tightenigng torque ratio to estimate the efficiency of HEICO-LOCK ?

Set 8: Inference in First-order logickkask/Fall-2015 CS271/slides/08... · ICS 271 Fall 2015 Chapter 9: Russell and Norvig ... • We can get the inference immediately if we can find

Lambda-Calculuslcs.ios.ac.cn/~yzhang/fopl/01_lambda.pdf · contained in the pure λ-calculus, but can be represented byλ-terms. This is about encoding, just as we can represent data

Rotational and Vibrational Levels of Moleculesgarrettp/teaching/PHY-1070/... · Rotations and vibrations • We have examined electronic transitions in molecules – But they can

II ver.2 Fusion Splicer S184PM-SLDF ver.2 Fusion Splicer · proof test on the fiber and releases the holder lid to avoid ... complicated fiber can be spliced effortlessly by Manual