Optimal inference in a class of nonparametric models · Conclusion. running example • Consider...

63
Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2015

Transcript of Optimal inference in a class of nonparametric models · Conclusion. running example • Consider...

Page 1: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

optimal inference in a class of

nonparametric models

Timothy Armstrong (Yale University)

Michal Kolesár (Princeton University)

September 2015

Page 2: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

setup

• Interested in inference on linear functional Lf in regression model

yi = f (xi ) + ui , ui ∼ N (0,σ 2 (xi )).

xi is �xed, σ 2 (xi ) is known.

• Important special cases:1. Inference at a point: Lf = f (0)2. Regression discontinuity: Lf = f (0+) − f (0−)3. ATE under unconfoundedness: xi = {wi ,di},

Lf = 1n∑

i ( f (wi ,1) − f (wi ,0))4. Partially linear model

2/59

Page 3: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

key assumption

Convexity Assumption

f ∈ F , a known convex set

Rules out e.g. sparsity, but not usual shape/smoothness restrictions:

Monotonicity F ={f : f non-increasing

}Lipschitz class FLip (C ) =

{f : | f (x1) − f (x2) | ≤ C |x1 − x2 |

}(or Hölder

class generalizations).

Taylor class FT,2 (C ) ={f : | f (x ) − f (0) − f ′(x )x | ≤ Cx2

}(useful for

RD / inference at point)

Sign restrictions in linear regression{f (x ) = x ′β : βj ≥ 0, j ∈ J

}

• Will take C as known if necessary, and ask later if this can be relaxed.

3/59

Page 4: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

notions of finite-sample optimality

• Normality =⇒ can derive �nite-sample procedures that minimize theworst case loss over G ⊆ F

• without Normality, procedures will be valid and optimal asymptoticallyunder regularity conditions, uniformly over F

1. Setting G = F yields minimax procedures.• Problem well-studied if loss is MSE, general solution in Donoho (1994),

used to derive optimal kernels and rates of convergence (Stone, 1980;Fan, 1993; Cheng, Fan, and Marron, 1997)

• Donoho (1994) derives �xed-length con�dence intervals (CI) that arealmost optimal

2. G ⊂ F “smoother” functions: adaptive inference (“directing power”)• For two-sided CIs, Cai and Low (2004) give bounds

4/59

Page 5: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

new finite-sample results: one-sided cis

Derive one-sided CIs, [c,∞), that minimize maximum quantiles ofexcess length over G, with c = L − biasF (L) − z1−α sd(L), for optimalestimator L

• For case F = G (minimax CIs), L has same form as minimax MSEestimators / �xed-length CIs of Donoho (1994)

• We show that if F is symmetric, adaptation severely limited.• Adaptation requires non-convexity or shape restrictions: otherwise,

cannot do better at smaller C while maintaining coverage for larger C• Conversely any inference method that claims to do better than minimax

CIs when f is smooth must be size distorted for some f ∈ F (C )

• Related to Low (1997), who shows adapting to derivative smoothnessclasses limited for two-sided (random-length) CIs.

5/59

Page 6: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

new finite-sample results: two-sided cis

We derive two-sided CIs that minimize expected length over G ={д},

solving the problem of “adaptation to a function” posed in Cai, Low,and Xia (2013)

• Can be used to bound scope for adaptivity

6/59

Page 7: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

implications for optimal bandwidth choice

Asymptotically, optimal procedures often correspond to kernelestimators with �xed (optimal) kernel, and bw that depends onoptimality criterion. We �nd that for RD and inference at a point:

• Optimal 95% �xed-length CIs use larger bandwidth than minimax MSEestimators.

• Undersmoothing cannot be optimal• Recentering CIs by estimating bias cannot be optimal—it’s essentially

equivalent to using higher-order kernel and undersmoothing (Calonico,Cattaneo, and Titiunik, 2014).

• Di�erence is small: CI around minimax MSE estimator only 1% longer• In practice, can keep the same bandwidth as for estimation, and

construct CI around it using worst-case bias correction

7/59

Page 8: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

applications

We apply the general results to:

1. RD with F ={f+ − f− : f± ∈ FT,2 (C )

}as in Cheng, Fan, and Marron

(1997)• Optimal bandwidths balance number of “e�ective observations” on each

side of cuto�• Illustrate with empirical application from Lee (2008)

2. Linear regression with β possibly constrained (sign restrictions,sparsity, elliptical constraints)

3. Sample average treatment e�ect under unconfoundedness underHölder class (separate paper)

8/59

Page 9: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

incomplete list of related literature

• Stats literature on minimax estimation/inference/rates ofconvergence/adaptivity: Ibragimov and Khas’minskii (1985), Donohoand Liu (1991), Donoho and Low (1992), Donoho (1994), Low (1995), Low(1997), Cai and Low (2004), Cai, Low, and Xia (2013), Cheng, Fan, andMarron (1997), Fan (1993), Fan, Gasser, Gijbels, Brockmann, and Engel(1997), Lepski and Tsybakov (2000)

• “non-standard” CIs: Imbens and Manski (2004), Müller and Norets(2012), Calonico, Cattaneo, and Titiunik (2014), Calonico, Cattaneo, andFarrell (2015), Rothe (2015)

• Adaptive estimation/inference in econometrics: Sun (2005), Armstrong(2015), Chernozhukov, Chetverikov, and Kato (2014)

9/59

Page 10: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Finite-Sample results

Asymptotic results

Applications

Conclusion

Page 11: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

running example

• Consider the problem of inference on f (0) when f is restricted to be inLipschitz class F = FLip (C ) =

{f : | f (x1) − f (x2) | ≤ C |x1 − x2 |

}.

• Assume σ (x ) = σ , known

10/59

Page 12: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

performance criteria

• To measure performance of (1 − α )% one-sided CIs [c,∞), we usemaximum quantiles of excess length

ELβ (c,G) = supд∈G

qд,β (Lд − c ),

where qд,β is βth quantile under д.• For two-sided CIs, we focus on �xed-length CIs L ± χ , where L is

estimator, and χ is chosen to satisfy coverage:

χα (L) = min{χ : inf

f ∈FPf ( |L − Lf | ≤ χ ) ≥ 1 − α

}

• For estimation, we use maximum MSE, RMSE (L) = supf ∈F Ef (L − Lf )2

11/59

Page 13: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

minimax testing problem

• In running example, Lf = f (0), F = FLip (C ), consider minimax test ofH0 : Lf ≤ L0 against H1 : Lf ≥ L0 + 2b

• Inverting minimax tests yields CI that minimizes ELβ (c,F ), where β isminimax power of the test.

• First need to �nd least favorable null and alternative. Problemequivalent to Y ∼ N (µ,σ 2I ), µ = ( f (x1), . . . , f (xn )) ∈ M convex

• Both M0 = M ∩{f : Lf ≤ L0

}and M1 = M ∩

{д : Lд ≥ L0 + 2b

}are

convex—least favorable functions minimize distance between them(Ingster and Suslina, 2003)

(д∗, f ∗) = argminд∈M1,f ∈M0

n∑i=1

(д(xi ) − f (xi ))2.

12/59

Page 14: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−b/C 0 b/Cx

д∗ (x ) =

L0 + b + (b −C |x |)+

f ∗ (x ) =

L0 + b − (b −C |x |)+

13/59

Page 15: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• д∗ (x ) = L0 + b + (b −C |x |)+, f ∗ (x ) = L0 + b − (b −C |x |)+

• Minimax test then given by LR test of µ0 = ( f ∗ (x1), . . . , f∗ (xn )) against

µ1 = (д∗ (x1), . . . ,д∗ (xn )): reject for large values of Y ′(µ1 − µ0)

• Test can be written as rejecting whenever

L(h) − L0 − b

(1 −

∑ni=1 kT (xi/h)

2∑ni=1 kT (xi/h)

)≥

(∑n

i=1 kT (xi/h)2)1/2∑n

i=1 kT (xi/h)σz1−α .

where kT (u) = (1 − |u |)+, h = b/C and

L(h) =

∑ni=1 (д

∗ (xi ) − f ∗ (xi ))Yi∑ni=1 (д

∗ (xi ) − f ∗ (xi ))=

∑ni=1 kT (xi/h)Yi∑ni=1 kT (xi/h)

• Key feature: non-random bias correction based on worst-case bias,doesn’t disappear asymptotically

14/59

Page 16: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

general setup

• In general, we observe Y = K f + σϵ , ϵ is standard Normal and K linearoperator, with

⟨Kд,K f

⟩=

∑i (Kд) (xi ) (K f ) (xi ),

• Heteroscedasticity handled by settingK f = ( f (x1)/σ (x1), . . . , f (xn )/σ (xn )), Y = (Y1/σ (x1), . . . ,Yn/σ (xn )).

• De�ne modulus of continuity (Donoho and Liu, 1991):

ω (δ ;F ) = sup{L(д − f ) : ‖K (д − f )‖ ≤ δ , д, f ∈ F

}

Denote solutions by д∗δ , f∗δ , and let f ∗M,δ = (д∗δ + f ∗δ )/2

• Problem of �nding LF functions equivalent to �nding ω−1 (·;F ), so forrunning example, д∗ = д∗ω−1 (2b ) , f

∗ = f ∗ω−1 (2b )

15/59

Page 17: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

class of optimal estimators

De�ne

Lδ ,F = Lf ∗M,δ +ω ′(δ ;F )

δ

⟨K (д∗δ − f ∗δ ),Y − K f ∗M,δ

⟩These estimators minimize maximum bias given variance bound (andvice versa) (Low, 1995). Their maximum and minimum bias over Fsatis�es

biasF (Lδ ,F ) = − biasF(Lδ ,F ) =

12

(ω (δ ;F ) − δω ′(δ ′;F )

),

In running example: L(h) = Lω−1 (2hC ),FLip (C )

16/59

Page 18: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

centrosymmetry and translation invariance

When F has additional structure, Lδ simpli�es:

• If F is translation invariant (for some ι ∈ F with Lι = 1, cι ∈ F for allc ∈ R), then δ/ω ′(δ ;F ) =

⟨K (д∗δ − f ∗δ ),ι

⟩, and estimator has

Nadaraya-Watson form:

Lδ ,F = Lf ∗M,δ +〈K (д∗δ − f ∗δ ),Y − K f ∗M,δ 〉

〈K (д∗δ − f ∗δ ),Kι〉.

• If F is centrosymmetric (f ∈ F =⇒ −f ∈ F ), then f ∗δ = −д∗δ , and

Lδ ,F =2ω ′(δ ;F )

δ〈Kд∗δ ,Y 〉 =

〈Kд∗δ ,Y 〉

〈Kд∗δ ,Kι〉,

17/59

Page 19: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem 1 (One-sided minimax CI)

Let

cα ,δ ,F = Lδ ,F − biasF (Lδ ,F ) − z1−ασω′(δ ;F ).

Then [cα ,δ ,F ,∞) is a 1 − α CI for Lf , with coverage minimized at f ∗δ .

For β = Φ(δ/σ − z1−α ), it minimizes ELβ (c,F ) among all one sided

1 − α CIs. All quantiles of excess length are maximized at д∗δ . The

minimax excess length at quantile β is ELβ (cα ,δ ,F ;F ) = ω (δ ;F ).

• β is minimax power of underlying tests (under translation invariance)

• Bias-correction based on worst-case bias under F , non-random

• In running example, using bw h minimizes β quantile of excess lengthat β = Φ

(ω−1 (2hC )

σ − z1−α

)18/59

Page 20: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• For estimation and two-sided CI, exact optimality results hard

• Donoho (1994) shows that procedures based on Lδ ,F minimax optimalif we restrict attention to a�ne estimators

• Results use the fact that problem is just as hard if we know that f is inone-dimensional subfamily

{λ f ∗δ + (1 − λ)д∗δ : 0 ≤ λ ≤ 1

}

To state these results, consider Z ∼ N (θ ,1), θ ∈ [−τ ,τ ]

• Minimax linear estimator is cρ (τ )Z , cρ (τ ) = τ 2/(1 + τ 2) with minimaxrisk ρ (τ ) = τ 2/(1 + τ 2)

• Shortest �xed-length CI is cχ (τ )Z ± χα (cχ (τ )Z ), solution characterizedin Drees (1999), similar in spirit to Imbens and Manski (2004)

19/59

Page 21: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

optimal shrinkage in bounded normal means

0.00

0.25

0.50

0.75

1.00

0.5

1.0

1.5

2.0

c_{ACI}chi

0 1 2 3 4 5tau

Confidence level

90%

95%

95% (Estimation)

20/59

Page 22: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem (Donoho (1994))

minimax MSE a�ine estimator is Lδ ,F where δ solves

maxδ>0

ω (δ ;F )

δ

√ρ

)σ .

and the optimal δ satisfies cρ (δ/(2σ )) = δω ′(δ ;F )/ω (δ ;F ).

The shortest fixed-length a�ine CI is Lδ ,F ±ω (δ ;F )

δ χα(δ

)σ where

δ solves

maxδ>0

ω (δ ;F )

δχα

)σ .

and the optimal δ satisfies cχ (δ/(2σ )) = δω ′(δ ;F )/ω (δ ;F ).

21/59

Page 23: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• For example, to �nd minimax MSE optimal bandwidth in runningexample, solve

δ 2

4σ 2 + δ 2 = cρ (δ/(2σ )) =δω ′(δ ;F )

ω (δ ;F )=

δ 2

2ω (δ ;F )∑

i д∗δ (xi )

which yields

σ 2 = C2h2(∑

i

kT (xi/h) −∑i

kT (xi/h)2).

Asymptotically

hopt,MSE =

(3σ 2

C2nfX (0)

)1/3

+ op (1)

• Can also use these results to derive optimal rates of convergence (egFan (1993); Cheng, Fan, and Marron (1997))—n−1/3 here

22/59

Page 24: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

adaptive inference

• onesided CIs focus on good performance under least favorable f ∈ F ,which may be too pessimistic

• Alternative: optimize excess length over smaller class G of smootherfunctions

infc

supд∈G

qq,β (Lд − c ),

among c that satisfy supf ∈F P (Lf ≥ c ) ≥ 1 − α .

• Amounts to “directing power” at smooth alternatives, whilemaintaining size over all of F

23/59

Page 25: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

adaptive inference in running example

• Associated testing problem in running example: H0 : Lf ≤ L0 againstH1 :

{Lf ≥ L0 + 2b

}∩

{f ∈ G

}• Inverting these minimax tests will yield CI that minimizes β quantile of

excess length over G, where β is minimax power of the test.

• As long as G is convex, this is still equivalent to testing convex nullagainst convex alternative =⇒ LF functions minimize distancebetween sets:

( f ∗,д∗) = argminf ∈F ,д∈G

n∑i=1

(д(xi ) − f (xi ))2, Lд ≥ L0 + 2b, Lf ≤ L0

24/59

Page 26: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• To make this concrete, consider G ={д(x ) : д(x ) = c,c ∈ R

}(i.e.

д(x ) = cι), and suppose Lf ≥ L0 + b under alternative

• Solution: f ∗ = L0 + b − (b − X |x |)+ (as before), д∗ (x ) = L0 + b

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−b/C 0 b/Cx

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

25/59

Page 27: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• But д∗ − f ∗ same as before, so estimator as before

L(h) =

∑ni=1 (д

∗ (xi ) − f ∗ (xi ))Yi∑ni=1 (д

∗ (xi ) − f ∗ (xi ))=

∑ni=1 kT (xi/h)Yi∑ni=1 kT (xi/h)

• Worst case-bias under the null and variance same as before =⇒ SameCI as before

Summary

One sided CI that minimizes maximum excess length over F forβ = Φ(δ/σ − z1−α ) subject to 1 − α coverage also minimizesELβ (c; span(ι)) for β = Φ(δ/(2σ ) − z1−α )

26/59

Page 28: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

setup for general adaptivity result

• De�ne order modulus of continuity Cai and Low (2004):

ω (δ ;F ,G) = sup{Lд − Lf : ‖K (д − f )‖ ≤ δ , f ∈ F ,д ∈ G

}.

so that ω (δ ;F ) = ω (δ ;F ,F ), and de�ne

Lδ ,F ,G = Lf ∗M,δ +ω ′(δ ;F ,G)

δ

⟨K (д∗δ − f ∗δ ),Y − K f ∗M,δ

⟩,

so that Lδ ,F ,F = Lδ ,F

• bias formulas generalize:

biasF (Lδ ,F ,G ) − biasG(Lδ ,F ,G ) =

12

(ω (δ ;F ,G) − δω ′(δ ;F ,G)

),

• In running example, L(h) = Lω−1 (hC ;F ,G),F ,G

27/59

Page 29: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem 2 (One-sided adaptive CIs)

Let F and G ⊆ F be convex, and suppose that f ∗δ and д∗δ achieve

the ordered modulus at δ . Let

cα ,δ ,F ,G = Lδ ,F ,G − biasF (Lδ ,F ,G ) − z1−ασω′(δ ;F ,G).

Then, for β = Φ(δ/σ − z1−α ), cα ,δ ,F ,G minimizes ELβ (c,G) among all

one-sided 1 − α CIs, where Φ denotes the standard normal cdf.

Minimum coverage is taken at f ∗δ and equals 1 − α . All quantiles of

excess length are maximized at д∗δ . The worst case βth quantile of

excess length is ELβ (cα ,δ ,F ,G ,G) = ω (δ ;F ,G).

28/59

Page 30: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

non-adaptivity under centrosymmetry

• Suppose F is centrosymmetric and

f ∗δ ,F ,G − д∗δ ,F ,G ∈ F . (1)

Holds for G “smooth enough”, e.g. G = span(ι) under translationinvariance as in running example

• Then 0 and f ∗δ ,F ,G − д∗δ ,F ,G also solve the modulus, and since

ω (δ ;F ) = sup{−2Lf : ‖K f ‖ ≤ δ/2, f ∈ F

}under centrosymmetry,

ω (δ ;F ,G) = ω (δ ;F ,{0}) = supf ∈F

{−Lf : ‖K f ‖ ≤ δ

}=

12ω (2δ ;F ),

• Implies cα ,δ ,F ,G = cα ,δ ,F ,{0} = cα ,2δ ,F .

29/59

Page 31: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem 3 (Non-adaptivity of one-sided CIs under centrosymmetry)

Let F be centrosymmetric. Then the one-sided CI that is minimax

for the βth quantile also optimizes ELβ (c;G) for any G such that the

solution to the ordered modulus problem exists and satisfies (1),

where

β = Φ((zβ − z1−α )/2).

In particular, the minimax CI optimizes ELβ (c; {0}).

• CI that is minimax for median excess length among 95% CIs alsooptimizes Φ(−1.645/2) ≈ 0.205 quantile under the zero function.

30/59

Page 32: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

bound on adaptivity

• CI [cα ,σ (zβ+z1−α ),F ) that is minimax for βth quantile of excess length isunbiased at 0, and satis�es

q0,β (L0 − cα ,σ (zβ+z1−α ) ) =12(ω ′(δ ;F )δ + ω (δ ;F )).

Hence,

ω (δ ;F ,G)q0,β (L0 − cα ,σ (zβ+z1−α ) )

=ω (δ ;F ,G)

12 (ω

′(δ )δ + ω (δ ))=

ω (2δ )ω ′(δ )δ + ω (δ )

.

• Typically, ω (δ ;F ) = Aδ r (1 + o(1)) as n → ∞ for some where r is theoptimal rate of convergence of the MSE. Then for 1/2 ≤ r ≤ 1, minimaxCI has asymptotic e�ciency at least 94.3% when indeed f = 0.

• Adapting to G that includes 0 at least as hard as adapting to zero

31/59

Page 33: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

implications of non-adaptivity result

• Need shape restriction or non-convexity for adaptation

• Similar to impossibility results in Low (1997) and Cai and Low (2004)for two-sided CIs, and in contrast to positive results for MSE

• Minimax rate of shrinkage describes the actual rate for all functions inthe class

• Possible to construct estimators that do better when f is smoother, butimpossible to tell how well you did

• For valid inference in cases where F is convex and centrosymmetric,one has to think hard about appropriate C

• Not possible to try to estimate it from the data and to better than if weassume worst possible case

32/59

Page 34: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

adaptivity under monotonicity

• Suppose, in running example, that we know f is non-increasing• Least favorable functions without and with directing power:

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

33/59

Page 35: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• Without directing power, optimal estimator again given by triangularkernel, but now includes bias correction (to ensure bias = − bias)

L(h) =

∑i kiYi/σ

2i∑

i ki/σ2i+ b

∑i sign(xi )ki (1 − ki )/σ 2

i∑i ki/σ

2i

,

where ki = kβ (xi/h), and optimal bw bigger than withoutmonotonicity. About 20% reduction in quantiles of excess length

• With directing power, optimal estimator averages all positiveobservations, and averages negative observations using triangularkernel. Excess length shrinks at parametric rate.

• When Lipschitz assumption dropped and only monotonicitymaintained, optimal estimator averages all positive observations, andexcess length still shrinks at parametric rate

34/59

Page 36: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

two-sided adaptive cis

• Fixed-length con�dence intervals cannot be adaptive

• Cai and Low (2004) construct random-length con�dence intervals thatare within a constant factor of lower bound on expected length

• Cai, Low, and Xia (2013) construct random-length con�dence intervalsunder shape constraints that have near minimum expected length foreach individual function (again within constant)

35/59

Page 37: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• Natural best-case scenario for two-sided CIs: optimize expected lengthat a single function G =

{д}

• By Pratt (1961), inverting UMP tests against G achieve exactly this

• Again amounts to testing convex null against convex alternative, LFfunction under null solves

f∗

θ = argminf ∈F

n∑i=1

( f (xi ) − д(xi ))2, Lf ≤ θ

Theorem 4 (Adaptation to a function)

CI with minimum expected measure Eдλ(C) st 1 − α coverage on F

inverts family of tests ϕθ , where ϕθ rejects for large values of

〈K (д − f∗

θ ),Y 〉 with critical value given by 1 − α quantile under f∗

θ .

36/59

Page 38: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

cis based on suboptimal estimators

• What is e�ciency loss of CIs around suboptimal a�ne estimators?

• A�ne estimators are Normal, with variance that doesn’t depend on f ,and bias that does

• For each performance criterion, only worst-case bias and variancematter: if we can calculate them, then can also calculate maximumMSE, and form of one- and two-sided CIs

• Let χα (B) solve P ( |Z +B | ≤ χ ) = Φ(χ −B) −Φ(χ +B) = 1−α . Then forestimator L with variance V and maximum bias B, is the shortest CI is

L ±V 1/2 χα (B/V1/2)

37/59

Page 39: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem 5 (Suboptimal estimators)

Let L = a + 〈w ,Y 〉 be an a�ine estimator. Then

[L − biasF (L) − ‖w ‖z1−ασ ,∞) is valid CI and

L ± σ ‖w ‖ χα (biasF (L)/σ ‖w ‖) is the shortest fixed-length 1 − α CI

centered at L.

• Not deep result, but very useful: allows to compute exact e�ciency lossfrom using suboptimal estimators, or size-distortion of CIs with(pointwise) asymptotic justi�cation

• Asymptotic version of this theorem can be used to calculate asymptotice�ciency loss from using suboptimal kernel, and/or suboptimalbandwidth

38/59

Page 40: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

suboptimal estimators in running example

• Consider some other kernel k in running example, L =∑i k (xi /h)Yi∑i k (xi /h)

• Variance: σ2 ∑

i k (xi /h)2

(∑i k (xi /h))2

• Maximum bias, since f ∈ FLip (C ).

����

∑i k (xi/h) ( f (xi ) − f (0))∑

i k (xi/h)

���� ≤ C

∑i |k (xi/h) | |xi |∑

i k (xi/h).

Bound attained at f (x ) = C |xi | if k ≥ 0, otherwise gives an upperbound.

39/59

Page 41: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Finite-Sample results

Asymptotic results

Applications

Conclusion

Page 42: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

renormalization

• In many cases (depending on L and smoothness of F , but includinginference at a point and RD), nonparametric regression problemequivalent to White noise model Y (dt ) = f (t ) + σϵ (t )

• See Brown and Low (1996) and Donoho and Low (1992)• In running example, this holds with σ 2 = σ (0)2/nfX (0)

• Suppose F ={f : J ( f ) ≤ C

}for some J (as in running example), and

that for the white noise model, following functionals are homogeneous

J (af (·/h)) = ah−s J J ( f )⟨Ka1 f (·/h),Ka2д(·/h)

⟩= a1a2h

−2sK ⟨K f ,Kд

⟩L(af (·/h)) = ah−sLLf

• In running example, we have sL = 0, s J = 2, sK − 1/2

40/59

Page 43: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• (single-class) modulus problem then renormalizes: if д∗C,δ , f∗C,δ

minimize min|L( f1 − f0) | st ‖K ( f1 − f0)‖ ≤ δ , J ( f1) ≤ C , J ( f0) ≤ C , then

д∗C,δ = aд∗1,1 (·/h) f ∗C,δ = af ∗1,1 (·/h)

ωC (δ ) = C1−rδ rω1 (1)

where a = δ−s J /(sK−s J )CsK /(sK−s J ) , h = (C/δ )1/(sK−s J ) and

r =sL − s J

sK − s J.

• root of minimax MSE, and (excess) length of CIs will shrink at raten−r /2

41/59

Page 44: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

optimal bandwidths

• Class of optimal estimators can be written as

Lδ = L(h) = h2sK−sL 〈Kk (·/h),Y 〉 +Chs J −sL(LfM,1,1 −

⟨Kk ,K fM,1,1

⟩),

with h = (C/δ )1/(sK−s J ) and kernel k (u) = rω1 (1) (д∗1,1 − f ∗1,1).

• Recall that optimal δ given by c` (δ/(2σ )) = δω ′(δ )/ω (δ ). Plugging inde�nition of h yields optimal bandwidth

h = (2σc−1` (r )/C )

1s J −sK ,

where, for one-sided CIs, c−1β (r ) = (zβ − z1−α )/2

42/59

Page 45: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

ratios of optimal bandwidths, sk = −1/2, sl = 0

0.95 (onesided, q=0.5)

0.99 (onesided, q=0.5)

0.95 (onesided, q=0.8)

0.99 (onesided, q=0.8)

0.95 (twosided)

0.99 (twosided)

1.0

1.5

2.0

0.4 0.5 0.6 0.7 0.8 0.9r

Band

widt

h ra

tio

Ratios of of optimal bandwidths for CIs to optimal MSE bandwidths

43/59

Page 46: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

takeaways from picture

• Optimal bandwidth ratios depend only on dilation exponents sL,sK ands J :

h`h`′= *

,

c−1`(r )

c−1`′(r )

+-

1s J −sK

• Bandwidths of same order in all cases: no undersmoothing

• For one-sided CIs, bandwidth gets larger with quantile that we areminimizing

• For 95+% two-sided CIs, if sL = 0 and sK = −1/2, optimal �xed-lengthCI uses a larger bandwidth than optimal MSE bandwidth

44/59

Page 47: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

• For any bandwidth h, worst-case bias is C2

1−rr hs J −sL (

∫k2)1/2

• Can use this worst case bias to construct CIs around L(h)

• How much bigger are two-sided CIs around minimax MSE bandwidth?Ratio of CI lengths given by

*,

c−1χ ,α (r )

c−1ρ (r )

+-

r−1

·χα (c

−1χ ,α (r ) (1 − 1/r ))

χα (c−1ρ (r ) (1 − 1/r ))

,

where χα (B) solves P ( |N (0,1) + B | ≤ χ ) = Φ(χ − B) −Φ(χ + B) = 1−α

• Need to use χα( √

1−rr

)instead of |zα /2 | as a critical value to ensure

coverage for CI around minimax MSE bandwidth

45/59

Page 48: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

length of optimal cis relative to cis around mse bw

0.99

0.95

0.7

0.94

0.96

0.98

1.00

0.5 0.6 0.7 0.8 0.9r

Perc

enta

ge d

ecre

ase

in le

ngth

46/59

Page 49: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

“critical values” for ci around mse bandwidth

0.9

0.95

0.99

2.0

2.5

3.0

0.6 0.7 0.8 0.9r

Crit

ical

val

ue

47/59

Page 50: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

undercoverage with usual critical values

0.9

0.95

0.99

0.80

0.85

0.90

0.95

0.6 0.7 0.8 0.9r

Cov

erag

e

48/59

Page 51: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

takeaways from pictures

• To construct two-sided CIs, can keep the same bandwidth as forestimation, price is < 2% for 95% CIs

• Need to use a slightly higher critical value to ensure proper coverage

49/59

Page 52: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

suboptimal kernels

• Results so far assumed using optimal kernel

• Under renormalization, maximum bias and variance renormalize insimilar way for suboptimal kernels

• For any kernel k , let hk be bandwidth that equates the maximum biasand root variance, and let w (k ) = se (Lk (hk )) = supf biasf (Lk (hk ))

• Suppose criterion scales linearly with maximum bias and root variance

50/59

Page 53: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Theorem 6 (E�iciency loss of suboptimal kernels)

1. Relative e�iciency of k and k (where the optimal bandwidth is

used in both cases) does not depend on the performance criterion,

and is given by w (k )/w (k )

2. Results for ratios of optimal bandwidths remain unchanged for

suboptimal kernels

3. E�iciency loss from using bandwidth optimal for a di�erent

criterion rather than bandwidth optimal for criterion of interest

remains unchanged for suboptimal kernels

51/59

Page 54: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

corollaries

• bounds for minimax MSE e�ciency of di�erent kernels of Cheng, Fan,and Marron (1997) 1. are tight; and 2. hold for other e�ciency criteria

• Using minimax MSE bandwidth for two-sided CIs a good idea nomatter what kernel one uses

52/59

Page 55: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Finite-Sample results

Asymptotic results

Applications

Conclusion

Page 56: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

rd

• Interested in Lf = limx ↓0 f (x ) − limx ↑0 f (x ).

• Let f+ (x ) = f (x )I (x > 0) and f− (x ) = −f (x )I (x < 0) so thatf = f+ − f−.

• We consider class

FRDT ,2 (C ) ={f+ − f− : f+ ∈ FT,2 (C;R+), f− ∈ FT,2 (C;R−),

}

where F2 (C;X), is the class from Sacks and Ylvisaker (1978),

FT,2 (C;X) ={f : | f (x ) − f (0) − f ′(0)x | ≤ Cx2 all x ∈ X

}.

• FT,2 also used in Cheng, Fan, and Marron (1997) for estimation at apoint that justi�es much of empirical RD practice

53/59

Page 57: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

least favorable functions

Least favorable functions are symmetric, д∗δ (x ) = −f∗δ (x ) and have form

д∗δ (x ) = [(b − b− + d+x −Cx2)+ − (b − b− + d+x +Cx2)−]1(x > 0)

[(b− + d−x −Cx2)+ − (b− + d−x +Cx2)−]1(x < 0)

with b−,d+,d− chosen to solve

0 =n∑i=1

дn−,b,C (xi )xi

σ 2 (xi ), 0 =

n∑i=1

дn+,b,C (xi )xi

σ 2 (xi ),

and

n∑i=1

д+,b,C (xi )

σ 2 (xi )=

n∑i=1

д−,b,C (xi )

σ 2 (xi )

54/59

Page 58: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

optimal kernel

0

1

2

3

4

5

0.00 0.45 0.85 1.46u

k(u)

• Asymptotically, д∗δ correspondsto di�erence between twokernel estimators, withbandwidths chosen to equatenumber of e�ectiveobservations

• Optimal kernel same as forinference at a point, derived inCheng, Fan, and Marron (1997)using upper bound on minimaxMSE

55/59

Page 59: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

application to Lee (2008)

• RD design:• Xi = margin of victory in previous election for Democratic party

(negative for Republican victory)• Yi = Democratic vote share in given election• Di = I (Xi ≥ 0) = indicator for Democratic incumbency• n = 6558 observations of elections between 1946 and 1998

• For simplicity, assume homoscedastic errors, use estimatesσ 2− (0) = 155.3 and σ+ (0) = 210.3 derived using Imbens and

Kalyanaraman (2012) bandwidth

• LF functions very close to scaled versions of optimal bandwidth

• Unless C very small, results in line with Lee (2008) and Imbens andKalyanaraman (2012)

56/59

Page 60: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

minimax mse estimator as function of c

estimate

bias_sq

variance

b

8

9

10

11

0.00

0.25

0.50

0.75

1.00

1.5

2.0

2.5

3.0

III

III

0.000026361

0.00044059

0.00083635

0.00123206

0.00162910

0.0022704

L / Effective number of observations

Elec

tora

l adv

anta

ge (%

)

Note L = C

57/59

Page 61: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

optimal fixed-length cis

lower

estimate

upper

bias_sq

variance

b

7

9

11

13

0.00

0.25

0.50

0.75

1.00

2.0

2.5

3.0

III

III

0.000026364

0.00044099

0.00083685

0.00123283

0.00162993

0.0022768

L / Effective number of observations

Elec

tora

l adv

anta

ge (%

)

58/59

Page 62: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

Finite-Sample results

Asymptotic results

Applications

Conclusion

Page 63: Optimal inference in a class of nonparametric models · Conclusion. running example • Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F=

summary

1. give exact results for 1. minimax optimal and 2. adaptive one-sided CIs.• CIs use non-random bias correction based on worst-case bias• Adaptivity without shape restrictions severely limited, like in two-sided

case.• Impossible to avoid thinking hard about appropriate C

2. give exact solution to problem of “adaptation to a function”

3. use these �nite-sample results to characterize optimal tuningparameters for di�erent performance criteria

• building CIs around minimax MSE bandwidth nearly optimal• undersmoothing cannot be optimal

59/59