Optimal inference in a class of nonparametric models · Conclusion. running example • Consider...

Post on 18-Oct-2020

1 views 0 download

Transcript of Optimal inference in a class of nonparametric models · Conclusion. running example • Consider...

optimal inference in a class of

nonparametric models

Timothy Armstrong (Yale University)

Michal Kolesár (Princeton University)

September 2015

setup

• Interested in inference on linear functional Lf in regression model

yi = f (xi ) + ui , ui ∼ N (0,σ 2 (xi )).

xi is �xed, σ 2 (xi ) is known.

• Important special cases:1. Inference at a point: Lf = f (0)2. Regression discontinuity: Lf = f (0+) − f (0−)3. ATE under unconfoundedness: xi = {wi ,di},

Lf = 1n∑

i ( f (wi ,1) − f (wi ,0))4. Partially linear model

2/59

key assumption

Convexity Assumption

f ∈ F , a known convex set

Rules out e.g. sparsity, but not usual shape/smoothness restrictions:

Monotonicity F ={f : f non-increasing

}Lipschitz class FLip (C ) =

{f : | f (x1) − f (x2) | ≤ C |x1 − x2 |

}(or Hölder

class generalizations).

Taylor class FT,2 (C ) ={f : | f (x ) − f (0) − f ′(x )x | ≤ Cx2

}(useful for

RD / inference at point)

Sign restrictions in linear regression{f (x ) = x ′β : βj ≥ 0, j ∈ J

}

• Will take C as known if necessary, and ask later if this can be relaxed.

3/59

notions of finite-sample optimality

• Normality =⇒ can derive �nite-sample procedures that minimize theworst case loss over G ⊆ F

• without Normality, procedures will be valid and optimal asymptoticallyunder regularity conditions, uniformly over F

1. Setting G = F yields minimax procedures.• Problem well-studied if loss is MSE, general solution in Donoho (1994),

used to derive optimal kernels and rates of convergence (Stone, 1980;Fan, 1993; Cheng, Fan, and Marron, 1997)

• Donoho (1994) derives �xed-length con�dence intervals (CI) that arealmost optimal

2. G ⊂ F “smoother” functions: adaptive inference (“directing power”)• For two-sided CIs, Cai and Low (2004) give bounds

4/59

new finite-sample results: one-sided cis

Derive one-sided CIs, [c,∞), that minimize maximum quantiles ofexcess length over G, with c = L − biasF (L) − z1−α sd(L), for optimalestimator L

• For case F = G (minimax CIs), L has same form as minimax MSEestimators / �xed-length CIs of Donoho (1994)

• We show that if F is symmetric, adaptation severely limited.• Adaptation requires non-convexity or shape restrictions: otherwise,

cannot do better at smaller C while maintaining coverage for larger C• Conversely any inference method that claims to do better than minimax

CIs when f is smooth must be size distorted for some f ∈ F (C )

• Related to Low (1997), who shows adapting to derivative smoothnessclasses limited for two-sided (random-length) CIs.

5/59

new finite-sample results: two-sided cis

We derive two-sided CIs that minimize expected length over G ={д},

solving the problem of “adaptation to a function” posed in Cai, Low,and Xia (2013)

• Can be used to bound scope for adaptivity

6/59

implications for optimal bandwidth choice

Asymptotically, optimal procedures often correspond to kernelestimators with �xed (optimal) kernel, and bw that depends onoptimality criterion. We �nd that for RD and inference at a point:

• Optimal 95% �xed-length CIs use larger bandwidth than minimax MSEestimators.

• Undersmoothing cannot be optimal• Recentering CIs by estimating bias cannot be optimal—it’s essentially

equivalent to using higher-order kernel and undersmoothing (Calonico,Cattaneo, and Titiunik, 2014).

• Di�erence is small: CI around minimax MSE estimator only 1% longer• In practice, can keep the same bandwidth as for estimation, and

construct CI around it using worst-case bias correction

7/59

applications

We apply the general results to:

1. RD with F ={f+ − f− : f± ∈ FT,2 (C )

}as in Cheng, Fan, and Marron

(1997)• Optimal bandwidths balance number of “e�ective observations” on each

side of cuto�• Illustrate with empirical application from Lee (2008)

2. Linear regression with β possibly constrained (sign restrictions,sparsity, elliptical constraints)

3. Sample average treatment e�ect under unconfoundedness underHölder class (separate paper)

8/59

incomplete list of related literature

• Stats literature on minimax estimation/inference/rates ofconvergence/adaptivity: Ibragimov and Khas’minskii (1985), Donohoand Liu (1991), Donoho and Low (1992), Donoho (1994), Low (1995), Low(1997), Cai and Low (2004), Cai, Low, and Xia (2013), Cheng, Fan, andMarron (1997), Fan (1993), Fan, Gasser, Gijbels, Brockmann, and Engel(1997), Lepski and Tsybakov (2000)

• “non-standard” CIs: Imbens and Manski (2004), Müller and Norets(2012), Calonico, Cattaneo, and Titiunik (2014), Calonico, Cattaneo, andFarrell (2015), Rothe (2015)

• Adaptive estimation/inference in econometrics: Sun (2005), Armstrong(2015), Chernozhukov, Chetverikov, and Kato (2014)

9/59

Finite-Sample results

Asymptotic results

Applications

Conclusion

running example

• Consider the problem of inference on f (0) when f is restricted to be inLipschitz class F = FLip (C ) =

{f : | f (x1) − f (x2) | ≤ C |x1 − x2 |

}.

• Assume σ (x ) = σ , known

10/59

performance criteria

• To measure performance of (1 − α )% one-sided CIs [c,∞), we usemaximum quantiles of excess length

ELβ (c,G) = supд∈G

qд,β (Lд − c ),

where qд,β is βth quantile under д.• For two-sided CIs, we focus on �xed-length CIs L ± χ , where L is

estimator, and χ is chosen to satisfy coverage:

χα (L) = min{χ : inf

f ∈FPf ( |L − Lf | ≤ χ ) ≥ 1 − α

}

• For estimation, we use maximum MSE, RMSE (L) = supf ∈F Ef (L − Lf )2

11/59

minimax testing problem

• In running example, Lf = f (0), F = FLip (C ), consider minimax test ofH0 : Lf ≤ L0 against H1 : Lf ≥ L0 + 2b

• Inverting minimax tests yields CI that minimizes ELβ (c,F ), where β isminimax power of the test.

• First need to �nd least favorable null and alternative. Problemequivalent to Y ∼ N (µ,σ 2I ), µ = ( f (x1), . . . , f (xn )) ∈ M convex

• Both M0 = M ∩{f : Lf ≤ L0

}and M1 = M ∩

{д : Lд ≥ L0 + 2b

}are

convex—least favorable functions minimize distance between them(Ingster and Suslina, 2003)

(д∗, f ∗) = argminд∈M1,f ∈M0

n∑i=1

(д(xi ) − f (xi ))2.

12/59

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−b/C 0 b/Cx

д∗ (x ) =

L0 + b + (b −C |x |)+

f ∗ (x ) =

L0 + b − (b −C |x |)+

13/59

• д∗ (x ) = L0 + b + (b −C |x |)+, f ∗ (x ) = L0 + b − (b −C |x |)+

• Minimax test then given by LR test of µ0 = ( f ∗ (x1), . . . , f∗ (xn )) against

µ1 = (д∗ (x1), . . . ,д∗ (xn )): reject for large values of Y ′(µ1 − µ0)

• Test can be written as rejecting whenever

L(h) − L0 − b

(1 −

∑ni=1 kT (xi/h)

2∑ni=1 kT (xi/h)

)≥

(∑n

i=1 kT (xi/h)2)1/2∑n

i=1 kT (xi/h)σz1−α .

where kT (u) = (1 − |u |)+, h = b/C and

L(h) =

∑ni=1 (д

∗ (xi ) − f ∗ (xi ))Yi∑ni=1 (д

∗ (xi ) − f ∗ (xi ))=

∑ni=1 kT (xi/h)Yi∑ni=1 kT (xi/h)

• Key feature: non-random bias correction based on worst-case bias,doesn’t disappear asymptotically

14/59

general setup

• In general, we observe Y = K f + σϵ , ϵ is standard Normal and K linearoperator, with

⟨Kд,K f

⟩=

∑i (Kд) (xi ) (K f ) (xi ),

• Heteroscedasticity handled by settingK f = ( f (x1)/σ (x1), . . . , f (xn )/σ (xn )), Y = (Y1/σ (x1), . . . ,Yn/σ (xn )).

• De�ne modulus of continuity (Donoho and Liu, 1991):

ω (δ ;F ) = sup{L(д − f ) : ‖K (д − f )‖ ≤ δ , д, f ∈ F

}

Denote solutions by д∗δ , f∗δ , and let f ∗M,δ = (д∗δ + f ∗δ )/2

• Problem of �nding LF functions equivalent to �nding ω−1 (·;F ), so forrunning example, д∗ = д∗ω−1 (2b ) , f

∗ = f ∗ω−1 (2b )

15/59

class of optimal estimators

De�ne

Lδ ,F = Lf ∗M,δ +ω ′(δ ;F )

δ

⟨K (д∗δ − f ∗δ ),Y − K f ∗M,δ

⟩These estimators minimize maximum bias given variance bound (andvice versa) (Low, 1995). Their maximum and minimum bias over Fsatis�es

biasF (Lδ ,F ) = − biasF(Lδ ,F ) =

12

(ω (δ ;F ) − δω ′(δ ′;F )

),

In running example: L(h) = Lω−1 (2hC ),FLip (C )

16/59

centrosymmetry and translation invariance

When F has additional structure, Lδ simpli�es:

• If F is translation invariant (for some ι ∈ F with Lι = 1, cι ∈ F for allc ∈ R), then δ/ω ′(δ ;F ) =

⟨K (д∗δ − f ∗δ ),ι

⟩, and estimator has

Nadaraya-Watson form:

Lδ ,F = Lf ∗M,δ +〈K (д∗δ − f ∗δ ),Y − K f ∗M,δ 〉

〈K (д∗δ − f ∗δ ),Kι〉.

• If F is centrosymmetric (f ∈ F =⇒ −f ∈ F ), then f ∗δ = −д∗δ , and

Lδ ,F =2ω ′(δ ;F )

δ〈Kд∗δ ,Y 〉 =

〈Kд∗δ ,Y 〉

〈Kд∗δ ,Kι〉,

17/59

Theorem 1 (One-sided minimax CI)

Let

cα ,δ ,F = Lδ ,F − biasF (Lδ ,F ) − z1−ασω′(δ ;F ).

Then [cα ,δ ,F ,∞) is a 1 − α CI for Lf , with coverage minimized at f ∗δ .

For β = Φ(δ/σ − z1−α ), it minimizes ELβ (c,F ) among all one sided

1 − α CIs. All quantiles of excess length are maximized at д∗δ . The

minimax excess length at quantile β is ELβ (cα ,δ ,F ;F ) = ω (δ ;F ).

• β is minimax power of underlying tests (under translation invariance)

• Bias-correction based on worst-case bias under F , non-random

• In running example, using bw h minimizes β quantile of excess lengthat β = Φ

(ω−1 (2hC )

σ − z1−α

)18/59

• For estimation and two-sided CI, exact optimality results hard

• Donoho (1994) shows that procedures based on Lδ ,F minimax optimalif we restrict attention to a�ne estimators

• Results use the fact that problem is just as hard if we know that f is inone-dimensional subfamily

{λ f ∗δ + (1 − λ)д∗δ : 0 ≤ λ ≤ 1

}

To state these results, consider Z ∼ N (θ ,1), θ ∈ [−τ ,τ ]

• Minimax linear estimator is cρ (τ )Z , cρ (τ ) = τ 2/(1 + τ 2) with minimaxrisk ρ (τ ) = τ 2/(1 + τ 2)

• Shortest �xed-length CI is cχ (τ )Z ± χα (cχ (τ )Z ), solution characterizedin Drees (1999), similar in spirit to Imbens and Manski (2004)

19/59

optimal shrinkage in bounded normal means

0.00

0.25

0.50

0.75

1.00

0.5

1.0

1.5

2.0

c_{ACI}chi

0 1 2 3 4 5tau

Confidence level

90%

95%

95% (Estimation)

20/59

Theorem (Donoho (1994))

minimax MSE a�ine estimator is Lδ ,F where δ solves

maxδ>0

ω (δ ;F )

δ

√ρ

)σ .

and the optimal δ satisfies cρ (δ/(2σ )) = δω ′(δ ;F )/ω (δ ;F ).

The shortest fixed-length a�ine CI is Lδ ,F ±ω (δ ;F )

δ χα(δ

)σ where

δ solves

maxδ>0

ω (δ ;F )

δχα

)σ .

and the optimal δ satisfies cχ (δ/(2σ )) = δω ′(δ ;F )/ω (δ ;F ).

21/59

• For example, to �nd minimax MSE optimal bandwidth in runningexample, solve

δ 2

4σ 2 + δ 2 = cρ (δ/(2σ )) =δω ′(δ ;F )

ω (δ ;F )=

δ 2

2ω (δ ;F )∑

i д∗δ (xi )

which yields

σ 2 = C2h2(∑

i

kT (xi/h) −∑i

kT (xi/h)2).

Asymptotically

hopt,MSE =

(3σ 2

C2nfX (0)

)1/3

+ op (1)

• Can also use these results to derive optimal rates of convergence (egFan (1993); Cheng, Fan, and Marron (1997))—n−1/3 here

22/59

adaptive inference

• onesided CIs focus on good performance under least favorable f ∈ F ,which may be too pessimistic

• Alternative: optimize excess length over smaller class G of smootherfunctions

infc

supд∈G

qq,β (Lд − c ),

among c that satisfy supf ∈F P (Lf ≥ c ) ≥ 1 − α .

• Amounts to “directing power” at smooth alternatives, whilemaintaining size over all of F

23/59

adaptive inference in running example

• Associated testing problem in running example: H0 : Lf ≤ L0 againstH1 :

{Lf ≥ L0 + 2b

}∩

{f ∈ G

}• Inverting these minimax tests will yield CI that minimizes β quantile of

excess length over G, where β is minimax power of the test.

• As long as G is convex, this is still equivalent to testing convex nullagainst convex alternative =⇒ LF functions minimize distancebetween sets:

( f ∗,д∗) = argminf ∈F ,д∈G

n∑i=1

(д(xi ) − f (xi ))2, Lд ≥ L0 + 2b, Lf ≤ L0

24/59

• To make this concrete, consider G ={д(x ) : д(x ) = c,c ∈ R

}(i.e.

д(x ) = cι), and suppose Lf ≥ L0 + b under alternative

• Solution: f ∗ = L0 + b − (b − X |x |)+ (as before), д∗ (x ) = L0 + b

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−b/C 0 b/Cx

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

25/59

• But д∗ − f ∗ same as before, so estimator as before

L(h) =

∑ni=1 (д

∗ (xi ) − f ∗ (xi ))Yi∑ni=1 (д

∗ (xi ) − f ∗ (xi ))=

∑ni=1 kT (xi/h)Yi∑ni=1 kT (xi/h)

• Worst case-bias under the null and variance same as before =⇒ SameCI as before

Summary

One sided CI that minimizes maximum excess length over F forβ = Φ(δ/σ − z1−α ) subject to 1 − α coverage also minimizesELβ (c; span(ι)) for β = Φ(δ/(2σ ) − z1−α )

26/59

setup for general adaptivity result

• De�ne order modulus of continuity Cai and Low (2004):

ω (δ ;F ,G) = sup{Lд − Lf : ‖K (д − f )‖ ≤ δ , f ∈ F ,д ∈ G

}.

so that ω (δ ;F ) = ω (δ ;F ,F ), and de�ne

Lδ ,F ,G = Lf ∗M,δ +ω ′(δ ;F ,G)

δ

⟨K (д∗δ − f ∗δ ),Y − K f ∗M,δ

⟩,

so that Lδ ,F ,F = Lδ ,F

• bias formulas generalize:

biasF (Lδ ,F ,G ) − biasG(Lδ ,F ,G ) =

12

(ω (δ ;F ,G) − δω ′(δ ;F ,G)

),

• In running example, L(h) = Lω−1 (hC ;F ,G),F ,G

27/59

Theorem 2 (One-sided adaptive CIs)

Let F and G ⊆ F be convex, and suppose that f ∗δ and д∗δ achieve

the ordered modulus at δ . Let

cα ,δ ,F ,G = Lδ ,F ,G − biasF (Lδ ,F ,G ) − z1−ασω′(δ ;F ,G).

Then, for β = Φ(δ/σ − z1−α ), cα ,δ ,F ,G minimizes ELβ (c,G) among all

one-sided 1 − α CIs, where Φ denotes the standard normal cdf.

Minimum coverage is taken at f ∗δ and equals 1 − α . All quantiles of

excess length are maximized at д∗δ . The worst case βth quantile of

excess length is ELβ (cα ,δ ,F ,G ,G) = ω (δ ;F ,G).

28/59

non-adaptivity under centrosymmetry

• Suppose F is centrosymmetric and

f ∗δ ,F ,G − д∗δ ,F ,G ∈ F . (1)

Holds for G “smooth enough”, e.g. G = span(ι) under translationinvariance as in running example

• Then 0 and f ∗δ ,F ,G − д∗δ ,F ,G also solve the modulus, and since

ω (δ ;F ) = sup{−2Lf : ‖K f ‖ ≤ δ/2, f ∈ F

}under centrosymmetry,

ω (δ ;F ,G) = ω (δ ;F ,{0}) = supf ∈F

{−Lf : ‖K f ‖ ≤ δ

}=

12ω (2δ ;F ),

• Implies cα ,δ ,F ,G = cα ,δ ,F ,{0} = cα ,2δ ,F .

29/59

Theorem 3 (Non-adaptivity of one-sided CIs under centrosymmetry)

Let F be centrosymmetric. Then the one-sided CI that is minimax

for the βth quantile also optimizes ELβ (c;G) for any G such that the

solution to the ordered modulus problem exists and satisfies (1),

where

β = Φ((zβ − z1−α )/2).

In particular, the minimax CI optimizes ELβ (c; {0}).

• CI that is minimax for median excess length among 95% CIs alsooptimizes Φ(−1.645/2) ≈ 0.205 quantile under the zero function.

30/59

bound on adaptivity

• CI [cα ,σ (zβ+z1−α ),F ) that is minimax for βth quantile of excess length isunbiased at 0, and satis�es

q0,β (L0 − cα ,σ (zβ+z1−α ) ) =12(ω ′(δ ;F )δ + ω (δ ;F )).

Hence,

ω (δ ;F ,G)q0,β (L0 − cα ,σ (zβ+z1−α ) )

=ω (δ ;F ,G)

12 (ω

′(δ )δ + ω (δ ))=

ω (2δ )ω ′(δ )δ + ω (δ )

.

• Typically, ω (δ ;F ) = Aδ r (1 + o(1)) as n → ∞ for some where r is theoptimal rate of convergence of the MSE. Then for 1/2 ≤ r ≤ 1, minimaxCI has asymptotic e�ciency at least 94.3% when indeed f = 0.

• Adapting to G that includes 0 at least as hard as adapting to zero

31/59

implications of non-adaptivity result

• Need shape restriction or non-convexity for adaptation

• Similar to impossibility results in Low (1997) and Cai and Low (2004)for two-sided CIs, and in contrast to positive results for MSE

• Minimax rate of shrinkage describes the actual rate for all functions inthe class

• Possible to construct estimators that do better when f is smoother, butimpossible to tell how well you did

• For valid inference in cases where F is convex and centrosymmetric,one has to think hard about appropriate C

• Not possible to try to estimate it from the data and to better than if weassume worst possible case

32/59

adaptivity under monotonicity

• Suppose, in running example, that we know f is non-increasing• Least favorable functions without and with directing power:

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

f^{*}g^{*}

L_{0}

L_{0}+b

L_{0}+2b

−2b/C 0 2b/Cx

33/59

• Without directing power, optimal estimator again given by triangularkernel, but now includes bias correction (to ensure bias = − bias)

L(h) =

∑i kiYi/σ

2i∑

i ki/σ2i+ b

∑i sign(xi )ki (1 − ki )/σ 2

i∑i ki/σ

2i

,

where ki = kβ (xi/h), and optimal bw bigger than withoutmonotonicity. About 20% reduction in quantiles of excess length

• With directing power, optimal estimator averages all positiveobservations, and averages negative observations using triangularkernel. Excess length shrinks at parametric rate.

• When Lipschitz assumption dropped and only monotonicitymaintained, optimal estimator averages all positive observations, andexcess length still shrinks at parametric rate

34/59

two-sided adaptive cis

• Fixed-length con�dence intervals cannot be adaptive

• Cai and Low (2004) construct random-length con�dence intervals thatare within a constant factor of lower bound on expected length

• Cai, Low, and Xia (2013) construct random-length con�dence intervalsunder shape constraints that have near minimum expected length foreach individual function (again within constant)

35/59

• Natural best-case scenario for two-sided CIs: optimize expected lengthat a single function G =

{д}

• By Pratt (1961), inverting UMP tests against G achieve exactly this

• Again amounts to testing convex null against convex alternative, LFfunction under null solves

f∗

θ = argminf ∈F

n∑i=1

( f (xi ) − д(xi ))2, Lf ≤ θ

Theorem 4 (Adaptation to a function)

CI with minimum expected measure Eдλ(C) st 1 − α coverage on F

inverts family of tests ϕθ , where ϕθ rejects for large values of

〈K (д − f∗

θ ),Y 〉 with critical value given by 1 − α quantile under f∗

θ .

36/59

cis based on suboptimal estimators

• What is e�ciency loss of CIs around suboptimal a�ne estimators?

• A�ne estimators are Normal, with variance that doesn’t depend on f ,and bias that does

• For each performance criterion, only worst-case bias and variancematter: if we can calculate them, then can also calculate maximumMSE, and form of one- and two-sided CIs

• Let χα (B) solve P ( |Z +B | ≤ χ ) = Φ(χ −B) −Φ(χ +B) = 1−α . Then forestimator L with variance V and maximum bias B, is the shortest CI is

L ±V 1/2 χα (B/V1/2)

37/59

Theorem 5 (Suboptimal estimators)

Let L = a + 〈w ,Y 〉 be an a�ine estimator. Then

[L − biasF (L) − ‖w ‖z1−ασ ,∞) is valid CI and

L ± σ ‖w ‖ χα (biasF (L)/σ ‖w ‖) is the shortest fixed-length 1 − α CI

centered at L.

• Not deep result, but very useful: allows to compute exact e�ciency lossfrom using suboptimal estimators, or size-distortion of CIs with(pointwise) asymptotic justi�cation

• Asymptotic version of this theorem can be used to calculate asymptotice�ciency loss from using suboptimal kernel, and/or suboptimalbandwidth

38/59

suboptimal estimators in running example

• Consider some other kernel k in running example, L =∑i k (xi /h)Yi∑i k (xi /h)

• Variance: σ2 ∑

i k (xi /h)2

(∑i k (xi /h))2

• Maximum bias, since f ∈ FLip (C ).

����

∑i k (xi/h) ( f (xi ) − f (0))∑

i k (xi/h)

���� ≤ C

∑i |k (xi/h) | |xi |∑

i k (xi/h).

Bound attained at f (x ) = C |xi | if k ≥ 0, otherwise gives an upperbound.

39/59

Finite-Sample results

Asymptotic results

Applications

Conclusion

renormalization

• In many cases (depending on L and smoothness of F , but includinginference at a point and RD), nonparametric regression problemequivalent to White noise model Y (dt ) = f (t ) + σϵ (t )

• See Brown and Low (1996) and Donoho and Low (1992)• In running example, this holds with σ 2 = σ (0)2/nfX (0)

• Suppose F ={f : J ( f ) ≤ C

}for some J (as in running example), and

that for the white noise model, following functionals are homogeneous

J (af (·/h)) = ah−s J J ( f )⟨Ka1 f (·/h),Ka2д(·/h)

⟩= a1a2h

−2sK ⟨K f ,Kд

⟩L(af (·/h)) = ah−sLLf

• In running example, we have sL = 0, s J = 2, sK − 1/2

40/59

• (single-class) modulus problem then renormalizes: if д∗C,δ , f∗C,δ

minimize min|L( f1 − f0) | st ‖K ( f1 − f0)‖ ≤ δ , J ( f1) ≤ C , J ( f0) ≤ C , then

д∗C,δ = aд∗1,1 (·/h) f ∗C,δ = af ∗1,1 (·/h)

ωC (δ ) = C1−rδ rω1 (1)

where a = δ−s J /(sK−s J )CsK /(sK−s J ) , h = (C/δ )1/(sK−s J ) and

r =sL − s J

sK − s J.

• root of minimax MSE, and (excess) length of CIs will shrink at raten−r /2

41/59

optimal bandwidths

• Class of optimal estimators can be written as

Lδ = L(h) = h2sK−sL 〈Kk (·/h),Y 〉 +Chs J −sL(LfM,1,1 −

⟨Kk ,K fM,1,1

⟩),

with h = (C/δ )1/(sK−s J ) and kernel k (u) = rω1 (1) (д∗1,1 − f ∗1,1).

• Recall that optimal δ given by c` (δ/(2σ )) = δω ′(δ )/ω (δ ). Plugging inde�nition of h yields optimal bandwidth

h = (2σc−1` (r )/C )

1s J −sK ,

where, for one-sided CIs, c−1β (r ) = (zβ − z1−α )/2

42/59

ratios of optimal bandwidths, sk = −1/2, sl = 0

0.95 (onesided, q=0.5)

0.99 (onesided, q=0.5)

0.95 (onesided, q=0.8)

0.99 (onesided, q=0.8)

0.95 (twosided)

0.99 (twosided)

1.0

1.5

2.0

0.4 0.5 0.6 0.7 0.8 0.9r

Band

widt

h ra

tio

Ratios of of optimal bandwidths for CIs to optimal MSE bandwidths

43/59

takeaways from picture

• Optimal bandwidth ratios depend only on dilation exponents sL,sK ands J :

h`h`′= *

,

c−1`(r )

c−1`′(r )

+-

1s J −sK

• Bandwidths of same order in all cases: no undersmoothing

• For one-sided CIs, bandwidth gets larger with quantile that we areminimizing

• For 95+% two-sided CIs, if sL = 0 and sK = −1/2, optimal �xed-lengthCI uses a larger bandwidth than optimal MSE bandwidth

44/59

• For any bandwidth h, worst-case bias is C2

1−rr hs J −sL (

∫k2)1/2

• Can use this worst case bias to construct CIs around L(h)

• How much bigger are two-sided CIs around minimax MSE bandwidth?Ratio of CI lengths given by

*,

c−1χ ,α (r )

c−1ρ (r )

+-

r−1

·χα (c

−1χ ,α (r ) (1 − 1/r ))

χα (c−1ρ (r ) (1 − 1/r ))

,

where χα (B) solves P ( |N (0,1) + B | ≤ χ ) = Φ(χ − B) −Φ(χ + B) = 1−α

• Need to use χα( √

1−rr

)instead of |zα /2 | as a critical value to ensure

coverage for CI around minimax MSE bandwidth

45/59

length of optimal cis relative to cis around mse bw

0.99

0.95

0.7

0.94

0.96

0.98

1.00

0.5 0.6 0.7 0.8 0.9r

Perc

enta

ge d

ecre

ase

in le

ngth

46/59

“critical values” for ci around mse bandwidth

0.9

0.95

0.99

2.0

2.5

3.0

0.6 0.7 0.8 0.9r

Crit

ical

val

ue

47/59

undercoverage with usual critical values

0.9

0.95

0.99

0.80

0.85

0.90

0.95

0.6 0.7 0.8 0.9r

Cov

erag

e

48/59

takeaways from pictures

• To construct two-sided CIs, can keep the same bandwidth as forestimation, price is < 2% for 95% CIs

• Need to use a slightly higher critical value to ensure proper coverage

49/59

suboptimal kernels

• Results so far assumed using optimal kernel

• Under renormalization, maximum bias and variance renormalize insimilar way for suboptimal kernels

• For any kernel k , let hk be bandwidth that equates the maximum biasand root variance, and let w (k ) = se (Lk (hk )) = supf biasf (Lk (hk ))

• Suppose criterion scales linearly with maximum bias and root variance

50/59

Theorem 6 (E�iciency loss of suboptimal kernels)

1. Relative e�iciency of k and k (where the optimal bandwidth is

used in both cases) does not depend on the performance criterion,

and is given by w (k )/w (k )

2. Results for ratios of optimal bandwidths remain unchanged for

suboptimal kernels

3. E�iciency loss from using bandwidth optimal for a di�erent

criterion rather than bandwidth optimal for criterion of interest

remains unchanged for suboptimal kernels

51/59

corollaries

• bounds for minimax MSE e�ciency of di�erent kernels of Cheng, Fan,and Marron (1997) 1. are tight; and 2. hold for other e�ciency criteria

• Using minimax MSE bandwidth for two-sided CIs a good idea nomatter what kernel one uses

52/59

Finite-Sample results

Asymptotic results

Applications

Conclusion

rd

• Interested in Lf = limx ↓0 f (x ) − limx ↑0 f (x ).

• Let f+ (x ) = f (x )I (x > 0) and f− (x ) = −f (x )I (x < 0) so thatf = f+ − f−.

• We consider class

FRDT ,2 (C ) ={f+ − f− : f+ ∈ FT,2 (C;R+), f− ∈ FT,2 (C;R−),

}

where F2 (C;X), is the class from Sacks and Ylvisaker (1978),

FT,2 (C;X) ={f : | f (x ) − f (0) − f ′(0)x | ≤ Cx2 all x ∈ X

}.

• FT,2 also used in Cheng, Fan, and Marron (1997) for estimation at apoint that justi�es much of empirical RD practice

53/59

least favorable functions

Least favorable functions are symmetric, д∗δ (x ) = −f∗δ (x ) and have form

д∗δ (x ) = [(b − b− + d+x −Cx2)+ − (b − b− + d+x +Cx2)−]1(x > 0)

[(b− + d−x −Cx2)+ − (b− + d−x +Cx2)−]1(x < 0)

with b−,d+,d− chosen to solve

0 =n∑i=1

дn−,b,C (xi )xi

σ 2 (xi ), 0 =

n∑i=1

дn+,b,C (xi )xi

σ 2 (xi ),

and

n∑i=1

д+,b,C (xi )

σ 2 (xi )=

n∑i=1

д−,b,C (xi )

σ 2 (xi )

54/59

optimal kernel

0

1

2

3

4

5

0.00 0.45 0.85 1.46u

k(u)

• Asymptotically, д∗δ correspondsto di�erence between twokernel estimators, withbandwidths chosen to equatenumber of e�ectiveobservations

• Optimal kernel same as forinference at a point, derived inCheng, Fan, and Marron (1997)using upper bound on minimaxMSE

55/59

application to Lee (2008)

• RD design:• Xi = margin of victory in previous election for Democratic party

(negative for Republican victory)• Yi = Democratic vote share in given election• Di = I (Xi ≥ 0) = indicator for Democratic incumbency• n = 6558 observations of elections between 1946 and 1998

• For simplicity, assume homoscedastic errors, use estimatesσ 2− (0) = 155.3 and σ+ (0) = 210.3 derived using Imbens and

Kalyanaraman (2012) bandwidth

• LF functions very close to scaled versions of optimal bandwidth

• Unless C very small, results in line with Lee (2008) and Imbens andKalyanaraman (2012)

56/59

minimax mse estimator as function of c

estimate

bias_sq

variance

b

8

9

10

11

0.00

0.25

0.50

0.75

1.00

1.5

2.0

2.5

3.0

III

III

0.000026361

0.00044059

0.00083635

0.00123206

0.00162910

0.0022704

L / Effective number of observations

Elec

tora

l adv

anta

ge (%

)

Note L = C

57/59

optimal fixed-length cis

lower

estimate

upper

bias_sq

variance

b

7

9

11

13

0.00

0.25

0.50

0.75

1.00

2.0

2.5

3.0

III

III

0.000026364

0.00044099

0.00083685

0.00123283

0.00162993

0.0022768

L / Effective number of observations

Elec

tora

l adv

anta

ge (%

)

58/59

Finite-Sample results

Asymptotic results

Applications

Conclusion

summary

1. give exact results for 1. minimax optimal and 2. adaptive one-sided CIs.• CIs use non-random bias correction based on worst-case bias• Adaptivity without shape restrictions severely limited, like in two-sided

case.• Impossible to avoid thinking hard about appropriate C

2. give exact solution to problem of “adaptation to a function”

3. use these �nite-sample results to characterize optimal tuningparameters for di�erent performance criteria

• building CIs around minimax MSE bandwidth nearly optimal• undersmoothing cannot be optimal

59/59