1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Transcript

Page 1: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

arX

iv:1

605.

0912

4v3

[cs

.IT

] 2

5 M

ay 2

020

Minimax Estimation of Divergences between Discrete

Distributions

Yanjun Han, Jiaotao Jiao, and Tsachy Weissman∗

May 26, 2020

Abstract

We study the minimax estimation of α-divergences between discrete distributions for in-teger α ≥ 1, which include the Kullback–Leibler divergence and the χ2-divergences as specialexamples. Dropping the usual theoretical tricks to acquire independence, we construct thefirst minimax rate-optimal estimator which does not require any Poissonization, sample split-ting, or explicit construction of approximating polynomials. The estimator uses a hybridapproach which solves a problem-independent linear program based on moment matching inthe non-smooth regime, and applies a problem-dependent bias-corrected plug-in estimator inthe smooth regime, with a soft decision boundary between these regimes.

Contents

1 Introduction 2

1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Estimator Construction 6

2.1 Construction for α-divergences with α ≥ 2 . . . . . . . . . . . . . . . . . . . . . . . 72.2 Construction for KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Estimator Analysis 13

3.1 A deterministic inequality via implicit polynomial approximation . . . . . . . . . . 133.2 Bias-variance analysis in the non-smooth regime . . . . . . . . . . . . . . . . . . . 153.3 Bias-variance analysis in the smooth regime . . . . . . . . . . . . . . . . . . . . . . 16

A Auxiliary Lemmas 17

∗Yanjun Han and Tsachy Weissman are with the Department of Electrical Engineering, Stanford Uni-versity, email: yjhan,[email protected] . Jiantao Jiao is with the Department of Electrical Engi-neering and Computer Sciences and Department of Statistics at University of California, Berkeley, email:[email protected]. A preliminary version [HJW16b] of this paper appeared in the International Sym-posium on Information Theory and Its Applications (ISITA) in Monterey, CA, USA, 2016. An earlier arXivversion [HJW16a] of this work is also available, while the current work made substantial changes.

http://arxiv.org/abs/1605.09124v3

yjhan,[email protected]

[email protected]

Page 2: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

B Upper Bound Analysis of the KL Divergence 18

B.1 Deterministic inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18B.2 Bias-variance analysis in the non-smooth regime . . . . . . . . . . . . . . . . . . . 19B.3 Bias-variance analysis in the smooth regime . . . . . . . . . . . . . . . . . . . . . . 20

C Minimax Lower Bounds 22

C.1 Lower Bounds for α-Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22C.2 Lower Bounds for KL Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Proof of Main Lemmas 28

D.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28D.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30D.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31D.4 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35D.5 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36D.6 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

E Proof of Auxiliary Lemmas 40

E.1 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40E.2 Proof of Lemma 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41E.3 Proof of Lemma 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42E.4 Proof of Lemma 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43E.5 Proof of Lemma 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44E.6 Proof of Lemma 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45E.7 Proof of Lemma 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46E.8 Proof of Lemma 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46E.9 Proof of Lemma 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1 Introduction

Divergences, as fundamental measures of the discrepancy between different probability distribu-tions, are key quantities in information theory and statistics and arise in various disciplines. Forexample, the Kullback–Leibler (KL) divergence [KL51] is an information-theoretic measure aris-ing naturally in data compression [CP04], communications [CK11], probability theory [San58],statistics [Kul97], optimization [DLR77], machine learning [Bis06,KW13], and many other disci-plines. Besides the KL divergence, the Hellinger distance [Hel09] plays key roles in the classicalasymptotic theory [Haj70,Haj72] and pairwise tests [Bir83,LC86] in statistics, and the Pearsonχ2-divergence is an important measure in goodness-of-fit tests [IS12,ACT18].

In this paper, we consider the problem of estimating divergences based on observations fromboth distributions. Given jointly independent m samples from distribution P = (p1, · · · , pk) andn samples from distribution Q = (q1, · · · , qk) over some common alphabet [k], the target is toestimate the following α-divergence [Ama12]:

Dα(P‖Q) ,1

α(α− 1)

(k∑

i=1

pαiqα−1i

− 1

), α ∈ R, (1)

Page 3: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

with the usual convention thatDα(P‖Q) = ∞ if P 6≪ Q when α ≥ 1 andQ 6≪ P when α ≤ 0. Theα-divergence belongs to the general class of f -divergences [Csi67] with fα(t) = (tα−1)/(α(α−1)),and when α = 1, the α-divergence becomes the usual KL divergence

DKL(P‖Q) =k∑

i=1

pi logpiqi. (2)

The family of α-divergences includes multiple well-known divergences widely used in informationtheory and statistics, where the choices of α = 0.5, 1, 2 give (possibly up to scalings) the squaredHellinger distance, KL divergence and χ2-divergence, respectively. For the clarity of presentation,throughout the paper we will assume that α ≥ 1 is an integer, which is enough to reflect ourideas.

Estimating functionals (or properties) of discrete distributions has attracted a recent line ofresearch in the past decade, such as [LNS99,Pan03,Pan04,CL11,VV11a,VV11b,VV13,JVHW15,WY16,HJW16a,OSW16, ZVV+16, HJM17,HJWW17, BZLV18, JHW18,WY19,HO19a, HO19b].We refer to the survey paper [Ver19] for an overview. The main aim of this paper is not simply toapply the previously known methodology to a new functional; instead, we intend to improve theprevious methodologies and construct minimax rate-optimal estimators satisfying the followingproperties:

1. No Poissonization: Most of the previous estimators are constructed under the Poissonizedmodel where the frequency counts of symbols are mutually independent, while they cannotdirectly work for the original Multinomial sampling model and there is no direct reductionbetween estimators. In this paper, we aim to construct estimators without any Poissoniza-tion to ensure that they work for the exact models we are considering.

2. No sample splitting: In addition to Poissonization, most previous estimators also require asample splitting to acquire further independence structure. Specifically, the observations aresplit into two or more parts, where the first part is to locate the probability of each symbolfor regime classification, and the other parts are used for estimation (see, e.g. [JVHW15]).However, sample splitting does not make full use of all observations, and the hard decisionsin regime classification make the final estimator unstable to slight changes in input andtherefore lead to poor concentration properties [HS20]. Therefore, it is practically desirableto propose an estimator with soft decisions and no sample splitting.

3. No explicit polynomial construction: Many of the above works rely on an explicit polynomialapproximation in both the estimator construction and analysis. Although the 1-dimensionalbest polynomial approximation of any given function can be computed efficiently usingthe Remez algorithm [Rem34], this task becomes very challenging even in two dimensions[Ric63]. Besides, there exist scenarios (e.g. Lemma 5 of this paper) where the uniformapproximation error is not the appropriate target or the basis functions are not the low-degree monomials, and there is no known algorithm to construct the best approximatingpolynomial. Hence, it will be desirable if there is an approach adapting to any polynomialapproximation without actually computing them.

In this paper, we construct minimax rate-optimal estimators satisfying all the above propertiesfor the α-divergences with integer α ≥ 1, while the ideas are generalizable to other functionals.Also as a byproduct, we obtain the minimax estimation rates for general α-divergences, whichcould be of independent interest themselves.

Page 4: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

1.1 Notations

Let N be the set of all positive integers, and for n ∈ N, let [n] , 1, 2, . . . , n. For k ∈ N, let Mk

be the set of all discrete distributions supported on [k]. Let B(n, p) be the Binomial distributionwith number of trials n and success probability p ∈ [0, 1], and HG(N,K, k) be the hypergeometricdistribution with k draws without replacement from a bin of N total elements and K ≤ N desiredelements. For non-negative sequences an and bn, we write an ≪ bn (or an = o(bn)) to denotethat limn→∞ an/bn = 0, and an . bn (or an = O(bn)) to denote that lim supn→∞ an/bn < ∞, andan ≫ bn (or an = ω(bn)) to denote bn ≪ an, and an & bn (or an = Ω(bn)) to denote bn . an, andan ≍ bn (or an = Θ(bn)) to denote both an . bn and bn . an. We also use .α (resp. &α,≍α)and Oα (resp. Ωα,Θα) to denote the respective meanings with hidden constants depending onlyon α.

1.2 Main Results

Note that for α ≥ 1, the α-divergence Dα(P‖Q) is unbounded if P 6≪ Q. Hence, we restrict ourattention to the following set of distribution pairs (P,Q):

Mk(U) ,

(P,Q) ∈ Mk ×Mk :

piqi

≤ U,∀i ∈ [k]

, (3)

where U ≥ 1 is a prespecified upper bound on the likelihood ratio dP/dQ. Restricting to the setMk(U), it is clear that the α-divergence Dα(P‖Q) is upper bounded by some quantity dependingonly on U . Our first result characterizes the minimax estimation error of Dα(P‖Q) for integerα ≥ 2.

Theorem 1. Let α ≥ 2 be an integer. Then for any m &α U2(α−1), n &α kUα/ log k + U2α−1,U &α (log k)2 and log k &α log n,

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −Dα(P‖Q)| ≍αkUα

n log n+

Uα−1

√m

+Uα−1/2

√n

where the infimum is taken over all possible estimators D depending only on m observations fromP and n observations from Q. In particular, the upper bound is attained by an estimator withoutany Poissonization, sample splitting, or explicit polynomial construction.

The following corollary on the sample complexity is then immediate.

Corollary 1. For integer α ≥ 2, consistent estimation of Dα(P‖Q) is possible over Mk(U) iffm ≫ U2(α−1) and n ≫ kUα/ log k + U2α−1.

Hence, given a small upper bound on the likelihood ratio, it only requires few samples fromP and a sub-linear sample size from Q. However, when the upper bound U becomes larger, thenumber of samples required for consistent estimation grows polynomially in U and may becomeenormous for large α, indicating the impossibility of learning in such scenarios. Also note thatin the theorem statement U cannot be too small, as U close to 1 makes the divergence Dα(P‖Q)close to zero.

As for α = 1, the α-divergence becomes the KL divergence DKL(P‖Q), and the followingtheorem summarizes the minimax risk of estimating DKL(P‖Q).

Page 5: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Theorem 2. If m & k/ log k, n & kU/ log k, U & (log k)2 and log k & log(m+ n), we have

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| ≍ k

m logm+

n log n+

logU√m

√U

Similarly, we have the following sample complexity on estimating the KL divergence.

Corollary 2. Consistent estimation of DKL(P‖Q) is possible over Mk(U) iff m ≫ k/ log k andn ≫ kU/ log k.

Consequently, Theorems 1 and 2 provide a complete characterization of the minimax rates ofestimating α-divergences for all α ∈ N.

1.3 Related Work

There have been several attempts to estimate the KL divergence for the continuous case, see [WKV05,LP06,GBR+06,PC08,WKV09,NWJ10] and references therein. These approaches usually do notoperate in the minimax framework, and focus on consistency but not rates of convergence, unlessstrong smoothness conditions on the densities are imposed to achieve the parametric rate (i.e.,Θ(n−1) in mean squared error). In the discrete setting, [CKV06] and [ZG14] proved consistencyof some specific estimators without arguing minimax optimality.

As for the minimax analysis of general functional estimation, there are three major approachesdeveloped in the recent literature. The most prominent approach is via the best polynomial ap-proximation, which operates in an ad-hoc manner and typically leads to tight minimax rates.Specifically, the minimax-optimal estimation procedures have been found for many non-smooth1D functionals, including the entropy [Pan03, Pan04,VV11a, JVHW15,WY16], L1 norm of themean vector [CL11], support size [VV13,WY19], support coverage [OSW16,ZVV+16], distanceto uniformity [VV11b,JHW18], and general 1-Lipschitz functions [HO19a,HO19b]. Similar pro-cedures can also be generalized to 2D functionals: for example, the optimal estimation of L1 ortotal variation distance between discrete distributions was studied in [JHW18], and for the KLdivergence considerd in this paper, [BZLV18] and an earlier version of this work [HJW16a] inde-pendently obtained the minimax rate following an explicit polynomial approximation approach.Generalizations to nonparametric functionals are also available [LNS99,HJM17,HJWW17]. Werefer to the survey paper [Ver19] for an overview of the results. However, this line of researchtypically satisfies none of the three properties mentioned in the introduction: explicit polyno-mials are necessary by definition of this approach, and nearly all works adopt sample splittingand Poissonization for analytical simplicity. There is possibly only one exception [HJWW17],which gets rid of Poissonization by applying a careful Efron–Stein–Steele inequality to the i.i.d.sampling model, but it still heavily relies on sample splitting.

The second approach is based on linear programming, where the final estimator is the solutionto an appropriate linear program. Early applications of linear programs to functional estimationinclude [Pan03,VV11a,VV11b,VV13] which directly involve the functional of interest, but therewere only suboptimal error guarantees. Later, new moment matching based methods independent

Page 6: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

of the target functional appeared in [HJW18] for estimating sorted distributions and [RW19] forestimating the multiset of the mean in Gaussian location model, both of which have optimal errorguarantees. This idea of moment matching, or method of moments, dates back to [Pea94], andwas also used in other problems such as [HP15,WY18] for learning Gaussian mixtures and [KV17,TKV17] for learning a population of parameters. Although the moment matching approach doesnot require explicit polynomial approximation, it only achieves the minimax rate in a smallparameter regime for most functionals [HJW18] and it is unknown how to adapt this approachto a given functional. Moreover, for the local moment matching idea which is necessary in mostfunctional estimation problems where the regime classification is required, both Poissonizationand sample splitting were still necessary in [HJW18]. A very recent work [HS20] was able toremove the sample splitting in the 1D case and prove the near-optimal exponential concentrationproperty, but it still applied the Poissonization and did not involve the target functional. Hence,for linear programming based approaches, both achieving the minimax optimality and satisfyingall target properties remain open.

The third approach is based on the profile maximum likelihood (PML) distribution, a conceptfirst proposed in [OSVZ04] on estimating the probability multiset. It was later shown in [ADOS17]that plugging the PML distribution into general symmetric functionals attains the optimal samplecomplexity by a union bound over all possible profiles, and a better statistical guarantee wasobtained recently in [HS20] using an involved chaining argument. There were also modificationsof PML [CSS19b,HO19a] tailored for specific functionals, and its success was shown for a numberof problems including the estimation of KL divergence [Ach18, HO19a]. However, the PMLapproach is designed to only work for symmetric functionals, and there is always a non-negligiblegap between the performance of PML and that of the minimax optimal estimator. In addition,exact or approximate PML distributions are very hard to compute [CSS19a, ACSS20], whichfurther makes it not suitable for our problem.

1.4 Organization

The rest of the paper is organized as follows. In Section 2, a hybrid estimator combining theidea of functional-independent moment matching and the functional-dependent bias-correctedplug-in approach is constructed for both the α-divergences and the KL divergence, with a softboundary between these two phases. Section 3 presents the roadmap of the analysis of the abovehybrid estimator for the α-divergences with α ≥ 2. In particular, it presents the idea of implicitpolynomial approximations achieved by moment matching, and key technical insights to overcomethe dependence incurred by removing Poissonization and sample splitting. The Appendix A listssome necessary auxiliary lemmas for this paper, while we refer the upper bound analysis for theKL divergence estimator, the minimax lower bounds, and the proofs of all main and auxiliarylemmas, to the appendices B, C, D, and E, respectively.

2 Estimator Construction

In this section, we construct minimax optimal estimators of α-divergences for integer α ≥ 2 andα = 1 separately.

Page 7: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

2.1 Construction for α-divergences with α ≥ 2

The high-level description of the estimator construction is as follows. First, following the generalrecipe in [JVHW15], we split the entire set [0, 1]2 of probability pairs (p, q) into the smoothregime Rs and the non-smooth regime Rns. In the non-smooth regime Rns, we construct a 2Dmeasure µ supported on a set slightly larger than Rns such that the moments

∫pαqdµ(dp, dq) of

µ for all d = 0, 1, · · · , O(log n) is close to the true moments∑k

i=1 pαi q

di 1((pi, qi) ∈ Rns) restricted

to Rns. Then we plug the measure µ into the target functional as the estimate in the non-smooth regime. As for the smooth regime, we simply apply an appropriate bias-corrected plug-inestimator of each (pi, qi). However, one crucial difference from the previous approaches is thatwe do not attribute each symbol i ∈ [k] to smooth/non-smooth regimes, either deterministicallyor randomly. Instead, each symbol i ∈ [k] contains both non-smooth and smooth componentsof contributions towards the final functional which can be computed separately to construct theoverall estimator. In the detailed implementation, some smoothing operation will be applied toboth the true moments and the plug-in estimators to account for different components, which isthe reason why we no longer require the sample splitting.

Fix constants c1, c2 > 0 to be chosen later, and assume that n is an even integer. Moreover,for each i ∈ [k], let mpi, nqi be the number of occurrences of symbol i in the m observationsfrom P and the n observations from Q, respectively. The detailed estimator construction for theα-divergence with integer α ≥ 2 is then as follows.

1. Estimate the non-smooth component:

• Estimators for smoothed moments: for d ∈ N, define the following function gd : N → R

with g0(x) ≡ 1, and

gd(x) =

d−1∏

d′=0

x− d′

n/2− d′. (4)

We also define a modified version gd of gd as

gd(x) = min gd(x), gd(⌈2c1 log n⌉) . (5)

Now for each d ∈ N, we define our estimator of the smoothed (α, d) moments as follows:

Mα,d =

k∑

i=1

α−1∏

ℓ=0

mpi − ℓ

m− ℓ·

∑

0≤s≤c1 logn

(HG

(n, nqi,

)= s)gd(nqi − s)

, (6)

where we recall that HG(N,K, k) denotes the hypergeometric distribution with k drawswithout replacement from a bin of N total elements and K desired elements.

• The linear program: define D = ⌈c2 log n⌉ and the (non-probability) measure µ as thesolution to the following linear program:

minimize L(µ, M) ,D∑

d=0

(3c1 log n

)−d

·∣∣∣∣∫

pαqdµ(dp, dq) − Mα,d

∣∣∣∣

subject to supp(µ) ⊆ S ,

(p, q) ∈ [0, 1]2 : q ≤ 3c1 log n

n, p ≤ Uq

µ(R2) ≤ k,

(7)

Page 8: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

where supp(µ) denotes the support of the measure µ.

• Final estimator of the non-smooth component: compute

Dα,ns =

∫pα

qα−1µ(dp, dq). (8)

2. Estimate the smooth component: define the function hα : N → R with

hα(x) = 1(x 6= 0) ·(

(2x/n)α−1− α(α − 1)

n· 1− 2x/n

(2x/n)α

), (9)

and the final estimator of the smooth component is

Dα,s =

k∑

i=1

α−1∏

ℓ=0

mpi − ℓ

m− ℓ·∑

s>c1 logn

(HG

(n, nqi,

)= s)hα(nqi − s)

. (10)

3. The final estimator Dα is then defined as

Dα =1

α(α − 1)

(Dα,ns + Dα,s − 1

). (11)

A few remarks of the estimator Dα are in order:

• The high-level idea: In the above estimator construction, the final estimator Dα consistsof the non-smooth component Dα,ns in (8) and the smooth component Dα,s in (10). Thesetwo components aim to estimate the respective components in the following decompositionof the target functional

k∑

i=1

pαiqα−1i

= Dα,ns +Dα,s,

with

Dα,ns ,k∑

i=1

pαiqα−1i

· P(B

(n2, qi

)≤ c1 log n

), (12)

Dα,s ,k∑

i=1

pαiqα−1i

· P(B

(n2, qi

)> c1 log n

). (13)

Recall that B(n, p) denotes the Binomial distribution with number of trials n and successprobability p, thus (12) and (13) provide a soft regime classification in the following sense:for each symbol i ∈ [k], if qi is large then it mainly contributes to the smooth componentin (13), and if qi is small then it mostly contributes to the non-smooth component in (12).Hence, each symbol contributes to both components with weights determined by a suitableBinomial probability, and our estimator aims to estimate these two components separately.The reason why we choose Binomial probabilities as the weights in (12) and (13) is thatthe smoothed moments with the current weights admit unbiased estimators, which will bediscussed in details below.

Page 9: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

• Regime classification: As explained in the high-level idea, the above estimator constructiondoes not involve a hard regime classification as in previous works. However, the supportconstraint of µ in the linear program (7) and the thresholds c1 log n in (12), (13) indicatethat a soft regime classification is still performed with q = 2c1 log n/n being the boundaryof the non-smooth (where q is smaller than the threshold) and smooth regimes (where qis greater than the threshold). The choice ensures that symbols with probability far awayfrom the boundary have weight P(B(n/2, q) ≤ c1 log n) close to either zero or one, as shownin Lemma 4 in the Appendix.

• The function gd(x) and the modification gd(x): The choice of the function gd(x) in (4) isto satisfy the following identity: for h ∼ B(n/2, q), it holds that

E [gd(h)] =

n/2∑

x=0

(n/2

)qx(1− q)n/2−x

d−1∏

d′=0

x− d′

n/2− d′

n/2∑

x=d

(n/2− d

x− d

)qx(1− q)n/2−x = qd. (14)

In other words, the quantity gd(h) is an unbiased estimator of qd. Therefore, the functiongd(x) is used for the unbiased estimation of monomials of q. We cut off the function gd(x)outside the interval [0, 2c1 log n] to ensure a uniform small difference |gd(x) − gd(y)| for allx, y ∈ N with |x − y| ≤ 1, which helps to reduce the variance of the smoothed moment

estimator Mα,d discussed below.

• The smoothed moment estimator Mα,d: Ideally we would like to estimate the local mo-

ments∑k

i=1 pαi q

di · 1(qi ≤ 2c1 log n/n) for each d ∈ N, if a hard regime classification

could work. However, the above quantity does not admit an unbiased estimator, and isinformation-theoretically hard to estimate if some qi’s are close to the boundary. Mostprevious works overcome the above difficulty via the sample splitting, where they split theempirical probability qi into two independent halves qi,1 and qi,2, and the modified moments∑k

j=1 pαi q

di · 1(qi,1 ≤ 2c1 log n/n) admit a simple unbiased estimator based on pi and qi,2.

Instead of sample splitting, we consider the following smoothed moments defined as

Mα,d ,k∑

i=1

pαi qdi · P

(n2, qi

)≤ c1 log n

). (15)

By Lemma 4, we see that the Binomial probability in (15) roughly corresponds to theindicator function 1(qi ≤ 2c1 log n/n), while (15) makes a soft decision. The key advantageof introducing the smoothed moments taking the form of (15) is that Mα,d now admits anunbiased estimator:

k∑

i=1

α−1∏

ℓ=0

mpi − ℓ

m− ℓ·

∑

0≤s≤c1 logn

(HG

(n, nqi,

)= s)gd(nqi − s)

= Mα,d. (16)

To see (16), note that for s ∼ HG(n, nq, n/2) and t = nq− s, they are independent randomvariables each of which follows a Binomial distribution B(n/2, q) unconditionally on nq ∼

Page 10: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

B(n, q). Hence, using the independence of pi and qi, the unbiased condition (14) leads to(16). An alternative view of (16) is via the reduction by sufficiency of the estimator usingsample splitting, and a similar idea concerning Poisson distributions can be found in [HS20].

• The linear program: The high-level intuition behind the specific objective function L(µ, M )in the linear program (7) is that after applying the duality between moment matching andpolynomial approximation, the following deterministic inequality holds for the estimationerror of Dα,ns in estimating Dα,ns:

|Dα,ns −Dα,ns| .α,c1,c2

kUα

n log n+ nO(c2)+α−1L(µ, M ).

In other words, the random estimation error is controlled exactly by the objective functionL(µ, M ). Now comes the key intuition: there must be an unknown feasible solution µ to (7)which depends on the unknown distributions (P,Q), with moments close to the smoothed

moments Mα,d in (15). Hence, the fact that µ is a minimizer gives L(µ, M ) ≤ L(µ, M),

where the last quantity is a linear combination of the differences |Mα,d −Mα,d| and is thusmuch easier to upper bound the expectation.

In summary, the form of the linear program (7) ensures that as long as Mα,d are accurateestimators of the true smoothed moments in (15), the plug-in approach (12) of the resultingminimizer µ will also be accurate for the target quantity Dα,ns. Moreover, the bivariatepolynomial we will use to approximate pα/qα−1 on the set S will take the form pαQ(q) forsome univariate polynomial Q with degree at most D, so the degree of p is always α andthe support constraint of µ is present in the linear program. The last constraint µ(R2) ≤ kis mostly a technical condition and satisfied by the target measure µ in (18).

• The smooth component Dα,s: For symbols with a significant contribution to Dα,s in (13),Lemma 4 shows that we must have q > c1 log n/n. Consequently, the target functionalpα/qα−1 becomes smooth on both arguments (p, q), and the plug-in approach (possiblywith bias correction) is expected to work well. To this end, the function hα(x) in (9) servesas a plug-in estimator with order-one bias correction in the sense that if X ∼ B(n/2, q)with q ≥ c1 log n/n, it holds that E[hα(X)] ≈ q1−α (cf. Lemma 2). Consequently, usingthe same insights of the smoothed moments and the identity (16), it can be seen that theestimator Dα,s in (10) is close to Dα,s in (13) in expectation.

• Dependence on (k, U): A potential drawback of the above estimator construction is thatthe estimator requires the knowledge of both the support size k and the upper bound Uwhich are used in the constraints of the linear program (7). However, we remark that ourmain aim is to theoretically provide a minimax rate-optimal methodology which does notrequire any Poissonization, sample splitting, or explicit polynomial construction in generalfunctional estimation problems, and therefore we focus more on the key theoretical insightsthan other possible improvements. Moreover, we also point out that if we are allowed touse an explicit polynomial approximation (still without Poissonization or sample splitting),the linear program (7) can be replaced by an explicit unbiased estimator of the abovepolynomial and become agnostic to both parameters k and U .

• Computational complexity: The main computation lies in the steps (6), (7), (8) and (10),where other steps take O(1) or O(log n) time for each evaluation. For steps (6) and (10),

Page 11: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

the inner summation is over at most nqi different values of s for each symbol i ∈ [k], andtherefore the total computational time is O(αk

∑ki=1 nqi) = O(αnk). As for the linear

program (7) and integration (8), we may quantize the support of µ for both tasks to resultin a polynomial time algorithm in total.

Based on the above discussions, we make a summary of how the constructed estimator Dα

satisfies the three properties in the introduction. First, to get rid of the sample splitting, thekey idea is to use the smoothed moments which admit unbiased estimators. Second, to avoid theexplicit construction of the polynomial, we use a linear program based on moment matching inthe non-smooth regime which is independent of the target functional (except for the final plug-in step in (8)). Finally, to handle the original sampling model without Poissonization, we useBinomial probabilities as the weights in the estimator construction, while the dependence acrosssymbols is handled in Section 3 based on the Efron–Stein–Steele inequality (cf. Lemma 7 in theAppendix).

The performance of the estimator Dα is summarized in the following theorem.

Theorem 3. For integer α ≥ 2, constant c1 > 0 large enough and c2 log n ≥ 1, it holds that

sup(P,Q)∈Mk(U)

E(P,Q)|Dα −Dα(P‖Q)| .α,c1,c2

kUα

n log n+

√kUα

n1−cc2+

Uα−1

√m

+Uα−1/2

√n

where c > 0 is an absolute constant.

Based on Theorem 3 and the assumption log k &α log n in Theorem 1, we conclude that bychoosing c2 > 0 small enough, Theorem 3 implies the upper bound of Theorem 1.

2.2 Construction for KL divergence

The estimator construction for the KL divergence DKL(P‖Q) is entirely similar. Specifically,if we write the KL divergence DKL(P‖Q) = −H(P ) − ∑k

i=1 pi log qi as the negated sum of

the Shannon entropy H(P ) =∑k

i=1 −pi log pi and the cross entropy∑k

i=1 pi log qi, the previousestimator construction can be applied to both entropies separately. The detailed construction isas follows (assuming that both m and n are even):

1. Estimate the non-smooth component of the cross entropy: first, following the same steps(4)–(7) with α = 1, we solve for the problem-independent measure µ(1). Then we againapply the plug-in approach for the non-smooth component of the cross entropy:

D(1)KL,ns =

∫−p log q · µ(1)(dp, dq).

2. Estimate the non-smooth component of the Shannon entropy:

• Construct the function

g′d(x) =

d−1∏

d′=0

x− d′

m/2− d′

Page 12: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

and its modification g′d(x) = ming′d(x), g′d(⌈2c1 logm⌉), as well as the estimators ofsmoothed moments:

M(2)d =

k∑

i=1

∑

0≤s≤c1 logm

(HG

(m,mpi,

)= s)g′d(mpi − s).

• Set D′ = ⌈c2 logm⌉ and solve the following linear program for the optimal µ(2):

minimize L(2)(µ(2), M (2)) ,D′∑

d=0

(3c1 logm

)−d

·∣∣∣∣∫

pdµ(2)(dp)− M(2)d

∣∣∣∣

subject to supp(µ(2)) ⊆[0,

3c1 logm

], µ(2)(R) ≤ k.

• Apply the plug-in approach to estimate the non-smooth component of the Shannonentropy as

D(2)KL,ns =

∫p log p · µ(2)(dp).

3. Estimate the smooth component of the cross entropy: define function h : N → R with

h(1)(x) = 1(x 6= 0) ·(− log

n− 1− 2x/n

and compute the following estimator

D(1)KL,s =

k∑

i=1

pi ·

∑

s>c1 logn

(HG

(n, nqi,

)= s)h(1)(nqi − s)

4. Estimate the smooth component of the Shannon entropy: define function h′ : N → R with

h(2)(x) =2x

mlog

m− 1− 2x/m

and compute the following estimator

D(2)KL,s =

k∑

i=1

∑

s>c1 logm

(HG

(m,mpi,

)= s)h(2)(mpi − s).

5. Final estimator:

DKL = D(1)KL,ns + D

(2)KL,ns + D

(1)KL,s + D

(2)KL,s.

In the above construction, the functions h(1) and h(2) are the bias-corrected plug-in estimatorsfor the target function − log q and p log p, respectively. The estimation performance of the aboveestimator DKL is summarized in the following theorem.

Page 13: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Theorem 4. For integer α ≥ 2, constant c1 > 0 large enough and c2 log n ≥ 1, it holds that

sup(P,Q)∈Mk(U)

E(P,Q)|DKL −DKL(P‖Q)|

.α,c1,c2

m logm+

n log n+ (mn)cc2

(√k

√kU

logU√m

√U

where c > 0 is an absolute constant.

Based on Theorem 4 and the assumption log k & log(m+n) in Theorem 2, we again concludethat by choosing the constant c2 > 0 small enough, Theorem 4 implies the upper bound ofTheorem 2.

3 Estimator Analysis

In this section, we provide the roadmap of the error analysis of the estimator Dα and proveTheorem 3, relegating the similar analysis of the estimator DKL and the proofs of main lemmasto the full version [HJW20]. Clearly, a triangle inequality gives

E|Dα −Dα(P‖Q)| ≤ E|Dα,ns −Dα,ns|+ E|Dα,s −Dα,s|α(α− 1)

, (17)

and it suffices to upper bound the estimation error of the estimators Dα,ns, Dα,s in (8), (10) inestimating the quantities Dα,ns,Dα,s in (12), (13), respectively. To this end, Section 3.1 shows

that for the plug-in estimator Dα,ns of the moment matching estimator, a deterministic upper

bound of |Dα,ns −Dα,ns| in terms of L(µ, M) for a properly constructed measure µ is availablevia an implicit polynomial approximation. Consequently, the estimation error of the non-smoothcomponent E|Dα,ns −Dα,ns| is effectively controlled by E[L(µ, M)], which further reduces to thebias-variance analysis of the smoothed moments estimator presented in Section 3.2. Finally,Section 3.3 deals with the estimation error (specifically, bias and variance) of the bias-correctedplug-in approach for the smooth component Dα,s.

3.1 A deterministic inequality via implicit polynomial approximation

In this subsection, we prove a deterministic upper bound on |Dα,ns − Dα,ns| for the plug-inapproach of the measure µ in the linear program (7). To this end, consider the following measureµ based on the perfect knowledge of (P,Q):

µ =

k∑

i=1

δ(pi,qi) · P(B

(n2, qi

)≤ c1 log n

(qi ≤

3c1 log n

), (18)

where δ(p,q) denotes the Dirac point measure on the single point (p, q). Clearly the measure µ issupported on S and satisfies µ(R2) ≤ k, thus it is a feasible solution to the linear program (7).

Page 14: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Moreover, a combination of (12) and (18) gives

∣∣∣∣∫

pα

qα−1µ(dp, dq)−Dα,ns

∣∣∣∣ =k∑

i=1

pαiqα−1i

· P(B

(n2, qi

)≤ c1 log n

(qi >

3c1 log n

)

≤ 1

n5α

k∑

i=1

pαiqα−1i

≤ Uα−1

n5α

k∑

i=1

pi =Uα−1

n5α,

where the inequalities follow from Lemma 4 and the bounded likelihood ratio pi ≤ Uqi. Hence,by a triangle inequality, the following deterministic inequality holds:

|Dα,ns −Dα,ns| ≤∣∣∣∣∫

pα

qα−1(µ(dp, dq) − µ(dp, dq))

∣∣∣∣ +Uα−1

n5α, (19)

and the quantity of interest is the integral difference of the function pα/qα−1 with respect to themeasures µ and µ.

Next we introduce an approximating polynomial of the function pα/qα−1 on the set S, whereit is implicit in the sense that it is only used in the estimator analysis but not in the estimatorconstruction. By Lemma 5 in the Appendix, there exists a polynomial Q0(x) =

∑D+αd=α adx

d suchthat

supx∈[0,1]

|x−Q0(x)| .α1

D2.

Now define

Q(q) =3c1 log n

nqα·Q0

(nq

3c1 log n

), q ∈

[0,

3c1 log n

it is clear that Q is a degree-d polynomial, and

sup(p,q)∈S

∣∣∣∣pα

qα−1− pαQ(q)

∣∣∣∣ ≤ Uα supq∈[0,3c1 logn/n]

|q − qαQ(q)| .α,c1,c2

Uα

n log n. (20)

In other words, the bivariate polynomial pαQ(q) is a good approximation of pα/qα−1 on S witha uniform approximation error in (20). Moreover, as |Q0(x)| is upper bounded by O(1) on theunit interval [0, 1], the even polynomial P0(t) = Q0(t

2) on [−1, 1] is also bounded by a constant.Consequently, Lemma 6 shows that maxα≤d≤α+D |ad| ≤ (1 +

√2)2(α+D) = Oα(n

cc2) with some

absolute constant c > 0. As a further result, if we express Q(q) =∑D

d=0 bdqd, the coefficients

have the upper bound

|bd| =(3c1 log n

)1−d−α

|ad+α| ≤ ncc2 ·(3c1 log n

)1−d−α

, d = 0, 1, · · · ,D. (21)

Page 15: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Consequently, by (20) and (21), the integral difference in (19) can be upper bounded by

∣∣∣∣∫

pα

qα−1(µ(dp, dq)− µ(dp, dq))

∣∣∣∣

≤∣∣∣∣∫

(pα

qα−1− pαQ(q)

)(µ(dp, dq)− µ(dp, dq))

∣∣∣∣ +D∑

d=0

|bd| ·∣∣∣∣∫

pαqd(µ(dp, dq) − µ(dp, dq))

∣∣∣∣

.α,c1,c2

∫

Uα

n log n(µ(dp, dq) + µ(dp, dq))

+ ncc2

(3c1 log n

)1−α D∑

d=0

(3c1 log n

)−d ∣∣∣∣∫

pαqd(µ(dp, dq)− µ(dp, dq))

∣∣∣∣

.α,c1,c2

kUα

n log n+ ncc2

(3c1 log n

)1−α

L(µ, µ),

where the last inequality is due to the constraint µ(R2), µ(R2) ≤ k, and we slightly abuse thenotation L(µ, µ) to denote L(µ,M) in (7) with Mα,d =

∫pαqdµ(dp, dq). Now by the triangle

inequality and the definition of the minimizer µ, we have

L(µ, µ) ≤ L(µ, M ) + L(µ, M ) ≤ 2L(µ, M ),

and therefore the following final inequality holds for the difference |Dα,ns −Dα,ns|:

|Dα,ns −Dα,ns| .α,c1,c2

kUα

n log n+ ncc2

(3c1 log n

)1−α

L(µ, M), (22)

where c > 0 is an absolute constant.

3.2 Bias-variance analysis in the non-smooth regime

To complete the error analysis for the non-smooth component, the deterministic inequality (22)

shows that it suffices to upper bound the quantity E[L(µ, M )], or equivalently, to upper bound

the estimation error E|Mα,d −M⋆α,d| of the smoothed moments for each d = 0, 1, · · · ,D, where

M⋆α,d ,

∫pαqdµ(dp, dq) =

k∑

i=1

pαi qdi · P

(n2, qi

)≤ c1 log n

(qi ≤

3c1 log n

)

is very close to the smoothed moments defined in (15). Based on the following bias-variancedecomposition

E|Mα,d −M⋆α,d| ≤ |EMα,d −M⋆

α,d|+√

Var(Mα,d),

it suffices to analyze the bias and variance of the estimator Mα,d, respectively. Specifically,

the bias |EMα,d − M⋆α,d| is expected to be very small, as the estimator Mα,d would be exactly

unbiased if the modification gd(x) in (5) were not applied and M⋆α,d = Mα,d, and the expected

differences incurred by the modification gd as well as |Mα,d − M⋆α,d| are both small due to the

Page 16: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

concentration inequalities in Lemma 4. As for the variance Var(Mα,d), the truncation in gd(x)

ensures that changing one observation only results in a tiny change in the estimator Mα,d, andthus an involved application of the Efron–Stein–Steele inequality (cf. Lemma 7) gives a smallvariance. The following lemma summarizes the upper bounds on the bias and the variance.

Lemma 1. For integer α ≥ 2, constant c1 > 0 large enough, c2 log n ≥ 1 and any d = 0, 1, · · · ,D,if (P,Q) ∈ Mk(U), it holds that

|EMα,d −M⋆α,d| .α,c1,c2

kUα

n4α

(3c1 log n

)α+d

√Var(Mα,d) .α,c1,c2

√kUα · nc′c2

(3c1 log n

)α+d

where c′ > 0 is an absolute constant.

Based on Lemma 1, the quantity E[L(µ, M)] can be upper bounded as

E[L(µ, M)] ≤D∑

d=0

(3c1 log n

)−d [|EMα,d −M⋆

α,d|+√

Var(Mα,d)

]

.α,c1,c2

D∑

d=0

(3c1 log n

)−d

·(3c1 log n

)α+d [kUα

n4α+√kUα · nc′c2

]

.α,c1,c2

(3c1 log n

n· U)α [k log n

n4α+

√knc′c2 log n

], (23)

which together with (22) completes the analysis of E|Dα,ns −Dα,ns|.

3.3 Bias-variance analysis in the smooth regime

Now the only remaining quantity of interest is E|Dα,s − Dα,s|, and we use the bias-variancedecomposition again to write

E|Dα,s −Dα,s| ≤ |EDα,s −Dα,s|+√

Var(Dα,s).

To handle the bias, we first need the following lemma on the performance of hα in (9).

Lemma 2. Let X ∼ B(n/2, q) with q ≥ c1 log n/n. Then for integer α ≥ 2 and constant c1 > 0large enough, we have

|E[hα(X)] − q1−α| .α,c1

nqα log n.

Let Xi ∼ B(n/2, qi) for all i ∈ [k], then the unbiased properties in (14) and (16) give

|EDα,s −Dα,s| ≤k∑

i=1

pαi |E[hα(Xi)]− q1−αi | · P

(n2, qi

)> c1 log n

Page 17: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

By Lemma 4, the Binomial probability is at most n−5α if qi ≤ c1 log n/n, while the quantitypαi |E[hα(Xi)] − q1−α

i | is always upper bounded by Oα(pi(Uα−1 + nα−1)). Therefore, the total

contribution of symbols i ∈ [k] with small probability qi ≤ c1 log n/n to the above quantity isat most Oα,c1(n

−5α(Uα−1 + nα−1)). Now applying Lemma 2 for symbols with large probabilitygives the final bias bound

|EDα,s −Dα,s| .α,c1

Uα−1 + nα−1

n5α+

k∑

i=1

pαinqαi log n

.α,c1

kUα

n log n. (24)

As for the variance Var(Dα,s), we again make use of the Efron–Stein–Steele inequality inLemma 7 to handle the dependence across different empirical frequencies, and the followinglemma presents the final variance bound.

Lemma 3. For integer α ≥ 2 and constant c1 > 0 large enough, it holds that√

Var(Dα,s) .α,c1,εUα−1

√m

+Uα−1/2

√n

√kUα

n1−ε,

where ε > 0 is any absolute constant.

Hence, combining the inequalities (17), (22), (23), (24) and Lemma 3, we conclude that

E(P,Q)|Dα −Dα(P‖Q)| .α,c1,c2

kUα

n log n+

√kUα

n1−cc2+

Uα−1

√m

+Uα−1/2

√n

for some absolute constant c > 0 and all (P,Q) ∈ Mk(U), which is exactly the statement ofTheorem 3.

A Auxiliary Lemmas

Lemma 4 (Lemma 17 of [HJW18]). Fix any α > 0. There exists a constant c = c(α) > 0 suchthat for all c1 > c, the following inequality holds for each p ∈ [0, 1]:

P(B(n/2, p) ≤ c1 log n | p ≥ 3c1 log n/n) ≤ n−5α,

P(B(n/2, p) ≥ c1 log n | p ≤ c1 log n/n) ≤ n−5α.

Lemma 5. Fix any integer α ≥ 2. There exist absolute constants cα, Cα > 0 such that for anyn ≥ α, it holds that

cαn2

≤ infaα,··· ,an∈R

supx∈[0,1]

∣∣∣∣∣x−n∑

d=α

adxd

∣∣∣∣∣ ≤Cα

n2.

Lemma 6 ([Mar92]). Let p(x) =∑n

d=0 adxd be a polynomial of degree n with |p(x)| ≤ 1 for all

x ∈ [0, 1]. Then

max0≤d≤n

|ad| ≤ (1 +√2)n.

Lemma 7 (Efron–Stein–Steele Inequality [Ste86]). Let X1,X2, · · · ,Xn be independent randomvariables, and for i = 1, · · · , n, let X ′

i be an independent copy of Xi. Then for any f ,

Var(f(X1, · · · ,Xn)) ≤1

n∑

i=1

E(f(X1, · · · ,Xn)− f(X1, · · · ,Xi−1,X′i,Xi+1, · · · ,Xn))

Page 18: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

B Upper Bound Analysis of the KL Divergence

The analysis of the KL divergence estimator DKL follows similar lines to that of the α-divergences.First, using a triangle inequality, we may express the error as

E|DKL −DKL(P‖Q)|≤ E|D(1)

KL,ns −D(1)KL,ns|+ E|D(2)

KL,ns −D(2)KL,ns|+ E|D(1)

KL,s + D(2)KL,s − (D

(1)KL,s +D

(2)KL,s)|, (25)

where

D(1)KL,ns =

k∑

i=1

pi log1

qi· P(B

(n2, qi

)≤ c1 log n

D(2)KL,ns =

k∑

i=1

pi log pi · P(B

(m2, pi

)≤ c1 logm

D(1)KL,s =

k∑

i=1

pi log1

qi· P(B

(n2, qi

)> c1 log n

D(2)KL,s =

k∑

i=1

pi log pi · P(B

(m2, pi

)> c1 logm

Based on (25), we apply the duality between moment matching and best polynomial approxima-tion to upper bound the first two terms, and the bias-variance analysis for the last term. Similarto Section 3, the following subsections provide the roadmap of the analysis, with detailed proofsof the main lemmas postponed to Appendix D.

B.1 Deterministic inequalities

Similar to Section 3.1, we first find a random upper bound which holds deterministically for the

quantities |D(1)KL,ns −D

(1)KL,ns| and |D(2)

KL,ns −D(2)KL,ns|. To start with, note that the measures

µ(1) =k∑

i=1

δ(pi,qi)P(B

(n2, qi

)≤ c1 log n

)· 1(qi ≤

3c1 log n

µ(2) =k∑

i=1

δpiP(B

(m2, pi

)≤ c1 logm

)· 1(pi ≤

3c1 logm

)

are feasible solutions to the respective linear programs for the estimated measure µ, and it remainsto find good polynomial approximations of the functions −p log q and p log p over the respectivedomains. To this end, we recall the following result from [JVHW15, Lemma 18].

Lemma 8. There is an absolute constant C > 0 such that for all n ∈ N, there exists a polynomialPn(x) of degree at most n such that Pn(0) = 0 and

supx∈[0,1]

|x log x− Pn(x)| ≤C

n2.

Page 19: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Let P0(x) and Q0(x) be the polynomials given by the Lemma 8 with degrees D′ = ⌈c2 logm⌉and D = ⌈c2 log n⌉, respectively. Then by a simple scaling, the polynomial

P (p) =3c1 logm

mP0

(mp

3c1 logm

)+ p log

(3c1 logm

)

satisfies that |P (p) − p log p| .c1,c2 (m logm)−1 for all 0 ≤ p ≤ 3c1 logm/m. Similarly, anotherpolynomial (note that Q0(0) = 0)

Q(q) = −1

[3c1 log n

nQ0

(nq

3c1 log n

)+ q log

(3c1 log n

)]

satisfies that |Q(q) + log q| .c1,c2 (nq log n)−1 for every 0 ≤ q ≤ 3c1 log n/n. Hence, the bivariate

polynomial pQ(q) satisfies that for every (p, q) ∈ S, it holds that

|pQ(q) + p log q| .c1,c2

nq log n.c1,c2

n log n.

Hence, following the same lines of analysis of Section 3.1, the coefficient bound of Pn(x) in Lemma8 gives the following deterministic inequalities:

|D(1)KL,ns −D

(1)KL,ns| .c1,c2

n log n+ nO(c2) · L(1)(µ(1), M (1)),

|D(2)KL,ns −D

(2)KL,ns| .c1,c2

m logm+mO(c2)−1 · L(2)(µ(2), M (2)).

(26)

B.2 Bias-variance analysis in the non-smooth regime

We follow the same lines of Section 3.2 to upper bound E[L(1)(µ(1), M (1))] and E[L(2)(µ(2), M (2))],

or in other words, derive the estimation performance of the (smoothed) moment estimators M(1)d

and M(2)d . Specifically, let

M⋆1,d =

k∑

i=1

piqdi P

(n2, qi

)≤ c1 log n

)· 1(qi ≤

3c1 log n

M⋆2,d =

k∑

i=1

pdiP(B

(m2, pi

)≤ c1 logm

)· 1(pi ≤

3c1 logm

)

be the true moments of µ(1) and µ(2) respectively, and we will show that the moment estimatorsare almost unbiased in estimating the above quantities. Moreover, these estimators enjoy smallvariance by a perturbation argument in the Efron–Stein–Steele inequality. The following lemmasummarizes the bias-variance analysis of the moment estimators.

Lemma 9. Let c2 log n ≥ 1, and constant c1 > 0 be large enough. Then for 0 ≤ d ≤ ⌈c2 log n⌉and (P,Q) ∈ Mk(U), it holds that

|E[M (1)d ]−M⋆

1,d| .c1,c2

(3c1 log n

√Var(M

(1)d ) .c1,c2 n

O(c2)

(√kU

√k

)(3c1 log n

Page 20: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Similarly, for c2 logm ≥ 1 and 0 ≤ d ≤ ⌈c2 logm⌉, it holds that

|E[M (2)d ]−M⋆

2,d| .c1,c2

(3c1 logm

√Var(M

(2)d ) .c1,c2

√kmO(c2) ·

(3c1 logm

Based on Lemma 9 and the definition of the objective function in the linear program, it holdsthat

E[L(1)(µ(1), M (1))] =

⌈c2 logn⌉∑

d=0

(3c1 log n

)−d

· E|M (1)d −M⋆

1,d|

≤⌈c2 logn⌉∑

d=0

(3c1 log n

)−d

·(|EM (1)

d −M⋆1,d|+

√Var(M

(1)d )

)

.c1,c2

⌈c2 logn⌉∑

d=0

n5+ nO(c2)

(√kU

√k

))

.c1,c2 nO(c2)

(√k

√kU

Similarly, we also have

E[L(2)(µ(2), M (2))] .c1,c2 mO(c2) ·

√k.

Consequently, plugging the previous inequalities into the deterministic inequality (26), we con-clude that

E|D(1)KL,ns −D

(1)KL,ns|+ E|D(2)

KL,ns −D(2)KL,ns| .c1,c2

n log n+

m logm+ (mn)O(c2)

(√k

√kU

(27)

B.3 Bias-variance analysis in the smooth regime

Using the usual bias-variance decomposition, we have

E|D(1)KL,s + D

(2)KL,s − (D

(1)KL,s +D

(2)KL,s)|

≤ |E[D(1)KL,s]−D

(1)KL,s|+ |E[D(2)

KL,s]−D(2)KL,s|+

√Var(D

(1)KL,s + D

(2)KL,s). (28)

Note that in (28) we did not apply the triangle inequality to the variance part, as taking the suminside the variance will lead to a significant reduction on the total variance1. To deal with thebiases in (28), the first step is to analyze the bias-corrected plug-in estimators h(1) and h(2) inthe estimator construction.

1Specifically, the variance drops from Oc1(log(kU)/

√

m+√

U/n) to Oc1(logU/

√

m+√

U/n).

Page 21: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Lemma 10. Let X ∼ B(n/2, q) with q ≥ c1 log n/n, for a constant c1 > 0 large enough. Then

|E[h(1)(X)] + log q| .c1

nq log n.

Similarly, if Y ∼ B(m/2, p) with p ≥ c1 logm/m, it holds that

|E[h(2)(Y )]− p log p| .c1

m logm.

By (16) and the triangle inequality, we have

|E[D(1)KL,s]−D

(1)KL,s| ≤

k∑

i=1

(n2, qi

)> c1 log n

)· pi|E[h(1)(Xi)] + log qi|.

For each i ∈ [k], if qi ≤ c1 log n/n, the above Binomial probability is at most n−5 by Lemma 4.Since ‖h(1)‖∞ . log n and pi ≤ Uqi, the total contribution of symbols i ∈ [k] with qi ≤ c1 log n/nin the above sum is at most O(kU/n4). For symbols i ∈ [k] with a large probability qi > c1 log n/n,we use Lemma 10 to conclude that

|E[D(1)KL,s]−D

(1)KL,s| .c1

n4+

k∑

i=1

pinqi log n

.c1

n log n. (29)

Similarly, the following upper bound holds for the other bias:

|E[D(2)KL,s]−D

(2)KL,s| .c1

m logm. (30)

Next we deal with the variance Var(D(1)KL,s + D

(2)KL,s) based on Lemma 7, and the upper bound

is summarized in the following lemma.

Lemma 11. Let c1 > 0 be a constant large enough and ε > 0 be any constant, then

√Var(D

(1)KL,s + D

(2)KL,s) .c1,ε

logU√m

√U

n+ (mn)ε

(√kU

√k

Hence, by (28), (29), (30), and Lemma 11, we conclude that

E|D(1)KL,s + D

(2)KL,s − (D

(1)KL,s +D

(2)KL,s)|

.c1,εk

m logm+

n log n+

logU√m

√U

n+ (mn)ε

(√kU

√k

). (31)

Therefore, the desired Theorem 4 is a direct consequence of (25), (27), and (31).

Page 22: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

C Minimax Lower Bounds

There are two main lemmas that we employ towards the proof of the minimax lower bounds.The first is the Le Cam’s two-point method, which helps to prove the minimax lower boundcorresponding to the variance, or equivalently, the classical asymptotics. Suppose we observe arandom vector Z ∈ Z with σ-algebra A and distribution Pθ where θ ∈ Θ. Let θ0 and θ1 be twoelements of Θ. Let T = T (Z) be an arbitrary estimator of a function T (θ) based on Z. Le Cam’stwo-point method gives the following general minimax lower bound.

Lemma 12 (Section 2.4.2 of [Tsy09]). The following inequality holds:

infT

supθ∈Θ

Pθ

(|T − T (θ)| ≥ |T (θ1)− T (θ0)|

)≥ 1

4exp (−DKL (Pθ1‖Pθ0)) .

The second lemma is the generalized two-point method or method of two fuzzy hypotheses.Let σ0 and σ1 be two prior distributions supported on Θ. Write Fi for the marginal distributionof Z when the prior is σi for i = 0, 1, and let T = T (Z) be an arbitrary estimator of a functionT (θ) based on Z. We have the following general minimax lower bound.

Lemma 13 (Theorem 2.15 of [Tsy09]). Suppose there exist ζ ∈ R, s > 0, 0 ≤ β0, β1 < 1 suchthat

σ0(θ : T (θ) ≤ ζ − s) ≥ 1− β0

σ1(θ : T (θ) ≥ ζ + s) ≥ 1− β1.

If TV(F1, F0) ≤ η < 1, then

infT

supθ∈Θ

Pθ

(|T − T (θ)| ≥ s

)≥ 1− η − β0 − β1

where TV(P,Q) =∫|dP − dQ|/2 is the total variation distance between two probability measures

P and Q.

C.1 Lower Bounds for α-Divergences

In this section, we will make use of Lemma 12 and Lemma 13 to prove the following minimaxlower bound in Theorem 1.

Theorem 5. Let α ≥ 2 be an integer. Then for any m &α U2(α−1), n &α kUα/ log k + U2α−1,U &α (log k)2, and log k &α log n,

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −Dα(P‖Q)| &αkUα

n log n+

Uα−1

√m

+Uα−1/2

√n

Clearly Theorem 5 implies the lower bound of Theorem 1. We will use Lemma 12 to provethe last two terms, and apply Lemma 13 for the first term.

Page 23: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

First consider the following two-point construction. Fix ε ∈ (0, 1/2) to be chosen later, let

P1 =

(1− ε

2(k − 1), · · · , 1− ε

2(k − 1),1 + ε

P2 =

(1 + ε

2(k − 1), · · · , 1 + ε

2(k − 1),1− ε

Q =

U(k − 1), · · · 1

U(k − 1), 1− 1

Clearly, for U ≥ 3 we have (P1, Q), (P2, Q) ∈ Mk(U). Under the above construction, the func-tional value difference between these two points is

|Dα(P1‖Q)−Dα(P2‖Q)| = Uα−1

α(α − 1)2α

(1− 1

(U − 1)α−1

)((1 + ε)α − (1− ε)α) = Ωα

(Uα−1ε

Moreover, the KL divergence between the observations is

DKL(P⊗m1 ‖P⊗m

2 ) = mDKL(P1‖P2)

= m

(1− ε

2log

1− ε

1 + ε+

1 + ε

2log

1 + ε

1− ε

)

= O(mε2).

Consequently, choosing ε = 1/(2√m) in Lemma 12 gives

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −Dα(P‖Q)| &αUα−1

√m

. (32)

Next consider another two-point construction. Fix ε ∈ (0, 1/2) to be chosen later, let

P =

2(k − 1), · · · , 1

2(k − 1),1

Q1 =

(1− ε

4(k − 1)U, · · · , 1− ε

4(k − 1)U, 1− 1− ε

Q2 =

(1 + ε

4(k − 1)U, · · · , 1 + ε

4(k − 1)U, 1− 1 + ε

Clearly we have (P,Q1), (P,Q2) ∈ Mk(U). Moreover, the functional value difference is

|Dα(P‖Q1)−Dα(P‖Q2)|

α(α− 1)

∣∣∣∣∣(2U)α−1

(1− ε)α−1− 1

(1 + ε)α−1

2α

((1− 1− ε

)α−1

−(1− 1 + ε

)α−1)∣∣∣∣∣

= Ωα(Uα−1ε),

and the KL divergence between observations is

DKL(Q⊗n1 ‖Q⊗n

2 ) = nDKL(Q1‖Q2)

= n

(1− ε

4Ulog

1− ε

1 + ε+

(1− 1− ε

)log

4U − 1 + ε

4U − 1− ε

)= O

(nε2

Page 24: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Hence, choosing ε =√

U/(4n) in this two-point construction gives

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −Dα(P‖Q)| &αUα−1/2

√n

. (33)

Finally we construct the fuzzy hypotheses σ0 and σ1 in Lemma 13. To this end, we first recallthe following duality result between best polynomial approximation and moment matching.

Lemma 14 (Appendix E of [WY16]). Given a compact interval I = [a, b] with a > 0, an integerL > 0 and a continuous function φ on I, let

EL(φ; I) , infai

supx∈I

∣∣∣∣∣L∑

i=0

aixi − φ(x)

∣∣∣∣∣

denote the best uniform approximation error of φ by degree-L polynomials. Then

2EL(φ; I) = max

∫φ(t)ν1(dt) −

∫φ(t)ν0(dt)

s.t.

∫tlν1(dt) =

∫tlν0(dt), l = 0, . . . , L,

where the maximum is taken over pairs of probability measures ν0 and ν1 supported on I.

To apply Lemma 14, we choose

I =

[d0

n log n,d1 log n

], L = ⌈d2 log n⌉,

with d0, d1, d2 > 0 be constants specified later. By the proof of the lower bound in Lemma 5 (cf.(66)) and proper scaling, for d0 > 0 small enough (depending only on α, d1, d2) it holds that

EL(x1−α; I) &α (n log n)α−1.

Hence, by Lemma 14, there exist probability measures ν0, ν1 supported on I such that they havematching first L moments and ∆ ,

∫x1−α(ν1(dx)− ν0(dx)) &α (n log n)α−1.

Now consider the following priors σ0 and σ1 on (P,Q). First, under both priors, the probabilityvector P is always

P =

(cU

n log n, · · · , cU

n log n, 1− c(k − 1)U

n log n

where c > 0 is a small constant. Since n & kUα/ log k, we have cUk/(n log n) < 1 for constantc > 0 small enough, and P is a probability vector. As for the prior distributions on the probabilityvector Q = (q1, · · · , qk), the prior µi with i ∈ 0, 1 assigns an i.i.d. prior νi to the first (k − 1)entries q1, · · · , qk−1, and qk = 1−(k−1)EX∼ν0 [X] = 1−(k−1)EX∼ν1 [X] is a common deterministicscalar. Again, as X ∼ νi is at most d1 log n/n, the assumptions U & (log k)2 and n & kUα/ log kensure that qk > 0 for constant d1 > 0 small enough. However, the above product measures µ0, µ1

Page 25: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

may not make Q a probability vector (i.e. sum into one), and thus we consider the following setof approximate probability vectors:

Mk(U, ε) =

(P,Q) : P ∈ Mk, Q ∈ R

k+,

∣∣∣∣∣k∑

i=1

qi − 1

∣∣∣∣∣ ≤ ε, pi ≤ Uqi,∀i ∈ [k]

. (34)

Finally, we define the priors σi, i ∈ 0, 1 to be used in Lemma 13 to be the pushforward measureof the restriction of µi, i ∈ 0, 1 to the following set:

Ei = Mk

(U,

n log n

)∩(P,Q) : |Dα(P‖Q)− E(P,Q)∼µi

[Dα(P‖Q)]| ≤ (k − 1)∆

(cU

n log n

)α.

(35)

To arrive at the desired minimax lower bound based on the above construction, we definethe following quantities. Let R⋆(m,n,U) denote the minimax L1 risk in estimating Dα(P‖Q)over (P,Q) ∈ Mk(U), and R⋆

P(m,n,U, ε) be the counterpart over (P,Q) ∈ Mk(U, ε) under thePoissonized model one observes the mutually independent histogram hi ∼ Poi(nqi), i ∈ [k] fordistribution Q (we do not change the sampling scheme for P ). The following lemma shows thatlower bounds of R⋆

P(m,n,U, ε) in the Poissonized model translate to those of R⋆(m,n,U) underthe original sampling model.

Lemma 15. For ε ∈ (0, 1), the following inequality holds:

R⋆P(m,n,U, ε) .α

(ε+ e−n/8

)Uα−1 +R⋆(m,n/2, (1 + ε)U).

The proof of Lemma 15 is postponed to Appendix E. By Lemma 15, it is clear that for thechoice of ε = k/(n log n) in (35), the target minimax lower bound

R⋆(m,n,U) = infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −Dα(P‖Q)| &αkUα

n log n(36)

follows from R⋆P(m,n,U, ε) & kUα/(n log n). To this end, we apply Lemma 13 to the priors σ0, σ1

and parameters θ = (P,Q), T (θ) = Dα(P‖Q), ζ = (E(P,Q)∼µ0[Dα(P‖Q)]+E(P,Q)∼µ1

[Dα(P‖Q)])/2, s =(k − 1)∆/4 · (cU/n log n)α, and Θ = Mk(U, ε). By the construction of the measures νi and µi,we have

E(P,Q)∼µ1[Dα(P‖Q)] − E(P,Q)∼µ0

[Dα(P‖Q)] =

(cU

n log n

)α

· (k − 1)

∫x1−α(ν1(dx) − ν0(dx)) = 4s,

and consequently (35) implies that Dα(P‖Q) ≤ ζ − s almost surely for (P,Q) ∼ µ0, and thatDα(P‖Q) ≥ ζ + s almost surely for (P,Q) ∼ µ1. In other words, β0 = β1 = 0 in Lemma 13.Hence, it remains to upper bound the total variation distance TV(σ0, σ1).

First we show that µi(Eci ) is small for i ∈ 0, 1. In fact, choosing a constant c > 0 small

enough in the probability vector P and using the support of νi shows that pi/qi ≤ U always holds

Page 26: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

in the support of µ0, µ1. Hence, for i ∈ 0, 1,

µi (Mk(U, ε)c) = µi

(∣∣∣∣∣k∑

i=1

qi − 1

∣∣∣∣∣ > ε

)

= ν⊗(k−1)i

(∣∣∣∣∣k−1∑

i=1

(qi − Eνi[qi])

∣∣∣∣∣ > ε

)

≤ (k − 1)Varνi(q)

ε2≤ (k − 1)(n log n)2

k2·(d1 log n

→ 0, (37)

where we have used the Chebyshev’s inequality and the assumption log k &α log n. Similarly,using the support of νi, applying the Chebyshev’s inequality again gives

µi

(P,Q) : |Dα(P‖Q)− E(P,Q)∼µi

[Dα(P‖Q)]| > (k − 1)∆

(cU

n log n

)α

≤[(k − 1)∆

(cU

n log n

)α]−2

· (k − 1)Varq∼νi

((cU

n log n

)α

q1−α

)

.α(n log n)2

(k − 1)U2α·(

n log n

)2α

·(

n log n

)2(1−α)

→ 0. (38)

Consequently, by (37) and (38), a union bound gives µi(Eci ) → 0 as n → ∞. Furthermore, after

choosing d2 > 0 large enough and d1 > 0 small enough, [WY16, Lemma 3] gives

TV(µ0 (P⊗m, Q⊗n)−1, µ1 (P⊗m, Q⊗n)−1

)≤ (k − 1)TV (EX∼ν0 [Poi(X)],EX∼ν1 [Poi(X)]) → 0.

Let µ′i = µi (P⊗m, Q⊗n)−1 for i ∈ 0, 1, a triangle inequality gives

TV(σ0, σ1) ≤ TV(σ0, µ

′0

)+ TV

(σ1, µ

′1

)+ TV

(µ′0, µ

′1

)

≤ µ0(Ec0) + µ1(E

c1) + TV

(µ′0, µ

′1

)→ 0,

and therefore η → 1/2 in Lemma 13. Hence, as s &α kUα/(n log n), Lemma 13 and 15 lead tothe desired inequality (36).

Finally, a combination of (32), (33) and (36) completes the proof of Theorem 5.

C.2 Lower Bounds for KL Divergence

The proof of minimax lower bounds for the KL divergence is similar to those of α-divergences.In this section we prove the following theorem.

Theorem 6. If m & k/ log k, n & kU/ log k, U & (log k)2 and log k & log(m+ n), we have

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| & k

m logm+

n log n+

logU√m

√U

Clearly Theorem 6 implies the lower bound of Theorem 2. Similarly, we will apply Lemma12 for the last two terms of Theorem 6, and Lemma 13 for the first two terms.

Page 27: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Consider the same two-point constructions in Appendix C.1. In the first construction,

|DKL(P1‖Q)−DKL(P2‖Q)| =∣∣∣∣ε log

1− U−1+ (1 + ε) log

1 + ε

2− (1− ε) log

1− ε

∣∣∣∣ = Ω(ε logU),

and Appendix C.1 shows that DKL(P⊗m1 ‖P⊗m

2 ) = O(mε2). Hence, by choosing ε = 1/(2√m),

we arrive at

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| & logU√m

. (39)

In the second construction, we have

|DKL(P‖Q1)−DKL(P‖Q2)| =∣∣∣∣1

2log

1 + ε

1− ε− 1

2log

4U − 1 + ε

4U − 1− ε

∣∣∣∣ = Ω(ε),

and Appendix C.1 shows that DKL(Q⊗n1 ‖Q⊗n

2 ) = O(nε2/U). Hence, choosing ε =√

U/(4n), wehave

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| &√

n. (40)

Next we construct the fuzzy hypotheses of the first two terms of Theorem 6. The first termΩ(k/(n log n)) essentially follows from the minimax lower bound of entropy estimation: [WY16]constructed two priors on P such that pi ∈ [Ω(1/(n log n)), O(log n/n)] for all i ∈ [k] under bothpriors. Hence, setting identical qi = Θ(1/(n log n)) deterministically, the assumption U & (log n)2

implies that the additional likelihood ratio condition is satisfied almost surely under both priors.Furthermore, the estimation of KL divergence reduces to the estimation of the Shannon entropyH(P ) =

∑ki=1−pi log pi of P . Consequently, the lower bound in [WY16] shows that

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| & k

m logm. (41)

As for the second term, we use a similar construction to that in Section C.1. First recall thefollowing lemma, which is a counterpart of Lemma 5 for α = 1.

Lemma 16 (Lemma 5 of [WY16]). There exist absolute constants c, c′ > 0 such that for n ∈ N,

infa0,··· ,an

supx∈[c/n2,1]

∣∣∣∣∣log x−n∑

d=0

adxd

∣∣∣∣∣ ≥ c′.

By Lemma 14, Lemma 16 and a proper scaling, for the parameters

I =

[d0

n log n,d1 log n

], L = ⌈d2 log n⌉,

with d0, d1, d2 > 0 be constants specified later, there exist two probability measures ν0, ν1 sup-ported on I with matching moments up to order L, and ∆ ,

∫log x(µ1(dx) − µ0(dx)) = Ω(1).

Next we choose te priors µ0, µ1 on (P,Q) as follows. Again, set P to be a deterministic vector

P =

(cU

n log n, · · · , cU

n log n, 1− c(k − 1)U

n log n

Page 28: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

with c > 0 small enough to ensure that P is a valid probability vector. For Q = (q1, · · · , qk),the measure µi assigns νi independently to each coordinate q1, · · · , qk−1, and sets qk = 1 −(k − 1)EX∼ν0 [X] = 1 − (k − 1)EX∼ν1 [X] for the last coordinate. Similarly, we define the set ofapproximate probability vectors in (34), and let the priors σi be the pushforward measure of therestriction of µi to the following set:

Ei = Mk

(U,

n log n

)∩(P,Q) : |DKL(P‖Q)− E(P,Q)∼µi

[DKL(P‖Q)]| ≤ (k − 1)∆

(cU

n log n

(42)

Again, let R⋆(m,n,U) and R⋆P(m,n,U, ε) be the target minimax risk over (P,Q) ∈ Mk(U)

and the Poissonized minimax risk over (P,Q) ∈ Mk(U, ε), respectively. The relationship betweenthese two quantities is summarized in the following lemma.

Lemma 17. For ε ∈ (0, 1), the following inequality holds:

R⋆P(m,n,U, ε) .α

(ε+ e−n/8

)logU +R⋆(m,n/2, (1 + ε)U).

Consequently, choosing ε = k/(n log n) as in (42), to show that the desired lower bound

infD

sup(P,Q)∈Mk(U)

E(P,Q)|D −DKL(P‖Q)| & kU

n log n, (43)

Lemma 17 implies that it suffices to show that R⋆P(m,n,U, ε) = Ω(kU/(n log n)). To this end,

similar arguments lead to β0 = β1 = 0 in Lemma 13 with

ζ =E(P,Q)∼µ0

[DKL(P‖Q)] + E(P,Q)∼µ1[DKL(P‖Q)]

2, s =

(k − 1)∆

(cU

n log n

n log n.

Moreover, the same arguments give µi(Mk(U, ε)c) → 0 for both i ∈ 0, 1, and

µi

(P,Q) : |DKL(P‖Q)− E(P,Q)∼µi

[DKL(P‖Q)]| > (k − 1)∆

(cU

n log n

)

≤[(k − 1)∆

(cU

n log n

)]−2

· (k − 1)Varq∼νi

(cU

n log nlog

)

.(n log n)2

(k − 1)U2·(

n log n

· (log n)2 → 0.

Hence, the union bound gives µi(Eci ) → 0 for i ∈ 0, 1 as n → ∞, and the triangle inequality

for the total variation distance gives that TV(σ0, σ1) → 0 and thus η → 1/2. Therefore, (43) is adirect consequence of Lemma 13, and a combination of (39), (40), (41), and (43) completes theproof of Theorem 6.

D Proof of Main Lemmas

D.1 Proof of Lemma 1

For the bias bound, recall the quantity Mm,d in (15) and define the following quantity:

M⋆α,d =

k∑

i=1

α−1∏

ℓ=0

mpi − ℓ

m− ℓ·

∑

0≤s≤c1 logn

(HG

(n, nqi,

)= s)gd(nqi − s)

Page 29: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

As similar arguments of (16) lead to E[M⋆α,d] = Mα,d, a triangle inequality gives

|E[Mα,d]−M⋆α,d| ≤ |Mα,d −M⋆

α,d|+ |E[Mα,d − M⋆α,d]|. (44)

We upper bound the terms of (44) separately. First,

|Mα,d −M⋆α,d| =

k∑

i=1

pαi qdi · P

(n2, qi

)≤ c1 log n

(qi >

3c1 log n

)

(a)

≤k∑

i=1

Uαqα+di · exp

(−Ω

((nqi − 2c1 log n)

nqi∧ |nqi − 2c1 log n|

))1

(qi >

3c1 log n

)

(b)

.α,c1,c2

k∑

i=1

Uα

(3c1 log n

)α+d

· 1

n4α=

kUα

n4α

(3c1 log n

)α+d

, (45)

where (a) follows from the Chernoff bound, and (b) follows from the substitution qi = 3c1 log n/n·x with x ≥ 1 and supx≥1 x

α+dn−c1(x−1) = 1 as long as c1 log n ≥ α + d, which is satisfied forall 0 ≤ d ≤ D = ⌈c2 log n⌉ if we choose c1 much larger than c2. Second, for random variablesXi ∼ B(n/2, qi), we have

|E[Mα,d − M⋆α,d]| =

k∑

i=1

pαi · P(B

(n2, qi

)≤ c1 log n

)E[gd(Xi)− gd(Xi)]

≤k∑

i=1

pαi · P(B

(n2, qi

)≤ c1 log n

)E [gd(Xi)1(Xi > 2c1 log n)]

(c)

≤ Uα∑

i:qi≤3c1 logn/n

qαi · E[(2Xi/n)d1(Xi > 2c1 log n)]

+ Uα∑

i:qi>3c1 logn/n

qα+di · P

(n2, qi

)≤ c1 log n

)

(d)

.α,c1,c2

kUα

n4α

(3c1 log n

)α+d

, (46)

where in (c) we have used that 0 ≤ gd(x) ≤ (2x/n)d for x ∈ N and E[gd(Xi)] = qdi , and (d) followsfrom the similar tail comparisons as in (45). Hence, the desired bias bound in Lemma 1 followsfrom (44), (45), and (46).

To deal with the variance Var(Mα,d), we shall need the following lemma.

Lemma 18. Let (mp1, · · · ,mpk) ∼ Multi(m; p1, · · · , pk) and (nq1, · · · , nqk) ∼ Multi(n; q1, · · · , qk)be independent random vectors. Then for S =

∑ki=1 fi(pi, qi) with any fi : [0, 1]

2 → R, we have

Var(S) ≤ 2m ·k∑

i=1

[pi

(fi(pi, qi)− fi

(pi −

m, qi

))2]

+ 2n ·k∑

i=1

[qi

(fi(pi, qi)− fi

(pi, qi −

))2],

Page 30: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

with the convention that whenever pi = 0 or qi = 0, the respective product inside the expectationis zero.

Lemma 18 relies on the Efron–Stein–Steele inequality (cf. Lemma 7) and its proof is postponedto Appendix E. To apply Lemma 18 for the variance, the function fi is

fi(pi, qi) ≡ f(pi, qi) , u(pi)v(qi) ,α−1∏

ℓ=0

mpi − ℓ

m− ℓ·

∑

0≤s≤c1 logn

(HG

(n, nqi,

)= s)gd(nqi − s).

The following lemma summarizes some properties of functions u and v.

Lemma 19. Let mp ∼ B(n, p) and nq ∼ B(n, q), then the following inequalities hold:

E[u(p)2] .α p2α, E

(u(p)− u

(p− 1

))2].α

p2α−1

m2,

E[v(q)2] .

(4c1 log n

)2d

exp (−c(nq − 3c1 log n)+) ,

(v(q)− v

(q − 1

))2].

(4c1 log n

)2d+1

exp (−c(nq − 3c1 log n)+) ,

where c > 0 is an absolute constant.

Based on Lemma 18 and Lemma 19, we conclude that

Var(Mα,d) . m

k∑

i=1

p2α−1i

(4c1 log n

)2d

exp (−c(nqi − 3c1 log n)+)

+ n

k∑

i=1

p2αi ·(4c1 log n

)2d+1

exp (−c(nqi − 3c1 log n)+) .

Hence, with a large constant c1 > 0, the total contribution of symbols i ∈ [k] with qi > 4c1 log n/nto the above variance is at most kn−5α·(4c1 log n/n)2d, which is negligible compared to the claimedresult. Restricting to symbols with qi ≤ 4c1 log n/n, using pi ≤ Uqi and d ≤ c2 log n + 1 in theabove variance bound gives

Var(Mm,d) .α,c1 kU2α log n ·

(4c1 log n

)2(d+α)

.α,c1,c2 kU2α

(3c1 log n

)2(d+α)

· nO(c2),

as claimed.

D.2 Proof of Lemma 2

First, we recall from the Chernoff bound (cf. Lemma 4) that if X ∼ B(n/2, q) and q ≥ c1 log n/n,for constant c1 > 0 large enough we have

(nq4

≤ X ≤ nq)≥ 1− 1

n5α. (47)

Page 31: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Consequently, as ‖hα‖∞ .α nα−1, we have∣∣∣E[hα(X)] − E

[hα(X)1

(nq4

≤ X ≤ nq)]∣∣∣ . 1

n4α+1.

For nq/4 ≤ X ≤ nq, the Taylor expansion of x1−α around x = q gives

(2X/n)α−1=

qα−1− α− 1

qα

(2X

n− q

α(α − 1)

2qα+1

(2X

n− q

− α(α − 1)(α − 2)

6qα+2

(2X

n− q

+α(α− 1)(α − 2)(α − 3)

24ξα+3

(2X

n− q

with ξ ∈ [q/2, 2q]. Using the central moments of the Binomial distribution

[(2X

n− q

)]= 0, E

[(2X

n− q

)2]=

2q(1− q)

[(2X

n− q

)3].

n2, E

[(2X

n− q

)4].

n2+

n3,

(48)

taking expectation at both sides of the Taylor expansion together with (47) gives∣∣∣∣E[(

(2X/n)α−1− 1

qα−1− α(α− 1)(1 − q)

nqα

(nq4

≤ X ≤ nq)]∣∣∣∣

.α1

qα+2· q

n2+

qα+3·(q2

n2+

n3α

.α1

n2qα+1+

n3qα+

n4α.α,c1

nqα log n, (49)

where in the last inequality we have used the assumption q ≥ c1 log n/n. Hence, it now remains toshow that the bias-correction term (1− (2X/n))/(2X/n)α in the estimator hα is a good estimateof the target quantity (1− q)/qα. To this end, the Taylor expansion of (1− x)/xα around x = qagain gives

1− (2X/n)

(2X/n)α=

1− q

qα− α− (α− 1)q

qα+1

(2X

n− q

α(α + 1)− α(α − 1)ξ

2ξα+2

(2X

n− q

with ξ ∈ [q/2, 2q] for nq/4 ≤ X ≤ nq. Using the central moments (48) again, we conclude that∣∣∣∣E[(

1− (2X/n)

(2X/n)α− 1− q

qα

(nq4

≤ X ≤ nq)]∣∣∣∣ .α

n3α+

nqα+1.α,c1

qα log n. (50)

Hence, the desired bias bound follows from (49), (50) and the concentration inequality (47).

D.3 Proof of Lemma 3

We use Lemma 18 to upper bound the variance of Dα,s, where we deal with the perturbations ofXm ∼ P and Y n ∼ Q, respectively. Specifically, the function fi in Lemma 18 is

fi(pi, qi) ≡ f(pi, qi) , u(pi)w(qi) ,α−1∏

ℓ=0

mpi − ℓ

m− ℓ·∑

s>c1 logn

(HG

(n, nqi,

)= s)hα(nqi − s).

Page 32: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

The properties of u are summarized in Lemma 19, so it remains to consider the properties of thefunction w. First, for the second moment of w(q) with nq ∼ B(n, q), Cauchy–Schwartz inequalitygives

E[w(q)2] ≤ E

∑

s>c1 logn

(HG

(n, nqi,

)= s)hα(nqi − s)2

= P

(n2, q)> c1 log n

)· E[hα(X)2],

with X ∼ B(n/2, q), where the last identity uses the unbiased/sample splitting property in (16).The following lemma gives an upper bound of the second moment of hα(X).

Lemma 20. For q ≥ c1 log n/n with constant c1 > 0 large enough, it holds that

E[hα(X)2] .α,c1

q2(α−1).

Consequently, by Lemma 4 and Lemma 20, we conclude that

E[w(q)2] .α,c1

n3α· 1(q <

c1 log n

q2(α−1)· 1(q ≥ c1 log n

Hence, the variance upper bound Sm corresponding to the perturbation in Xm ∼ P in Lemma18 is

Sm , 2m ·k∑

i=1

[pi

(u(pi)− u

(pi −

))2]· E[w(qi)2]

.α,c1

∑

i∈[k]:qi<c1 logn/n

p2α−1i

m· 1

n3α+

∑

i∈[k]:qi≥c1 logn/n

p2α−1i

m· 1

q2(α−1)i

.α,c1

k∑

i=1

U2(α−1)pim

=U2(α−1)

m, (51)

where in the last inequality we have used that pi ≤ Uqi for all i ∈ [k].Next we deal with the perturbation in Y n ∼ Q, where a central quantity is to upper bound

the difference w(q)− w(q − 1/n). To do so, we distinguish into three cases:

• Case I: q ≤ c1 log n/n. In this case, the above upper bound on the second moment of w(q)gives

(w(q)− w

(q − 1

))2]≤ 2E[w(q)2] + 2E[w(q − 1/n)2] .α,c1

n3α.

Page 33: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Consequently, by Lemma 19 we have

2n ·∑

i∈[k]:qi≤c1 logn/n

E[u(pi)2] · E

[qi

(w(qi)− w

(qi −

))2]

.α,c1

n3α−1

∑

i∈[k]:qi≤c1 logn/n

p2αi

.α,c1

n3α−1

∑

i∈[k]:qi≤c1 logn/n

(U log n

)2α−1

.α,c1

U2α−1

n. (52)

• Case II: c1 log n/n < q ≤ 3c1 log n/n. In this case, it is clear that |w(q)| ≤ ‖hα‖∞ .α nα−1.Hence,

(w(q)− w

(q − 1

))2].α n2(α−1) · E[q] = n2(α−1)q.

Consequently, it holds that

2n ·∑

i∈[k]:c1 logn/n<qi≤3c1 logn/n

E[u(pi)2] · E

[qi

(w(qi)− w

(qi −

))2]

.α,c1 n2α−1

∑

i∈[k]:qi≤3c1 logn/n

p2αi qi

.α,c1 n2α−1

∑

i∈[k]

U2α

(3c1 log n

)2α+1

.α,c1,εkU2α

n2−ε(53)

for any ε > 0.

• Case III: q > 3c1 log n/n. We claim that in this case w(q) is close to the following functionw⋆(q) under the L2 norm:

w⋆(q) ,∑

s≥0

(HG

(n, nq,

)= s)hα(nq − s).

In fact, the Cauchy–Schwartz inequality gives

E[(w(q)− w⋆(q))2] = E

∑

s≤c1 logn

(HG

(n, nq,

)= s)hα(nq − s)

≤ E

∑

s≤c1 logn

(HG

(n, nq,

)= s)hα(nq − s)2

≤ P

(n2, q)≤ c1 log n

)· ‖hα‖2∞ .c1,α

n3α,

where in the last inequality we have used Lemma 4. Hence, by triangle inequality it sufficesto work with the function w⋆(q). To this end, note that an alternative definition of w⋆(q) is

Page 34: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

w⋆(q) = E[hα(nq−Y )] with Y ∼ HG(n, nq, n/2). Similarly, w⋆(q−1/n) = E[hα(nq−1−Z)]with Z ∼ HG(n, nq − 1, n/2). By definition of the hypergeometric distribution, it is clearthat there exists a coupling between random variables Y and Z such that Y − 1 ≤ Z ≤ Yalmost surely, and consequently

∣∣∣∣w⋆(q)− w⋆

(q − 1

)∣∣∣∣ ≤ E |hα(nq − Y )− hα(nq − 1− Z)| ≤ E[∆hα(nq − Y )],

with ∆hα(x) , |hα(x)− hα(x− 1)|. The key property of the function ∆hα is summarizedin the following lemma.

Lemma 21. For X ∼ B(n/2, q) and q ≥ 3c1 log n/n with constant c1 > 0 large enough, itholds that

E[∆hα(X)2] .α,c1

n2q2α, E[X∆hα(X)2] .α,c1

nq2α−1.

Based on Lemma 21, we have the following chain of inequalities:

(w⋆(q)− w⋆

(q − 1

))2]

(a)

≤ E

q ·

∑

s≥0

(HG

(n, nq,

)= s)∆hα(nq − s)2

= E

∑

s≥0

(HG

(n, nq,

)= s)·∆hα(nq − s)2

+ E

∑

s≥0

(q − s

(HG

(n, nq,

)= s)·∆hα(nq − s)2

(b)= E

[X1

]· E[∆hα(X2)

2] +E[X2∆hα(X2)

(c)

.α,c1

n2q2α−1,

where (a) is again due to the Cauchy–Schwartz inequality applied to E2[∆hα(nq−Y )], (b) is

due to the unbiased property similar to (16) with i.i.d. random variablesX1,X2 ∼ B(n/2, q),and (c) follows from Lemma 21. In summary, it holds that

2n ·∑

i∈[k]:qi>3c1 logn/n

E[u(pi)2] · E

[qi

(w(qi)− w

(qi −

))2]

.α,c1

∑

i∈[k]:qi>3c1 logn/n

(p2αi

n3α−1+

p2αinq2α−1

)

.α,c1

k∑

i=1

(pi

n3α−1+

U2α−1pin

).α,c1

U2α−1

n. (54)

Page 35: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Combining the above scenarios and inequalities (52), (53), (54), the variance upper bound Sn

corresponding to the perturbation in Y n ∼ Q in Lemma 18 is

Sn , 2n ·k∑

i=1

E[u(pi)2] · E

[qi

(w(qi)− w

(qi −

))2].α,c1,ε

U2α−1

kU2α

n2−ε. (55)

Therefore, a combination of (51), (55) and Lemma 18 completes the proof of Lemma 3.

D.4 Proof of Lemma 9

We shall only prove the statements on M(1)d , as the analysis is entirely similar for M

(2)d . Adopting

the same idea of the proof of Lemma 1, the bias can be expressed as

|E[M (1)d ]−M⋆

1,d| ≤k∑

i=1

piqdi · P

(n2, qi

)≤ c1 log n

)· 1(qi >

3c1 log n

)

︸︷︷︸,A1

k∑

i=1

piP(B

(n2, qi

)≤ c1 log n

)· E|gd(Xi)− gd(Xi)|

︸︷︷︸,A2

(56)

for Xi ∼ B(n/2, qi). By the Chernoff bound and tail comparison, for constant c1 > 0 large enoughand c2 > 0 small enough we have

supq≥3c1 logn/n

qd · P(B

(n2, q)≤ c1 log n

).c1,c2

(3c1 log n

Consequently, we have

A1 .c1,c2

(3c1 log n

)d k∑

i=1

pi =1

(3c1 log n

. (57)

As for A2, by definition of the modification gd in (5), we have

E|gd(Xi)− gd(Xi)| ≤ E [gd(Xi)1(Xi ≥ 2c1 log n)]

.c1,c2

(3c1 log n

· 1(qi ≤

3c1 log n

)+ qdi · 1

(qi >

3c1 log n

where the second inequality follows from similar tail comparison and gd(x) ≤ (2x/n)d. Hence,

A2 .c1,c2 A1 +1

(3c1 log n

)d k∑

i=1

pi .c1,c2

(3c1 log n

. (58)

Therefore, the claimed bias bound follows from a combination of (56), (57), and (58).

Page 36: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

As for the variance of M(1)d , we apply Lemma 18 to the function

fi(pi, qi) ≡ f(pi, qi) , piv(qi) , pi ·∑

s≤c1 logn

(HG

(n, nqi,

)= s)gd(nqi − s).

Since Lemma 19 summarizes some properties of the function v, Lemma 18 leads to

Var(M(1)d ) ≤ 2m ·

k∑

i=1

[pim2

]E[v(qi)

2]+ 2n ·

k∑

i=1

E[p2i ] · E[qi

(v(qi)− v

(qi −

))2]

.c1,c2

k∑

i=1

(4c1 log n

)2d

exp(−c(nqi − 3c1 log n)+)

k∑

i=1

(p2i +

pim

)· n(4c1 log n

)2d+1

exp(−c(nqi − 3c1 log n)+).

Using d ≤ ⌈c2 log n⌉ and distinguishing into two cases qi ≤ 4c1 log n/n and qi > 4c1 log n/n, weconclude that

Var(M(1)d ) .c1,c2 n

O(c2) ·(kU

mn+

kU2

)(3c1 log n

)2d

.c1,c2 nO(c2)

(√k

√kU

)2(3c1 log n

)2d

establishing the desired variance bound.

D.5 Proof of Lemma 10

As the second inequality essentially follows from the analysis of [JVHW15, Lemma 3], we solelyfocus on the first inequality. Recall that the concentration inequality (47) gives nq/4 ≤ X ≤ nqwith probability at least 1 − n−5, we may primarily work on the regime X ∈ [nq/4, nq]. In thisregime, the Taylor expansion of − log x around x = q gives that

− log2X

n= − log q − 1

(2X

n− q

2q2

(2X

n− q

− 1

3q3

(2X

n− q

4ξ4

(2X

n− q

with ξ ∈ [q/2, 2q] if nq/4 ≤ X ≤ nq. Using the central moments of Binomial random variable in(48), we have

∣∣∣∣E[(

− log2X

n− 1− q

nq+ log q

(nq4

≤ X ≤ nq)]∣∣∣∣ .c1

n2q2+

n3q3+

n5.c1

nq log n, (59)

where the last step is thanks to the assumption q ≥ c1 log n/n. As for the bias correction term,the Taylor expansion of (1− x)/(nx) around x = q gives

1− 2X/n

2X=

1− q

nq− 1

nq2

(2X

n− q

nξ3

(2X

n− q

Page 37: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

for some ξ ∈ [q/2, 2q] if nq/4 ≤ X ≤ nq. Hence, using the central moments (48) again yields to∣∣∣∣E[(

log2X

n− 1− q

(nq4

≤ X ≤ nq)]∣∣∣∣ .c1

n2q2+

n5.c1

nq log n. (60)

Now the desired bias upper bound follows from the inequalities (59), (60), and the concentrationbound (47).

D.6 Proof of Lemma 11

By the estimator construction for the KL divergence, it is clear that D(1)KL,s+D

(2)KL,s =

∑ki=1 f(pi, qi),

with the bivariate function f given by

f(p, q) = p∑

s>c1 logn

(HG

(n, nq,

)= s)h(1)(nq − s)

+∑

s>c1 logm

(HG

(m,mp,

)= s)h(2)(mp− s).

For notational simplicity, we write f(p, q) = p · u(q) + v(p). We begin with some properties ofthe functions u and v which are summarized in the following lemma.

Lemma 22. For mp ∼ B(m, p), nq ∼ B(n, q), the following inequalities hold:

E[u(q)2] .c1 [1 + (log q)2] · 1(q >

3c1 log n

)+ (log n)2 · 1

(q ≤ 3c1 log n

(v(p)− v

(p− 1

))2].c1

p[1 + (log p)2]

m2· 1(p >

3c1 logm

(logm)5

m3· 1(p ≤ 3c1 logm

(u(q)− u

(q − 1

))2].c1

n2q· 1(q >

3c1 log n

(log n)3

n· 1(q ≤ 3c1 log n

Based on Lemma 22, the perturbation of q in Lemma 18 can be directly controlled. To seethis, note that

f(p, q)− f

(p, q − 1

)= p

(u(q)− u

(q − 1

)).

Hence, by Lemma 22,

2n ·k∑

i=1

[qi

(f(pi, qi)− f

(pi, qi −

))2]

= 2n

k∑

i=1

E[p2i ] · E[qi

(u(qi)− u

(qi −

))2]

.c1

∑

i∈[k]:qi≤3c1 logn/n

(p2i +

pim

)· (log n)3 +

∑

i∈[k]:qi>3c1 logn/n

(p2i +

pim

)· 1

nqi

.c1,εkU2

n2−2ε+

n1−εm+

nm.c1,ε

kU2

n2−2ε+

m2+

n, (61)

Page 38: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

where the last step is due to the inequality kU/(n1−εm) ≤ kU2/n2−2ε + k/m2.The perturbation of p is slightly more involved to deal with. We write that

f(p, q)− f

(p− 1

m, q

u(q)

m+ v(p)− v

(p− 1

We distinguish into three cases:

• Case I: p ≤ 3c1 logm/m. By Lemma 22, we see that

(f(p, q)− f

(p− 1

m, q

))2].

E[p] · E[u(q)2]m2

+ E

(v(p)− v

(p− 1

))2]

.c1

p(log n)2

m2+

(logm)5

m3.

Consequently, the total contribution of such symbols is

2m ·∑

i∈[k]:pi≤3c1 logm/m

[pi

(f(pi, qi)− f

(pi −

m, qi

))2]

.c1

∑

i∈[k]:pi≤3c1 logm/m

(pi(log n)

(logm)5

)

.c1,ε (mn)ε(kU

mn+

).c1,ε (mn)ε

(kU2

n2+

). (62)

• Case II: q ≤ 3c1 log n/n. In this case, by Lemma 22 again we have

(f(p, q)− f

(p− 1

m, q

))2].

E[p] · E[u(q)2]m2

+ E

(v(p)− v

(p− 1

))2]

.c1

p(log n)2

m2+

(logm)5

m3+

p[1 + (log p)2]

.c1

U(log n)3

m2n+

(logm)5

m3,

where in the last step we have used that p ≤ Uq. Consequently, the total contribution ofsuch symbols is

2m ·∑

i∈[k]:qi≤3c1 logn/n

[pi

(f(pi, qi)− f

(pi −

m, qi

))2]

.c1

∑

i∈[k]:qi≤3c1 logn/n

(U(log n)2

mn+

(logm)5

)

.c1,ε (mn)ε(kU

mn+

).c1,ε (mn)ε

(kU2

n2+

). (63)

Page 39: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

• Case III: p > 3c1 logm/m and q > 3c1 log n/n. We consider the following identity:

f(p, q)− f

(p− 1

m, q

u(q)

m+ v(p)− v

(p− 1

)

=u(q) + log q

(v(p)− v

(p− 1

)− log p

mlog

The usefulness of the above decomposition lies in the following key result that each termnow enjoys a small second moment.

Lemma 23. For mp ∼ B(m, p), nq ∼ B(n, q) with p > 3c1 logm/m and q > 3c1 log n/n, itholds that

E[(u(q) + log q)2] .c1 1,

[(v(p)− v

(p− 1

)− log p

)2].c1

m2.

Based on Lemma 23 and the fact (47) that p ≤ 2p with probability at least 1 − m−5, wearrive at

(f(p, q)− f

(p− 1

m, q

))2]

.c1

m5+ p ·

(E[(u(q) + log q)2]

m2+ E

[(v(p)− v

(p− 1

)− log p

)2]+

m2log2

)

.c1

m2+

p log2(p/q)

m2.c1

p(logU)2 + q

m2,

where the last step follows from the elementary inequality (log x)2 . (logU)2 + x−1 for all0 < x ≤ U . Consequently, the total contribution of such symbols to the final variance is

2m ·∑

i∈[k]:pi>3c1 logm/m,qi>3c1 logn/n

[pi

(f(pi, qi)− f

(pi −

m, qi

))2]

.c1

∑

i∈[k]

pi(logU)2 + qim

.c1

(logU)2

m. (64)

In summary, a combination of the inequalities (62), (63), and (64) leads to

2m ·∑

i∈[k]

[pi

(f(pi, qi)− f

(pi −

m, qi

))2].c1,ε

(logU)2

m+ (mn)ε

(kU2

n2+

), (65)

and the final result follows from Lemma 18 and (61), (65).

Page 40: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

E Proof of Auxiliary Lemmas

E.1 Proof of Lemma 5

We prove the upper and lower approximation error bounds separately. The upper bound makesuse of the Muntz polynomial: it was shown in [Che66, Page 169] that for real numbersm, p1, · · · , pn >−1/2, the distance of xm to the linear span of xp1 , · · · , xpn in the L2[0, 1] metric is

dn(m; p1, · · · , pn) =1√

2m+ 1

n∏

i=1

|m− pi|m+ pi + 1

Specializing to the case where m = 1/2 and p1, · · · , pn−α+1 = α−1/2, · · · , n−1/2, the aboveresult implies that there exists some rational series P (x) =

∑n−1d=α−1 bdx

d+1/2 such that

∥∥√x− P (x)∥∥L2[0,1]

≤ 1√2

n−α+1∏

i=1

α+ i− 1/2

α+ i+ 1/2= Oα

Hence, for Q(x) =∫ x0 P (t)dt, the Cauchy–Schwartz inequality gives

∣∣∣∣2

2 −Q(x)

∣∣∣∣ ≤∫ x

0|√t− P (t)|dt ≤

√(∫ x

0dt

)(∫ x

0|√t− P (t)|2dt

)= Oα

(√x

)

for all x ∈ [0, 1]. After dividing 2√x/3 at both sides, we conclude that the polynomial R(x) =

3Q(x)/(2√x) satisfies the claimed upper bound.

As for the lower bound, note that for any c ∈ (0, 1),

infaα,··· ,an∈R

supx∈[0,1]

∣∣∣∣∣x−n∑

d=α

adxd

∣∣∣∣∣ ≥ infa0,··· ,an∈R

supx∈[0,1]

xα

∣∣∣∣∣x1−α −

n∑

d=0

adxd

∣∣∣∣∣

≥ infa0,··· ,an∈R

supx∈[(c/n)2,1]

xα

∣∣∣∣∣x1−α −

n∑

d=0

adxd

∣∣∣∣∣

≥( cn

)2αinf

a0,··· ,an∈Rsup

x∈[(c/n)2,1]

∣∣∣∣∣x1−α −

n∑

d=0

adxd

∣∣∣∣∣ .

Hence, it suffices to show that for some constant c ∈ (0, 1), it holds that

infa0,··· ,an∈R

supx∈[(c/n)2,1]

∣∣∣∣∣x1−α −

n∑

d=0

adxd

∣∣∣∣∣ = Ωα

(n2(α−1)

). (66)

To this end, we introduce some necessary definitions and results from approximation theory. Forfunctions defined on [0, 1], define the r-th order Ditzian–Totik modulus of smoothness by [DT87]

ωrϕ(f, t)∞ , sup

0<h≤t‖∆r

hϕ(x)f(x)‖∞ = sup0<h≤t

∥∥∥∥∥r∑

ℓ=0

ℓ

)(−1)ℓf

(x+

(ℓ− r

)hϕ(x)

)∥∥∥∥∥∞

, (67)

where ϕ(x) ,√

x(1− x). This quantity is related to the polynomial approximation error via thefollowing lemma.

Page 41: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Lemma 24 ([DT87]). For any integer u > 0 and n > u, there exists some constant Mu dependingonly on u, such that for all t ∈ (0, 1) and f ,

En(f ; [0, 1]) ≤ Muωuϕ(f, 1/n)∞

n∑

ℓ=0

(ℓ+ 1)u−1Eℓ(f ; [0, 1]) ≥ ωuϕ(f, 1/n)∞.

where En(f ; [0, 1]) is the best degree-n polynomial approximation error of f on [0, 1].

For function f(x) = [(c/n)2+(1− (c/n)2)x]1−α with α ≥ 2, by translation the approximationerror En(f ; [0, 1]) is the LHS of (66). It is straightforward to see that for any 1 ≤ m ≤ n/c,

ω1ϕ(f, 1/m) ≤ 2‖f‖∞ ≤ 2(n/c)2(α−1) ,

ω1ϕ(f, 1/m) ≥

∣∣∣∣f(0)− f

8m2

)∣∣∣∣ = Ωα

((n/c)2(α−1)

Hence, there exist constants cα, Cα > 0 such that cα(n/c)2(α−1) ≤ ω1

ϕ(f, 1/m) ≤ Cα(n/c)2(α−1)

for 1 ≤ m ≤ n/c. Choosing u = 1 and D = 1/c (w.l.o.g. we assume that D > 1 is an integer),Lemma 24 gives

cα(n/c)2(α−1) ≤ ω1

(f,

)

≤ M1

Dn∑

ℓ=0

(ℓ+ 1)u−1Eℓ(f ; [0, 1])

≤ M1

((nc

)2(α−1)+M1

n∑

ℓ=1

Cα

(nc

)2(α−1)+ (Dn− n)En(f ; [0, 1])

which gives

En(f ; [0, 1]) ≥1

n(1/c− 1)

(cαM1

(nc

)2α−(nc

)2(α−1)− nM1Cα

(nc

)2(α−1)).

Consequently, choosing the constant c > 0 small enough gives the claimed lower bound (66).

E.2 Proof of Lemma 15

For each n ∈ N, let Dn be an estimator achieving the minimax risk R⋆(m,n, (1 + ε)U) under theoriginal sampling model with n samples from Q. By sufficiency arguments we may assume thatDn only depends on the histograms. Then we construct an estimator D for the Poissonized modelas follows: let N =

∑ki=1 hi from the observed Poisson histograms, then the estimator D is chosen

to be DN . Clearly, N ∼ Poi(n∑k

i=1 qi), and conditioning on the realization of N , the Poissonsampling model is equivalent to sampling N independent samples from the discrete distributionQ/‖Q‖1. Hence, let Q0 = Q/‖Q‖1 for (P,Q) ∈ Mk(U, ε), it holds that (P,Q0) ∈ Mk((1 + ε)U).

Page 42: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Hence, the estimation performance of the estimator D under the Poissonized model is

E(P,Q)|D −Dα(P‖Q)| ≤ |Dα(P‖Q0)−Dα(P‖Q)|+ E(P,Q)|D −Dα(P‖Q0)|

= |Dα(P‖Q0)−Dα(P‖Q)|+∞∑

ℓ=0

E(P,Q0)|Dℓ −Dα(P‖Q0)| · P(Poi(n) = ℓ)

(a)

.α εUα−1 +∞∑

ℓ=0

R⋆(m, ℓ, (1 + ε)U) · P(Poi(n) = ℓ)

(b)

≤ εUα−1 +Uα−1 − 1

α(α− 1)· P(Poi(n) < n/2) +R⋆(m,n/2, (1 + ε)U)

(c)

≤ εUα−1 +Uα−1 − 1

α(α− 1)e−n/8 +R⋆(m,n/2, (1 + ε)U),

where (a) is due to the fact that (P,Q0) ∈ Mk((1 + ε)U) and for (P,Q) ∈ Mk(U, ε),

|Dα(P‖Q0)−Dα(P‖Q)| = |‖Q‖α−11 − 1|

α(α − 1)·

k∑

i=1

pαiqα−1i

.α ε

k∑

i=1

pαiqα−1i

= Oα

(εUα−1

(b) follows from the decreasing property of the map ℓ 7→ R⋆(m, ℓ, (1 + ε)U) and R⋆(m, 0, (1 +ε)U) ≤ (Uα−1 − 1)/(α(α − 1)) by considering an estimator which always outputs zero, and (c)is due to the Chernoff bound. Finally, taking the supremum over (P,Q) ∈ Mk(U, ε) on the LHScompletes the proof of the lemma.

E.3 Proof of Lemma 17

The proof of Lemma 17 is similar to that of Lemma 15. Define the estimators (Dn, n ∈ N) andD as in the proof of Lemma 15, and for (P,Q) ∈ Mk(U, ε), define Q0 = Q/‖Q‖1 which satisfies(P,Q0) ∈ Mk((1 + ε)U). Consequently, the estimation performance of the estimator D underthe Poissonized model is

E(P,Q)|D −DKL(P‖Q)| ≤ |DKL(P‖Q0)−DKL(P‖Q)|+ E(P,Q)|D −DKL(P‖Q0)|

= |DKL(P‖Q0)−DKL(P‖Q)|+∞∑

ℓ=0

E(P,Q0)|Dℓ −DKL(P‖Q0)| · P(Poi(n) = ℓ)

(a)

. ε logU +

∞∑

ℓ=0

R⋆(m, ℓ, (1 + ε)U) · P(Poi(n) = ℓ)

(b)

≤ ε logU + logU · P(Poi(n) < n/2) +R⋆(m,n/2, (1 + ε)U)

(c)

≤ ε logU + logU · e−n/8 +R⋆(m,n/2, (1 + ε)U),

where (a) is due to the fact that (P,Q0) ∈ Mk((1 + ε)U) and for (P,Q) ∈ Mk(U, ε),

|DKL(P‖Q0)−DKL(P‖Q)| = | log ‖Q‖1 − 1| ·k∑

i=1

pi logpiqi

.α εk∑

i=1

pi logpiqi

= Oα (ε logU) ,

Page 43: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

step (b) follows from the decreasing property of the map ℓ 7→ R⋆(m, ℓ, (1+ε)U) and R⋆(m, 0, (1+ε)U) ≤ logU by considering an estimator which always outputs zero, and (c) is due to the Chernoffbound. Finally, taking the supremum over (P,Q) ∈ Mk(U, ε) on the LHS completes the proof ofthe lemma.

E.4 Proof of Lemma 18

We first prove the following result: for (mp1, · · · ,mpk) ∼ Multi(m; p1, · · · , pk) and T =∑k

i=1 fi(pi),it holds that

Var(T ) ≤ 2m ·k∑

i=1

[pi

(f(pi)− f

(pi −

))2]. (68)

To prove (68), let Xm ∼ (p1, · · · , pk) be m i.i.d. observations and write T as a function of Xm.By the Efron–Stein–Steele inequality (cf. Lemma 7), we have

Var(T ) ≤ m

2· E[(T (X1, · · · ,Xm)− T (X ′

1, · · · ,Xm))2]

, (69)

where X ′1 is an independent copy of X1 mutually independent of (X2, · · · ,Xm). By the definition

of T , one may express the difference as

T (X ′1, · · · ,Xm)− T (X1, · · · ,Xn) = D− +D+,

where

D− = fX1(pX1

)− fX1

(pX1

− 1

D+ =

fX′

(pX′

1+ 1

)− fX′

(pX′

)if X ′

1 6= X1,

fX′

(pX′

)− fX′

(pX′

1− 1

)if X ′

1 = X1.

Note that conditioning on the empirical probabilities (p1, · · · , pk), the random variable X1 followsthe distribution (p1, · · · , pk) while the new observation X ′

1 follows the distribution (p1, · · · , pk),and they are mutually independent. Hence,

E[D2−] = E[E[D2

− | p1, · · · , pk]] =k∑

i=1

[pi

(fi(pi)− fi

(pi −

))2],

Page 44: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

and

E[D2+] = E[E[D2

+ | p1, · · · , pk]]

k∑

i=1

pi · E[(1− pi)

(fi

(pi +

)− fi (pi)

+ pi

(fi(pi)− fi

(pi −

))2]

k∑

i=1

m∑

j=1

(fi

)− fi

(j − 1

))2

×[m− j + 1

m·(

j − 1

)pj−1i (1− pi)

m−j+1 +j

m·(m

)pji (1− pi)

m−j

]

=k∑

i=1

m∑

j=1

(fi

)− fi

(j − 1

))2

×[j

m·(m

)pji (1− pi)

m−j

]

k∑

i=1

[pi

(fi(pi)− fi

(pi −

))2]= E[D2

−].

Hence, by (69) and the triangle inequality, we have

Var(T ) ≤ m

2E[(D− +D+)

2] ≤ m(E[D2

−] + E[D2+])= 2m ·

k∑

i=1

[pi

(fi(pi)− fi

(pi −

))2],

establishing (68). For the original statement, write the sum S as a function of Xm ∼ (p1, · · · , pk)and Y n ∼ (q1, · · · , qk), and deal with the single changes in Xm and Y n separately as (68) in theEfron–Stein–Steele inequality.

E.5 Proof of Lemma 19

To deal with the function u, recall that the moment generating function of Binomial distributiongives that for mp ∼ B(m, p),

E[eλp] =(1− p+ peλ/m

)m=

(1 + p

∞∑

i=1

(λ/m)i

, ∀λ ∈ R.

Hence, comparing the coefficients of λk at both sides gives

E[pk] .k

k∑

ℓ=1

pℓ

mk−ℓ, ∀k ∈ N. (70)

Consequently, as u(p) =∑α

d=1 adpd with coefficient bound |ad| = Oα(m

d−α), the Binomial mo-ment bound (70) gives

E[u(p)2] .α

α∑

d=1

2d∑

ℓ=1

m2d−2α · pℓ

m2d−ℓ.α p2α.

Page 45: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

Similarly, simple algebra shows that u(p)− u(p− 1/m) =∑α−1

d=1 bdpd with |bd| = Oα(m

d−α), andtherefore (70) again leads to

(u(p)− u

(p− 1

))2].α

2α−1∑

d=3

d∑

ℓ=1

md−2α−1 · pℓ

md−ℓ.α

p2α−1

m2.

As for the function v, as ‖gd‖∞ ≤ (4c1 log n/n)d, it holds that

|v(q)| ≤(4c1 log n

· P(HG

(n, nq,

)≤ c1 log n

Hence,

E[v(q)2] ≤(4c1 log n

)2d

(HG

(n, nq,

)≤ c1 log n

)]

(4c1 log n

)2d

(n2, q)≤ c1 log n

)

≤(4c1 log n

)2d

exp(−Ω(nq − 3c1 log n)+)),

where the last step follows from the Chernoff bound. Similarly,

(u(q)− u

(q − 1

))2].

(4c1 log n

)2d

[q · P

(HG

(n, nq,

)≤ c1 log n

)]

≤(4c1 log n

)2d√E[q2] · E

(HG

(n, nq,

)≤ c1 log n

)]

≤(4c1 log n

)2d

·√

q2 +q

n· exp(−Ω((nq − 3c1 log n)+))

(4c1 log n

)2d+1

exp(−Ω((nq − 3c1 log n)+)),

where the last step follows from proper tail comparison.

E.6 Proof of Lemma 20

By definition of hα in (9), it holds that ‖hα‖∞ .α nα−1. Moreover, for any nq/4 ≤ X ≤ nq withq ≥ c1 log n/n, it holds that

|hα(X)| .α1

qα−1+

nqα.α,c1

qα−1.

Hence, the concentration bound (47) leads to

E[hα(X)2] = E

[hα(X)21

(nq4

≤ X ≤ nq)]

+ E

[hα(X)21

(X /∈

[nq4, nq])]

.α,c1 E

q2(α−1)1

(nq4

≤ X ≤ nq)]

+ n2(α−1)P

(X /∈

[nq4, nq])

.α,c1

q2(α−1)+

n3α.α,c1

q2(α−1),

where in the last step we have again used that q ≥ c1 log n/n.

Page 46: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

E.7 Proof of Lemma 21

This proof is entirely similar to that of Lemma 20. Specifically, it is easy to verify that ∆hα(x) ≤2‖hα‖∞ . nα−1 for all x ∈ N, and ∆hα(x) .α,c1 1/(nqα) whenever x ∈ [nq/4, nq]. Hence, theconcentration bound (47) leads to

E[∆hα(X)2] = E

[∆hα(X)21

(nq4

≤ X ≤ nq)]

+ E

[∆hα(X)21

(X /∈

[nq4, nq])]

.α,c1 E

n2q2α1

(nq4

≤ X ≤ nq)]

+ n2(α−1)P

(X /∈

[nq4, nq])

.α,c1

n2q2α+

n3α.α,c1

n2q2α,

where in the last step we have again used that q ≥ 3c1 log n/n. The other inequality follows fromthe same arguments line by line.

E.8 Proof of Lemma 22

First we upper bound the second moment of u(q). By Cauchy–Schwartz,

u(q)2 ≤∑

s>c1 logn

(HG

(n, nq,

)= s)h(1)(nq − s)2.

Consequently, (16) leads to

E[u(q)2] ≤ P

(n2, q)> c1 log n

)· E[h(1)(X)2]

for X ∼ B(n/2, q). We distinguish into two scenarios. First, if q ≤ c1 log n/n, Lemma 4 togetherwith the norm bound ‖h(1)‖∞ . log n gives an upper boundOc1(n

−5 log2 n) on the above quantity.Second, if q > c1 log n/n, following the same lines of the proofs of Lemma 20 and Lemma 21 leadsto E[h(1)(X)2] .c1 1 + (log q)2. Then after some algebra, combining these two scenarios impliesthe first statement of the lemma (despite that the threshold becomes different).

To deal with the function v, we again distinguish into two cases. First, when p ≤ 3c1 logm/m,Holder’s inequality gives

E[v(p)4] ≤ P

(m2, p)> c1 logm

)· E[h(2)(X)4]

for X ∼ B(m/2, p). Further distinguishing into two cases p ≤ c1 logm/m and c1 logm/m < p ≤3c1 logm/m and using the above analysis for u, we conclude that for all p ≤ 3c1 logm/m it holdsthat

E[v(p)4] .c1

(log2m

Consequently, in this case the target quantity can be upper bounded as

(v(p)− v

(p− 1

))2].√

E[p2] ·√

E[v(p)4] + E [v(p− 1/m)4]

.c1

√p2 +

m·(log2m

.c1

(logm)5

m3.

Page 47: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

As for p > 3c1 logm/m, we use the coupling arguments as in the proof of Lemma 3. First, byLemma 4 and the assumption p > 3c1 logm/m, it suffices to replace v by the function v⋆ definedas

v⋆(p) =∑

s≥0

(HG

(m,mp,

)= s)· h(2)(mp − s),

with an additive error at most Oc1(n−5). Second, we may express v⋆(p) = E[h(2)(mp− Y )] with

Y ∼ HG(m,mp,m/2), and v⋆(p − 1/m) = E[h(2)(mp − 1 − Z)] with Z ∼ HG(m,mp − 1,m/2).Since there exists a coupling between (Y,Z) such that Y − 1 ≤ Z ≤ Y holds almost surely, weconclude that

(v⋆(p)− v⋆

(p− 1

))2

=(E[h(2)(mp − Y )− h(2)(mp− 1− Z)]

≤(E[∆h(2)(mp− Y )]

≤∑

s≥0

(HG

(m,mp,

)= s)·∆h(2)(mp− s)2,

where ∆h(2)(x) , |h(2)(x)−h(2)(x−1)|. Similar to Lemma 21, for random variable X ∼ B(m/2, p)with p ≥ 3c1 logm/m, the following inequalities hold:

[∆h(2)(X)2

].c1

1 + (log p)2

m2, E

[X ·∆h(2)(X)2

].c1

p[1 + (log p)2]

Hence, for X1,X2 ∼ B(m/2, p), we have

(v⋆(p)− v⋆

(p− 1

))2]≤ E

p ·

∑

s≥0

(HG

(m,mp,

)= s)·∆h(2)(mp− s)2

= E

∑

s≥0

(HG

(m,mp,

)= s)·∆h(2)(mp− s)2

+ E

∑

s≥0

(HG

(m,mp,

)= s)·(p− s

)∆h(2)(mp− s)2

= E

[X1

]· E[∆h(2)(X)2

E[X ·∆h(2)(X)2

]

.c1

p[1 + (log p)2]

m2,

establishing the second inequality for p ≥ 3c1 logm/m. The third inequality follows from similarlines to the proof of the second inequality and is thus omitted.

E.9 Proof of Lemma 23

For the first inequality, consider the following bias-variance decomposition:

E[(u(q) + log q)2] = |E[u(q)] + log q|2 + Var(u(q)).

Page 48: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

First, the bias can be upper bounded as

|E[u(q)] + log q| (a)=∣∣∣P(B

(n2, q)> c1 log n

)· E[h(1)(X)] + log q

∣∣∣(b)

≤∣∣∣E[h(1)(X)] + log q

∣∣∣+ | log q|n5

(c)

.c1

nq log n+

| log q|n5

.c1 1,

with X ∼ B(n/2, q), where (a) follows from (16), (b) is due to the triangle inequality and Lemma4, and (c) follows from Lemma 10 and the assumption q ≥ 3c1 log n/n. Second, applying Lemma18 to the Binomial model (which is a special case of the Multinomial model), we have

Var(u(q)) ≤ 2n · E[q

(u(q)− u

(q − 1

))2].c1

nq.c1 1,

where the second inequality is due to Lemma 22. Hence, combining the above inequalities com-pletes the proof of the first inequality.

As for the second inequality, we again deal with the bias and the variance separately. For thebias part, following the above steps involving (16), Lemma 4 and Lemma 10 leads to

|E[v(p)]− p log p| .c1

m logm,

∣∣∣∣E[v

(p− 1

)]−(p− 1

)log

(p− 1

)∣∣∣∣ .c1

m logm.

Combining the above inequalities and observing that

∣∣∣∣p log p−(p− 1

)log

(p− 1

)− 1

mlog p

∣∣∣∣ =(p− 1

)log

mp− 1≤ 1

a triangle inequality leads to the squared bias bound

∣∣∣∣E[v(p)− v

(p− 1

)]− log p

∣∣∣∣2

.c1

m2. (71)

The most challenging part is to upper bound the variance of the difference v(p)− v(p− 1/m).To this end, by the assumption p ≥ 3c1 logm/m, the similar techniques in the proof of Lemma22 show that it suffices to work with the new function v⋆ with

v⋆(p) =∑

s≥0

(m,mp,

)= s)· h(2)(mp− s).

Consequently, applying the Efron–Stein–Steele inequality to the Binomial model again, Lemma18 yields to

Var

(v⋆(p)− v⋆

(p− 1

))≤ 2m · E

(v⋆(p)− 2v⋆

(p− 1

)+ v⋆

(p− 2

))2]. (72)

Page 49: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

To upper bound the finite difference in (72), we may write v⋆(p) = E[h(2)(mp −X)] with X ∼HG(m,mp,m/2), v⋆(p− 1/m) = E[h(2)(mp− 1− Y )] with Y ∼ HG(m,mp− 1,m/2), and v⋆(p−2/m) = E[h(2)(mp− 2−Z)] with Z ∼ HG(m,mp− 2,m/2). Note that there exists a coupling of(X,Y,Z) such that X−1 ≤ Y ≤ X and Y −1 ≤ Z ≤ Y hold almost surely, and E[X+Z|Y ] = 2Y .Hence, a taylor expansion of h(2)(x) around x = mp− Y − 1 gives∣∣∣∣v⋆(p)− 2v⋆

(p− 1

)+ v⋆

(p− 2

)∣∣∣∣

=∣∣∣E[h(2)(mp−X)− 2h(2)(mp− Y − 1) + h(2)(mp− Z − 2)]

∣∣∣

∣∣∣∣∣E[Dh(2)(mp− Y − 1)(2Y −X − Z) +

D2h(2)(ξ1)

2(Y + 1−X)2 +

D2h(2)(ξ2)

2(Y − 1− Z)2

]∣∣∣∣∣

≤ E

[sup

mp−X−2≤ξ≤mp−X|D2h(2)(ξ)|

], E[h(mp −X)],

where we use the notations Dh(2) and D2h(2) to denote the first- and second-order derivatives ofh(2), respectively, and the first-order term in the Taylor expansion is zero via conditioning on Y .Consequently, the Cauchy–Schwartz inequality gives

(v⋆(p)− 2v⋆

(p− 1

)+ v⋆

(p− 2

))2

≤∑

s≥0

(m,mp,

)= s)· h(mp− s)2,

and therefore for X1,X2 ∼ B(m/2, p),

(v⋆(p)− 2v⋆

(p− 1

)+ v⋆

(p− 2

))2]

≤ E

∑

s≥0

p · P(B

(m,mp,

)= s)· h(mp − s)2

= E

∑

s≥0

m· P(B

(m,mp,

)= s)· h(mp − s)2

+ E

∑

s≥0

(m,mp,

)= s)·(p− s

)h(mp − s)2

= E

[X1

]· E[h(X2)

2] +1

mE[X2 · h(X2)

2].

Finally, following the same lines as the proof of Lemma 20 and Lemma 21, and using h(X2) =Θ(1/(m2p)) whenever mp/4 ≤ X2 ≤ mp, for p ≥ 3c1 logm/m it holds that

E[h(X2)2] .c1

m4p2, E[X2 · h(X2)

2] .c1

m3p.

Based on the above inequalities, (72) gives the variance upper bound

Var

(v⋆(p)− v⋆

(p− 1

)).c1

m3p.c1

m2. (73)

Hence, a combination of (71) and (73) completes the proof of the second inequality.

Page 50: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

References

[Ach18] Jayadev Acharya. Profile maximum likelihood is optimal for estimating kl divergence.In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1400–1404. IEEE, 2018.

[ACSS20] Nima Anari, Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. The Betheand Sinkhorn permanents of low rank matrices and implications for profile maximumlikelihood. arXiv preprint arXiv: 2004.02425, 2020.

[ACT18] Jayadev Acharya, Clement L Canonne, and Himanshu Tyagi. Inference under in-formation constraints i: Lower bounds from chi-square contraction. arXiv preprintarXiv:1812.11476, 2018.

[ADOS17] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. Aunified maximum likelihood approach for estimating symmetric properties of discretedistributions. In International Conference on Machine Learning, pages 11–21, 2017.

[Ama12] Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. SpringerScience & Business Media, 2012.

[Bir83] Lucien Birge. Approximation dans les espaces metriques et theorie de l’estimation.Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 65(2):181–237,1983.

[Bis06] Christopher M Bishop. Pattern recognition. Machine Learning, 128, 2006.

[BZLV18] Yuheng Bu, Shaofeng Zou, Yingbin Liang, and Venugopal V Veeravalli. Estimationof kl divergence: Optimal minimax rate. IEEE Transactions on Information Theory,64(4):2648–2674, 2018.

[Che66] Elliott Ward Cheney. Introduction to approximation theory. 1966.

[CK11] Imre Csiszar and Janos Korner. Information theory: coding theorems for discretememoryless systems. Cambridge University Press, 2011.

[CKV06] Haixiao Cai, Sanjeev R Kulkarni, and Sergio Verdu. Universal divergence estimationfor finite-alphabet sources. Information Theory, IEEE Transactions on, 52(8):3456–3475, 2006.

[CL11] T Tony Cai and Mark G Low. Testing composite hypotheses, Hermite polynomi-als and optimal estimation of a nonsmooth functional. The Annals of Statistics,39(2):1012–1041, 2011.

[CP04] Olivier Catoni and Jean Picard. Statistical learning theory and stochastic optimiza-tion: Ecole d’Ete de Probabilites de Saint-Flour, XXXI-2001, volume 31. SpringerScience & Business Media, 2004.

[Csi67] Imre Csiszar. Information-type measures of difference of probability distributionsand indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318,1967.

Page 51: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

[CSS19a] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. Efficient profile maximumlikelihood for universal symmetric property estimation. In Proceedings of the 51stAnnual ACM SIGACT Symposium on Theory of Computing, pages 780–791, 2019.

[CSS19b] Moses Charikar, Kirankumar Shiragur, and Aaron Sidford. A general frameworkfor symmetric property estimation. In Advances in Neural Information ProcessingSystems, pages 12426–12436, 2019.

[DLR77] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood fromincomplete data via the em algorithm. Journal of the royal statistical society. SeriesB (methodological), pages 1–38, 1977.

[DT87] Zeev Ditzian and Vilmos Totik. Moduli of smoothness. Springer, 1987.

[GBR+06] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Scholkopf, and Alex JSmola. A kernel method for the two-sample-problem. In Advances in neural infor-mation processing systems, pages 513–520, 2006.

[Haj70] Jaroslav Hajek. A characterization of limiting distributions of regular estimates.Zeitschrift fur Wahrscheinlichkeitstheorie und verwandte Gebiete, 14(4):323–330,1970.

[Haj72] Jaroslav Hajek. Local asymptotic minimax and admissibility in estimation. In Pro-ceedings of the sixth Berkeley symposium on mathematical statistics and probability,volume 1, pages 175–194, 1972.

[Hel09] Ernst Hellinger. Neue begrundung der theorie quadratischer formen von un-endlichvielen veranderlichen. Journal fur die reine und angewandte Mathematik(Crelles Journal), 1909(136):210–271, 1909.

[HJM17] Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On estimation of lr-norms ingaussian white noise models. arXiv preprint arXiv:1710.03863, 2017.

[HJW16a] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimationof divergences between discrete distributions. arXiv preprint arXiv:1605.09124v2,2016.

[HJW16b] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimationof kl divergence between discrete distributions. In 2016 International Symposium onInformation Theory and Its Applications (ISITA), pages 256–260. IEEE, 2016.

[HJW18] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unifiedmethodology for symmetric functional estimation and distribution estimation underwasserstein distance. In Conference On Learning Theory, pages 3189–3221, 2018.

[HJW20] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax rate-optimal estimationof divergences between discrete distributions. arXiv preprint arXiv:1605.09124v3,2020.

[HJWW17] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates ofentropy estimation over lipschitz balls. to appear in the Annals of Statistics, 2017.

Page 52: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

[HO19a] Yi Hao and Alon Orlitsky. The broad optimality of profile maximum likelihood. InAdvances in Neural Information Processing Systems, pages 10989–11001, 2019.

[HO19b] Yi Hao and Alon Orlitsky. Unified sample-optimal property estimation in near-lineartime. In Advances in Neural Information Processing Systems, pages 11104–11114,2019.

[HP15] Moritz Hardt and Eric Price. Tight bounds for learning a mixture of two gaussians.In Proceedings of the forty-seventh annual ACM symposium on Theory of computing,pages 753–760, 2015.

[HS20] Yanjun Han and Kirankumar Shiragur. The optimality of profile maximum likelihoodin estimating sorted discrete distributions. arXiv preprint arXiv:2004.03166, 2020.

[IS12] Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaus-sian models, volume 169. Springer Science & Business Media, 2012.

[JHW18] Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Minimax estimation of the l1distance. IEEE Transactions on Information Theory, 64(10):6672–6706, 2018.

[JVHW15] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estima-tion of functionals of discrete distributions. Information Theory, IEEE Transactionson, 61(5):2835–2885, 2015.

[KL51] Solomon Kullback and Richard A Leibler. On information and sufficiency. Theannals of mathematical statistics, 22(1):79–86, 1951.

[Kul97] Solomon Kullback. Information theory and statistics. Courier Corporation, 1997.

[KV17] Weihao Kong and Gregory Valiant. Spectrum estimation from samples. The Annalsof Statistics, 45(5):2218–2247, 2017.

[KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXivpreprint arXiv:1312.6114, 2013.

[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer, 1986.

[LNS99] Oleg Lepski, Arkady Nemirovski, and Vladimir Spokoiny. On estimation of the Lr

norm of a regression function. Probability theory and related fields, 113(2):221–253,1999.

[LP06] Young Kyung Lee and Byeong U Park. Estimation of Kullback–Leibler divergenceby local likelihood. Annals of the Institute of Statistical Mathematics, 58(2):327–340,2006.

[Mar92] VA Markov. On functions deviating least from zero in a given interval. Izdat. Imp.Akad. Nauk, St. Petersburg, pages 218–258, 1892.

[NWJ10] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating diver-gence functionals and the likelihood ratio by convex risk minimization. InformationTheory, IEEE Transactions on, 56(11):5847–5861, 2010.

Page 53: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

[OSVZ04] Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and JunanZhang. On modeling profiles instead of values. In Proceedings of the 20th conferenceon Uncertainty in artificial intelligence, pages 426–435. AUAI Press, 2004.

[OSW16] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction ofthe number of unseen species. Proceedings of the National Academy of Sciences,113(47):13283–13288, 2016.

[Pan03] Liam Paninski. Estimation of entropy and mutual information. Neural Computation,15(6):1191–1253, 2003.

[Pan04] Liam Paninski. Estimating entropy on m bins given fewer than m samples. Infor-mation Theory, IEEE Transactions on, 50(9):2200–2203, 2004.

[PC08] Fernando Perez-Cruz. Kullback-Leibler divergence estimation of continuous distri-butions. In Information Theory, 2008. ISIT 2008. IEEE International Symposiumon, pages 1666–1670. IEEE, 2008.

[Pea94] Karl Pearson. Contributions to the mathematical theory of evolution. PhilosophicalTransactions of the Royal Society of London. A, 185:71–110, 1894.

[Rem34] E Ya Remez. Sur la determination des polynomes dapproximation de degre donnee.Comm. Soc. Math. Kharkov, 10:41–63, 1934.

[Ric63] John R Rice. Tchebycheff approximation in several variables. Transactions of theAmerican Mathematical Society, pages 444–466, 1963.

[RW19] Philippe Rigollet and Jonathan Weed. Uncoupled isotonic regression via mini-mum wasserstein deconvolution. Information and Inference: A Journal of the IMA,8(4):691–717, 2019.

[San58] IN Sanov. On the probability of large deviations of random variables. United StatesAir Force, Office of Scientific Research, 1958.

[Ste86] J Michael Steele. An efron-stein inequality for nonsymmetric statistics. The Annalsof Statistics, 14(2):753–758, 1986.

[TKV17] Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters.In Advances in neural information processing systems, pages 5778–5787, 2017.

[Tsy09] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2009.

[Ver19] Sergio Verdu. Empirical estimation of information measures: A literature guide.Entropy, 21(8):720, 2019.

[VV11a] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log n-sample esti-mator for entropy and support size, shown optimal via new CLTs. In Proceedingsof the 43rd annual ACM symposium on Theory of computing, pages 685–694. ACM,2011.

Page 54: 1 Minimax Rate-Optimal Estimation of Divergences between ...arxiv.org/pdf/1605.09124.pdfa_b= maxfa;bg. Moreover, polyd ndenotes the set of all d-variate polynomials of degree no more

[VV11b] Gregory Valiant and Paul Valiant. The power of linear estimators. In Foundations ofComputer Science (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 403–412.IEEE, 2011.

[VV13] Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimators forentropy and other properties. In Advances in Neural Information Processing Systems,pages 2157–2165, 2013.

[WKV05] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdu. Divergence estimation of contin-uous distributions based on data-dependent partitions. Information Theory, IEEETransactions on, 51(9):3064–3074, 2005.

[WKV09] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdu. Divergence estimation for mul-tidimensional densities via-nearest-neighbor distances. Information Theory, IEEETransactions on, 55(5):2392–2405, 2009.

[WY16] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alpha-bets via best polynomial approximation. IEEE Transactions on Information Theory,62(6):3702–3720, 2016.

[WY18] YihongWu and Pengkun Yang. Optimal estimation of gaussian mixtures via denoisedmethod of moments. arXiv preprint arXiv:1807.07237, 2018.

[WY19] Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, andoptimal estimation of the unseen. The Annals of Statistics, 47(2):857–883, 2019.

[ZG14] Zhiyi Zhang and Michael Grabchak. Nonparametric estimation of Kullback-Leiblerdivergence. Neural computation, 26(11):2570–2593, 2014.

[ZVV+16] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, KaitlinSamocha, Monkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G. MacArthur.Quantifying unobserved protein-coding variants in human populations provides aroadmap for large-scale sequencing projects. Nature Communications, 7, 2016.

Top Related

Introduction to Artiﬁcial Intelligence Game Playingbeckert/teaching/... · Minimax Algorithm function MINIMAX-DECISION(game) returns an operator for each op in OPERATORS[game] do

Training on Personal Hygiene - Cagliari › ... › TrainingOnPersonalHygiene.pdf · Moreover, bodily secretions and sweat always favor -not only after to ... washing and grooming:

Solutions Manual to MATHEMATICAL STATISTICS: …Solutions Manual to MATHEMATICAL STATISTICS: Asymptotic Minimax Theory Alexander Korostelev Olga Korosteleva Wayne State University,

Holomorphic Symplectic Manifolds - UniTrentoalpha.science.unitn.it/~andreatt/SymplecticUS.pdf · dimH2i(X,Q) = ♯{conj. classes of el. of G of age = i}; odd homology is zero. Moreover

Doubly Robust Bayesian Inference for Non-Stationary ... · 2.1 General Bayesian Inference (GBI) with -Divergences ( -D) Standard Bayesian inference minimizes the KLD between the Data

LiSA - A Library of Scheduling Algorithms Handbook for ... · Moreover, the incorporation of own algorithms and the automated call of algorithms are described. Chapter 8 contains

Τα Νέα 107 - HSSMGEto innovate and establish geotechnical-related startups and spinoffs. Moreover, we should collaborate with other disciplines to learn and make use of the latest

Food protein-stabilized nanoemulsions as potential ... · ... exhibited better stability and biocompatibility compared with nano-emulsions stabilized by traditional emulsifiers. Moreover,