On the Estimation of -Divergencesschneide/poczos11a.pdfkx ykp(y)dy

9
609 On the Estimation of α-Divergences Barnabás Póczos Jeff Schneider School of Computer Science Carnegie Mellon University Pittsburgh, PA USA, 15213 School of Computer Science Carnegie Mellon University Pittsburgh, PA USA, 15213 Abstract We propose new nonparametric, consistent Rényi-α and Tsallis-α divergence estimators for continuous distributions. Given two in- dependent and identically distributed sam- ples, a “naïve” approach would be to simply estimate the underlying densities and plug the estimated densities into the correspond- ing formulas. Our proposed estimators, in contrast, avoid density estimation completely, estimating the divergences directly using only simple k-nearest-neighbor statistics. We are nonetheless able to prove that the estimators are consistent under certain conditions. We also describe how to apply these estimators to mutual information and demonstrate their efficiency via numerical experiments. 1 Introduction Many statistical, artificial intelligence, and machine learning problems require efficient estimation of the divergence between two distributions. We assume that these distributions are not given explicitly. Only two finite, independent and identically distributed (i.i.d.) samples are given from the two underlying distribu- tions. The Rényi-α (Rényi, 1961, 1970) and Tsallis-α (Villmann and Haase, 2010) divergences are two widely applied and prominent examples of probability diver- gences. The popular Kullback–Leibler (kl) divergence is a special case of these families, and they can also be related to the Csiszár’s-f divergence (Csiszár, 1967). Under certain conditions, these divergences can esti- mate entropy and can also be applied to estimate Rényi Appearing in Proceedings of the 14 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: W&CP 15. Copyright 2011 by the authors. and Tsallis mutual information. For more examples and other possible applications of these divergences, see our extended technical report (Póczos and Schnei- der, 2011). Despite their wide applicability, there is no known direct, consistent estimator for Rényi-α or Tsallis-α divergence. An indirect way to obtain the desired estimates would be to use a “plug-in” estimation scheme—first, apply a consistent density estimator for the underlying densi- ties, and then plug them into the desired formula. The unknown densities, however, are nuisance parameters in the case of divergence estimation, and we would prefer to avoid estimating them. Furthermore, density estimators usually have tunable parameters, and we may need expensive cross validation to achieve good performance. This paper provides a direct, L 2 -consistent estimator for the Tsallis-α divergence and a weakly consistent estimator for the Rényi-α divergence. These estima- tors can also be applied to (Rényi and Tsallis) mutual information. The closest existing work most relevant to the topic of this paper is the work of Wang et al. (2009a), who provided an estimator for the α 1 limit case only, i.e., for the kl-divergence. However, we warn the reader that there is an apparent error in their work; they applied the reverse Fatou lemma under condi- tions when it does not hold. It is not obvious how this portion of the proof can be remedied. This error originates in the work of Kozachenko and Leonenko (1987) and can also be found in other works. Hero et al. (2002a,b) also investigated the Rényi divergence estima- tion problem but assumed that one of the two density functions is known. Gupta and Srivastava (2010) de- veloped algorithms for estimating the Shannon entropy and the kl divergence for certain parametric fami- lies. Recently, Nguyen et al. (2009, 2010) developed methods for estimating f -divergences using their varia- tional characterization properties. They estimate the likelihood ratio of the two underlying densities and

Transcript of On the Estimation of -Divergencesschneide/poczos11a.pdfkx ykp(y)dy

  • 609

    On the Estimation of α-Divergences

    Barnabás Póczos Jeff SchneiderSchool of Computer ScienceCarnegie Mellon University

    Pittsburgh, PAUSA, 15213

    School of Computer ScienceCarnegie Mellon University

    Pittsburgh, PAUSA, 15213

    Abstract

    We propose new nonparametric, consistentRényi-α and Tsallis-α divergence estimatorsfor continuous distributions. Given two in-dependent and identically distributed sam-ples, a “naïve” approach would be to simplyestimate the underlying densities and plugthe estimated densities into the correspond-ing formulas. Our proposed estimators, incontrast, avoid density estimation completely,estimating the divergences directly using onlysimple k-nearest-neighbor statistics. We arenonetheless able to prove that the estimatorsare consistent under certain conditions. Wealso describe how to apply these estimatorsto mutual information and demonstrate theirefficiency via numerical experiments.

    1 Introduction

    Many statistical, artificial intelligence, and machinelearning problems require efficient estimation of thedivergence between two distributions. We assume thatthese distributions are not given explicitly. Only twofinite, independent and identically distributed (i.i.d.)samples are given from the two underlying distribu-tions. The Rényi-α (Rényi, 1961, 1970) and Tsallis-α(Villmann and Haase, 2010) divergences are two widelyapplied and prominent examples of probability diver-gences. The popular Kullback–Leibler (kl) divergenceis a special case of these families, and they can also berelated to the Csiszár’s-f divergence (Csiszár, 1967).Under certain conditions, these divergences can esti-mate entropy and can also be applied to estimate Rényi

    Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

    and Tsallis mutual information. For more examplesand other possible applications of these divergences,see our extended technical report (Póczos and Schnei-der, 2011). Despite their wide applicability, there isno known direct, consistent estimator for Rényi-α orTsallis-α divergence.

    An indirect way to obtain the desired estimates wouldbe to use a “plug-in” estimation scheme—first, apply aconsistent density estimator for the underlying densi-ties, and then plug them into the desired formula. Theunknown densities, however, are nuisance parametersin the case of divergence estimation, and we wouldprefer to avoid estimating them. Furthermore, densityestimators usually have tunable parameters, and wemay need expensive cross validation to achieve goodperformance.

    This paper provides a direct, L2-consistent estimatorfor the Tsallis-α divergence and a weakly consistentestimator for the Rényi-α divergence. These estima-tors can also be applied to (Rényi and Tsallis) mutualinformation.

    The closest existing work most relevant to the topicof this paper is the work of Wang et al. (2009a), whoprovided an estimator for the α → 1 limit case only,i.e., for the kl-divergence. However, we warn thereader that there is an apparent error in their work;they applied the reverse Fatou lemma under condi-tions when it does not hold. It is not obvious howthis portion of the proof can be remedied. This errororiginates in the work of Kozachenko and Leonenko(1987) and can also be found in other works. Hero et al.(2002a,b) also investigated the Rényi divergence estima-tion problem but assumed that one of the two densityfunctions is known. Gupta and Srivastava (2010) de-veloped algorithms for estimating the Shannon entropyand the kl divergence for certain parametric fami-lies. Recently, Nguyen et al. (2009, 2010) developedmethods for estimating f -divergences using their varia-tional characterization properties. They estimate thelikelihood ratio of the two underlying densities and

  • 610

    On the Estimation of α-Divergences

    plug that into the divergence formulas. This approachinvolves solving a convex minimization problem overan infinite-dimensional function space. For certainfunction classes defined by reproducing kernel Hilbertspaces (rkhs), however, they were able to reduce thecomputational load from solving infinite-dimensionalproblems to solving n-dimensional problems, where ndenotes the sample size. When n is large, solving theseconvex problems can still be very demanding. Further-more, choosing an appropriate rkhs also introducesquestions regarding model selection. An appealingproperty of our estimator is that we do not need tosolve minimization problems over function classes; weonly need to calculate certain k-nearest-neighbor (k-nn) based statistics. Recently, Sricharan et al. (2010)proposed k-nearest-neighbor based methods for esti-mating non-linear functionals of density, but in contrastto our approach, they were interested in the case wherek increases with the sample size.

    Our work borrows ideas from Leonenko et al. (2008a)and Goria et al. (2005), who considered Shannon andRényi-α entropy estimation from a single sample.1 Incontrast, we propose divergence estimators using twoindependent samples. Recently, Póczos et al. (2010);Pál et al. (2010) proposed a method for consistentRényi information estimation, but this estimator alsouses one sample only and cannot be used for estimatingdivergences. Further information and useful reviewsof several different divergences can be found, e.g., inVillmann and Haase (2010), Cichocki et al. (2009), andWang et al. (2009b).

    The paper is organized as follows. In the next sectionwe formally define our estimation problem, introducethe Rényi-α and Tsallis-α divergences, and explain theirmost important properties. Section 3 briefly introducesk-nn based density estimators. We propose estimatorsfor the Rényi-α and Tsallis-α divergences in Section 4and also present our most important theoretical resultsabout the asymptotic unbiasedness and consistency ofthe estimators. For their analysis we will need a fewgeneral tools, which we collect in Section 5. We willprove the asymptotic unbiasedness of our estimatorsin Section 6. Due to a lack of space, we provide manydetails of the proofs in Póczos and Schneider (2011).The analysis of the asymptotic variances of our estima-tors follows an approach similar to their biases but ismore complex; therefore, we relegate this material intoPóczos and Schneider (2011) as well. Section 7 containsthe results of numerical experiments that demonstratethe effectiveness of our proposed algorithm. We alsodemonstrate in that section how our divergence es-

    1The original presentations of these works containedsome errors; Leonenko and Pronzato (2010) provide correc-tions for some of these theorems.

    timators can be used for Rényi- and Tsallis-mutualinformation (mi) estimation. Finally, we conclude witha discussion of our work.

    2 Divergences

    For the remainder of this work we will assume thatM0 ⊂ Rd is a measurable set with respect to the d-dimensional Lebesgue measure and that p and q aredensities on this domain. The set where they are strictlypositive will be denoted by supp(p) and supp(q), re-spectively.

    Let p and q be Rd ⊇M0 : → R density functions, andlet α ∈ R\{0, 1}. The α-divergence D̃α(p‖q) (Cichockiet al., 2008) is defined as

    D̃α(p‖q).=

    1

    α(1− α)

    [1−

    ∫M0

    pα(x)q1−α(x) dx

    ],

    (1)assuming this integral exists. One can see that this isa special case of Csiszár’s f -divergence (Csiszár, 1967)and hence it is always nonnegative.2 Closely relateddivergences (but not special cases) to (1) are the Rényi-α (Rényi, 1961) and the Tsallis-α (Villmann and Haase,2010) divergences.

    Definition 1. Let p, q be Rd ⊇ M0 : → R densityfunctions and let α ∈ R \ {1}. The Rényi-α divergenceis defined as

    Rα(p‖q).=

    1

    α− 1log

    ∫M0

    pα(x)q1−α(x) dx. (2)

    The Tsallis-α divergence is defined as

    Tα(p‖q).=

    1

    α− 1

    (∫M0

    pα(x)q1−α(x) dx− 1). (3)

    Both definitions assume that the corresponding integralexists.

    We can see that as α→ 1 these divergences converge tothe kl-divergence. The following lemma summarizesthe behavior of these divergences.

    Lemma 2.

    α < 0⇒ Rα(p‖q) ≤ 0, Tα(p‖q) ≤ 0α = 0⇒ Rα(p‖q) = Tα(p‖q) = 0

    0 < α < 1⇒ Rα(p‖q) ≥ 0, Tα(p‖q) ≥ 0α = 1⇒ Rα(p‖q) = Tα(p‖q) = KL(p‖q) ≥ 01 < α⇒ Rα(p‖q) ≥ 0, Tα(p‖q) ≥ 0.

    We are now prepared to formally define the goal ofour paper. Given two independent i.i.d. samples from

    2See the Appendix for more details.

  • 611

    Barnabás Póczos, Jeff Schneider

    distributions with densities p and q, respectively, weprovide an L2-consistent estimator for

    Dα(p‖q).=

    ∫M0

    pα(x)q1−α(x) dx. (4)

    By plugging our estimate of (4) into (3) and (2), weimmediately get an L2-consistent estimator for Tα(p‖q),as well as a weakly consistent estimator for Rα(p‖q)for α 6= 1.

    3 k-nn Based Density Estimators

    In the remainder of this paper we will heavily exploitsome properties of k-nn based density estimators. Inthis section we define these estimators and briefly sum-marize their most important properties.

    k-nn density estimators operate using only distancesbetween the observations in a given sample and theirkth nearest neighbors (breaking ties arbitrarily). LetX1:n

    .= (X1, . . . , Xn) be an i.i.d. sample from a dis-

    tribution with density p, and similarly let Y1:m.=

    (Y1, . . . , Ym) be an i.i.d. sample from a distributionhaving density q. Let ρk(i) denote the Euclidean dis-tance of the kth nearest neighbor of Xi in the sampleX1:n, and similarly let νk(i) denote the distance of thekth nearest neighbor of Xi in the sample Y1:m. LetB(x,R) denote a closed ball around x ∈ Rd with radiusR, and let V

    (B(x,R)

    )= c̄Rd be its volume, where c̄

    stands for the volume of a d-dimensional unit ball.

    Loftsgaarden and Quesenberry (1965) define the k-nnbased density estimators of p and q at Xi as follows.Definition 3 (k-nn based density estimators).

    p̂k(Xi) =k/(n− 1)V(B(x, ρk)

    ) = k(n− 1)c̄ρdk(i)

    , (5)

    q̂k(Xi) =k/m

    V(B(x, νk)

    ) = kmc̄νdk(i)

    . (6)

    The following theorems show the consistency of thesedensity estimators.3

    Theorem 4 (k-nn density estimators, convergencein probability). If k(n) denotes the number of neigh-bors applied at sample size n, limn→∞ k(n) =∞, andlimn→∞ n/k(n) = ∞, then p̂k(n)(x) →p p(x) for al-most all x.Theorem 5 (k-nn density estimators, al-most sure convergence in sup norm). Iflimn→∞ k(n)/ log(n) = ∞ and limn→∞ n/k(n) = ∞,then limn→∞ supx

    ∣∣p̂k(n)(x)− p(x)∣∣ = 0 almost surely.3We use Xn →p X and Xn →d X to represent conver-

    gence of random variables in probability and in distribution,respectively. Fn →w F will denote the weak convergenceof distribution functions.

    Note that these estimators are consistent only whenk(n)→∞. We will use these density estimators in ourproposed divergence estimators; however, we will keepk fixed and will still be able to prove their consistency.

    4 An Estimator for Dα(p‖q)

    In this section we introduce our estimator for Dα(p‖q)and claim its L2 consistency in the form of severaltheorems. From now on we will assume that (4) canbe rewritten as

    Dα(p‖q) =∫M

    (q(x)

    p(x)

    )1−αp(x) dx, (7)

    whereM = supp(p). In other words, in the definitionof Dα(p‖q), it is enough to integrate on the support ofp. There are other possible ways to rewrite Dα(p‖q)(such as

    ∫(q/p)(1−α)p,

    ∫(p/q)αq, or

    ∫(q/p)−αq), and

    we could start our analysis from these forms as well. Ifwe simply plugged (5) and (6) into (7), then we couldestimate Dα(p‖q) with

    1

    n

    n∑i=1

    ((n− 1)ρdk(i)mνdk(i)

    )1−α;

    however, this estimator is asymptotically biased. Wewill prove that by introducing a multiplicative term thefollowing estimator is asymptotically unbiased undercertain conditions:

    D̂α(X1:n‖Y1:m).=

    1

    n

    n∑i=1

    ((n− 1)ρdk(i)mνdk(i)

    )1−αBk,α,

    (8)where Bk,α

    .= Γ(k)

    2

    Γ(k−α+1)Γ(k+α−1) . Notably, this mul-tiplicative bias does not depend on p or q. The fol-lowing theorems of this section contain our main re-sults: D̂α(X1:n‖Y1:m) is an L2-consistent estimator forDα(p‖q), i.e., it is asymptotically unbiased, and thevariance of the estimator is asymptotically zero.

    In our theorems we will assume that almost all pointsofM are in its interior and thatM has the followingadditional property:

    inf0

  • 612

    On the Estimation of α-Divergences

    Theorem 6 (Asymptotic unbiasedness). Assume that(a) 0 < γ .= 1 − α < k, (b) p is bounded away fromzero, (c) p is uniformly Lebesgue approximable, (d)∃δ0 s.t. ∀δ ∈ (0, δ0),

    ∫MH(x, p, δ, 1)p(x) dx < ∞, (e)∫

    M ‖x − y‖γp(y) dy < ∞ for almost all x ∈ M, (f)∫∫

    M2 ‖x − y‖γp(y)p(x) dy dx < ∞, and that (g) q is

    bounded from above. Then

    limn,m→∞

    E[D̂α(X1:n‖Y1:m)

    ]= Dα(p‖q),

    i.e., the estimator is asymptotically unbiased.

    For the definition of a uniformly Lebesgue approximablefunction, see Definition 14. The following theoremstates that the estimator is asymptotically unbiasedwhen −k < γ .= 1− α < 0.Theorem 7 (Asymptotic unbiasedness). Assume that(a) −k < γ .= 1 − α < 0, (b) q is bounded away fromzero, (c) q is uniformly Lebesgue approximable, (d)∃δ0 s.t. ∀δ ∈ (0, δ0)

    ∫MH(x, q, δ, 1)p(x) dx < ∞, (e)∫

    M ‖x − y‖γq(y) dy < ∞ for almost all x ∈ M, (f)∫∫

    M2 ‖x − y‖γq(y)p(x) dy dx < ∞, (g) p is bounded

    from above, and that (h) supp(p) ⊆ supp(q). In thiscase, the estimator is asymptotically unbiased.

    The following theorems provide conditions under whichD̂ is L2 consistent. In the previous theorems we havestated conditions that lead to asymptotically unbiaseddivergence estimation. In all of the following theoremswe will assume that the estimator is asymptoticallyunbiased for the parameter γ = 1−α as well as for a newparameter γ̃ .= 2(1−α) (corresponding to α̃ .= 2α− 1),and also assume that max

    (Dα(p‖q), Dα̃(p‖q)

    )

  • 613

    Barnabás Póczos, Jeff Schneider

    (a) feasible (b) not allowed

    Figure 1: A possible allowed and a not-allowed domainM under the property in (13).

    Lemma 13 (Lebesgue (1910)). If g ∈ L1(Rd), thenfor any sequence of open balls B(x,Rn) with radiusRn → 0, and for almost all x ∈ Rd,

    limn→∞

    ∫B(x,Rn) g(t) dt

    V(B(x,Rn)

    ) = g(x). (11)This implies that ifM⊂ Rd is a Lebesgue-measurableset, and g ∈ L1(M), then for any sequence of Rn → 0,for any δ > 0 and for almost all x ∈ M, there existsan n0(x, δ) ∈ Z+ such that if n > n0(x, δ), then

    g(x)− δ <

    ∫B(x,Rn) g(t) dt

    V(B(x,Rn))< g(x) + δ. (12)

    We will later require a generalization of this property;namely, we will need it to hold uniformly over x ∈M.However, for this generalization to hold we must putslight restrictions on the domain M to avoid effectsaround its boundary. We will consider only those do-mainsM that posses the property that the intersectionofM with an arbitrary small ball having center inMhas volume that cannot be arbitrary small relative tothe volume of the ball. To be more formal, we wantthe following inequality to be satisfied:

    inf0 0, there exists an n = n0(δ) ∈ Z+ (independent ofx) such that if n > n0, then for almost all x ∈M,

    g(x)− δ <

    ∫B(x,Rn)∩M g(t) dt

    V(B(x,Rn) ∩M

    ) < g(x) + δ. (14)

    This property is a uniform variant of (12). The follow-ing lemma provides examples of uniformly Lebesgue-approximable functions.

    Lemma 15. If g is uniformly continuous onM, thenit is uniformly Lebesgue approximable onM.

    Finally, as we proceed we will frequently use the fol-lowing lemma:

    Lemma 16 (Moments of the Erlang distribution). Letfx,k(u)

    .= 1Γ(k)λ

    k(x)uk−1 exp(−λ(x)u) be the density ofthe Erlang distribution with parameters λ(x) > 0 andk ∈ Z+. Let γ ∈ R such that γ + k > 0. The γthmoments of this Erlang distribution can be calculatedas∫∞

    0uγfx,k(u) du = λ(x)

    −γ Γ(k+γ)Γ(k) .

    6 Proving Asymptotic Unbiasedness

    The following subsection contains several specific lem-mas and theorems that we will use for proving theconsistency of the proposed estimator in (8).

    6.1 Preliminaries

    Recall that ρk(j) is a random variable that measuresthe distance from Xj to its kth nearest neighbor inX1:n \Xj .Lemma 17. Let ζn,k,1

    .= (n − 1)ρdk(1) be a random

    variable, and let Fn,k,x(u).= Pr(ζn,k,1 < u | X1 = x)

    denote its conditional distribution function. Then

    Fn,k,x(u) =

    1−k−1∑j=0

    (n− 1j

    )(Pn,u,x)

    j(1− Pn,u,x)n−1−j , (15)

    where Pn,u,x.=∫M∩B(x,Rn(u)) p(t) dt and Rn(u)

    .=

    (u/(n− 1))1/d.

    We also have the following (Leonenko et al., 2008a).

    Lemma 18. Fn,k,x →w Fk,x for almost all x ∈ M,where Fk,x(u)

    .= 1−exp(−λu)

    ∑k−1j=0

    (λu)j

    j! is the Erlangdistribution with λ = c̄p(x).

    Lemma 19. Let ξn,k,x and ξk,x be random variableswith Fn,k,x and Fk,x distribution functions, and letγ ∈ R be arbitrary. Then for almost all x ∈ M wehave that ξγn,k,x →d ξ

    γk,x.

    Theorem 20. For almost all x ∈ M the followingstatements hold. If (i) −k < γ < 0, or (ii) 0 ≤ γ, and∫M ‖x− y‖

    γp(y) dy

  • 614

    On the Estimation of α-Divergences

    Similarly, if (i) −k < γ < 0 or (ii) 0 ≤ γ, and∫M ‖x− y‖

    γq(y) dy 0 and using Lemma 16.

    All that remains is to prove that if ξγn,k,x →d ξγk,x, then

    E[ξγn,k,x] → E[ξγk,x]. To see this, it is enough to show

    (according to Theorem 12) that for some ε > 0 andc(x) 0 so small that p(x)− δ > 0for all x ∈M. Then there exists a Np,q > 0 such thatif m,n > Np,q, then for almost all x ∈M,

    fn(x)gm(x) ≤ γ2L(x, 1, k, γ, p, δ, δ1)

    [L̂(q̄, 1)

    k − γ+

    1

    γ

    ],

    where L̂(q̄, β) .= (q̄c̄)k exp(q̄c̄β).

    Similarly, for the −k < γ .= 1−α < 0 case we have thefollowing lemma.Lemma 23. Let −k < γ .= 1 − α < 0, and letsupp(p) ⊆ supp(q). Furthermore, let q be uniformlyLebesgue approximable onM = supp(p) and boundedaway from zero. Let p be bounded above by p̄. Letδ1 > 0, and let δ > 0 so small that q(x) − δ > 0 forall x ∈ M. Then there exists a Np,q > 0 such that ifm,n > Np,q, then for almost all x ∈ supp(p),

    fn(x)gm(x) ≤ γ2L(x, 1, k,−γ, q, δ, δ1)

    [L̂(p̄, 1)

    k + γ− 1γ

    ].

  • 615

    Barnabás Póczos, Jeff Schneider

    Now, for the two cases 0 < γ = 1 − α < k and −k <γ = 1 − α < 0, we can see that under the conditionsdetailed in Theorems 6 and 7, there exists a function Jand a threshold number Np,q such that if n,m > Np,q,then for almost all x ∈ M fn(x)gm(x) ≤ J(x) and∫M J(x)p(x) dx < ∞. Applying the Lebesgue domi-nated convergence theorem finishes the proofs of thesetheorems.

    7 Numerical Experiments

    In this section we present a few numerical experimentsto demonstrate the consistency of the proposed diver-gence estimators. We run experiments on beta dis-tributions, where the domains are bounded, and wealso study normal distributions, which have unboundeddomains. We chose these distributions because in thesecases the divergences have known closed-form expres-sions, and thus it is easy to evaluate our methods. Wewill also demonstrate that the proposed divergenceestimators can be applied to estimate mutual informa-tion. We note that in our simulations the numericalresults were very similar for the estimation of Rα andTα; therefore, we will only present our results for theRα case.

    7.1 Normal distributions

    We begin our discussion by investigating the perfor-mance of our divergence estimators on normal distribu-tions. Note that when α /∈ [0, 1], the divergences caneasily become unbounded.5

    In Figure 2 we display the performances of the proposedD̂α and R̂α divergence estimators when the underly-ing densities were zero-mean Gaussians with randomlychosen 5-dimensional covariance matrices. Our resultsdemonstrates that when we increase the sample sizes nand m, then the D̂α and R̂α values converge to theirtrue values. For simplicity, in our experiments we al-ways set n = m. The figures show five independentexperiments; the number of instances were varied be-tween 50 and 25 000. The number of nearest neighborsk was set to 8, and α to 0.8.

    7.2 Beta distributions

    We were also interested in examining the perfor-mance of our estimators on beta distributions. Tobe able to study multidimensional cases, we con-struct d-dimensional distributions with independent1-dimensional beta distributions as marginals. For aclosed-form expression of the true divergence in thiscase, see the Appendix.

    5See the Appendix for the details.

    102

    103

    104

    0.8

    0.81

    0.82

    0.83

    0.84

    0.85

    0.86

    0.87

    0.88

    num of observations

    Dα(

    f || g

    )

    estimatedtrue

    (a) Dα(f‖g)

    102

    103

    104

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    num of observations

    Rα(

    f || g

    )

    estimatedtrue

    (b) Rα(f‖g)

    Figure 2: Estimated vs. true divergence for the normaldistribution experiments as a function of the numberof observations. The results of five independent experi-ments are shown for estimating the (a) Dα(f‖g) and(b) Rα(f‖g) divergences.

    Our first experiment, illustrated in Figures 3(a)–3(b),indicates that the estimators are consistent when d = 2;as we increase the number of instances, the estimatorsconverge to the true Dα(f‖g) and Rα(f‖g) values. Thefigures show five independent experiments, varying thesample size between 100 and 10 000. α was set to 0.4,and we used k = 4 nearest neighbors in the densityestimates. The parameters of the beta distributionswere chosen independently and uniformly random from[1, 2]. We repeated this experiment in 5d as well. The5d results, shown in Figure 3(c)–3(d), show that theestimators were also consistent in this case.

    7.3 Mutual information estimation

    In this section we demonstrate that the proposed diver-gence estimators can also be used to estimate mutualinformation. Let f = (f1, . . . , fd) ∈ Rd be the densityof a d-dimensional distribution. The mutual informa-tion Iα(f) is the divergence between f and the productof the marginal variables. Particularly, for the Rényidivergence we have Iα(f) = Rα(f‖

    ∏di=1 fi). Therefore,

    if we are given a sample X1, . . . , X2n from f , we mayestimate mutual information as follows. We form oneset of size n by setting aside the first n samples. Webuild another sample by randomly permuting the coor-dinates of the remaining n observations independentlyfor each coordinate to form n independent instancessampled from

    ∏di=1 fi. Using these two sets, we can

    estimate Iα(f). Figures 4(a)–4(b) show the results ofapplying this procedure for a 2d Gaussian distributionwith a randomly chosen covariance matrix. The subfig-ures show the true Dα and Rα values, as well as theirestimations using different sample sizes. k was set to8, and α was 0.8.

    Figures 4(c)–4(d) show the results of repeating theprevious experiment with two alterations. In this casewe estimated the Shannon (rather than Rényi) infor-

  • 616

    On the Estimation of α-Divergences

    102

    103

    104

    0.88

    0.9

    0.92

    0.94

    0.96

    0.98D(f | g)

    estimatedtrue

    (a) Dα(f‖g)

    102

    103

    104

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3R(f | g)

    estimatedtrue

    (b) Rα(f‖g)

    102

    103

    104

    0.78

    0.8

    0.82

    0.84

    0.86

    0.88

    0.9

    0.92

    0.94

    0.96D(f | g)

    estimatedtrue

    (c) Dα(f‖g)

    102

    103

    104

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8R(f | g)

    estimatedtrue

    (d) Rα(f‖g)

    Figure 3: Estimated vs. true divergence for the beta distribution experiments as a function of the number ofobservations. The figures show the results of five independent experiments for estimating the Dα(f‖g) andRα(f‖g) divergences. (a,b): f and g were the densities of two 2d beta distributions—the marginal distributionswere independent 1d betas with randomly chosen parameters. (c,d): The same as (a,b), but here f and g werethe densities of two 5d beta distributions with independent marginals.

    102

    103

    104

    0.78

    0.8

    0.82

    0.84

    0.86

    0.88

    0.9

    0.92

    num of observations

    Dα(

    f || g

    )

    estimatedtrue

    (a) Dα(f‖g)

    102

    103

    104

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    num of observations

    Rα(

    f || g

    )

    estimatedtrue

    (b) Rα(f‖g)

    103

    104

    0.9975

    0.998

    0.9985

    0.999

    0.9995

    1

    num of observations

    Dα(

    f || g

    )

    estimatedtrue

    (c) Dα(f‖g)

    103

    104

    0

    2

    4

    6

    8

    10

    num of observations

    Rα(

    f || g

    )

    estimatedtrue

    (d) Rα(f‖g)

    Figure 4: Estimated vs. true Rényi information for the mutual information experiments as a function of the numberof observations. (a) and (b) show five independent experiments for estimating Dα(f‖g) and Iα(f) = Rα(f‖g)for a 2d Gaussian distribution using sample sizes between 100 and 20 000. In (c)–(d), we estimated the mutualinformation between the marginals of a π/4 degree rotated 2d uniform distribution. The sample size was variedfrom 500 to 40 000.

    mation, and for this purpose we selected a 2d uniformdistribution on [−1/2, 1/2]2 rotated by π/4. Due to thisrotation, the marginal distributions are no longer inde-pendent. Because our goal was to estimate the Shannoninformation, we set α to 0.9999. The number of nearestneighbors used was k = 8, and the sample size wasvaried between 500 and 40 000. The estimators gavequite good results for the Shannon mutual informationas well as for D1(f‖

    ∏di=1 fi) = 1.

    8 Discussion and Conclusion

    We have derived a new nonparametric estimator forthe Rényi-α and Tsallis-α divergences, two importantquantities with several applications in machine learn-ing and statistics. Under certain conditions we showedthe consistency of these estimators and how they canbe applied to estimate mutual information. We alsodemonstrated their efficiency using numerical experi-ments.

    The main idea in the proofs of our new theorems wasthat the expected value of our estimator can be rewrit-

    ten as in (16). We showed that asymptotically theterms inside this expectation converge to the Erlangdistribution and applied the well-known formulas for itsmoments. The main difficulty was to show that we canindeed switch the limit and expectation operators; thatis, the limit of expectations equals the expectation ofthe limits of the random variables. For this purpose webounded above these random variables and applied theLebesgue dominated convergence theorem. To derivea bound on these random variables, we made severalassumptions on the densities p and q.

    There remain some open issues: the conditions of thetheorems could be weakened considerably, and therates of the estimators are still unknown. It wouldalso be desirable to investigate whether the proposedestimators are asymptotically normal.

    References

    Cichocki, A., Lee, H., Kim, Y.-D., and Choi, S. (2008).Non-negative matrix factorization with α-divergence.Pattern Recognition Letters.

  • 617

    Barnabás Póczos, Jeff Schneider

    Cichocki, A., Zdunek, R., Phan, A., and Amari, S.-I.(2009). Nonnegative Matrix and Tensor Factoriza-tions. John Wiley and Sons.

    Csiszár, I. (1967). Information-type measures of differ-ences of probability distributions and indirect obser-vations. Studia Sci. Math. Hungarica, 2:299–318.

    Goria, M. N., Leonenko, N. N., Mergel, V. V., andInverardi, P. L. N. (2005). A new class of randomvector entropy estimators and its applications in test-ing statistical hypotheses. Journal of NonparametricStatistics, 17:277–297.

    Gupta, M. and Srivastava, S. (2010). Parametricbayesian estimation of differential entropy and rela-tive entropy. Entropy, 12:818–843.

    Hero, A. O., Ma, B., Michel, O., and Gorman, J.(2002a). Alpha-divergence for classification, indexingand retrieval. Communications and Signal ProcessingLaboratory Technical Report CSPL-328.

    Hero, A. O., Ma, B., Michel, O. J. J., and Gorman, J.(2002b). Applications of entropic spanning graphs.IEEE Signal Processing Magazine, 19(5):85–95.

    Kozachenko, L. F. and Leonenko, N. N. (1987). A sta-tistical estimate for the entropy of a random vector.Problems of Information Transmission, 23:9–16.

    Leonenko, N. and Pronzato, L. (2010). Correction of ‘aclass of Rényi information estimators for mulitidimen-sional densities’ Ann. Statist., 36(2008) 2153-2182.

    Leonenko, N., Pronzato, L., and Savani, V. (2008a).A class of Rényi information estimators for multidi-mensional densities. Annals of Statistics, 36(5):2153–2182.

    Leonenko, N., Pronzato, L., and Savani, V. (2008b).Estimation of entropies and divergences via nearestneighbours. Tatra Mt. Mathematical Publications,39.

    Loftsgaarden, D. O. and Quesenberry, C. P. (1965).A nonparametric estimate of a multivariate densityfunction. Ann. Math. Statist, 36:1049–1051.

    Nguyen, X., Wainwright, M., and Jordan., M. (2009).On surrogate loss functions and f-divergences. Annalsof Statistics, 37:876–904.

    Nguyen, X., Wainwright, M., and Jordan., M. (2010).Estimating divergence functionals and the likelihoodratio by convex risk minimization. IEEE Transac-tions on Information Theory, To appear.

    Pál, D., Póczos, B., and Szepesvári, C. (2010). Es-timation of Rényi entropy and mutual informationbased on generalized nearest-neighbor graphs. InProceedings of the Neural Information ProcessingSystems.

    Póczos, B., Kirshner, S., and Szepesvári, C. (2010).REGO: Rank-based estimation of Rényi informationusing Euclidean graph optimization. In AISTATS2010.

    Póczos, B. and Schneider, J. (2011). On the estima-tion of alpha-divergences. CMU, Auton Lab Tech-nical Report, http://www.cs.cmu.edu/~bapoczos/articles/poczos11alphaTR.pdf.

    Rényi, A. (1961). On measures of entropy and informa-tion. In Fourth Berkeley Symposium on MathematicalStatistics and Probability. University of CaliforniaPress.

    Rényi, A. (1970). Probability Theory. North-HollandPublishing Company, Amsterdam.

    Sricharan, K., Raich, R., and Hero, A. (2010). Em-pirical estimation of entropy functionals with confi-dence. Technical Report, http://arxiv.org/abs/1012.4188.

    van der Wart, A. W. (2007). Asymptotic Statistics.Cambridge University Press.

    Villmann, T. and Haase, S. (2010). Mathematicalaspects of divergence based vector quantization usingFrechet-derivatives. University of Applied SciencesMittweida.

    Wang, Q., Kulkarni, S. R., and Verdú, S. (2009a).Divergence estimation for multidimensional densitiesvia k-nearest-neighbor distances. IEEE Transactionson Information Theory, 55(5).

    Wang, Q., Kulkarni, S. R., and Verdú, S. (2009b). Uni-versal estimation of information measures for analogsources. Foundations and Trends in Communicationsand Information Theory, 5(3):265–352.