Su–cient Statistics and Exponential Family 1 Statistics and Su–cient...

Math 541: Statistical Theory II

Sufficient Statistics and Exponential Family

Lecturer: Songfeng Zheng

1 Statistics and Sufficient Statistics

Suppose we have a random sample X1, · · · , Xn taken from a distribution f(x|θ) which relieson an unknown parameter θ in a parameter space Θ. The purpose of parameter estimationis to estimate the parameter θ from the random sample.

We have already studied three parameter estimation methods: method of moment, maximumlikelihood, and Bayes estimation. We can see from the previous examples that the estimatorscan be expressed as a function of the random sample X1, · · · , Xn. Such a function is calleda statistic.

Formally, any real-valued function T = r(X1, · · · , Xn) of the observations in the sam-ple is called a statistic. In this function, there should not be any unknown parameter.For example, suppose we have a random sample X1, · · · , Xn, then X, max(X1, · · · , Xn),median(X1, · · · , Xn), and r(X1, · · · , Xn) = 4 are statistics; however X1 + µ is not statistic ifµ is unknown.

For the parameter estimation problem, we know nothing about the parameter but the obser-vations from such a distribution. Therefore, the observations X1, · · · , Xn is our first hand ofinformation source about the parameter, that is to say, all the available information aboutthe parameter is contained in the observations. However, we know that the estimators weobtained are always functions of the observations, i.e., the estimators are statistics, e.g. sam-ple mean, sample standard deviations, etc. In some sense, this process can be thought of as“compress” the original observation data: initially we have n numbers, but after this “com-pression”, we only have 1 numbers. This “compression” always makes us lose informationabout the parameter, can never makes us obtain more information. The best case is that this“compression” result contains the same amount of information as the information containedin the n observations. We call such a statistic as sufficient statistic. From the above intuitiveanalysis, we can see that sufficient statistic “absorbs” all the available information about θcontained in the sample. This concept was introduced by R. A. Fisher in 1922.

If T (X1, · · · , Xn) is a statistic and t is a particular value of T , then the conditional jointdistribution of X1, · · · , Xn given that T = t can be calculated. In general, this joint con-ditional distribution will depend on the value of θ. Therefore, for each value of t, there

1

2

will be a family of possible conditional distributions corresponding to the different possiblevalues of θ ∈ Θ. However, it may happen that for each possible value of t, the conditionaljoint distribution of X1, · · · , Xn given that T = t is the same for all the values of θ ∈ Θand therefore does not actually depend on the value of θ. In this case, we say that T is asufficient statistic for the parameter θ.

Formally, a statistic T (X1, · · · , Xn) is said to be sufficient for θ if the conditional distributionof X1, · · · , Xn, given T = t, does not depend on θ for any value of t.

In other words, given the value of T , we can gain no more knowledge about θ from knowingmore about the probability distribution of X1, · · · , Xn. We could envision keeping only Tand throwing away all the Xi without losing any information!

The concept of sufficiency arises as an attempt to answer the following question: Is therea statistic, i.e. a function T (X1, · · · , Xn), that contains all the information in the sampleabout θ? If so, a reduction or compression of the original data to this statistic without lossof information is possible. For example, consider a sequence of independent Bernoulli trialswith unknown probability of success, θ. We may have the intuitive feeling that the totalnumber of successes contains all the information about θ that is in the sample, that theorder in which the successes occurred, for example, does not give any additional informationabout θ.

Example 1: Let X1, · · · , Xn be a sequence of independent bernoulli trials with P (Xi =1) = θ. We will verify that T =

∑ni=1 Xi is sufficient for θ.

Proof: We have

P (X1 = x1, · · · , Xn = xn|T = t) =P (X1 = x1, · · · , Xn = xn)

P (T = t)

Bearing in mind that the Xi can take on only the values 0s or 1s, the probability in thenumerator is the probability that some particular set of t Xi are equal to 1s and the othern − t are 0s. Since the Xi are independent, the probability of this is θt(1 − θ)n−t. To findthe denominator, note that the distribution of T , the total number of ones, is binomial withn trials and probability of success θ. Therefore the ratio in the above equation is

θt(1− θ)n−t

(nt

)θt(1− θ)n−t

=1(nt

)

The conditional distribution thus does not involve θ at all. Given the total number of ones,the probability that they occur on any particular set of t trials is the same for any value ofθ so that set of trials contains no additional information about θ.

3

2 Factorization Theorem

The preceding definition of sufficiency is hard to work with, because it does not indicatehow to go about finding a sufficient statistic, and given a candidate statistic, T , it wouldtypically be very hard to conclude whether it was sufficient statistic because of the difficultyin evaluating the conditional distribution.

We shall now present a simple method for finding a sufficient statistic which can be appliedin many problems. This method is based on the following result, which was developed withincreasing generality by R. A. Fisher in 1922, J. Neyman in 1935, and P. R. Halmos and L.J. Savage in 1949, and this result is know as the Factorization Theorem.

Factorization Theorem: Let X1, · · · , Xn form a random sample from either a continuousdistribution or a discrete distribution for which the pdf or the point mass function is f(x|θ),where the value of θ is unknown and belongs to a given parameter space Θ. A statisticT (X1, · · · , Xn) is a sufficient statistic for θ if and only if the joint pdf or the joint point massfunction fn(x|θ) of X1, · · · , Xn can be factorized as follows for all values of x = (x1, · · · , xn) ∈Rn and all values of θ ∈ Θ:

fn(x|θ) = u(x)v[T (x), θ].

Here, the function u and v are nonnegative, the function u may depend on x but does notdepend on θ, and the function v depends on θ but will depend on the observed value x onlythrough the value of the statistic T (x).

Note: In this expression, we can see that the statistic T (X1, · · · , Xn) is like an “interface”between the random sample X1, · · · , Xn and the function v.

Proof: We give a proof for the discrete case. First, suppose the frequency function can befactored in the given form. We let X = (X1, · · · , Xn) and x = (x1, · · · , xn), then

P (T = t) =∑

x:T (x)=t

P (X = x) =∑

x:T (x)=t

u(x)v(T (x), θ) = v(t, θ)∑

x:T (x)=t

u(x)

Then we can have

P (X = x|T = t) =P (X = x, T = t)

P (T = t)=

u(x)∑x:T (x)=t u(x)

which does not depend on θ, and therefore T is a sufficient statistic.

Conversely, suppose the conditional distribution of X given T is independent of θ, that is T isa sufficient statistic. Then we can let u(x) = P (X = x|T = t, θ), and let v(t, θ) = P (T = t|θ).It follows that

P (X = x|θ) = P (X = x, T = t|θ) this is true because T (x) = t, it is redundant

= P (X = x|T = t, θ)P (T = t|θ) = u(x)v(T (x1, · · · , xn)), θ)

4

which is of the desired form.

Example 2: Suppose that X1, · · · , Xn form a random sample from a Poisson distributionfor which the value of the mean θ is unknown (θ > 0). Show that T =

∑ni=1 Xi is a sufficient

statistic for θ.

Proof: For every set of nonnegative integers x1, · · · , xn, the joint probability mass functionfn(x|θ) of X1, · · · , Xn is as follows:

fn(x|θ) =n∏

i=1

e−θθxi

xi!=

(n∏

i=1

1

xi!

)e−nθθ

∑n

i=1xi

It can be seen that fn(x|θ) has been expressed as the product of a function that does notdepend on θ and a function that depends on θ but depends on the observed vector x onlythrough the value of

∑ni=1 xi. By factorization theorem, it follows that T =

∑ni=1 Xi is a

sufficient statistic for θ.

Example 3: Applying the Factorization Theorem to a Continuous Distribution.Suppose that X1, · · · , Xn form a random sample from a continuous distribution with thefollowing p.d.f.:

f(x|θ) =

{θxθ−1, for 0 < x < 10, otherwise

It is assumed that the value of the parameter θ is unknown (θ > 0). We shall show thatT =

∏ni=1 Xi is a sufficient statistic for θ.

Proof: For 0 < xi < 1 (i = 1, · · · , n), the joint p.d.f. fn(x|θ) of X1, · · · , Xn is as follows:

fn(x|θ) = θn

(n∏

i=1

xi

)θ−1

.

Furthermore, if at least one value of xi is outside the interval 0 < xi < 1, then fn(x|θ) = 0for every value of θ ∈ Θ. The right side of the above equation depends on x only through thevalue of the product

∏ni=1 xi. Therefore, if we let u(x) = 1 and r(x) =

∏ni=1 xi, then fn(x|θ)

can be considered to be factored in the form specified by the factorization theorem. It followsfrom the factorization theorem that the statistic T =

∏ni=1 Xi is a sufficient statistic for θ.

Example 4: Suppose that X1, · · · , Xn form a random sample from a normal distributionfor which the mean µ is unknown but the variance σ2 is known. Find a sufficient statisticfor µ.

Solution: For x = (x1, · · · , xn), the joint pdf of X1, · · · , Xn is

fn(x|µ) =n∏

i=1

1

(2π)1/2σexp

[−(xi − µ)2

2σ2

].

This could be rewritten as

fn(x|µ) =1

(2π)n/2σnexp

(−

∑ni=1 x2

i

2σ2

)exp

(µ

σ2

n∑

i=1

xi − nµ2

2σ2

)

5

It can be seen that fn(x|µ) has now been expressed as the product of a function that doesnot depend on µ and a function the depends on x only through the value of

∑ni=1 xi. It

follows from the factorization theorem that T =∑n

i=1 Xi is a sufficient statistic for µ.

Since∑n

i=1 = nx̄, we can state equivalently that the final expression depends on x onlythrough the value of x̄, therefore X̄ is also a sufficient statistic for µ. More generally, everyone to one function of X̄ will be a sufficient statistic for µ.

Property of sufficient statistic: Suppose that X1, · · · , Xn form a random sample from adistribution for which the p.d.f. is f(x|θ), where the value of the parameter θ belongs to agiven parameter space Θ. Suppose that T (X1, · · · , Xn) and T ′(X1, · · · , Xn) are two statistics,and there is a one-to-one map between T and T ′; that is, the value of T ′ can be determinedfrom the value of T without knowing the values of X1, · · · , Xn, and the value of T can bedetermined from the value of T ′ without knowing the values of X1, · · · , Xn. Then T ′ is asufficient statistic for θ if and only if T is a sufficient statistic for θ.

Proof: Suppose the one-to-one mapping between T and T ′ is g, i.e. T ′ = g(T ) and T =g−1(T ′), and g−1 is also one-to-one. T is a sufficient statistic if and only if the joint pdffn(X|T = t) can be factorized as

fn(x|T = t) = u(x)v[T (x), θ]

and this can be written as

u(x)v[T (x), θ] = u(x)v[g−1(T ′(x)), θ] = u(x)v′[T ′(x), θ]

Therefore the joint pdf can be factorized as u(x)v′[T ′(x), θ], by factorization theorem, T ′ isa sufficient statistic.

For instance, in Example 4, we showed that for normal distribution, both T1 =∑n

i=1 Xi

and T2 = X̄ are sufficient statistics, and there is a one-to-one mapping between T1 and T2:T2 = T1/n. Other statistics like (

∑ni=1 Xi)

3, exp (∑n

i=1 Xi) are also sufficient statistics.

Example 5: Suppose that X1, · · · , Xn form a random sample from a beta distribution withparameters α and β, where the value of α is known and the value of β is unknown (β > 0).Show that the following statistic T is a sufficient statistic for the parameter β:

T =1

n

(n∑

i=1

log1

1−Xi

)3

.

Proof: The p.d.f. f(x|β) of each individual observation Xi is

f(x|β) =

{Γ(α+β)Γ(α)Γ(β)

xα−1(1− x)β−1, for 0 ≤ x ≤ 1

0, otherwise

Therefore, the joint p.d.f. fn(x|β) of X1, · · · , Xn is

fn(x|β) =n∏

i=1

Γ(α + β)

Γ(α)Γ(β)xα−1

i (1− xi)β−1 = Γ(α)−n

(n∏

i=1

xi

)α−1Γ(α + β)n

Γ(β)n

(n∏

i=1

(1− xi)

)β−1

6

We define T ′(X1, · · · , Xn) =∏n

i=1(1−Xi), and because α is known, so we can define

u(x) = Γ(α)−n

(n∏

i=1

xi

)α−1

v(T ′, β) =Γ(α + β)n

Γ(β)nT ′(x1, · · · , xn)β−1

We can see that the function v depends on x only through T ′, therefore T ′ is a sufficientstatistic.

It is easy to see that

T = g(T ′) =log(−T ′)3

n,

and the function g is a one-to-one mapping. Therefore T is a sufficient statistic.

Example 6: Sampling for a Uniform distribution. Suppose that X1, · · · , Xn form arandom sample from a uniform distribution on the interval [0, θ], where the value of theparameter θ is unknown (θ > 0). We shall show that T = max(X1, · · · , Xn) is a sufficientstatistic for θ.

Proof: The p.d.f. f(x|θ) of each individual observation Xi is

f(x|θ) =

{1θ, for 0 ≤ x ≤ θ

0, otherwise

Therefore, the joint p.d.f. fn(x|θ) of X1, · · · , Xn is

fn(x|θ) =

{1θn , for 0 ≤ xi ≤ θ0, otherwise

It can be seen that if xi < 0 for at least one value of i (i = 1, · · · , n), then fn(x|θ) = 0 forevery value of θ > 0. Therefore it is only necessary to consider the factorization of fn(x|θ)for values of xi ≥ 0 (i = 1, · · · , n).

Define h[max(x1, · · · , xn), θ] as

h[max(x1, · · · , xn), θ] =

{1, if max(x1, · · · , xn) ≤ θ0, if max(x1, · · · , xn) > θ

Also, xi ≤ θ for i = 1, · · · , n if and only if max(x1, · · · , xn) ≤ θ. Therefore, for xi ≥ 0(i = 1, · · · , n), we can rewrite fn(x|θ) as follows:

fn(x|θ) =1

θnh[max(x1, · · · , xn), θ].

Since the right side depends on x only through the value of max(x1, · · · , xn), it follows thatT = max(X1, · · · , Xn) is a sufficient statistic for θ. According to the property of sufficientstatistic, any one-to-one function of T is a sufficient statistic as well.

7

Example 7: Suppose that X1, X2, · · · , Xn are i.i.d. random variables on the interval [0, 1]with the density function

f(x|α) =Γ(2α)

Γ(α)2[x(1− x)]α−1

where α > 0 is a parameter to be estimated from the sample. Find a sufficient statistic forα.

Solution: The joint density function of x1, · · · , xn is

f(x1, · · · , xn|α) =n∏

i=1

Γ(2α)

Γ(α)2[xi(1− xi)]

α−1 =Γ(2α)n

Γ(α)2n

[n∏

i=1

xi(1− xi)

]α−1

Comparing with the form in factorization theorem,

f(x1, · · · , xn|θ) = u(x1, · · · , xn)v[T (x1, · · · , xn), θ]

we see that T =∏n

i=1 Xi(1 −Xi), v(t, θ) = Γ(2α)n

Γ(α)2n tα−1, u(x1, · · · , xn) = 1, i.e. v depends on

x1, · · · , xn only through t. By the factorization theorem, T =∏n

i=1 Xi(1−Xi) is a sufficientstatistic. According to the property of sufficient statistic, any one-to-one function of T is asufficient statistic as well.

3 Sufficient Statistics and Estimators

We know estimators are statistics, in particular, we want the obtained estimator to besufficient statistic, since we want the estimator absorbs all the available information containedin the sample.

Suppose that X1, · · · , Xn form a random sample from a distribution for which the pdf orpoint mass function is f(x|θ), where the value of the parameter θ is unknown. And weassume there is a sufficient statistic for θ, which is T (X1, · · · , Xn). We will show that theMLE of θ, θ̂, depends on X1, · · · , Xn only through the statistic T .

It follows from the factorization theorem that the likelihood function fn(x|θ) can be writtenas

fn(x|θ) = u(x)v[T (x), θ].

We know that the MLE θ̂ is the value of θ for which fn(x|θ) is maximized. We also knowthat both u and v are positive. Therefore, it follows that θ̂ will be the value of θ for whichv[T (x), θ] is maximized. Since v[T (x), θ] depends on x only through the function T (x), itfollows that θ̂ will depend on x only through the function T (x). Thus the MLE estimator θ̂is a function of the sufficient statistic T (X1, · · · , Xn).

In many problems, the MLE θ̂ is actually a sufficient statistic. For instance, Example 1shows a sufficient statistic for the success probability in Bernoulli trial is

∑ni=1 Xi, and we

8

know the MLE for θ is X̄; Example 4 shows a sufficient statistic for µ in normal distributionis X̄, and this is the MLE for µ; Example 6 shows a sufficient statistic of θ for the uniformdistribution on (0, θ) is max(X1, · · · , Xn), and this is the MLE for θ.

The above discussion for MLE also holds for Bayes estimator. Let θ be a parameter withparameter space Θ equal to an interval of real numbers (possibly unbounded), and we assumethat the prior distribution for θ is p(θ). Let X have p.d.f. f(x|θ) conditional on θ. Supposewe have a random sample X1, · · · , Xn from f(x|θ). Let T (X1, · · · , Xn) be a sufficient statistic.We first show that the posterior p.d.f. of θ given X = x depends on x only through T (x).

The likelihood term is fn(x|θ), according to Bayes formula, we have the posterior distributionfor θ is

f(θ|x) =fn(x|θ)p(θ)∫

Θ fn(x|θ)p(θ)dθ=

u(x)v(T (x), θ)p(θ)∫Θ u(x)v(T (x), θ)p(θ)dθ

=v(T (x), θ)p(θ)∫

Θ v(T (x), θ)p(θ)dθ

where the second step uses the factorization theorem. We can see that the posterior p.d.f.of θ given X = x depends on x only through T (x).

Since the Bayes estimator of θ with respect to a specified loss function is calculated fromthis posterior p.d.f., the estimator also will depend on the observed vector x only throughthe value of T (x). In other words, the Bayes estimator is a function of the sufficient statisticT (X1, · · · , Xn).

Summarizing our discussion above, both the MLE estimator and Bayes estimator are func-tions of sufficient statistic, therefore they absorb all the available information contained inthe sample at hand.

4 Exponential Family of Probability Distribution

A study of the properties of probability distributions that have sufficient statistics of the samedimension as the parameter space regardless of the sample size led to the development ofwhat is called the exponential family of probability distributions. Many common distributions,including the normal, the binomial, the Poisson, and the gamma, are members of this family.

One-parameter members of the exponential family have density or mass function of the form

f(x|θ) = exp[c(θ)T (x) + d(θ) + S(x)]

Suppose that X1, · · · , Xn are i.i.d. samples from a member of the exponential family, thenthe joint probability function is

f(x|θ) =n∏

i=1

exp[c(θ)T (xi) + d(θ) + S(xi)]

= exp

[c(θ)

n∑

i=1

T (xi) + nd(θ)

]exp

[n∑

i=1

S(xi)

]

9

From this result, it is apparent by the factorization theorem that∑n

i=1 T (xi) is a sufficientstatistic.

Example 8: The frequency function of Bernoulli distribution is

P (X = x) = θx(1− θ)1−x x = 0 or x = 1

= exp

[x log

(θ

1− θ

)+ log(1− θ)

](1)

It can be seen that this is a member of the exponential family with T (x) = x, and we canalso see that

∑ni=1 Xi is a sufficient statistic, which is the same as in example 1.

Example 9: Suppose that X1, X2, · · · , Xn are i.i.d. random variables on the interval [0, 1]with the density function

f(x|α) =Γ(2α)

Γ(α)2[x(1− x)]α−1

where α > 0 is a parameter to be estimated from the sample. Find a sufficient statistic forα by verifying that this distribution belongs to exponential family.

Solution: The density function

f(x|α) =Γ(2α)

Γ(α)2[x(1−x)]α−1 = exp {log Γ(2α)− 2 log Γ(α) + α log[x(1− x)]− log[x(1− x)]}

Comparing to the form of exponential family,

T (x) = log[x(1− x)]; c(α) = α; S(x) = − log[x(1− x)]; d(α) = log Γ(2α)− 2 log Γ(α)

Therefore, f(x|α) belongs to exponential family. Then the sufficient statistic is

n∑

i=1

T (Xi) =n∑

i=1

log[Xi(1−Xi)] = log

[n∏

i=1

Xi(1−Xi)

]

In example 6, we got the sufficient statistic was∏n

i=1 Xi(1−Xi), which is different from theresult here. But both of them are sufficient statistics because of the functional relationshipbetween them.

A k-parameter member of the exponential family has a density or frequency function of theform

f(x|θ) = exp

[k∑

i=1

ci(θ)Ti(x) + d(θ) + S(x)

]

For example, the normal distribution, gamma distribution (Example below), beta distribu-tion are of this form.

Example 10: Show that the gamma distribution belongs to the exponential family.

10

Proof: Gamma distribution has density function

f(x|α, β) =βα

Γ(α)xα−1e−βx, 0 ≤ x < ∞

which can be written as

f(x|α, β) = exp {−βx + (α− 1) log x + α log β − log Γ(α)}Comparing with the form of exponential family

exp

{n∑

i=1

ci(θ)Ti(x) + d(θ) + S(x)

}

We see that Gamma distribution has the form of exponential distribution with c1(α, β) = −β,c2(α, β) = α− 1, T1(x) = x, T2(x) = log x, d(α, β) = α log β − log Γ(α), and S(x) = 0.

Therefore, gamma distribution belongs to the exponential family.

5 Exercises

Instructions for Exercises 1 to 4: In each of these exercises, assume that the random variablesX1, · · · , Xn form a random sample of size n form the distribution specified in that exercise,and show that the statistic T specified in the exercise is a sufficient statistic for the parameter:

Exercise 1: A normal distribution for which the mean µ is known and the variance σ2 isunknown; T =

∑ni=1(Xi − µ)2.

Exercise 2: A gamma distribution with parameters α and β, where the value of β is knownand the value of α is unknown (α > 0); T =

∏ni=1 Xi.

Exercise 3: A uniform distribution on the interval [a, b], where the value of a is known andthe value of b is unknown (b > a); T = max(X1, · · · , Xn).

Exercise 4: A uniform distribution on the interval [a, b], where the value of b is known andthe value of a is unknown (b > a); T = min(X1, · · · , Xn).

Exercise 5: Suppose that X1, · · · , Xn form a random sample from a gamma distributionwith parameters α > 0 and β > 0, and the value of β is known. Show that the statisticT =

∑ni=1 log Xi is a sufficient statistic for the parameter α.

Exercise 6: The Pareto distribution has density function:

f(x|x0, θ) = θxθ0x−θ−1, x ≥ x0, θ > 1

Assume that x0 > 0 is given and that X1, X2, · · · , Xn is an i.i.d. sample. Find a sufficientstatistic for θ by (a) using factorization theorem, (b) using the property of exponential family.Are they the same? If not, why are both of them sufficient?

11

Exercise 7: Verify that following are members of exponential family:

a) Geometric distribution p(x) = px−1(1− p);

b) Poisson distribution p(x) = e−λ λx

x!;

c) Normal distribution N(µ, σ2);

d) Beta distribution.

Su–cient Statistics and Exponential Family 1 Statistics and Su–cient...

Documents

Transcript of Su–cient Statistics and Exponential Family 1 Statistics and Su–cient...