BS2a - University of Oxford

Lecture 2 - Sufficiency, Factorization Theorem, Minimal sufficiency

Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 15

Sufficient statistics

Let X1, . . . ,Xn be a random sample from f (x ; θ).

Definition 2 A statistic T (X1, . . . ,Xn) is a function of the data that doesnot depend on unknown parameters.

Definition 3 A statistic T (X1, . . . ,Xn) is said to be sufficient for θ if theconditional distribution of X1, . . . ,Xn, given T , does not depend on θ.That is,

g(x | t , θ) = g(x | t)

Comment The definition says that a sufficient statistic T contains allthe information there is in the sample about θ.


Example 7

n independent trials where the probability of success is p.

Let X1, . . . ,Xn be indicator variables which are 1 or 0 depending if thetrial is a success or failure.Let T =

∑ni=1 Xi . The conditional distribution of X1, . . . ,Xn given T = t

is

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=

∏ni=1 pxi (1− p)1−xi(n

t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,

not depending on p, so T is sufficient for p.Comment Make sense, since no information in the order.


Example 7

n independent trials where the probability of success is p.Let X1, . . . ,Xn be indicator variables which are 1 or 0 depending if thetrial is a success or failure.

Let T =∑n

i=1 Xi . The conditional distribution of X1, . . . ,Xn given T = tis

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,



Example 7

n independent trials where the probability of success is p.Let X1, . . . ,Xn be indicator variables which are 1 or 0 depending if thetrial is a success or failure.Let T =

∑ni=1 Xi .

The conditional distribution of X1, . . . ,Xn given T = tis

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,



Example 7



is

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)

=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,



Example 7



is

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,



Example 7



is

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,

not depending on p, so T is sufficient for p.

Comment Make sense, since no information in the order.


Example 7



is

g(x1, . . . , xn | t ,p) =f (x1, . . . , xn, t | p)

h(t | p)=


t

)pt(1− p)n−t

=pt(1− p)n−t(nt

)pt(1− p)n−t

=

(nt

)−1

,



Theorem 1 : Factorization Criterion

T (X1, . . . ,Xn) is a sufficient statistic for θ if and only if there exist twonon-negative functions K1,K2 such that the likelihood function L(θ; x)can be written

L(θ; x) = K1[t(x1, . . . , xn); θ]K2[x1, . . . , xn],

where K1 depends only on the sample through T , and K2 does notdepend on θ.




L(θ; x) = f (x |θ) = f (x , t |θ) =

g(x | t , θ)h(t | θ)


g(x | t , θ) = g(x | t)


L(θ; x) = g(x | t)h(t | θ)





L(θ; x) = f (x |θ) = f (x , t |θ) = g(x | t , θ)h(t | θ)


g(x | t , θ) = g(x | t)


L(θ; x) = g(x | t)h(t | θ)



2. Suppose L(θ; x) = f (x | θ) = K1[t ; θ]K2[x ].

Then

h(t | θ) =∑

{x :T (x)=t}

f (x , t | θ) =∑

{x :T (x)=t}

L(θ; x)

= K1[t ; θ]∑

{x :T (x)=t}

K2(x).

Thus

g(x | t , θ) = f (x , t | θ)h(t | θ)

=L(θ; x)h(t | θ)

=K2[x ]∑

{x :T (x)=t} K2(x),

not depending on θ. (K1 cancels out in numerator and denominator.)


2. Suppose L(θ; x) = f (x | θ) = K1[t ; θ]K2[x ].Then

h(t | θ) =∑

{x :T (x)=t}

f (x , t | θ) =∑

{x :T (x)=t}

L(θ; x)

= K1[t ; θ]∑

{x :T (x)=t}

K2(x).

Thus

g(x | t , θ) = f (x , t | θ)h(t | θ)

=L(θ; x)h(t | θ)

=K2[x ]∑

{x :T (x)=t} K2(x),

not depending on θ. (K1 cancels out in numerator and denominator.)


Minimal sufficiency

In general, we can envisage a sequence of functionsT1(X ),T2(X ),T3(X ), . . . such that Tj(X ) is a function of Tj−1(X ) (j ≥ 2)where moving along the sequence gains progressive reductions ofdata, and if Tj(X ) is sufficient for θ, then so is Ti(X ) i < j . Typicallythere will be a last member Tk (X ) that is sufficient, with Ti(X ) notsufficient for i > k . Then Tk (X ) is minimal sufficient

Example 7 (cont.) Consider n = 3 Bernoulli trials1 T1(X ) = (X1,X2,X3) (the individual sequences of trials)2 T2(X ) = (X1,

∑3i=1 Xi) (the 1st random variable and the total sum).

3 T3(X ) =∑3

i=1 Xi (the total sum)4 T4(X ) = I(T3(X ) = 0) (I is indicator function; Exercise Prove T4

not sufficient)

Definition 4 A statistic is minimal sufficient if it can be expressed as afunction of every other sufficient statistic.


Minimal sufficiency


Example 7 (cont.) Consider n = 3 Bernoulli trials1 T1(X ) = (X1,X2,X3) (the individual sequences of trials)

2 T2(X ) = (X1,∑3

i=1 Xi) (the 1st random variable and the total sum).3 T3(X ) =

∑3i=1 Xi (the total sum)

4 T4(X ) = I(T3(X ) = 0) (I is indicator function; Exercise Prove T4not sufficient)



Minimal sufficiency




3 T3(X ) =∑3


not sufficient)



Minimal sufficiency




3 T3(X ) =∑3

i=1 Xi (the total sum)

4 T4(X ) = I(T3(X ) = 0) (I is indicator function; Exercise Prove T4not sufficient)



Minimal sufficiency




3 T3(X ) =∑3


not sufficient)



Example 7 (cont.) : Minimal sufficiency

n Bernoulli trials with T =∑n

i=1 Xi .

Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then

g(x1, . . . , xn|u,p) = g(x1, . . . , xn|t1,p)P(t1|u,p)= g(x1, . . . , xn|t1)P(T = t1|T ∈ {t1, t2})

=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2

which depends on p, so U is not sufficient, a contradiction, and henceT must be MS (similar reasoning for multiple ti ).




i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.

Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T

(and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)

(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).

The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes.

Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then

g(x1, . . . , xn|u,p) = g(x1, . . . , xn|t1,p)P(t1|u,p)

= g(x1, . . . , xn|t1)P(T = t1|T ∈ {t1, t2})

=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2





i=1 Xi . Suppose T above is not minimalsufficient but another statistic U is MS.Then U can be given as afunction of T (and not vis versa or T is MS) and there exist t1 6= t2values of T so that U(t1) = U(t2)(ie T → U is many to one so U → Tis not a function, and we assume for the moment no other t makeU(t) = U(t1)).The event U = u is the event T ∈ {t1, t2}. Let x1, ...xncontain t1 successes. Then


=pt1(1− p)n−t1(n

t1

)pt1(1− p)n−t1 +

(nt2

)pt2(1− p)n−t2



Minimal sufficiency and partitions of the sample space

Intuitively, a minimal sufficient statistic most efficiently captures allpossible information about the parameter θ.Any statistic T (X ) partitions the sample space into subsets and ineach subset T (X ) has constant value.Minimal sufficient statistics correspond to the coarsest possiblepartition of the sample space.In the example of n = 3 Bernoulli trials consider the following 4statistics and the partitions they induce.



Intuitively, a minimal sufficient statistic most efficiently captures allpossible information about the parameter θ.

Any statistic T (X ) partitions the sample space into subsets and ineach subset T (X ) has constant value.Minimal sufficient statistics correspond to the coarsest possiblepartition of the sample space.In the example of n = 3 Bernoulli trials consider the following 4statistics and the partitions they induce.



Intuitively, a minimal sufficient statistic most efficiently captures allpossible information about the parameter θ.Any statistic T (X ) partitions the sample space into subsets and ineach subset T (X ) has constant value.

Minimal sufficient statistics correspond to the coarsest possiblepartition of the sample space.In the example of n = 3 Bernoulli trials consider the following 4statistics and the partitions they induce.



Intuitively, a minimal sufficient statistic most efficiently captures allpossible information about the parameter θ.Any statistic T (X ) partitions the sample space into subsets and ineach subset T (X ) has constant value.Minimal sufficient statistics correspond to the coarsest possiblepartition of the sample space.

In the example of n = 3 Bernoulli trials consider the following 4statistics and the partitions they induce.



Intuitively, a minimal sufficient statistic most efficiently captures allpossible information about the parameter θ.Any statistic T (X ) partitions the sample space into subsets and ineach subset T (X ) has constant value.Minimal sufficient statistics correspond to the coarsest possiblepartition of the sample space.In the example of n = 3 Bernoulli trials consider the following 4statistics and the partitions they induce.


TTT

TTH

THT

THH

HTT

HHT

HTH

HHH

TTT

TTH

THT

THH

HTT

HHT

HTH

HHH

TTT

TTH

THT

THH

HTT

HHT

HTH

HHH

TTT

TTH

THT

THH

HTT

HHT

HTH

HHH

T1(X) = X1,X2,X3( )

T3(X) = Xii=1

3

∑ T4 (X) = I T3(X) = 0( )

T2 (X) = X1, Xii=1

3

∑⎛⎝⎜

⎞⎠⎟


Lemma 1 : Lehmann-Scheffé partitions

Consider the partition of the sample space defined by putting x and yinto the same class of the partition if and only if

L(θ; y)/L(θ; x) = f (y | θ)/f (x | θ) = m(x , y).

Then any statistic corresponding to this partition is minimal sufficient.

Comment This Lemma tells us how to define partitions that correspondto minimal sufficient statistics. It says that ratios of likelihoods of twovalues x and y in the same partition (and hence same statistic value)should not depend on θ.


Proof (for discrete RVs)

1. Sufficiency.Suppose T is such a statistic

g(x |t , θ) = f (x | θ)f (t | θ)

=f (x | θ)∑

y∈τ f (y | θ), τ = {y : T (y) = t}

=f (x | θ)∑

y∈τ f (x | θ)m(x , y)

=[∑

y∈τm(x , y)

]−1

which does not depend on θ. Hence the partition D is sufficient.


2. Minimal sufficiency.Now suppose U is any other sufficient statistic and that U(x) = U(y)for some pair of values (x , y).

If we can show that U(x) = U(y) impliesT (x) = T (y), then the Lehmann-Scheffé partition induced by Tincludes the partition based on any other sufficient statistic.In otherwords, T is a function of every other sufficient statistic, and so must beminimal sufficient.Since U is sufficient we have

L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=K2[y ]K2[x ]

which does not depend on θ. So the statistic U produces a partition atleast as fine as that induced by T , and the result is proved.


2. Minimal sufficiency.Now suppose U is any other sufficient statistic and that U(x) = U(y)for some pair of values (x , y). If we can show that U(x) = U(y) impliesT (x) = T (y), then the Lehmann-Scheffé partition induced by Tincludes the partition based on any other sufficient statistic.

In otherwords, T is a function of every other sufficient statistic, and so must beminimal sufficient.Since U is sufficient we have

L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=K2[y ]K2[x ]



2. Minimal sufficiency.Now suppose U is any other sufficient statistic and that U(x) = U(y)for some pair of values (x , y). If we can show that U(x) = U(y) impliesT (x) = T (y), then the Lehmann-Scheffé partition induced by Tincludes the partition based on any other sufficient statistic.In otherwords, T is a function of every other sufficient statistic, and so must beminimal sufficient.

Since U is sufficient we have

L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=K2[y ]K2[x ]



2. Minimal sufficiency.Now suppose U is any other sufficient statistic and that U(x) = U(y)for some pair of values (x , y). If we can show that U(x) = U(y) impliesT (x) = T (y), then the Lehmann-Scheffé partition induced by Tincludes the partition based on any other sufficient statistic.In otherwords, T is a function of every other sufficient statistic, and so must beminimal sufficient.Since U is sufficient we have

L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=

K2[y ]K2[x ]




L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=K2[y ]K2[x ]

which does not depend on θ.

So the statistic U produces a partition atleast as fine as that induced by T , and the result is proved.



L(θ; y)L(θ; x)

=K1[u(y); θ]K2[y ]K1[u(x); θ]K2[x ]

=K2[y ]K2[x ]



Sufficiency in an exponential family

Random Sample X1, . . . ,XnLikelihood

L(θ; x) =n∏

i=1

f (xi ; θ)

=n∏

i=1

exp

k∑

j=1

Aj(θ)Bj(xi) + C(xi) + D(θ)

= exp

k∑

j=1

Aj(θ)

(n∑

i=1

Bj(xi)

)+ nD(θ) +

n∑i=1

C(xi)

.

Exponential family form again.


Sufficiency in an exponential family

Suppose the family is in canonical form so φj = Aj(θ), and lettj =

∑ni=1 Bj(xi), C(x) =

∑ni=1 C(xi).

L(θ; x) = exp

k∑

j=1

φj tj + nD(θ) + C(x)

.

By the factorization criterion t1, . . . , tk are sufficient statistics forφ1, . . . , φk . In fact, we do not need canonical form. If

L(θ; x) = exp

k∑

j=1

Aj(θ)tj + nD(θ) + C(x)

is a minimal k -dimensional linear exponential family then (by theregularity conditions above) t1, . . . , tk are minimal sufficient forθ1, . . . , θk . Minimal sufficiency is verified using Lemma 1.


BS2a - University of Oxford

Documents

Transcript of BS2a - University of Oxford