PART II STATISTICS

25
46 PART II STATISTICS

Transcript of PART II STATISTICS

46

PART II

STATISTICS

47

Chapter 9. Examples about statistical models.

9.1. Examples, basic concepts and terminologies.

We use examples to review some of the basic concepts and terminologies in statistics. They area) population (population distribution, underlying distribution), b) parameters: θ ∈ Θ; c) randomvariables v.s. observations; d) statistics; and e) joint density v.s. likelihood.

Example 9.1. Bernoulli trials. A coin has probability θ to be head and 1 − θ to be tail whentossed. Toss the coin n times. Let Xi = 1 or 0, depending on the i-th toss is a head or tail. ThenX1, ..., Xn are iid ∼ Bin(1, θ).

For this example, a) population distribution: Bin(1, θ); b) parameter θin[0, 1]; c) X1, ..., Xn arer.v.s and the observations (which are fixed numbers 1 or 0 and are not random at all) are realizedvalues of X1, ..., Xn. With slight abuse of notations, we do NOT distinguish the random variablesand their realizations in terms of notations. d) statistics: functions of the r.v.s, such as, for example,X = (1/n)

∑ni=1Xi or X1 −X2. e). joint density v.s. likelihood:

The joint density (actually probability function) of (X1, ..., Xn) is

f(x1, ..., xn|θ) = P (X1 = x1, ..., Xn = n) =n∏i=1

P (Xi = xi) =n∏i=1

θxi(1− θ)1−xi

= θ∑n

i=1 xi(1− θ)n−∑n

i=1 xi ,

for xi = 0 or 1 for 1 ≤ i ≤ n. The joint density function is a function of (x1, ..., xn) over all possiblevalues of (X1, ..., Xn).

Interpretation: The larger the joint density function is at (x1, .., xn), the more likely the r.v.s(X1, ..., Xn) takes value (x1, ..., xn). Here the parameter θ is fixed.

Likelihood: Suppose (X1, ..., Xn) takes values (x1, ..., xn). The likelihood

L(θ) ≡ L(θ|x1, ..., xn) = f(x1, ..., xn|θ) = θ∑n

i=1 xi(1− θ)n−∑n

i=1 xi .

This is identical to the joint density function in its expression. (In fact, they are identical !!!) Butthe view point is entirely different. The likelihood is a function of the parameter θ with the valuesof the random variables (x1, ..., xn) being fixed; while the joint density function is a function of thevalues of the r.v.s. (x1, ..., xn) with the parameter θ being fixed.

Suppose L(0.5)/L(0.7) = 10. This can be interpreted as: given the random variables taking values(x1, ..., xn), the parameter θ is 10 times more likely to be 0.5 than to be 0.7. �

Example 9.2. Suppose X1, ..., Xn are iid ∼ N(µ, σ2). Then, the likelihood is

L(µ, σ2) =n∏i=1

1√2πσ2

e−(Xi−µ)2/2σ2= exp{ µ

σ2

n∑i=1

Xi −∑ni=1X

2i

2σ2+nµ2

σ2− 1

2log(2πσ2)}.

As announced, we do not distinguish the values of the r.v.s from the r.v.s themselves. �

Example 9.3. X1, ..., Xn are iid ∼ P(λ).

L(λ) = exp{log λn∑i=1

Xi − nλ−n∑i=1

logXi!}.

Example 9.4. X1, ..., Xn are iid with common density function f(t) = 1/[π(1 + (t− θ)2)], whichis the Cauchy distribution centered at θ. Then,

L(θ) =n∏i=1

1π(1 + (Xi − θ)2)

.

48

Example 9.5. X1, ..., Xn are iid ∼ Unif [0, θ] with common density f(t) = 1/θ1{t∈[0,θ]}. Then,

L(θ) =1θn

1{X(n)≤θ}1X(1)≥0},

where X(1) ≤ · · · ≤ X(n) are the ordered X1, ..., Xn. Therefore X(1) = min1≤i≤nXi and X(n) =max1≤i≤nXi. �

We consider the following real example about Hardy-Weinberg equilibrium for more illustration.

Example 9.6. The Hardy-Weinberg Equilibrium According to the Hardy-Weinberg equilibrium,one of the fundamental laws in genetics, the blood type MM, MN and NN should each have pro-portion (1 − θ)2, 2θ(1 − θ) and θ2, respectively, in the population. In 1937, a random sample of1029 were drawn from all residents of Hong Kong. Let X1, ..., Xn denote the blood types of thesepeople, n = 1029. A brief summary shows among these 1029 people, those with blood types MM,MN and NN are 342, 500 and 187, respectively.

In this example, a) the population distribution is a distribution of blood types MM, MN, NNwith chance (1 − θ)2, 2θ(1 − θ) and θ2, respectively. b). parameter: θ ∈ [0, 1]. c) r.v.s areX1, ..., Xn and their observed values, which are still denoted same. d) Statistics: for example,Y1 ≡

∑ni=1 1{Xi=MM}, the number of people with blood type MM in the sample. In the actual

data, Y1 = 342. The density function of X1, ..., Xn is

f(x1, .., xn|θ) =n∏i=1

(1− θ)21{xi=MM} [2θ(1− θ)]1{xi=MN}θ21{xi=NN} .

= (1− θ)∑n

i=1(21{xi=MM}+1{xi=MM})θ∑n

i=1(21{xi=NN}+1{xi=MM})2∑n

i=1 1{xi=MM} .

The likelihoodL(θ) = (1− θ)2×342+500θ2×187+5002500.

Then, the maximum likelihood estimator (to be introduced later) of θ is θ = 874/(1184 + 874) =0.4247. �

9.2. Sufficiency, exponential families and Rao-Blackwell theorem.

9.2.1. Sufficiency

Sufficiency is an important concept about statistics. Heuristically, a statistic (of possible multi-dimension) is sufficient for parameter θ if all information about θ contained in the dataset is con-tained in this statistic. A formal definition is as follows.

Suppose X1, ..., Xn are iid ∼ Pθ, θ ∈ Θ, where Pθ refers to the population distribution indexed byparameter θ. A statistic T = T (X1, ..., Xn) is called sufficient for θ, if, for any t, the conditionaldistribution of (X1, ..., Xn) given T = t is free of θ (in other words, is the same for all θ).

Remark 1. T and θ can be multi-dimensional. General definition of sufficiency do not require thedata are observations of iid r.v.s. 2. Special examples: (X1, ..., Xn) as n-dimensional statistics issufficient. X1 is not sufficient unless n = 1. 3. All information about θ contained in (X1, ..., Xn)is contained in T . Once T is fixed at any value t, the further variation of X1, ..., Xn would not bedependent or related with the parameter θ. In other words, (X1, ..., Xn) depends on θ only throughT . 4. In general, low dimensional sufficient statistics are desired as it achieves data reductionwithout loss of the information about the parameter.

Example 9.7. X1, ...Xn are iid ∼ Bin(1, θ). Then,

L(θ) = θY (1− θ)n−Y , where Y =n∑i=1

Xi.

49

One can verify that

P (X1 = x1, ..., Xn = xn|Y = k) ={

1/(nk

)if∑ni=1 xi = k

0 if not.

This implies Y is indeed sufficient. �

In general, checking whether a statistic is sufficient through the above definition of sufficiency canbe sometimes difficult. The following factorization theorem is useful.

9.2.2. Factorization theorem Theorem 9.1 Factorization theorem Suppose

L(θ) = g(T (X1, ..., Xn); θ)h(X1, ..., Xn).

Then T is sufficient for θ.

Proof (heuristic) Assume the density exists. Let A1, ..., An be sets on real line.

P (X1 ∈ A1, ..., Xn ∈ An)|T (X1, ..., Xn) = t)

=

∫An· · ·∫A1

1{T (x1,...,xn)=t}L(θ)dx1 · · · dxn∫· · ·∫

1{T (x1,...,xn)=t}L(θ)dx1 · · · dxn

=

∫· · ·∫ ∏n

i=1 1{xi∈Ai}1{T (x1,...,xn)=t}L(θ)dx1 · · · dxn∫· · ·∫

1{T (x1,...,xn)=t}L(θ)dx1 · · · dxn

=

∫· · ·∫ ∏n

i=1 1{xi∈Ai}1{T (x1,...,xn)=t}g(t, θ)h(x1, ..., xn)dx1 · · · dxn∫· · ·∫

1{T (x1,...,xn)=t}g(t, θ)h(x1, ..., xn)dx1 · · · dxn

=

∫· · ·∫ ∏n

i=1 1{xi∈Ai}1{T (x1,...,xn)=t}h(x1, ..., xn)dx1 · · · dxn∫· · ·∫

1{T (x1,...,xn)=t}h(x1, ..., xn)dx1 · · · dxn

which is free of θ. �

Using this theorem, one can easily see that X(n) is sufficient for θ based on X1, ..., Xn iid ∼Unif [0, θ].

9.2.3. The k-parameter exponential family (of distributions)

A family of distribution {Pθ : θ ∈ Θ} is called k-parameter exponential family, if 1). θ is ofk dimension; 2). There exist functions c1, ..., ck and statistics T1, ..., Tk of X such that L(θ) =1{X∈A}exp{

∑ki=1 ci(θ)Ti(X) + d(θ) + S(X)}. where L(θ) is the likelihood (density) of X ∼ Pθ.

(Notice that A must be a set free of θ. A is the support or the collection of values of the r.v. X. ).

(T1, ..., Tk) is called natural sufficient statistic for parameter θ. One should keep in mind that bothθ and X here can be multi-dimensional.

It follows straightforwardly from the factorization theorem that (T1, ...Tk) is sufficient for θ.

It can be checked straightforwardly that binomial distribution Bin(K, θ), Poisson distribution P(λ)and exponential distribution E(λ) are all 1-parameter exponential families, while normal distributionN(µ, σ2) is a 2 parameter exponential family. On the other hand, the uniform distribution Unif [0, θ]or Cauchy distribution centered at θ are not exponential families.

9.2.4. Rao-Blackwell theorem.

It is conceptually clear that a good estimation of θ should be based on sufficient statistics. Rao-Blackwell theorem says that, by conditioning an estimator on a sufficient statistic, the resultingnew estimator is more accurate in terms of mean squared error.

50

Theorem 9.2 Rao-Blackwell Theorem Let θ be an estimator of θ with E(θ2) <∞. SupposeT (possibly multi-dimensional) is a sufficient statistic for θ. Let θ = E(θ|T ). Then θ, as anestimator of θ, has a mean squared error smaller than or equal to that of θ, i.e.,

E[(θ − θ)2] ≤ E[(θ − θ)2].

Remark. In terms of mean squared error (MSE), the accuracy of any estimator is improved byconditioning it on a sufficient statistic.

Proof E(θ) = E(θ). From Jensen’s inequality,

E(θ2) = E({E(θ|T )}2) ≤ E{E(θ2|T )} = E(θ2).

Hence,

E[(θ − θ)2] = E(

[θ − E(θ)]2)

+ [E(θ)− θ]2

= E(θ2)− [E(θ)]2 + [E(θ)− θ]2

≤ E(θ2)− [E(θ)]2 + [E(θ)− θ]2

= E[(θ − θ)2]

Remark. As seen from the proof, conditioning on that any statistic would decrease or not-increasethe MSE. The key of conditioning on sufficient statistic is that the conditional distribution does notdepend on the unknown parameter, and thus it is computable. If the statistic to be conditioned onis not sufficient, the resulting conditional expectation may involve the unknown parameter and isthere not computable (not a statistic.)

Example 9.7. Suppose X1, ..., Xn are iid ∼ N(θ, 1). Then

1). X1, for example, is an estimator of θ. The MSE is 1.

2). X is sufficient. E(X1|X) = X. The MSE is 1/n, which is a significant improvement of accuracyover X1.

3). (X1, ..., Xn) is an n-dimensional sufficient statistic, E(X1|X1, ..., Xn) = X1. No improvementof accuracy, but not worse.

4). X2 is not a sufficient statistic. E(X1|X2) = θ. It appears that conditioning on X1 produces bestestimator. However, the conditional expectation E(X1|X2) is not a function of the observationsand is not a statistic.

51

Chapter 10. Point Estimation

This chapter discusses the most fundamental problem of statistical analysis: estimation of the un-known parameters. We consider two parameter estimation methods and provide some justificationcriterion to measure the accuracy of estimation.

10.1. Method of moment

For any r.v. X, its k-th moment is defined as E(Xk). For example: E(X) is the 1-st momentand E(X2) is the 2nd moment. Based a random sample {X1, ..., Xn}, the sample k-th moment isdefined as (1/n)

∑nj=1X

kj .

Example 10.1 X1, ..., Xn iid ∼ Bin(1, θ).

The population 1st moment is E(Xi) = θ. The sample 1st moment is X, the sample mean. Themethod of moment estimation is to equate the sample 1st moment with the population 1st momentand solve for θ. Hence the solution of X = θ, which is X is the method of moment estimator of θ.�

Example 10.2 X1, ..., Xn iid ∼ P(λ).

The method of moment estimator of λ is also X, the sample mean. �

Example 10.3 X1, ..., Xn iid ∼ N(µ, σ2).

The method of moment estimation: Solve the equations

Xi = µ (1/n)n∑j=1

X2j = E(X2

i ) = µ2 + σ2.

The solutions: µ ≡ X and σ2 ≡ (1/n)∑nj=1X

2j − X2 = (1/n)

∑nj=1(Xj − X)2 are the method of

moment estimators of (µ, σ2). �

Example 10.4 X1, ..., Xn iid ∼ Unif [0, θ].

E(Xi) = θ/2. The method of moment estimation solves the equation X = θ/2 and the estimatoris 2X. �

In general, for parameter θ of k-dimension. Let gj(θ) = E(Xji ) denote the population j-th moment.

The method of moment estimation solves the k equations:

(1/n)n∑j=1

Xj = g1(θ), · · · · · · (1/n)n∑j=1

Xkj = gk(θ),

for θ. The solution, denoted as θ, is the method of moment estimator of θ.

10.2. Maximum likelihood estimation (MLE)

Suppose L(θ) is the likelihood function based on data {X1, ..., Xn}. The MLE of θ, denoted as θ,is such that

L(θ) = maxθ∈Θ

L(θ).

Let l(θ) = log(L(θ)) be the log-likelihood. Since log-function is monotone increasing. One may,equivalently, also define the MLE of θ as,

l(θ) = maxθ∈Θ

l(θ).

52

Example 10.5 X1, ..., Xn iid ∼ Bin(1, θ).

L(θ) =∏ni=1 θ

Xi(1− θ)1−Xi = θ∑n

i=1Xi(1− θ)n−∑n

i=1Xi .

l(θ) = nX log(θ) + n(1− X) log(1− θ).l(θ) = n/θ − n(1− X)/(1− θ).

Solve l(θ) = 0 and we get the MLE of θ: θ = X. �

Example 10.6 X1, ..., Xn iid ∼ P(λ).

L(λ) =∏ni=1(1/Xi!)e−λλXi .

l(λ) = log(L(λ)) = −nλ+ nX log(λ)−∑ni=1 log(Xi!).

l(λ) = −n+ nX/λ.

Solve l(λ) = 0, we get the MLE of λ: λ = X. �

Example 10.7 X1, ..., Xn iid ∼ N(µ, σ2).

L(µ, σ2) =∏ni=1(1/

√2πσ2)exp{−(Xi − µ)2/(2σ2)}.

l(µ, σ2) = log(L(µ, σ2)) = (1/2)∑ni=1

(− (Xi − µ)2/σ2 − log(σ2)− log(2π)

).

lµ(µ, σ2) =∂

∂µl(µ, σ2) =

n∑i=1

(Xi − µ)/σ2

lσ2(µ, σ2) =∂

∂σ2l(µ, σ2) =

n∑i=1

((Xi − µ)/(2σ4) + 1/(2σ2)

)Solve the equations of lµ(µ, σ2) = 0 and lσ2(µ, σ2) = 0, we get the MLE of (µ, σ2) as µ = X andσ2 = (1/n)

∑ni=1(Xi − X)2 = (1− 1/n)s2 where s2 is the sample variance. �

Example 10.8 X1, ..., Xn iid ∼ Unif [0, θ].

L(θ) =∏ni=1(1/θ)1{0≤Xi≤θ} = (1/θn)1{X(n)≤θ}1{X(1)≥0} where X(1) and X(n) are the minimum

and maximum of the observations.

So L(θ) = 0 if θ < X(n) and L(θ) = 1/θn if θ ≥ X(n). The maximum of L(θ) is 1/Xn(n) and it is

achieved when θ = X(n). Therefore the MLE of θ is θ = X(n). �

10.3. Unbiasedness and UMVUE

Definition. A statistic θ = θ(X1, ..., Xn) is called an unbiased estimator of a parameter θ, ifE(θ) = θ. The quantity E(θ)− θ is called the bias of the estimator θ.

Example 10.9 X1, ..., Xn iid with mean µ and variance σ2.

The sample mean X is an unbiased estimator of µ. The sample variance s2 is an unbiased estimatorof σ2. Then, σ2 ≡ (1− 1/n)s2 is a biased estimator of σ2, and the bias is

E(σ2)− σ2 = E(n− 1n

s2)− σ2 =n− 1n

σ2 − σ2 = −σ2/n.

Exercise Verify that s2 is an unbiased estimator of σ2, and that s is a biased estimator of σ.

Example 10.10 X1, ..., Xn iid ∼ Unif [0, θ].

The method of moment estimator θ = 2X is an unbiased estimator of θ.

53

The MLE θ = X(n) is a biased estimator and the bias is

E(θ)− θ =∫ ∞

0

P (X(n) > t)dt− θ =∫ θ

0

P (X(n) > t)dt− θ

=∫ θ

0

[1− P (X(n) ≤ t)]dt− θ =∫ θ

0

[1− (t/θ)n]dt− θ = θ − θ∫ 1

0

tndt− θ

= −θ/(n+ 1).

Remark. Interpretation of unbiasedness: The estimator say θ is constructed based on the dataset collected from a population with a parameter θ. Suppose the same data collection procedureis repeated a large number, say M , times. And each time, an estimate is calculated. Then theaverage of these M estimates is close to θ as long as M is large enough.

Unbiasedness may not necessarily implies the estimator is accurate. For example, X1 (using onesingle observation) is an unbiased estimator of population mean µ, but it is obviously not an accurateestimation. The mean squared error (MSE) is, in general, a more popular criterion to measure theaccuracy of estimation. The MSE of θ is

E(θ − θ) = var(θ) + [E(θ)− θ]2 = var(θ) + bias2.

Example 10.11 X1, ..., Xn iid ∼ Unif [0, θ].

E(Xi) = θ/2 and var(Xi) = θ2/12.

For the method of moment estimator of θ: θ = 2X, the MSE is

E(θ − θ)2 = var(2X) = (4/n)var(Xi) = θ2/(3n).

For the MLE θ = X(n), which is biased, the MSE is

E(θ − θ)2 = var(θ) + Bias2

= E(θ2)− [E(θ)]2 + [−θ/(n+ 1)]2 =∫ θ

0

t2d(t/θ)n − [nθ/(n+ 1)]2 + θ2/(n+ 1)2

= θ2

∫ 1

0

ntn+1dt− n2 − 1(n+ 1)2

θ2 =2

(n+ 1)(n+ 2)θ2.

Compare the method of moment estimator with the MLE, we find, despite that the MLE is biasedand the method of moment estimator is unbiased, the MLE is much more accurate than the methodof moment estimator in terms of MSE, especially if n is large.

Search of the estimator with the smallest MSE, for all θ ∈ Θ, is in general not possible. This isbecause certain trivial but obviously unreasonable estimators may be extremely accurate for somevalues of the parameter in terms of MSE. For example, let θ = 1. The MSE of θ is (1− θ)2 whichis 0, if the true θ is indeed 1. If, instead, we focus on only all unbiased estimators, it is possible tofind the estimator with the smallest MSE.

Definition. Suppose θ is an unbiased estimator with MSE (variance) being the smallest among allunbiased estimators, for all θ ∈ Θ. Then θ is called uniformly minimum variance unbiased estimator(UMVUE).

We already know that by conditioning an estimator on sufficient statistic would reduce the MSE(by the Rao-Blackwell theorem). To achieve the minimum variance for an unbiased estimator, thesufficient statistic would be better if it contains little or no variation that is unrelated with the

54

parameter θ. Roughly speaking, it is better off if it contains only the information about θ, no moreredundant information. This leads to concept of the completeness of a statistic.

Definition. Suppose X1, ..., Xn ∼ Pθ. A statistic T = T (X1, ..., Xn) (of possibly multi-dimension)is called complete statistic if E(g(T )) = Eθ(g(T )) = 0 for all θ ∈ Θ implies g(·) = 0.

Interpretation: If T is complete, no non-trivial function of T exists such that its distribution isunrelated with θ. (By trivial function, we mean constant function). If there exists one as such, sayg(·), then E(g(T ))) is a constant unrelated with θ. Therefore, by the definition of completeness,E[g(T )−E(g(T ))] = 0 implies g(·) = E(g(T )) which is a trivial function. Therefore, any non-trivialfunction would have a distribution related with θ. We may say that T contains only informationabout θ, no information or variation unrelated with θ is contained in T .

Example 10.12 : X1, ..., Xn ∼ Pθ.1) T = c (constant) is complete but not sufficient.

2). The entire data set (X1, ..., Xn) viewed as an n-dimensional statistic is generally not completebut but it is sufficient. For example, E(X1 − X2) = 0 and X1 − X2 is a nontrivial function of(X1, ..., Xn).

3). (X3, X2) is not complete and not sufficient.

4). (X, σ2) is sufficient and complete if the population is N(µ, σ2). �

Exercise. g(·) is function. T is a statistic. If T is complete, so is g(T ). If g(T ) is sufficient, so isT .

The following theorem states that conditioning any unbiased estimator on a sufficient and completestatistic produces the UMVUE.

Theorem 10.1 Suppose θ is an unbiased estimator of θ. T is a sufficient and complete statistic.Then E(θ|T ) is the unique UMVUE.

Proof Consider any unbiased estimator θ. It follows from the Rao-Blackwell theorem that var(E(θ|T )) ≤var(θ). Since θ and θ are both unbiased,

E(E(θ|T )− E(θ|T )

)= E(θ − θ) = θ − θ = 0.

Since E(θ|T )−E(θ|T ) is a function of T , the definition of the completeness of T implies E(θ|T )−E(θ|T ) = 0. Hence var(E(θ|T )) ≤ var(θ). Consequently, E(θ|T ) is the UMVUE and is unique. �

Direct search of a complete and sufficient statistic is usually not easy. However, for k-parameterexponential families, the natural sufficient statistics can be proved to be sufficient and complete.

Example 10.13 X1, ..., Xn ∼ N(µ, σ2).

This is exponential family, (X, σ2) as well as (X, s2) are sufficient and complete statistics.

Therefore (X, s2) are the UMVUE of (µ, σ2). �

Example 10.14 X1, ..., Xn ∼ P(λ).

X is sufficient and complete for λ. And therefore X is UMVUE of λ. �

Example 10.15 X1, ..., Xn ∼ Unif [0, θ].

X(n) is sufficient. We show it is also complete.

Suppose E(g(X(n))) = 0 for all θ > 0.

E(g(X(n))) =∫ θ

0

g(x)dP (X(n) ≤ x) =∫ θ

0

g(x)d(x/θ)n =n

θn

∫ θ

0

g(t)tn−1dt = 0.

55

Then,∫ θ

0g(t)tn−1dt = 0 for all θ > 0. implying g(x) = 0 for all x > 0. Therefore X(n) is complete.

As a result, (n+ 1)/nX(n) is UMVUE of θ. �

10.4. Fisher information and Cramer-Rao lower bound

Suppose X ∼ Pθ with density (or probability function) fθ. X could be of multi-dimension. Define

I(θ) = E[( ∂∂θ

log fθ(X))2]

= −E[ ∂2

∂θ2log fθ(X)

].

Then, I(θ) is called Fisher information (about parameter θ contained in X).

1). The larger I(θ) is, the more information about θ contained in X. Or, in other words, X is moresensitive to the change of parameter value θ. Imagine that X is irrelevant with θ, i.e., fθ(·) remainsthe same for all θ. Then its derivative is 0 and therefore I(θ) is 0.

2). Suppose X1, ..., Xn are iid with common density fθ(·). Then the information about θ containedin each Xi is I(θ) as given above. And the information about θ contained in each X1, ..., Xn isIn(θ) = nI(θ). (Please show.)

3). The above definition requires certain regularity conditions. Among these conditions, the differ-entiability of fθ with respect to θ, for example, is required.

Example 10.16 X1, ..., Xn ∼ P(λ).

Observe thatlog fλ(Xi) = log

([1/Xi!]e−λλXi

)= Xi log λ− λ− logXi!.

Therefore,∂2

∂λ2log fλ(Xi) = −Xi

λ.

So, the information about λ contained in each Xi is I(λ) = 1/λ. And the information about λcontained in X1, ..., Xn is In(λ) = n/λ. � Remark If θ is of multi-dimension, say

p-dimension, I(θ) can be likewise defined as a p× p matrix. But we shall confine our attention onFisher information for 1 dimensional θ.

Theorem 10.2 Cramer-Rao (Fisher information) lower bound. Recall I(θ) definedabove. Under some regularity conditions,

var(T (X)) ≥ (ψ(θ))2

I(θ)

where ψ(θ) = E(T (X)) and ψ(θ) = (∂/∂θ)ψ(θ). In particular, if ψ(θ) = θ, then var(T (X)) ≥1/I(θ).

Interpretation: Suppose T (X) (again X can be multi-dimension.) is an unbiased estimator of ψ(θ).The above theorem provide a lower bound on the MSE or variance of T (X), under some regularityconditions. If T (X) achieves this lower bound on its variance, 1/I(θ). Then, T (X) is optimal in thesense of MSE among all unbiased estimators. Notice that here the regularity condition, includingdifferentiability of fθ w.r.t. θ, is not satisfied in some cases such as Unif [0, θ].

Proof. (Heuristic)

ψ(θ) =∂

∂θE(T (X))

=∫T (x)

∂θfθ(x)dx =

∫T (x)fθ(x)dx =

∫T (x)

fθ(x)fθ(x)

fθ(x)dx

= E(T (X)

∂θlog fθ(X)

)= Cov

(T (X),

∂θlog fθ(X)

),

56

since ∂∂θ log fθ(X) is mean 0. Then,(

ψ(θ))2

≤ var(T (X) var

( ∂∂θ

log fθ(X))

= var(T (X)

)I(θ),

which yields the desired result. �

Remark. Suppose X1, ..., Xn ∼ Pθ with common density fθ. Suppose T = T (X1, ..., Xn) is astatistic with mean ψ(θ). Then,

var(T (X1, ..., Xn)

)≥ (ψ(θ))2

nI(θ).

10.5 Consistency and asymptotic normality

Definition A sequence of estimators θ = θ(X1, ..., Xn) of parameter θ is said to be consistent ifθ → θ in probability.

If√n(θ − θ)→ N(0, σ2) we say θ is consistent and asymptotic normal (at the order of root n.

If√n(θ − θ)→ N(0, I−1(θ)), we say θ is asymptotically efficient.

Exercise. If√n(θ − θ) → N(0, σ2) and g is a smooth function, then

√n(g(θ) − g(θ)) →

N(0, g2(θ)σ2, where g(·) is the derivative of g(·).

57

Chapter 11. Interval estimation and tests of hypotheses

11.1 Confidence intervals.

Point estimation provides a single vector, say θ, which can be viewed as a point in real space,to estimate the unknown parameter, say θ, of certain dimension, which is again a point in realspace. However, justification of the estimation error, θ− θ, or the accuracy of the estimation is notprovided automatically with the point estimation. The complexity about accuracy of estimationlies in two aspects: the parameter θ is unknown, and θ is random. Confidence interval (C.I.) orinterval estimation solves the problem by providing an interval, rather than a point, attempting tocover the unknown parameter, along with certain level of confidence (chance).

Example 11.1 X1, ..., Xn ∼ N(µ, σ2).

Key facts: 1).√n(X−µ)/σ ∼ N(0, 1); , 2).

√n(X−µ)/s ∼ tn−1; 3). (n−1)s2/σ2 ∼ χ2

n−1.

If σ is known, fact 1) implies

P(− z(α/2) ≤

√n(X − µ)/σ ≤ z(α/2)

)= 1− α,

where z(a) is such that P (N(0, 1) > z(a)) = a. If follows that

P(X − z(α/2)σ/

√n ≤ µ ≤ X + z(α/2)σ/

√n)

= 1− α

Then, µ is in the interval[X − z(α/2)σ/

√n, X + z(α/2)σ/

√n]

with chance 1− α. We say that the above interval is a confidence interval for µ at confidence level1− α. For notational simplicity, we also write the above interval as X ± z(α/2)σ/

√n.

If σ2 is unknown, it can be similarly derived, based on fact 2) that

P(X − tn−1(α/2)s/

√n ≤ µ ≤ X + tn−1(α/2)s/

√n)

= 1− α,

where tn−1(a) is such that P (tn−1 > tn−1(a)) = a. As a result, X ± tn−1(α/2)s/√n is a C.I. for µ

at confidence level 1− α.

Suppose, for example, n = 9, X = 0.223 and s2 = 1.21. If σ is known to be 1.1, C.I. for µ at level95% is X ± z(0.025)σ/

√n = 0.223± 1.96× 1.1/3 = 0.223± 0.719 = [−0.496, 0.942]. And 99% C.I.

is X ± z(0.005)σ/√n = 0.223± 2.576× 1.1/3 = 0.223± 0.945

If, on the other hand, σ is unknown, a t-distribution based C.I. for µ at level 95% is X ±t8(0.025)s/

√n = 0.223 ± 2.306 × 1.1/3 = 0.223 ± 0.846. And 99% C.I. is X ± t8(0.005)s/

√n =

0.223± 3.355× 1.1/3 = 0.223± 1.23. �

Observe that the normality based C.I.s are of shorter length compared with the t-distributionbased C.I.s at the the same confidence level. This implies that knowing the true variance σ indeedimproves the accuracy statement.

Exercise. Based on fact 3), show that [(n − 1)s2/χ2n−1(α/2), (n − 1)s2/χ2

n−1(1 − α/2)] is aconfidence interval for σ2, where χ2

k(a) is such that P (χ2k > χ2

k(a)) = a.

Example 11.2 X1, ..., Xn ∼ Unif [0, θ].

Key fact: X(n)/θ follows a distribution with density f(x) = nxn−1, x ∈ [0, 1]. Then,

P(X(n)/θ ≤ (α/2)1/n

)= (α/2)n/n = α/2,

andP(X(n)/θ > (1− α/2)1/n

)= 1− (1− α/2)n/n = α/2.

58

Consequently,

P(

(α/2)1/n ≤ X(n)/θ ≤ (1− α/2)1/n)

= 1− α/2− α/2 = 1− α.

Hence, [X(n)/(1− α/2)1/n, X(n)/(α/2)1/n] is a C.I. for θ at confidence level 1− α. �

As seen in the above example, to construct a C.I. for a parameter, the commonly adopted approachis to find the distribution of some r.v.s containing the parameter. Such r.v.s are often based onthe difference of the estimator and the target parameter, as in Ex1, or their ratio, as in Ex2 orthe Exercise. Confidence intervals are not unique. When the exact distributions of the r.v.s arenot easy to identify, the asymptotic distributions can be applied to construct C.I.s at approximateconfidence levels.

Asymptotic normality-based C.I.s Suppose√n(θ − θ)→ N(0, σ2). Then,

P(√

n|θ − θ|/σ ≤ z(α/2))≈ 1− α.

If σ is known, θ ± z(α/2)σ/√n is a C.I. for θ at approximate level 1− α.

If σ is unknown but can be consistently estimated by σ, then θ ± z(α/2)σ/√n is a C.I. for θ at

approximate level 1− α.

Oftentimes, σ = g(θ), where g is a known function. Then one may solve the inequality |θ−θ|/g(θ) ≤z(α/2) to obtain a C.I. for θ at approximate level 1−α. This is usually slightly more accurate thansimpler C.I. θ ± z(α/2)g(θ)/

√n

Example 11.3 X1, ..., Xn ∼ Bin(1, θ).

It follows from the central limit theorem that√n(X − θ)→ N(0, σ2),

where σ2 = θ(1− θ), which is a function of θ. Since X is consistent estimator of θ, a simple way ofconstructing C.I. at approximate level 1− α is

X ± z(α/2)√X(1− X)/

√n.

An alternative and slightly better way is to solve the inequality√n|X − θ|√θ(1− θ)

≤ z(α/2).

And obtain the C.I. at approximate level 1− α as

2nX + z2(α/2)2(n+ z2(α/2))

± z(α/2)√

4n(1− X)X + z2(α/2)2(n+ z2(α/2))

.

Since θ is in [0, 1], one can modify the above C.I.s, say [a, b] by restricting them to be inside [0, 1].Namely, modify [a, b] to [min(1,max(a, 0)), min(1,max(b, 0))]. �

Remark. Confidence intervals may be one-sided. In other words, the left endpoint of the intervalmay be −∞; Or the right endpoint of the interval may be∞. For example, (−∞, X+tn−1(α)s/

√n]

is a C.I. for µ at level 1− α, based on X1, ..., Xn ∼ N(µ, σ2).

Remark When a parameter is of multi-dimension, say p-dimension, confidence region in the p-dimensional real space can be constructed to attempt to contain the parameter. In particular,simultaneous C.I.s can be constructed for each component of the parameter.

59

Example 11.4 (Two sample problem) Johns Hopkins Regional Talent Searches.

Talent tests are given selected males and females randomly selected from the population of theUnited States. And the results are summarized in the following:

Group Sample sizes Sample means Sample standard deviations

Males 19883 416 87

Females 19937 386 74

Based on this data, construct the confidence interval for the mean difference of test results for allAmerican males and females.

This is a standard two sample problem. The set-up is

X1, ..., Xn1 are iid ∼ N(µ1, σ21) and Y1, ..., Yn2 are iid ∼ N(µ2, σ

22). The first sample, the Xis, are

independent of the second sample, the Yjs. We are concerned with the comparison of µ1 and µ2.

Very often one can assume equal variance for the two population distributions. Namely, σ21 = σ2

2 .Suppose this is true for this example. Under this assumption, σ2

1 (or σ22) can be estimated by the

so-called pooled estimator

s2pooled =

(n1 − 1)s2x + (n2 − 1)s2

y

n1 + n2 − 2.

Key fact:X − Y − (µ1 − µ2)√s2

pooled(1/n1 + 1/n2)∼ tn1+n2−2.

Therefore, a C.I. for µ1 − µ2 at level 1− α is

X − Y ± tn1+n2−2(α/2)spooled

√1/n1 + 1/n2.

In this example,

s2pooled =

(n1 − 1)s2x + (n2 − 1)s2

y

n1 + n2 − 2=

19882× 872 + 19936× 742

19883 + 19937− 2= 6520.

As a result, C.I. for µ1 − µ2 at level 95% is 30± 1.583. And C.I. at level 99% is 30± 2.085. Noticethat, 0 is not inside the C.I.s. It provides evidence to support that µ1 − µ2 is not zero. Theinterpretation of the confidence intervals is that American males and females have different talents.�

Remark. For two-sample problem, if the two population variances are known, or are unknownand are not the same, methods to construct C.I.s are also available.

11.2 Test of Hypotheses: Examples and Basic Concepts.

We use the following examples to review the basic framework of statistical hypotheses testing.

Example 11.5 (Cobra Cheese Company testing of natural milk.) Cobra Cheese Company buysmilk from several suppliers as the essential material for its cheese product. The company wishesto set up a rule to guard against some suppliers who might add extra water into the milk. Excesswater can be detected by measuring the freezing temperature of the milk. The freezing temperatureof natural milk follows a normal distribution with µ0 = −.545◦C and standard deviation σ = 0.008.Excess water increases the freezing temperature of the milk towards 0◦C. The company draws 5samples from each supplied milk and measures the freezing temperatures of these 5 samples.

Question a): How to set up a reasonable rule based on the five samples. b): Suppose the sampleaverage freezing temperatures is −0.539 from a particular milk supplier, how to make a judgmentof whether the milk is natural or not.

60

1). Theoretical setup:X1, ..., Xn ∼ N(µ, σ2)

where n = 5, σ2 = 0.0082. Here Xi represents the freezing temperature of the i-th sample.

2). Hypotheses: H0 : µ = µ0 = −.545 vs. Ha : µ > µ0. Notice that H0 (the null hypothesis) impliesthe milk is natural, and Ha (the alternative hypothesis) implies the milk is watered. In general,the null hypothesis and the alternative are asymmetric. The null hypothesis is mostly the one thatpeople take when no testing is conducted.

3). Test or decision rule:

A statistical test is a decision rule based on data of to reject or accept the null hypothesis. In thisexample, as it is clear that the sample mean X is a proxy of the population mean µ. It is naturalto have test like:

Reject H0 when (if and only if) X > c

where c is to be specified, depending on the requirements on the test. A test can be described usingRejection Region:

R = {(X1, ..., Xn) : X > c} = {X > c},

which is the set of data values that calls for rejection of the null hypothesis.

4). Measuring the properties of a test: There are only two types of errors/mistakes one can makein testing hypotheses: Rejecting H0 when H0 is true or Rejecting Ha when Ha is true. The formeris called Type I error and the latter is called Type II error. An ideal test procedure is to keepthe chances of both Type I and II errors small. Unfortunately, that is in general impossible. Aprescribed or required quantity α which is such that

α ≥ P (Reject H0|H0)

is called significance level. Notice that the probability of type I error above can be a function ofthe parameter on H0. Therefore, a test of significance level α is also of significance level α′ for anyα′ ≥ α. The smallest significance level of a test is called size of the test. The chance of type erroris

β = P (Accept H0|Ha) = P (Reject Ha|Ha)

which is also a function of the parameter on Ha. This function is called operating characteristiccurve (OC curve.) And 1 − β, as a function of the parameter on Ha is called the power functionof the test, representing the chance of making the correct decision of rejecting H0 when Ha is true.For a given significance level, the larger the power the better the test.

For question a), suppose now we let c = −.539. Then the test is

Reject H0 when X > −.539.

The chance of type I error is

P (Type I error|H0) = P (Reject H0|H0) = P (X > −.539|µ = −.545)

= P(N(0, 1) >

√5(−.539 + .545)/.008

)= 0.047

This test has size 0.047, and significance level 0.047 (or above). Its power function is, for µ > −.545(Ha):

P (Reject H0|Ha) = P (X > −.539|µ > −.545)

= P(√

5(X − µ)/.008 >√

5(−.539 + µ)/.008)

= P(N(0, 1) > −55.90µ− 30.13

)= Φ(55.90µ+ 30.13)

which is an increasing function of µ on (−.545,∞).

61

5). p-values.

p-value is an important index which measures the strength of the evidence contained in data againstthe null hypothesis. The smaller the p-value, the stronger evidence against H0. p-value is theprobability, assuming H0 is true, that the test statistic would take a value as extreme or moreextreme (towards favoring Ha) than the actually observed values. Suppose a p-value of a test is 3%.The interpretation is: Assuming H0 is true and the same data collection and testing procedure isrepeated a large number of times, only 3% of the times that the collected data would be as “bad”as or “worse” than the present data.

For question b), suppose X = −.538◦C. Then, the p-value is

P (X > −.538|H0) = 1− Φ(√

5(−.538 + .545)/.008)

= 0.025,

which is considerably small, meaning that there is strong indicating the evidence of watered milk.The argument is the following. Assuming the milk is natural, suppose repeat collecting the 5samples a large number, (say 1000), time. Only 2.5% of the times (25 times out of 1000), that theaverage freezing temperature is as high as or higher than -.538. As such phenomenon is quite rare(2.5%), we can reasonably conclude that the assumption of H0 is false. �

Remark Parity between hypothesis testing and confidence intervals: Hypothesis test can some-times be conducted through confidence intervals. C.I. can be interpreted as set of likely values ofthe parameters. Consider the simple example of X1, ..., Xn ∼ N(µ, 1).

1). Suppose for testing hypothesis H0 : µ = µ0 vs. Ha : µ 6= µ0. Recall that a C.I at confidencelevel 1− α is X ± z(α/2)/

√n. Then, a test based on the C.I. at significance level α can be

Reject H0 when µ0 /∈ X ± z(α/2)/√n.

2). Suppose for testing hypothesis H0 : µ = µ0 vs. Ha : µ > µ0. A one-sided C.I at confidence level1− α is (X − z(α)/

√n, ∞). Then, a test based on the C.I. at significance level α can be

Reject H0 when µ0 /∈ (X − z(α)/√n, ∞).

2). Suppose for testing hypothesis H0 : µ = µ0 vs. Ha : µ < µ0. A one-sided C.I at confidence level1− α is (−∞, X − z(α)/

√n). Then, a test based on the C.I. at significance level α can be

Reject H0 when µ0 /∈ (−∞, X − z(α)/√n).

Example 11.6 ( Swain vs. Alabama)

In 1965, the U.S. Supreme Court decided the case of Swain vs. Alabama. Swain, a black man, wasconvicted in Talladega County, Alabama, of assaulting a white woman. He was sentenced to deathand was later executed in 1966. The case was appealed to the Supreme Court on the grounds thatthere were no blacks on the jury; moreover, no black “within the memory of persons now living hasever served on any petit jury in any civil or criminal case tried in Talladega County, Alabama.” TheSupreme Court denied the appeal, on the following grounds. As provided by Alabama law, the jurywas selected from a panel of about 100 persons. There were 8 blacks on the panel. (They did notserve on the jury because they were ”struck,” or removed, through a maneuver called peremptorychallenges by the prosecution. Such challenges were until quite recently constitutionally protected.)The Supreme Court ruled that the presence of 8 blacks on the panel showed “The overall percentagedisparity has been small and reflects no studied attempt to include or exclude a specified numberof blacks.” At that time in Alabama, only men over the age of 21 were eligible for jury duty (!).There were 16,000 such men in Talladega County, of whom about 26% were black. (The story canbe found in, for example, http://www.stat.ucla.edu/cases/swain/)

62

Analysis: We focus on the problem of whether the selection of 100 panel is biased against blacks.The theoretical setup is

X1, ..., Xn ∼ Bin(1, p),

where Xi = 1 if the i-th panel member is black and 0 otherwise. n = 100. (The actual selection iswith replacement. Our assumption of iid is for without replacement. The difference is negligible ifthe population size, here 16,000, is large.)

To answer the question, the standard statistical procedure is to consider hypotheses:

H0 : p = 0.26 v.s Ha : p < 0.26

Clearly, H0 implies the selection of panel is indeed random and is fair; while Ha implies the selectionof panel is biased against blacks, namely the blacks are under-represented.

The estimator of p is X. And it is a reasonable test statistic. The large value of X indicating favorof H0 and the small value of X indicating favor of Ha. A reasonable test is

Reject H0 when X ≤ c

for some constant c. For the actual case, with 8 blacks in the panel, the p-value is

P (X ≤ 8/100|H0) = P (Bin(100, 0.26) ≤ 8) ≈ P(N(0, 1) ≤ (8− 26)√

100× .26× .74

)≈ 0.00002

which is extremely small. Therefore, we conclude there is abundant statistical evidence indicatingthe alternative Ha is true. The interpretation of the p-value is: suppose the actual selection of panelis fair and is repeated, say 1 million, times, only about 20 times the panel has 8 or less blacks. Onemight also find the exact value of P (Bin(100, 0.26) ≤ 8) using statistical software. �

11.3. UMP and Neyman-Pearson Lemma.

As with estimation, there is also optimality with tests of statistical hypotheses.

Definition Uniformly Most Powerful Test (UMP test) A test at significance level α iscalled a UMP test at level α if its type chance of type II error is the smallest among all tests atsignificance level α.

Since, as argued before, the chances of type I and II errors cannot be both kept small. We onlycompare tests with at the same level of significance α or, in other words, with chance of type I errorcontrolled at a given level α. This comparison is then on the chance of type II error which is thesmaller the better. Or equivalently, the larger the power function, the better the test is. And thebest of such test is called UMP test.

The following Neyman-Pearson lemma shows that for the simplest case, the likelihood ration testis the best in that it produces a UMP test.

Theorem 11.1. Neyman-Pearson Lemma Let L(θ;x) be the likelihood function of parameter θbased on X ( which could be of multi-dimension). Consider hypotheses:

H0 : θ = θ0 vs. Ha : θ = θ1

Suppose the testReject H0 when L(θ1;x)/L(θ0;x) > c > 0

has size α. Then, this test is UMP test at significance level α.

Proof Let R = {x : L(θ1;x)/L(θ0;x) > c} be the rejection region of the above likelihood ratio testThen,

α = P (R|H0) =∫R

L(θ0;x)dx.

63

Consider an other test with significance level α and let R∗ be its rejection region. Then

α ≥∫R∗L(θ0;x)dx.

Taking subtraction, we have∫R

L(θ0;x)dx−∫R∗L(θ0;x)dx =

∫R\R∗

L(θ0;x)dx−∫R∗\R

L(θ0;x)dx ≥ 0.

Since L(θ1;x)/L(θ0;x) > c for x ∈ R and L(θ1;x)/L(θ0;x) ≤ c for x /∈ R. We get∫R\R∗

L(θ1;x)dx =∫R\R∗

L(θ1;x)L(θ0;x)

L(θ0;x)dx

≥ c

∫R\R∗

L(θ0;x)dx ≥ c∫R∗\R

L(θ0;x)dx

≥∫R∗\R

L(θ1;x)L(θ0;x)

L(θ0;x)dx

=∫R∗\R

L(θ1;x)dx

As a result, ∫R

L(θ1;x)dx ≥∫R∗L(θ1;x)dx

Since the former is the power of the likelihood ratio test and the latter is the power of an other testat significance level α. the UMP of likelihood ratio test is proved. �

Remark UMP is not one single test, it refers to a class of tests. For different significance level, theUMP tests are different. A likelihood ratio test of size α, which reject H0 if L(θ1;x)/L(θ0;x) > c;accept H0 if L(θ1;x)/L(θ0;x) < c; and reject or accept H0 if L(θ1;x)/L(θ0;x) = c; is also UMP atsignificance level α. This is called a randomized test.

Example 11.7 X1, ..., Xn ∼ N(µ, σ2) where σ2 is known.

Consider hypotheses:H0 : µ = µ0 vs.Ha : µ = µ1.

where µ>µ0. The test:Reject H0 when X − µ0 > z(α)σ/

√n

is UMP at significance level α.

Observe that this test is unrelated with the exact value of µ1 but depending on the fact thatµ1 > µ0.

Now consider hypothesisH0 : µ = µ0 vs.Ha : µ > µ0.

The same testReject H0 when X − µ0 > z(α)σ/

√n

is UMP test at significance level α.

Likewise, if µ1 < µ0, the test

Reject H0 when X − µ0 < −z(α)σ/√n

is UMP at significance level α, and it is also UMP at significance level α for hypotheses

H0 : µ = µ0 vs.Ha : µ < µ0.

64

Example 11.8 X1, ..., Xn ∼ P(λ).

Consider hypotheses:H0 : λ = 1 vs. Ha : λ > 1.

Picking any λ1 > 1. The likelihood function is

L(λ) =n∏i=1

e−λλXi/Xi!

and the likelihood ratio isL(θ1)/L(1) = λ

∑ni=1Xi

1

A test thatReject H0 when

∑ni=1Xi ≥ k

is a UMP test. This test can be randomized to obtain UMP at any given significance level α. �

11.4 Generalized likelihood ratio test.

In many cases of hypothesis testing, especially for two-sided hypothesis testing, the UMP test doesnot exist and the likelihood ratio test for simple hypotheses, meaning that both null and alternativehypotheses contain one single distribution, is of limited use. The generalized likelihood ratio testis commonly used in general setting.

Example 11.9 X1, ..., Xn ∼ Bin(1, θ).

H0 : θ = θ0 v.s. Ha : θ 6= θ0. Following the idea of likelihood ratio L(θ1)/L(θ0), we may considerusing

sup{L(θ) : θ ∈ Ha}sup{L(θ) : θ ∈ H0}

.

Equivalently, and for computational convenience, we consider the generalized likelihood ratio:

GLR =sup{L(θ) : θ ∈ Ha ∪H0}

sup{L(θ) : θ ∈ H0}=

L(θ)L(θ0)

=( Xθ0

)nX(1− X1− θ0

)n−nXAnd the test is

reject H0 when GRL ≥ c

for some constant c, depending on the significance level.

For large n, we know that X is close to θ0 under H0. By applying the Taylor expansion, we canshow

2 log(GLR) ≈ n(X − θ0)2

θ0(1− θ0)

It then follows from the central limit theorem that

2 logGLR→ χ21

as n→∞. Therefore, at approximate significance level α, we

reject H0 when 2 log(GLR) ≥ χ2(α) .

Example 11.10 X1, ..., Xn ∼ N(µ, σ2) where σ2 is unknown.

65

Consider H0 : µ = µ0 vs. Ha : µ 6= µ0. Observe that a complete description should be

H0 : (µ, σ2) ∈ {µ0} × (0,∞) vs Ha : (µ, σ2) ∈ {(−∞,∞) \ {µ0}} × (0,∞).

The test statistic of a generalized likelihood ratio method is

GLR =sup{L(µ, σ2) : µ ∈ (−∞,∞), σ2 > 0}

sup{L(µ, σ2) : µ = µ0, σ2 > 0}=

L(µ, σ2)L(µ0, σ2

0)

where (µ, σ2) = (X, (1/n)∑ni=1(Xi− X)2) are the MLE of (µ, σ2), and σ2

0 = (1/n)∑ni=1(Xi−µ0)2

is the MLE of σ2 with µ is known to be µ0. Observe that

σ20 = σ2 + (X − µ0)2

After some calculation, we get

2 log(GLR) = n log(σ20/σ

2) = n log(

1 +1

n− 1n(X − µ0)2

s2

)where s2 = (n/(n− 1))σ2 is the sample variance. The GLR test is than

Reject H0 when√n|X − µ0|/s > c

Since under H0,√n(X − µ0)/s ∼ tn−1. Choosing c = tn−1(α/2) makes the above GLR test a

significance level α test.

Observe that, again by the law of large numbers, CLT and the Taylor expansion,

2 log(GLR) ≈ n

n− 1n(X − µ0)2

s2→ χ2

1

as n→∞. In fact, tn → N(0, 1) and therefore t2n → χ21. �

In general, suppose X1, ..., Xn ∼ Pθ. Consider hypotheses

H0 : θ ∈ Θ0 vs. Ha : θ ∈ Θa

If H0 is a linear space of dimension q and H0 ∪ Ha is a linear space of dimension r, then, undersome regularity conditions,

2 log(GLR)→ χ2r−q,

in distribution as n→∞.

66

Chapter 12. Linear Regression Models

12.1 Illustrating simple linear regression models.

We use an elementary example to review the framework of linear regression models.

Example 12.1 Experiment on Hooke’s law. Hooke’s law states that the extended length of apiano wire is proportional to the weight attached to the end of the wire. To verify the law, a simpleexperiment is carried out and the result, subject to observational errors, are

xi (weights in kg): 0 2 4 6 8 10

Yi (length in cm): 439.00 439.12 439.21 439.31 439.40 439.50

0). Statistical model:Yi = β0 + β1xi + εi, i = 1, ..., n,

where εi are iid ∼ N(0, σ2), and xi are non-random. (In this example, n = 6.)

β1: the mean change of the response when the covariate is increased by size 1.

1). Objective: Estimating the regression parameters (β0, β1) with accuracy.

2). Least squares estimator: By minimizing∑ni=1(Yi − a − bxi)2 over all (a, b), one obtains the

LSE of (β0, β1) as

β1 = SSxy/SSxx = 0.0491, β0 = y − βx = 439.011

where SSxy =∑ni=1(xi − x)(Yi − Y ) and SSxx =

∑ni=1(xi − x)2.

3). Fitted values, residuals and σ2:

Fitted values: Yi = β0 + β1xi, the estimates of the deterministic part of response β0 + β1xi.

Residuals: εi = Yi − Yi, the estimates of the εi.

Yi: 439.011 439.109 439.208 439.306 439.404 439.502

εi: -0.011 0.011 -0.002 0.004 -0.004 -0.002

Estimator of σ2:

σ2 =1

n− 2

n∑i=1

(yi − β0 − β1xi)2 =1

n− 2

(SSyy − SS2

xy/SSxx

)= 0.0000704

which is unbiased estimator of σ2. σ = 0.0084.

Exercise Verify that the residuals always satisfy the two linear constraints:∑ni=1 εi = 0 and∑n

i=1 εi(xi − x) = 0.

4). Distribution of the estimators.(n− 2)σ2 ∼ χ2

n−2

β1 = β1 +∑ni=1(xi − x)εi∑ni=1(xi − xi)2

∼ N(β1, σ2/SSxx)

and

β0 = β0 +n∑i=1

( 1n− (xi − x)x∑n

j=1(xj − x)2

)εi ∼ N

(β0, σ

2[1/n+ x2/SSxx])

Therefore,β1 − β1

σ/√SSxx

∼ tn−2

67

β0 − β0

σ√

1/n+ x2/SSxx∼ tn−2.

Exercise Verify that β0 and β1 are independent of εi, i = 1, ..., n.

5). Confidence intervals and test of hypothesis:

Confidence intervals for β1 and for β0at confidence level 1− α:

β1 ± tn−2(α/2)σ/√SSxx

β0 ± tn−2(α/2)σ√

1/n+ x2/SSxx.

For example, C.I. for β1 at confidence lever 95% is 0.0491 ± 0.0028, and C.I. for β0 at confidencelever 95% is 439.011± 0.017.

Test of hypothesis:H0 : β1 = a vs. Ha : β1 > a.

A test at significance level α is

Reject H0 when β1 − a > tn−2(α)σ/√SSxx.

For example, consider H0 : β1 = 0.05 against Ha : β1 6= 0.05. Notice that

|β − 0.05|σ√SSxx

= 0.9

The p-value is thatP (|t4| > 0.9) = 2P (t4 > 0.9) = 0.419

which is quite large. We accept H0.

6). Variance decomposition and ANOVA Table:

n∑i=1

(Yi − Y )2 =n∑i=1

(Yi − Y )2 +n∑i=1

(Yi − Yi)2

0.1694 = 0.1691 + 0.0003

SSTo = SSR+ SSE

SSyy = SS2xy/SSxx + [SSyy − SS2

xy/SSxx]

Total variantion in Y = Variantion in Y due to X + Variation of Error

Coefficient of determination:

R2 =SSR

SSTo=

0.16910.1694

= 99.83%

which measures of the proportion of total variation in the response that is due to the variation ofthe covariate.

ANOVA table:

Source of Sum of d.f. MS F-statisticVariation Squares

Regression SSR=0.1691 1 MSR=SSR/1=0.1691 MSR/MSE= 2398.7

Error SSE=0.0003 n− 2 = 4 MSR=SSR/(n− 2) = 0.0007

Total SSTo=0.1694 n− 1 = 5

68

The ANOVA table can be used for testing the hypothesis

H0 : β1 = 0 vs. Ha : β1 6= 0.

And the significant level α test is

Reject H0 when MSR/MSE > F1,n−2(α).

The p-value isP (F1,4 > 2398.7) = 0.0000

which is extremely small. �

Exercise Verify that MSE = σ2 and that β21/[σ

2/SSxx] = MSR/MSE.

12.2. General linear regression models.

0). The model:

Yi = β0 + β1xi1 + ...+ βpxip + εi,= x′iβ + εi i = 1, ..., n

where xi = (1, xi1, ..., xip)′ are nonrandom, εi are iid ∼ N(0, σ2). The regression parameters areβ = (β0, ..., βp)′ where β0 is the intercept.

1). Matrix presentation:

Yn×1 = Xn×(p+1)β(p+1)×1 + εn×1

where

Y =

Y1...Yn

X =

1 x11 · · · x1p

......

......

1 xn1 · · · xnp

ε =

ε1...εn

Remark. It includes one sample, two sample and one-way layout.

2). Least squares estimator and its properties.

Theorem 12.1 The least squares estimator denoted by β, which minimizes

(Y −Xβ)′(Y −Xβ) ≡ ‖Y −Xβ‖2

over all β, isβ = (X′X−1X′Y.

Here ‖ · ‖ is the Euclidean norm.

Proof Observe that X′(Y −Xβ) = 0. Write

(Y −Xβ)′(Y −Xβ) = ‖Y −Xβ −X(β − β)‖2

= ‖Y −Xβ‖2 − 2[X(β − β)]′(Y −Xβ)′ + ‖X(β − β)‖2

= ‖Y −Xβ‖2 + ‖X(β − β)‖2

which is minimized when β = β.

Interpretation: Think of Y as a vector in n-dimensional real space, and let X0, ..., Xp stand for then columns of the matrix X. Then, X0, ...,Xp are also p + 1 vectors in n-dimensional real space,which span a linear subspace of p+ 1 dimension. Project Y onto this p+ 1 dimensional subspace,then the projection is

Xβ = β0X0 + · · ·+ βpXp.

69

The the least squares estimator β are actually the projection coefficients.

3). Gauss-Markov Theorem.

Theorem 12.2 The LSE β is the best linear unbiased estimator (BLUE) of β in the sense that,among all unbiased linear estimators of β, i.e., those estimators that are linear function of Y1, ..., Yn,the LSE has the smallest variance. Proof Consider any unbiased linear estimator A′Y of β, where

A is an n× (p + 1) matrix of constants. Since Y = Xβ + ε, unbiasedness implies A′X = I whereI is the (p + 1) × (p + 1) indentity matrix. The variance of this estimator is σ2A′A. To showA′A ≥ (X′X)−1, let A′ = (X′X)−1(X′ + E′). Then A′X = I implies E′X = 0. Therefore

A′A = (X′X)−1 + (X′X)−1E′E(X′X)−1 ≥ (X′X)−1.

The proof is complete. (Notice that here the matrix inequality is well defined.) �

4). Distribution of β:

β = β + (X′X)−1X′ε ∼ N(β, σ2(X′X)−1

)σ2 =

1n− p− 1

n∑i=1

(Yi − Yi)2 ∼ χ2n−p−1

βk − βkσ√akk

∼ tn−p−1

where

(X′X)−1 =

a00 · · · a0p

a10 · · · a1p

......

...ap0 · · · app

.

Inference about the regression parameters β can be carried out based on the above distribution ofβ.

70

Review of Statistical Theory.

1. Parametric Models: Population distribution, parameters, observations, estimation, inference.

Commonly used distributions: a). Exponential family: Bin(n, p), P(λ), exponential with meanµ, N(µ, σ2), etc.; b). Uniform distributions: Unif [0, θ], Unif [−θ, θ], Unif [µ − θ, µ + θ], and theuniform over integers; c). Cauchy distribution.

2. Sufficiency, factorization theorem and Rao-Blackwell Theorem.

3. Point estimation: MLE and moment estimation.

4. Unbiasedness, MSE, completeness and UMVUE.

5. Fisher information, Cramer-Rao lower bound, consistency, asymptotic normality and asymptoticefficiency.

6. Confidence intervals (with exact or approximate confidence levels).

7. The t or normal-based C.I.s for the population mean for one sample problem. The t-based C.I.for mean difference for the two sample problem.

8. Concepts about the test of hypothesis: hypothesis, tests, the type I and type II errors, size, sig-nificance level, power function, OC (Operating Characteristic) curve; p-value and its interpretation.

9. UMP (its definition) and the Neyman-Pearson lemma.

10. Generalized likelihood ratio test and its asymptotic distribution (2 log(GLR)→ χ2r−q).

11. Simple linear regression: model assumption, the LSE and its distribution, the residuals, fittedvalues, σ2, C.I.s and tests of hypotheses, variance decomposition and ANOVA table.