Lecture 3. Inference about multivariate normal distribution · Inference about multivariate normal...

Lecture 3. Inference about multivariate normal distribution

3.1 Point and Interval Estimation

Let X1, . . . ,Xn be i.i.d. Np(µ,Σ). We are interested in evaluation of the maximum likeli-hood estimates of µ and Σ. Recall that the joint density of X1 is

f(x) = |2πΣ|−12 exp

[−1

2(x− µ)′Σ−1(x− µ)

],

for x ∈ Rp. The negative log likelihood function, given observations xn1 := x1, . . . ,xn, upto an additive constant independent of µ and Σ, is then

`∗(µ,Σ | xn1 ) =n

2log |Σ|+ 1

2

n∑i=1

(xi − µ)′Σ−1(xi − µ)

=n

2log |Σ|+ n

2(x− µ)′Σ−1(x− µ) +

1

2

n∑i=1

(xi − x)′Σ−1(xi − x).

On this end, denote the centered data matrix by X = [x1, . . . , xn]p×n, where xi = xi − x.Let

S0 =1

nXX′ =

1

n

n∑i=1

xix′i.

Proposition 1. The MLEs of µ and Σ, that jointly minimize `∗(µ,Σ | xn1 ), are

µMLE = x,

ΣMLE = S0.

Note that S0 is a biased estimator of Σ. The sample variance–covariance matrix S =nn−1

S0 is unbiased.

For interval estimation of µ, we largely follow Section 7.1 of Hardle and Simar (2012).First note that since µ ∈ Rp, we need to generalize the notion of intervals (primarily definedfor R1) to a higher dimensional space. A simple extension is a direct product of marginalintervals: for example, for intervals a < x < b and c < y < d, we obtain a rectangular region(x, y) ∈ R2 : a < x < b, c < y < d.

A confidence region A is a subset of Rp and it depends on (random) observationsX1, . . . ,Xn. A = A(Xn

1 ) is a confidence region of size 1− α ∈ (0, 1) for parameter µ ∈ Rp

ifP (µ ∈ A(Xn

1 )) ≥ 1− αwhere P is with respect to the distribution of Xn

1 .

(Elliptical confidence region) Corollary 7 in lecture 2 provides a pivot which paves away to construct a confidence region for µ.

Since n−pp

(X − µ)′S−10 (X − µ) ∼ Fp,n−p and

P

((X − µ)′S−1

0 (X − µ) <p

n− pF1−α;p,n−p

)= 1− α,

we have that

A :=

ν ∈ Rp : (X − ν)′S−1

0 (X − ν) <p

n− pF1−α;p,n−p

is a confidence region of size 1− α for parameter µ.

Note: F1−α;p,n−p is defined as the constant such that P (F < F1−α;p,n−p) = 1−α where Fis Fp,n−p distributed.

The resulting confidence region for a given sample is an ellipse.

(Simultaneous confidence intervals) Simultaneous confidence intervals for all linearcombinations of elements of µ, i.e., a′µ for arbitrary a ∈ Rp, provide confidence of size1 − α for any interval in this framework, which covers a′µ. This includes the marginalmeans µ1, . . . , µp (when choosing a wisely). We are interested in evaluating lower and upperbounds L(a) and U(a) satisfying

P (L(a) < a′µ < U(a), for all a ∈ Rp) ≥ 1− α

Note: the lower and upper limits of the interval depend on the data (hence is random)and is also specified by a.

First consider a single confidence interval by fixing a particular vector a. To evaluatea confidence interval for a′µ, write new random variables Yi = a′X i ∼ N1(a′µ,a′Σa)

(i = 1, . . . , n), whose squared t-statistic is t2(a) := n(a′µ−a′X)2

a′Sa∼ F1,n−1. Thus, for any

fixed a,P (t2(a) ≤ F1−α,1,n−1) = 1− α. (1)

Next, consider many projection vectors a1,a2, . . . ,aM (M is finite only for convenience).The simultaneous confidence intervals of the type similar to (1) are then

P

(M⋂i=1

t2(ai) ≤ h(α)

)≥ 1− α,

for some h(·). We collect some facts below.

1. maxa t2(a) ≤ h(α) implies t2(ai) ≤ h(α) for all i.

2. maxa t2(a) = n(µ− X)′S−1(µ− X), HS Theorem 2.5 (note the rank)

2

3. Corollary 7 in lecture 2.

These suggest that we have h(α) = (n−1)pn−p F1−α;p,n−p and

P

(M⋂i=1

t2(ai) ≤ h(α)

)≥ P

(max

at2(a) ≤ h(α)

)= 1− α.

Proposition 2. Simultaneously for all a ∈ Rp, the interval

a′X ±√h(α)a′Sa/n

contains a′µ with probability 1− α.

Example 1. From the Golub gene expression data, with dimension d = 7129, take the first and1674th variables (genes), to focus on the bivariate case (p = 2). There are two populations:11 observations from AML, 27 from ALL. Figure 1 illustrates the elliptical confidence regionof size 95% and 99%. Figure 2 compares the elliptical confidence region with the simultaneousconfidence intervals for a1 = (1, 0)′ and a2 = (0, 1)′.

−500 −400 −300 −200 −100 0 1000

0.5

1

1.5

2

2.5x 10

4 Golub data, ALL (red) and AML(blue)

gene #1

gene

#16

74

Figure 1: Elliptical confidence regions of size 95% and 99%.

3

−500 −400 −300 −200 −100 0 1000

0.5

1

1.5

2

2.5x 10

4 Golub data, ALL (red) and AML(blue)

gene #1

gene

#16

74

Figure 2: Simultaneous confidence intervals of size 95%

3.2 Hypotheses testing

Consider testing a null hypothesis H0 : θ ∈ Ω0 against an alternative hypothesis H1 : θ ∈ Ω1.The principle of likelihood ratio test is as follows: Let L0 be the maximized likelihood underH0 : θ ∈ Ω0, and L1 be the maximized likelihood under θ ∈ Ω0 ∪ Ω1. The likelihood ratiostatistic, or sometimes called Wilks statistic, is then

W = −2 log(L0

L1

) ≥ 0

The null hypothesis is rejected if the observed value of W is large. In some cases the exactdistribution of W under H0 can be evaluated. In other cases, Wilks’ theorem states that forlarge n (sample size),

WL

=⇒χ2ν ,

where ν is the number of free parameters in H1 but not in H0. If the degrees of freedom inΩ0 is q and the degrees of freedom in Ω0 ∪ Ω1 is r, then ν = r − q.

Consider testing hypotheses on µ and Σ of multivariate normal distribution, based onn-sample X1, . . . ,Xn.

case I: H0 : µ = µ0, H1 : µ 6= µ0, Σ is known.In this case, we know the exact distribution of the likelihood ratio statistic

W = n(x− µ0)′Σ−1(x− µ0) ∼ χ2p,

4

under H0.

[Show on blackboard.] It is worthwhile to first write −2 log(L(x;µ,Σ)) =

−2`(x;µ,Σ) = n log |2πΣ|+ n · trΣ−1S+ n(x− µ)TΣ−1(x− µ)

case II: H0 : µ = µ0, H1 : µ 6= µ0, Σ is unknown.

The MLEs under H1 are µ = x and Σ = S0. The restricted MLE of Σ under H0 isΣ(0) = 1

n

∑ni=1(xi − µ0)(xi − µ0)′ = S0 + δδ′, where δ =

√n(x− µ0). The likelihood

ratio statistic is thenW = n log |S0 + δδ′| − n log |S0|.

It turns out that W is a monotone increasing function of

δ′S−1δ = n(x− µ0)S−1(x− µ0),

which is the Hotelling’s T 2(n− 1) statistic.

case III: H0 : Σ = Σ0, H1 : Σ 6= Σ0, µ is unknown.We have the likelihood ratio statistic

W = −n log |Σ−10 S0| − np+ ntrace(Σ−1

0 S0).

This is the case where the exact distribution of W is difficult to evaluate. For largen, use Wilks’ theorem to approximate the distribution of W by χ2

ν with the degrees offreedom ν = p(p+ 1)/2.

Next, consider testing the equality of two mean vectors. LetX11, . . . ,X1n1 be i.i.d. Np(µ1,Σ)and X21, . . . ,X2n2 be i.i.d. Np(µ2,Σ).

case IV: H0 : µ1 = µ2, H1 : µ1 6= µ2, Σ is unknown.Since

X1 − X2 ∼ Np(µ1 − µ2,n1 + n2

n1n2

Σ),

(n1 + n2 − 2)SP ∼ Wp(n1 + n2 − 2,Σ),

where (n1 + n2 − 2)SP := (n1 − 1)S1 + (n2 − 1)S2, we have Hotelling’s T 2 statistic fortwo-sample problem

T 2(n1 + n2 − 2) =n1n2

n1 + n2

(X1 − X2)′S−1P (X1 − X2),

and by Theorem 5 in lecture 2

n1 + n2 − p− 1

(n1 + n2 − 2)pT 2(n1 + n2 − 2) ∼ Fp,n1+n2−p−1.

Similar to case II above, the likelihood ratio statistic is a monotone function of T 2(n1 +n2 − 2).

5

3.3 Hypothesis testing when p > n

In the high dimensional situation where the dimension p is larger than sample size (p > n−1or p > n1+n2−2), the sample covariance S is not invertable, thus the Hotelling’s T 2 statistic,which is essential in the testing procedures above, cannot be computed. We survey importantproposals for testing hypotheses on means in High-Dimension, Low-Sample Size (HDLSS)data.

A basic idea in generalizing a test procedure for the p > n case is to base the test on acomputable test statistic which is also an estimator for ‖µ− µ0‖ or ‖µ1 − µ2‖.

In case II (one sample), Dempster (1960) proposed to replace S−1 in Hotelling’s statisticby (trace(S)Ip)−1. He showed that under H0 : µ = 0,

TD =nX

′X

trace(S)∼ Fr,(n−1)r, approximately,

for r = (trace(Σ))2

trace(Σ2), a measure of sphericity of Σ. An estimator r of r is used in testing.

Bai and Saranadasa (1996) proposed to simply replace S−1 in Hotelling’s statistic by Ip,yielding TB = nX

′X. However X

′X is not an unbiased estimator of µ′µ since E(X

′X) =

1ntrace(Σ) + µ′µ. They showed that the standardized statistic

MB =nX

′X − trace(S)

sd(nX′X − trace(S))

=nX

′X − trace(S)√

2(n−1)n(n−2)(n+1)

(trace(S2)− 1

n(trace(S))2

)has asymptotic N(0, 1) distribution for p, n→∞.

Srivastava and Du (2008) proposed to replace S in Hotelling’s statistic by DS = diag(S).

Then TS = nX′D−1

S X − n−1n−3

p can be used to estimate n(n−1)n−3‖D

12Σµ‖2, which is zero under

H0 : µ = 0. Srivastava and Du’s test statistic is then

MS =TS

sd(TS)=

nX′D−1

S X − n−1n−3

p√2trace(R2)− p2

n−1

,

which has asymptotic N(0, 1) distribution for p, n→∞. Here R = D− 1

2S SD

− 12

S is the samplecorrelation matrix.

Chen and Qin (2010) improves the two-sample test for mean vectors from that of Baiand Saranadasa (1996). In testing H1 : µ1 = µ2, Bai and Saranadasa (1996) proposedto use TB = X

′1X2 − n1+n2

n1n2trace(SP ). The substraction of trace(SP ) is to make sure that

E(TB) = ‖µ1 − µ2‖2. Chen and Qin (2010) proposed to not use trace(SP ), by considering

TC =

∑n1

i 6=jX′1iX1j

n1(n1 − 1)+

∑n2

i 6=jX′2iX2j

n2(n2 − 1)− 2

∑n1

i=1

∑n2

j=1X′1iX2j

n1n2

.

Since E(TC) = ‖µ1 − µ2‖2, Chen and Qin proposed to test based on TC .

There are many other ideas, including:

6

1. The test statistic is essentially the maximum of p normalized marginal mean differences(Cai et al., 2013);

2. Use a generalized inverse of S, denoted by S− or S†, to replace S−1;

3. Estimate Σ in a way that is invertible;

4. Reduce the dimension p of the random vector X by Z = h(X) ∈ Rd, for d < n, thenapply the traditional theory of hypothesis testing.

Next lecture is on linear dimension reduction–principal component analysis.

References

Bai, Z. and Saranadasa, H. (1996), “Effect of high dimension: by an example of a two sampleproblem,” Statist. Sinica, 6, 311–329.

Cai, T. T., Liu, W., and Xia, Y. (2013), “Two-Sample Test of High Dimensional Meansunder Dependency,” To appear in Journal of the Royal Statistical Society: Series B.

Chen, S. X. and Qin, Y.-L. (2010), “A two-sample test for high-dimensional data withapplications to gene-set testing,” The Annals of Statistics, 38, 808–835.

Dempster, A. P. (1960), “A significance test for the separation of two highly multivariatesmall samples,” Biometrics, 16, 41–50.

Srivastava, M. S. and Du, M. (2008), “A test for the mean vector with fewer observationsthan the dimension,” Journal of Multivariate Analysis, 99, 386–402.

7

In evaluating the MLE of Σ for MVN, one can use the following famous result.

Note:

• ∂|Y|∂Y

= |Y|(Y−1)T

• ∂∂Y

trace(AY−1B) = −(Y−1BAY−1)T

∂

∂Σlog |Σ|+ trace(Σ−1S0) = |Σ|−1|Σ|(Σ−1)− (Σ−1S0Σ−1) = (Σ−1)− (Σ−1S0Σ−1)

The first order condition leads to ΣMLE = S0

8

Lecture 3. Inference about multivariate normal distribution · Inference about multivariate normal...

Documents

Transcript of Lecture 3. Inference about multivariate normal distribution · Inference about multivariate normal...