Lecture 3. Inference about multivariate normal distribution · Lecture 3. Inference about...

Lecture 3. Inference about multivariate normal distribution

3.1 Point and Interval Estimation

Let X1, . . . ,Xn be i.i.d. Np(µ,Σ). We are interested in evaluation of the maximum likelihoodestimates of µ and Σ. Recall that the joint density of X1 is

f(x) = |2πΣ|−12 exp

[−1

2(x− µ)′Σ−1(x− µ)

],

for x ∈ Rp. The negative log likelihood function, given observations xn1 = x1, . . . ,xn, isthen

`(µ,Σ | xn1 ) =n

2log |Σ|+ 1

2

n∑i=1

(xi − µ)′Σ−1(xi − µ) + c

=n

2log |Σ|+ n

2(x− µ)′Σ−1(x− µ) +

1

2

n∑i=1

(xi − x)′Σ−1(xi − x).

On this end, denote the centered data matrix by X = [x1, . . . , xn]p×n, where xi = xi − x.Let

S0 =1

nXX′ =

1

n

n∑i=1

xix′i.

Proposition 1. The m.l.e. of µ and Σ, that jointly minimize `(µ,Σ | xn1 ), are

µMLE = x,

ΣMLE = S0.

Note that S0 is a biased estimator of Σ. The sample variance–covariance matrix S =nn−1

S0 is unbiased.

For interval estimation of µ, we largely follow Section 7.1 of Hardle and Simar (2012).First note that since µ ∈ Rp, we need to generalize the notion of intervals (primarily definedfor R1) to higher dimension. A simple extension is a direct product of marginal intervals:for intervals a < x < b and c < y < d, we obtain a rectangular region (x, y) ∈ R2 : a < x <b, c < y < d.

A confidence region A ∈ Rp is composed of the values of a function of (random) obser-vations X1, . . . ,Xn. A = A(Xn

1 ) is a confidence region of size 1− α ∈ (0, 1) for parameter µif

P (µ ∈ A) ≥ 1− α, for all µ ∈ Rp.

(Elliptical confidence region) Corollary 7 in lecture 2 provides a pivot which pavesa way to construct a confidence region for µ. Since n−p

p(X − µ)′S−1

0 (X − µ) ∼ Fp,n−p and

P(

(X− µ)′S−10 (X− µ) < p

n−pF1−α;p,n−p

)= 1− α,

A =

µ ∈ Rp : (X− µ)′S−1

0 (X− µ) <p

n− pF1−α;p,n−p

is a confidence region of size 1− α for parameter µ.

(Simultaneous confidence intervals) Simultaneous confidence intervals for all linearcombinations of elements of µ, a′µ for arbitrary a ∈ Rp, provides confidence of size 1 − αfor all intervals covering a′µ including the marginal means µ1, . . . , µp. We are interested inevaluating lower and upper bounds L(a) and U(a) satisfying

P (L(a) < a′µ < U(a), for all a ∈ Rp) ≥ 1− α, for all µ ∈ Rp.

First consider a single confidence interval by fixing a particular vector a. To evaluatea confidence interval for a′µ, write new random variables Yi = a′Xi ∼ N1(a′µ, a′Σa) (i =

1, . . . , n), whose squared t-statistic is t2(a) = n (a′µ−a′X)2

a′Sa∼ F1,n−1. Thus, for any fixed a,

P (t2(a) ≤ F1−α,1,n−1) = 1− α. (1)

Next, consider many projection vectors a1, a2, . . . , aM (M is finite only for convenience).The simultaneous confidence intervals of the type similar to (1) are then

P

(M⋂i=1

t2(ai) ≤ h(α)

)≥ 1− α,

for some h(·). By collecting some facts,

1. maxa t2(a) ≤ h(α) implies t2(ai) ≤ h(α) for all i.

2. maxa t2(a) = n(µ− X)′S−1(µ− X),

3. and Corollary 7 in lecture 2,

we have for h(α) = n−1n

pn−pF1−α;p,n−p,

P

(M⋂i=1

t2(ai) ≤ h(α)

)≥ P

(max

at2(a) ≤ h(α)

)= 1− α.

Proposition 2. Simultaneously for all a ∈ Rp, the interval

a′X±√h(α)a′Sa

contains a′µ with probability 1− α.

Example 1. From the Golub gene expression data, with dimension d = 7129, take the first and1674th variables (genes), to focus on the bivariate case (p = 2). There are two populations:11 observations from AML, 27 from ALL. Figure 1 illustrates the elliptical confidence regionof size 95% and 99%. Figure 2 compares the elliptical confidence region with the simultaneousconfidence intervals for a1 = (1, 0)′ and a2 = (0, 1)′.

2

−500 −400 −300 −200 −100 0 1000

0.5

1

1.5

2

2.5x 10

4 Golub data, ALL (red) and AML(blue)

gene #1

gene

#16

74

Figure 1: Elliptical confidence regions of size 95% and 99%.

−500 −400 −300 −200 −100 0 1000

0.5

1

1.5

2

2.5x 10

4 Golub data, ALL (red) and AML(blue)

gene #1

gene

#16

74

Figure 2: Simultaneous confidence intervals of size 95%

3.2 Hypotheses testing

Consider testing a null hypothesis H0 : θ ∈ Ω0 against an alternative hypothesis H1 : θ ∈ Ω1.The principle of likelihood ratio test is as follows: Let L0 be the maximized likelihood under

3

H0 : θ ∈ Ω0, and L1 be the maximized likelihood under θ ∈ Ω0 ∪ Ω1. The likelihood ratiostatistic, or sometimes called Wilks statistic, is then

W = −2 log(L0

L1

) ≥ 0

The null hypothesis is rejected if the observed value of W is large. In some cases the exactdistribution of W under H0 can be evaluated. In other cases, Wilks’ theorem states that forlarge n (sample size),

WL

=⇒χ2ν ,

where ν is the number of free parameters in H1, not in H0. If the degrees of freedom in Ω0

is q and the degrees of freedom in Ω0 ∪ Ω1 is r, then ν = r − q.Consider testing hypotheses on µ and Σ of multivariate normal distribution, based on

n-sample X1, . . . ,Xn.

case I: H0 : µ = µ0, H1 : µ 6= µ0, Σ is known.In this case, we know the exact distribution of the likelihood ratio statistic

W = n(x− µ0)′Σ−1(x− µ0) ∼ χ2p,

under H0.

case II: H0 : µ = µ0, H1 : µ 6= µ0, Σ is unknown.The m.l.e. under H1 are µ = x and Σ = S0. The restricted m.l.e. of Σ under H0 isΣ(0) = 1

n

∑ni=1(xi − µ0)(xi − µ0)′ = S0 + δδ′, where δ =

√n(x − µ0). The likelihood

ratio statistic is thenW = n log |S0 + δδ′| − n log |S0|.

It turns out that W is a monotone increasing function of

δ′S−1δ = n(x− µ0)S−1(x− µ0),

which is the Hotelling’s T 2(n− 1) statistic.

case III: H0 : Σ = Σ0, H1 : Σ 6= Σ0, µ is unknown.We have the likelihood ratio statistic

W = −n log |Σ−10 S0| − np+ ntrace(Σ−1

0 S0).

This is the case where the exact distribution of W is difficult to evaluate. For largen, use Wilks’ theorem to approximate the distribution of W by χ2

ν with the degrees offreedom ν = p(p+ 1)/2.

Next, consider testing the equality of two mean vectors. Let X11, . . . ,X1n1 be i.i.d. Np(µ1,Σ)and X21, . . . ,X2n2 be i.i.d. Np(µ2,Σ).

4

case IV: H0 : µ1 = µ2, H1 : µ1 6= µ2, Σ is unknown.Since

X1 − X2 ∼ Np(µ1 − µ2,n1 + n2

n1n2

Σ),

(n1 + n2 − 2)SP ∼ Wp(n1 + n2 − 2,Σ),

we have Hotelling’s T 2 statistic for two-sample problem

T 2(n1 + n2 − 2) =n1n2

n1 + n2

(X1 − X2)′S−1P (X1 − X2),

and by Theorem 5 in lecture 2

n1 + n2 − p− 1

(n1 + n2 − 2)pT 2(n1 + n2 − 2) ∼ Fp,n1+n2−p−1.

Similar to case II above, the likelihood ratio statistic is a monotone function of T 2(n1 +n2 − 2).

3.3 Hypothesis testing when p > n

In the high dimensional situation where the dimension p is larger than sample size (p > n−1or p > n1+n2−2), the sample covariance S is not invertable, thus the Hotelling’s T 2 statistic,which is essential in the testing procedures above, cannot be computed. We survey importantproposals for testing hypotheses on means in high dimension, low sample size data.

A basic idea in generalizing a test procedure for the p > n case is to base the test on acomputable test statistic which is also an estimator for ‖µ− µ0‖ or ‖µ1 − µ2‖.

In case II (one sample), Dempster (1960) proposed to replace S−1 in Hotelling’s statisticby (trace(S)Ip)−1. He showed that under H0 : µ = 0,

TD =nX′X

trace(S)∼ Fr,(n−1)r, approximately,

for r = (trace(Σ))2

trace(Σ2), a measure of sphericity of Σ. An estimator r of r is used in testing.

Bai and Saranadasa (1996) proposed to simply replace S−1 in Hotelling’s statistic by Ip,yielding TB = nX′X. However X′X is not an unbiased estimator of µ′µ since E(X′X) =1ntrace(Σ) + µ′µ. They showed that the standardized statistic

MB =nX′X− trace(S)

sd(nX′X− trace(S))=

nX′X− trace(S)√2(n−1)n

(n−2)(n+1)

(trace(S2)− 1

n(trace(S))2

)has asymptotic N(0, 1) distribution for p, n→∞.

5

Srivastava and Du (2008) proposed to replace S−1 in Hotelling’s statistic byDS = diag(S).

Then TS = nX′D−1S X − n−1

n−3p can be used to estimate n(n−1)

n−3‖D

12Σµ‖2, which is zero under

H0 : µ = 0. Srivastava and Du’s test statistic is then

MS =TS

sd(TS)=

nX′D−1S X− n−1

n−3p√

2trace(R2)− p2

n−1

,

which has asymptotic N(0, 1) distribution for p, n→∞. Here R = D− 1

2S SD

− 12

S is the samplecorrelation matrix.

Chen and Qin (2010) improves the two-sample test for mean vectors from that of Baiand Saranadasa (1996). In testing H1 : µ1 = µ2, Bai and Saranadasa (1996) proposedto use TB = X′1X2 − n1+n2

n1n2trace(SP ). The substraction of trace(SP ) is to make sure that

E(TB) = ‖µ1 − µ2‖2. Chen and Qin (2010) proposed to not use trace(SP ), by considering

TC =

∑n1

i 6=j X′1iX1j

n1(n1 − 1)+

∑n2

i 6=j X′2iX2j

n2(n2 − 1)− 2

∑n1

i=1

∑n2

j=1 X′1iX2j

n1n2

.

Since E(TC) = ‖µ1 − µ2‖2, Chen and Qin proposed to test based on TC .

There are many other ideas, including:

1. The test statistic is essentially the maximum of p normalized marginal mean differences(Cai et al., 2013);

2. Use a generalized inverse of S, denoted by S− or S†, to replace S−1;

3. Estimate Σ in a way that is invertible;

4. Reduce the dimension p of the random vector X by Z = h(X) ∈ Rd, for d < n, thenapply the traditional theory of hypothesis testing.

Next lecture is on linear dimension reduction–principal component analysis.

References

Bai, Z. and Saranadasa, H. (1996), “Effect of high dimension: by an example of a two sampleproblem,” Statist. Sinica, 6, 311–329.

Cai, T. T., Liu, W., and Xia, Y. (2013), “Two-Sample Test of High Dimensional Meansunder Dependency,” To appear in Journal of the Royal Statistical Society: Series B.

Chen, S. X. and Qin, Y.-L. (2010), “A two-sample test for high-dimensional data withapplications to gene-set testing,” The Annals of Statistics, 38, 808–835.

Dempster, A. P. (1960), “A significance test for the separation of two highly multivariatesmall samples,” Biometrics, 16, 41–50.

Srivastava, M. S. and Du, M. (2008), “A test for the mean vector with fewer observationsthan the dimension,” Journal of Multivariate Analysis, 99, 386–402.

6

In evaluating the MLE of Σ for MVN, one can use the following famous result.

Lemma 3 (von Neumann). For any m×m symmetric matrices A and B with eigenvaluesσA = (σA1, . . . , σAm)′ and σB = (σB1, . . . , σBm)′ in decreasing order,

|trace(A′B)| ≤ σ′AσB,

and the equality holds when A and B have the same eigenvectors.

A general version of von Neumann inequality is:

Lemma 4 (von Neumann). For any m×n matrices A and B with vectors of singular valuesσA and σB in decreasing order,

|trace(A′B)| ≤ σ′AσB,

and the equality holds when A and B are simulateneously diagonazable.

The problem was to minimize the negative log-likelihood log |Σ| + trace(Σ−1S0). Theparameter, assumed to be nonnegative definite, is eigen-decomposed into UΛU ′. Likewise,we can eigen-decompose the real symmetric matrix S0 = V DV ′.

log |Σ|+ trace(Σ−1S0) = log |Λ|+ trace(UΛ−1U ′V DV ′)

≥ log |Λ|+ trace(Λ−1D)

=

p∑i=1

log(λi) + trace(di/λi)

=

p∑i=1

(ai − log(ai)) + log(di),

where ai = di/λi, and minimized when ai = 1.

7

Lecture 3. Inference about multivariate normal distribution · Lecture 3. Inference about...

Documents

Transcript of Lecture 3. Inference about multivariate normal distribution · Lecture 3. Inference about...