A Tutorial on Multivariate Statistical Analysis

46
A Tutorial on Multivariate Statistical Analysis Craig A. Tracy UC Davis SAMSI September 2006 1

Transcript of A Tutorial on Multivariate Statistical Analysis

Page 1: A Tutorial on Multivariate Statistical Analysis

A Tutorial on

Multivariate

Statistical Analysis

Craig A. Tracy

UC Davis

SAMSI

September 2006

1

Page 2: A Tutorial on Multivariate Statistical Analysis

ELEMENTARY STATISTICS

Collection of (real-valued) data from a sequence of experiments

X1, X2, . . . , Xn

Might make assumption underlying law is N(µ, σ2) with unknown

mean µ and variance σ2. Want to estimate µ and σ2 from the data.

Sample Mean & Sample Variance:

X =1

n

j

Xj , S =1

n− 1

j

(

Xj − X)2

Estimators are “unbiased”

E(X) = µ, E(S) = σ2

2

Page 3: A Tutorial on Multivariate Statistical Analysis

Theorem: If X1, X2, . . . are independent N(µ, σ2) variables then

X and S are independent. We have that X is N(µ, σ2/n) and

(n− 1)S/σ2 is χ2(n− 1).

Recall χ2(d) denotes the chi-squared distribution with d degrees of

freedom. Its density is

fχ2(x) =1

2d/2 Γ(d/2)xd/2−1 e−x/2 , x ≥ 0,

where

Γ(z) =

∫ ∞

0

tz−1 e−t dt, ℜ(z) > 0.

3

Page 4: A Tutorial on Multivariate Statistical Analysis

MULTIVARIATE GENERALIZATIONS

From the classic textbook of Anderson[1]:

Multivariate statistical analysis is concerned with data that

consists of sets of measurements on a number of individuals

or objects. The sample data may be heights and weights of

some individuals drawn randomly from a population of

school children in a given city, or the statistical treatment

may be made on a collection of measurements, such as

lengths and widths of petals and lengths and widths of

sepals of iris plants taken from two species, or one may

study the scores on batteries of mental tests administered

to a number of students.

p = # of sets of measurements on a given individual,

n = # of observations = sample size

4

Page 5: A Tutorial on Multivariate Statistical Analysis

Remarks:

• In above examples, one can assume that p≪ n since typically

many measurements will be taken.

• Today it is common for p≫ 1, so n/p is no longer necessarily

large.

Vehicle Sound Signature Recognition: Vehicle noise is a

stochastic signal. The power spectrum is discretized to a

vector of length p = 1200 with n ≈ 1200 samples from the

same kind of vehicle.

Astrophysics: Sloan Digital Sky Survey typically has many

observations (say of quasar spectrum) with the spectra of

each quasar binned resulting in a large p.

Financial data: S&P 500 stocks observed over monthly

intervals for twenty years.

5

Page 6: A Tutorial on Multivariate Statistical Analysis

GAUSSIAN DATA MATRICES

The data are now n independent column vectors of length p

~x1, ~x2, . . . , ~xn

from which we construct the n× p data matrix

X =

←− ~xT1 −→

←− ~xT2 −→

...

←− ~xTn −→

The Gaussian assumption is that

~xj ∼ Np(µ,Σ)

Many applications assume the mean has been already substracted

out of the data, i.e. µ = 0.

6

Page 7: A Tutorial on Multivariate Statistical Analysis

Multivariate Gaussian Distribution

If x and y are vectors, the matrix x⊗ y is defined by

(x⊗ y)jk = xjyk

If µ = E(x) is the mean of the random vector x, then the

covariance matrix of x is the p× p matrix

Σ = E [(x− µ)⊗ (x− µ)]]

Σ is a symmetric, non-negative definite matrix. If Σ > 0 (positive

definite) and X ∼ Np(µ,Σ), then the density function of X is

fX(x) = (2π)−p/2(detΣ)−1/2 exp

[

−1

2

(

x− µ,Σ−1(x− µ)

]

, > x ∈ Rp

Sample mean:

x =1

n

j

~xj , E(x) = µ

7

Page 8: A Tutorial on Multivariate Statistical Analysis

Sample covariance matrix:

S =1

n− 1

n∑

j=1

(~xj − x)⊗ (~xj − x)

For µ = 0 the sample covariance matrix can be written simply as

1

n− 1XTX

Some Notation: If X is a n× p data matrix formed from the n

independent column vectors xj , cov(xj) = Σ, we can form one

column vector vec(X) of length pn

vec(X) =

x1

x2

...

xn

8

Page 9: A Tutorial on Multivariate Statistical Analysis

The covariance of vec(X) is the np× np matrix

In ⊗ Σ =

Σ 0 0 · · · 0

0 Σ 0 · · · 0...

...... · · ·

...

0 0 0 · · · Σ

In this case we say the data matrix X constructed from n

independent xj ∼ Np(µ,Σ) has distribution

Np(M, In ⊗ Σ)

where M = E(X) = 1⊗ µ, 1 is the column vector of all 1’s.

9

Page 10: A Tutorial on Multivariate Statistical Analysis

WISHART DISTRIBUTION

Definition: If A = XTX where the n× p matrix X is

Np(0, In ⊗ Σ), Σ > 0, then A is said to have Wishart distribution

with n degrees of freedom and covariance matrix Σ. We will say A

is Wp(n,Σ).

Remarks:

• The Wishart distribution is the multivariate generalization of

the chi-squared distribution.

• A ∼Wp(n,Σ) is positive definite with probability one if and

only if n ≥ p.• The sample covariance matrix,

S =1

n− 1A

is Wp(n− 1, 1n−1 Σ).

10

Page 11: A Tutorial on Multivariate Statistical Analysis

WISHART DENSITY FUNCTION, n ≥ p

Let Sp denote the space of p× p positive definite (symmetric)

matrices. If A = (ajk) ∈ Sp, let

(dA) = volume element of A =∧

j≤k

dajk

The multivariate gamma function is

Γp(a) =

Sp

e−tr(A) (detA)a−(p+1)/2 (dA),ℜ(a) > (p− 1)/2.

Theorem: If A is Wp(n,Σ) with n ≥ p, then the density function

of A is

1

2np Γp(n/2) (detΣ)n/2e−

12 tr(Σ−1A) (detA)

(n−p−1)/2

11

Page 12: A Tutorial on Multivariate Statistical Analysis

Sketch of Proof:

• The density function for X is the multivariate Gaussian

(including volume element (dX))

(2π)−np/2 (detΣ)−n/2

e−12 tr(Σ−1XT X) (dX)

• Recall the QR factorization [8]: Let X denote an n× p matrix

with n ≥ p with full column rank. Then there exists an unique

n× p matrix Q, QTQ = Ip, and an unique n× p upper

triangular matrix R with positive diagonal elements so that

X = QR. Note A = XTX = RTR.

• A Jacobian calculation [1, 13]: If A = XTX , then

(dX) = 2−p (detA)(n−p−1)/2 (dA)(QTdQ)

12

Page 13: A Tutorial on Multivariate Statistical Analysis

where

(QTdQ) =

p∧

j=1

n∧

k=j+1

qTk dqj

and Q = (q1, . . . , qp) is the column representation of Q.

• Thus the joint distribution of A and Q is

(2π)−np/2 (detΣ)−n/2 e−12 tr(Σ−1A)×

2−p (detA)(n−p−1)/2

(dA)(QT dQ)

• Now integrate over all Q. Use fact that

Vn,p

(QT dQ) =2pπnp/2

Γp(n/2)

and Vn,p is the set of real n× p matrices Q satisfying

QTQ = Ip. (When n = p this is the orthogonal group.)

13

Page 14: A Tutorial on Multivariate Statistical Analysis

Remarks regarding the Wishart density function

• Case p = 2 obtain by R. A. Fisher in 1915.

• General p by J. Wishart in 1928 by geometrical arguments.

• Proof outlined above came later. (See [1, 13] for complete

proof.)

• When Q is a p× p orthogonal matrix

Γp(p/2)

2pπp2/2

(

QT dQ)

is normalized Haar measure for the orthogonal group O(p). We

denote this Haar measure by (dQ).

• Siegel proved (see, e.g. [13])

Γp(a) = πp(p−1)/4

p∏

j=1

Γ

(

a− 1

2(j − 1)

)

14

Page 15: A Tutorial on Multivariate Statistical Analysis

EIGENVALUES OF A WISHART MATRIX

Theorem: If A is Wp(n,Σ) with n ≥ p the joint density function

for the eigenvalues ℓ1,. . . , ℓp of A is

πp2/2 2−np/2 (detΣ)−n/2

Γp(p/2) Γp(n/2)

p∏

j=1

ℓ(n−p−1)/2j

j<k

|ℓj − ℓk| ×

O(p)

e−12 tr(Σ−1QLQT ) (dQ), (λ1 > · · · > λp)

where L = diag(ℓ1, . . . , ℓp) and (dQ) is normalized Haar measure.

Note that ∆(ℓ) :=∏

j<k(ℓj − ℓk) is the Vandermonde.

Corollary: If A is Wp(n, Ip), then the integral over the orthogonal

group in the previous theorem is

e−12

P

j ℓj .

15

Page 16: A Tutorial on Multivariate Statistical Analysis

Proof: Recall that the Wishart density function (times the volume

element) is

1

2np Γp(n/2) (detΣ)n/2e−

12 tr(Σ−1A) (detA)(n−p−1)/2 (dA)

The idea is to diagonalize A by an orthogonal transformation and

then integrate over the orthogonal group thereby giving the density

function for the eigenvalues of A.

Let ℓ1 > · · · > ℓp be the ordered eigenvalues of A.

A = QLQT , L = diag(ℓ1, . . . , ℓp), Q ∈ O(p)

The jth column of Q is a normalized eigenvector of A. The

transformation is not 1–1 since Q = [±q1, . . . ,±qp] works for each

fixed A. The transformation is made 1–1 by requiring that the 1st

element of each qj is nonnegative. This restricts Q (as A varies) to

a 2−p part of O(p). We compensate for this at the end.

16

Page 17: A Tutorial on Multivariate Statistical Analysis

We need an expression for the volume element (dA) in terms of Q

and L. First we compute the differential of A

dA = dQLQT +QdLQT +QLdQT

QT dAQ = QT dQL+ dL+ LdQT Q

= −dQtQdL+ LdQT Q+ dL

=[

L, dQTQ]

+ dL

(We used QTQ = I implies QT dQ = −dQT Q.)

We now use the following fact (see, e.g., page 58 in [13]): If

X = BY BT where X and Y are p× p symmetric matrices, B is a

nonsingular p× p matrix, then (dX) = (detB)p+1(dY ). In our case

Q is orthogonal so the volume element (dA) equals the volume

element (QT dAQ). The volume element is the exterior product of

the diagonal elements of QT dAQ times the exterior product of the

elements above the diagonal.

17

Page 18: A Tutorial on Multivariate Statistical Analysis

Since L is diagonal, the commutator[

L, dQTQ]

has zero diagonal elements. Thus the exterior product of the

diagonal elements of QT dAQ is∧

j dℓj .

The exterior product of the elements coming from the commutator

is∏

j<k

(ℓj − ℓk)∧

j<k

qTk dqj

and so(dA) =

j<k

qTk dqj ∆(ℓ)

j

dℓj

=2pπp2/2

Γp(p/2)(dQ) ∆(ℓ)

j

dℓj

The theorem now follows once integrate over all of O(p) and divide

the result by 2p.

18

Page 19: A Tutorial on Multivariate Statistical Analysis

• One is interested in limit laws as n, p→∞. For Σ = Ip,

Johnstone [11] proved, using RMT methods, for centering and

scaling constants

µnp =(√n− 1 +

√p)2,

σnp =(√n− 1 +

√p)

(

1√n− 1

+1√p

)1/3

thatℓ1 − µnp

σnp

converges in distribution as n, p→∞, n/p→ γ <∞, to the

GOE largest eigenvalue distribution [15].

• El Karoui [6] has extended the result to γ ≤ ∞. The case

p≫ n appears, for example, in microarray data.

• Soshnikov [14] has lifted Gaussian assumption under the

additional restriction n− p = O(p1/3).

19

Page 20: A Tutorial on Multivariate Statistical Analysis

• For Σ 6= Ip, the difficulty in establishing limit theorms comes

from the integral∫

O(p)

e−12 tr(Σ−1QΛQT ) (dQ)

Using zonal polynomials infinite series expansions have been

derived for this integral, but these expansions are difficult to

analyze. See Muirhead [13].

• For complex Gaussian data matrices X similar density formulas

are known for the eigenvalues of X∗X . Limit theorems for

Σ 6= Ip are known since the analogous group integral, now over

the unitary group, is known explicitly—the Harish

Chandra–Itzykson–Zuber (HCIZ) integral (see, e.g. [17]). See

the work of Baik, Ben Arous and Peche [2, 3] and El Karoui [7].

20

Page 21: A Tutorial on Multivariate Statistical Analysis

PRINCIPAL COMPONENT ANALYSIS (PCA),

H. Hotelling, 1933

Population Principal Components: Let x be a p× 1 random

vector with E(x) = µ and cov(x) = Σ > 0. Let λ1 ≥ λ2 ≥ · · · ≥ λp

denote the eigenvalues of Σ and H an orthogonal matrix

diagonalizing Σ: HT ΣH = Λ = diag(λ1, . . . , λp). We write H in

column vector form

H = [h1, . . . , hp]

so that hj is the p× 1 eigenvector of Σ corresponding to eigenvalue

λj . Define the p× 1 vector

u = HTx = (u1, . . . , up)T

then cov(u) = E(

(HTx−HTµ)⊗ (HTx−HTµ))

= HTE ((x− µ)⊗ (x− µ))H

= HT ΣH = Λ ⇒ uj uncorrelated.

21

Page 22: A Tutorial on Multivariate Statistical Analysis

Definition: uj is called the jth principal component of x. Note

var(uj) = λj .

Statistical interpretations: The claim is that u1 is that linear

combination of components of x that has maximum variance.

Proof: For simplicity of notation, set µ = 0. Let b denote any p× 1

vector, bT b = 1, and form bTx.

var(bTx) = E(

bTx · bTx)

= E(

bTx · (bTx)T)

= bT E(xxT )b = bT Σ b.

We want to maximize the right hand side subject to the constraint

bT b = 1. By the method of Lagrange multipliers we maximize

bT Σ b− λ(bT b− 1)

Since Σ is symmetric the vector of partial derivatives is

2Σ b− 2λb

Thus b must be an eigenvector with eigenvalue λ.

22

Page 23: A Tutorial on Multivariate Statistical Analysis

The largest variance corresponds to choosing the largest eigenvalue.

The general result is that ur has maximum variance of all

normalized combinations uncorrelated with u1,. . . , ur−1.

Sample principal components: Let S denote the sample

covariance matrix of the data matrix X and let Q = [q1, . . . , qp] a

p× p orthogonal matrix diagonalizing S:

QT S Q = diag(ℓ1, . . . , ℓp)

The ℓj are the sample variances that are estimates for λj . The

vectors qj are sample estimates for the vectors hj .

If x is the random vector and u = HTx is the vector of principal

components, then u = QTx is the vector of sample principal

components.

23

Page 24: A Tutorial on Multivariate Statistical Analysis

SCREE PLOTS

In applications: How many of the ℓj ’s are significant?

24

Page 25: A Tutorial on Multivariate Statistical Analysis

CANONICAL CORRELATION ANALYSIS (CCA)

H. Hotelling, 1936

Suppose a large data set is naturally decomposed into two groups.

For example, p× 1 random vectors ~x1,. . . , ~xn make up one set and

q × 1 random vectors ~y1,. . . , ~ym the other. We are interested in the

correlations between these two data sets. For example, in medicine

we might have n measurements of age, height, and weight (p = 3)

and m measurements of systolic and diastolic blood pressures

(q = 2). We are interested in what combination of the components

of x is most correlated with a combination of the components of y.

Population Canonical Correlations: Let x and y be two

random vectors of size p× 1 and q × 1, respectively. We assume

E(x) = E(y) = 0 and p ≤ q. Form the (p+ q)× 1 vector

x

y

25

Page 26: A Tutorial on Multivariate Statistical Analysis

and its (p+ q)× (p+ q) covariance matrix

Σ =

Σ11 Σ12

Σ21 Σ22

.

Let

u := αTx ∈ R, v := γT y ∈ R

where α and γ are vectors to be determined. We want to maximize

the correlation

corr(u, v) =cov(u, v)

var(u)var(v)

The correlation does not change under scale transformations

u→ cu, etc. so we can maximize this correlation subject to the

constraints

E(u2) = E(αTx · αTx) = αT Σ11α = 1 (1)

E(v2) = γT Σ22γ = 1 (2)

26

Page 27: A Tutorial on Multivariate Statistical Analysis

Under these constraints

corr(u, v) = E(αTx · γT y) = αT Σ12γ.

Let

ψ = αT Σ12γ −1

2ρ (αT Σ11α− 1)− 1

2λ (γtΣ22γ − 1)

where ρ and λ are Lagrange multipliers. Set the vector of partial

derivatives to zero:

∂ψ

∂α= Σ12γ − ρΣ11α = 0 (3)

∂ψ

∂γ= ΣT

12α− λΣ22γ = 0 (4)

If we left multiply (3) by αT and (4) by γT , use the normalization

conditions (1) and (2) we conclude λ = ρ. Thus (3) and (4) become

−ρΣ11 Σ12

Σ21 −ρΣ22

α

γ

= 0 (5)

27

Page 28: A Tutorial on Multivariate Statistical Analysis

with corr(u, v) = ρ

and ρ ≥ 0 is a solution to

det

−ρΣ11 Σ12

Σ21 −ρΣ22

= 0

This is a polynomial in ρ of degree (p+ q). Let ρ1 denote the

maximum root and α1 and γ1 corresponding solutions to (5).

Definition: u1 and v1 are called the first canonical variables and

their correlation ρ1 = corr(u1, v1) is called the first canonical

correlation coefficient.

More generally, the rth pair of canonical variables is the pair of

linear combinations ur = (α(r))Tx and vr = (γ(r))T y, each of unit

variance and uncorrelated with the first r − 1 pairs of canonical

variables and having maximum correlation. The correlation

corr(ur, vr) is the rth canonical correlation coefficient.

28

Page 29: A Tutorial on Multivariate Statistical Analysis

REDUCTION TO AN EIGENVALUE PROBLEM

Since

−ρΣ11 Σ12

Σ21 −ρΣ22

=

Σ11 0

0 1

−ρI Σ−111 Σ12Σ

−122

Σ21 −ρI

1 0

0 Σ22

Thus the determinantal equation becomes

det(

ρ2I − Σ21Σ−111 Σ12Σ

−122

)

= 0

The nonzero roots ρ1, . . . , ρk are called the population canonical

correlation coefficients. k = rank(Σ12).

In applications Σ is not known. One uses the sample covariance

matrix S to obtain sample canonical correlation coefficients.

29

Page 30: A Tutorial on Multivariate Statistical Analysis

AN EXAMPLE

The first application of Hotelling’s canonical correlations is by

F. Waugh [16] in 1942. He begins his paper with

Professor Hotelling’s paper, “Relations between Two Sets

of Variates,” should be widely known and his method used

by practical statisticians. Yet, few practical statisticians

seem to know of the paper, and perhaps those few are

inclined to regard it as a mathematical curiosity rather

than an important and useful method for analyzing

concrete problems. This may be due to . . .

The Practical Problem: Relation of wheat characteristics to

flour characteristics. Guiding principle

The grade of the raw material should give a good

indication of the probable grade of the finished product.

30

Page 31: A Tutorial on Multivariate Statistical Analysis

The data: 138 samples of Canadian Hard Red Spring wheat and

the flour made from each these samples.a

wheat quality flour quality

x1 = kernel texture y1 = wheat per barrels of flour

x2 = test weight y2 = ash in flour

x3 = damaged kernels y3 = crude protein in flour

x4 = crude protein in wheat y4 = gluten quality index

x5 = foreign material

u1 = αT1 x = index of wheat quality, v1 = γT

1 y = index of flour quality

corr(u1, v1) = 0.909

aIn this example p > q. The data are normalized to mean 0 and variance 1.

31

Page 32: A Tutorial on Multivariate Statistical Analysis

DISTRIBUTION OF SAMPLE CANONICAL

CORRELATION COEFFICIENTS

Give a sample of size n observations on

x

y

drawn from

Np+q(µ,Σ) and A the (unnormalized) sample covariance matrix.

Then W is Wp+q(n,Σ). We have [13]

Theorem (Constantine, 1963): Let A have the Wp+q(n,Σ)

distribution where p ≤ q, n ≥ p+ q and Σ and A are partitioned as

above. Then the joint probability density function of r21,. . . , r2p, the

eigenvalues of A−111 A12A

−122 A21 (let ξj := r2j ) is

cp,q,n

p∏

j=1

(1−ρ2j)

n/2

p∏

j=1

[

ξ(q−p−1)/2j (1− ξj)(n−p−q−1)/2

]

·|∆(ξ)| ·F(ξ)

where

32

Page 33: A Tutorial on Multivariate Statistical Analysis

• cp,q,n is a normalization constant.

• ∆(ξ) = Vandermonde determinant =∏p

j<k(ξj − ξk).

• F(ξ) is a two-matrix hypergeometric function which can be

expressed as an infinite series involving zonal polynomials. See

Theorem 11.3.2 and Definition 7.3.2 in Muirhead [13].

Null Distribution: For Σ12 = 0 (x and y are independent), the

above joint density for the sample canonical correlation coefficients

reduces to

cp,q,n

p∏

j=1

[

ξ(q−p−1)/2j (1− ξj)(n−p−q−1)/2

]

· |∆(ξ)|

In this case the distribution of the largest sample canonical

correlation coefficient r1 can be used for testing the null hypothesis:

H : Σ12 = 0. We reject H for large values of r1.

33

Page 34: A Tutorial on Multivariate Statistical Analysis

ZONAL POLYNOMIALS

Zonal polynomials naturally arise in multivariate analysis when

considering group integrals such as∫

O(m)

e−tr(XHY HT ) (dH)

where X and Y are m×m symmetric, positive definite matrices

and (dH) is normalized Haar measure. Zonal polynomials can be

defined either through the representation theory of GL(m,R) [12]

or as eigenfunctions of certain Laplacians [5, 13].

Let Sm denote the space of m×m symmetric, positive definite

matrices. Zonal polynomials, Cλ(X), X ∈ Sm are certain

homogeneous polynomials in the eigenvalues of X that are indexed

by partitions λ.

To give the precise definition we need some preliminary definitions.

34

Page 35: A Tutorial on Multivariate Statistical Analysis

Definition: A partition λ of n is a sequence λ = (λ1, λ2, . . . , λℓ)

where the λj ≥ 0 are weakly decreasing and∑

j λj = n. We denote

this by λ ⊢ n.

For example, λ = (5, 3, 3, 1) is a partition of 12. The number of

nonzero parts of λ is called the length of λ, denoted ℓ(λ).

If λ and µ are two partitions of n, we say λ < µ (lexicographic

order) if, for some index i, λj = µj for j < i and λj < µj . For

example

(1, 1, 1, 1) < (2, 1, 1) < (2, 2) < (3, 1) < (4)

If λ ⊢ n with ℓ(λ) = m, we define the monomial

xλ = xλ11 xλ2

2 · · ·xλmm

and say xµ is of higher weight than xλ if µ > λ.

35

Page 36: A Tutorial on Multivariate Statistical Analysis

Metrics and Laplacians

Let X ∈ Sm and dX = (dxjk) the matrix of differentials of X . We

define a metric on Sm by

(ds)2 = tr(

X−1dX ·X−1dX)

A simple computation shows this metric is invariant under

X −→ LXLT , L ∈ GL(m,R).

Let n = m(m+ 1)/2 = # of independent elements of X . We denote

by vec(X) ∈ Rn the column vector representation of X , e.g.

X =

x11 x12

x12 x22

, x := vec(X) =

x11

x12

x22

The metric (ds)2 is a quadratic differential in dx.

36

Page 37: A Tutorial on Multivariate Statistical Analysis

For example,

(ds)2 =(

dx11 dx12 dx22

)

·G(x) ·

dx11

dx12

dx22

where

G(x) = (gij) =

1

(x11x22 − x212)

2

x222 −2x12x22 x2

12

−2x12x22 2x11x22 + 2x212 −2x11x12

x212 −2x11x12 x2

11

Labeling x = vec(X) with a single index the differential is in the

standard form

(ds)2 =∑

i<j

dxi gij dxj

37

Page 38: A Tutorial on Multivariate Statistical Analysis

and the Laplacian associated to this metric is

∆X = (detG)−1/2n

j=1

∂xj

[

(detG)1/2n

i=1

gij ∂

∂xi

]

where

G−1 = (gij)

More succinctly, if

∇X =

∂∂x1

...

∂∂xn

,

∆X = (detG)−1/2(

∇X , (detG)1/2G−1∇X

)

where (·, ·) is the standard inner product on Rn.

38

Page 39: A Tutorial on Multivariate Statistical Analysis

One can show that ∆X is an invariant differential operator :

∆LXLT = ∆X , L ∈ GL(m,R)

We now diagonalize X

X = HYHT , Y = diag(y1, . . . , ym), H ∈ O(m).

The Laplacian ∆X is now expressed in terms of a radial part and

an angular part. The radial part of ∆X is the differential operator

m∑

j=1

y2j

∂2

∂y2j

+m

j=1

m∑

k=1k 6=j

y2j

yj − yk

∂yj+

j

yj∂

∂yj

We now let ∆X denote only the radial part.

We can now define zonal polynomials!

39

Page 40: A Tutorial on Multivariate Statistical Analysis

Definition: Let X ∈ Sm with eigenvalues x1,. . . , xm and let

λ = (λ1, . . . , λ−m) ⊢ k into not more than m parts. Then Cλ(X) is

the symmetric, homogeneous polynomial of degree k in xj such that

1. The term of highest weight in Cλ(X) is xλ

2. Cλ(X) is an eigenfunction of the Laplacian ∆X .

3. (tr(X))k

= (x1 + · · ·+ xm)k =∑

λ⊢k

ℓ(λ)≤m

Cλ(X).

Remarks:

• Must show there is an unique polynomial satisfying these

requirements.

• Eigenvalue in (2) equals αλ :=∑

j λj(λj − j) + k(m+ 1)/2.

• Program MOPS [5] computes zonal polynomials.

40

Page 41: A Tutorial on Multivariate Statistical Analysis

The following theorem is at the core of why zonal polynomials

appear in multivariate statistical analysis.

Theorem: If X,Y ∈ Sp, then

O(p)

Cλ(XHYHT ) (dH) =Cλ(X)Cλ(Y )

Cλ(Ip)(6)

where (dH) is normalized Haar measure.

Proof: Let fλ(Y ) denote the left-hand side of (6) and Q ∈ O(p).

fλ(QYQT ) = fλ(Y ) (let H → HQ in integral and use invariance of

the measure). Thus fλ is a symmetric function of the eigenvalues of

Y . Since Cλ is homogeneous of degree |λ|, so is fλ. Now apply the

41

Page 42: A Tutorial on Multivariate Statistical Analysis

Laplacian to fλ.

∆Y fλ(Y ) =

O(p)

∆Y Cλ(XHYHT ) (dH)

=

∆Y Cλ(X1/2HYHTX1/2) (dH)

=

∆Y Cλ(LY LT ) (dH) (L = X1/2H)

=

∆LY LT Cλ(LY LT ) (dH) invariance of ∆Y

= αλ

Cλ(LY LT ) (dH)

= αλ fλ(Y )

By definition of Cλ(Y ) we have fλ(Y ) = dλ Cλ(Y ). Since

fλ(Ip) = Cλ(X), we find dλ and the theorem follows.

42

Page 43: A Tutorial on Multivariate Statistical Analysis

Using Zonal Polynomials to Evaluate Group Integrals∫

O(p)

e−ρ tr(XHY HT ) (dH)

=∞∑

k=0

ρk

k!

O(p)

(

tr(XHYHt))k

(dH)

=∞∑

k=0

ρk

k!

λ⊢k

ℓ(λ)≤m

O(p)

Cλ(XHYHT ) (dH)

=

∞∑

k=0

ρk

k!

λ⊢k

ℓ(λ)≤m

Cλ(X)Cλ(Y )

Cλ(Ip)

43

Page 44: A Tutorial on Multivariate Statistical Analysis

References

[1] T. W. Anderson, An Introduction to Multivariate Statistical

Analysis, third edition, John Wiley & Sons, Inc., 2003.

[2] J. Baik, G. Ben Arous, and S. Peche, Phase transition of the largest

eigenvalue for nonnull complex sample covariance matrices, Ann.

Probab. 33 (2005), 1643–1697.

[3] J. Baik, Painleve formulas of the limiting distributions for nonnull

complex sample covariance matrices, Duke Math. J. 133 (2006),

205–235.

[4] M. Dieng and C. A. Tracy, Application of random matrix theory to

multivariate statistics, arXiv: math.PR/0603543.

[5] I. Dumitriu, A. Edelman and G. Shuman, MOPS: Multivariate

Orthogonal Polynomials (symbolically), arXiv:math-ph/0409066.

[6] N. El Karoui, On the largest eigenvalue of Wishart matrices with

identity covariance when n, p and p/n tend to infinity, arXiv:

44

Page 45: A Tutorial on Multivariate Statistical Analysis

math.ST/0309355.

[7] N. El Karoui, Tracy-Widom limit for the largest eigenvalue of a

large class of complex Wishart matrices, arXiv: math.PR/0503109.

[8] G. H. Golub and C. F. Van Loan, Matrix Computations, second

edition, Johns Hopkins University Press, 1989.

[9] H. Hotelling, Analysis of a complex of statistical variables into

principal components, J. of Educational Psychology 24 (1933),

417–441, 498–520.

[10] H. Hotelling, Relations between two sets of variates, Biometrika 28

(1936), 321–377.

[11] I. M. Johnstone, On the distribution of the largest eigenvalue in

principal component analysis, Ann. Stat. 29 (2001), 295–327.

[12] I. G. Macdonald, Symmetric Functions and Hall Polynomials, 2nd

ed., Oxford University Press, 1995.

[13] R. J. Muirhead, Aspects of Multivariate Statistical Theory, John

45

Page 46: A Tutorial on Multivariate Statistical Analysis

Wiley & Sons Inc., 1982.

[14] A. Soshnikov, A note on universality of the distribution of the

largest eigenvalue in certain sample covariance matrices, J.

Statistical Physics 108 (2002), 1033–1056.

[15] C. A. Tracy and H. Widom, On orthogonal and symplectic matrix

ensembles, Commun. Math. Phys. 177 (1996), 727–754.

[16] F. V. Waugh, Regression between sets of variables, Econometrica

10 (1942), 290–310.

[17] P. Zinn-Justin and J.-B. Zuber, On some integrals over the U(N)

unitary group and their large N limit, J. Phys. A: Math. Gen. 36

(2003), 3173–3193.

46