SM-4331 Advanced Statistics Chapter 3 (Important ... · SM-4331 Advanced Statistics Chapter 3...

SM-4331 Advanced StatisticsChapter 3 (Important Univariate and Multivariate

Distributions)

Dr Haziq JamilFOS M1.09

Universiti Brunei Darussalam

Semester II 2019/20

Outline

1 χ2-distribution

2 Student’s t-distribution

3 F -distribution

4 Multivariate distributionsBivariate distributionsMultivariate distributions

5 Multinomial and categorical distributionMultinomial distributionCategorical distribution

6 Multivariate normal distribution

Haziq Jamil - https://haziqj.ml SM-4331 Chapter 3 Sem II 2019/20 0 / 70

https://haziqj.ml

χ2-distribution »

χ2-distribution

The χ2-distribution is an important distribution in statistics. It is closelylinked with the normal, Student’s t and F distributions. Inference for thevariance parameter σ2 relies on χ2-distributions. More importantly, mostgoodness-of-fit tests are based on χ2-distributions.

Definition 1 (χ2-distribution)

Let Z1, . . . ,Zkiid∼ N(0, 1), i.e. each Zi has pdf f (zi ) = (2π)−1/2e−z

2i /2 for

i = 1, . . . , k . Then,

X = Z 21 + · · ·+ Z 2

k =k∑

i=1

Z 2i

follows a χ2-distribution with k degrees of freedom. We write X ∼ χ2k .


https://haziqj.ml

χ2-distribution »

χ2-distribution (cont.)

RemarkOut of curiosity, the pdf of a χ2

k distribution is f (x) = Cxk/2−1e−x/2,where the normalising constant C is equal to 2−k/2Γ−1(k/2) (Γ(·) is thegamma function). The form of the pdf is less important to know than thedefinition of χ2

k distribution given in Definition 1.

Here are some important properties of the χ2k distribution.

1. X has support over [0,∞).2. E(X ) = k .3. Var(X ) = 2k .4. If X1 ∼ χ2

k1and X2 ∼ χ2

k2, and X1 ⊥ X2, then X1 + X2 ∼ χ2

k1+k2.


https://haziqj.ml

χ2-distribution »


Proof.Prove properties 2–4 as an exercise.


https://haziqj.ml

χ2-distribution »



https://haziqj.ml

χ2-distribution »

Probabilities tables for the χ2-distribution

Probabilities such as

P(χ2k ≤ x) =

∫ x

0fX (x) dx

where fX is the pdf of χ2k cannot be found in closed form. It is calculated

using computer approximations for the integral above. In R, use pchisq().

Alternatively, statistical tables are used. You will find tables for percentilesof the χ2 distribution. That is, you are able to find the value of x := χ2

k(A)such that

P(χ2k ≤ x) =

∫ x

0fX (x) dx = A

for various values of A and k .


https://haziqj.ml

χ2-distribution »

Example

Example 2

Let Y1, . . . ,Yniid∼ N(µ, σ2). Then,

Zi =Yi − µσ

∼ N(0, 1).

Hence,1σ2

n∑i=1

(Yi − µ)2 =n∑

i=1

Z 2i ∼ χ2

n.

Note that

1σ2

n∑i=1

(Yi − µ)2 =1σ2

n∑i=1

(Yi − Yn)2 +n

σ2 (Yn − µ)2. (1)


https://haziqj.ml

χ2-distribution »

Example (cont.)

Example 2Since Yn ∼ N(µ, σ2/n), it must be that

n

σ2 (Yn − µ)2 ∼ χ21.

It can also be proved (see Exercise Sheet 3) that

1σ2

n∑i=1

(Yi − Yn)2 ∼ χ2n−1.

Thus, the decomposition in (1) may formally be written as

χ2n = χ2

n−1 + χ21


https://haziqj.ml

1 χ2-distribution


3 F -distribution

4 Multivariate distributions

5 Multinomial and categorical distribution



https://haziqj.ml

Student’s t-distribution »

Student’s t-distribution

This is another important distribution in statistics, because:• The t-test is perhaps the most frequently used statistical test in

application.• Confidence intervals for normal mean with unknown variance may be

accurately constructed based on the t-distribution.

Historical note: The t-distribution was first studied by the EnglishmanWilliam Sealy Gosset (1876-1937), who worked as a statistician forGuinness, writing under the pen-name “Student”.


https://haziqj.ml


Student’s t-distribution (cont.)

Definition 3Suppose• Z ∼ N(0, 1),• X ∼ χ2

k , and• X ⊥ Z , i.e. X and Z are independent.

Then, the distribution of the random variable

T =Z√X/k

is called the t-distribution with k degrees of freedom. We write T ∼ tk .


https://haziqj.ml



RemarkThe pdf for T ∼ tk is given by

f (t) ∝(1 +

t2

k

)− k+12

,

but once again the actual form of the pdf is not as important as thedefinition of the t-distribution.

Some important properties of the t-distribution:1. T is continuous and symmetric over (−∞,∞).2. E(T ) = 0, provided E(|T |) <∞ (k > 1).3. Var(T ) = k

k−2 .4. Technically, k ∈ R, but we will usually deal with k ∈ Z > 0.


https://haziqj.ml



5. tkD−→ N(0, 1) as k →∞.

Proof.

If X ∼ χ2k , then by definition X = Z 2

1 + · · ·+ Z 2k , where Zi

iid∼ N(0, 1). Bythe LLN,

X

k=

Z 21 + · · ·+ Z 2

k

kP−→ E(Z 2

1 ) = 1.

as k →∞. Therefore,√X/k

P−→ 1, and in particular,

T =Z√X/k

D−→ N(0, 1)

following Slutzky’s theorem.


https://haziqj.ml



6. The t-distribution has heavy tails. That is, if T ∼ tk , E(|T |k) =∞.Comparing this to the normal distribution: X ∼ N(µ, σ2),E(|X |k) <∞ for any k > 0.

RemarkThis ‘heavy-tails’ property is a useful property in modelling abnormalphenomena or outliers (e.g. in financial or insurance data). C.f. “robuststatistics”.

Explore the t-distribution vs normal distribution here:https://eripoll12.shinyapps.io/t_Student/


https://eripoll12.shinyapps.io/t_Student/

https://haziqj.ml




https://haziqj.ml


An important property of normal samples

Theorem 4 (An important property of normal samples)

Let {X1, . . . ,Xn} be a sample from N(µ, σ2). Let

X =1n

n∑i=1

Xi , s2 =1

n − 1

n∑i=1

(Xi − X )2, and SE(X ) = s/√n.

Then,i. X ∼ N(µ, σ2/n)

ii. (n − 1)s2/σ2 ∼ χ2n−1

iii. X ⊥ s2

iv.√n(X−µ)

s = X−µSE(X )

∼ tn−1


https://haziqj.ml


An important property of normal samples (cont.)

Proof.i. follows directly from properties of normal distributions, and earlier we“proved” 1

σ2

∑ni=1(Xi − X )2 ∼ χ2

n−1 which implies ii.

Consider any Xj , j ∈ {1, . . . , n} and Cov(Xj − X , X ):

Cov(Xj − X , X ) = Cov(Xj , X )− Cov(X , X )

= Cov

(Xj ,

1n

n∑i=1

Xi

)− Var(X )

=1n

n∑i=1

Cov(Xj ,Xi )− σ2/n

= σ2/n − σ2/n = 0

Since the covariance is zero and they are normal, they are independent.


https://haziqj.ml



Proof.Following this, if X is independent of Xj − X for any j , it stands to reasonthat X is also independent of X = (X1 − X , . . . ,Xn − X )>, and also of

X>X =(X1 − X · · · Xn − X

)X1 − X...

Xn − X

=n∑

i=1

(Xi − X )2 = (n − 1)s2,

and thus also of s2.

Remark: we used the fact that if X ⊥ Yi , then g(X ) ⊥ g(Yi ), and alsog(X ) ⊥ {g(Y1) + · · ·+ g(Yn)}.


https://haziqj.ml



Proof.Finally, putting everything together,

N(0,1)︷︸︸︷√n(X − µ)/σ√√√√√ χ2n−1︷︸︸︷(n − 1)s2/σ2

n−1

=X − µs/√n

=X − µSE(X )

∼ tn−1.

RemarkThis is why for normal distributions where σ2 is unknown, and is estimatedby the unbiased sample variance s2, the standardised sample mean followsa t-distribution! This gives rise to the t-test.


https://haziqj.ml

1 χ2-distribution


3 F -distribution





https://haziqj.ml

F -distribution »

F -distribution

The F -distribution is another notable distribution in statistics. It commonlyarises as the null distribution of a test statistic, particularly in the analysisof variance (ANOVA).

Definition 5Let X1 ∼ χ2

k1and X2 ∼ χ2

k2. Then, the distribution of

Y =X1/k1

X2/k2

is called the F -distribution with (k1, k2) degrees of freedom. We writeY ∼ Fk1,k2 .


https://haziqj.ml

F -distribution »

F -distribution (cont.)

RemarkNot even going to bother writing down the pdf! See for yourself:https://en.wikipedia.org/wiki/F-distribution. Remember thedefinition, though.

Some important properties of the F -distribution:1. Y is continuous and has support over [0,∞), provided k1 > 1.2. E(Y ) = k2

k2−2 , provided k2 > 2.

3. Var(Y ) =2k22 (k1+k2−2)

k1(k2−2)2(k2−4), provided k2 > 4.

4. Technically, k1, k2 ∈ R>0, but we will usually deal with k1, k2 ∈ Z > 0.


https://en.wikipedia.org/wiki/F-distribution

https://haziqj.ml

F -distribution »


5. If Y ∼ Fk1,k2 , then Y−1 ∼ Fk2,k1 .6. If T ∼ tk , then T 2 ∼ F1,k .

Proof.Exercise: Prove properties 5 and 6.


https://haziqj.ml

F -distribution »



https://haziqj.ml

F -distribution »

The analysis of variance

The ANOVA, despite its name, is a (collection of) methods used to analysedifferences among group means in a sample. The ANOVA was developedby Sir Ronald Fisher.

Example 6Let Xij ∼ N(µj , σ

2), i = 1, . . . , nj and j = 1, . . . ,m with both µj and σ2

unknown. Let n =∑m

j=1 nj be the total sample size. Define the grandmean and the respective group means to be X = n−1∑

i ,j Xij andXj = n−1

j

∑nji=1 Xij respectively. The “total sum of squares” is

S =∑i ,j

(Xij − X )2


https://haziqj.ml

F -distribution »


Example 6The total sum of squares can be decomposed into

S =

S1︷︸︸︷∑i ,j

(Xij − Xj)2 +

S2︷︸︸︷∑j

nj(Xj − X )2

where S1 is called the “within sum of squares” (how much variation amongindividuals in each group) and S2 the “ between sum of squares” (howmuch variation in the mean among groups).

There is the concept of “degrees of freedom”: n − 1 in the TSS, m − 1 inthe BSS, and therefore n −m in the WSS.


https://haziqj.ml

F -distribution »


Example 6This gives rise to the ANOVA table:

Source SS d.f. MSS F-statistic

Between∑

j nj(Xj − X )2 m − 1∑

j nj (Xj−X )2

m−1

∑j nj (Xj−X )2/(m−1)∑i,j (Xij−Xj )2/(n−m)

Within∑

i ,j(Xij − Xj)2 n −m

∑i,j (Xij−Xj )

2

n−m

Total∑

i ,j(Xij − X )2 n − 1

What is the distribution of F?


https://haziqj.ml

F -distribution »


Example 6We have seen that

1σ2

∑i ,j

(Xij − X )2 ∼ χ2n−1.

In fact, we can also show similarly that

1σ2

∑i ,j

(Xij − Xj)2 ∼ χ2

n−m.

Using these two facts, we deduce that

1σ2

∑j

nj(Xj − X )2 ∼ χ2m−1

from the property of χ2-distributions.


https://haziqj.ml

F -distribution »


Example 6So now,

F =

��1/σ2

χ2m−1︷︸︸︷∑j

nj(Xj − X )2/(m − 1)

��1/σ2

χ2n−m︷︸︸︷∑i ,j

(Xij − X )2/(n −m)

is a ratio of two χ2-distributions, which means that F follows anF -distribution with (m − 1, n −m) degrees of freedom.


https://haziqj.ml

F -distribution »


Stop to thinkWhat happens when there are only two groups (m = 2)?• What is the distribution of χ2

1 equal to?• What is the distribution of the test statistic F?• What is the distribution of

√F?


https://haziqj.ml

1 χ2-distribution


3 F -distribution





https://haziqj.ml

Multivariate distributions » Bivariate distributions

Bivariate distributions

For a pair of r.v. (X ,Y ), the joint pdf fX ,Y (x , y) for (X ,Y ) is a probabilitydistribution that gives the probability that each of X and Y falls in anyparticular range or set of values specified for these variables.

The marginal distributions may be obtained by

fX (x) =

{∑y fX ,Y (x , y) if Y is discrete∫

y fX ,Y (x , y) dy if Y is continuous

fY (y) =

{∑x fX ,Y (x , y) if X is discrete∫

x fX ,Y (x , y) dx if X is continuous


https://haziqj.ml


Bivariate distributions (cont.)

The joint cdf is defined as

FX ,Y (x , y) = P(X ≤ x ,Y ≤ y).

From this, the marginal distributions (cdf) are given by

FX (x) = P(X ≤ x ,Y ≤ ∞) = FX ,Y (x ,∞)

andFY (y) = P(Y ≤ ∞,Y ≤ y) = FX ,Y (∞, y)


https://haziqj.ml


Bivariate distributions (cont.)

To be even more specific, we define the joint pdf as follows:

Definition 7 (Joint bivariate pdf)In the discrete case, i.e. X and Y are two discrete r.v., the jointprobability mass function (pmf) is

pX ,Y (x , y) = P(X = x ,Y = y)

In the continuous case, i.e. X and Y are two continuous r.v., the jointprobability density function (pdf) is

fX ,Y (x , y) =∂2

∂x∂yFX ,Y (x , y).


https://haziqj.ml


Properties of bivariate distributions

1. Since the joint pdf is a probability distribution, it must satisfy∑x

∑y

P(X = x ,Y = y) = 1

in the discrete case, and∫x

∫yfX ,Y (x , y) dx dy = 1

in the continuous case.


https://haziqj.ml


Properties of bivariate distributions (cont.)

2. We can write the pmf/pdf in terms of conditional distributions. Forthe discrete case, this is

pX ,Y (x , y) = P(Y = y |X = x) P(X = x)

= P(X = x |Y = y) P(Y = y).

For the continuous case, this is

fX ,Y (x , y) = fY |X (y |x)fX (x)

= fX |Y (x |y)fY (y).


https://haziqj.ml



3. Though we won’t be looking at these, it is possible to have a “mixed”joint pdf/pmf, i.e. X is continuous and Y is discrete, or the other wayaround. The joint pdf may be written

fX ,Y (x , y) = fX |Y (x |y) P(Y = y)

= P(Y = y |X = x)fX (x)

and its joint cdf by

FX ,Y (x , y) =∑y≤y

∫ x

−∞fX ,Y (x , y) dx

One common example is when using logistic regression in predictingprobability of a binary outcome, conditional on the value of acontinuously distributed outcome.


https://haziqj.ml



4. The covariance between X and Y , denoted Cov(X ,Y ), is

Cov(X ,Y ) = E [(X − EX )(Y − EY )]

= E(XY )− EX · EY

5. The correlation between X and Y , denoted ρXY , is

ρXY =Cov(X ,Y )√VarX · VarY

Stop to think• What is the covariance between a r.v. X and itself?• What possible values can ρ take?


https://haziqj.ml



6. Two r.v. X and Y are independent if and only if the joint cdfsatisfies

FX ,Y (x , y) = FX (x) · FY (y).

As for their pmf/pdf,

pX ,Y (x , y) = P(X = x ,Y = y) = P(X = x) · P(Y = y)

in the discrete case, and

fX ,Y (x , y) = fX (x) · fY (y)

in the continuous case.

RemarkX ⊥ Y ⇒ Cov(X ,Y ) = 0, but not necessarily the other way around.


https://haziqj.ml


Discrete bivariate distributions

If X takes discrete values x1, . . . , xm and Y takes discrete values y1, . . . , yn,their joint pmf may be presented in table:

y1 y2 · · · yn

x1 p11 p12 · · · p1n p1·x2 p21 p22 · · · p2n p2·...

......

. . ....

...xm pm1 pm2 · · · pmn pm·

p·1 p·2 · · · p·n

where pij = P(X = xi ,Y = yj), and pi· and p·j are the marginalprobabilities of X and Y respectively.


https://haziqj.ml


Discrete bivariate distributions (cont.)

So from the previous slides we know that

p·j =n∑

i=1

pij

pi· =m∑j=1

pij

and of course,∑n

i=1∑m

j=1 pij = 1.


https://haziqj.ml


Examples

Example 8Flip a fair coin two times. Let X = 1 if it is heads in the first flip, andX = 0 if it is tails. Let Y = 1 if the outcomes in the two flips are thesame, and Y = 0 if the two outcomes are different. The joint probabilityfunction is

Y = 1 Y = 0

X = 1 1/4 1/4X = 0 1/4 1/4

It is easy to see that X and Y are independent (although we had notassumed this).


https://haziqj.ml


Examples (cont.)

Example 9Consider a uniform distribution on the unit square [0, 1]× [0, 1]. It has pdfgiven by

f (x , y) =

{1 0 ≤ x ≤ 1, 0 ≤ y ≤ 10 otherwise

This is a well-defined pdf, as f ≥ 0 and∫ ∫

f (x , y) dx dy = 1. It is alsoeasy to see that X and Y are independent.

Find P(X < 1/2,Y < 1/2) and P(X + Y < 1).


https://haziqj.ml


Examples (cont.)

Example 9

P(X < 1/2,Y < 1/2) =

∫ 1/2

0

∫ 1/2

0dx dy

=[[xy ]

1/20

]1/20

= 1/4.

Note that the set {x + y < 1} corresponds to {0 < y < 1, 0 < x < 1− y}.

P(X + Y < 1) =

∫ 1

0dy

∫ 1−y

0dx

=

∫ 1

0dy [x ]1−y0

=

∫ 1

0(1− y) dy =

[y − y2/2

]10 = 1/2.


https://haziqj.ml


Conditional distributions

If X and Y are not independent, knowing X should be helpful indetermining Y , as X may carry some information on Y . Therefore itmakes sense to define the distribution of Y given, say X = x . This is theconcept of conditional distributions.

If X and Y are discrete, then you have probably seen that

P(Y = y |X = x) =P(Y = y ,X = x)

P(X = x)=

P(X = x |Y = y) P(Y = y)

P(X = x)

However, this definition does not extend to continuous random variables,because the probability of a single point has mass zero.


https://haziqj.ml


Conditional distributions (cont.)• If X and Y are independent, then fY |X (y |x) = fY (y)• Note that

fY |X (y |x)fX (x) = fX ,Y (x , y) = fX |Y (x |y)fY (y)

which offers alternative ways to determine the joint pdf.• For any r.v. X and Y ,

EX

(E(Y |X )

)= E(Y )

Proof.

E(

E(Y |X ))

=

∫ {∫y fY |X (y |x) dy

}fX (x) dx

=

∫ ∫y fX ,Y (x , y) dx dy

=

∫y f (y) dy = E(Y )


https://haziqj.ml



Example 11Let fX ,Y (x , y) = e−y for 0 < x < y <∞, and 0 otherwise. Find fY |X (y |x)and fX |Y (x |y).

We need to find the marginals first:

fX (x) =

∫ ∞x

e−y dy = e−x , 0 < x <∞

fY (y) =

∫ y

0e−y dx = ye−y , 0 < y <∞


https://haziqj.ml

Multivariate distributions » Multivariate distributions

Multivariate distributions

The bivariate results we saw earlier can be extended to more than twovariables, resulting in multivariate distributions.

Let X = (X1, . . . ,Xn)> be a random vector consisting of n r.v.. The jointcdf is defined as

F (x1, . . . , xn) = P(X1 ≤ x1, . . . ,Xn ≤ xn)

If X is continuous, the joint pdf satisfies

fX ,Y (x , y) =∂n

∂x1 · · · ∂xnF (x1, . . . , xn).


https://haziqj.ml


Multivariate distributions (cont.)

• In general, the pdf admits the factorisation

f (x1, . . . , xn) = f (x1)f (x2, . . . , xn|x1)

= f (x1)f (x2|x1)f (x3, . . . , xn|x1, x2)

...= f (x1)f (x2|x1)f (x3|x1, x2) · · · f (xn|x1, . . . , xn−1)

where f (xj |x1, . . . , xj−1) denotes the conditional pdf of Xj givenX1 = x1, . . . ,Xj−1 = xj−1.

• On the other hand, X1, . . . ,Xn are independent if and only if

f (x1, . . . , xn) = f (x1)f (x2) · · · f (xn)


https://haziqj.ml


Random samples

• We will often say “X1, . . . ,Xn is a random sample from a distributionwith pdf f ”. Without specifying independence, we should not assumethat it is an iid sample.• Thus, when doing inference, we should work with the joint pdf

f (x1, . . . , xn). Several multivariate distributions will be discussed inthe next section.• On the other hand, if X1, . . . ,Xn

iid∼ f (x), then and only then

f (x) = f (x1, . . . , xn) =n∏

i=1

f (xi )


https://haziqj.ml

1 χ2-distribution


3 F -distribution





https://haziqj.ml

Multinomial and categorical distribution » Multinomial distribution

Multinomial distribution

The multinomial distribution is an extension of the binomial distribution.Suppose we threw an k-sided die n times, and we recorded Xi , the numberof times the ith side turned up (i = 1, . . . , k). Then X = (X1, . . . ,Xk)>

follows a multinomial distribution.

Definition 12 (Multinomial distribution)

Let X = (X1, . . . ,Xk)> ∼ Mult(n, p1, . . . , pk). Then, the pmf of X is givenby

f (x1, . . . , xk) =n!

x1!x2! · · · xk !px11 px22 · · · p

xkk

Here, pj is the probability of success associated with event Xj .


https://haziqj.ml


Multinomial distribution (cont.)

• Obviously, each pj ≥ 0, and that∑k

j=1 pj = 1.

• We also observe that if out of the n trials, Xj is the number of“success” associated with the jth outcome, then

I X1 + · · ·+ Xk = n, and therefore, the Xjs are not independent.I Xj ∼ Bin(n, pj), and hence E(Xj) = npj and Var(Xj) = npj(1− pj).

• We can show that Cov(Xj ,Xj ′) = −npjpj ′ (see Exercise 3).

• Therefore,

E(X) = np ∈ Rk

Var(X) = n[diag(p)− pp>

]∈ Rk×k

where p = (p1, . . . , pk)>.


https://haziqj.ml



Expanding the expectation vector and variance-covariance matrix in full:

E(X) =

np1...

npk

Var(X) =

np1(1− p1) −np1p2 · · · −np1pk−np1p2 np2(1− p2) · · · −np2pk

......

. . ....

−np1pk −np2pk · · · npk(1− pk)


https://haziqj.ml



• Marginal distributions after collapsing categories are also multinomial.E.g. (X1, . . . ,X5)> ∼ Mult(n, p1, . . . , p5), then(X1 + X2,X3 + X4,X5)> ∼ Mult(n, p1 + p2, p3 + p4, p5).

• Conditional distributions are also multinomial.E.g.(X1, . . . ,X5)> ∼ Mult(n, p1, . . . , p5), then

(X1,X2)>|(X3 = x3,X4 = x4,X5 = x5)

∼ Mult

(n − x3 − x4 − x5,

p1

p1 + p2,

p2

p1 + p2

)


https://haziqj.ml



Example 13Suppose that in a three-way election for a large country, candidate Areceived 20% of the votes, candidate B received 30% of the votes, andcandidate C received 50% of the votes. If six voters are selected randomly,what is the probability that there will be exactly one supporter forcandidate A, two supporters for candidate B and three supporters forcandidate C in the sample?

Let (XA,XB ,XC ) represent the number of voters for candidates A, B and Crespectively. Then (XA,XB ,XC ) Mult(6, 0.2, 0.3, 0.5). SoP(XA = 1,XB = 2,XC = 3) = 6!

1!2!3!0.210.320.53 = 0.135.


https://haziqj.ml


Parameter estimation

QuestionSuppose that you observe a random sample X = (Xi1, . . . ,Xik) from aMult(n, p1, . . . , pk) distribution, where each Xij tells you the “number ofsuccesses” for category j . What is an appropriate estimator for each of thepj? Work this out as an exercise. Hint: Each component of X isBern(pj).


https://haziqj.ml

Multinomial and categorical distribution » Categorical distribution

Categorical distribution

When we have only one trial (n = 1), we have a special case of themultinomial distribution. There is exactly one entry in X = (X1, . . . ,Xk)that is equal to one, while the rest is zero.

Definition 14 (Categorical distribution)

Let X = (X1, . . . ,Xk)> ∼ Cat(p1, . . . , pk). Then, the pmf of X is given by

f (x1, . . . , xk) = px11 px22 · · · pxkk

Here, pj is the probability of success associated with event Xj .


https://haziqj.ml

Multinomial and categorical distribution » Categorical distribution

Categorical distribution (cont.)

• Since each Xj ∼ Bin(1, pj) ≡ Bern(pj), the categorical distribution iseffectively a generalisation of the Bernoulli distribution.

• Interestingly, we can also define

Y = arg maxj

Xj

so that Y is a random variable taking on one distinct valuej ∈ {1, . . . , k} (category) with probability pj . This formulation has alot of importance in choice modelling in statistics and econometrics.


https://haziqj.ml

1 χ2-distribution


3 F -distribution





https://haziqj.ml

Multivariate normal distribution »

Multivariate normal distribution

This is the multivariate extension of the regular normal distribution. It isundoubtedly the most commonly used distribution in statistics, and it has alot of interesting properties.

Definition 15 (Multivariate normal distribution)

Let X = (X1, . . . ,Xk)> be distributed according to a multivariate normaldistribution with mean vector µ ∈ Rk and variance-covariance matrixΣ ∈ Rk×k . It has pdf

f (x) = (2π)−k/2|Σ|−1/2 exp

{−12

(x− µ)>Σ−1(x− µ)

}and we write X ∼ Nk(µ,Σ).


https://haziqj.ml


Multivariate normal distribution (cont.)

Here are some properties of the multivariate normal distribution:1. µ is a vector of length k , and Σ is a square k × k , symmetric,

positive-definite matrix.2. The first and second moments of X are

E(X) = µ

E(XX>) = Σ + µµ>

so therefore,

Var(X) = E((X− µ)(X− µ)>

)= E(XX>)− µµ> = Σ

3. Let Σ = (σij). Then

σij =

{Var(Xi ) if i = j

Cov(Xi ,Xj) if i 6= j


https://haziqj.ml



4. When σij = 0 for all i 6= j , i.e. the components of X are uncorrelated,we have Σ = diag(σ11, . . . , σkk). Also, |Σ| =

∏ki=1 σii . Hence, the

pdf admits a simpler form

f (x) =k∏

i=1

{1√2πσii

e− 1

2σii(xi−µi )2

}and therefore, X1, . . . ,Xk are independent, and each Xi ∼ N(µi , σii ).

RemarkIn general, two r.v. X and Y which are independent imply thatCov(X ,Y ) = 0, but not necessarily the other way around. But if X and Yare two normal r.v., then X and Y are independent if and only ifCov(X ,Y ) = 0.


https://haziqj.ml



5. Suppose that we can partition X into X = (Xa,Xb)>, and similarly,

µ =

(µa

µb

)Σ =

(Σa Σab

Σ>ab Σb

).

Then,I Marginal distributions. Xa ∼ Ndim Xa(µa,Σa) and

Xb ∼ Ndim Xb(µb,Σb).

I Conditional distributions. Xa|Xb ∼ Ndim Xa(µa, Σa) andXb ∼ Ndim Xb

(µb, Σb), where

µa = µa + ΣabΣ−1b (Xb − µb) µb = µb + Σ>abΣ

−1a (Xa − µa)

Σa = Σa −ΣabΣ−1b Σ>ab Σb = Σb −Σ>abΣ

−1a Σab


https://haziqj.ml



6. Let A be a Rq×k constant matrix, and b ∈ Rk a constant vector.Then Y := AX + b ∈ Rq has distribution

Y ∼ Nq(Aµ,AΣA>)

This is simply the linearity property of normal distributions for themultivariate case.


https://haziqj.ml


Standard multivariate normal

A special case of the multivariate normal is when µ = (0, . . . , 0)>, andΣ = Ik . We would then have the standard multivariate normal distributionZ ∼ Nk(0, Ik).

The pdf of the standard normal is

φ(z) = (2π)−k/2 exp

{−12z>z

}


https://haziqj.ml


Standard multivariate normal (cont.)

Suppose that L is a (non-singular) matrix such that LL> = Σ. Note that,by the linearity property of the multivariate normal, X = µ + LZ is alsonormally distributed, with mean

E(X) = µ + LE(Z) = µ

and variance

Var(X) = 0 + LVar(Z)L> = LIkL> = Σ.

Therefore X ∼ Nk(µ,Σ).

This property is often used for simulating from multivariate normals: 1)Generate samples from k iid N(0, 1) distributions; then 2) apply thelinearity transformation above.


https://haziqj.ml



Of course, the other way around works too. If X ∼ Nk(µ,LL>), then

Z = L−1(X− µ)

has meanE(Z) = L−1(EX− µ) = 0

and variance

Var(Z) = L−1(VarX)(L−1)> = L−1LL>(L−1)> = Ik .

So it is possible to “standardise” a multivariate normal distribution. This isuseful when we want to calculate multivariate normal probabilities(although admittedly, this is a very computer-intensive problem involvingnumerical approximations).


https://haziqj.ml



Strategies for decomposing the variance-covariance matrix Σ:• Eigendecomposition. For a positive-definite matrix, we have that

Σ = ΓΛΓ>. The matrix Γ is a matrix of eigenvectors of Σ, andΛ = diag(λ1, . . . , λk) is a diagonal matrix of eigenvalues. Additionallyfor positive matrices, ΓΓ> = Ik (orthogonal).• Cholesky decomposition. The Cholesky decomposition of a

positive-definite matrix is a decomposition of the form Σ = LL>,where L is a lower-triangular matrix with real and positive diagonalentries.• LDL decomposition. Closely related to the Cholesky decomposition,

this is a decomposition of the form Σ = LDL>, where D is a diagonalmatrix, and L is a lower-unitriangular matrix (all diagonal elements are1).


https://haziqj.ml


Example

Example 16Suppose that X1 ∼ N(µ1, σ

21), and X2 ∼ N(µ2, σ

22). The covariance

between X1 and X2 is Cov(X1,X2) = σ12.

Then X = (X1,X2)> is a bivariate normal distribution with meanµ = (µ1, µ2)>, and variance

Σ =

(σ2

1 σ12σ12 σ2

2

).


https://haziqj.ml


Example (cont.)

Example 16The pdf is

f

(x1x2

)=(2π)−1 det

(σ2

1 σ12σ12 σ2

2

)−1

× exp

{−12(x1 − µ1 x2 − µ2

)( σ21 σ12

σ12 σ22

)−1(x1 − µ1x2 − µ2

)}


https://haziqj.ml


Example (cont.)

Example 16


https://haziqj.ml



Very briefly, suppose that you had n samples from a k-variate normaldistribution. That is, you observe X1, . . . ,Xn, where each Xi ∈ Rk is ak-dimensional vector.

The unconstrained maximum likelihood estimators for µ and Σ are

µ =1n

n∑i=1

xi = (X1, . . . , Xk)> ∈ Rk

and

Σ =1n

n∑i=1

(xi − µ)(xi − µ)> ∈ Rk×k .

It turns out also that µ D−→ N(µ,Σ/n), as n→∞ (multivariate CLT).


https://haziqj.ml



QuestionHow many total unknown parameters are there in a k-variate normaldistribution? Work this out as an exercise.


https://haziqj.ml

References I

Casella, G. and R. L. Berger (2002). Statistical Inference. 2nd ed. PacificGrove, CA: Duxbury. ISBN: 978-0-534-24312-8.

Pawitan, Y. (2001). In All Likelihood. Statistical Modelling and InferenceUsing Likelihood. Oxford University Press. ISBN: 978-0-19-850765-9.

Jamil, H. (Oct. 2018). “Regression modelling using priors depending onFisher information covariance kernels (I-priors)”. PhD thesis. LondonSchool of Economics and Political Science.

Wasserman, L. (2013). All of statistics: a concise course in statisticalinference. Springer Science & Business Media.


https://haziqj.ml

SM-4331 Advanced Statistics Chapter 3 (Important ... · SM-4331 Advanced Statistics Chapter 3...

Documents

Transcript of SM-4331 Advanced Statistics Chapter 3 (Important ... · SM-4331 Advanced Statistics Chapter 3...