Mathematical Statistics course # 7

43
Mathematical Statistics course # 7 Damien Garreau Université Côte d’Azur March 19, 2020 1

Transcript of Mathematical Statistics course # 7

Page 1: Mathematical Statistics course # 7

Mathematical Statisticscourse # 7

Damien Garreau

Université Côte d’Azur

March 19, 2020

1

Page 2: Mathematical Statistics course # 7

Recap (I)

I Gaussian linear model:

Y = Xβ + ε ,

with Y ∈ Rn, X ∈ Rn×(d+1), β ∈ Rd+1, and ε ∼ N(0, σ2 In

)I least squares and maximum likelihood both yield the same estimator:

βn = (X>X )†X>Y .

I we also have an estimator for the variance:

σ2n = 1

n − d − 1

∥∥∥Y − X βn

∥∥∥2.

I we know the exact law of these estimators:

βn ∼ N(β, σ2(X>X )−1) and σ2

n ∼σ2

n − d − 1χ2(n − d − 1) .

2

Page 3: Mathematical Statistics course # 7

Recap (II)

I Gauss-Markov theorem: βn is the best linear unbiased estimatorof β under mild assumptions

I Coefficient of determination:

R2 = r2yy =

(∑ni=1(yi − y)(yi − y)

)2∑ni=1(yi − y)2∑n

i=1(yi − y).

I Intuition: higher is better (1⇒ perfect linear fit, 0⇒ uncorrelated)I Residuals: ei = yi − yi .I one should check a posteriori that the ei have a nice behavior:

I normalityI homoscedasticityI independence

3

Page 4: Mathematical Statistics course # 7

Outline

1. Estimating the parameters of a discrete distribution

2. Chi-squared test for fit of a distribution

3. Extensions

4

Page 5: Mathematical Statistics course # 7

1. Estimating the parametersof a discrete distribution

5

Page 6: Mathematical Statistics course # 7

Multinomial distribution

Definition: let n and k be finite integers and p ∈ [0, 1]k such that

p1 + · · ·+ pk = 1 .

Let A1, . . . ,Ak a set of distinct outcomes. For 1 ≤ i ≤ n, we definethe independent random variables Ei , each taking the value Aj withprobability pj . Then we say that the vector N = (N1, . . . ,Nk) where

Nj =n∑

i=11Ei =Aj

has the multinomial distribution (“loi multinomiale”) withparameters n, k, p.

I Intuition: Nj counts the number of times the experiments yielded AjI Important: while the experiments are independent, the Ni are not

(in particular,∑

i Ni = n)6

Page 7: Mathematical Statistics course # 7

Example: the Bernoulli distribution

I let p ∈ (0, 1)I set n = 1, k = 2, and

p1 = p and p2 = 1− p .

I one experiment E1 with two outcomes {A1,A2}I then N1 has the Bernoulli distribution B(p):

P (N1 = 0) = P (E1 = A2)= p2

P (N1 = 0) = 1− p

and P (N1 = 1) = 1− P (N1 = 0) = p.I same reasoning for N2 ∼ B(1− p)

7

Page 8: Mathematical Statistics course # 7

Example: the binomial distribution

I let p ∈ (0, 1)I set n ≥ 1, k = 2,

p1 = p and p2 = 1− p .

I n independent experiments Ej with two outcomes {A1,A2}I then X1 has the binomial distribution B(n, p): for any

m ∈ {0, . . . , n},

P (N1 = m) = P (∃S ⊆ {1, . . . , n} s.t. |S| = m and∀j ∈ S,Ej = A1,∀j /∈ S,Ej = A2)

=(nm

)pm

1 pn−m2

P (N1 = m) =(nm

)pm(1− p)n−m

8

Page 9: Mathematical Statistics course # 7

Binomial distribution in pictures (I)

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

0.00

0.05

0.10

0.15

0.20

0.25Probability mass function of the Binomial distribution

n = 10, p = 0.5n = 20, p = 0.5n = 20, p = 0.8

9

Page 10: Mathematical Statistics course # 7

Binomial distribution in pictures (II)

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.00.0

0.2

0.4

0.6

0.8

1.0Cumulative distribution function of the Binomial distribution

n = 10, p = 0.5n = 20, p = 0.5n = 20, p = 0.8

10

Page 11: Mathematical Statistics course # 7

Properties of the multinomial distribution (I)

I set n, k ≥ 1 integers and p = (p1, . . . , pk)> such that pj ≥ 0 and∑j pj = 1

I let us consider N ∼M(n, p)

Property: N/n takes values almost surely in the simplex:

∆k ={t = (t1, . . . , tk) ∈ Rk

+ s.t.∑

jtj = 1

}.

Proof:∑

j Nj =∑

j∑

i 1Ei =Aj = n almost surely. �

11

Page 12: Mathematical Statistics course # 7

The simplex

x

0.0

0.2

0.4

0.6

0.8

1.0y

0.00.2

0.40.6

0.81.0

0.0

0.2

0.4

0.6

0.8

1.0

The simplex in 3D

12

Page 13: Mathematical Statistics course # 7

Properties of the multinomial distribution (II)

Property: the probability mass function of X ∼M(n, p) is given by

P (N1 = n1, . . . ,Nk = nk) = n!n1! · · · nk !p

n11 · · · p

nkk .

Proof: same idea than for the binomial. �

Property: it holds that E [Nj ] = npj , Var (Nj) = npj(1− pj), andCov (Nj ,Nj′ ) = −npjpj′ .

Proof: direct computation. �

13

Page 14: Mathematical Statistics course # 7

Estimating the parameters of a discrete distribution (I)

I consider a discrete set X = {x1, . . . , xk}I we are given an i.i.d. sample X = (X1, . . . ,Xn) with common law

given by p = (p1, . . . , pk):

∀i ∈ {1, . . . , n}, ∀j ∈ {1, . . . , k}, P (Xi = xj) = pj .

I Example: fair dice (xj = j , pj = 1/6)I How can we estimate p?I consider the random variables 1Xi =xj . They are i.i.d. Bernoulli with

mean pj

I we can use the method of moments!I since E

[1Xi =xj

]= pj , we deduce

pj = 1n

n∑i=1

1Xi =xj .

14

Page 15: Mathematical Statistics course # 7

Estimating the parameters of a discrete distribution (II)

I for any 1 ≤ j ≤ k, define

Nj =n∑

i=11Xi =xj .

I we can rewrite pj = Nj/nI N = (N1, . . . ,Nk)> has the multinomial distribution, and in

particularI in particular,

E[Njn

]= pj ,

and the pj are unbiased, consistent estimators of the pj

I we can also construct joint confidence intervals since we have thejoint law!

15

Page 16: Mathematical Statistics course # 7

2. Chi-squared test for fit ofa distribution

16

Page 17: Mathematical Statistics course # 7

Chi-squared test for fit of a distributionI Statistical model: X = (X1, . . . ,Xn) i.i.d. with supportX = (x1, . . . , xk) with p = (p1, . . . , pk) such that

∀1 ≤ i ≤ n, ∀1 ≤ j ≤ k, P (Xi = xj) = pj

I General idea: we are given a reference law

p0 = (p01 , . . . , p0

k) ,

with full support on X (that is, p0j > 0 for all j)

I we want to test whether p = p0 or not (“test d’ajustement à une loidonnée”)

I formally:

H0 :∀j ∈ {1, . . . , k}, pj = p0j

vsH1 :∃j ∈ {1, . . . , k} such that pj 6= p0

j .

17

Page 18: Mathematical Statistics course # 7

Example: testing the laws of heredityI Example: genetics (Gregor Johann Mendel, 1822–1884)

I suppose that the color of tulips is red (R), blue (B), or purpleI we have a theory: the color is determined a single gene that has two

alleles: R and BI thus purple parents (RB) theoretically yield 3 possibilities (RR, RB,

and BB) with probability 0.25, 0.5, and 0.25I Can we test our hypothesis?

18

Page 19: Mathematical Statistics course # 7

Chi-squared distance

I Idea: under H0, p is close to p0

I thus we can take as test statistic a distance between p and p0

Definition: we define the chi-squared distance (“distance du chideux”) between pn and p0 by

D2n(pn, p0) = n

k∑j=1

(pj − p0j )2

p0j

=k∑

j=1

(Nj − np0j )2

np0j

.

I Intuition: weighted, re-normalized Euclidean distance between pnand p0

I often rewritten as a function of the integer counts

19

Page 20: Mathematical Statistics course # 7

Chi-squared distance under the null (I)

Theorem: the chi-squared distance has the following asymptoticbehaviors:I under H0: p = p0, we have

D2n(pn, p0) L−→ χ2

k−1 ;

I on the other side, under H1: p 6= p0, we have

D2n(pn, p0) −→ +∞

almost surely.

I Intuition: we know the law of the test statistic under H0 and H1when n→ +∞

I in particular, it does not depend on p0!20

Page 21: Mathematical Statistics course # 7

Chi-squared distance under the null (II)

Proof: we start with the behavior under H0.I for any i ∈ {1, . . . , n}, set

Zi =(

1√p0

1(1Xi =x1 − p0

1), . . . , 1√p0

k(1Xi =xk − p0

k))>

.

I the Zi are i.i.d., centered, and have a covariance matrix Γ such that

Γj,j = Var

1√p0

j

(1Xi =xj − p0j )

=p0

j (1− p0j )

p0j

= 1− p0j ,

Γj,j′ = Cov

1√p0

j

(1Xi =xj − p0j ), 1√

p0j′

(1Xi =xj′ − p0j′ )

= −√

p0j p0

j′ .

21

Page 22: Mathematical Statistics course # 7

Chi-squared distance under the null (III)

I denote by √p the vector (√p1, . . . ,√pk)>

I the central limit theorem yields

√n(

1√p0

1(p1 − p0

1), . . . , 1√p0

k(pk − p0

k))>

= 1√n

n∑i=1

ZiL−→ V ,

where V ∼ N (0, Γ)I by the continuous mapping theorem,

D2n(pn, p0) L−→ ‖V ‖2

.

I we notice that Γ = Ik −√p√p> is the orthogonal projection on

the subspace orthogonal to Vec(√p)

I by Cochran’s theorem, ‖V ‖2 ∼ χ2k−1

22

Page 23: Mathematical Statistics course # 7

Chi-squared distance under the null (IV)

We now turn towards the behavior under H1.I under H1, there exists an integer j0 such that

pj0 6= p0j0 .

I we write

D2n(pn, p0) = n

k∑j=1

(pj − p0j )2

p0j

≥ n(pj0 − p0

j0 )2

p0j0

∼ n(pj0 − p0

j0 )2

p0j0

D2n(pn, p0) −→ +∞ a.s. �

23

Page 24: Mathematical Statistics course # 7

Reminder: the χ2 distributionI let X ∼ χ2

d , then E [X ] = d and Var (X ) = 2d

0 5 10 15 20 25 30

0.000

0.025

0.050

0.075

0.100

0.125

0.150

Probability density function of the 2 distributiond = 5d = 10d = 20

24

Page 25: Mathematical Statistics course # 7

Quantiles of the χ2 distribution (I)I recall quantile function: cd,p is a number such that

P(χ2

d ≤ cd,p)> p .

I very often given as a table:

d \ p 0.5 0.6 0.7 0.8 0.9 0.95 0.991 0.45 0.71 1.07 1.64 2.71 3.84 6.632 1.39 1.83 2.41 3.22 4.61 5.99 9.213 2.37 2.95 3.66 4.64 6.25 7.81 11.344 3.36 4.04 4.88 5.99 7.78 9.49 13.285 4.35 5.13 6.06 7.29 9.24 11.07 15.096 5.35 6.21 7.23 8.56 10.64 12.59 16.817 6.35 7.28 8.38 9.8 12.02 14.07 18.488 7.34 8.35 9.52 11.03 13.36 15.51 20.099 8.34 9.41 10.66 12.24 14.68 16.92 21.6710 9.34 10.47 11.78 13.44 15.99 18.31 23.21

25

Page 26: Mathematical Statistics course # 7

Quantiles of the χ2 distribution (II)I How to read this table?I suppose that we are looking for c9,0.95 (a number such that 95% of

the observations from a χ29 random variable are smaller):

d \ p 0.5 0.6 0.7 0.8 0.9 0.95 0.991 0.45 0.71 1.07 1.64 2.71 3.84 6.632 1.39 1.83 2.41 3.22 4.61 5.99 9.213 2.37 2.95 3.66 4.64 6.25 7.81 11.344 3.36 4.04 4.88 5.99 7.78 9.49 13.285 4.35 5.13 6.06 7.29 9.24 11.07 15.096 5.35 6.21 7.23 8.56 10.64 12.59 16.817 6.35 7.28 8.38 9.8 12.02 14.07 18.488 7.34 8.35 9.52 11.03 13.36 15.51 20.099 8.34 9.41 10.66 12.24 14.68 16.92 21.6710 9.34 10.47 11.78 13.44 15.99 18.31 23.21

I thus c9,0.95 = 16.9226

Page 27: Mathematical Statistics course # 7

Quantiles of the χ2 distribution (III)

0 5 10 15 20 25 30

0.00

0.02

0.04

0.06

0.08

0.10

95% quantile of the 29 distribution

I 95% of the observations fall to the left of the red line27

Page 28: Mathematical Statistics course # 7

Chi-squared test for fit of a distribution

I finally, we propose the following test (“test d’ajustement du chideux”):

φ(X1, . . . ,Xn) = 1D2n(pn,p0)>ck−1,1−α .

I asymptotically, the test is of size 1− αI the test is also consistent (power converges to 1)I Remark: the guarantees only holds for large n!I usually, it is recommended to ensure that n ≥ 30 and np0

j ≥ 5 for allj ∈ {1, . . . , k}

I if it is not the case, regroup cases

28

Page 29: Mathematical Statistics course # 7

Chi-squared test: exampleI recall genetics example: p0

RR = 0.25, p0RB = 0.50, and p0

BB = 0.25I we want to test

H0 : p = p0 vs H1 : p 6= p0 .

I we repeat independently and in the same conditions n = 100time the crossing experiment

I we obtain the following data:

color red purple bluenumber 27 55 18

I sample size is large enoughI the χ2 statistic is given by

D2n(pn, p0) = (27− 25)2

25 + (55− 50)2

50 + (18− 25)2

25 = 2.62

I the 95% quantile for the χ22 distribution is 5.99, thus we do not

reject the null29

Page 30: Mathematical Statistics course # 7

3. Extensions

30

Page 31: Mathematical Statistics course # 7

Chi-squared test for continuous distributions (I)

I possible to adapt the χ2 statistic to continuous distributionI Idea: split the space in k classes A1, . . . ,Ak

I from an i.i.d. sample X1, . . . ,Xn, we construct

∀i ∈ {1, . . . , n}, Yi = j if Xi ∈ Aj .

I then define p from the Yi

I finally, takep0

j = P (X ∈ Aj) ,

and use the previous testI Remark: a rule of thumb is to choose the Aj such that the p0

j areapproximately equal

31

Page 32: Mathematical Statistics course # 7

Chi-squared test for continuous distributions (II)I Example: suppose that we want to test

H0 : X ∼ E(1) vs H1 : X 6∼ E(1) .

I split R+ in five intervals:

R+ = [0, t1) ∪ [t1, t2) ∪ [t2, t3) ∪ [t3, t4) ∪ [t4,+∞)

I we write

P (E(1) ∈ [tj , tj+1)) = 1− e−tj+1 − (1− e−tj ) = e−tj − e−tj+1 .

I then solve: e−0 − e−t1 = 1/5e−t1 − e−t2 = 1/5e−t2 − e−t3 = 1/5e−t3 − e−t4 = 1/5

e−t4 − 0 = 1/5

e−t1 = 4/5e−t2 = 3/5e−t3 = 2/5e−t4 = 1/5

32

Page 33: Mathematical Statistics course # 7

Chi-squared test for continuous distributions (III)

I that is, t1 = 0.22, t2 = 0.51, t3 = 0.91, and t4 = 1.61

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.2

0.4

0.6

0.8

1.0

A1 A2 A3 A4 A5

Equal mass splits for the exponential distribution

33

Page 34: Mathematical Statistics course # 7

Chi-squared test for continuous distributions (IV)

I suppose that we observe 100 realizations of X , dispatching asfollows:

class A1 A2 A3 A4 A5

number 20 27 31 15 7

I we compute the test statistic:

D2n(pn, p0) = (20− 20)2

20 + (27− 20)2

20 + (31− 20)2

20 + (15− 20)2

20 + (7− 20)2

20= 18.2

I the 95% quantile for the χ24 distribution is 9.49: we reject the null

34

Page 35: Mathematical Statistics course # 7

Parametric family of distributions (I)

I let us now consider a parametric family of discrete distributions

F = {p(θ), θ ∈ Θ} ,

with Θ ⊆ RD

I we want to test

H0 : p ∈ F vs H1 : p /∈ F

(“test d’ajustement à une famille de lois”)I Idea: pick a reference law in FI Solution: construct θn the maximum likelihood estimator of θI we can then consider

p(θn) =(p1(θn), . . . , pk(θn)

)>∈ Rk .

35

Page 36: Mathematical Statistics course # 7

Parametric family of distributions (II)

I we then compare the empirical frequencies to p(θn) as before, givingrise to the test statistic

D2n(pn,F) = n

k∑j=1

(pj − pj(θn))2

pi (θn).

Theorem: Suppose that D < k − 1. Assume that the mapping

p : θ 7→ p(θ) = (p1(θ), . . . , pk(θ))

is one-to-one, C2, and has no singular point. Then, if θn is aconsistent estimator of θ,

D2n(pn,F) L−→ χ2

k−D−1

under the null.

36

Page 37: Mathematical Statistics course # 7

Parametric family of distributions (III)

I under H1, D2n(pn,F)→ +∞ almost surely if the distance from p to

F is > 0I in definitive, we have the following test (“test d’ajustement à une

famille paramétrée de lois”):

φ(X1, . . . ,Xn) = 1D2n(pn,F)>ck−D−1,1−α ,

where c·,· denote the χ2 quantiles as beforeI also possible to extend for continuous laws as explained previouslyI we will see in a future course the Kolmogorov-Smirnov test that

avoids binning the data

37

Page 38: Mathematical Statistics course # 7

Chi-squared test for independence (I)

I Setting: n observations (X1,Y1), . . . , (Xn,Yn) i.i.d. taking discretevalues

I denote by ν the joint law of (X ,Y )I we want to test

H0 : X independent from Y vs H1 : not independent

(“test du chi deux d’indépendance”)I Idea: transform this problem into a particular case of the previous

test:H0 : ν ∈ F vs H1 : ν /∈ F

I Question: which family to consider?

38

Page 39: Mathematical Statistics course # 7

Chi-squared test for independence (II)

I denote by X = {x1, . . . , xr} (resp. Y = {y1, . . . , ys}) the supportof X (resp. Y )

I P(X ) (resp. P(Y)) the set of distributions on X (resp. Y) withpositive mass

I we setF = {p ⊗ q, p ∈ P(X ), q ∈ P(Y)}

I Intuition: set a discrete distributions on X × Y that can be writtenas a product

I the parameter of this family is

θ = (p1, . . . , pr−1, q1, . . . , qs−1)> ∈ Rr+s−2

39

Page 40: Mathematical Statistics course # 7

Chi-squared test for independence (III)

I the maximum likelihood estimator is given by (p, q), where

pj = 1n

n∑i=1

1Xi =xj and qj′ = 1n

n∑i=1

1Yi =yj′ .

I on the other side, the empirical frequencies are given by

νj,j′ = 1n

n∑i=1

1(Xi ,Yi )=(xj ,yj′ ) .

I the test statistic is

D2n(pn, qn, νn) = n

n∑j=1

n∑j′=1

(νj,j′ − pj qj′ )2

pj qj′.

40

Page 41: Mathematical Statistics course # 7

Chi-squared test for independence (IV)

I one can check the assumptions of the theorem, and show that

D2n(pn, qn, νn) L−→ χ2

k−D−1

under the nullI Question: how many degrees of freedom?I here k = rs (number of parameters for ν) and

D = r − 1 + s − 1 = r + s − 2 (number of parameters for θ)I thus

k − D − 1 = rs − (r + s − 2)− 1 = (r − 1)(s − 1) .

I in definitive, we have the following test (“test du chi deuxd’indépendance”):

φ(X1,Y1, . . . ,Xn,Yn) = 1D2n(pn,qn,νn)>c(r−1)(s−1),1−α ,

where c·,· denote the χ2 quantiles as before41

Page 42: Mathematical Statistics course # 7

Chi-squared test for homogeneity (I)

I particular case of the chi-squared test for independenceI discrete distributions a and b on {x1, . . . , xk}I we are given i.i.d. samples A1, . . . ,An and B1, . . . ,Bm from a and bI we want to test (“test du chi deux d’homogénéité”):

H0 : a = b vs H1 : a 6= b .

I Example: categories of population (men / women), k outcomesI Idea: set

((X1,Y1), . . . , (Xn,Yn), (Xn+1,Yn+1), . . . , (Xn+m,Yn+m)) =((A1, 1), . . . , (An, 1), (B1, 2), . . . , (Bm, 2))

I then swap and test for X independent of Y !

42

Page 43: Mathematical Statistics course # 7

Chi-squared test for homogeneity (II)

I more precisely, set

Nj =n∑

i=11Ai =xj ,Mj =

n∑i=1

1Bi =xj , and pj = Nj + Mjn + m .

I then the test statistic is

D2n(pn) =

k∑j=1

((Nj − npj)2

npj+ (Mj − npj)2

mpj

).

I under H0, D2n(pn) L−→ χ2

k−1 and D2n(pn) a.s.−→ +∞ under H1

I chi-squared test for homogeneity (“test du chi deuxd’homogénéité”):

φ(A1, . . . ,An,B1, . . . ,Bm) = 1D2n(pn)>ck−1,1−α ,

43