Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the...

27
Review of Statistics A-Exam Qiwei Li, Ph.D. student Rice University, Houston, Texas January 1, 2013

Transcript of Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the...

Page 1: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Reviewof

Statistics A-Exam

Qiwei Li, Ph.D. studentRice University, Houston, Texas

January 1, 2013

Page 2: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

To Kina

Page 3: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Contents

1 Probability Theory 41.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.4 Induced Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.5 Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.6 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.7 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Distributions 72.1 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Location and Scale Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Important Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5.1 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5.2 Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.3 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.4 Cauchy-Schwarz Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.5 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.6 Important Equalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.7 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Large Sample Theory 93.1 Convergence Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.3 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.4 Law of Large Numbers (LLN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.5 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 The Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Data Reduction 114.1 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.2 Sufficiency Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.3 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.4 Exponential Family Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Minimal Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Ancillary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Page 4: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

4.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.2 Location Family Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.3 Scale Family Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.4 Complete Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4.2 Exponential Family Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Point Estimation 145.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Methods of Finding Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2.1 Method of Moments (MOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2.2 Maximum Likelihood Estimator (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2.3 Bayes Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2.4 Minimax Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 Methods of Evaluating Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3.1 Mean Squared Error (MSE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3.2 Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3.3 Uniform Minimum Variance Unbiased Estimator (UMVUE) . . . . . . . . . . . . . . . . 165.3.4 Equivariance and Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3.5 Adimissible Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3.6 Asymptotic Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Hypothesis Testing 186.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.1.1 Error Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.1.2 Power Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.1.3 p-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.1.4 Critical Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.1.5 Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 Method of Finding Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2.1 Likelihood Ratio Test (LRT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2.2 Bayesian Shelter Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2.3 Uniformly Most Powerful (UMP) Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Linear Regression 207.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207.1.2 Least Squares (LS) Solution: Mathematical Solution . . . . . . . . . . . . . . . . . . . 207.1.3 Best Linear Unbiased Estimator (BLUE): Statistical Inference . . . . . . . . . . . . . . . 207.1.4 Maximum Likelihood Estimator (MLE): Statistical Inference . . . . . . . . . . . . . . . 207.1.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.2.2 Least Squares (LS) Solution: Mathematical Solution . . . . . . . . . . . . . . . . . . . 217.2.3 Best Linear Unbiased Estimator (BLUE): Statistical Inference . . . . . . . . . . . . . . . 217.2.4 Maximum Likelihood Estimator (MLE): Statistical Inference . . . . . . . . . . . . . . . 217.2.5 Model Selection with Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

Page 5: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

7.3 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.3.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.3.3 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7.4 Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.4.1 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.4.2 Checking the Assumption of Regression Models . . . . . . . . . . . . . . . . . . . . . . 227.4.3 Testing for Lack of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3

Page 6: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

1 Probability Theory

1.1 Set Theory

DeMorgan’s Laws: For any three events, A, B, and C, defined on a sample space S, (A⋃B)c = Ac

⋂Bc and

(A⋂B)c = Ac

⋃Bc. (Theorem 1.1.4, Page 3)

1.2 Measure Theory

1.2.1 σ-algebra

σ-algebra: A collection of subsets of Ω is called a σ-algebra, denoted by F , if it satisfies the following threeproperties: a. ∅ ∈ F ; b. If A ∈ F , then Ac ∈ F (F is closed under complementation); c. If A1, A2, · · · ,∈ F ,then

⋃∞i=1Ai ∈ F (F is closed under countable unions). (Definition 1.2.1, Page 6)

A pair (Ω,F) is called a measurable space. The elements of F are called measurable sets or events.(Section 1.1.1, Cox’s)

If C is any collection of subsets of Ω, then the σ-field generated by C, is the smallest σ-field containing C.(Definition 1.1.2, Cox’s)

1.2.2 Measures

Measure:

• A measure is a function µ defined for certain subsets A of a set Ω which assigns a nonnegative numberµ(A) to each “measurable” set A. (Section 1.1, Cox’s)

• A measure space (Ω,F , µ) is a triple, consisting of an underlying set Ω, a σ-field F , a function µ calledthe measure with domain F and satisfying the following three properties: a. 0 ≤ µ(A) ≤ ∞ for all A ∈F ; b. µ(∅) = 0; c. If A1, A2, · · · is a sequence of disjoint elements of F , then µ(

⋃∞i=1Ai) =

∑∞i=1 µ(Ai).

(Definition 1.1.4, Cox’s)

Basic Properties are (Proposition 1.1.4, Cox’s):

• Monotonicity: A ⊂ B,A,B ∈ F implies µ(A) ≤ µ(B).

• Subadditivity: If A1, A2, · · · is any sequence of measurable sets, then µ(⋃∞i=1Ai) ≤

∑ni=1 µ(Ai).

• If A1, A2, · · · is a decreasing sequence of measurable set, and if µ(A1) < ∞, then µ(⋂∞i=1Ai) =

limi→∞ µ(Ai).

Examples include:

• Probability: Denoted by P instead of µ, the subset A is called an event and Ω is called the samplespace. (Section 1.1, Cox’s)

• Length: Denoted by m, which is a Lebesgue measure on arbitrary intervals of real numbers whichequals the length of the interval. (Section 1.1, Cox’s)

• Counting: Denoted by #, which is defined as the number of elements of A,A ∈ F . (Example 1.1.3,Cox’s)

• Unit Point Mass: Denoted by δx, δx(A) = 1 if x ∈ A and δx(A) = 0 otherwise. (Example 1.1.4, Cox’s)

4

Page 7: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

σ-finite: A measure space (Ω,F , µ) is called σ-finite if and only if there is an infinite sequence A1, A2, · · ·in F such that: a. µ(Ai) <∞ for each i; b. Ω =

⋃∞i=1Ai. (Definition 1.3.2, Cox’s)

Product Measure: Let (Ω1,F1, µ1) and (Ω2,F2, µ2) be σ-finite measure space. Then there exists a uniquemeasure µ1 × µ2 (called product measure) on (Ω1,F1) × (Ω2,F2) such that for all A1 ∈ F1 and A2 ∈ F2,(µ1 × µ2)(A1 ×A2) = µ1(A1)× µ2(A2). (Theorem 1.3.1, Cox’s)

1.2.3 Measurable Functions

Inverse Image: Consider two sets Ω and Λ. Let f : Ω → Λ be any function and A ⊂ Λ. Then the inverseimage of A under f is f−1(A) = ω ∈ Ω : f(ω) ∈ A. If A is a collection of subsets of Λ, then we writef−1(A) = f−1(A) : A ∈ A. (Section 1.2.1, Cox’s)

Measurable Function: Let (Ω,F) and (Λ,G) be measurable spaces and f : Ω → Λ a function. Thenf is a measurable function if and only if f−1(G) ⊂ F , in which case we shall write f : (Ω,F) → (Λ,G).(Definition 1.2.1, Cox’s) If f : (Ω,F) → (Λ,G), then the σ-field generated by f is f−1(G) and is denoted byσ(f). (Definition 1.2.2, Cox’s)

Note that If Λ = R and G is the Borel σ-field, then we say f is Borel measurable or a real valued Borelfunction. All real valued functions that arise in “practice” are Borel functions such as the indicator functionof A: IA(x) = 1 if x ∈ A and IA(x) = 0 otherwise. The class of simple functions is obtained by taking linearcombinations of indicators: φ(ω) =

∑ni=1 aiIAi(ω). (Section 1.2.1, Cox’s)

1.2.4 Induced Measures

Induced Measure: Let (Ω,F , µ) be a measure space, (Λ,G) a measurable space, and f : (Ω,F) → (Λ,G) ameasurable function. Define a function (µf−1)(C) = µ(f−1(C)), C ∈ G. We claim that µf−1 is a measureand we call (µ f−1) the measure induced by f . (Secion 1.2.2, Cox’s)

Note that if µ = P is a probability measure and X is a random variable, then P X−1 is called thedistribution of X or the law of X and is sometimes denoted PX or Law[X]. (Section 1.2.2, Cox’s)

1.2.5 Integral

We define the integral of f over the set A ∈ F as∫Af(ω)dµ(ω) =

∫IA(ω)f(ω)dµ(ω). (Section 1.2.3, Cox’s)

Basic Properties are (Proposition 1.2.5, Cox’s):

• If∫fdµ exists and a ∈ R, then

∫afdµ exists and equals a

∫fdµ.

• If∫fdµ and

∫gdµ both exist and

∫fdµ +

∫gdµ is defined, then

∫(f + g)dµ is defined and equals to∫

fdµ+∫gdµ.

• If f(ω) < g(ω) for all w ∈ Ω, then∫fdµ ≤

∫gdµ.

• |∫fdµ| ≤

∫|f |dµ.

Examples include (Section 1.2.3, Cox’s):

• Probability: E[X] =∫XdP .

• Counting:∑i f(ai) =

∫fd#.

• Unit Point Mass: f(x) =∫fdδx.

Fubini’s Theorem: Let Ω = Ω1 × Ω2, F = F1 × F2, and µ = µ1 × µ2 where µ1 and µ2 are σ-finite. If f isa Borel function on Ω whose integral with respect to µ exists, then

∫Ωf(ω)dµ(ω) =

∫Ω1×Ω2

f(ω1, ω2)d(µ1 ×µ2)(ω1, ω2) =

∫Ω2

[∫Ω1f(ω1, ω2)dµ1(ω1)

]dµ2(ω2). (Theorem 1.3.2, Cox’s)

5

Page 8: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

1.2.6 Density

Density: Let f : (Ω,F , µ) → (R,B) be nonnegative and put ν(A) =∫Afdµ,A ∈ F , then ν is a measure on

(Ω,F) and the function f is called the density of ν with respect to µ. (Theorem 1.2.9, Cox’s)Let µ and ν be measures on (Ω,F). We say ν is absolutely continuous with respect to µ and write ν << µ

if and only if for all A ∈ F , µ(A) = 0 implies ν(A) = 0. We sometimes say µ dominates ν, or that µ is adominating measure for ν. (Definition 1.4.1, Cox’s)

Radon-Nikodym Theorem: Let (Ω,F , µ) be a σ-finite measure space and suppose ν << µ. Then there isa nonnegative Borel function f such that ν(A) =

∫Afdµ, for all A ∈ F . Furthermore, f is unique µ-almost

everywhere, i.e., if ν(A) =∫Agdµ for A ∈ F , then g = f µ-almost everywhere. (Theorem 1.4.1, Cox’s)

Let Ω = Ω1 × Ω2, F = F1 × F2 be σ-finite measure spaces, with ν1 << µ1 and ν2 << µ2. Thenν1 × ν2 << µ1 × µ2 and d(ν1×ν2)

d(µ1×µ2) (ω1, ω2) = ν1

µ1(ω1) ν2

µ2(ω2), µ1 × µ2-almost everywhere. (Proposition 1.4.3,

Cox’s)

1.2.7 Conditional Expectation

Let X : (Ω,F , p) → (R,B) be a random variable with E[|X|] < ∞, and suppose G is a sub-σ-field of F .Then the conditional expecation of X given G, denoted E[X|G], is the essentially unique random variable Zsatisfying: a. Z is G measurable; b.

∫AZdP =

∫AXdP for all A ∈ G. (Definition 1.5.1, Cox’s)

Suppose A1, · · · , An are events which partition Ω and P (Ai) > 0 for each i. Let Y =∑ni=1 aiIAi be a

simple random variable, where a1, · · · , an are distinct real numbers. If X is an integrable random variable,

then E[X|Y ] =∑ni=1

∫AiXdP

P (Ai)IAi , almost surely. (Proposition 1.5.3)

6

Page 9: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

2 Distributions

2.1 Common Distributions

Please refer to Table 1 and Table 2. (Wikipedia and Page 621-626)

2.2 Empirical Distributions

Let X1, · · · , Xn be iid real random variables with the common cdf. Then the empirical distribution functionis defined as Fn(A) = 1

n

∑ni=1 δxi(A). (Section 1.1.4, Cox’s)

2.3 Exponential Families

A family of pdfs or pmfs is called an exponential family if it can be expressed as

f(x|θ) = h(x)c(θ) exp

(k∑i=1

wi(θ)ti(x)

),

where h(x) ≥ 0 and t1(x), · · · , tk(x) are real-valued functions of the observation x, and c(θ) ≥ 0 andw1(θ), · · · , wk(θ) are real-valued functions of the possibly vector-valued parameter θ. (Page 111)

An exponential family is sometimes reparameterized as

f(x|η) = h(x)c∗(η) exp

(k∑i=1

ηiti(x)

),

where η is called the natural parameter and c∗(η) = 1/∫∞−∞ h(x) exp

(∑ki=1 ηiti(x)

)dx. (Page 114)

Note that any family of distributions where the support set depends on the parameter can’t be from anexponential family.

2.4 Location and Scale Families

Let f(x) be any pdf. Then the family of pdfs f(x − µ), indexed by the parameter µ,−∞ < µ < ∞, is calledthe location family with standard pdf f(x) and µ is called the location parameter for the family. (Definition3.5.2, Page 116)

Let f(x) be any pdf. Then for any σ > 0, the family of pdfs (1/σ)f(x/σ), indexed by the parameter σ, iscalled the scale family with standard pdf f(x) and σ is called the scale parameter for the family. (Definition3.5.4, Page 119)

Let f(x) be any pdf. Then for any µ,−∞ < µ <∞, and any σ > 0, the family of pdfs (1/σ)f((x− µ)/σ),indexed by the parameter (µ, σ), is called the location-scale family with standard pdf f(x); µ is called thelocation parameter and σ is called the scale parameter. (Definition 3.5.5, Page 119)

2.5 Important Theorems

2.5.1 Transformation

Let X have pdf fX(x), let Y = g(X), and define the sample space X . Suppose there exists a partition,A0, · · · , Ak, of X such that P (X ∈ A0) = 0 and fX(x) is continuous on each Ai. Further, suppose there existfunctions g1(x), · · · , gk(x), defined on A1, · · · , Ak, respectively, satisfying g(x) = gi(x), for x ∈ Ai, gi(x) ismonotone on Ai, the set Y = y : y = gi(x) for some x ∈ Ai is the same for each i = 1, · · · , k, and g−1

i (y)

has a continuous derivative on Y, for each i = 1, · · · , k. Then fY (y) =∑ki=1 fX(g−1

i (y))| ddy g−1i (y)| if y ∈ Y

and fY (y) = 0 otherwise. (Theorem 2.1.8, Page 53)

7

Page 10: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

2.5.2 Chebychev’s Inequality

Let X be a random variable and let g(x) be a nonnegative function. Then, for any r > 0, P (g(X) ≥ r) ≤E[g(X)]

r . (Theorem 3.6.1, Page 122)

2.5.3 Jensen’s Inequality

Let f be a convex function on a convex set K ⊂ Rn and suppose X is a random n-vector with E[||X||] <∞and X ∈ K almost surely. Then E[X] ∈ K and f(E[X]) ≤ E[f(X)]. (Theorem 2.1.4, Cox’s)

2.5.4 Cauchy-Schwarz Inequality

For any random variables X and Y , (E[|XY |])2 ≤ E[X2]E[Y 2] and the equality holds if and only if eitherX = 0 almost surely or Y = cX almost surely for some constant c. (Theorem 2.1.5, Cox’s)

2.5.5 Sums of Random Variables

Let X1, · · · , Xn be a random sample from a population with mgf MX(t). Then the mgf of the sample meanis MX(t) = (MX(t/n))n. (Theorem 5.2.7, Page 215)

If X and Y are independent continuous random variables with pdfs fX(x) and fY (y), then the pdf ofZ = X + Y is fZ(z) =

∫∞−∞ fX(w)fY (z − w)dw. (Theorem 5.2.9, Page 215)

2.5.6 Important Equalities

• E[aX + bY ] = aE[X] + bE[Y ].

• V ar(aX + bY ) = a2V ar(X) + b2V ar(Y ) + 2abCov(X,Y ).

• Cov(X,Y ) = E[XY ]− E[X]E[Y ].

• Cov(AX + C,BY +D) = ACov(X,Y )BT .

• E[g(X)h(Y )] = E[g(X)]E[h(Y )], if X and Y are independent.

2.5.7 Order Statistics

Let X(1), · · · , X(n) denote the order statistics of a random sample, X1, · · · , Xn, from a continuous populationwith cdf FX(x) and pdf fX(x). Then the pdf of X(j) is fX(j)(x) = n!

(j−1)!(n−j)!fX(x)FX(x)j−1(1−FX(x))n−j .(Theorem 5.4.4, Page 229)

8

Page 11: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

3 Large Sample Theory

3.1 Convergence Concepts

3.1.1 Almost Sure Convergence

A sequence of random variables, X1, X2, · · ·, converges almost surely to a random variable X if, for everyε > 0, P (limn→∞ |Xn −X| < ε) = 1. (Definition 5.5.6, Page 234)

3.1.2 Convergence in Probability

A sequence of random variables, X1, X2, · · ·, converges in probability to a random variable X if, for everyε > 0, limn→∞ P (|Xn −X| ≥ ε) = 0 or, equivalently, limn→∞ P (|Xn −X| < ε) = 1. (Definition 5.5.1, Page232)

Note that the X1, X2, · · · are typically not iid random variables. (Page 232)Suppose that X1, X2, · · · converges in probability to a random variable X and that h is a continuous

function. Then h(X1), h(X2), · · · converges in probability to h(X). (Theorem 5.5.4, Page 233)

3.1.3 Convergence in Distribution

A sequence of random variables,X1, X2, · · ·, converges in distribution to a random variableX if limn→∞ FXn(x) =

FX(x) at all points x where FX(x) is continuous. (Definition 5.5.10, Page 235)Slutsky’s Theorem: If Xn → X in distribution and Yn → a, a constant, in probability, then YnXn → aX

in distribution and Xn + Yn → X + a in distribution. (Theorem 5.5.17, Page 239)Cramer-Wold Theorem: Suppose that X1,X2, · · · is a sequence of random k-vector, then Xn converges

to X in distribution if and only if aTXn converges to aTX in distribution for all a ∈ Rk.

3.1.4 Law of Large Numbers (LLN)

• Weak LLN: Let X1, X2, · · · be iid random variables with E[Xi] = µ and V ar(Xi) = σ2 < ∞. DefineXn = 1

n

∑ni=1Xi. Then, for every ε > 0, limn→∞ P (|Xn − µ| < ε) = 1; that is, Xn converges in

probability to µ. (Theorem 5.5.2, Page 233)

• Strong LLN: Let X1, X2, · · · be iid random variables with E[Xi] = µ and V ar(Xi) = σ2 < ∞, anddefine Xn = 1

n

∑ni=1Xi. Then, for every ε > 0, P (limn→∞ |Xn − µ| < ε) = 1; that is, Xn converges

almost surely to µ. (Theorem 5.5.9, Page 235)

Note that the only moment condition needed is that E|Xi| <∞. (Page 235)

3.1.5 Central Limit Theorem (CLT)

• CLT: Let X1, X2, · · · be a sequence of iid random variables whose mgfs exist in a neighborhood of 0. LetE[Xi] = µ and V ar(Xi) = σ2 > 0. Define Xn = 1

n

∑ni=1Xi. LetGn(x) denote the cdf of

√n(Xn−µ)/σ.

Then, for any x,−∞ < x < ∞, limn→∞Gn(x) =∫ x−∞

1√2πe−y

2/2dy; that is,√n(Xn − µ)/σ has a

limiting standard normal distribution. (Theorem 5.5.14, Page 236)

• Stronger CLT: Let X1, X2, · · · be a sequence of iid random variables with E[Xi] = µ and 0 < V ar(Xi) =

σ2 < ∞. Define Xn = 1n

∑ni=1Xi. Let Gn(x) denote the cdf of

√n(Xn − µ)/σ. Then, for any

x,−∞ < x < ∞, limn→∞Gn(x) =∫ x−∞

1√2πe−y

2/2dy; that is,√n(Xn − µ)/σ has a limiting standard

normal distribution. (Theorem 5.5.15, Page 238)

9

Page 12: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

3.2 The Delta Method

Let Yn be a sequence of random variables that satisfies√n(Yn − θ) → n(0, σ2) in distribution. For a given

function g and a specific value of θ, suppose that g′(θ) exists and is not 0. Then√n(g(Yn) − g(θ)) →

n(0, σ2(g′(θ))2) in distribution. (Theorem 5.5.24, Page 243)Taylor Series: If a function g(x) has derivatives of order r, that is, g(r)(x) = dr

dxr g(x) exists, then for any

constant a, the Taylor polynomial of order r about a is Tr(x) =∑ri=0

g(i)(a)i! (x− a)i. (Definition, Page 241)

10

Page 13: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

4 Data Reduction

4.1 Sufficient Statistics

4.1.1 Definition

• A statistic T (X) is a sufficient statistics for θ if the conditional distribution of the sample X given thevalue of T (X) does not depend on θ. (Definition 6.2.1, Page 272)

• A sufficient statistics for a parameter θ is a statistic that, in a certain sense, captures all the informationabout θ contained in the sample. Any additional information in the sample, besides the value of thesufficient statistic, does not contain any more information about θ. (Page 272)

4.1.2 Sufficiency Principle

If T (X) is a sufficient statistic for θ, then any inference about θ should depend on the sampleX only throughthe value T (X). That is, if x and y are two sample points such that T (x) = T (y), then the inference aboutθ should be the same whether X = x or X = y is observed. (Page 272)

4.1.3 Verification

• If p(x|θ) is the joint pdf or pmf of X and q(t|θ) is the pdf or pmf of T (X), then T (X) is a sufficientstatistic for θ if, for every x in the sample space, the ratio p(x|θ)/q(t|θ) is constant as a function of θ.(Theorem 6.2.2, Page 274)

• Factorization Theorem: Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T (X) is asufficient statistic for θ if and only if there exist functions g(t|θ) and h(x) such that, for all samplepoints x and all parameter points θ,f(x|θ) = g(t|θ)h(x). (Theorem 6.2.6, Page 276)

4.1.4 Exponential Family Case

Let X1, · · · , Xn be iid observation from a pdf or pmf f(x|θ) that belongs to an exponential family given by

f(x|θ) = h(x)c(θ) exp

(k∑i=1

wi(θ)ti(x)

)

, where θ = (θ1, · · · , θd), d ≤ k. Then T (X) = (∑ni=1 t1(Xi), · · · ,

∑ni=1 tk(Xi)) is a sufficient statistic for θ.

(Theorem 6.2.10, Page 279)

4.1.5 Examples

• Binomial Sufficient Statistic T (X) =∑ni=1Xi. (Example 6.2.3, Page 274)

• Normal Sufficient Statistic with Known Variance T (X) = 1n

∑ni=1Xi. (Example 6.2.4, Page 274 and

Example 6.2.7, Page 277)

• Normal Sufficient Statistic T (X) = (X, S2) =(

1n

∑ni=1Xi,

1n−1

∑ni=1(Xi − X)2

). (Example 6.2.9,

Page 279)

• Discrete Uniform Sufficient Statistic T (X) = X(n). (Example 6.2.8, Page 277)

11

Page 14: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

4.2 Minimal Sufficient Statistics

4.2.1 Definition

A sufficient statistic T (X) is called a minimal sufficient statistic if, for any other sufficient statistic T ′(X),T (X) is a function of T ′(X). (Definition 6.2.11, Page 280)

4.2.2 Verification

• Let f(x|θ) be the pmf or pdf of a sample X. Suppose there exists a function T (X) such that, forevery two sample points x and y, the ratio f(x|θ)/f(y|θ) is constant as a function of θ if and only ifT (x) = T (y). Then T (X) is a minimal sufficient statistic for θ. (Theorem 6.2.13, Page 281)

• If a minimal sufficient statistic exists, then any complete statistic is also a minimal sufficient statistic.(Theorem 6.2.28, Page 289)

4.2.3 Examples

• Normal Minimal Sufficient Statistic T (X) = (X, S2) =(

1n

∑ni=1Xi,

1n−1

∑ni=1(Xi − X)2

). (Example

6.2.14, Page 281)

• Uniform Sufficient Statistic T (X) = (X(1), X(n)). (Example 6.2.15, Page 282)

4.3 Ancillary Statistics

4.3.1 Definition

A statistic S(X) whose distribution does not depend on the parameter θ is called an ancillary statistic.(Definition 6.2.16, Page 282)

4.3.2 Location Family Case

Let X1, · · · , Xn be iid observations from a location parameter family with cdf F (x − θ),−∞ < θ < ∞. Therange R = X(n) −X(1) is an ancillary statistic. (Example 6.2.18, Page 283)

4.3.3 Scale Family Case

Let X1, · · · , Xn be iid observations from a scale parameter family with cdf F (x/σ), σ > 0. Then any statisticthat depends on the sample only through the n − 1 values X1/Xn, · · · , Xn−1/Xn is an ancillary statistic.(Example 6.2.19, Page 284)

4.3.4 Examples

• Uniform Ancillary Statistic S(X) = X(n) −X(1). (Example 6.2.17, Page 282)

• Normal Ancillary Statistics S(X) = (X1 − X)/S.

4.4 Complete Statistics

4.4.1 Definition

Let f(t|θ) be a family of pdfs or pmfs for a statistic T (X). The family of probability distributions is calledcomplete if Eθg(T ) = 0 for all θ implies Pθ(g(T ) = 0) = 1 for all θ. Equivalently, T (X) is called a completestatistic. (Definition 6.2.21, Page 285)

12

Page 15: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Basu’s Theorem: If T (X) is a complete and minimal sufficient statistic, then T (X) is independent ofevery ancillary statistic. (Theorem 6.2.24, Page 287)

4.4.2 Exponential Family Case

Let X1, · · · , Xn be iid observation from an exponential family with pdf or pmf of the form

f(x|θ) = h(x)c(θ) exp

(k∑i=1

wi(θ)ti(x)

)

, where θ = (θ1, · · · , θd). Then the statistic T (X) = (∑ni=1 t1(Xi), · · · ,

∑ni=1 tk(Xi)) is complete if (w1(θ), · · · , wk(θ)) :

θ ∈ Θ contains an open set in Rk. (Theorem 6.2.25, Page 288)

4.4.3 Examples

• Binomial Complete Sufficient Statistic T (X) =∑ni=1Xi. (Example 6.2.22, Page 285)

• Discrete Uniform Sufficient Statistic T (X) = X(n). (Example 6.2.23, Page 286)

13

Page 16: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

5 Point Estimation

5.1 Basic Concepts

• Point estimator: A point estimator is any function W (X1, · · · , Xn) of a sample; that is, any statistic is apoint estimator. (Definition 7.1.1, Page 311)

• Loss function: The loss function is a nonnegative function that generally increases as the distancebetween the point estimator and parameter increases. For example, the absolute error loss function isL(θ,W (X)) = |θ −W (X)| and the squared error loss function is L(θ,W (X)) = (θ −W (X))2. (Page349)

• Risk function: The risk function is the average loss that will be incurred if the estimator δ(x) is used,R(θ, δ) = EθL(θ, δ(X)). (Page 349)

5.2 Methods of Finding Estimators

5.2.1 Method of Moments (MOM)

Let X1, · · · , Xn be a sample from a population with pdf or pmf f(x|θ1, · · · , θk). MOM estimators are found byequating the first k sample moments to the corresponding k population moments, and solving the resultingsystem of simultaneous equations, mj = 1

n

∑ni=1X

ji = EXj , j = 1, · · · , k. (Page 312)

Examples include:

• Normal MOM Estimator: θMOM = X and σ2MOM = 1

n (Xi − X)2. (Example 7.2.1, Page 313)

• Binomial MOM Estimator: kMOM = X2

X− 1n

∑n

i=1(Xi−X)2

and pMOM = XkMOM

. (Example 7.2.2, Page

313)

5.2.2 Maximum Likelihood Estimator (MLE)

If X1, · · · , Xn are an iid sample from a population with pdf or pmf f(x|θ1, · · · , θk), the likelihood functionis defined by L(θ|x) =

∏ni=1 f(xi|θ1, · · · , θk). For each sample point x, let θMLE be a parameter value at

which L(θ|x) attains its maximum as a function of θ, with x held fixed. A MLE of the parameter θ based ona sample X is θMLE(X). (Definition 7.2.4, Page 316)

Invariance property of MLEs: If θMLE is the MLE of θ, then for any function τ(θ), the MLE of τ(θ) isτ(θMLE). (Theorem 7.2.10, Page 320)

Examples include:

• Normal MLE with Known Variance: θMLE = X. (Example 7.2.6, Page 317)

• Normal MLE: θMLE = X and σ2MLE = 1

n

∑ni=1(Xi − X)2. (Example 7.2.11, Page 321)

• Bernoulli MLE: pMLE = X. (Example 7.2.7, Page 317)

5.2.3 Bayes Estimator

If we denote the prior distribution by π(θ) and the sampling distribution by f(x|θ), then the posteriordistribution, the conditional distribution of θ given the sample, x, is π(θ|x) = f(x|θ)π(θ)/m(x). (Page 324)

14

Page 17: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Bayes risk: In a Bayesian analysis we would use the prior distribution on parameter to compute anaverage risk, ∫

Θ

R(θ, δ)π(θ)dθ

=

∫Θ

(∫XL(θ, δ(X))f(x|θ)dx

)π(θ)dθ

=

∫X

(∫Θ

L(θ, δ(X))π(θ|x)dθ

)m(x)dx,

where∫

ΘL(θ, δ(X))π(θ|x)dθ is called the posterior expected loss. For squared error loss, the Bayes rule is

the mean of the posterior distribution; for absolute error loss, the Bayes rule is the median of the posteriordistribution. (Page 352-353)

Recall that when the loss function is of the form W (θ)(δ(X) − g(θ))2, the Bayes estimator is given byEθW (θ)g(θ)|XEθW (θ)|X . (Page 3, Statistics A-Exam (January 4th, 2012))

Examples with conjugate priors include:

• Binomial Bayes Estimator: pBayes = y+αα+β+n . (Example 7.2.14, Page 324)

• Normal Bayes Estimator with Known Variance: θBayes = τ2

τ2+σ2X + σ2

τ2+σ2µ. (Example 7.2.16, Page326)

5.2.4 Minimax Estimator

An estimator δM (X) is called minimax with respect to a risk function R(θ, δ) if it achieves the smallestmaximum risk among all estimators, meaning it satisfies supθ∈ΘR(θ, δM ) = infδsupθ∈ΘR(θ, δ). (Wikipedia)

5.3 Methods of Evaluating Estimators

5.3.1 Mean Squared Error (MSE)

The mean squared error of an estimator W of a parameter θ is the function of θ defined by Eθ(W − θ)2 =

V arθW + (BiasθW )2, where BiasθW = EθW − θ. (Definition 7.3.1 and Definition 7.3.2, Page 330)

5.3.2 Cramer-Rao Inequality

Let X1, · · · , Xn be a sample with pdf f(x|θ), and let W (X) be any estimator satisfying ddθEθW (X) =∫

X∂∂θ [W (x)f(x|θ)]dx and V arθW (X) < ∞. Then V arθW (X) ≥ ( ddθEθW (X))2

Eθ(( ∂∂θ log f(X|θ))2). (Theorem 7.3.9,

Page 335)The regularity conditions are: a. Θ ∈ Rk is an open set; b. V arθW (X) < ∞ or EθW (X)2 < ∞,

for all θ ∈ Θ; c. ∂∂θ log f(x|θ) exists for all x and θ ∈ Θ; d. EθW (X) is differentiable for all θ ∈ Θ and

the derivative can be computed by interchanging differentiation and integration; e. The Fisher informationmatrix is strictly positive definite for all θ ∈ Θ. (Page 177, Cox’s)

The quantity Eθ(( ∂∂θ log f(X|θ))2

)is called the Fisher information of the sample, reflecting the fact that

the information number gives a bound on the variance of the UMVUE of θ.Furthermore, if X1, · · · , Xn are iid with pdf f(x|θ), then V arθW (X) ≥ ( ddθEθW (X))2

nEθ(( ∂∂θ log f(X|θ))2). (Corollary

7.3.10, Page 337)

15

Page 18: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Furthermore, if X1, · · · , Xn are iid observation from an exponential family with pdf or pmf of the form

f(x|θ) = h(x)c(θ) exp

(k∑i=1

wi(θ)ti(x)

)

, where θ = (θ1, · · · , θd). Then V arθW (X) ≥ ( ddθEθW (X))2

−nEθ(∂2

∂θ2log f(X|θ)

) . (Lemma 7.3.11, Page 338)

Let X1, · · · , Xn be iid f(x|θ), where f(x|θ) satisfies the conditions of the Cramer-Rao Theorem. LetL(θ|x) =

∏ni=1 f(xi|θ) denote the likelihood function. If W (X) is any unbiased estimator of τ(θ), then

W (X) attains the Cramer-Rao Lower Bound if and only if a(θ)(W (x) − τ(θ)) = ∂∂θ logL(θ|x) for some

function a(θ). (Corollary 7.3.15, Page 341)

5.3.3 Uniform Minimum Variance Unbiased Estimator (UMVUE)

An estimator W ∗ is a best unbiased estimator of τ(θ) if it satisfies EθW ∗ = τ(θ) for all θ and, for any otherestimator W with EθW = τ(θ), we have V arθW ∗ ≤ V arθW for all θ. W ∗ is also called a UMVUE of τ(θ).(Definition 7.3.7, Page 334)

Rao-Blackwell Theorem: Let W be any unbiased estimator of τ(θ), and let T be a sufficient statistic forθ. Define φ(T ) = E(W |T ). Then Eθφ(T ) = τ(θ) and V arθφ(T ) ≤ V arθW for all θ; that is, φ(T ) is a uniqueuniformly better unbiased estimator of τ(θ). (Theorem 7.3.17, Page 342 and Theorem 7.3.19, Page 343)

Let T be a complete sufficient statistic for a parameter θ, and let φ(T ) be any estimator based only on T .Then φ(T ) is the UMVUE of its expected value. (Theorem 7.3.23, Page 347)

Examples include

• Poisson UMVUE: λUMV UE = X. (Example 7.3.12, Page 338)

• Normal UMVUE with Known Mean: σ2UMV UE = 1

n

∑ni=1(Xi − µ)2. (Example 7.3.16, Page 341)

• Uniform UMVUE: θUMV UE = n+1n X(n). (Example 7.3.22, Page 346)

5.3.4 Equivariance and Invariance

Consider a location-scale family X1, · · · , Xn iid with pdf or pmf fab(x) = 1bf(x−ab

). If a is an estimator of a

and b is an estimator of b, then location and scale equivariant means a(x+α) = a(x)+α and b(βx) = βb(x),respectively; location invariant means b(x+ α) = b(x). (Page 170, Cox’s)

5.3.5 Adimissible Estimator

An order can be defined between two estimators δ and δ′ based on their risk functions:

• δ is said to be at least better than δ′ if R(θ, δ) ≤ R(θ, δ′),∀θ ∈ Θ.

• δ and δ′ are said to be equivalent if R(θ, δ) = R(θ, δ′),∀θ ∈ Θ.

δ is said to be inadmissible if there is an estimator δ′ such that δ′ such that δ′ is better than δ. δ is said to beadmissible if it is not inadmissible. (http://www.stat.unc.edu/faculty/cji/lecture2.pdf)

5.3.6 Asymptotic Evaluations

A sequence of estimators Wn is a consistent sequence of estimators of the parameter θ if, for every ε > 0 andevery θ ∈ Θ, limn→∞ Pθ(|Wn − θ| < ε) = 1 or limn→∞ Pθ(|Wn − θ| ≥ ε) = 0. (Definition 10.1.1, Page 468)

16

Page 19: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

According to Chebychev’s Inequality Pθ(|Wn − θ| ≥ ε) ≤ Eθ(Wn−θ)2

ε2 = V arθWn + (BiasθWn)2, thereforeif Wn is a sequence of estimators of a parameter θ satisfying limn→∞ V arθWn = 0 and limn→∞BiasθWn = 0

for every θ ∈ Θ, then Wn is a consistent sequence of estimators of θ. (Theorem 10.1.3, Page 469)Let Wn be a consistent sequence of estimators of a parameter θ. Let a1, a2, · · · and b1, b2, · · · be sequences

of constants satisfying limn→∞ an = 1 and limn→∞ bn = 0. Then the sequence Un = anWn+bn is a consistentsequence of estimators of θ. (Theorem 10.1.5, Page 469)

Consistency of MLEs: Let X1, · · · , Xn be iid f(x|θ), and let L(θ|x) =∏ni=1 f(xi|θ) be the likelihood

function. Let θMLE denote the MLE of θ. Let τ(θ) be a continuous function of θ. Under some regularityconditions on f(x|θ) and L(θ|x), for every ε > 0 and every θ ∈ Θ, limn→∞ Pθ(|τ(θMLE) − τ(θ)| ≥ ε) = 0.That is, τ(θMLE) is a consistent estimator of τ(θ). (Theorem 10.1.6, Page 470)

A sequence of estimatorsWn is asymptotically efficient for a parameter τ(θ) if√n(Wn−τ(θ))→ n(0, v(θ))

in distribution and v(θ) = (τ ′(θ))2

Eθ(( ∂∂θ log f(X|θ))2). That is, the asymptotic variance of Wn achieves the Cramer-

Rao Lower Bound. (Definition 10.1.11, Page 471)Asymptotic Efficiency of MLEs: LetX1, · · · , Xn be iid f(x|θ), let θMLE denote the MLE of θ, and let τ(θ) be

a continuous function of θ. Under some regularity conditions on f(x|θ) and L(θ|x),√n(τ(θMLE)− τ(θ))→

n(0, v(θ)), where v(θ) is the Cramer-Rao Lower Bound. That is, τ(θMLE) is a consistent and asymptoticallyefficient estimator of τ(θ). (Theorem 10.1.12, Page 472)

Note that it is very hard to beat MLE when sample size n is large. For the finite sample problems, MLEmay not be the best.

17

Page 20: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

6 Hypothesis Testing

6.1 Basic Concepts

6.1.1 Error Probabilities

• If θ ∈ Θ0 but the hypothesis test incorrectly decides to reject H0, then the test has made a Type I Error.(Page 382)

• If θ ∈ Θc0 but the test decides to accept H0, a Type II Error has been made. (Page 382)

6.1.2 Power Function

The power function of a hypothesis test with rejecting region R is the function of θ defined by β(θ) =

Eθφ(X) = Pθ(X ∈ R), where Pθ(X ∈ R) = probability of a Type I Error, if θ ∈ Θ0 and Pθ(X ∈ R) = oneminus the probability of Type II Error, if θ ∈ Θc

0. (Definition 8.3.1, Page 383)Here, φ(X) is the conditional probability of rejecting H0 after observing X = x. The size of a test is

size(φ) = supθ∈Θ0β(θ)

6.1.3 p-Value

• A p-value p(X) is a test statistic satisfying 0 ≤ p(x) ≤ 1 for every sample point x. Small values ofp(x) give evidence that H1 is true. A p-value is valid if, for every θ ∈ Θ0 and every 0 ≤ α ≤ 1,Pθ(p(X) ≤ α) ≤ α. (Definition 8.3.26, Page 397)

• The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actuallyobserved, assuming that the null hypothesis H0 is true. (Wikipedia)

6.1.4 Critical Region

The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejectedin favor of the alternative hypothesis. (Wikipedia)

6.1.5 Confidence Interval

For an interval estimator [L(X), U(X)] of a parameter θ, the coverage probability of [L(X), U(X)] is theprobability that the random interval [L(X), U(X)] covers the true parameter, θ. In symbols, it is de-noted by Pθ(θ ∈ [L(X), U(X)]). (Definition 9.1.4, Page 418) For an interval estimator [L(X), U(X)] ofa parameter θ, the confidence coefficient of [L(X), U(X)] is the infimum of the coverage probabilities,infθPθ(θ ∈ [L(X), U(X)]). (Definition 9.1.5, Page 418) Interval estimators, together with a measure ofconfidence, are sometimes known as confidence intervals. A confidence interval with confidence coefficientequal to some value, say 1− α, is simply called a 1− α confidence interval. (Page 419)

6.2 Method of Finding Tests

6.2.1 Likelihood Ratio Test (LRT)

The likelihood ratio test statistic for testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θc0 is φ(x) =

supΘ0L(θ|x)

supΘL(θ|x) . Alikelihood ratio test is any test that has a rejection region of the form x : φ(x) ≤ c, where c is any numbersatisfying 0 ≤ c ≤ 1. (Definition 8.2.1, Page 375)

18

Page 21: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

6.2.2 Bayesian Shelter Test

Considering initially the general problem H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 and letting π denote the prior: H0 :

π = π0 versus H1 : π = π1, the likelihood ratio for the Neyman-Peaerson test is f(x|θ∈Θ1)f(x|θ∈Θ0) =

∫Θ1

f(x|θ)dπ1(θ)∫Θ0

f(x|θ)dπ0(θ).

(Page 273, Cox’s)

6.2.3 Uniformly Most Powerful (UMP) Test

Neyman-Pearson Lemma: Consider testing H0 : θ = θ0 versus H1 : θ = θ1, where the pdf or pmf correspond-ing to θi is f(x|θi), i = 0, 1, using a test with rejection region R that satisfies x ∈ R if f(x|θ1) > kf(x|θ0)

and x ∈ Rc if f(x|θ1) < kf(x|θ0), for some k ≥ 0, and α = Pθ0(X ∈ R). Then a. (Sufficiency) Any testthat satisfies the above conditions is a uniformly most powerful level α test; b. (Necessity) If there exists atest satisfying the above conditions with k > 0, then every UMP level α test is a size α test and every UMPlevel α test satisfies the first condition except perhaps on a set A satisfying Pθ0(X ∈ A) = Pθ1(X ∈ A) = 0.(Theorem 8.3.12, Page 388)

Monotone Likelihood Ratio (MLR): A family of pdfs or pmfs f(x|θ) : θ ∈ Θ for a univariate randomvariable X with real-valued parameter θ has a MLR if, for every θ2 > θ1, f(x|θ2)/f(x|θ1) is a monotone(nonincreasing or nondecreasing) function of x on x : f(x|θ1) > 0 or f(x|θ2) > 0. (Definition 8.3.16, Page391)

Exponential Family Case: Suppose

f(x|θ) = h(x)c(θ) exp (w(θ)t(x))

is a one parameter exponential family with w(θ) strictly monotone increasing in θ. Then f(x|θ) has the strictMLR property in t(x). If w(θ) is strictly decreasing, replace it by −w(θ) and we get the MLR property in−t(x). (Page 266-267, Cox’s)

Karlin-Rubin Theorem: Consider testing H0 : θ ≤ θ0 versus H1 : θ > θ0. Suppose that T is a sufficientstatistic for θ and the family of pdfs or pmfs g(t|θ) : θ ∈ Θ of T has a nondecreasing monotone likelihoodratio. Then for any t0, the test that rejects H0 if and only if T > t0 is a UMP level α test, where α = Pθ0(T >

t0). (Theorem 8.3.16, Page 391)

19

Page 22: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

7 Linear Regression

7.1 Simple Linear Regression

7.1.1 Definition

In simple linear regression we have a relationship of the form Yi = β0 + β1xi + εi. It is common to supposethat E[εi] = 0, so that, we have E[Yi] = β0 + β1xi, which is called the population regression function. Wealso call Yi as the response variable and xi as the predictor variable. (Page 539)

Based on the data (x1, y1), · · · , (xn, yn), we define the following quantities:

• The sample means: x = 1n

∑ni=1 xi and y = 1

n

∑ni=1 yi;

• The sums of squares: Sxx =∑ni=1(xi − x)2 and Syy =

∑ni=1(yi − y)2;

• The sum of cross-products: Sxy =∑ni=1(xi − x)(yi − y) =

∑ni=1(xi − x)yi;

• The sum of squared errors (SSE): SSE =∑ni=1[yi − (β0 + β1xi)]

2 (Page 541 and 543).

7.1.2 Least Squares (LS) Solution: Mathematical Solution

The least squares solutions of β0 and β1 are defined to be those values that the line β0 + β1x minimizes SSEand the solutions are β1 =

SxySxx

and β0 = x− β1y. (Page 543)The least squares method should be considered only as a method of “fitting a line” to a set of data, not

as a method of statistical inference. (Page 544)

7.1.3 Best Linear Unbiased Estimator (BLUE): Statistical Inference

We assume that Yi = β0 + β1xi + εi, i = 1, · · · , n, where ε1, · · · , εn are uncorrelated random variables withE[εi] = 0 and V ar(εi) = σ2. (Page 545)

The best linear unbiased estimators are β1 =SxySxx

and β0 = x− β1y. (Page 547) Since they are unbiased,

we have E[β1] = β1 and E[β0] = β0. The variances are V ar(β1) = σ2

Sxxand V ar(β0) = σ2

(1n + X2

Sxx

). The

mean square error (MSE) σ2 = 1n−2

∑ni=1[yi − (β0 + β1xi)]

2.

The confidence interval with known σ2 is[β1 − Zα

2

√σ2

Sxx, β1 + Zα

2

√σ2

Sxx

]. The confidence interval with

unknown σ2 is[β1 − t(α2 , n− 2)

√σ2

Sxx, β1 + t(α2 , n− 2)

√σ2

Sxx

].

7.1.4 Maximum Likelihood Estimator (MLE): Statistical Inference

For the simple linear model, we further assume the distribution of the errors εi ∼ Normal(0, σ2) and theydo not depend on x. Then the MLEs are: β1 =

SxySxx

, β0 = x− β1y, and σ2 = 1n

∑ni=1[yi − (β0 + β1xi)]

2. TheMLE are consistent and asymptotically efficient.

7.1.5 Hypothesis Testing

For the simple linear model, we further assume the distribution of the errors εi ∼ Normal(0, σ2) and theydo not depend on x.

If σ2 is known, the test statistic Z0 = β1−β1√σ2/Sxx

∼ Normal(0, 1). We reject H0 once observing Z0 ≤ Zα2

or Z0 ≥ Z1−α2 . If σ2 is unknown, the test statistic t0 = β1−β1√σ2/Sxx

∼ t(n − 2). We reject H0 once observing

t0 ≤ t(α2 , n− 2) or t0 ≥ t(1− α2 , n− 2).

20

Page 23: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

7.2 Multiple Linear Regression

7.2.1 Definition

We assume that Yi = β0+β1xi1+· · ·+βp−1xi(p−1)+εi, i = 1, · · · , nwith the assumptions E[εi] = 0, V ar(εi) =

σ2, and Cov(εi, ε) = 0,∀i 6= j. We can also write the model in matrix form as Yn×1 = Xn×pβp×1 + εn×1.

7.2.2 Least Squares (LS) Solution: Mathematical Solution

The sum of squared errors is SSE = (Y −Xβ)T (Y −Xβ). To minimize SSE, we obtain the LS solutionβ = (XTX)−1XTY .

7.2.3 Best Linear Unbiased Estimator (BLUE): Statistical Inference

Gauss-Markov Theorem: If E[Y ] = Xβ and Cov(Y ) = σ2I, then the LS solution β has minimum varianceamong all linear unbiased estimators. So the BLUE is β = (XTX)−1XTY . We have E[β] = β andCov(β) = σ2(XTX)−1. MSE σ2 = 1

n−p (Y −Xβ)T (Y −Xβ).

Substituting back into the model, we get the estimate of Y : Y = Xβ = X(XTX)−1XTY = HY ,where we call H = X(XTX)−1XT hat matrix. Both H and I − H are symmetric and idempotent.

7.2.4 Maximum Likelihood Estimator (MLE): Statistical Inference

For the multiple linear model, we further assume the distribution of the errors εi ∼ Normal(0, σ2) and theydo not depend on X. Then the MLE is β = (XTX)−1XTY . MSE σ2 = 1

n (Y −Xβ)T (Y −Xβ). Note thatσ2 is a biased MLE, but it is asymptotically unbiased.

7.2.5 Model Selection with Hypothesis Testing

Consider two multiple linear regression models, denoted by ω (small model) and Ω (large model). The nullhypothesis is defined as H0: the small model ω is “better” than Ω and the alternative hypothesis is definedas H1: the small model ω is not “better” than Ω. The test statistic can be written as F = SSEω−SSEΩ

SSEΩ=

SSTOΩ−SSEΩ

SSEΩ= SSRΩ

SSEΩ= MSRΩ

MSEΩ

p−qn−p ∼ Fp−q,n−p. We reject H0 once observing F > F1−α,p−q,n−p.

7.3 ANOVA

7.3.1 Definition

Analysis of Variance is an alternative method for testing the significance of regression, based on the exami-nation of the variance of the data.

∑ni=1(yi − y)2 =

∑ni=1(yi − yi)2 +

∑ni=1(yi − y)2.

• Sum of total squares (SSTO):∑ni=1(yi − y)2;

• Sum of square errors (SSE):∑ni=1(yi − yi)2;

• Sum of square regressions (SSR):∑ni=1(yi − y)2;

• Mean square errors (MSE): SSEn−2 ;

• Mean square regressions (MSR): SSR1 .

7.3.2 Hypothesis Testing

MSE = MSR when β1 = 0. Therefore, the test statistic F0 = MSRMSE ∼ F (1, n − 2). If F0 > F1−α,1,n−2, we

reject H0.

21

Page 24: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

7.3.3 Coefficient of Determination

The coefficient of determination is a measure of how much variation in the data Y is explained by theregressor X. It is constructed as the ratio of SSR to SSTO. R2 = SSR

SSTO = 1− SSESSTO , 0 ≤ R

2 ≤ 1. R2 is calledthe proportion of variation explained by the predictor variable X. Note that it is not surprising that R2 canbe increased by adding more predictors.

7.4 Important Issues

7.4.1 Multicollinearity

Multicollinearity occurs when some predictors are linear combinations of other predictors, i.e., they arehighly correlated. Then XTX is not full rank (it is singular), thus (XTX)−1 does not exist. This can havedetrimental effects such as

• A unique β estimate of β cannot be found;

• The estimates of βj are imprecise;

• The interpretation of a regression coefficient as measuring the change in the expected value of theresponse variable when this given predictor is increased by one unit while the other predictors are heldconstant, it not applicable;

• The estimation of the standard errors can blow up.

We can detect multicollinearity through

• Examine the eigenvalues ofXTX: λ1 ≥ · · · ≥ λp in order and κ =√

λ1

λp≥ 30 indicates ill-conditioned;

• Examine the variance inflation factors (VIF) 11−R2

j

and higher VIF indicated xj is linearly dependent on

the rest of the predictors.

Some of the ways to mitigate the effects of highly correlated predictor variables:

• Collect more data;

• Change the model specification;

• Eliminate “trouble” variables;

• Do ridge regression.

7.4.2 Checking the Assumption of Regression Models

• Checking Error Assumptions: We can check the constant variance through plotting residuals againstfitted values, check the normality with quantile-quantile (Q-Q) plot, and check the correlations via au-

tocorrelation function (ACF) using a test statistic such as the Durbin-Watson testDW =

∑n

i=2(εi−εi−1)2∑n

i=1ε2i

,

of which small value suggest correlatedness.

• Checking Unusual Observations: Recall the hat matrix, hii = 1n + (xi−x)2

Sxx, we can identify extreme X

values looking for hii > 2pn or via DFFITS and DFBETAS statistics.

• Checking Structural Assumptions: Partial regression or added variable or adjusted variable plots canisolate the effect of xi on Y .

22

Page 25: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

7.4.3 Testing for Lack of Fit

Transformations of either the response of the predictor variables can help improve model fit or help correctmodel assumptions.

• The shape of the response “curve” can provide a strong hint for a linearizing transformation.

• The Box-Cox method takes the guess work out of finding the appropriate transformation of the re-sponse, for best fit of the data, from a family of transformations Y ′ = Y λ, where Y ′ = log Y whenλ = 0.

23

Page 26: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Tabl

e1:

Sum

mar

yof

Com

mon

Dis

trib

utio

ns

Mom

ents

Expo

nent

ialF

amili

esD

ata

Red

ucti

on

Dis

trib

utio

nPD

For

PMF

E[X

]Var(X

)M

GF

h(x)

c(θ)

t(x)

w(θ)

Suffi

cien

tC

ompl

ete

Uni

form

f(x)

=1b−a

a+b

2(b−a)2

12

etb−eta

t(b−a)

X(1)

X(1)

a<x<b

X(n

)X

(n

)

Expo

nent

ial

f(x)

=λe−λx

1 λ1 λ2

1(1−t/λ

)1

λx

−λ

∑ n i=

1Xi

∑ n i=

1Xi

x≥

0,λ>

0t<λ

Nor

mal

f(x)

=1

√2πσe−

(x−µ

)2

σ2

etµ

+1 2σ

2t2

1√

1 σe−µ

2

2x

µ σ2

∑ n i=

1Xi

∑ n i=

1Xi

0<x<

1,a>

0,b>

0x2

−1

2

∑ n i=

1X

2 i

∑ n i=

1X

2 i

Gam

ma

f(x)

=ba

Γ(a)xa−

1e−bx

a ba b2

( 11−t/b

) a1

ba

Γ(a)

logx

a−

1∑ n i

=1

logXi

∑ n i=

1lo

gXi

x>

0,a>

0,b>

0x

−b

∑ n i=

1Xi

∑ n i=

1Xi

Bet

af(x)

(a+b)

Γ(a)Γ

(b)xa−

1(1−x)b−

1aa+b

ab

(a+b)2(a+b+

1)

Ref

erto

x1−x

Γ(a+b)

Γ(a)Γ

(b)

logx

a∑ n i

=1

logXi

∑ n i=

1lo

gXi

0<x<

1,a>

0,b>

0W

ikip

edia

log(1−x)

b∑ n i

=1

log(1−Xi)∑ n i

=1

log(1−Xi)

Ber

noul

lip(x)

=px

(1−p)1−x

pp(1−p)

1−p

+pet

11−p

xlo

gp

1−p

∑ n i=

1Xi

∑ n i=

1Xi

x=

0,1

,0≤p≤

1

Bin

omia

lp(x)

=

( N x) px (1

−p)N−x

Np

Np(1−p)

(1−p

+pet)N

( N x)

(1−p)N

xlo

gp

1−p

∑ n i=

1Xi

∑ n i=

1Xi

0≤x≤N

,0≤p≤

1

Pois

son

p(x)

=e−λλx

x!

λλ

(et−

1)

1 x!

e−λ

xlo

∑ n i=

1Xi

∑ n i=

1Xi

x≥

0,λ≥

0

Geo

met

ric

p(x)

=(1−p)xp

1−pp

1−p

p2

p

1−

(1−p)et

1p

xlo

g(1−p)

∑ n i=

1Xi

∑ n i=

1Xi

x≥

0,0<p≤

1

Neg

ativ

eB

inom

ial

p(x)

=

( r+x−

1

x

) pr (1−p)x

r1−pp

r1−p

p2

(p

1−

(1−p)et

) r( r+

x−

1

x

)pr

xlo

g(1−p)

∑ n i=

1Xi

∑ n i=

1Xi

x≥

0,r>

0,0<p≤

1t<−

log(1−p)

24

Page 27: Review...Probability: Denoted by P instead of , the subset Ais called an event and is called the sample space. (Section 1.1, Cox’s) Length: Denoted by m, which is a Lebesgue measure

Tabl

e2:

Sum

mar

yof

Com

mon

Dis

trib

utio

ns(C

onti

nue)

Poin

tEs

tim

atio

n

Dis

trib

utio

nPD

For

PMF

Fish

erM

OM

MLE

Unb

iase

dU

MV

UE

Bay

es

Uni

form

f(x)

=1b−a

a=X−√ 3

n−

1n

S2

a=X

(1)

a=

0a

=0

a<x<b

b=X

+

√ 3n−

1n

S2

b=X

(n

)b

=n

+1

nX

(n

)b

=n

+1

nX

(n

)

Expo

nent

ial

f(x)

=λe−λx

n λ2

λ=

1 Xλ

=1 X

λ=

a+n

b+nX

x≥

0,λ>

0

Nor

mal

f(x)

=1

√2πσe−

(x−µ

)2

2n σ2

µ=X

µ=X

µ=X

µ=X

µ=

τ2

τ2+σ

2/nX

2/n

τ2+σ

2/nµ

0

0<x<

1,a>

0,b>

0n

4ˆ σ2

=n−

1n

S2

ˆ σ2

=n−

1n

S2

ˆ σ2

=S

2K

now

2K

now

2

Gam

ma

f(x)

=ba

Γ(a)xa−

1e−bx

a=

X2

n−

1n

S2

Kno

wna

Kno

wna

x>

0,a>

0,b>

0nab2

b=

Xn−

1n

S2

b=

a Xb

=a0+na

b0+nX

Bet

af(x)

(a+b)

Γ(a)Γ

(b)xa−

1(1−x)b−

1

0<x<

1,a>

0,b>

0

Ber

noul

lip(x)

=px

(1−p)1−x

np(1−p)

p=X

p=X

p=X

p=X

a+nX

a+b+n

x=

0,1

,0≤p≤

1

Bin

omia

lp(x)

=

( N x) px (1

−p)N−x

N=

X2

X−n−

1n

S2

Kno

wnN

Kno

wnN

Kno

wnN

Kno

wnN

0≤x≤N

,0≤p≤

1nN

p(1−p)

p=X−n−

1n

S2

Xp

=X nN

p=

X nN

p=

X nN

p=

a+nX

a+b+nN

Pois

son

p(x)

=e−λλx

x!

n λλ

=X

λ=X

λ=aX

+(1−a)(n−

1n

S2)

λ=X

λ=a+nX

b+n

x≥

0,λ≥

00<a<

1

Geo

met

ric

p(x)

=(1−p)xp

np2(1−p)

p=

11+X

p=

11+X

p=

a+n

a+b+n

(1+X

)

x≥

0,0<p≤

1

Neg

ativ

eB

inom

ial

p(x)

=

( r+x−

1

x

) pr (1−p)x

r=

X2

X−n−

1n

S2

Kno

wnr

Kno

wnr

x≥

0,r>

0,0<p≤

1nr

p2(1−p)

p=

rr+X

p=

rr+X

p=a+nX

b+rn

25