of 33 /33
Today 1. Probability 2. Random variables CS 8803-MDM Lecture 2 – p. 3/2
• Author

others
• Category

## Documents

• view

3

0

Embed Size (px)

### Transcript of 1. Probability 2. Random variablesjhuan/EECS730_F12/slides/3_statII.pdf · Distribution Functions X...

• Today

1. Probability

2. Random variables

CS 8803-MDM Lecture 2 – p. 3/20

• Probability

We won’t spend that much time on it, but we need to starthere.

CS 8803-MDM Lecture 2 – p. 4/20

• Samples Spaces and Events

If we toss a coin twice then the sample space, or set of allpossible outcomes or realizations ω, isΩ = {HH,HT, TH, TT}.

An event is a subset of this set; for example the event thatthe first toss is heads is A = {HH,HT}.

CS 8803-MDM Lecture 2 – p. 5/20

• Probability

We’ll assign a real number P(A) to each event A, called theprobability of A. To qualify as a probability, P must satisfythree axioms:

1. P(A) ≥ 0 for every A2. P(Ω) = 1

3. If A1, A2, . . . are disjoint then

P

(∞⋃

i=1

Ai

)

=

∞∑

i=1

P(Ai). (1)

Note that frequentists and Bayesians agree on these.

CS 8803-MDM Lecture 2 – p. 6/20

• Random VariablesThis is where we start talking about data.

CS 8803-MDM Lecture 2 – p. 7/20

• Random Variables

A random variable is a mapping, or function

X : Ω → R (2)

that assigns a real number X(ω) to each outcome ω.

For example, if Ω ={(x, y) : x2 + y2 ≤ 1

}and our outcomes

are samples (x, y) from the unit disk, then these are somerandom variables: X(ω) = x, Y (ω) = y, Z(ω) = x + y.

CS 8803-MDM Lecture 2 – p. 8/20

• Distribution Functions

Suppose X is a random variable, x a specific value of it(data).

Cumulative distribution function (CDF): the functionF : R → [0, 1] (sometimes FX) defined by F (x) = P(X ≤ x).

X is discrete if it takes countably many values {x1, x2, . . .}.

Probability (mass) function for X: f(x) = P(X = x).

CS 8803-MDM Lecture 2 – p. 9/20

• Distribution Functions

X is continuous if there exists a function f such thatf(x) ≥ 0 for all x,

∫∞

−∞f(x)dx = 1 and for every a ≤ b,

P(a < X < b) =

∫ b

a

f(x)dx. (3)

f is the probability density function (PDF).

We have that F (x) =∫ x−∞

f(t)dt and f(x) = F ′(x) whereverF is differentiable.

CS 8803-MDM Lecture 2 – p. 10/20

• Discrete Distributions

Some examples of discrete distributions:

X is the outcome of a coin flip. P(X = 1) = p andP(X = 0) = 1 − p for some p ∈ [0, 1]. We sayX ∼ Bernoulli(p). f(x) = px(1 − p)1−x for x ∈ {0, 1}.

Binomial: the distribution of the number of outcomes (ofsay, heads) of a coin flip.

CS 8803-MDM Lecture 2 – p. 11/20

• Continuous Distributions

Some examples of continuous distributions:

Uniform: X ∼ Uniform(a, b) if f(x) = 1/(b − a) for x ∈ [a, b], 0otherwise.

Gaussian: X ∼ N(µ, σ2) if f(x) = 1σ√

2πexp

{1

2σ2 (x − µ)2}

for

µ ∈ R, σ > 0. We call its PDF φ(x) and its CDF Φ(x).

CS 8803-MDM Lecture 2 – p. 12/20

• Multivariate Distributions

Can define a distribution over a vector of random variables.We say this is a multivariate distribution.

Our dataset generally consists of samples from amultivariate distribution. Each of the columns is a randomvariable. We can also consider the whole vector of randomvariables as a random variable.

CS 8803-MDM Lecture 2 – p. 13/20

• Expectation

The expected value, or mean, or first moment of X is

E(X) = EX = µ =

∫xf(x)dx. (10)

Note that in the discrete case this means∑

x xf(x).

E

(∑

i

aiXi

)=∑

i

aiE(Xi) (11)

for constants a1, a2, . . . , an. If the Xi are independent,

E

(∏

i

aiXi

)=∏

i

aiE(Xi). (12)

CS 8803-MDM Lecture 2 – p. 18/20

• Variance

The kth moment of X is defined to be E(Xk) assuming thatE(Xk) < ∞.

If X has mean µ, the variance of X is

σ2 = V(X) = VX = E(X − µ)2 =∫

(x − µ)2f(x) (13)

and σ = sd(X) =√

V(X).

CS 8803-MDM Lecture 2 – p. 19/20

• Sample Statistics

If X1, . . . , Xn are random variables then the sample mean is

X =1

N

N∑

i=1

Xi (14)

and the sample variance is

S2 =1

N − 1

N∑

i=1

(Xi − X)2. (15)

If X1, . . . , Xn are IID, then

E(X) = E(Xi) = µ, V(X) = σ2/N, E(S2) = σ2. (16)

CS 8803-MDM Lecture 2 – p. 20/20

• Today

1. Some inequalities

2. Asymptotic theory

3. Point estimation

CS 8803-MDM Lecture 3 – p. 2/32

• Some inequalities

A few very useful inequalities that will come up in manycontexts – in particular, they lie at the heart of learningtheory.

CS 8803-MDM Lecture 3 – p. 3/32

• Standard Normal Distribution

We say that a random variable has a standard Normaldistribution if µ = 0 and σ = 1, and we denote it by Z.

If X ∼ N(µ, σ2) then Z = (X − µ)/σ ∼ N(0, 1).

If Z ∼ N(0, 1) then X = µ + σZ ∼ N(µ, σ2).

CS 8803-MDM Lecture 3 – p. 4/32

• Markov’s Inequality

Theorem (Markov’s inequality): Suppose X is anon-negative random variable and E(X) exists. Then forany t > 0,

P(X > t) ≤ E(X)t

. (1)

CS 8803-MDM Lecture 3 – p. 5/32

• Markov’s Inequality: Proof

Since X > 0,

E(X) =

∫∞

0xf(x)dx (2)

=

∫ t

0xf(x)dx +

∫∞

t

xf(x)dx (3)

≥∫

t

xf(x)dx (4)

≥ t∫

t

f(x)dx (5)

= tP(X > t). (6)

CS 8803-MDM Lecture 3 – p. 6/32

• Chebyshev’s Inequality

Theorem (Chebyshev’s inequality): If µ = E(X) andσ2 = V(X), then

P(|X − µ| ≥ t) ≤ σ2

t2(7)

and

P

(∣∣∣∣X − µ

σ

∣∣∣∣ ≥ u)

≤ 1u2

(8)

(or P(|Z| ≥ u) ≤ 1u2

if Z = (X − µ)/σ).

For example, P(|Z| > 2) ≤ 1/4 and P(|Z| > 3) ≤ 1/9.

CS 8803-MDM Lecture 3 – p. 7/32

• Chebyshev’s Inequality: Proof

Using Markov’s inequality,

P(|X − µ| ≥ t) = P(|X − µ|2 ≥ t2) (9)

≤ E(X − µ)2

t2(10)

=σ2

t2. (11)

The second part follows by setting t = uσ.

CS 8803-MDM Lecture 3 – p. 8/32

• Chebyshev’s Inequality: Example

Suppose we test a classifier on a set of N new examples.Let Xi = 1 if the prediction is wrong and Xi = 0 if it is right;then XN = 1N

∑Ni=1 Xi is the observed error rate. Each Xi

may be regarded as a Bernoulli with unknown mean p; wewould like to estimate this.

How likely is XN to not be within ǫ of p?

CS 8803-MDM Lecture 3 – p. 9/32

• Chebyshev’s Inequality: Example

We have that V(XN ) = V(X)/N = p(1 − p)/N and

P(|XN − p| > ǫ) ≤V(XN )

ǫ2(12)

=p(1 − p)

Nǫ2(13)

≤ 14Nǫ2

(14)

since p(1 − p) ≤ 1/4 for all p.

For ǫ = .2 and N = 100 the bound is .0625.

CS 8803-MDM Lecture 3 – p. 10/32

• Asymptotic theory

What happens as you get more data.

CS 8803-MDM Lecture 3 – p. 14/32

• Convergence

Suppose X1, X2, . . . is a sequence of random variables andX is another random variable. FN is the CDF of XN and Fis the CDF of X.

XN converges in probability to X, XNp→ X, if for every

ǫ > 0,P(|XN − X| > ǫ) → 0 (19)

as N → ∞.

CS 8803-MDM Lecture 3 – p. 15/32

• Convergence

XN converges in distribution to X, XN X, if

limN→∞

FN (t) = F (t) (20)

at all t for which F is continuous.

CS 8803-MDM Lecture 3 – p. 16/32

• Convergence

XN converges in quadratic mean (or L2) to X, XNqm→ X, if

E(XN − X)2 → 0 (21)

as N → ∞.

CS 8803-MDM Lecture 3 – p. 17/32

• Convergence

These are ordered in strength:

XNqm→ X ⇒ XN

p→ X ⇒ XN X (22)

Special case: if P(X = c) = 1 for some c ∈ R,XN X ⇒ XN

p→ X. But in general none of the reverseimplications hold.

CS 8803-MDM Lecture 3 – p. 18/32

• (Weak) Law of Large Numbers

Theorem (WLLN): If X1, . . . , XN are IID, and E(Xi) = µ, then

XNp→ µ.

This says that the sample mean XN approaches the truemean µ as N gets large.

CS 8803-MDM Lecture 3 – p. 19/32

• WLLN: Proof

To make the proof simpler (though it’s not strictlynecessary), assume the variance is finite (σ < ∞). Thenusing Chebyshev’s inequality,

P(|XN − µ| > ǫ) ≤V(XN )

ǫ2(23)

=σ2

Nǫ2(24)

which approaches 0 as N → ∞.

CS 8803-MDM Lecture 3 – p. 20/32

• Central Limit Theorem

Theorem (CLT): If X1, . . . , XN are IID (with any distribution),with mean µ and variance σ2, then

ZN =XN − µ√

V(XN )=

√N(XN − µ)

σ Z (25)

where Z ∼ N(0, 1). In other words,

limN→∞

P(ZN ≤ z) = Φ(z) =∫ z

−∞

1√2π

e−x2/2dx. (26)

CS 8803-MDM Lecture 3 – p. 21/32

• Central Limit Theorem

This says that probability statements about XN can beapproximated using a Normal distribution. This is written as

ZN =

√N(XN − µ)

σ≈ N(0, 1) (27)

or

XN ≈ N(

µ,σ2

N

). (28)

CS 8803-MDM Lecture 3 – p. 22/32

• Acknowledgement: The slides are from Alexander Gray @ George Tech.

Welcome to This CourseToday A Dataset Types of Estimation Problems Density Estimation Density Estimation Density Estimation Regression Regression Classification Classification Types of Data Models Error (Loss Function)

Learning and Prediction Some Questions What is Machine Learning? Inference What is Machine Learning? History of Machine Learning History of Machine Learning Growth of Machine Learning Goals of This Course Review of Syllabus Review of Syllabus Review of Syllabus Review of Syllabus Books Hidden Messages of This Course Grading Survey/Comparisons How Hard Will This Course Be? 3_stat2.pdfToday Standard Normal Distribution Markov's Inequality Markov's Inequality: Proof Chebyshev's Inequality Chebyshev's Inequality: Proof Chebyshev's Inequality: Example Chebyshev's Inequality: Example Hoeffding's Inequality Hoeffding's Inequality Hoeffding's Inequality: Example Convergence Convergence Convergence Convergence (Weak)Law of Large Numbers WLLN: Proof Central Limit Theorem Central Limit Theorem Point estimation Sampling distribution Properties of Estimators Asymptotic Normality

3_stat2.pdfCourse StuffToday Samples Spaces and Events Probability Random Variables Distribution Functions Distribution Functions Discrete Distributions Continuous Distributions Multivariate Distributions Marginal Distributions Conditional Distributions Bayes' Rule Independence Expectation Variance Sample Statistics