Download - 1. Probability 2. Random variablesjhuan/EECS730_F12/slides/3_statII.pdf · Distribution Functions X is continuous if there exists a function f such that f(x) ≥ 0 for all x, R∞

Transcript

Today

1. Probability

2. Random variables

CS 8803-MDM Lecture 2 – p. 3/20

Probability

We won’t spend that much time on it, but we need to starthere.

CS 8803-MDM Lecture 2 – p. 4/20

Samples Spaces and Events

If we toss a coin twice then the sample space, or set of allpossible outcomes or realizations ω, isΩ = HH,HT, TH, TT.

An event is a subset of this set; for example the event thatthe first toss is heads is A = HH,HT.

CS 8803-MDM Lecture 2 – p. 5/20

Probability

We’ll assign a real number P(A) to each event A, called theprobability of A. To qualify as a probability, P must satisfythree axioms:

1. P(A) ≥ 0 for every A

2. P(Ω) = 1

3. If A1, A2, . . . are disjoint then

P

(∞⋃

i=1

Ai

)

=

∞∑

i=1

P(Ai). (1)

Note that frequentists and Bayesians agree on these.

CS 8803-MDM Lecture 2 – p. 6/20

Random VariablesThis is where we start talking about data.

CS 8803-MDM Lecture 2 – p. 7/20

Random Variables

A random variable is a mapping, or function

X : Ω → R (2)

that assigns a real number X(ω) to each outcome ω.

For example, if Ω =(x, y) : x2 + y2 ≤ 1

and our outcomes

are samples (x, y) from the unit disk, then these are somerandom variables: X(ω) = x, Y (ω) = y, Z(ω) = x + y.

CS 8803-MDM Lecture 2 – p. 8/20

Distribution Functions

Suppose X is a random variable, x a specific value of it(data).

Cumulative distribution function (CDF): the functionF : R → [0, 1] (sometimes FX) defined by F (x) = P(X ≤ x).

X is discrete if it takes countably many values x1, x2, . . ..

Probability (mass) function for X: f(x) = P(X = x).

CS 8803-MDM Lecture 2 – p. 9/20

Distribution Functions

X is continuous if there exists a function f such thatf(x) ≥ 0 for all x,

∫∞

−∞f(x)dx = 1 and for every a ≤ b,

P(a < X < b) =

∫ b

a

f(x)dx. (3)

f is the probability density function (PDF).

We have that F (x) =∫ x

−∞f(t)dt and f(x) = F ′(x) wherever

F is differentiable.

CS 8803-MDM Lecture 2 – p. 10/20

Discrete Distributions

Some examples of discrete distributions:

X is the outcome of a coin flip. P(X = 1) = p andP(X = 0) = 1 − p for some p ∈ [0, 1]. We sayX ∼ Bernoulli(p). f(x) = px(1 − p)1−x for x ∈ 0, 1.

Binomial: the distribution of the number of outcomes (ofsay, heads) of a coin flip.

CS 8803-MDM Lecture 2 – p. 11/20

Continuous Distributions

Some examples of continuous distributions:

Uniform: X ∼ Uniform(a, b) if f(x) = 1/(b − a) for x ∈ [a, b], 0otherwise.

Gaussian: X ∼ N(µ, σ2) if f(x) = 1σ√

2πexp

1

2σ2 (x − µ)2

for

µ ∈ R, σ > 0. We call its PDF φ(x) and its CDF Φ(x).

CS 8803-MDM Lecture 2 – p. 12/20

Multivariate Distributions

Can define a distribution over a vector of random variables.We say this is a multivariate distribution.

Our dataset generally consists of samples from amultivariate distribution. Each of the columns is a randomvariable. We can also consider the whole vector of randomvariables as a random variable.

CS 8803-MDM Lecture 2 – p. 13/20

Expectation

The expected value, or mean, or first moment of X is

E(X) = EX = µ =

∫xf(x)dx. (10)

Note that in the discrete case this means∑

x xf(x).

E

(∑

i

aiXi

)=∑

i

aiE(Xi) (11)

for constants a1, a2, . . . , an. If the Xi are independent,

E

(∏

i

aiXi

)=∏

i

aiE(Xi). (12)

CS 8803-MDM Lecture 2 – p. 18/20

Variance

The kth moment of X is defined to be E(Xk) assuming thatE(Xk) < ∞.

If X has mean µ, the variance of X is

σ2 = V(X) = VX = E(X − µ)2 =

∫(x − µ)2f(x) (13)

and σ = sd(X) =√

V(X).

CS 8803-MDM Lecture 2 – p. 19/20

Sample Statistics

If X1, . . . , Xn are random variables then the sample mean is

X =1

N

N∑

i=1

Xi (14)

and the sample variance is

S2 =1

N − 1

N∑

i=1

(Xi − X)2. (15)

If X1, . . . , Xn are IID, then

E(X) = E(Xi) = µ, V(X) = σ2/N, E(S2) = σ2. (16)

CS 8803-MDM Lecture 2 – p. 20/20

Today

1. Some inequalities

2. Asymptotic theory

3. Point estimation

CS 8803-MDM Lecture 3 – p. 2/32

Some inequalities

A few very useful inequalities that will come up in manycontexts – in particular, they lie at the heart of learningtheory.

CS 8803-MDM Lecture 3 – p. 3/32

Standard Normal Distribution

We say that a random variable has a standard Normaldistribution if µ = 0 and σ = 1, and we denote it by Z.

If X ∼ N(µ, σ2) then Z = (X − µ)/σ ∼ N(0, 1).

If Z ∼ N(0, 1) then X = µ + σZ ∼ N(µ, σ2).

CS 8803-MDM Lecture 3 – p. 4/32

Markov’s Inequality

Theorem (Markov’s inequality): Suppose X is anon-negative random variable and E(X) exists. Then forany t > 0,

P(X > t) ≤ E(X)

t. (1)

CS 8803-MDM Lecture 3 – p. 5/32

Markov’s Inequality: Proof

Since X > 0,

E(X) =

∫∞

0xf(x)dx (2)

=

∫ t

0xf(x)dx +

∫∞

t

xf(x)dx (3)

≥∫

t

xf(x)dx (4)

≥ t

∫∞

t

f(x)dx (5)

= tP(X > t). (6)

CS 8803-MDM Lecture 3 – p. 6/32

Chebyshev’s Inequality

Theorem (Chebyshev’s inequality): If µ = E(X) andσ2 = V(X), then

P(|X − µ| ≥ t) ≤ σ2

t2(7)

and

P

(∣∣∣∣X − µ

σ

∣∣∣∣ ≥ u

)≤ 1

u2(8)

(or P(|Z| ≥ u) ≤ 1u2 if Z = (X − µ)/σ).

For example, P(|Z| > 2) ≤ 1/4 and P(|Z| > 3) ≤ 1/9.

CS 8803-MDM Lecture 3 – p. 7/32

Chebyshev’s Inequality: Proof

Using Markov’s inequality,

P(|X − µ| ≥ t) = P(|X − µ|2 ≥ t2) (9)

≤ E(X − µ)2

t2(10)

=σ2

t2. (11)

The second part follows by setting t = uσ.

CS 8803-MDM Lecture 3 – p. 8/32

Chebyshev’s Inequality: Example

Suppose we test a classifier on a set of N new examples.Let Xi = 1 if the prediction is wrong and Xi = 0 if it is right;then XN = 1

N

∑Ni=1 Xi is the observed error rate. Each Xi

may be regarded as a Bernoulli with unknown mean p; wewould like to estimate this.

How likely is XN to not be within ǫ of p?

CS 8803-MDM Lecture 3 – p. 9/32

Chebyshev’s Inequality: Example

We have that V(XN ) = V(X)/N = p(1 − p)/N and

P(|XN − p| > ǫ) ≤ V(XN )

ǫ2(12)

=p(1 − p)

Nǫ2(13)

≤ 1

4Nǫ2(14)

since p(1 − p) ≤ 1/4 for all p.

For ǫ = .2 and N = 100 the bound is .0625.

CS 8803-MDM Lecture 3 – p. 10/32

Asymptotic theory

What happens as you get more data.

CS 8803-MDM Lecture 3 – p. 14/32

Convergence

Suppose X1, X2, . . . is a sequence of random variables andX is another random variable. FN is the CDF of XN and F

is the CDF of X.

XN converges in probability to X, XNp→ X, if for every

ǫ > 0,P(|XN − X| > ǫ) → 0 (19)

as N → ∞.

CS 8803-MDM Lecture 3 – p. 15/32

Convergence

XN converges in distribution to X, XN X, if

limN→∞

FN (t) = F (t) (20)

at all t for which F is continuous.

CS 8803-MDM Lecture 3 – p. 16/32

Convergence

XN converges in quadratic mean (or L2) to X, XNqm→ X, if

E(XN − X)2 → 0 (21)

as N → ∞.

CS 8803-MDM Lecture 3 – p. 17/32

Convergence

These are ordered in strength:

XNqm→ X ⇒ XN

p→ X ⇒ XN X (22)

Special case: if P(X = c) = 1 for some c ∈ R,

XN X ⇒ XNp→ X. But in general none of the reverse

implications hold.

CS 8803-MDM Lecture 3 – p. 18/32

(Weak) Law of Large Numbers

Theorem (WLLN): If X1, . . . , XN are IID, and E(Xi) = µ, then

XNp→ µ.

This says that the sample mean XN approaches the truemean µ as N gets large.

CS 8803-MDM Lecture 3 – p. 19/32

WLLN: Proof

To make the proof simpler (though it’s not strictlynecessary), assume the variance is finite (σ < ∞). Thenusing Chebyshev’s inequality,

P(|XN − µ| > ǫ) ≤ V(XN )

ǫ2(23)

=σ2

Nǫ2(24)

which approaches 0 as N → ∞.

CS 8803-MDM Lecture 3 – p. 20/32

Central Limit Theorem

Theorem (CLT): If X1, . . . , XN are IID (with any distribution),with mean µ and variance σ2, then

ZN =XN − µ√

V(XN )=

√N(XN − µ)

σ Z (25)

where Z ∼ N(0, 1). In other words,

limN→∞

P(ZN ≤ z) = Φ(z) =

∫ z

−∞

1√2π

e−x2/2dx. (26)

CS 8803-MDM Lecture 3 – p. 21/32

Central Limit Theorem

This says that probability statements about XN can beapproximated using a Normal distribution. This is written as

ZN =

√N(XN − µ)

σ≈ N(0, 1) (27)

or

XN ≈ N

(µ,

σ2

N

). (28)

CS 8803-MDM Lecture 3 – p. 22/32

Acknowledgement: The slides are from Alexander Gray @ George Tech.