1. Probability 2. Random variablesjhuan/EECS730_F12/slides/3_statII.pdf · Distribution Functions X...

of 33 /33
Today 1. Probability 2. Random variables CS 8803-MDM Lecture 2 – p. 3/2

Embed Size (px)

Transcript of 1. Probability 2. Random variablesjhuan/EECS730_F12/slides/3_statII.pdf · Distribution Functions X...

  • Today

    1. Probability

    2. Random variables

    CS 8803-MDM Lecture 2 – p. 3/20

  • Probability

    We won’t spend that much time on it, but we need to starthere.

    CS 8803-MDM Lecture 2 – p. 4/20

  • Samples Spaces and Events

    If we toss a coin twice then the sample space, or set of allpossible outcomes or realizations ω, isΩ = {HH,HT, TH, TT}.

    An event is a subset of this set; for example the event thatthe first toss is heads is A = {HH,HT}.

    CS 8803-MDM Lecture 2 – p. 5/20

  • Probability

    We’ll assign a real number P(A) to each event A, called theprobability of A. To qualify as a probability, P must satisfythree axioms:

    1. P(A) ≥ 0 for every A2. P(Ω) = 1

    3. If A1, A2, . . . are disjoint then

    P

    (∞⋃

    i=1

    Ai

    )

    =

    ∞∑

    i=1

    P(Ai). (1)

    Note that frequentists and Bayesians agree on these.

    CS 8803-MDM Lecture 2 – p. 6/20

  • Random VariablesThis is where we start talking about data.

    CS 8803-MDM Lecture 2 – p. 7/20

  • Random Variables

    A random variable is a mapping, or function

    X : Ω → R (2)

    that assigns a real number X(ω) to each outcome ω.

    For example, if Ω ={(x, y) : x2 + y2 ≤ 1

    }and our outcomes

    are samples (x, y) from the unit disk, then these are somerandom variables: X(ω) = x, Y (ω) = y, Z(ω) = x + y.

    CS 8803-MDM Lecture 2 – p. 8/20

  • Distribution Functions

    Suppose X is a random variable, x a specific value of it(data).

    Cumulative distribution function (CDF): the functionF : R → [0, 1] (sometimes FX) defined by F (x) = P(X ≤ x).

    X is discrete if it takes countably many values {x1, x2, . . .}.

    Probability (mass) function for X: f(x) = P(X = x).

    CS 8803-MDM Lecture 2 – p. 9/20

  • Distribution Functions

    X is continuous if there exists a function f such thatf(x) ≥ 0 for all x,

    ∫∞

    −∞f(x)dx = 1 and for every a ≤ b,

    P(a < X < b) =

    ∫ b

    a

    f(x)dx. (3)

    f is the probability density function (PDF).

    We have that F (x) =∫ x−∞

    f(t)dt and f(x) = F ′(x) whereverF is differentiable.

    CS 8803-MDM Lecture 2 – p. 10/20

  • Discrete Distributions

    Some examples of discrete distributions:

    X is the outcome of a coin flip. P(X = 1) = p andP(X = 0) = 1 − p for some p ∈ [0, 1]. We sayX ∼ Bernoulli(p). f(x) = px(1 − p)1−x for x ∈ {0, 1}.

    Binomial: the distribution of the number of outcomes (ofsay, heads) of a coin flip.

    CS 8803-MDM Lecture 2 – p. 11/20

  • Continuous Distributions

    Some examples of continuous distributions:

    Uniform: X ∼ Uniform(a, b) if f(x) = 1/(b − a) for x ∈ [a, b], 0otherwise.

    Gaussian: X ∼ N(µ, σ2) if f(x) = 1σ√

    2πexp

    {1

    2σ2 (x − µ)2}

    for

    µ ∈ R, σ > 0. We call its PDF φ(x) and its CDF Φ(x).

    CS 8803-MDM Lecture 2 – p. 12/20

  • Multivariate Distributions

    Can define a distribution over a vector of random variables.We say this is a multivariate distribution.

    Our dataset generally consists of samples from amultivariate distribution. Each of the columns is a randomvariable. We can also consider the whole vector of randomvariables as a random variable.

    CS 8803-MDM Lecture 2 – p. 13/20

  • Expectation

    The expected value, or mean, or first moment of X is

    E(X) = EX = µ =

    ∫xf(x)dx. (10)

    Note that in the discrete case this means∑

    x xf(x).

    E

    (∑

    i

    aiXi

    )=∑

    i

    aiE(Xi) (11)

    for constants a1, a2, . . . , an. If the Xi are independent,

    E

    (∏

    i

    aiXi

    )=∏

    i

    aiE(Xi). (12)

    CS 8803-MDM Lecture 2 – p. 18/20

  • Variance

    The kth moment of X is defined to be E(Xk) assuming thatE(Xk) < ∞.

    If X has mean µ, the variance of X is

    σ2 = V(X) = VX = E(X − µ)2 =∫

    (x − µ)2f(x) (13)

    and σ = sd(X) =√

    V(X).

    CS 8803-MDM Lecture 2 – p. 19/20

  • Sample Statistics

    If X1, . . . , Xn are random variables then the sample mean is

    X =1

    N

    N∑

    i=1

    Xi (14)

    and the sample variance is

    S2 =1

    N − 1

    N∑

    i=1

    (Xi − X)2. (15)

    If X1, . . . , Xn are IID, then

    E(X) = E(Xi) = µ, V(X) = σ2/N, E(S2) = σ2. (16)

    CS 8803-MDM Lecture 2 – p. 20/20

  • Today

    1. Some inequalities

    2. Asymptotic theory

    3. Point estimation

    CS 8803-MDM Lecture 3 – p. 2/32

  • Some inequalities

    A few very useful inequalities that will come up in manycontexts – in particular, they lie at the heart of learningtheory.

    CS 8803-MDM Lecture 3 – p. 3/32

  • Standard Normal Distribution

    We say that a random variable has a standard Normaldistribution if µ = 0 and σ = 1, and we denote it by Z.

    If X ∼ N(µ, σ2) then Z = (X − µ)/σ ∼ N(0, 1).

    If Z ∼ N(0, 1) then X = µ + σZ ∼ N(µ, σ2).

    CS 8803-MDM Lecture 3 – p. 4/32

  • Markov’s Inequality

    Theorem (Markov’s inequality): Suppose X is anon-negative random variable and E(X) exists. Then forany t > 0,

    P(X > t) ≤ E(X)t

    . (1)

    CS 8803-MDM Lecture 3 – p. 5/32

  • Markov’s Inequality: Proof

    Since X > 0,

    E(X) =

    ∫∞

    0xf(x)dx (2)

    =

    ∫ t

    0xf(x)dx +

    ∫∞

    t

    xf(x)dx (3)

    ≥∫

    t

    xf(x)dx (4)

    ≥ t∫

    t

    f(x)dx (5)

    = tP(X > t). (6)

    CS 8803-MDM Lecture 3 – p. 6/32

  • Chebyshev’s Inequality

    Theorem (Chebyshev’s inequality): If µ = E(X) andσ2 = V(X), then

    P(|X − µ| ≥ t) ≤ σ2

    t2(7)

    and

    P

    (∣∣∣∣X − µ

    σ

    ∣∣∣∣ ≥ u)

    ≤ 1u2

    (8)

    (or P(|Z| ≥ u) ≤ 1u2

    if Z = (X − µ)/σ).

    For example, P(|Z| > 2) ≤ 1/4 and P(|Z| > 3) ≤ 1/9.

    CS 8803-MDM Lecture 3 – p. 7/32

  • Chebyshev’s Inequality: Proof

    Using Markov’s inequality,

    P(|X − µ| ≥ t) = P(|X − µ|2 ≥ t2) (9)

    ≤ E(X − µ)2

    t2(10)

    =σ2

    t2. (11)

    The second part follows by setting t = uσ.

    CS 8803-MDM Lecture 3 – p. 8/32

  • Chebyshev’s Inequality: Example

    Suppose we test a classifier on a set of N new examples.Let Xi = 1 if the prediction is wrong and Xi = 0 if it is right;then XN = 1N

    ∑Ni=1 Xi is the observed error rate. Each Xi

    may be regarded as a Bernoulli with unknown mean p; wewould like to estimate this.

    How likely is XN to not be within ǫ of p?

    CS 8803-MDM Lecture 3 – p. 9/32

  • Chebyshev’s Inequality: Example

    We have that V(XN ) = V(X)/N = p(1 − p)/N and

    P(|XN − p| > ǫ) ≤V(XN )

    ǫ2(12)

    =p(1 − p)

    Nǫ2(13)

    ≤ 14Nǫ2

    (14)

    since p(1 − p) ≤ 1/4 for all p.

    For ǫ = .2 and N = 100 the bound is .0625.

    CS 8803-MDM Lecture 3 – p. 10/32

  • Asymptotic theory

    What happens as you get more data.

    CS 8803-MDM Lecture 3 – p. 14/32

  • Convergence

    Suppose X1, X2, . . . is a sequence of random variables andX is another random variable. FN is the CDF of XN and Fis the CDF of X.

    XN converges in probability to X, XNp→ X, if for every

    ǫ > 0,P(|XN − X| > ǫ) → 0 (19)

    as N → ∞.

    CS 8803-MDM Lecture 3 – p. 15/32

  • Convergence

    XN converges in distribution to X, XN X, if

    limN→∞

    FN (t) = F (t) (20)

    at all t for which F is continuous.

    CS 8803-MDM Lecture 3 – p. 16/32

  • Convergence

    XN converges in quadratic mean (or L2) to X, XNqm→ X, if

    E(XN − X)2 → 0 (21)

    as N → ∞.

    CS 8803-MDM Lecture 3 – p. 17/32

  • Convergence

    These are ordered in strength:

    XNqm→ X ⇒ XN

    p→ X ⇒ XN X (22)

    Special case: if P(X = c) = 1 for some c ∈ R,XN X ⇒ XN

    p→ X. But in general none of the reverseimplications hold.

    CS 8803-MDM Lecture 3 – p. 18/32

  • (Weak) Law of Large Numbers

    Theorem (WLLN): If X1, . . . , XN are IID, and E(Xi) = µ, then

    XNp→ µ.

    This says that the sample mean XN approaches the truemean µ as N gets large.

    CS 8803-MDM Lecture 3 – p. 19/32

  • WLLN: Proof

    To make the proof simpler (though it’s not strictlynecessary), assume the variance is finite (σ < ∞). Thenusing Chebyshev’s inequality,

    P(|XN − µ| > ǫ) ≤V(XN )

    ǫ2(23)

    =σ2

    Nǫ2(24)

    which approaches 0 as N → ∞.

    CS 8803-MDM Lecture 3 – p. 20/32

  • Central Limit Theorem

    Theorem (CLT): If X1, . . . , XN are IID (with any distribution),with mean µ and variance σ2, then

    ZN =XN − µ√

    V(XN )=

    √N(XN − µ)

    σ Z (25)

    where Z ∼ N(0, 1). In other words,

    limN→∞

    P(ZN ≤ z) = Φ(z) =∫ z

    −∞

    1√2π

    e−x2/2dx. (26)

    CS 8803-MDM Lecture 3 – p. 21/32

  • Central Limit Theorem

    This says that probability statements about XN can beapproximated using a Normal distribution. This is written as

    ZN =

    √N(XN − µ)

    σ≈ N(0, 1) (27)

    or

    XN ≈ N(

    µ,σ2

    N

    ). (28)

    CS 8803-MDM Lecture 3 – p. 22/32

  • Acknowledgement: The slides are from Alexander Gray @ George Tech.

    Welcome to This CourseToday A Dataset Types of Estimation Problems Density Estimation Density Estimation Density Estimation Regression Regression Classification Classification Types of Data Models Error (Loss Function)

    Learning and Prediction Some Questions What is Machine Learning? Inference What is Machine Learning? History of Machine Learning History of Machine Learning Growth of Machine Learning Goals of This Course Review of Syllabus Review of Syllabus Review of Syllabus Review of Syllabus Books Hidden Messages of This Course Grading Survey/Comparisons How Hard Will This Course Be? 3_stat2.pdfToday Standard Normal Distribution Markov's Inequality Markov's Inequality: Proof Chebyshev's Inequality Chebyshev's Inequality: Proof Chebyshev's Inequality: Example Chebyshev's Inequality: Example Hoeffding's Inequality Hoeffding's Inequality Hoeffding's Inequality: Example Convergence Convergence Convergence Convergence (Weak)Law of Large Numbers WLLN: Proof Central Limit Theorem Central Limit Theorem Point estimation Sampling distribution Properties of Estimators Asymptotic Normality

    3_stat2.pdfCourse StuffToday Samples Spaces and Events Probability Random Variables Distribution Functions Distribution Functions Discrete Distributions Continuous Distributions Multivariate Distributions Marginal Distributions Conditional Distributions Bayes' Rule Independence Expectation Variance Sample Statistics