Quick Tour of Basic Probability Theory - Stanford...

Quick Tour of Basic Probability Theory

CS224W: Social and Information Network AnalysisFall 2012

Outline

Today’s goal: A gentle refresher on probability

You should have seen this before

Outline

I Basic definitions

I Random variables

I Maximum likelihood estimation

Fundamentals of Probability

I Sample space Ω: Set of all possible outcomes

I Event space F : 2Ω (an event is a subset of the sample space)I Probability measure: function P : F → R such that:

I P(A) ≥ 0 (∀A ∈ F)I P(Ω) = 1I For disjoint events Ai , P(∪iAi ) =

∑i P(Ai )

In this session, I’ll focus mostly on the discrete case (things arebasically the same in the continuous case).

Example

Consider throwing a die twice:

I Sample space Ω = 1, 2, 3, 4, 5, 6 × 1, 2, 3, 4, 5, 6I Event space F= 2Ω (example events: let A be the event that

the sum is even and let B be the event that we roll at leastone 6).

I Probability measure: function P is simple counting in thissimple discrete case. Example: P(A) = 18/36 = 1/2,P(B) = 11/36.

Note that multiple events can happen simultaneously. e.g. if weroll a 6 then a 2, the outcome is 6, 2, and both A and B haveoccurred.

Union

For any two events A and B, the union of the two (“A or B”) is:

P(A ∪ B) = P(A) + P(B)− P(A ∩ B)

e.g. P(A ∪ B) = 18/36 + 11/36− 5/36 = 24/36 = 2/3

Conditional probability

Let A and B be two events. Then the conditional probability of Agiven B is:

P(A|B) =P(A ∩ B)

P(B)

“What’s the probability of A once we know B has happened?”

Rewriting gives us the useful product rule:

P(A ∩ B) = P(A|B)P(B)

Independence

Two events A and B are independent if

P(A ∩ B) = P(A)P(B)

Equivalently: P(A|B) = P(A) and P(B|A) = P(B)

Intuitively, knowing A doesn’t tell you anything about B andvice-versa.

But beware of relying on your intuition: rolling two dice (xa andxb), events xa = 2 and xa + xb = k are independent if k = 7 anddependent otherwise.

Union bound

Recall that P(A ∪ B) = P(A) + P(B)− P(A ∩ B) for any twoevents A and B.

If we’re trying to upper bound the probability that A or B happens,the worst case is that A and B are disjoint (so P(A ∩ B) = 0).

The surprisingly useful union bound now follows. Let Ai be some(not necessarily independent!) events, then:

P

(⋃i

Ai

)≤∑

I

P(Ai )

Bayes’ Rule

Most important basic rule of probability!

For two events A and B (such that P(B) 6= 0):

P(A|B) =P(B|A)P(A)

P(B)

Often used to update beliefs:

posterior = “support B provides for A”× “prior”

Bayes’ Rule Example

You friend told you she had a great conversation with someone onthe Caltrain. Not knowing anything else, your prior belief that herconversation partner was a woman is 50%. Let W donate thisevent. Let L denote the event that her conversation partner haslong hair. If you learn L to be true, how should you update yourbeliefs about W ?

P(W ) = 0.5 and suppose P(L) = 0.6, and P(L|W ) = 0.75 areknown.

Then P(W |L) = P(L|W )P(W )P(L) = 0.75 ∗ 0.5/0.6 = 62.5%.

Random Variables

A random variable is a technically a function X : Ω→ R

Probabilities of random variable events come from underlying Pfunction: P(X = k) = P(ω ∈ Ω|X (ω) = k)

It’s called a random variable because it’s a variable that doesn’ttake on a single, deterministic value, but it can take on a set ofdifferent values, each with an associated probability.

e.g. Let X be a random variable that counts the number of 6’s weroll in 2 rolls of a die.

P(X = 2) = P(6, 6) = 1/36P(X = 1) =P(1, 6) + . . .+ P(6, 6) + P(6, 1) + . . .P(6, 5) = 10/36P(X = 0) = 25/36

Distributions

A probability mass function (pmf) assigns a probability to eachpossible value of a random variable (in the discrete case)

Example: funny die

Distributions

Another example: distribution over sum of two die rolls

Probability density functionsThe PDF of a continuous random variable X is describes therelative likelihood for X to take on a given value:

P[a ≤ X ≤ b] =

∫ b

af (x)dx

Cumulative Distributions

The CDF of a random variable X is:

F (x) = P(X ≤ x)

Properties of Distribution Functions

I CDF (cumulative distribution function):I 0 ≤ FX (x) ≤ 1I FX monotone increasing, with limx→−∞FX (x) = 0,

limx→∞FX (x) = 1

I pmf:I 0 ≤ pX (x) ≤ 1I∑

x pX (x) = 1I∑

x∈A pX (x) = pX (A)

I pdf:I fX (x) ≥ 0I∫∞−∞ fX (x)dx = 1

I∫x∈A

fX (x)dx = P(X ∈ A)

Some Common Random Variables

I X ∼ Bernoulli(p) (0 ≤ p ≤ 1): pX (x) =

p x=1,

1− p x=0.

I X ∼ Geometric(p) (0 ≤ p ≤ 1): pX (x) = p(1− p)x−1

I X ∼ Uniform(a, b) (a < b): fX (x) =

1

b−a a ≤ x ≤ b,

0 otherwise.

I X ∼ Normal(µ, σ2): fX (x) = 1√2πσ

e−1

2σ2 (x−µ)2

Expectation and Variance

I If the discrete random variable X has pmf p(x), then theexpectation is E [X ] =

∑x x · p(x)

I Continuous case is similar: E [X ] =∫∞−∞ x · fX (x)dx

I Expectation is linear:I for any constant a ∈ R, E [a] = aI E [a · g(X ) + b · h(X )] = aE [g(X )] + bE [h(X )]

I Var [X ] = E [(X − E [X ])2] = E [X 2]− E [X ]2

I Variance is not linear

Example: expectation of rolling a die once:1 · 1/6 + 2 · 1/6 + 3 · 1/6 + 4 · 1/6 + 5 · 1/6 + 6 · 1/6 = 3.5

Indicator variables

An indicator variable just indicates whether an event occurs or not:

IA =

1 if A occurs0 otherwise

They have a very useful property:

E [IA] = 1 · P(IA = 1) + 0 · P(IA = 0)

= P(IA = 1)

= P(A)

Method of indicators

Goal: find expected number of successes out of N trials

Method: define an indicator (Bernoulli) random variable for eachtrial, find expected value of the sum

Example: N professors are at dinner and take a random coat whenthey leave. Expected number of profs with the right coat?

Let G be the number of profs who get the right coat, and let Gi bean indicator for the event that professor i gets his own coat. Then

G = G1 + G2 + . . .+ Gn

These events are not independent!

But linearity of expectation saves us:

E [G ] = E [G1 + G2 + . . .+ Gn]

= E [G1] + E [G2] + . . .+ E [Gn]

= 1/n + 1/n + . . . 1/n = 1

Remember: linearity of expectation does not assumeindependence!

Some Useful Inequalities

I Markov’s Inequality: X random variable, and a > 0. Then:

P(X ≥ a) ≤ E [X ]

a

Example: back to the professors and their coats. We knowthat E[G] = 1, so applying Markov’s Inequality gives us:

P(G ≥ a) ≤ 1

a

Plugging in a = 5, we get that the chance that at least 5professors get the right coats is no higher than 20%(regardless of N).

I Chernoff bound: Let X1, . . . ,Xn independent Bernoulli withP(Xi = 1) = pi . Denoting µ = E [

∑ni=1 Xi ] =

∑ni=1 pi ,

P(n∑

i=1

Xi ≥ (1 + δ)µ) ≤(

eδ

(1 + δ)1+δ

)µ

for any δ. Multiple variants of Chernoff-type bounds exist,which can be useful in different settings

Parameter Estimation: Maximum Likelihood

I Say we have a parametrized distribution fX (x ; θ) and we don’tknow the parameter(s) θ.

I IID samples x1, . . . , xn observed.

I Goal: Estimate θ

I The maximum likelihood estimator (MLE) is the value θ thatmaximizes the likelihood of observing the data you observed.

MLE Example

Say you flip a coin with unknown bias p of landing heads n timesand get nH heads and nT tails. What’s the MLE estimate for thecoin’s bias?

The likelihood of observing the data given a particular θ isP(D|θ) = θnH (1− θ)nT .

Take logs: log P(D|θ) = nH log(θ) + nT log(1− θ).

MLE Example continued

Take the derivative and set to 0:

d

dθlog P(D|θ) = 0

d

dθ[nH log(θ) + nT log(1− θ)] = 0

nH

θ− nT

1− θ= 0

θ =nH

nH + nT

Sometimes it is not possible to find the optimal estimate in closedform, in which case iterative methods must be used.

Interesting limits

I limn→∞(1 + kn )n → ek

I limn→∞ n!→√

2πn(

ne

)n(lower bound)

Quick Tour of Basic Probability Theory - Stanford...

Documents

Transcript of Quick Tour of Basic Probability Theory - Stanford...