Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...

Post on 17-Jul-2020

6 views 0 download

Transcript of Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...

Introduction to Probability Theory

Fabio G. Cozman - fgcozman@usp.br

August 29, 2018

Part I: Basic Concepts for Finite Spaces

1 Possibility/sample space, outcomes, events

2 Variables and indicator functions

3 Probabilities, expectations

4 Properties of probabilities

Possibility/sample space

1 Possibility/sample space: set Ω.

2 Elements ω of Ω are outcomes.

3 Subsets of Ω are events (no fuzziness!).

Example

Two coins are tossed; each coin can be heads (H)or tail (T ). Then Ω = HH ,HT ,TH ,TT.Consider three events. Event A = HH is theevent that both tosses produce heads. EventB = HH ,TT is the event that both tossesproduce identical outcomes. Event C = HH ,THis the event that the second toss yields heads. Notethat A = B ∩ C .

Probability measure (finite spaces!)

A probability measure is a function that assigns aprobability value to each event.

PU1 For any event A, P(A) ≥ 0.

PU2 The space Ω has probability one:P(Ω) = 1.

PU3 If events A and B are disjoint (that is,A ∩ B = ∅), thenP(A ∪ B) = P(A) + P(B).

Easy example

Example

A six-sided die is rolled. Suppose all outcomes of Ωare assigned precise and identical probability values.

We must have∑

ω∈Ω P(ω) = P(Ω) = 1, thuswe have P(ω) = 1/6 for all outcomes.

The event A = 1, 3, 5 (outcome is odd) hasprobability P(A) = 1/2.

The event B = 1, 2, 3, 5 (outcome is prime)has probability P(B) = 2/3 andP(A ∩ B) = P(1, 3, 5) = 1/2.

Properties of probabilities

1 As A and Ac are disjoint and A ∪ Ac = Ω, wehave P(A) + P(Ac) = P(Ω) = 1,

P(A) = 1− P(Ac) .

2 P(∅) = 1− P(∅c) = 1− P(Ω) = 0.3 P(A ∪ B) = P(A) + P(B)− P(A ∩ B).4 For n mutually disjoint events Bi ,

P(∪ni=1Bi) =n∑

i=1

P(Bi) .

5 If events Bi form a partition of Ω,

P(A) = P(∪ni=1A ∩ Bi) =∑i

P(A ∩ Bi) .

Random variables

1 Function X : Ω→ < is usually called a randomvariable.

2 If X is a variable, then any function f : < → <defines a random variable f (X ).

Example

The age in months of a person ω selected from apopulation Ω is a variable X . The same populationcan be used to define a different variable Y whereY (ω) is the weight (rounded to the next kilogram)of a person ω selected from Ω. We can also have arandom variable Z = X + Y .

Distributions

Possibility space Ω, variable X : Ω→ <.

Then: possibility space ΩX containing everypossible value of X .

Probability measure on Ω induces a measureover subsets of ΩX :

P(X ∈ A) = P(ω ∈ Ω : X (ω) ∈ A) .

Induced measure on ΩX is usually called thedistribution of X .

Expectations

Given variable X , its expectation is

E[X ] =∑ω∈Ω

X (ω)P(ω) =∑x

xP(X = x) .

An expectation functional yields a real numberfor each variable.Properties:

For constants α and β, if α ≤ X ≤ β, thenα ≤ E[X ] ≤ β.E[X + Y ] = E[X ] + E[Y ].

Variance is E[(X − E[X ])2

].

Part II: A Bit of History

1 Old times.

2 Classical probabilities.

3 Frequentist and Bayesian schemes.

Brief Look: History of Probabilities

ClassicalLeibnitz, Fermat, Pascal, De Moivre (1600)Bayes (1700)Laplace (1800)Modifications: Keynes, Jeffreys, Jaynes (1940)

FrequentistVenn, Boole, De Morgan (1850)Fisher, Neyman/Pearson (1900)

BayesianRamsey, De Finetti (1930)Savage (1950)

The Classical Theory: Ancient Time

First thoughts appeared in Philosophy:

Aristotle: “the probable is that which for the mostpart happens”

Bishop Butler: “to us probability is the very guideof life”

Also, many philosophers have used probabilityto prove the existence of God (e.g., the proofof the ecliptic)

The Classical Theory: Evolution

Pascal, De Moivre, Bernoulli: Central limittheorem, law of large numbers.Thomas Bayes: What you believe depends onwhat you believed before; we need priordistributions.

Bayes’ rule: P(A|B) =P(B |A)P(A)

P(B).

The Classical Theory: Laplace

Probability is the ratio of the number offavorable cases to that of all the cases possible

The Principle of Non-Sufficient Reason: twopossible cases are equally probable if there is noreason to prefer one to the other

The Classical Theory: Difficulties

The great problem: the Principle of Non-SufficientReason

Too many proofs from too little knowledge.

The problem of reparameterizations:If you are not sure about x , you are not sureabout x2. How to express that?

Now Come the Frequentists

Basic Idea: instead of using ignorance, let’s useknowledge

Let’s define probability as the limiting relativefrequency of observations

P(A) = limn→∞

nAn

Venn, Boole and De Morgan proposed itaround 1850; Statistics was built upon thisconception of probabilities

The Frequentist Theory: Difficulties

The definition is too poor compared to whatwe want

It is impossible to talk about probabilities forthings that will happen only once!

More mathematically, how to use the limit inthe definition (limn→∞nA/n)?

Many deterministic sequences have limitsDo random sequences have limits?

A Brief Summary So Far

Classical Theory:Probability is is the ratio of favorable cases to thenumber of cases (Principle of Non-SufficientReason)Problem: Principle of Non-Sufficient Reason isuntenable

Frequentist Theory:Probability is a limiting relative frequencyProblems: too narrow a concept; hard to definemathematically

The Emergence of Subjectivism

Since everything else seems to fail, why don’twe admit that there is a component ofsubjectivism in probability?Ramsey/De Finetti groundbreaking idea: let’sdefine probability as a “fair” betting strategy:

I’ll give you 1 unit of currency if President X isre-electedHow much would you pay to bet “fairly” that Xwill not be re-elected?The amount you pay is your probability for Xre-elected

The Bayesian Theory: Savage’s Idea

Axiomatize preferences over “gambles”

From preferences, obtain “money” (utility) andprobabilities

Result: If f g then there is a probabilitymeasure P and a utility function U such thatE [U(f)] < E [U[(g)]].

The Bayesian Theory: Basics

All forms of uncertainty are reduced toprobability

Judgements of uncertainty are reduced topreferences

All forms of updating knowledge are equivalentto application of Bayes’ rule

Frequentists Versus Bayesians

Bayesians:

Induction is a solved problem: you define yourprior, you collect data and then you apply Bayesrule, always following decision theory

Challenges: basically subjective (annoying priors).

Frequentists:

Induction is an ad hoc activity; Statistics furnishesuseful tools for induction

Some tools: significance testing, hypothesistesting, least-squares...

Challenges: based on shaky foundations; piecemealand ad hoc approach to problems.

Part III

1 Moments, variance, covariance.

2 Weak laws of large numbers.

Moments

Definition

The ith moment of X is the expectation E[X i].

Definition

The ith central moment of X is the expectationE[(X − E[X ])i

].

Definition

The variance V [X ] of X is second central momentof X .

Note:

V [X ] = E[(X − E[X ])2

]= E

[X 2]− E[X ]2 .

Markov inequality

1 Suppose X ≥ 0 and t > 0.If X (ω) < t, IX≥t(ω) = 0. Then X (ω)/t ≥ IX≥t(ω). If X (ω) ≥ t, then

X (ω)/t ≥ 1 = IX≥t(ω). Consequently, X/t ≥ IX≥t and then E[X ] /t ≥ E[IX≥t

], so:

P(X ≥ t) ≤ E[X ]

t.

2 Chebyshev inequality:

P(|X − E[X ] | ≥ t) ≤ V [X ]

t2.

Digression: Covariance

Definition

The covariance of variables X and Y isCov(X ,Y ) = E[(X − E[X ])(Y − E[Y ])].

If two variables X and Y are such thatCov(X ,Y ) = 0, then X and Y are uncorrelated.

Very weak law of large numbers

Theorem

If variables X1,X2, . . . ,Xn have expectationsE[Xi ] ∈ [µ, µ] and variances V [Xi ] ∈ [σ2, σ2], andXi and Xj are uncorrelated for every i 6= j , then forany ε > 0,

P(µ− ε <

∑i Xi

n< µ + ε

)≥ 1− σ2

nε2.

Weak law of large numbers

Theorem

If variables X1,X2, . . . have expectations E[Xi ] = µand variances V [Xi ] = σ2, and Xi and Xj areuncorrelated for every i 6= j , then for any ε > 0,

limn→∞

P(∣∣∣∣∑i Xi

n− µ

∣∣∣∣ < ε

)= 1.

Philosophy behing the “law”

Idea: irregularities observed in Xi do not affectthe average of these variables.

We should have regularity out of apparentchaos: even though the random variablesbehave randomly, their average does approachsome meaningful number (the probability...).

Suggests the “definition”: P(A) = limn→∞ #A/n.

Part IV: Conditioning

1 Bayes rule.

2 Theorem of total probabilities.

Conditioning: Bayes rule

Definition

If P(B) > 0, then

P(A|B) =P(A ∩ B)

P(B).

Definition

The conditional expectation of X given B , denotedby E[X |B], is defined only if P(B) > 0 as

E[X |B] =∑x

xP(X = x |A) =E[IBX ]

P(B).

Basic facts

For any C such that P(C ) > 0:

For any A, P(A|C ) ≥ 0.

P(Ω|C ) = 1.

If A ∩ B = ∅, thenP(A ∪ B |C ) = P(A|C ) + P(B |C ).

Note that P(A|A) = 1 whenever P(A) > 0.

Properties

1 E[X |B] =∑

ω∈B X (ω)P(ω|B).

2 For events Bini=1,

P(B1 ∩ B2 ∩ · · · ∩ Bn) = P(B1)n∏

i=2

P(Bi | ∩i−1

j=1 Bj

).

More properties

1 Total probability theorem: If events Bi forma partition of Ω such that all P(Bi) > 0,

P(A) =∑i

P(A ∩ Bi) =∑i

P(A|Bi)P(Bi) .

2 Then:

P(Bi |A) =P(A|Bi)P(Bi)∑i P(A|Bi)P(Bi)

.

Example

1 Individuals in an office have a disease D. Testto detect the disease (R or Rc).

2 P(R |D) = 9/10 (sensitivity of the test).

3 P(Rc |Dc) = 4/5 (specificity of the test).

4 P(D) = 1/9.

5 Then:

P(D|R) =9/10× 1/9

9/10× 1/9 + 1/5× 8/9= 9/25.

Three-prisoners problem

1 Three prisoners, Teddy, Jay, and Mark, waitingto be executed.

2 Governor will select one to be freed (equalprobability)

3 Warden knows the governor’s decision4 Teddy convinces the warden to say the name of

one of his fellow inmates who will be executed(useless information...)

5 Warden is honest6 Warden says that Jay is to be executed: Teddy

is happy (1/3 to 1/2)!7 But if warden said Mark, Teddy would be

happy??

Analysis

1 Possibility space:

Ω =

Teddy freed ∩ warden says Jay,

Teddy freed ∩ warden says Mark,Jay freed ∩ warden says Mark,Mark freed ∩ warden says Jay

.

2 We know that

P(Teddy freed)=P(Jay freed)=P(Mark freed)=1/3.

3 How would the warden behave if Teddy is to befreed?

P(warden says Jay|Teddy freed) .

Possible conclusion...

IfP(warden says Jay|Teddy freed) = 1/2,

then:

P(Teddy freed ∩ warden says Jay) = 1/6,

P(Teddy freed ∩ warden says Mark) = 1/6,

P(Jay freed ∩ warden says Mark) = 1/3,

P(Mark freed ∩ warden says Jay) = 1/3.

Hence

P(Teddy freed|warden names Jay) =1/6

1/3 + 1/6= 1/3.

Complete analysis...

1 Statement does not say anything about thebehaviour of the warden.

2 All that is really known is

P(warden names Jay|Teddy freed) ∈ [0, 1].

3 Consequently

P(Teddy freed|warden names Jay) ∈[

0

0 + 1/3,

1/3

1/3 + 1/3

].

Part V: Probability mass functions

1 Mass functions.

2 Marginal probability mass functions.

3 Conditional probability mass functions.

4 Multivariate models.

Probability mass function

1 Probability mass function is simplypX (x) = P(X = x).

2 Then P(X ∈ A) =∑

x∈A pX (x).

Example (Uniform distribution)

Uniform distribution for X assigns pX (x) = 1/k forevery value x of X .

Example (Bernoulli distribution)

Binary variable X with values 0 and 1. Bernoullidistribution with parameter p for X takes twovalues: pX (0) = (1− p) and pX (1) = p.E[X ] = 0(1− p) + 1p = p; V [X ] = p(1− p).

More on probability mass functions...

1 For Y = f (X ),

pY (y) = P(Y = y) =∑

x∈ΩX ,Y (x)=y

pX (x),

pY (y) = P(Y = y) =∑

ω∈Ω,Y (X (ω))=y

P(ω) .

2 Conditional probability mass functionpX |B(x |B) = P(X = x|B).

3 Joint probability mass function p(X ,Y ):

pX ,Y (x , y) = P(X = x ∩ Y = y) .

Marginal probability mass functions

pX (x) = P(X = x)=

∑y∈ΩY

P(X = x ∩ Y = y)

=∑y∈ΩY

pX ,Y (x , y).

Example

X and Y with three values each and

pX ,Y (x , y) y = 1 y = 2 y = 3x = 1 1/10 1/25 1/20x = 2 1/20 1/5 1/25x = 3 1/10 1/50 2/5

Expectations and mass functions

Finite spaces:

E[X ] =∑ω∈Ω

X (ω)P(ω)

=∑x∈ΩX

∑ω:X (ω)=x

xP(ω)

=∑x∈ΩX

x∑

ω:X (ω)=x

P(ω)

=∑x∈ΩX

xP(X = x)

=∑x∈ΩX

xpX (x).

Conditional probability mass

For variable X and event A such that P(A) > 0,

pX |A(x |A) = P(X = x |A) .

For variables X and Y ,

pX |Y (x |y) = P(X = x |Y = y) .

Iterated expectations

Denote E[X |Y = y ] by E[X |y ].Then E[X |Y ] is a function of Y .For finite spaces:

E[X ] =∑x∈ΩX

xpX (x)

=∑x∈ΩX

x∑y∈ΩY

pX ,Y (x , y)

=∑x∈ΩX

∑y∈ΩY

xpX |Y (x |y)pY (y)

=∑x∈ΩX

E[X |Y = y ] pY (y)

= E[E[X |Y ]] .

(A similar expression holds for infinite spaces.)

For sets of variables

1 Probability mass function, conditional, joint,marginal, etc.

2 Vectors:

X =

X1...Xn

= [X1, . . . ,Xn]T ,

pX(x) = P(X = x) ,E[X] = [E[X1] , . . . ,E[Xn]]T ,

Part VI: Independence

1 Independence for two events, for many events.

2 Independence for random variables.

3 Conditional independence.

Independence for events

1 A and B are independent

P(A|B) = P(A) whenever P(B) > 0;

or, equivalently,

P(A ∩ B) = P(A)P(B) .

2 Many events are independent: for all subsets ofevents Aini=1,

P(∩iAi) =∏i

P(Ai) .

(Pairwise independence is not enough!)

Independence for random variables

For all events such that the conditionalprobabilities are defined,

P(Xi = xi| ∩j 6=i Xj = xi) = P(Xi = xi) ;

that is,

p(xi | ∩j 6=i Xj = xi) = p(xi).

Or, more concisely:

p(x1, . . . , xn) =n∏

i=1

p(xi).

Conditional independence

1 (X ⊥⊥Y |A) if

E[f (X )g(Y )|A] = E[f (X )|A]E[g(Y )|A]

for all functions f , g , whenever P(A) > 0.

2 (X ⊥⊥Y |Z ) if

(X ⊥⊥Y |Z = z)

for every category z of Z such thatP(Z = z) > 0.

Part VII: Laws of Large Numbers

1 Weak law.

2 Strong law.

Weak law of large numbers again

1 Independence implies uncorrelation:

E[(Xi − E[Xi ])(Xj − E[Xj ])] =

E[Xi − E[Xi ]]E[Xj − E[Xj ]] = 0.

2 If independent variables X1,X2, . . . haveexpectations E[Xi ] = µ and variancesV [Xi ] = σ2, then for any ε > 0,

limn→∞

P(∣∣∣∣∑i Xi

n− µ

∣∣∣∣ < ε

)= 1.

3 There are variants: assuming no variance,assuming expectations change, etc.

Advanced: strong law of large numbers

In a sequence of variables X1, . . . ,Xn, the meanconverges to the expectation with probability one:

P(

limn→∞

∑ni=1 Xi

n= µ

)= 1.

1 It requires the theory of infinite spaces.

2 It is hard to prove and requires severalassumptions.

3 It is really a strong result.

Part VIII: General Spaces

1 Infinities.

2 General axioms.

Infinite spaces

So, far Ω has been a finite set.1 Random variables have finitely many values.2 A probability mass function specifies a distributions

through finitely many values.

Now suppose Ω is an infinite set: Ω may be1 countable (natural, odd, integer, rational numbers)

or2 uncountable (real numbers).

Kolmogorov’s axioms

PU1 For any event A, P(A) ≥ 0.PU2 The space Ω has probability one:

P(Ω) = 1.PU3 If events A and B are disjoint (that is,

A ∩ B = ∅), thenP(A ∪ B) = P(A) + P(B).

PU4 If Ai is such thatlimn→∞ ∩ni=1Ai = ∅,then limn→∞ P(∩ni=1Ai) = 0.

Equivalent axioms (countable additivity)

P1 For any event A, P(A) ≥ 0.

P2 The space Ω has probability one:P(Ω) = 1.

P3 If countably many events Ai∞i=1 aredisjoint, then P(∪iAi) =

∑∞i=1 P(Ai).

The last axiom introduces countable additivity.

Example with discrete variable

Suppose X has integer values 0, 1, 2, . . . .

Then X has a Poisson distribution withparameter λ > 0 when

P(X = x) =e−λλx

x!,

for x ≥ 0.

Part IX: Mathematics of Infinite Spaces

1 Fields and algebras; Borel algebra.

2 Measurability.

3 Lebesgue integration.

Digression: Measurability

Given an infinite Ω, we have to specifyprobability values for its subsets.All subsets?

1 It is impossible to define a countably additiveprobability measure over all subsets of the realnumbers, for which the probability of an interval[a, b] is b − a.

2 In fact, there are subsets of < that areunmeasurable: a countable additive set-functioncannot be defined on them such that [a, b] mapsto b − a.

3 Ulam’s theorem: if a countably additive measure isdefined over all subsets of the real numbers andvanishes on all singletons, it is identically zero.

Kolmogorov’s solution: fields

Consider first a finite set Ω.A field F is a nonempty set of subsets of Ωsuch that:

1 if A ∈ F , then Ac ∈ F .2 if A ∈ F and B ∈ F , then A ∪ B ∈ F .

Note: if A is in a field, then Ac , ∅ and Ω areautomatically in the field.

Example:∅,A,B ,Ac ,Bc ,A ∪ B ,Ac ∪ B ,A ∪ Bc ,Ac ∪Bc ,A ∩ B ,Ac ∩ B ,A ∩ Bc ,Ac ∩ Bc , (Ac ∩ B) ∪(A ∩ Bc), (A ∪ Bc) ∩ (Ac ∪ B),Ω.

σ-fields

Now consider an infinite set Ω.A σ-field is a set of subsets of Ω such that

1 if A ∈ F , then Ac ∈ F .2 if Ai ∈ F then ∪iAi ∈ F .

Note that σ-fields are closed under countableunions.

Fields and algebras

Fields are also called algebras.

σ fields are also called σ-algebras.

In fact, “algebra” seems to be a better term(there are other meanings for the word “field”that do not apply here...).

Terminology is confusing!

Borel algebras

“Minimal σ-algebra containing the opensets/compact sets of a topological set Ω.”The Borel algebra for the real numbers:

The smallest σ-algebra on < that contains theintervals.

The elements of a Borel algebra are the Borelsets.

Consequences of countable additivity

No way to extend arbitrary assessments overarbitrary spaces.

No uniform distribution on the integers.

BUT: countable additivity basically allows us to useintegrals to compute expectations!

VERY important: there is a unique probabilitymeasure that corresponds to an expectation (andvice-versa)!

End of digression

A probability space is a triple (Ω,F ,P), where

Ω is a set (possibility space).

F is a σ-algebra on Ω.

P is a probability measure on F ; that is, anon-negative normalized (to unity) andcountable additive set-function.

Note: in almost all books on probability theory, theprobability space takes the real numbers and theirBorel algebra.

Extending the previous theoryA variable with finitely many values is calledsimple. We know how to relate expectationand probability for those.

Take two possible approximations for E[X ]:

E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .

E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .

In fact, for many random variables, bothapproximations coincide and then

E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .

The problem is to characterize these variables,and the properties of this functional.

Measurable random variables

A function f : Ω→ < is F -measurable withrespect to an algebra F when any set

ω : f (ω) ≤ α

belongs to F .

Note: there are more general definitions in theliterature...

A random variable X is measurable when it is ameasurable function.

The Lebesgue integralA variable with finitely many values is calledsimple.Take two possible approximations for E[X ]:

E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .

E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .

In fact, given an expectation/measure, we have

E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .

This quantity is the Lebesgue integral withrespect to the probability measure P.Notation:

E[X ] =

∫XdP.

The Lebesgue integralA variable with finitely many values is calledsimple.Take two possible approximations for E[X ]:

E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .

E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .

In fact, given an expectation/measure, we have

E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .

This quantity is the Lebesgue integral withrespect to the probability measure P.Notation:

E[X ] =

∫XdP.

Discrete random variables

Suppose X has an enumerable set of values.Then

E[X ] =∑x

xP(X = x) .

Example: X has a Poisson distribution withparameter λ > 0 when

P(X = x) =e−λλx

x!,

for integer x ≥ 0; then

E[X ] = λ

andV [X ] = λ.

The Riemann integral

Under quite general conditions, the Lebesgueintegral can be computed using the Riemannintegral (the “usual” integral).

The key idea is to define densities and then tointegrate with respect to densities.

Part X: Densities and the like

1 Cumulative distribution functions and densities.

2 A summary.

Cumulative distribution function

The function

FX (x) = P(ω : X (ω) ≤ x)

is the cumulative distribution function of X .Note:

FX is a non-negative non-decreasing function, withFX (−∞) = 0 and FX (∞) = 1.

P([a, b]) = FX (b)− FX (a) =∫ b

ap(x) dx .

Densities

For a measurable variable X , the density of Xis, when it exists:

pX (x) =dFX (x)

dx=

dP(ω : X (ω) ≤ τ)dτ

∣∣∣∣x

.

Then:

E[X ] =

∫ΩX

xpX (x) dx ,

where ΩX is the set of values of X , and theintegral is the Riemann integral.

Summary1 Kolmogorov’s theory: probability space (Ω,F ,P), where

Ω is a general possibility space, F is a σ-algebra, and Pis a non-negative, normalized to unity and countablyadditive set-function (a normalized measure).

The most common σ-algebra for the real numbersis the Borel algebra (intervals).Random variables are F -measurable functions.Expectations of measurable functions are Lebesgueintegrals.

2 The distribution of X is entirely captured by FX (x), thecumulative distribution function.

3 If FX is continuous, the variable X is continuous, and wecan differentiate FX (x) to obtain the density pX (x).

4 If the distribution of X has a density pX (x), thenexpectation E[X ] is a Riemann integral

∫xpX (x) dx .

Part XI: Catalog of Distributions

1 Common densities.

2 De Moivre - Laplace’s theorem.

Uniform distribution

Suppose X is a real-valued variable.

The distribution of X is uniform if its density is

pX (x) =1

b − aif x ∈ [a, b].

and pX (x) = 0 otherwise.

Gaussian distribution

X has a Gaussian distribution when

pX (x) =1√

2πσ2exp

(−(x − µ)2

2σ2

).

E[X ] = µ and V [X ] = σ2.

De Moivre - Laplace’s theorem

Take n, p ∈ [0, 1], such thatn × p × (1− p) >> 1; then(

nk

)pk(1−p)n−k ≈

exp(−(k − np)2/2np(1− p)

)√2πnp(1− p)

.

(That is, the ratio of two sides goes to 1.)

That is, the probability that k among n trialsare positive, when n grows without bound, canbe approximated by a Gaussian density.

Gamma distribution

X has a gamma distribution with parameters αand β when

pX (x) =βα

Γ(α)xα−1e−βx ,

for x > 0, and pX (x) = 0 otherwise.Gamma function:

Γ(α) =

∫ ∞0

zα−1e−zdz .

Note: For any positive integer k ,Γ(k) = (k − 1)!.We have E[X ] = α/β and the variance of X isα/β2.

Gamma and exponential distributions

Important: If Xini=1 are independent andhave a Gamma distribution with parameters αi

and β, then X = X1 + · · ·+ Xn has a Gammadistribution with parameter α1 + · · ·+ αn andβ.

If α = 1 and β > 0, then X has an exponentialdistribution with parameter β,

pX (x) = βe−βx ,

for x > 0, and pX (x) = 0 otherwise.

Chi-square distribution

X has chi-square distribution when

pX (x) =1√

2Γ(1/2)

exp(−x/2)√x

when x > 0, and pX (x) = 0 otherwise.The χ2 with n degrees of freedom:

pX (x) =1

2n/2Γ(n/2)xn/2−1 exp(−x/2)

when x > 0, and pX (x) = 0 otherwise.(Gamma distribution, α = n/2 and β = 1/2).If Xini=1 are Gaussian variables with µ = 0and σ2 = 1, then X 2

1 + · · ·+ X 2n has a χ2

distribution with n degrees of freedom.

Beta distribution

Often used to model random variables that arelimited to an interval.X has a beta distribution with parameters αand β when

pX (x) =Γ(α + β)

Γ(α)Γ(β)xα−1(1− x)β−1,

for x ∈ [0, 1] and 0 otherwise.Note: a beta distribution is proportional toxα−1(1− x)β−1. If α = β = 1, then we obtainthe uniform distribution.For a beta distribution pX (·) with parameters αand β, the expected value of X is α/(α + β)and the variance is αβ/((α + β)2(α + β + 1)).

Dirichlet distribution

A column vector of dimension n has a Dirichletdistribution when:

pX1,...,X2(x1, . . . , xn) =

Γ(∑n

i=1 αi)∏ni=1 Γ(αi)

n∏i=1

xαi−1i ,

when∑

i xi = 1, and 0 otherwise.

Distribution is defined in a simplex ofdimension n − 1.

The values αini=1, where αi > 0, are theparameters of the Dirichlet distribution.

This is a direct generalization of the betadistribution.

t distribution

X has a t distribution with n degrees offreedom (for n > 0) when

pX (x) =Γ((n + 1)/2)

Γ(n/2)√nπ

(1 + x2/n)−(n+1)/2.

When n = 1, the distribution is called theCauchy distribution — it is a distribution withundefined expected value and variance!

Important: if X has a Gaussian distributionwith µ = 0 and σ2 = 1, and Y has a χ2

distribution with n degrees of freedom, thenX/√

Y /n has a t distribution with n degreesof freedom.

Part XII

1 Multivariate densities.

Multivariate densities

For two variables X and Y ,

FX ,Y (x , y) = P(X ≤ x ,Y ≤ y)

and

pX ,Y (x , y) =∂FX ,Y (x , y)

∂X∂Y.

Then

P((X ,Y ) ∈ D) =

∫ ∫D

pX ,Y (x , y) dxdy .

...and similarly for any number of variables.

Gaussian vector

A column vector X of dimension n has aGaussian joint probability density when:

pX(x) =1√

(2π)n detPexp

(−(x− µ)TP−1(x− µ)

2

),

where µ is a vector and P is a square matrix ofappropriate dimensions.For a Gaussian vector, we have:

E[X] = µ;V [X] = E

[(X− E[X])(X− E[X])T

]= P .

Part XIII: Functions

1 Functions of random variables.

2 Expected values of functions.

Functions of a random variable

Example: If Y = aX + b, with a > 0, then:

f (X ) ≤ y ⇒ X ≤ y − b

a,

and then:

FY (y) = P(f (X ) ≤ y) = P(

X ≤ y − b

a

),

so FY (y) = FX(y−ba

).

Functions of a random variable

Example: If Z = X + Y , then:

FZ (z) = P(Z ≤ z)

= P(X + Y ≤ z)

=

∫ ∞−∞

∫ z−y

−∞pX ,Y (x , y) dxdy .

Linear combinations and conditioning

For a vector Y of dimension m such thatY = AX, where A is a matrix, we have:

pY(y) =1√

(2π)m detQexp

(−(y − ν)TQ−1(y − ν)

2

),

where ν = Aµ and Q = APAT .

If X and Y are Gaussian vectors, pX|Y(x|y) isalso Gaussian with

E[X|Y] = E[X] + Cov[X,Y]V [Y]−1(Y−E[Y]),

V [X|Y] = V [X]− Cov[X,Y]V [Y]−1Cov[Y,X].

Also, we have:

If Y = f (X ) and X has density pX (x), then

E[Y ] = E[f (X )] =

∫f (x)pX (x) dx .

Concepts of covariance, correlation, etc,defined just as for discrete random variables.

Part XIV: Important concepts

1 Conditioning.

2 Independence.

Usual solution for conditioning

Introduce:

pX |Y (x |y) =pX ,Y (x , y)

pY (y)=

pX ,Y (x , y)∫pX ,Y (x , y) dx

.

Then:

E[X |Y ] =

∫xpX |Y (x |Y ) dx .

Independence: basic definition

Consider random variables X1,X2, . . . ,Xn.These random variables are independent:

When all distributions have densities,

p(X1, . . . ,Xn) =n∏

i=1

p(Xi) .

Part XV: Some advanced concepts

1 Kolmogorovian conditioning.

2 Other definitions of independence.

3 Convergence.

4 Laws of large numbers.

5 Exchangeability.

6 Central limit theorem.

7 Bayesian consensus.

Conditioning, again

Conditional expectation:

E[X |A] =E[XA]

P(A)

whenever P(A) > 0.

Now consider E[X |Y ].If Ω is uncountable, E[X |Y = y ] may face thedifficulty that one can have P(Y = y) = 0 for allvalues of Y .So, it is hard to define P(X |Y ) as a function of Xand Y .

Real thing: Kolmogorovian conditioning

Definition: E[X |Y ] is a random variable that is“Y-measurable” and such that

E[f (Y )(X − E[X |Y ])] = 0

for any function f (Y ).

This is not simple!

Usually, a proper density pX |Y (x |y) exists suchthat “probabilities” P(A(X )|B(Y )) can becalculated.

Independence, again

Variables Xini=1 are independent if

E[fi(Xi)| ∩j 6=i Xj ∈ Aj] = E[fi(Xi)] ,

for all functions fi(Xi) andall events ∩j 6=iXj ∈ Aj with positive probability.

For all functions fi(Xi),

E

[n∏

i=1

fi(Xi)

]=

n∏i=1

E[fi(Xi)] .

For all sets of events Aini=1,

P(∩ni=1Xi ∈ Ai) =n∏

i=1

P(Xi ∈ Ai) .

Convergence

There are many notions of convergence for randomvariables.Consider a sequence of random variables Xidefined in the same probability space (Ω,F ,P).

1 Convergence in distribution (function):limn→∞ FXn

(x) = FX (x)

2 Convergence in probability:limn→∞ P(|Xn − X | ≥ ε) = 0.

3 Almost sure convergence (with probabilityone): P(limn→∞ Xn = X ) = 1.

Review: Law of Large Numbers

Consider an infinite sequence of independentvariables with expectation µ, variance σ2.

Define X =∑n

i=1 Xi

n .

Weak law of large numbers:

limn→∞

P(|X − µ| ≥ ε

)= 0.

Strong law of large numbers:

P(

limn→∞

X = µ)

= 1.

Central limit theorem

Take sequence of n independent randomvariables Xi with mean µi and variance σ2

i .

Consider the random variable X =∑

i Xi ; thenE[X ] =

∑i µi and variance σ2 =

∑i σ

2i .

If we define

Z =X − µσ

,

then the distribution of Z tends to a Gaussiandistribution with expectation 0 and variance 1as n→∞.

Exchangeability

1 Binary variables X1,X2, . . . are exchangeable ifP(X1,X2, . . .) does not change if we justchange the order of variables.

2 If X1,X2, . . . are exchangeable, then

P(k ones in n selected variables)

can always be written as∫ (nk

)θk(1− θ)n−kp(θ) dθ.

(This is De Finetti’s representation theorem.)

3 Note the deep implications of exchangeability!

Bayesian consensus

1 Suppose that n Bayesians have different priorsPi(θ).

2 Suppose they all observe X1,X2, . . . .

3 Suppose all Xi are independent and identicallydistributed P(Xi |θ∗).

4 Then all Bayesians will reach an identicalposterior P(θ|X1,X2, . . .) that is infinitelyconcentrated around θ∗.

Summary1 Kolmogorov’s theory: probability space (Ω,F ,P), where

Ω is a general possibility space, F is a σ-algebra, and Pis a non-negative, normalized to unity and countablyadditive set-function (a normalized measure).

2 Random variables are F -measurable functions,expectations are Lebesgue integrals.

3 Random variables are F -measurable functions,expectations are Lebesgue integrals (there are manyunivariate and multivariate densities!).

4 Conditional density p(X |Y ) is given by p(X ,Y ) /p(Y );independence means “conditional equal tounconditional” or “factorization” (actually only thelatter).

5 Convergence concepts are important, with many results:laws of large numbers, central limit theorems...