Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...
Transcript of Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...
Part I: Basic Concepts for Finite Spaces
1 Possibility/sample space, outcomes, events
2 Variables and indicator functions
3 Probabilities, expectations
4 Properties of probabilities
Possibility/sample space
1 Possibility/sample space: set Ω.
2 Elements ω of Ω are outcomes.
3 Subsets of Ω are events (no fuzziness!).
Example
Two coins are tossed; each coin can be heads (H)or tail (T ). Then Ω = HH ,HT ,TH ,TT.Consider three events. Event A = HH is theevent that both tosses produce heads. EventB = HH ,TT is the event that both tossesproduce identical outcomes. Event C = HH ,THis the event that the second toss yields heads. Notethat A = B ∩ C .
Probability measure (finite spaces!)
A probability measure is a function that assigns aprobability value to each event.
PU1 For any event A, P(A) ≥ 0.
PU2 The space Ω has probability one:P(Ω) = 1.
PU3 If events A and B are disjoint (that is,A ∩ B = ∅), thenP(A ∪ B) = P(A) + P(B).
Easy example
Example
A six-sided die is rolled. Suppose all outcomes of Ωare assigned precise and identical probability values.
We must have∑
ω∈Ω P(ω) = P(Ω) = 1, thuswe have P(ω) = 1/6 for all outcomes.
The event A = 1, 3, 5 (outcome is odd) hasprobability P(A) = 1/2.
The event B = 1, 2, 3, 5 (outcome is prime)has probability P(B) = 2/3 andP(A ∩ B) = P(1, 3, 5) = 1/2.
Properties of probabilities
1 As A and Ac are disjoint and A ∪ Ac = Ω, wehave P(A) + P(Ac) = P(Ω) = 1,
P(A) = 1− P(Ac) .
2 P(∅) = 1− P(∅c) = 1− P(Ω) = 0.3 P(A ∪ B) = P(A) + P(B)− P(A ∩ B).4 For n mutually disjoint events Bi ,
P(∪ni=1Bi) =n∑
i=1
P(Bi) .
5 If events Bi form a partition of Ω,
P(A) = P(∪ni=1A ∩ Bi) =∑i
P(A ∩ Bi) .
Random variables
1 Function X : Ω→ < is usually called a randomvariable.
2 If X is a variable, then any function f : < → <defines a random variable f (X ).
Example
The age in months of a person ω selected from apopulation Ω is a variable X . The same populationcan be used to define a different variable Y whereY (ω) is the weight (rounded to the next kilogram)of a person ω selected from Ω. We can also have arandom variable Z = X + Y .
Distributions
Possibility space Ω, variable X : Ω→ <.
Then: possibility space ΩX containing everypossible value of X .
Probability measure on Ω induces a measureover subsets of ΩX :
P(X ∈ A) = P(ω ∈ Ω : X (ω) ∈ A) .
Induced measure on ΩX is usually called thedistribution of X .
Expectations
Given variable X , its expectation is
E[X ] =∑ω∈Ω
X (ω)P(ω) =∑x
xP(X = x) .
An expectation functional yields a real numberfor each variable.Properties:
For constants α and β, if α ≤ X ≤ β, thenα ≤ E[X ] ≤ β.E[X + Y ] = E[X ] + E[Y ].
Variance is E[(X − E[X ])2
].
Part II: A Bit of History
1 Old times.
2 Classical probabilities.
3 Frequentist and Bayesian schemes.
Brief Look: History of Probabilities
ClassicalLeibnitz, Fermat, Pascal, De Moivre (1600)Bayes (1700)Laplace (1800)Modifications: Keynes, Jeffreys, Jaynes (1940)
FrequentistVenn, Boole, De Morgan (1850)Fisher, Neyman/Pearson (1900)
BayesianRamsey, De Finetti (1930)Savage (1950)
The Classical Theory: Ancient Time
First thoughts appeared in Philosophy:
Aristotle: “the probable is that which for the mostpart happens”
Bishop Butler: “to us probability is the very guideof life”
Also, many philosophers have used probabilityto prove the existence of God (e.g., the proofof the ecliptic)
The Classical Theory: Evolution
Pascal, De Moivre, Bernoulli: Central limittheorem, law of large numbers.Thomas Bayes: What you believe depends onwhat you believed before; we need priordistributions.
Bayes’ rule: P(A|B) =P(B |A)P(A)
P(B).
The Classical Theory: Laplace
Probability is the ratio of the number offavorable cases to that of all the cases possible
The Principle of Non-Sufficient Reason: twopossible cases are equally probable if there is noreason to prefer one to the other
The Classical Theory: Difficulties
The great problem: the Principle of Non-SufficientReason
Too many proofs from too little knowledge.
The problem of reparameterizations:If you are not sure about x , you are not sureabout x2. How to express that?
Now Come the Frequentists
Basic Idea: instead of using ignorance, let’s useknowledge
Let’s define probability as the limiting relativefrequency of observations
P(A) = limn→∞
nAn
Venn, Boole and De Morgan proposed itaround 1850; Statistics was built upon thisconception of probabilities
The Frequentist Theory: Difficulties
The definition is too poor compared to whatwe want
It is impossible to talk about probabilities forthings that will happen only once!
More mathematically, how to use the limit inthe definition (limn→∞nA/n)?
Many deterministic sequences have limitsDo random sequences have limits?
A Brief Summary So Far
Classical Theory:Probability is is the ratio of favorable cases to thenumber of cases (Principle of Non-SufficientReason)Problem: Principle of Non-Sufficient Reason isuntenable
Frequentist Theory:Probability is a limiting relative frequencyProblems: too narrow a concept; hard to definemathematically
The Emergence of Subjectivism
Since everything else seems to fail, why don’twe admit that there is a component ofsubjectivism in probability?Ramsey/De Finetti groundbreaking idea: let’sdefine probability as a “fair” betting strategy:
I’ll give you 1 unit of currency if President X isre-electedHow much would you pay to bet “fairly” that Xwill not be re-elected?The amount you pay is your probability for Xre-elected
The Bayesian Theory: Savage’s Idea
Axiomatize preferences over “gambles”
From preferences, obtain “money” (utility) andprobabilities
Result: If f g then there is a probabilitymeasure P and a utility function U such thatE [U(f)] < E [U[(g)]].
The Bayesian Theory: Basics
All forms of uncertainty are reduced toprobability
Judgements of uncertainty are reduced topreferences
All forms of updating knowledge are equivalentto application of Bayes’ rule
Frequentists Versus Bayesians
Bayesians:
Induction is a solved problem: you define yourprior, you collect data and then you apply Bayesrule, always following decision theory
Challenges: basically subjective (annoying priors).
Frequentists:
Induction is an ad hoc activity; Statistics furnishesuseful tools for induction
Some tools: significance testing, hypothesistesting, least-squares...
Challenges: based on shaky foundations; piecemealand ad hoc approach to problems.
Part III
1 Moments, variance, covariance.
2 Weak laws of large numbers.
Moments
Definition
The ith moment of X is the expectation E[X i].
Definition
The ith central moment of X is the expectationE[(X − E[X ])i
].
Definition
The variance V [X ] of X is second central momentof X .
Note:
V [X ] = E[(X − E[X ])2
]= E
[X 2]− E[X ]2 .
Markov inequality
1 Suppose X ≥ 0 and t > 0.If X (ω) < t, IX≥t(ω) = 0. Then X (ω)/t ≥ IX≥t(ω). If X (ω) ≥ t, then
X (ω)/t ≥ 1 = IX≥t(ω). Consequently, X/t ≥ IX≥t and then E[X ] /t ≥ E[IX≥t
], so:
P(X ≥ t) ≤ E[X ]
t.
2 Chebyshev inequality:
P(|X − E[X ] | ≥ t) ≤ V [X ]
t2.
Digression: Covariance
Definition
The covariance of variables X and Y isCov(X ,Y ) = E[(X − E[X ])(Y − E[Y ])].
If two variables X and Y are such thatCov(X ,Y ) = 0, then X and Y are uncorrelated.
Very weak law of large numbers
Theorem
If variables X1,X2, . . . ,Xn have expectationsE[Xi ] ∈ [µ, µ] and variances V [Xi ] ∈ [σ2, σ2], andXi and Xj are uncorrelated for every i 6= j , then forany ε > 0,
P(µ− ε <
∑i Xi
n< µ + ε
)≥ 1− σ2
nε2.
Weak law of large numbers
Theorem
If variables X1,X2, . . . have expectations E[Xi ] = µand variances V [Xi ] = σ2, and Xi and Xj areuncorrelated for every i 6= j , then for any ε > 0,
limn→∞
P(∣∣∣∣∑i Xi
n− µ
∣∣∣∣ < ε
)= 1.
Philosophy behing the “law”
Idea: irregularities observed in Xi do not affectthe average of these variables.
We should have regularity out of apparentchaos: even though the random variablesbehave randomly, their average does approachsome meaningful number (the probability...).
Suggests the “definition”: P(A) = limn→∞ #A/n.
Part IV: Conditioning
1 Bayes rule.
2 Theorem of total probabilities.
Conditioning: Bayes rule
Definition
If P(B) > 0, then
P(A|B) =P(A ∩ B)
P(B).
Definition
The conditional expectation of X given B , denotedby E[X |B], is defined only if P(B) > 0 as
E[X |B] =∑x
xP(X = x |A) =E[IBX ]
P(B).
Basic facts
For any C such that P(C ) > 0:
For any A, P(A|C ) ≥ 0.
P(Ω|C ) = 1.
If A ∩ B = ∅, thenP(A ∪ B |C ) = P(A|C ) + P(B |C ).
Note that P(A|A) = 1 whenever P(A) > 0.
Properties
1 E[X |B] =∑
ω∈B X (ω)P(ω|B).
2 For events Bini=1,
P(B1 ∩ B2 ∩ · · · ∩ Bn) = P(B1)n∏
i=2
P(Bi | ∩i−1
j=1 Bj
).
More properties
1 Total probability theorem: If events Bi forma partition of Ω such that all P(Bi) > 0,
P(A) =∑i
P(A ∩ Bi) =∑i
P(A|Bi)P(Bi) .
2 Then:
P(Bi |A) =P(A|Bi)P(Bi)∑i P(A|Bi)P(Bi)
.
Example
1 Individuals in an office have a disease D. Testto detect the disease (R or Rc).
2 P(R |D) = 9/10 (sensitivity of the test).
3 P(Rc |Dc) = 4/5 (specificity of the test).
4 P(D) = 1/9.
5 Then:
P(D|R) =9/10× 1/9
9/10× 1/9 + 1/5× 8/9= 9/25.
Three-prisoners problem
1 Three prisoners, Teddy, Jay, and Mark, waitingto be executed.
2 Governor will select one to be freed (equalprobability)
3 Warden knows the governor’s decision4 Teddy convinces the warden to say the name of
one of his fellow inmates who will be executed(useless information...)
5 Warden is honest6 Warden says that Jay is to be executed: Teddy
is happy (1/3 to 1/2)!7 But if warden said Mark, Teddy would be
happy??
Analysis
1 Possibility space:
Ω =
Teddy freed ∩ warden says Jay,
Teddy freed ∩ warden says Mark,Jay freed ∩ warden says Mark,Mark freed ∩ warden says Jay
.
2 We know that
P(Teddy freed)=P(Jay freed)=P(Mark freed)=1/3.
3 How would the warden behave if Teddy is to befreed?
P(warden says Jay|Teddy freed) .
Possible conclusion...
IfP(warden says Jay|Teddy freed) = 1/2,
then:
P(Teddy freed ∩ warden says Jay) = 1/6,
P(Teddy freed ∩ warden says Mark) = 1/6,
P(Jay freed ∩ warden says Mark) = 1/3,
P(Mark freed ∩ warden says Jay) = 1/3.
Hence
P(Teddy freed|warden names Jay) =1/6
1/3 + 1/6= 1/3.
Complete analysis...
1 Statement does not say anything about thebehaviour of the warden.
2 All that is really known is
P(warden names Jay|Teddy freed) ∈ [0, 1].
3 Consequently
P(Teddy freed|warden names Jay) ∈[
0
0 + 1/3,
1/3
1/3 + 1/3
].
Part V: Probability mass functions
1 Mass functions.
2 Marginal probability mass functions.
3 Conditional probability mass functions.
4 Multivariate models.
Probability mass function
1 Probability mass function is simplypX (x) = P(X = x).
2 Then P(X ∈ A) =∑
x∈A pX (x).
Example (Uniform distribution)
Uniform distribution for X assigns pX (x) = 1/k forevery value x of X .
Example (Bernoulli distribution)
Binary variable X with values 0 and 1. Bernoullidistribution with parameter p for X takes twovalues: pX (0) = (1− p) and pX (1) = p.E[X ] = 0(1− p) + 1p = p; V [X ] = p(1− p).
More on probability mass functions...
1 For Y = f (X ),
pY (y) = P(Y = y) =∑
x∈ΩX ,Y (x)=y
pX (x),
pY (y) = P(Y = y) =∑
ω∈Ω,Y (X (ω))=y
P(ω) .
2 Conditional probability mass functionpX |B(x |B) = P(X = x|B).
3 Joint probability mass function p(X ,Y ):
pX ,Y (x , y) = P(X = x ∩ Y = y) .
Marginal probability mass functions
pX (x) = P(X = x)=
∑y∈ΩY
P(X = x ∩ Y = y)
=∑y∈ΩY
pX ,Y (x , y).
Example
X and Y with three values each and
pX ,Y (x , y) y = 1 y = 2 y = 3x = 1 1/10 1/25 1/20x = 2 1/20 1/5 1/25x = 3 1/10 1/50 2/5
Expectations and mass functions
Finite spaces:
E[X ] =∑ω∈Ω
X (ω)P(ω)
=∑x∈ΩX
∑ω:X (ω)=x
xP(ω)
=∑x∈ΩX
x∑
ω:X (ω)=x
P(ω)
=∑x∈ΩX
xP(X = x)
=∑x∈ΩX
xpX (x).
Conditional probability mass
For variable X and event A such that P(A) > 0,
pX |A(x |A) = P(X = x |A) .
For variables X and Y ,
pX |Y (x |y) = P(X = x |Y = y) .
Iterated expectations
Denote E[X |Y = y ] by E[X |y ].Then E[X |Y ] is a function of Y .For finite spaces:
E[X ] =∑x∈ΩX
xpX (x)
=∑x∈ΩX
x∑y∈ΩY
pX ,Y (x , y)
=∑x∈ΩX
∑y∈ΩY
xpX |Y (x |y)pY (y)
=∑x∈ΩX
E[X |Y = y ] pY (y)
= E[E[X |Y ]] .
(A similar expression holds for infinite spaces.)
For sets of variables
1 Probability mass function, conditional, joint,marginal, etc.
2 Vectors:
X =
X1...Xn
= [X1, . . . ,Xn]T ,
pX(x) = P(X = x) ,E[X] = [E[X1] , . . . ,E[Xn]]T ,
Part VI: Independence
1 Independence for two events, for many events.
2 Independence for random variables.
3 Conditional independence.
Independence for events
1 A and B are independent
P(A|B) = P(A) whenever P(B) > 0;
or, equivalently,
P(A ∩ B) = P(A)P(B) .
2 Many events are independent: for all subsets ofevents Aini=1,
P(∩iAi) =∏i
P(Ai) .
(Pairwise independence is not enough!)
Independence for random variables
For all events such that the conditionalprobabilities are defined,
P(Xi = xi| ∩j 6=i Xj = xi) = P(Xi = xi) ;
that is,
p(xi | ∩j 6=i Xj = xi) = p(xi).
Or, more concisely:
p(x1, . . . , xn) =n∏
i=1
p(xi).
Conditional independence
1 (X ⊥⊥Y |A) if
E[f (X )g(Y )|A] = E[f (X )|A]E[g(Y )|A]
for all functions f , g , whenever P(A) > 0.
2 (X ⊥⊥Y |Z ) if
(X ⊥⊥Y |Z = z)
for every category z of Z such thatP(Z = z) > 0.
Part VII: Laws of Large Numbers
1 Weak law.
2 Strong law.
Weak law of large numbers again
1 Independence implies uncorrelation:
E[(Xi − E[Xi ])(Xj − E[Xj ])] =
E[Xi − E[Xi ]]E[Xj − E[Xj ]] = 0.
2 If independent variables X1,X2, . . . haveexpectations E[Xi ] = µ and variancesV [Xi ] = σ2, then for any ε > 0,
limn→∞
P(∣∣∣∣∑i Xi
n− µ
∣∣∣∣ < ε
)= 1.
3 There are variants: assuming no variance,assuming expectations change, etc.
Advanced: strong law of large numbers
In a sequence of variables X1, . . . ,Xn, the meanconverges to the expectation with probability one:
P(
limn→∞
∑ni=1 Xi
n= µ
)= 1.
1 It requires the theory of infinite spaces.
2 It is hard to prove and requires severalassumptions.
3 It is really a strong result.
Part VIII: General Spaces
1 Infinities.
2 General axioms.
Infinite spaces
So, far Ω has been a finite set.1 Random variables have finitely many values.2 A probability mass function specifies a distributions
through finitely many values.
Now suppose Ω is an infinite set: Ω may be1 countable (natural, odd, integer, rational numbers)
or2 uncountable (real numbers).
Kolmogorov’s axioms
PU1 For any event A, P(A) ≥ 0.PU2 The space Ω has probability one:
P(Ω) = 1.PU3 If events A and B are disjoint (that is,
A ∩ B = ∅), thenP(A ∪ B) = P(A) + P(B).
PU4 If Ai is such thatlimn→∞ ∩ni=1Ai = ∅,then limn→∞ P(∩ni=1Ai) = 0.
Equivalent axioms (countable additivity)
P1 For any event A, P(A) ≥ 0.
P2 The space Ω has probability one:P(Ω) = 1.
P3 If countably many events Ai∞i=1 aredisjoint, then P(∪iAi) =
∑∞i=1 P(Ai).
The last axiom introduces countable additivity.
Example with discrete variable
Suppose X has integer values 0, 1, 2, . . . .
Then X has a Poisson distribution withparameter λ > 0 when
P(X = x) =e−λλx
x!,
for x ≥ 0.
Part IX: Mathematics of Infinite Spaces
1 Fields and algebras; Borel algebra.
2 Measurability.
3 Lebesgue integration.
Digression: Measurability
Given an infinite Ω, we have to specifyprobability values for its subsets.All subsets?
1 It is impossible to define a countably additiveprobability measure over all subsets of the realnumbers, for which the probability of an interval[a, b] is b − a.
2 In fact, there are subsets of < that areunmeasurable: a countable additive set-functioncannot be defined on them such that [a, b] mapsto b − a.
3 Ulam’s theorem: if a countably additive measure isdefined over all subsets of the real numbers andvanishes on all singletons, it is identically zero.
Kolmogorov’s solution: fields
Consider first a finite set Ω.A field F is a nonempty set of subsets of Ωsuch that:
1 if A ∈ F , then Ac ∈ F .2 if A ∈ F and B ∈ F , then A ∪ B ∈ F .
Note: if A is in a field, then Ac , ∅ and Ω areautomatically in the field.
Example:∅,A,B ,Ac ,Bc ,A ∪ B ,Ac ∪ B ,A ∪ Bc ,Ac ∪Bc ,A ∩ B ,Ac ∩ B ,A ∩ Bc ,Ac ∩ Bc , (Ac ∩ B) ∪(A ∩ Bc), (A ∪ Bc) ∩ (Ac ∪ B),Ω.
σ-fields
Now consider an infinite set Ω.A σ-field is a set of subsets of Ω such that
1 if A ∈ F , then Ac ∈ F .2 if Ai ∈ F then ∪iAi ∈ F .
Note that σ-fields are closed under countableunions.
Fields and algebras
Fields are also called algebras.
σ fields are also called σ-algebras.
In fact, “algebra” seems to be a better term(there are other meanings for the word “field”that do not apply here...).
Terminology is confusing!
Borel algebras
“Minimal σ-algebra containing the opensets/compact sets of a topological set Ω.”The Borel algebra for the real numbers:
The smallest σ-algebra on < that contains theintervals.
The elements of a Borel algebra are the Borelsets.
Consequences of countable additivity
No way to extend arbitrary assessments overarbitrary spaces.
No uniform distribution on the integers.
BUT: countable additivity basically allows us to useintegrals to compute expectations!
VERY important: there is a unique probabilitymeasure that corresponds to an expectation (andvice-versa)!
End of digression
A probability space is a triple (Ω,F ,P), where
Ω is a set (possibility space).
F is a σ-algebra on Ω.
P is a probability measure on F ; that is, anon-negative normalized (to unity) andcountable additive set-function.
Note: in almost all books on probability theory, theprobability space takes the real numbers and theirBorel algebra.
Extending the previous theoryA variable with finitely many values is calledsimple. We know how to relate expectationand probability for those.
Take two possible approximations for E[X ]:
E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .
E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .
In fact, for many random variables, bothapproximations coincide and then
E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .
The problem is to characterize these variables,and the properties of this functional.
Measurable random variables
A function f : Ω→ < is F -measurable withrespect to an algebra F when any set
ω : f (ω) ≤ α
belongs to F .
Note: there are more general definitions in theliterature...
A random variable X is measurable when it is ameasurable function.
The Lebesgue integralA variable with finitely many values is calledsimple.Take two possible approximations for E[X ]:
E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .
E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .
In fact, given an expectation/measure, we have
E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .
This quantity is the Lebesgue integral withrespect to the probability measure P.Notation:
E[X ] =
∫XdP.
The Lebesgue integralA variable with finitely many values is calledsimple.Take two possible approximations for E[X ]:
E[X ] ≈ sup (E[Y ] : Y ≤ X ,Y is simple) .
E[X ] ≈ inf (E[Z ] : Z ≥ X ,Y is simple) .
In fact, given an expectation/measure, we have
E[X ] = sup (E[Y ] : Y ≤ X ,Y is simple.) .
This quantity is the Lebesgue integral withrespect to the probability measure P.Notation:
E[X ] =
∫XdP.
Discrete random variables
Suppose X has an enumerable set of values.Then
E[X ] =∑x
xP(X = x) .
Example: X has a Poisson distribution withparameter λ > 0 when
P(X = x) =e−λλx
x!,
for integer x ≥ 0; then
E[X ] = λ
andV [X ] = λ.
The Riemann integral
Under quite general conditions, the Lebesgueintegral can be computed using the Riemannintegral (the “usual” integral).
The key idea is to define densities and then tointegrate with respect to densities.
Part X: Densities and the like
1 Cumulative distribution functions and densities.
2 A summary.
Cumulative distribution function
The function
FX (x) = P(ω : X (ω) ≤ x)
is the cumulative distribution function of X .Note:
FX is a non-negative non-decreasing function, withFX (−∞) = 0 and FX (∞) = 1.
P([a, b]) = FX (b)− FX (a) =∫ b
ap(x) dx .
Densities
For a measurable variable X , the density of Xis, when it exists:
pX (x) =dFX (x)
dx=
dP(ω : X (ω) ≤ τ)dτ
∣∣∣∣x
.
Then:
E[X ] =
∫ΩX
xpX (x) dx ,
where ΩX is the set of values of X , and theintegral is the Riemann integral.
Summary1 Kolmogorov’s theory: probability space (Ω,F ,P), where
Ω is a general possibility space, F is a σ-algebra, and Pis a non-negative, normalized to unity and countablyadditive set-function (a normalized measure).
The most common σ-algebra for the real numbersis the Borel algebra (intervals).Random variables are F -measurable functions.Expectations of measurable functions are Lebesgueintegrals.
2 The distribution of X is entirely captured by FX (x), thecumulative distribution function.
3 If FX is continuous, the variable X is continuous, and wecan differentiate FX (x) to obtain the density pX (x).
4 If the distribution of X has a density pX (x), thenexpectation E[X ] is a Riemann integral
∫xpX (x) dx .
Part XI: Catalog of Distributions
1 Common densities.
2 De Moivre - Laplace’s theorem.
Uniform distribution
Suppose X is a real-valued variable.
The distribution of X is uniform if its density is
pX (x) =1
b − aif x ∈ [a, b].
and pX (x) = 0 otherwise.
Gaussian distribution
X has a Gaussian distribution when
pX (x) =1√
2πσ2exp
(−(x − µ)2
2σ2
).
E[X ] = µ and V [X ] = σ2.
De Moivre - Laplace’s theorem
Take n, p ∈ [0, 1], such thatn × p × (1− p) >> 1; then(
nk
)pk(1−p)n−k ≈
exp(−(k − np)2/2np(1− p)
)√2πnp(1− p)
.
(That is, the ratio of two sides goes to 1.)
That is, the probability that k among n trialsare positive, when n grows without bound, canbe approximated by a Gaussian density.
Gamma distribution
X has a gamma distribution with parameters αand β when
pX (x) =βα
Γ(α)xα−1e−βx ,
for x > 0, and pX (x) = 0 otherwise.Gamma function:
Γ(α) =
∫ ∞0
zα−1e−zdz .
Note: For any positive integer k ,Γ(k) = (k − 1)!.We have E[X ] = α/β and the variance of X isα/β2.
Gamma and exponential distributions
Important: If Xini=1 are independent andhave a Gamma distribution with parameters αi
and β, then X = X1 + · · ·+ Xn has a Gammadistribution with parameter α1 + · · ·+ αn andβ.
If α = 1 and β > 0, then X has an exponentialdistribution with parameter β,
pX (x) = βe−βx ,
for x > 0, and pX (x) = 0 otherwise.
Chi-square distribution
X has chi-square distribution when
pX (x) =1√
2Γ(1/2)
exp(−x/2)√x
when x > 0, and pX (x) = 0 otherwise.The χ2 with n degrees of freedom:
pX (x) =1
2n/2Γ(n/2)xn/2−1 exp(−x/2)
when x > 0, and pX (x) = 0 otherwise.(Gamma distribution, α = n/2 and β = 1/2).If Xini=1 are Gaussian variables with µ = 0and σ2 = 1, then X 2
1 + · · ·+ X 2n has a χ2
distribution with n degrees of freedom.
Beta distribution
Often used to model random variables that arelimited to an interval.X has a beta distribution with parameters αand β when
pX (x) =Γ(α + β)
Γ(α)Γ(β)xα−1(1− x)β−1,
for x ∈ [0, 1] and 0 otherwise.Note: a beta distribution is proportional toxα−1(1− x)β−1. If α = β = 1, then we obtainthe uniform distribution.For a beta distribution pX (·) with parameters αand β, the expected value of X is α/(α + β)and the variance is αβ/((α + β)2(α + β + 1)).
Dirichlet distribution
A column vector of dimension n has a Dirichletdistribution when:
pX1,...,X2(x1, . . . , xn) =
Γ(∑n
i=1 αi)∏ni=1 Γ(αi)
n∏i=1
xαi−1i ,
when∑
i xi = 1, and 0 otherwise.
Distribution is defined in a simplex ofdimension n − 1.
The values αini=1, where αi > 0, are theparameters of the Dirichlet distribution.
This is a direct generalization of the betadistribution.
t distribution
X has a t distribution with n degrees offreedom (for n > 0) when
pX (x) =Γ((n + 1)/2)
Γ(n/2)√nπ
(1 + x2/n)−(n+1)/2.
When n = 1, the distribution is called theCauchy distribution — it is a distribution withundefined expected value and variance!
Important: if X has a Gaussian distributionwith µ = 0 and σ2 = 1, and Y has a χ2
distribution with n degrees of freedom, thenX/√
Y /n has a t distribution with n degreesof freedom.
Part XII
1 Multivariate densities.
Multivariate densities
For two variables X and Y ,
FX ,Y (x , y) = P(X ≤ x ,Y ≤ y)
and
pX ,Y (x , y) =∂FX ,Y (x , y)
∂X∂Y.
Then
P((X ,Y ) ∈ D) =
∫ ∫D
pX ,Y (x , y) dxdy .
...and similarly for any number of variables.
Gaussian vector
A column vector X of dimension n has aGaussian joint probability density when:
pX(x) =1√
(2π)n detPexp
(−(x− µ)TP−1(x− µ)
2
),
where µ is a vector and P is a square matrix ofappropriate dimensions.For a Gaussian vector, we have:
E[X] = µ;V [X] = E
[(X− E[X])(X− E[X])T
]= P .
Part XIII: Functions
1 Functions of random variables.
2 Expected values of functions.
Functions of a random variable
Example: If Y = aX + b, with a > 0, then:
f (X ) ≤ y ⇒ X ≤ y − b
a,
and then:
FY (y) = P(f (X ) ≤ y) = P(
X ≤ y − b
a
),
so FY (y) = FX(y−ba
).
Functions of a random variable
Example: If Z = X + Y , then:
FZ (z) = P(Z ≤ z)
= P(X + Y ≤ z)
=
∫ ∞−∞
∫ z−y
−∞pX ,Y (x , y) dxdy .
Linear combinations and conditioning
For a vector Y of dimension m such thatY = AX, where A is a matrix, we have:
pY(y) =1√
(2π)m detQexp
(−(y − ν)TQ−1(y − ν)
2
),
where ν = Aµ and Q = APAT .
If X and Y are Gaussian vectors, pX|Y(x|y) isalso Gaussian with
E[X|Y] = E[X] + Cov[X,Y]V [Y]−1(Y−E[Y]),
V [X|Y] = V [X]− Cov[X,Y]V [Y]−1Cov[Y,X].
Also, we have:
If Y = f (X ) and X has density pX (x), then
E[Y ] = E[f (X )] =
∫f (x)pX (x) dx .
Concepts of covariance, correlation, etc,defined just as for discrete random variables.
Part XIV: Important concepts
1 Conditioning.
2 Independence.
Usual solution for conditioning
Introduce:
pX |Y (x |y) =pX ,Y (x , y)
pY (y)=
pX ,Y (x , y)∫pX ,Y (x , y) dx
.
Then:
E[X |Y ] =
∫xpX |Y (x |Y ) dx .
Independence: basic definition
Consider random variables X1,X2, . . . ,Xn.These random variables are independent:
When all distributions have densities,
p(X1, . . . ,Xn) =n∏
i=1
p(Xi) .
Part XV: Some advanced concepts
1 Kolmogorovian conditioning.
2 Other definitions of independence.
3 Convergence.
4 Laws of large numbers.
5 Exchangeability.
6 Central limit theorem.
7 Bayesian consensus.
Conditioning, again
Conditional expectation:
E[X |A] =E[XA]
P(A)
whenever P(A) > 0.
Now consider E[X |Y ].If Ω is uncountable, E[X |Y = y ] may face thedifficulty that one can have P(Y = y) = 0 for allvalues of Y .So, it is hard to define P(X |Y ) as a function of Xand Y .
Real thing: Kolmogorovian conditioning
Definition: E[X |Y ] is a random variable that is“Y-measurable” and such that
E[f (Y )(X − E[X |Y ])] = 0
for any function f (Y ).
This is not simple!
Usually, a proper density pX |Y (x |y) exists suchthat “probabilities” P(A(X )|B(Y )) can becalculated.
Independence, again
Variables Xini=1 are independent if
E[fi(Xi)| ∩j 6=i Xj ∈ Aj] = E[fi(Xi)] ,
for all functions fi(Xi) andall events ∩j 6=iXj ∈ Aj with positive probability.
For all functions fi(Xi),
E
[n∏
i=1
fi(Xi)
]=
n∏i=1
E[fi(Xi)] .
For all sets of events Aini=1,
P(∩ni=1Xi ∈ Ai) =n∏
i=1
P(Xi ∈ Ai) .
Convergence
There are many notions of convergence for randomvariables.Consider a sequence of random variables Xidefined in the same probability space (Ω,F ,P).
1 Convergence in distribution (function):limn→∞ FXn
(x) = FX (x)
2 Convergence in probability:limn→∞ P(|Xn − X | ≥ ε) = 0.
3 Almost sure convergence (with probabilityone): P(limn→∞ Xn = X ) = 1.
Review: Law of Large Numbers
Consider an infinite sequence of independentvariables with expectation µ, variance σ2.
Define X =∑n
i=1 Xi
n .
Weak law of large numbers:
limn→∞
P(|X − µ| ≥ ε
)= 0.
Strong law of large numbers:
P(
limn→∞
X = µ)
= 1.
Central limit theorem
Take sequence of n independent randomvariables Xi with mean µi and variance σ2
i .
Consider the random variable X =∑
i Xi ; thenE[X ] =
∑i µi and variance σ2 =
∑i σ
2i .
If we define
Z =X − µσ
,
then the distribution of Z tends to a Gaussiandistribution with expectation 0 and variance 1as n→∞.
Exchangeability
1 Binary variables X1,X2, . . . are exchangeable ifP(X1,X2, . . .) does not change if we justchange the order of variables.
2 If X1,X2, . . . are exchangeable, then
P(k ones in n selected variables)
can always be written as∫ (nk
)θk(1− θ)n−kp(θ) dθ.
(This is De Finetti’s representation theorem.)
3 Note the deep implications of exchangeability!
Bayesian consensus
1 Suppose that n Bayesians have different priorsPi(θ).
2 Suppose they all observe X1,X2, . . . .
3 Suppose all Xi are independent and identicallydistributed P(Xi |θ∗).
4 Then all Bayesians will reach an identicalposterior P(θ|X1,X2, . . .) that is infinitelyconcentrated around θ∗.
Summary1 Kolmogorov’s theory: probability space (Ω,F ,P), where
Ω is a general possibility space, F is a σ-algebra, and Pis a non-negative, normalized to unity and countablyadditive set-function (a normalized measure).
2 Random variables are F -measurable functions,expectations are Lebesgue integrals.
3 Random variables are F -measurable functions,expectations are Lebesgue integrals (there are manyunivariate and multivariate densities!).
4 Conditional density p(X |Y ) is given by p(X ,Y ) /p(Y );independence means “conditional equal tounconditional” or “factorization” (actually only thelatter).
5 Convergence concepts are important, with many results:laws of large numbers, central limit theorems...