Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...

Introduction to Probability Theory

Fabio G. Cozman - fgcozman@usp.br

August 29, 2018

Part I: Basic Concepts for Finite Spaces

1 Possibility/sample space, outcomes, events

2 Variables and indicator functions

3 Probabilities, expectations

4 Properties of probabilities

Possibility/sample space

1 Possibility/sample space: set Ω.

2 Elements ω of Ω are outcomes.

3 Subsets of Ω are events (no fuzziness!).

Example

Two coins are tossed; each coin can be heads (H)or tail (T ). Then Ω = HH ,HT ,TH ,TT.Consider three events. Event A = HH is theevent that both tosses produce heads. EventB = HH ,TT is the event that both tossesproduce identical outcomes. Event C = HH ,THis the event that the second toss yields heads. Notethat A = B ∩ C .

Probability measure (finite spaces!)

A probability measure is a function that assigns aprobability value to each event.

PU1 For any event A, P(A) ≥ 0.

PU2 The space Ω has probability one:P(Ω) = 1.

PU3 If events A and B are disjoint (that is,A ∩ B = ∅), thenP(A ∪ B) = P(A) + P(B).

Easy example

Example

A six-sided die is rolled. Suppose all outcomes of Ωare assigned precise and identical probability values.

We must have∑

ω∈Ω P(ω) = P(Ω) = 1, thuswe have P(ω) = 1/6 for all outcomes.

The event A = 1, 3, 5 (outcome is odd) hasprobability P(A) = 1/2.

The event B = 1, 2, 3, 5 (outcome is prime)has probability P(B) = 2/3 andP(A ∩ B) = P(1, 3, 5) = 1/2.

Properties of probabilities

1 As A and Ac are disjoint and A ∪ Ac = Ω, wehave P(A) + P(Ac) = P(Ω) = 1,

P(A) = 1− P(Ac) .

2 P(∅) = 1− P(∅c) = 1− P(Ω) = 0.3 P(A ∪ B) = P(A) + P(B)− P(A ∩ B).4 For n mutually disjoint events Bi ,

P(∪ni=1Bi) =n∑

P(Bi) .

5 If events Bi form a partition of Ω,

P(A) = P(∪ni=1A ∩ Bi) =∑i

P(A ∩ Bi) .

Random variables

1 Function X : Ω→ < is usually called a randomvariable.

2 If X is a variable, then any function f : < → <defines a random variable f (X ).

Example

The age in months of a person ω selected from apopulation Ω is a variable X . The same populationcan be used to define a different variable Y whereY (ω) is the weight (rounded to the next kilogram)of a person ω selected from Ω. We can also have arandom variable Z = X + Y .

Distributions

Possibility space Ω, variable X : Ω→ <.

Then: possibility space ΩX containing everypossible value of X .

Probability measure on Ω induces a measureover subsets of ΩX :

P(X ∈ A) = P(ω ∈ Ω : X (ω) ∈ A) .

Induced measure on ΩX is usually called thedistribution of X .

Expectations

Given variable X , its expectation is

E[X ] =∑ω∈Ω

X (ω)P(ω) =∑x

xP(X = x) .

An expectation functional yields a real numberfor each variable.Properties:

For constants α and β, if α ≤ X ≤ β, thenα ≤ E[X ] ≤ β.E[X + Y ] = E[X ] + E[Y ].

Variance is E[(X − E[X ])2

Part II: A Bit of History

1 Old times.

2 Classical probabilities.

3 Frequentist and Bayesian schemes.

Brief Look: History of Probabilities

ClassicalLeibnitz, Fermat, Pascal, De Moivre (1600)Bayes (1700)Laplace (1800)Modifications: Keynes, Jeffreys, Jaynes (1940)

FrequentistVenn, Boole, De Morgan (1850)Fisher, Neyman/Pearson (1900)

BayesianRamsey, De Finetti (1930)Savage (1950)

The Classical Theory: Ancient Time

First thoughts appeared in Philosophy:

Aristotle: “the probable is that which for the mostpart happens”

Bishop Butler: “to us probability is the very guideof life”

Also, many philosophers have used probabilityto prove the existence of God (e.g., the proofof the ecliptic)

The Classical Theory: Evolution

Pascal, De Moivre, Bernoulli: Central limittheorem, law of large numbers.Thomas Bayes: What you believe depends onwhat you believed before; we need priordistributions.

Bayes’ rule: P(A|B) =P(B |A)P(A)

The Classical Theory: Laplace

Probability is the ratio of the number offavorable cases to that of all the cases possible

The Principle of Non-Sufficient Reason: twopossible cases are equally probable if there is noreason to prefer one to the other

The Classical Theory: Difficulties

The great problem: the Principle of Non-SufficientReason

Too many proofs from too little knowledge.

The problem of reparameterizations:If you are not sure about x , you are not sureabout x2. How to express that?

Now Come the Frequentists

Basic Idea: instead of using ignorance, let’s useknowledge

Let’s define probability as the limiting relativefrequency of observations

P(A) = limn→∞

Venn, Boole and De Morgan proposed itaround 1850; Statistics was built upon thisconception of probabilities

The Frequentist Theory: Difficulties

The definition is too poor compared to whatwe want

It is impossible to talk about probabilities forthings that will happen only once!

More mathematically, how to use the limit inthe definition (limn→∞nA/n)?

Many deterministic sequences have limitsDo random sequences have limits?

A Brief Summary So Far

Classical Theory:Probability is is the ratio of favorable cases to thenumber of cases (Principle of Non-SufficientReason)Problem: Principle of Non-Sufficient Reason isuntenable

Frequentist Theory:Probability is a limiting relative frequencyProblems: too narrow a concept; hard to definemathematically

The Emergence of Subjectivism

Since everything else seems to fail, why don’twe admit that there is a component ofsubjectivism in probability?Ramsey/De Finetti groundbreaking idea: let’sdefine probability as a “fair” betting strategy:

I’ll give you 1 unit of currency if President X isre-electedHow much would you pay to bet “fairly” that Xwill not be re-elected?The amount you pay is your probability for Xre-elected

The Bayesian Theory: Savage’s Idea

Axiomatize preferences over “gambles”

From preferences, obtain “money” (utility) andprobabilities

Result: If f g then there is a probabilitymeasure P and a utility function U such thatE [U(f)] < E [U[(g)]].

The Bayesian Theory: Basics

All forms of uncertainty are reduced toprobability

Judgements of uncertainty are reduced topreferences

All forms of updating knowledge are equivalentto application of Bayes’ rule

Frequentists Versus Bayesians

Bayesians:

Induction is a solved problem: you define yourprior, you collect data and then you apply Bayesrule, always following decision theory

Challenges: basically subjective (annoying priors).

Frequentists:

Induction is an ad hoc activity; Statistics furnishesuseful tools for induction

Some tools: significance testing, hypothesistesting, least-squares...

Challenges: based on shaky foundations; piecemealand ad hoc approach to problems.

Part III

1 Moments, variance, covariance.

2 Weak laws of large numbers.

Moments

Definition

The ith moment of X is the expectation E[X i].

Definition

The ith central moment of X is the expectationE[(X − E[X ])i

Definition

The variance V [X ] of X is second central momentof X .

V [X ] = E[(X − E[X ])2

[X 2]− E[X ]2 .

Markov inequality

1 Suppose X ≥ 0 and t > 0.If X (ω) < t, IX≥t(ω) = 0. Then X (ω)/t ≥ IX≥t(ω). If X (ω) ≥ t, then

X (ω)/t ≥ 1 = IX≥t(ω). Consequently, X/t ≥ IX≥t and then E[X ] /t ≥ E[IX≥t

], so:

P(X ≥ t) ≤ E[X ]

2 Chebyshev inequality:

P(|X − E[X ] | ≥ t) ≤ V [X ]

Digression: Covariance

Definition

The covariance of variables X and Y isCov(X ,Y ) = E[(X − E[X ])(Y − E[Y ])].

If two variables X and Y are such thatCov(X ,Y ) = 0, then X and Y are uncorrelated.

Very weak law of large numbers

Theorem

If variables X1,X2, . . . ,Xn have expectationsE[Xi ] ∈ [µ, µ] and variances V [Xi ] ∈ [σ2, σ2], andXi and Xj are uncorrelated for every i 6= j , then forany ε > 0,

P(µ− ε <

∑i Xi

n< µ + ε

)≥ 1− σ2

Weak law of large numbers

Theorem

If variables X1,X2, . . . have expectations E[Xi ] = µand variances V [Xi ] = σ2, and Xi and Xj areuncorrelated for every i 6= j , then for any ε > 0,

limn→∞

P(∣∣∣∣∑i Xi

n− µ

∣∣∣∣ < ε

Philosophy behing the “law”

Idea: irregularities observed in Xi do not affectthe average of these variables.

We should have regularity out of apparentchaos: even though the random variablesbehave randomly, their average does approachsome meaningful number (the probability...).

Suggests the “definition”: P(A) = limn→∞ #A/n.

Part IV: Conditioning

1 Bayes rule.

2 Theorem of total probabilities.

Conditioning: Bayes rule

Definition

If P(B) > 0, then

P(A|B) =P(A ∩ B)

Definition

The conditional expectation of X given B , denotedby E[X |B], is defined only if P(B) > 0 as

E[X |B] =∑x

xP(X = x |A) =E[IBX ]

Basic facts

For any C such that P(C ) > 0:

For any A, P(A|C ) ≥ 0.

P(Ω|C ) = 1.

If A ∩ B = ∅, thenP(A ∪ B |C ) = P(A|C ) + P(B |C ).

Note that P(A|A) = 1 whenever P(A) > 0.

Properties

1 E[X |B] =∑

ω∈B X (ω)P(ω|B).

2 For events Bini=1,

P(B1 ∩ B2 ∩ · · · ∩ Bn) = P(B1)n∏

P(Bi | ∩i−1

j=1 Bj

More properties

1 Total probability theorem: If events Bi forma partition of Ω such that all P(Bi) > 0,

P(A) =∑i

P(A ∩ Bi) =∑i

P(A|Bi)P(Bi) .

2 Then:

P(Bi |A) =P(A|Bi)P(Bi)∑i P(A|Bi)P(Bi)

Example

1 Individuals in an office have a disease D. Testto detect the disease (R or Rc).

2 P(R |D) = 9/10 (sensitivity of the test).

3 P(Rc |Dc) = 4/5 (specificity of the test).

4 P(D) = 1/9.

5 Then:

P(D|R) =9/10× 1/9

9/10× 1/9 + 1/5× 8/9= 9/25.

Three-prisoners problem

1 Three prisoners, Teddy, Jay, and Mark, waitingto be executed.

2 Governor will select one to be freed (equalprobability)

3 Warden knows the governor’s decision4 Teddy convinces the warden to say the name of

one of his fellow inmates who will be executed(useless information...)

5 Warden is honest6 Warden says that Jay is to be executed: Teddy

is happy (1/3 to 1/2)!7 But if warden said Mark, Teddy would be

happy??

Analysis

1 Possibility space:

Teddy freed ∩ warden says Jay,

Teddy freed ∩ warden says Mark,Jay freed ∩ warden says Mark,Mark freed ∩ warden says Jay

2 We know that

P(Teddy freed)=P(Jay freed)=P(Mark freed)=1/3.

3 How would the warden behave if Teddy is to befreed?

P(warden says Jay|Teddy freed) .

Possible conclusion...

IfP(warden says Jay|Teddy freed) = 1/2,

P(Teddy freed ∩ warden says Jay) = 1/6,

P(Teddy freed ∩ warden says Mark) = 1/6,

P(Jay freed ∩ warden says Mark) = 1/3,

P(Mark freed ∩ warden says Jay) = 1/3.

P(Teddy freed|warden names Jay) =1/6

1/3 + 1/6= 1/3.

Complete analysis...

1 Statement does not say anything about thebehaviour of the warden.

2 All that is really known is

P(warden names Jay|Teddy freed) ∈ [0, 1].

3 Consequently

P(Teddy freed|warden names Jay) ∈[

0 + 1/3,

1/3 + 1/3

Part V: Probability mass functions

1 Mass functions.

2 Marginal probability mass functions.

3 Conditional probability mass functions.

4 Multivariate models.

Probability mass function

1 Probability mass function is simplypX (x) = P(X = x).

2 Then P(X ∈ A) =∑

x∈A pX (x).

Example (Uniform distribution)

Uniform distribution for X assigns pX (x) = 1/k forevery value x of X .

Example (Bernoulli distribution)

Binary variable X with values 0 and 1. Bernoullidistribution with parameter p for X takes twovalues: pX (0) = (1− p) and pX (1) = p.E[X ] = 0(1− p) + 1p = p; V [X ] = p(1− p).

Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...

Documents

Transcript of Introduction to Probability Theory - USP · The Classical Theory: Ancient Time First thoughts...

Number Theory

Morse theory, Floer theory, and String Topology · Morse theory, Floer theory, and String Topology Ralph Cohen Stanford University Abel Symposium Oslo, Norway August, 2007 . Let M

Bet Theory

Game Theory

Random Thoughts 2012 (COMP 066)

II. QUEUING THEORY (a) General Concepts - queuing theory useful

Boundary Layer Theory

atomic theory

Field Theory Notes

1 Chapter 2 Prospect Theory and Expected Utility Theory.

Characteristic Classes in K-Theory General Theory - Department of

Regression: Part II - Biology at the University of ... · PDF fileBayes (WinBUGS) Likelihood (R) Likelihood. Bayes. Additional thoughts on modeling variance ... regression model and

Neutron stars in need of holography€¦ · String theory’s user’s guide What is string theory? A theory of hadrons A theory of quantum gravity A theory of everything A theory

Dislocation Theory

Workshop on belief functionstdenoeux/dokuwiki/_media/... · values of are considered a priori more probable that central values, which does represent non vacuous knowledge about .

Measure Theory and Probability Theorystephaneduprazecon.com/measuretheory.pdf · Measure Theory and Probability Theory Stéphane Dupraz Inthischapter,weaimatbuildingatheoryofprobabilitiesthatextendstoanysetthetheoryofprobability

Policy Gradient Methods - Robot Learningrll.berkeley.edu/deeprlcourse/docs/lec2.pdf1.Make the good trajectories more probable1 2.Make the good actions more probable 3.Push the actions

THINKS ’n THOUGHTS-01-/-29--90- … V o l u m e II … THINKS ’n THOUGHTS B - k 2 Volume II (& 4 Italia· … & 6 Islamabad·) …Lahore …Pakistan … … 1966 (March) ===> 1966

XRD Theory Presentation

Chemistry Theory