Probability Theory for Machine Learning

43
Probability Theory for Machine Learning Jesse Bettencourt September 2018 Introduction to Machine Learning CSC411 University of Toronto

Transcript of Probability Theory for Machine Learning

Page 1: Probability Theory for Machine Learning

Probability Theory for Machine Learning

Jesse Bettencourt

September 2018

Introduction to Machine Learning

CSC411

University of Toronto

Page 2: Probability Theory for Machine Learning

Introduction to Notation

Page 3: Probability Theory for Machine Learning

Motivation

Uncertainty arises through:

• Noisy measurements

• Finite size of data sets

• Ambiguity

• Limited Model Complexity

Probability theory provides a consistent framework for the

quantification and manipulation of uncertainty.

1

Page 4: Probability Theory for Machine Learning

Sample Space

Sample space Ω is the set of all possible outcomes of an

experiment.

Observations ω ∈ Ω are points in the space also called sample

outcomes, realizations, or elements.

Events E ⊂ Ω are subsets of the sample space.

2

Page 5: Probability Theory for Machine Learning

Sample Space Coin Example

In this experiment we flip a coin twice:

Sample space All outcomes Ω = HH,HT ,TH,TTObservation ω = HT valid sample since ω ∈ Ω

Event Both flips same E = HH,TT valid event since E ⊂ Ω

3

Page 6: Probability Theory for Machine Learning

Probability

Page 7: Probability Theory for Machine Learning

Probability

The probability of an event E, P(E ), satisfies three axioms:

1: P(E ) ≥ 0 for every E

2: P(Ω) = 1

3: If E1,E2, . . . are disjoint then

P(∞⋃i=1

Ei ) =∞∑i=1

P(Ei )

4

Page 8: Probability Theory for Machine Learning

Joint and Conditional Probabilities

Joint Probability of A and B is denoted P(A,B)

Conditional Probability of A given B is denoted P(A|B).

• Assuming P(B) > 0, then P(A|B) = P(A,B)/P(B)

• Product Rule: P(A,B) = P(A|B)P(B) = P(B|A)P(A)

5

Page 9: Probability Theory for Machine Learning

Conditional Example

60% of ML students pass the final and 45% of ML students pass

both the final and the midterm.

What percent of students who passed the final also passed the

midterm?

6

Page 10: Probability Theory for Machine Learning

Conditional Example

60% of ML students pass the final and 45% of ML students pass

both the final and the midterm.

What percent of students who passed the final also passed the

midterm?

Reword: What percent passed the midterm given they passed the

final?

P(M|F ) = P(M,F )/P(F )

= 0.45/0.60

= 0.75

7

Page 11: Probability Theory for Machine Learning

Independence

Events A and B are independent if P(A,B) = P(A)P(B)

Events A and B are conditionally independent given C if

P(A,B|C ) = P(B|A,C )P(A|C ) = P(B|C )P(A|C )

8

Page 12: Probability Theory for Machine Learning

Marginalization and Law of Total Probability

Marginalization (Sum Rule)

P(X ) =∑Y

P(X ,Y )

Law of Total Probability

P(X ) =∑Y

P(X |Y )P(Y )

9

Page 13: Probability Theory for Machine Learning

Bayes’ Rule

Page 14: Probability Theory for Machine Learning

Bayes’ Rule

Bayes’ Rule:

P(A|B) =P(B|A)P(A)

P(B)

10

Page 15: Probability Theory for Machine Learning

Bayes’ Rule

Bayes’ Rule:

P(A|B) =P(B|A)P(A)

P(B)

P(θ|x) =P(x |θ)P(θ)

P(x)

Posterior =Likelihood ∗ Prior

Evidence

Posterior ∝ Likelihood × Prior

11

Page 16: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

12

Page 17: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

This depends on accuracy and sensitivity of test and prior

probability of the disease:

P(T = 1|D = 1) = 0.95 (true positive)

P(T = 1|D = 0) = 0.10 (false positive)

P(D = 1) = 0.1 (prior)

So P(D = 1|T = 1) =?

13

Page 18: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

P(T = 1|D = 1) = 0.95 (true positive)

P(T = 1|D = 0) = 0.10 (false positive)

P(D = 1) = 0.1 (prior)

So P(D = 1|T = 1) =?

Use Bayes’ Rule:

P(A|B) =P(B|A)P(A)

P(B)

P(D = 1|T = 1) =P(T = 1|D = 1)P(D = 1)

P(T = 1)

14

Page 19: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

P(T = 1|D = 1) = 0.95 (true positive)

P(T = 1|D = 0) = 0.10 (false positive)

P(D = 1) = 0.1 (prior)

Use Bayes’ Rule:

P(D = 1|T = 1) =P(T = 1|D = 1)P(D = 1)

P(T = 1)

P(D = 1|T = 1) =0.95 ∗ 0.1

P(T = 1)

15

Page 20: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

P(T = 1|D = 1) = 0.95 (true positive)

P(T = 1|D = 0) = 0.10 (false positive)

P(D = 1) = 0.1 (prior)

P(D = 1|T = 1) =0.95 ∗ 0.1

P(T = 1)(Bayes’ Rule)

By Law of Total Probability

P(T = 1) =∑D

P(T = 1|D)P(D)

= P(T = 1|D = 1)P(D = 1) + P(T = 1|D = 0)P(D = 0)

= 0.95 ∗ 0.1 + 0.1 ∗ 0.90

= 0.18516

Page 21: Probability Theory for Machine Learning

Bayes’ Example

Suppose you have tested positive for a disease. What is the

probability you actually have the disease?

P(T = 1|D = 1) = 0.95 (true positive)

P(T = 1|D = 0) = 0.10 (false positive)

P(D = 1) = 0.1 (prior)

P(T = 1) = 0.185 (from Law of Total Probability)

P(D = 1|T = 1) =0.95 ∗ 0.1

P(T = 1)

=0.95 ∗ 0.1

0.185

= 0.51

Probability you have the disease given you tested positive is 51%

17

Page 22: Probability Theory for Machine Learning

Random Variables and Statistics

Page 23: Probability Theory for Machine Learning

Random Variable

How do we connect sample spaces and events to data?

A random variable is a mapping which assigns a real number X (ω)

to each observed outcome ω ∈ Ω

For example, let’s flip a coin 10 times. X (ω) counts the number of

Heads we observe in our sequence. If ω = HHTHTHHTHT then

X (ω) = 6.

18

Page 24: Probability Theory for Machine Learning

I.I.D.

Random variables are said to be independent and identically

distributed (i.i.d.) if they are sampled from the same probability

distribution and are mutually independent.

This is a common assumption for observations. For example, coin

flips are assumed to be iid.

19

Page 25: Probability Theory for Machine Learning

Discrete and Continuous Random Variables

Discrete Random Variables

• Takes countably many values, e.g., number of heads

• Distribution defined by probability mass function (PMF)

• Marginalization: p(x) =∑

y p(x , y)

Continuous Random Variables

• Takes uncountably many values, e.g., time to complete task

• Distribution defined by probability density function (PDF)

• Marginalization: p(x) =∫y p(x , y)dy

20

Page 26: Probability Theory for Machine Learning

Probability Distribution Statistics

Mean: First Moment, µ

E [x ] =∞∑i=1

xip(xi ) (univariate discrete r.v.)

E [x ] =

∫ ∞−∞

xp(x)dx (univariate continuous r.v.)

Variance: Second Moment, σ2

Var [x ] =

∫ ∞−∞

(x − µ)2p(x)dx

= E [(x − µ)2]

= E [x2]− E [x ]2

21

Page 27: Probability Theory for Machine Learning

Gaussian Distribution

Page 28: Probability Theory for Machine Learning

Univariate Gaussian Distribution

Also known as the Normal Distribution, N (µ, σ2)

N (x |µ, σ2) =1√

2πσ2exp− 1

2σ2(x − µ)2

22

Page 29: Probability Theory for Machine Learning

Multivariate Gaussian Distribution

Multidimensional generalization of the Gaussian.

x is a D-dimensional vector

µ is a D-dimensional mean vector

Σ is a D × D covariance matrix with determinant |Σ|

N (x|µ,Σ) =1

(2π)D/21

|Σ|1/2exp−1

2(x − µ)TΣ−1(x − µ)

23

Page 30: Probability Theory for Machine Learning

Covariance Matrix

Recall that x and µ are D-dimensional vectors

Covariance matrix Σ is a matrix whose (i , j) entry is the covariance

Σij = Cov(xi , xj)

= E [(xi − µi )(xj − µj)]

= E [(xixj)]− µiµj

so notice that the diagonal entries are the variance of each

elements.

The covariant matrix has the property that it is symmetric and

positive-semidefinite (this is useful for whitening).

24

Page 31: Probability Theory for Machine Learning

Whitening Transform

Whitening is a linear transform that converts a d-dimensional

random vector x = (x1, . . . , xd)T

with mean µ = E [x] = (µ1, . . . , µd)T and

positive definite d × d covariance matrix Cov(x) = Σ

into a new random d-dimensional vector

z = (z1, . . . , zd)T = W x

with “white” covariance matrix, Cov(z) = I

The d × d covariance matrix W is called the whitening matrix.

Mahalanobis or ZCA whitening matrix: WZCA = Σ−12

25

Page 32: Probability Theory for Machine Learning

Inferring Parameters

Page 33: Probability Theory for Machine Learning

Inferring Parameters

We have data X and we assume it is sampled from some

distribution.

How do we figure out the parameters that ‘best’ fit that

distribution?

Maximum Likelihood Estimation (MLE)

θMLE = argmaxθ

P(X |θ)

Maximum a Posteriori (MAP)

θMAP = argmaxθ

P(θ|X )

26

Page 34: Probability Theory for Machine Learning

MLE for Univariate Gaussian Distribution

We are trying to infer the parameters for a Univariate Gaussian

Distribution, mean (µ) and variance (σ2).

N (x |µ, σ2) =1√

2πσ2exp− 1

2σ2(x − µ)2

The likelihood that our observations x1, . . . , xN were generated by

a univariate Gaussian with parameters µ and σ2 is

Likelihood = p(x1 . . . xN |µ, σ2) =N∏i=1

1√2πσ2

exp− 1

2σ2(xi − µ)2

27

Page 35: Probability Theory for Machine Learning

MLE for Univariate Gaussian Distribution

For MLE we want to maximize this likelihood, which is difficult

because it is represented by a product of terms

Likelihood = p(x1 . . . xN |µ, σ2) =N∏i=1

1√2πσ2

exp− 1

2σ2(xi − µ)2

So we take the log of the likelihood so the product becomes a sum

Log Likelihood = log p(x1 . . . xN |µ, σ2)

=N∑i=1

log1√

2πσ2exp− 1

2σ2(xi − µ)2

Since log is monotonically increasing max L(θ) = max log L(θ)

28

Page 36: Probability Theory for Machine Learning

MLE for Univariate Gaussian Distribution

The log Likelihood simplifies to

L(µ, σ) =N∑i=1

log1√

2πσ2exp− 1

2σ2(xi − µ)2

= −1

2N log(2πσ2)−

N∑i=1

(xi − µ)2

2σ2

Which we want to maximize. How?

29

Page 37: Probability Theory for Machine Learning

MLE for Univariate Gaussian Distribution

To maximize we take the derivatives, set equal to 0, and solve:

L(µ, σ) = −1

2N log(2πσ2)−

N∑i=1

(xi − µ)2

2σ2

Derivative w.r.t. µ, set equal to 0, and solve for µ

∂L(µ, σ)

∂µ= 0 =⇒ µ =

1

N

N∑i=1

xi

Therefore the µ that maximizes the likelihood is the average of the

data points.

Derivative w.r.t. σ2, set equal to 0, and solve for σ2

∂L(µ, σ)

∂σ2= 0 =⇒ σ2 =

1

N

N∑i=1

(xi − µ)2

30

Page 38: Probability Theory for Machine Learning

MLE for a biased coin

Suppose we observe a single outcome from the toss of a biased

coin, which has probably θ of landing on heads.

log p(x |θ) = x log(θ) + (1− x) log(1− θ)

The MLE maximizes the log-likelihood,

θMLE = x

where x is 0 or 1. There is a 100% chance of observing the same

outcome again!

31

Page 39: Probability Theory for Machine Learning

MLE for a biased coin

Suppose we observe a single outcome from the toss of a biased

coin, which has probably θ of landing on heads.

log p(x |θ) = x log(θ) + (1− x) log(1− θ)

The MLE maximizes the log-likelihood,

θMLE = x

where x is 0 or 1.

There is a 100% chance of observing the same

outcome again!

31

Page 40: Probability Theory for Machine Learning

MLE for a biased coin

Suppose we observe a single outcome from the toss of a biased

coin, which has probably θ of landing on heads.

log p(x |θ) = x log(θ) + (1− x) log(1− θ)

The MLE maximizes the log-likelihood,

θMLE = x

where x is 0 or 1. There is a 100% chance of observing the same

outcome again!

31

Page 41: Probability Theory for Machine Learning

MAP for a biased coin

We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)

(conjugate prior).

Then the posterior is,

p(θ|x) = Beta(x + 2, 3− x)

(Show this!)

Which gives the MAP estimate,

θMAP =x + 1

3

This is 1/3 if we see a tails and 2/3 if we see a heads.

Priors help us reach reasonable conclusions when we have limited

observations. MAP is consistent with MLE when we have infinite

observations.

32

Page 42: Probability Theory for Machine Learning

MAP for a biased coin

We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)

(conjugate prior).

Then the posterior is,

p(θ|x) = Beta(x + 2, 3− x)

(Show this!) Which gives the MAP estimate,

θMAP =x + 1

3

This is 1/3 if we see a tails and 2/3 if we see a heads.

Priors help us reach reasonable conclusions when we have limited

observations. MAP is consistent with MLE when we have infinite

observations.

32

Page 43: Probability Theory for Machine Learning

MAP for a biased coin

We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)

(conjugate prior).

Then the posterior is,

p(θ|x) = Beta(x + 2, 3− x)

(Show this!) Which gives the MAP estimate,

θMAP =x + 1

3

This is 1/3 if we see a tails and 2/3 if we see a heads.

Priors help us reach reasonable conclusions when we have limited

observations. MAP is consistent with MLE when we have infinite

observations.32