Prospective evaluation and success of a machine learning ...
Probability Theory for Machine Learning
Transcript of Probability Theory for Machine Learning
Probability Theory for Machine Learning
Jesse Bettencourt
September 2018
Introduction to Machine Learning
CSC411
University of Toronto
Introduction to Notation
Motivation
Uncertainty arises through:
• Noisy measurements
• Finite size of data sets
• Ambiguity
• Limited Model Complexity
Probability theory provides a consistent framework for the
quantification and manipulation of uncertainty.
1
Sample Space
Sample space Ω is the set of all possible outcomes of an
experiment.
Observations ω ∈ Ω are points in the space also called sample
outcomes, realizations, or elements.
Events E ⊂ Ω are subsets of the sample space.
2
Sample Space Coin Example
In this experiment we flip a coin twice:
Sample space All outcomes Ω = HH,HT ,TH,TTObservation ω = HT valid sample since ω ∈ Ω
Event Both flips same E = HH,TT valid event since E ⊂ Ω
3
Probability
Probability
The probability of an event E, P(E ), satisfies three axioms:
1: P(E ) ≥ 0 for every E
2: P(Ω) = 1
3: If E1,E2, . . . are disjoint then
P(∞⋃i=1
Ei ) =∞∑i=1
P(Ei )
4
Joint and Conditional Probabilities
Joint Probability of A and B is denoted P(A,B)
Conditional Probability of A given B is denoted P(A|B).
• Assuming P(B) > 0, then P(A|B) = P(A,B)/P(B)
• Product Rule: P(A,B) = P(A|B)P(B) = P(B|A)P(A)
5
Conditional Example
60% of ML students pass the final and 45% of ML students pass
both the final and the midterm.
What percent of students who passed the final also passed the
midterm?
6
Conditional Example
60% of ML students pass the final and 45% of ML students pass
both the final and the midterm.
What percent of students who passed the final also passed the
midterm?
Reword: What percent passed the midterm given they passed the
final?
P(M|F ) = P(M,F )/P(F )
= 0.45/0.60
= 0.75
7
Independence
Events A and B are independent if P(A,B) = P(A)P(B)
Events A and B are conditionally independent given C if
P(A,B|C ) = P(B|A,C )P(A|C ) = P(B|C )P(A|C )
8
Marginalization and Law of Total Probability
Marginalization (Sum Rule)
P(X ) =∑Y
P(X ,Y )
Law of Total Probability
P(X ) =∑Y
P(X |Y )P(Y )
9
Bayes’ Rule
Bayes’ Rule
Bayes’ Rule:
P(A|B) =P(B|A)P(A)
P(B)
10
Bayes’ Rule
Bayes’ Rule:
P(A|B) =P(B|A)P(A)
P(B)
P(θ|x) =P(x |θ)P(θ)
P(x)
Posterior =Likelihood ∗ Prior
Evidence
Posterior ∝ Likelihood × Prior
11
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
12
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
This depends on accuracy and sensitivity of test and prior
probability of the disease:
P(T = 1|D = 1) = 0.95 (true positive)
P(T = 1|D = 0) = 0.10 (false positive)
P(D = 1) = 0.1 (prior)
So P(D = 1|T = 1) =?
13
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
P(T = 1|D = 1) = 0.95 (true positive)
P(T = 1|D = 0) = 0.10 (false positive)
P(D = 1) = 0.1 (prior)
So P(D = 1|T = 1) =?
Use Bayes’ Rule:
P(A|B) =P(B|A)P(A)
P(B)
P(D = 1|T = 1) =P(T = 1|D = 1)P(D = 1)
P(T = 1)
14
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
P(T = 1|D = 1) = 0.95 (true positive)
P(T = 1|D = 0) = 0.10 (false positive)
P(D = 1) = 0.1 (prior)
Use Bayes’ Rule:
P(D = 1|T = 1) =P(T = 1|D = 1)P(D = 1)
P(T = 1)
P(D = 1|T = 1) =0.95 ∗ 0.1
P(T = 1)
15
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
P(T = 1|D = 1) = 0.95 (true positive)
P(T = 1|D = 0) = 0.10 (false positive)
P(D = 1) = 0.1 (prior)
P(D = 1|T = 1) =0.95 ∗ 0.1
P(T = 1)(Bayes’ Rule)
By Law of Total Probability
P(T = 1) =∑D
P(T = 1|D)P(D)
= P(T = 1|D = 1)P(D = 1) + P(T = 1|D = 0)P(D = 0)
= 0.95 ∗ 0.1 + 0.1 ∗ 0.90
= 0.18516
Bayes’ Example
Suppose you have tested positive for a disease. What is the
probability you actually have the disease?
P(T = 1|D = 1) = 0.95 (true positive)
P(T = 1|D = 0) = 0.10 (false positive)
P(D = 1) = 0.1 (prior)
P(T = 1) = 0.185 (from Law of Total Probability)
P(D = 1|T = 1) =0.95 ∗ 0.1
P(T = 1)
=0.95 ∗ 0.1
0.185
= 0.51
Probability you have the disease given you tested positive is 51%
17
Random Variables and Statistics
Random Variable
How do we connect sample spaces and events to data?
A random variable is a mapping which assigns a real number X (ω)
to each observed outcome ω ∈ Ω
For example, let’s flip a coin 10 times. X (ω) counts the number of
Heads we observe in our sequence. If ω = HHTHTHHTHT then
X (ω) = 6.
18
I.I.D.
Random variables are said to be independent and identically
distributed (i.i.d.) if they are sampled from the same probability
distribution and are mutually independent.
This is a common assumption for observations. For example, coin
flips are assumed to be iid.
19
Discrete and Continuous Random Variables
Discrete Random Variables
• Takes countably many values, e.g., number of heads
• Distribution defined by probability mass function (PMF)
• Marginalization: p(x) =∑
y p(x , y)
Continuous Random Variables
• Takes uncountably many values, e.g., time to complete task
• Distribution defined by probability density function (PDF)
• Marginalization: p(x) =∫y p(x , y)dy
20
Probability Distribution Statistics
Mean: First Moment, µ
E [x ] =∞∑i=1
xip(xi ) (univariate discrete r.v.)
E [x ] =
∫ ∞−∞
xp(x)dx (univariate continuous r.v.)
Variance: Second Moment, σ2
Var [x ] =
∫ ∞−∞
(x − µ)2p(x)dx
= E [(x − µ)2]
= E [x2]− E [x ]2
21
Gaussian Distribution
Univariate Gaussian Distribution
Also known as the Normal Distribution, N (µ, σ2)
N (x |µ, σ2) =1√
2πσ2exp− 1
2σ2(x − µ)2
22
Multivariate Gaussian Distribution
Multidimensional generalization of the Gaussian.
x is a D-dimensional vector
µ is a D-dimensional mean vector
Σ is a D × D covariance matrix with determinant |Σ|
N (x|µ,Σ) =1
(2π)D/21
|Σ|1/2exp−1
2(x − µ)TΣ−1(x − µ)
23
Covariance Matrix
Recall that x and µ are D-dimensional vectors
Covariance matrix Σ is a matrix whose (i , j) entry is the covariance
Σij = Cov(xi , xj)
= E [(xi − µi )(xj − µj)]
= E [(xixj)]− µiµj
so notice that the diagonal entries are the variance of each
elements.
The covariant matrix has the property that it is symmetric and
positive-semidefinite (this is useful for whitening).
24
Whitening Transform
Whitening is a linear transform that converts a d-dimensional
random vector x = (x1, . . . , xd)T
with mean µ = E [x] = (µ1, . . . , µd)T and
positive definite d × d covariance matrix Cov(x) = Σ
into a new random d-dimensional vector
z = (z1, . . . , zd)T = W x
with “white” covariance matrix, Cov(z) = I
The d × d covariance matrix W is called the whitening matrix.
Mahalanobis or ZCA whitening matrix: WZCA = Σ−12
25
Inferring Parameters
Inferring Parameters
We have data X and we assume it is sampled from some
distribution.
How do we figure out the parameters that ‘best’ fit that
distribution?
Maximum Likelihood Estimation (MLE)
θMLE = argmaxθ
P(X |θ)
Maximum a Posteriori (MAP)
θMAP = argmaxθ
P(θ|X )
26
MLE for Univariate Gaussian Distribution
We are trying to infer the parameters for a Univariate Gaussian
Distribution, mean (µ) and variance (σ2).
N (x |µ, σ2) =1√
2πσ2exp− 1
2σ2(x − µ)2
The likelihood that our observations x1, . . . , xN were generated by
a univariate Gaussian with parameters µ and σ2 is
Likelihood = p(x1 . . . xN |µ, σ2) =N∏i=1
1√2πσ2
exp− 1
2σ2(xi − µ)2
27
MLE for Univariate Gaussian Distribution
For MLE we want to maximize this likelihood, which is difficult
because it is represented by a product of terms
Likelihood = p(x1 . . . xN |µ, σ2) =N∏i=1
1√2πσ2
exp− 1
2σ2(xi − µ)2
So we take the log of the likelihood so the product becomes a sum
Log Likelihood = log p(x1 . . . xN |µ, σ2)
=N∑i=1
log1√
2πσ2exp− 1
2σ2(xi − µ)2
Since log is monotonically increasing max L(θ) = max log L(θ)
28
MLE for Univariate Gaussian Distribution
The log Likelihood simplifies to
L(µ, σ) =N∑i=1
log1√
2πσ2exp− 1
2σ2(xi − µ)2
= −1
2N log(2πσ2)−
N∑i=1
(xi − µ)2
2σ2
Which we want to maximize. How?
29
MLE for Univariate Gaussian Distribution
To maximize we take the derivatives, set equal to 0, and solve:
L(µ, σ) = −1
2N log(2πσ2)−
N∑i=1
(xi − µ)2
2σ2
Derivative w.r.t. µ, set equal to 0, and solve for µ
∂L(µ, σ)
∂µ= 0 =⇒ µ =
1
N
N∑i=1
xi
Therefore the µ that maximizes the likelihood is the average of the
data points.
Derivative w.r.t. σ2, set equal to 0, and solve for σ2
∂L(µ, σ)
∂σ2= 0 =⇒ σ2 =
1
N
N∑i=1
(xi − µ)2
30
MLE for a biased coin
Suppose we observe a single outcome from the toss of a biased
coin, which has probably θ of landing on heads.
log p(x |θ) = x log(θ) + (1− x) log(1− θ)
The MLE maximizes the log-likelihood,
θMLE = x
where x is 0 or 1. There is a 100% chance of observing the same
outcome again!
31
MLE for a biased coin
Suppose we observe a single outcome from the toss of a biased
coin, which has probably θ of landing on heads.
log p(x |θ) = x log(θ) + (1− x) log(1− θ)
The MLE maximizes the log-likelihood,
θMLE = x
where x is 0 or 1.
There is a 100% chance of observing the same
outcome again!
31
MLE for a biased coin
Suppose we observe a single outcome from the toss of a biased
coin, which has probably θ of landing on heads.
log p(x |θ) = x log(θ) + (1− x) log(1− θ)
The MLE maximizes the log-likelihood,
θMLE = x
where x is 0 or 1. There is a 100% chance of observing the same
outcome again!
31
MAP for a biased coin
We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)
(conjugate prior).
Then the posterior is,
p(θ|x) = Beta(x + 2, 3− x)
(Show this!)
Which gives the MAP estimate,
θMAP =x + 1
3
This is 1/3 if we see a tails and 2/3 if we see a heads.
Priors help us reach reasonable conclusions when we have limited
observations. MAP is consistent with MLE when we have infinite
observations.
32
MAP for a biased coin
We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)
(conjugate prior).
Then the posterior is,
p(θ|x) = Beta(x + 2, 3− x)
(Show this!) Which gives the MAP estimate,
θMAP =x + 1
3
This is 1/3 if we see a tails and 2/3 if we see a heads.
Priors help us reach reasonable conclusions when we have limited
observations. MAP is consistent with MLE when we have infinite
observations.
32
MAP for a biased coin
We can place a prior distribution on θ. In this case, θ ∼ Beta(2, 2)
(conjugate prior).
Then the posterior is,
p(θ|x) = Beta(x + 2, 3− x)
(Show this!) Which gives the MAP estimate,
θMAP =x + 1
3
This is 1/3 if we see a tails and 2/3 if we see a heads.
Priors help us reach reasonable conclusions when we have limited
observations. MAP is consistent with MLE when we have infinite
observations.32