Post on 28-May-2022
Probability in Machine Learning
Three Axioms of Probability
β’ Given an Event πΈ in a sample space π, S = π=1Ϊπ πΈπ
β’ First axiomβ π πΈ β β, 0 β€ π(πΈ) β€ 1
β’ Second axiomβ π π = 1
β’ Third axiomβ Additivity, any countable sequence of mutually exclusive events πΈπβ π π=1Ϊ
π πΈπ = π πΈ1 + π πΈ2 +β―+ π πΈπ = Οπ=1π π πΈπ
Random Variables
β’ A random variable is a variable whose values are numerical outcomes of a random phenomenon.
β’ Discrete variables and Probability Mass Function (PMF)
β’ Continuous Variables and Probability Density Function (PDF)
π₯
ππ π₯ = 1
Pr π β€ π β€ π = ΰΆ±π
π
ππ π₯ ππ₯
Expected Value (Expectation)
β’ Expectation of a random variable X:
β’ Expected value of rolling one dice?
https://en.wikipedia.org/wiki/Expected_value
Expected Value of Playing Roulette
β’ Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. Whatβs the expected value?
πΈ πΊπππ ππππ $1 πππ‘ = β1 Γ36
37+ 35 Γ
1
37=β1
37
Variance
β’ The variance of a random variable X is the expected value of the squared deviation from the mean of X
Var πΏ = πΈ[(πΏ β π)2]
Bias and Variance
Covariance
β’ Covariance is a measure of the joint variability of two random variables.
πΆππ£ π, π = πΈ π β πΈ[π] πΈ π β πΈ[π] = πΈ ππ β πΈ π πΈ[π]
https://programmathically.com/covariance-and-correlation/
Standard Deviation
Ο =1
π
π=1
π
(π₯π β π)2 = πππ(πΏ)
68β95β99.7 rule
6 Sigma
β’ A product has 99.99966% chance to be free of defects
https://www.leansixsigmadefinition.com/glossary/six-sigma/
Union, Intersection, and Conditional Probability β’ π π΄ βͺ π΅ = π π΄ + π π΅ β π π΄ β© π΅
β’ π π΄ β© π΅ is simplified as π π΄π΅
β’ Conditional Probability π π΄|π΅ , the probability of event A given B has occurred
β π π΄|π΅ = ππ΄π΅
π΅
β π π΄π΅ = π π΄|π΅ π π΅ = π π΅|π΄ π(π΄)
Chain Rule of Probability
β’ The joint probability can be expressed as chain rule
Mutually Exclusive
β’ π π΄π΅ = 0
β’ π π΄ βͺ π΅ = π π΄ + π π΅
Independence of Events
β’ Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesβπ π΄π΅ = π π΄ π π΅
βπ π΄|π΅ = π π΄
Bayes Rule
π π΄|π΅ =π π΅|π΄ π(π΄)
π(π΅)
Proof:
Remember π π΄|π΅ = ππ΄π΅
π΅
So π π΄π΅ = π π΄|π΅ π π΅ = π π΅|π΄ π(π΄)Then Bayes π π΄|π΅ = π π΅|π΄ π(π΄)/π π΅
NaΓ―ve Bayes Classifier
NaΓ―ve = Assume All Features Independent
Graphical Model
π π, π, π, π, π = π π π π π π π π, π π π π π(π|π)
Normal (Gaussian) Distributionβ’ A type of continuous probability distribution for a real-valued random variable.
β’ One of the most important distributions
Central Limit Theoryβ’ Averages of samples of observations of random variables independently
drawn from independent distributions converge to the normal distribution
https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/
Bernoulli Distribution
β’ Definition
β’ PMF
β’ πΈ[π] = π
β’ Var(π) = ππ
https://en.wikipedia.org/wiki/Bernoulli_distribution
https://acegif.com/flipping-coin-gifs/
Information Theory
β’ Self-information: πΌ π₯ = β log π(π₯)
β’ Shannon Entropy:
π» = β
π
ππ log2 ππ
Entropy of Bernoulli Variable
β’π» π₯ = πΈ πΌ π₯ = βπΈ[log π(π₯)]
https://en.wikipedia.org/wiki/Information_theory
Kullback-Leibler (KL) Divergence
β’ π·πΎπΏ(π| π = πΈ log π π β logπ π = πΈ logπ π₯
π π₯
KL Divergence is Asymmetric!
π·πΎπΏ(π| π != π·πΎπΏ(π| π
https://www.deeplearningbook.org/slides/03_prob.pdf
Key Takeaways
β’ Expected value (expectation) is mean (weighted average) of a random variable
β’ Event A and B are independent if π π΄π΅ = π π΄ π π΅
β’ Event A and B are mutually exclusive if π π΄π΅ = 0
β’ Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown
β’ Entropy is expected value of self information βπΈ[logπ(π₯)]
β’ KL divergence can measure the difference of two probability distributions and is asymmetric