Probability in Machine Learning

Post on 28-May-2022

10 views 0 download

Transcript of Probability in Machine Learning

Probability in Machine Learning

Three Axioms of Probability

β€’ Given an Event 𝐸 in a sample space 𝑆, S = 𝑖=1ڂ𝑁 𝐸𝑖

β€’ First axiomβˆ’ 𝑃 𝐸 ∈ ℝ, 0 ≀ 𝑃(𝐸) ≀ 1

β€’ Second axiomβˆ’ 𝑃 𝑆 = 1

β€’ Third axiomβˆ’ Additivity, any countable sequence of mutually exclusive events πΈπ‘–βˆ’ 𝑃 𝑖=1Ϊ‚

𝑛 𝐸𝑖 = 𝑃 𝐸1 + 𝑃 𝐸2 +β‹―+ 𝑃 𝐸𝑛 = σ𝑖=1𝑛 𝑃 𝐸𝑖

Random Variables

β€’ A random variable is a variable whose values are numerical outcomes of a random phenomenon.

β€’ Discrete variables and Probability Mass Function (PMF)

β€’ Continuous Variables and Probability Density Function (PDF)

π‘₯

𝑝𝑋 π‘₯ = 1

Pr π‘Ž ≀ 𝑋 ≀ 𝑏 = ΰΆ±π‘Ž

𝑏

𝑓𝑋 π‘₯ 𝑑π‘₯

Expected Value (Expectation)

β€’ Expectation of a random variable X:

β€’ Expected value of rolling one dice?

https://en.wikipedia.org/wiki/Expected_value

Expected Value of Playing Roulette

β€’ Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. What’s the expected value?

𝐸 πΊπ‘Žπ‘–π‘› π‘“π‘Ÿπ‘œπ‘š $1 𝑏𝑒𝑑 = βˆ’1 Γ—36

37+ 35 Γ—

1

37=βˆ’1

37

Variance

β€’ The variance of a random variable X is the expected value of the squared deviation from the mean of X

Var 𝑿 = 𝐸[(𝑿 βˆ’ πœ‡)2]

Bias and Variance

Covariance

β€’ Covariance is a measure of the joint variability of two random variables.

πΆπ‘œπ‘£ 𝑋, π‘Œ = 𝐸 𝑋 βˆ’ 𝐸[𝑋] 𝐸 π‘Œ βˆ’ 𝐸[π‘Œ] = 𝐸 π‘‹π‘Œ βˆ’ 𝐸 𝑋 𝐸[π‘Œ]

https://programmathically.com/covariance-and-correlation/

Standard Deviation

Οƒ =1

𝑁

𝑖=1

𝑁

(π‘₯𝑖 βˆ’ πœ‡)2 = π‘‰π‘Žπ‘Ÿ(𝑿)

68–95–99.7 rule

6 Sigma

β€’ A product has 99.99966% chance to be free of defects

https://www.leansixsigmadefinition.com/glossary/six-sigma/

Union, Intersection, and Conditional Probability β€’ 𝑃 𝐴 βˆͺ 𝐡 = 𝑃 𝐴 + 𝑃 𝐡 βˆ’ 𝑃 𝐴 ∩ 𝐡

β€’ 𝑃 𝐴 ∩ 𝐡 is simplified as 𝑃 𝐴𝐡

β€’ Conditional Probability 𝑃 𝐴|𝐡 , the probability of event A given B has occurred

βˆ’ 𝑃 𝐴|𝐡 = 𝑃𝐴𝐡

𝐡

βˆ’ 𝑃 𝐴𝐡 = 𝑃 𝐴|𝐡 𝑃 𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)

Chain Rule of Probability

β€’ The joint probability can be expressed as chain rule

Mutually Exclusive

β€’ 𝑃 𝐴𝐡 = 0

β€’ 𝑃 𝐴 βˆͺ 𝐡 = 𝑃 𝐴 + 𝑃 𝐡

Independence of Events

β€’ Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilitiesβˆ’π‘ƒ 𝐴𝐡 = 𝑃 𝐴 𝑃 𝐡

βˆ’π‘ƒ 𝐴|𝐡 = 𝑃 𝐴

Bayes Rule

𝑃 𝐴|𝐡 =𝑃 𝐡|𝐴 𝑃(𝐴)

𝑃(𝐡)

Proof:

Remember 𝑃 𝐴|𝐡 = 𝑃𝐴𝐡

𝐡

So 𝑃 𝐴𝐡 = 𝑃 𝐴|𝐡 𝑃 𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)Then Bayes 𝑃 𝐴|𝐡 = 𝑃 𝐡|𝐴 𝑃(𝐴)/𝑃 𝐡

NaΓ―ve Bayes Classifier

NaΓ―ve = Assume All Features Independent

Graphical Model

𝑝 π‘Ž, 𝑏, 𝑐, 𝑑, 𝑒 = 𝑝 π‘Ž 𝑝 𝑏 π‘Ž 𝑝 𝑐 π‘Ž, 𝑏 𝑝 𝑑 𝑏 𝑝(𝑒|𝑐)

Normal (Gaussian) Distributionβ€’ A type of continuous probability distribution for a real-valued random variable.

β€’ One of the most important distributions

Central Limit Theoryβ€’ Averages of samples of observations of random variables independently

drawn from independent distributions converge to the normal distribution

https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/

Bernoulli Distribution

β€’ Definition

β€’ PMF

β€’ 𝐸[𝑋] = 𝑝

β€’ Var(𝑋) = π‘π‘ž

https://en.wikipedia.org/wiki/Bernoulli_distribution

https://acegif.com/flipping-coin-gifs/

Information Theory

β€’ Self-information: 𝐼 π‘₯ = βˆ’ log 𝑃(π‘₯)

β€’ Shannon Entropy:

𝐻 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Entropy of Bernoulli Variable

‒𝐻 π‘₯ = 𝐸 𝐼 π‘₯ = βˆ’πΈ[log 𝑃(π‘₯)]

https://en.wikipedia.org/wiki/Information_theory

Kullback-Leibler (KL) Divergence

β€’ 𝐷𝐾𝐿(𝑝| π‘ž = 𝐸 log 𝑃 𝑋 βˆ’ log𝑄 𝑋 = 𝐸 log𝑃 π‘₯

𝑄 π‘₯

KL Divergence is Asymmetric!

𝐷𝐾𝐿(𝑝| π‘ž != 𝐷𝐾𝐿(π‘ž| 𝑝

https://www.deeplearningbook.org/slides/03_prob.pdf

Key Takeaways

β€’ Expected value (expectation) is mean (weighted average) of a random variable

β€’ Event A and B are independent if 𝑃 𝐴𝐡 = 𝑃 𝐴 𝑃 𝐡

β€’ Event A and B are mutually exclusive if 𝑃 𝐴𝐡 = 0

β€’ Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown

β€’ Entropy is expected value of self information βˆ’πΈ[log𝑃(π‘₯)]

β€’ KL divergence can measure the difference of two probability distributions and is asymmetric