Probability in Machine Learning

Three Axioms of Probability

• Given an Event 𝐸 in a sample space 𝑆, S = 𝑖=1ڂ𝑁 𝐸𝑖

• First axiom− 𝑃 𝐸 ∈ ℝ, 0 ≤ 𝑃(𝐸) ≤ 1

• Second axiom− 𝑃 𝑆 = 1

• Third axiom− Additivity, any countable sequence of mutually exclusive events 𝐸𝑖− 𝑃 𝑖=1ڂ

𝑛 𝐸𝑖 = 𝑃 𝐸1 + 𝑃 𝐸2 +⋯+ 𝑃 𝐸𝑛 = σ𝑖=1𝑛 𝑃 𝐸𝑖

Random Variables

• A random variable is a variable whose values are numerical outcomes of a random phenomenon.

• Discrete variables and Probability Mass Function (PMF)

• Continuous Variables and Probability Density Function (PDF)

𝑝𝑋 𝑥 = 1

Pr 𝑎 ≤ 𝑋 ≤ 𝑏 = න𝑎

𝑓𝑋 𝑥 𝑑𝑥

Expected Value (Expectation)

• Expectation of a random variable X:

• Expected value of rolling one dice?

https://en.wikipedia.org/wiki/Expected_value

Expected Value of Playing Roulette

• Bet $1 on single number (0 ~ 36), and get $35 payoff if you win. What’s the expected value?

𝐸 𝐺𝑎𝑖𝑛 𝑓𝑟𝑜𝑚 $1 𝑏𝑒𝑡 = −1 ×36

37+ 35 ×

37=−1

Variance

• The variance of a random variable X is the expected value of the squared deviation from the mean of X

Var 𝑿 = 𝐸[(𝑿 − 𝜇)2]

Bias and Variance

Covariance

• Covariance is a measure of the joint variability of two random variables.

𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸[𝑋] 𝐸 𝑌 − 𝐸[𝑌] = 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸[𝑌]

https://programmathically.com/covariance-and-correlation/

Standard Deviation

𝑖=1

(𝑥𝑖 − 𝜇)2 = 𝑉𝑎𝑟(𝑿)

68–95–99.7 rule

6 Sigma

• A product has 99.99966% chance to be free of defects

https://www.leansixsigmadefinition.com/glossary/six-sigma/

Union, Intersection, and Conditional Probability • 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃 𝐴 ∩ 𝐵

• 𝑃 𝐴 ∩ 𝐵 is simplified as 𝑃 𝐴𝐵

• Conditional Probability 𝑃 𝐴|𝐵 , the probability of event A given B has occurred

− 𝑃 𝐴|𝐵 = 𝑃𝐴𝐵

− 𝑃 𝐴𝐵 = 𝑃 𝐴|𝐵 𝑃 𝐵 = 𝑃 𝐵|𝐴 𝑃(𝐴)

Chain Rule of Probability

• The joint probability can be expressed as chain rule

Mutually Exclusive

• 𝑃 𝐴𝐵 = 0

• 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵

Independence of Events

• Two events A and B are said to be independent if the probability of their intersection is equal to the product of their individual probabilities−𝑃 𝐴𝐵 = 𝑃 𝐴 𝑃 𝐵

−𝑃 𝐴|𝐵 = 𝑃 𝐴

Bayes Rule

𝑃 𝐴|𝐵 =𝑃 𝐵|𝐴 𝑃(𝐴)

𝑃(𝐵)

Proof:

Remember 𝑃 𝐴|𝐵 = 𝑃𝐴𝐵

So 𝑃 𝐴𝐵 = 𝑃 𝐴|𝐵 𝑃 𝐵 = 𝑃 𝐵|𝐴 𝑃(𝐴)Then Bayes 𝑃 𝐴|𝐵 = 𝑃 𝐵|𝐴 𝑃(𝐴)/𝑃 𝐵

Naïve Bayes Classifier

Naïve = Assume All Features Independent

Graphical Model

𝑝 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 = 𝑝 𝑎 𝑝 𝑏 𝑎 𝑝 𝑐 𝑎, 𝑏 𝑝 𝑑 𝑏 𝑝(𝑒|𝑐)

Normal (Gaussian) Distribution• A type of continuous probability distribution for a real-valued random variable.

• One of the most important distributions

Central Limit Theory• Averages of samples of observations of random variables independently

drawn from independent distributions converge to the normal distribution

https://corporatefinanceinstitute.com/resources/knowledge/other/central-limit-theorem/

Bernoulli Distribution

• Definition

• PMF

• 𝐸[𝑋] = 𝑝

• Var(𝑋) = 𝑝𝑞

https://en.wikipedia.org/wiki/Bernoulli_distribution

https://acegif.com/flipping-coin-gifs/

Information Theory

• Self-information: 𝐼 𝑥 = − log 𝑃(𝑥)

• Shannon Entropy:

𝐻 = −

𝑝𝑖 log2 𝑝𝑖

Entropy of Bernoulli Variable

•𝐻 𝑥 = 𝐸 𝐼 𝑥 = −𝐸[log 𝑃(𝑥)]

https://en.wikipedia.org/wiki/Information_theory

Kullback-Leibler (KL) Divergence

• 𝐷𝐾𝐿(𝑝| 𝑞 = 𝐸 log 𝑃 𝑋 − log𝑄 𝑋 = 𝐸 log𝑃 𝑥

𝑄 𝑥

KL Divergence is Asymmetric!

𝐷𝐾𝐿(𝑝| 𝑞 != 𝐷𝐾𝐿(𝑞| 𝑝

https://www.deeplearningbook.org/slides/03_prob.pdf

Key Takeaways

• Expected value (expectation) is mean (weighted average) of a random variable

• Event A and B are independent if 𝑃 𝐴𝐵 = 𝑃 𝐴 𝑃 𝐵

• Event A and B are mutually exclusive if 𝑃 𝐴𝐵 = 0

• Central limit theorem tells us that Normal distribution is the one, if the data probability distribution is unknown

• Entropy is expected value of self information −𝐸[log𝑃(𝑥)]

• KL divergence can measure the difference of two probability distributions and is asymmetric

Probability in Machine Learning

Documents

Transcript of Probability in Machine Learning

Proceso estadístico de modelos con Machine Learning › bitstream › 20.500... · Postproceso estadístico con Machine Learning: aplicación… • Trabajo con 5 aeropuertos de

Machine Learning - cs.cmu.edu

Machine Learning (CSE 446): Learning as Minimizing Loss ...

Sparse methods for machine learning - WebHomefbach/Cours_peyresq_2010.pdf · Sparse methods for machine learning Theory and algorithms Francis Bach Willow project, INRIA - Ecole Normale

Bayesian learning finalized (with high probability)

Sparse methods for machine learning - WebHomefbach/nips2009tutorial/nips_tutorial_2009... · Sparse methods for machine learning Theory and algorithms Francis Bach Willow project,

Sparse methods for machine learning Theory and algorithms

KungFu: Making Training in Distributed Machine Learning ...

Prospective evaluation and success of a machine learning ...

Machine Learning 10-701tom/10701_sp11/slides/Kernels_SVM2_04_1… · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 12, 2011

Machine Learning and Data Mining Linear classification

CPSC 540: Machine Learning - Metropolis-Hastings

Physics-informed, Interpretable Machine Learning · 2020. 12. 21. · Physics-informed, Interpretable Machine Learning Midshipman 2/C Nourachi Professor Kevin McIlhany, Physics Department

Adversarial Attack Machine Learning HW10

Machine Learning for Intelligent Agentschrdimi/teaching/theses/nikos-thesis.pdf · Machine Learning for Intelligent Agents N. Tziortziotis Ph.D. Dissertation ♦ – Ioannina, March

Deep Machine Learning - ph.postech.ac.krph.postech.ac.kr/data/file/ph_col/2380212529_8c2ywBqY_d06cb185529ba649... · Deep Machine Learning Seungjin Choi Department of Computer Science

Stochastic Knowledge Representations and Machine Learning ...

Machine Learning Classical Conditioning Synaptic Plasticity · Machine Learning Classical Conditioning Synaptic Plasticity ... What types of criterions can we use to answer ... Proof

Introduction to Probability Theory - USPsites.poli.usp.br/p/fabio.cozman/Didatico/Learning/prob.pdf!

06 Machine Learning - Naive Bayes