SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

SINGULAR LEARNING THEORY

Part I: Statistical Learning

Shaowei Lin(Institute for Infocomm Research, Singapore)

21-25 May 2013Motivic Invariants and Singularities Thematic Program

Notre Dame University

Overview

Probability

Statistics

Bayesian

Regression

Algebraic Geometry

Asymptotic Theory∫

[0,1]2(1−x2y2)N/2dxdy ≈

−12 logN−

π8 ( 1

log 2−2 log 2−

γ)N−12 − 1

4N−1logN + 1

log 2 +

1−γ)N−1 −√

2π128 N

−32 logN + · · ·

Statistical Learning

Algebraic Statistics Singular Learning Theory

Overview

Probability

Statistics

Bayesian

Regression

Part I: Statistical Learning

Part II: Real Log Canonical Thresholds

Part III: Singularities in Graphical Models

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Random Variables

Probability

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

A probability space (ξ,F ,P) consists of

• a sample space ξ which is the set of all possible outcomes,• a collection♯ F of events, which are subsets of ξ,• an assignment P : F → [0, 1] of probabilities to events

A (real-valued) random variable X : ξ → Rk is

• a function from the sample space to a real vector space.• a measurement of the possible outcomes.• X ∼ P means “X has the distribution given by P ”.♯ σ-algebra: closed under complement, countable union and contains ∅. probability measure: P(∅) = 0, P(ξ) = 1, countable additivity for disjoint events. measurable function: for all x ∈ R, the preimage of y ∈ R

k : y ≤ x is in F .

Example . Rolling a fair die.

ξ = , , , , , F = ∅, , , , . . .X ∈ 1, 2, 3, 4, 5, 6 P(X=1) = 1

6 , P(X≤3) = 12

Discrete and Continuous Random Variables

Probability

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

If ξ is finite, we say X is a discrete random variable.

• probability mass function p(x) = P (X = x), x ∈ Rk.

If ξ is infinite, we define the

• cumulative distribution function (CDF) F (x)

F (x) = P(X≤x), x ∈ Rk.

• probability density function ♯ (PDF) p(y)

F (x) =∫

y∈Rk:y≤x p(y)dy, x ∈ Rk.

If the PDF exists, then X is a continuous random variable.♯ Radon-Nikodym derivative of F (x) with respect to the Lebesgue measure on R

k . We can also define PDFs for discrete variables if we allow the Dirac delta function.

The probability mass/density function is often informally referred toas the distribution of X .

Gaussian Random Variables

Probability

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Example . Multivariate Gaussian distribution X ∼ N (µ,Σ).

X ∈ Rk, mean µ ∈ R

k, covariance Σ ∈ Rk≻0

p(x) =1

(2π detΣ)k/2exp

2(x− µ)⊤Σ−1(x− µ)

PDFs of univariate Gaussian distributions X ∼ N (µ, σ2)

Basic Concepts

Probability

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

The expectation E[X] is the integral of the random variable Xwith respect to its probability measure, i.e. the “average”.

Discrete variables Continuous variables

E[X] =∑

x∈Rk

xP(X=x) E[X] =

xp(x) dx

The variance E[(X − E[X])2] measures the “spread” of X .

The conditional probability ♯P(A|B) of two events A,B ∈ F is

the probability that A will occur given that we know B has occurred.

• If P(B) > 0, then P(A|B) = P(A ∩B)/P(B).♯ formal definition depends on the notion of conditional expectation.

Example . Weather forecast.

P(Rain|Thunder)= 0.2/(0.1 + 0.2) = 2/3

Rain No Rain

Thunder 0.2 0.1No Thunder 0.3 0.4

Independence

Probability

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Let X ∈ Rk, Y ∈ R

l, Z ∈ Rm be random variables.

X,Y are independent (X⊥⊥Y ) if

• P(X∈S, Y ∈T ) = P(X∈S)P(Y ∈T )for all measurable subsets S ⊂ R

k, T ⊂ Rl.

• i.e. “ knowing X gives no information about Y ”

X,Y are conditionally independent given Z (X⊥⊥Y | Z) if

• P(X∈S, Y ∈T |Z=z) = P(X∈S|Z=z)P(Y ∈T |Z=z)for all z ∈ R

m and measurable subsets S ⊂ Rk, T ⊂ R

l.• i.e. “ any dependence between X and Y is due to Z ”

Example . Hidden variables.

Favorite color X ∈ red, blue, favorite food Y ∈ salad, steak.If X,Y are dependent, one may ask if there is a hidden variable,e.g. gender Z ∈ female,male, such that X⊥⊥Y | Z .

Statistics

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Statistical Model

Probability

Statistics

• Mixture Models

Bayesian

Regression

Let ∆ denote the space of distributions with outcomes ξ.

Model : a family M of probability distributions, i.e. a subset of ∆.

Parametric model : family M of distributions p(·|ω) are indexedby parameters ω in a space Ω, i.e. we have a map Ω → ∆.

Example . Biased coin tosses.

Number of heads in two tosses of coin: H ∈ ξ = 0, 1, 2Space of distributions:

∆ = p ∈ R3≥0 : p(0)+p(1)+p(2) = 1

Probability of getting heads: ω ∈ Ω = [0, 1] ⊂ R

Parametric model for H :p(0|ω) = (1− ω)2

p(1|ω) = 2(1− ω)ωp(2|ω) = ω2

implicit equation4 p(0)p(2)− p(1)2 = 0

Maximum Likelihood

Probability

Statistics

• Mixture Models

Bayesian

Regression

A sample X1, . . . , XN of X is a set of independent, identicallydistributed (i.i.d.) random variables with the same distribution.

Goal: Given a statistical model p(·|ω) : ω∈Ω and a sample,find a distribution p(·|ω) that best describes the sample.

A statistic f(X1, . . . , XN ) is a function of the sample.An important statistic is the maximum likelihood estimate (MLE).It is a parameter ω that maximizes the likelihood function

L(ω) =

p(Xi|ω).

Example . Biased coin tosses.

Suppose the table below summarizes a sample of H of size 100.

H 0 1 2

Count 25 45 30Then, L(ω) = 245ω105(1− ω)95

ω = 105/200.

Kullback-Leibler Divergence

Probability

Statistics

• Mixture Models

Bayesian

Regression

Let X1, . . . , XN be a sample of a discrete variable X .The empirical distribution is the function

q(x) =1

δ(x−Xi)

where δ(·) is the Kronecker delta function.

The Kullback-Leibler divergence of a distribution p from q is

K(q||p) =∑

x∈Rk

q(x) logq(x)

Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).

K(q||q) =∑

x∈Rk

q(x) log q(x)

︸︷︷︸

entropy

NlogL(ω)

Probability

Statistics

• Mixture Models

Bayesian

Regression

Let X1, . . . , XN be a sample of a continuous variable X .The empirical distribution is the generalized function

q(x) =1

δ(x−Xi)

where δ(·) is the Dirac delta function.

The Kullback-Leibler divergence of a distribution p from q is

K(q||p) =

q(x) logq(x)

p(x)dx.

Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).

K(q||q) =

q(x) log q(x)dx

︸︷︷︸

entropy

NlogL(ω)

Probability

Statistics

• Mixture Models

Bayesian

Regression

Probability

Statistics

• Mixture Models

Bayesian

Regression

Example . Population mean and variance.

Let X ∼ N(µ, σ2) be the height of a random Singaporean.Given sample X1, . . . , XN , estimate mean µ and variance σ2.

Now, the Kullback-Leibler divergence is

K(q||q) =1

(Xi − µ)2 +1

Nlog σ + constant.

Differentiating this function gives us the MLE

Xi, σ2 =1

(Xi − µ)2.

MLE for the model mean is the sample mean.MLE for the model variance is the sample variance.

Mixture Models

Probability

Statistics

• Mixture Models

Bayesian

Regression

A mixture of distributions p1(·), . . . , pm(·) is a convex combination

p(x) =m∑

αipi(x), x ∈ Rk

i.e. the mixing coefficients αi are nonnegative and sum to one.

Example . Gaussian mixtures.

Mixing univariate Gaussians N (µi, σ2i ), i = 1, . . . ,m, produces

distributions of the form

p(x) =m∑

αi√

2πσ2i

−(x− µi)

This mixture model is therefore described by parameters

ω = (α1, . . . , αm, µ1, . . . , µm, σ1, . . . , σm)

and is frequently used in cluster analysis.

Bayesian Statistics

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Interpretations of Probability Theory

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Interpretations of Probability Theory

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Bayes’ Rule

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Updating our belief of event A based on an observation B.

P(A|B)︸︷︷︸

posterior (new belief)

=P(B|A)

P(B)P(A)︸︷︷︸

prior (old belief)

Example . Biased coin toss.

Let θ denote P(heads) of a coin.Determine if the coin is fair (θ = 1

2) or biased (θ = 34).

Old belief : P(fair) = 0.9Now, suppose we observed a sample with 8 heads and 2 tails.New belief :

P(fair|sample) =P(sample|fair)

P(sample)P(fair)

8(12)2(0.9)8

(12)8(12)

2(0.9)8 + (34)8(14)

2(0.1)8≈ 0.584

Parameter Estimation

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Let D = X1, . . . , XN denote a sample of X , i.e. “the data”.

Frequentists : Compute maximum likelihood estimate.Bayesians : Treat parameters ω ∈ Ω as random variables

with priors p(ω) that require updating.

Posterior distributionon parameters ω ∈ Ω

p(ω|D) =p(D|ω)p(ω)

Ω p(D|ω)p(ω)dω

Posterior mean

Ωω p(ω|D)dω

Posterior mode argmaxω∈Ω p(ω|D)

Predictive distributionon outcomes x ∈ ξ

p(x|D) =

Ωp(x|ω)p(ω|D)dω

These integrals are difficult to compute or even estimate!

Model Selection

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Which model M1, . . . ,Mm best describes the sample D?

Frequentists : Pick model containing ML distribution.Bayesians : Assign priors p(Mi), p(ω|Mi) and compute

P(Mi|D) =P(D|Mi)P(Mi)

P(D)∝ P(D|Mi)P(Mi).

where P(D|Mi) is the likelihood integral

P(D|Mi) =

p(Xi|ω,Mi)p(ω|Mi)dω.

Parameter estimation is a form of model selection!For each ω ∈ Ω, we define a model Mω with one distribution.

Estimating Integrals

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Generally, there are three ways to estimate statistical integrals.

1. Exact methodsCompute a closed form formula for the integral, e.g.Baldoni·Berline·De Loera·Koppe·Vergne, 2010;Lin·Sturmfels·Xu, 2009.

2. Numerical methodsApproximate using Markov Chain Monte Carlo (MCMC)and other sampling techniques.

3. Asymptotic methodsAnalyze how the integral behaves for large samples.Rewrite the likelihood integral as

Ωe−Nf(ω)ϕ(ω)dω

where f(ω) = − 1N logL(ω) and ϕ(ω) is the prior on Ω.

Bayesian Information Criterion

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Laplace approximation : If f(ω) is uniquely minimized at MLE ωand the Hessian ∂2f(ω) is full rank, then asymptotically

− logZN ≈ Nf(ω) +dimΩ

2logN +O(1)

as sample size N → ∞.

Bayesian information criterion (BIC): Select model that maximizes

Nf(ω) +dimΩ

N = 1 N = 10

Graphs of e−Nf(ω) for different N . Integral = volume under graph.

Singularities in Statistical Models

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Model Selection

• Singularities

Regression

Informally, the model is singular at ω0 ∈ Ω if the Laplaceapproximation fails when the empirical distribution is p(·|ω0).

Formally, if we define the Kullback-Leibler function

K(ω) =

ξp(x|ω0) log

p(x|ω0)

p(x|ω)dx.

then ω0 is a singularity when the Hessian ∂2K(ω0) is not full rank.

Statistical models with hidden variables , e.g. mixture models,often contain many singularities.

Linear Regression

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

Least Squares

Probability

Statistics

Bayesian

Regression

• Least Squares

• Paradigms

Suppose we have random variables Y ∈ R, X ∈ Rd that satisfy

Y = ω ·X + ε.

Parameters ω ∈ Rd; noise ε ∈ N (0, 1); data (Yi, Xi), i = 1 .. N .

• Commonly computed quantitiesMLE argminω

∑Ni=1 |Yi − ω ·Xi|

Penalized MLE argminω∑N

i=1 |Yi − ω ·Xi|2 + π(ω)

• Commonly used penaltiesLASSO π(ω) = |ω|1 · βBayesian Info Criterion (BIC) π(ω) = |ω|0 · logNAkaike Info Criterion (AIC) π(ω) = |ω|0 · 2

• Commonly asked questionsModel selection (e.g. which factors are important?)Parameter estimation (e.g. how important are the factors?)

Sparsity Penalty

Probability

Statistics

Bayesian

Regression

• Least Squares

• Paradigms

The best model is usually selected using a score

argminu∈Ω l(u) + π(u)

where the likelihood l(u) measures the fitting error of the modelwhile the penalty π(u) measures its complexity .

Recently, sparse penalties derived from statistical considerationswere found to be highly effective.

Bayesian info criterion (BIC) ↔ Marginal likelihood integralAkaike info criterion (AIC) ↔ Kullback-Leibler divergence

Compressive sensing ↔ ℓ1-regularization of BIC

Singular learning theory plays an important role in these derivations.

Paradigms in Statistical Learning

Probability

Statistics

Bayesian

Regression

• Least Squares

• Paradigms

• Probabilitytheory of random phenomenaStatisticstheory of making sense of dataLearningthe art of prediction using data

• Frequentists : “True distribution, maximum likelihood”Bayesians : “Belief updates, maximum a posteriori”

Interpretations do not affect the correctness of probability theory,but they greatly affect the statistical methodology .

• We often have to balance the complexity of the modelwith its fitting error via some suitable probabilistic criteria.The learning algorithm also needs to be computable to be useful.

Probability

Statistics

Bayesian

Regression

• Least Squares

• Paradigms

Thank you!

“Algebraic Methods for Evaluating Integrals in Bayesian Statistics”

http://math.berkeley.edu/~shaowei/swthesis.pdf

(PhD dissertation, May 2011)

References

Probability

Statistics

Bayesian

Regression

• Least Squares

• Paradigms

1. V. I. ARNOL’D, S. M. GUSEIN-ZADE AND A. N. VARCHENKO: Singularities of Differentiable Maps, Vol. II,Birkhauser, Boston, 1985.

2. V. BALDONI, N. BERLINE, J. A. DE LOERA, M. KOPPE, M. VERGNE: How to integrate a polynomialover a simplex, Mathematics of Computation 80 (2010) 297–325.

3. A. BRAVO, S. ENCINAS AND O. VILLAMAYOR: A simplified proof of desingularisation and applications,Rev. Math. Iberoamericana 21 (2005) 349–458.

4. D. A. COX, J. B. LITTLE, AND D. O’SHEA: Ideals, Varieties, and Algorithms: An Introduction toComputational Algebraic Geometry and Commutative Algebra, Springer-Verlag, New York, 1997.

5. R. DURRETT: Probability - Theory and Examples (4th edition), Cambridge U. Press, 2010.6. M. EVANS, Z. GILULA AND I. GUTTMAN: Latent class analysis of two-way contingency tables by

Bayesian methods, Biometrika 76 (1989) 557–563.7. H. HIRONAKA: Resolution of singularities of an algebraic variety over a field of characteristic zero I, II,

Ann. of Math. (2) 79 (1964) 109–203.8. S. LIN, B. STURMFELS AND Z. XU: Marginal likelihood integrals for mixtures of independence models,

J. Mach. Learn. Res. 10 (2009) 1611–1631.9. S. LIN: Algebraic methods for evaluating integrals in Bayesian statistics, PhD dissertation, Dept.

Mathematics, UC Berkeley (2011).10. L. MACKEY: Fundamentals - Probability and Statistics, (slides for CS 281A: Statistical Learning Theory

at UC Berkeley, Aug 2009).11. S. WATANABE: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on

Applied and Computational Mathematics 25, Cambridge University Press, Cambridge, 2009.

SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

Documents

Transcript of SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

Introduction to statistics: Foundations...2/ 32 Lecture 1 Random variables, pdfs, cdfs The de nition of a random variable A random variable Xis a function X: S!R that associates to

18.175: Lecture 17 .1in Poisson random variables

Random Variables - UTKweb.eecs.utk.edu/~mjr/ECE504/PresentationSlides/... · Functions of a Random Variable Consider first a transformation from a DV random variable X to another

18.600: Lecture 17 .1in Continuous random variablesmath.mit.edu/~sheffield/2019600/Lecture17.pdfContinuous random variables I Say X is a continuous random variable if there exists

Gaussian Random Variables and Processessarva/courses/EE703/2012/Slides/GaussianRV.pdf · Gaussian Random Variables and Processes Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department

Lecture Notes on Random Variables and Stochastic Processes

Stochastic Processes David Nualart The University of …people.ku.edu/~nualart/StochasticCalculus.pdf · 1 Stochastic Processes 1.1 Probability Spaces and Random Variables In this

TRANSFORMATION OF RANDOM VARIABLES - METU OCW

-ACCOMPANYING · 2009-06-12 · Serdica Math. J. 22 (1996), 471-496 SUMS OF A RANDOM NUMBER OF RANDOM VARIABLES AND THEIR APPROXIMATIONS WITH ν-ACCOMPANYING INFINITELY DIVISIBLE

Chapter 4 Dependent Random Variables - NYU Courantvaradhan/course/PROB.ch4.pdf · 2000-09-01 · Chapter 4 Dependent Random Variables 4.1 Conditioning One of the key concepts in probability

Basic notions of probability theory • Discrete Random Variables · 2019. 3. 4. · Probability functions (II, discrete random variables) ... Univariate discrete probability distributions:

Week 2: Independence, Random Variables, Asymptopiabrody/cs49/f15/... · The Probabilistic Method Joshua Brody CS49/Math59 Fall 2015 Week 2: Independence, Random Variables, Asymptopia

Topic 7: Random Variables and Distribution Functionsjwatkins/g-randomvariables.pdf · board random variable. So, distribution functions for continuous random random vari-ables increase

Random variables, vectors, and processes Ω · Random variables, vectors, and processes EE278: Introduction to Statistical Signal Processing, winter 2010–2011 cR.M. Gray 2011 1

A method of moments estimator of tail dependence · A method of moments estimator of tail dependence 1005 d×1 random vectors, F is a m×1 random vector of factor variables and β

A Bernstein inequality for dependent random matrices · Introduction I Let (X k) kbe a sequence of i.i.d. random variables with common mean and variance ˙2. I By the central limit

Chap 2.1 : Random Variables - Ryerson Universitycourses/ee8103/Chap2.pdf · 2010-08-27 · Chap 2: Random Variables Chap 2.1 : Random Variables Let Ω be sample space of a probability

1.Random Variables: Brief Review 2.Joint Distributions. 3 ...

8. One Function of Two Random Variables - Caltechee162.caltech.edu/notes/lect8.pdf · One Function of Two Random Variables Given two random variables X and Y and a function g(x,y),

Limits of Interlacing Eigenvalues in the Tridiagonal -Hermite ...R p(x)dx = 1. Gopal K. Goel and Mentor Andrew Ahn About Beamer May 20, 2017 6 / 18 Random Variables A random variable