SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

32
SINGULAR LEARNING THEORY Part I: Statistical Learning Shaowei Lin (Institute for Infocomm Research, Singapore) 21-25 May 2013 Motivic Invariants and Singularities Thematic Program Notre Dame University

Transcript of SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

Page 1: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

SINGULAR LEARNING THEORY

Part I: Statistical Learning

Shaowei Lin(Institute for Infocomm Research, Singapore)

21-25 May 2013Motivic Invariants and Singularities Thematic Program

Notre Dame University

Page 2: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Overview

Probability

Statistics

Bayesian

Regression

Algebraic Geometry

Asymptotic Theory∫

[0,1]2(1−x2y2)N/2dxdy ≈

π8 N

−12 logN−

π8 ( 1

log 2−2 log 2−

γ)N−12 − 1

4N−1logN + 1

4 (1

log 2 +

1−γ)N−1 −√

2π128 N

−32 logN + · · ·

Statistical Learning

A B

D C

Algebraic Statistics Singular Learning Theory

Page 3: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Overview

Probability

Statistics

Bayesian

Regression

Part I: Statistical Learning

Part II: Real Log Canonical Thresholds

Part III: Singularities in Graphical Models

Page 4: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Probability

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Page 5: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Random Variables

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

A probability space (ξ,F ,P) consists of

• a sample space ξ which is the set of all possible outcomes,• a collection♯ F of events, which are subsets of ξ,• an assignment P : F → [0, 1] of probabilities to events

A (real-valued) random variable X : ξ → Rk is

• a function from the sample space to a real vector space.• a measurement of the possible outcomes.• X ∼ P means “X has the distribution given by P ”.♯ σ-algebra: closed under complement, countable union and contains ∅. probability measure: P(∅) = 0, P(ξ) = 1, countable additivity for disjoint events. measurable function: for all x ∈ R, the preimage of y ∈ R

k : y ≤ x is in F .

Example . Rolling a fair die.

ξ = , , , , , F = ∅, , , , . . .X ∈ 1, 2, 3, 4, 5, 6 P(X=1) = 1

6 , P(X≤3) = 12

Page 6: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Discrete and Continuous Random Variables

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

If ξ is finite, we say X is a discrete random variable.

• probability mass function p(x) = P (X = x), x ∈ Rk.

If ξ is infinite, we define the

• cumulative distribution function (CDF) F (x)

F (x) = P(X≤x), x ∈ Rk.

• probability density function ♯ (PDF) p(y)

F (x) =∫

y∈Rk:y≤x p(y)dy, x ∈ Rk.

If the PDF exists, then X is a continuous random variable.♯ Radon-Nikodym derivative of F (x) with respect to the Lebesgue measure on R

k . We can also define PDFs for discrete variables if we allow the Dirac delta function.

The probability mass/density function is often informally referred toas the distribution of X .

Page 7: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Gaussian Random Variables

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Example . Multivariate Gaussian distribution X ∼ N (µ,Σ).

X ∈ Rk, mean µ ∈ R

k, covariance Σ ∈ Rk≻0

p(x) =1

(2π detΣ)k/2exp

(

−1

2(x− µ)⊤Σ−1(x− µ)

)

PDFs of univariate Gaussian distributions X ∼ N (µ, σ2)

Page 8: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Basic Concepts

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

The expectation E[X] is the integral of the random variable Xwith respect to its probability measure, i.e. the “average”.

Discrete variables Continuous variables

E[X] =∑

x∈Rk

xP(X=x) E[X] =

Rk

xp(x) dx

The variance E[(X − E[X])2] measures the “spread” of X .

The conditional probability ♯P(A|B) of two events A,B ∈ F is

the probability that A will occur given that we know B has occurred.

• If P(B) > 0, then P(A|B) = P(A ∩B)/P(B).♯ formal definition depends on the notion of conditional expectation.

Example . Weather forecast.

P(Rain|Thunder)= 0.2/(0.1 + 0.2) = 2/3

Rain No Rain

Thunder 0.2 0.1No Thunder 0.3 0.4

Page 9: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Independence

Probability

• Random Variables

• Discrete·Continuous

• Gaussian

• Basic Concepts

• Independence

Statistics

Bayesian

Regression

Let X ∈ Rk, Y ∈ R

l, Z ∈ Rm be random variables.

X,Y are independent (X⊥⊥Y ) if

• P(X∈S, Y ∈T ) = P(X∈S)P(Y ∈T )for all measurable subsets S ⊂ R

k, T ⊂ Rl.

• i.e. “ knowing X gives no information about Y ”

X,Y are conditionally independent given Z (X⊥⊥Y | Z) if

• P(X∈S, Y ∈T |Z=z) = P(X∈S|Z=z)P(Y ∈T |Z=z)for all z ∈ R

m and measurable subsets S ⊂ Rk, T ⊂ R

l.• i.e. “ any dependence between X and Y is due to Z ”

Example . Hidden variables.

Favorite color X ∈ red, blue, favorite food Y ∈ salad, steak.If X,Y are dependent, one may ask if there is a hidden variable,e.g. gender Z ∈ female,male, such that X⊥⊥Y | Z .

Page 10: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Statistics

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Page 11: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Statistical Model

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Let ∆ denote the space of distributions with outcomes ξ.

Model : a family M of probability distributions, i.e. a subset of ∆.

Parametric model : family M of distributions p(·|ω) are indexedby parameters ω in a space Ω, i.e. we have a map Ω → ∆.

Example . Biased coin tosses.

Number of heads in two tosses of coin: H ∈ ξ = 0, 1, 2Space of distributions:

∆ = p ∈ R3≥0 : p(0)+p(1)+p(2) = 1

Probability of getting heads: ω ∈ Ω = [0, 1] ⊂ R

Parametric model for H :p(0|ω) = (1− ω)2

p(1|ω) = 2(1− ω)ωp(2|ω) = ω2

implicit equation4 p(0)p(2)− p(1)2 = 0

Page 12: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Maximum Likelihood

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

A sample X1, . . . , XN of X is a set of independent, identicallydistributed (i.i.d.) random variables with the same distribution.

Goal: Given a statistical model p(·|ω) : ω∈Ω and a sample,find a distribution p(·|ω) that best describes the sample.

A statistic f(X1, . . . , XN ) is a function of the sample.An important statistic is the maximum likelihood estimate (MLE).It is a parameter ω that maximizes the likelihood function

L(ω) =

N∏

i=1

p(Xi|ω).

Example . Biased coin tosses.

Suppose the table below summarizes a sample of H of size 100.

H 0 1 2

Count 25 45 30Then, L(ω) = 245ω105(1− ω)95

ω = 105/200.

Page 13: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Kullback-Leibler Divergence

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Let X1, . . . , XN be a sample of a discrete variable X .The empirical distribution is the function

q(x) =1

N

N∑

i=1

δ(x−Xi)

where δ(·) is the Kronecker delta function.

The Kullback-Leibler divergence of a distribution p from q is

K(q||p) =∑

x∈Rk

q(x) logq(x)

p(x).

Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).

K(q||q) =∑

x∈Rk

q(x) log q(x)

︸ ︷︷ ︸

entropy

−1

NlogL(ω)

Page 14: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Kullback-Leibler Divergence

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Let X1, . . . , XN be a sample of a continuous variable X .The empirical distribution is the generalized function

q(x) =1

N

N∑

i=1

δ(x−Xi)

where δ(·) is the Dirac delta function.

The Kullback-Leibler divergence of a distribution p from q is

K(q||p) =

Rk

q(x) logq(x)

p(x)dx.

Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q(·).

K(q||q) =

Rk

q(x) log q(x)dx

︸ ︷︷ ︸

entropy

−1

NlogL(ω)

Page 15: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Kullback-Leibler Divergence

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Page 16: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Kullback-Leibler Divergence

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

Example . Population mean and variance.

Let X ∼ N(µ, σ2) be the height of a random Singaporean.Given sample X1, . . . , XN , estimate mean µ and variance σ2.

Now, the Kullback-Leibler divergence is

K(q||q) =1

2σ2N

N∑

i=1

(Xi − µ)2 +1

Nlog σ + constant.

Differentiating this function gives us the MLE

µ =1

N

N∑

i=1

Xi, σ2 =1

N

N∑

i=1

(Xi − µ)2.

MLE for the model mean is the sample mean.MLE for the model variance is the sample variance.

Page 17: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Mixture Models

Probability

Statistics

• Statistical Model

• Maximum Likelihood

• Kullback-Leibler

• Mixture Models

Bayesian

Regression

A mixture of distributions p1(·), . . . , pm(·) is a convex combination

p(x) =m∑

i=1

αipi(x), x ∈ Rk

i.e. the mixing coefficients αi are nonnegative and sum to one.

Example . Gaussian mixtures.

Mixing univariate Gaussians N (µi, σ2i ), i = 1, . . . ,m, produces

distributions of the form

p(x) =m∑

i=1

αi√

2πσ2i

exp

(

−(x− µi)

2

2σ2i

)

.

This mixture model is therefore described by parameters

ω = (α1, . . . , αm, µ1, . . . , µm, σ1, . . . , σm)

and is frequently used in cluster analysis.

Page 18: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Bayesian Statistics

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Page 19: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Interpretations of Probability Theory

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Page 20: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Interpretations of Probability Theory

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Page 21: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Bayes’ Rule

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Updating our belief of event A based on an observation B.

P(A|B)︸ ︷︷ ︸

posterior (new belief)

=P(B|A)

P(B)P(A)︸ ︷︷ ︸

prior (old belief)

Example . Biased coin toss.

Let θ denote P(heads) of a coin.Determine if the coin is fair (θ = 1

2) or biased (θ = 34).

Old belief : P(fair) = 0.9Now, suppose we observed a sample with 8 heads and 2 tails.New belief :

P(fair|sample) =P(sample|fair)

P(sample)P(fair)

=(12)

8(12)2(0.9)8

(12)8(12)

2(0.9)8 + (34)8(14)

2(0.1)8≈ 0.584

Page 22: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Parameter Estimation

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Let D = X1, . . . , XN denote a sample of X , i.e. “the data”.

Frequentists : Compute maximum likelihood estimate.Bayesians : Treat parameters ω ∈ Ω as random variables

with priors p(ω) that require updating.

Posterior distributionon parameters ω ∈ Ω

p(ω|D) =p(D|ω)p(ω)

Ω p(D|ω)p(ω)dω

Posterior mean

Ωω p(ω|D)dω

Posterior mode argmaxω∈Ω p(ω|D)

Predictive distributionon outcomes x ∈ ξ

p(x|D) =

Ωp(x|ω)p(ω|D)dω

These integrals are difficult to compute or even estimate!

Page 23: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Model Selection

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Which model M1, . . . ,Mm best describes the sample D?

Frequentists : Pick model containing ML distribution.Bayesians : Assign priors p(Mi), p(ω|Mi) and compute

P(Mi|D) =P(D|Mi)P(Mi)

P(D)∝ P(D|Mi)P(Mi).

where P(D|Mi) is the likelihood integral

P(D|Mi) =

Ω

N∏

i=1

p(Xi|ω,Mi)p(ω|Mi)dω.

Parameter estimation is a form of model selection!For each ω ∈ Ω, we define a model Mω with one distribution.

Page 24: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Estimating Integrals

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Generally, there are three ways to estimate statistical integrals.

1. Exact methodsCompute a closed form formula for the integral, e.g.Baldoni·Berline·De Loera·Koppe·Vergne, 2010;Lin·Sturmfels·Xu, 2009.

2. Numerical methodsApproximate using Markov Chain Monte Carlo (MCMC)and other sampling techniques.

3. Asymptotic methodsAnalyze how the integral behaves for large samples.Rewrite the likelihood integral as

ZN =

Ωe−Nf(ω)ϕ(ω)dω

where f(ω) = − 1N logL(ω) and ϕ(ω) is the prior on Ω.

Page 25: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Bayesian Information Criterion

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Laplace approximation : If f(ω) is uniquely minimized at MLE ωand the Hessian ∂2f(ω) is full rank, then asymptotically

− logZN ≈ Nf(ω) +dimΩ

2logN +O(1)

as sample size N → ∞.

Bayesian information criterion (BIC): Select model that maximizes

Nf(ω) +dimΩ

2logN

N = 1 N = 10

Graphs of e−Nf(ω) for different N . Integral = volume under graph.

Page 26: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Singularities in Statistical Models

Probability

Statistics

Bayesian

• Interpretations

• Bayes’ Rule

• Parameter Estimation

• Model Selection

• Estimating Integrals

• Information Criterion

• Singularities

Regression

Informally, the model is singular at ω0 ∈ Ω if the Laplaceapproximation fails when the empirical distribution is p(·|ω0).

Formally, if we define the Kullback-Leibler function

K(ω) =

ξp(x|ω0) log

p(x|ω0)

p(x|ω)dx.

then ω0 is a singularity when the Hessian ∂2K(ω0) is not full rank.

Statistical models with hidden variables , e.g. mixture models,often contain many singularities.

Page 27: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Linear Regression

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

Page 28: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Least Squares

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

Suppose we have random variables Y ∈ R, X ∈ Rd that satisfy

Y = ω ·X + ε.

Parameters ω ∈ Rd; noise ε ∈ N (0, 1); data (Yi, Xi), i = 1 .. N .

• Commonly computed quantitiesMLE argminω

∑Ni=1 |Yi − ω ·Xi|

2

Penalized MLE argminω∑N

i=1 |Yi − ω ·Xi|2 + π(ω)

• Commonly used penaltiesLASSO π(ω) = |ω|1 · βBayesian Info Criterion (BIC) π(ω) = |ω|0 · logNAkaike Info Criterion (AIC) π(ω) = |ω|0 · 2

• Commonly asked questionsModel selection (e.g. which factors are important?)Parameter estimation (e.g. how important are the factors?)

Page 29: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Sparsity Penalty

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

The best model is usually selected using a score

argminu∈Ω l(u) + π(u)

where the likelihood l(u) measures the fitting error of the modelwhile the penalty π(u) measures its complexity .

Recently, sparse penalties derived from statistical considerationswere found to be highly effective.

Bayesian info criterion (BIC) ↔ Marginal likelihood integralAkaike info criterion (AIC) ↔ Kullback-Leibler divergence

Compressive sensing ↔ ℓ1-regularization of BIC

Singular learning theory plays an important role in these derivations.

Page 30: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Paradigms in Statistical Learning

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

• Probabilitytheory of random phenomenaStatisticstheory of making sense of dataLearningthe art of prediction using data

• Frequentists : “True distribution, maximum likelihood”Bayesians : “Belief updates, maximum a posteriori”

Interpretations do not affect the correctness of probability theory,but they greatly affect the statistical methodology .

• We often have to balance the complexity of the modelwith its fitting error via some suitable probabilistic criteria.The learning algorithm also needs to be computable to be useful.

Page 31: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

Thank you!

“Algebraic Methods for Evaluating Integrals in Bayesian Statistics”

http://math.berkeley.edu/~shaowei/swthesis.pdf

(PhD dissertation, May 2011)

Page 32: SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… · Discrete and Continuous Random Variables Probability •Random Variables •Discrete·Continuous

References

Probability

Statistics

Bayesian

Regression

• Least Squares

• Sparsity Penalty

• Paradigms

1. V. I. ARNOL’D, S. M. GUSEIN-ZADE AND A. N. VARCHENKO: Singularities of Differentiable Maps, Vol. II,Birkhauser, Boston, 1985.

2. V. BALDONI, N. BERLINE, J. A. DE LOERA, M. KOPPE, M. VERGNE: How to integrate a polynomialover a simplex, Mathematics of Computation 80 (2010) 297–325.

3. A. BRAVO, S. ENCINAS AND O. VILLAMAYOR: A simplified proof of desingularisation and applications,Rev. Math. Iberoamericana 21 (2005) 349–458.

4. D. A. COX, J. B. LITTLE, AND D. O’SHEA: Ideals, Varieties, and Algorithms: An Introduction toComputational Algebraic Geometry and Commutative Algebra, Springer-Verlag, New York, 1997.

5. R. DURRETT: Probability - Theory and Examples (4th edition), Cambridge U. Press, 2010.6. M. EVANS, Z. GILULA AND I. GUTTMAN: Latent class analysis of two-way contingency tables by

Bayesian methods, Biometrika 76 (1989) 557–563.7. H. HIRONAKA: Resolution of singularities of an algebraic variety over a field of characteristic zero I, II,

Ann. of Math. (2) 79 (1964) 109–203.8. S. LIN, B. STURMFELS AND Z. XU: Marginal likelihood integrals for mixtures of independence models,

J. Mach. Learn. Res. 10 (2009) 1611–1631.9. S. LIN: Algebraic methods for evaluating integrals in Bayesian statistics, PhD dissertation, Dept.

Mathematics, UC Berkeley (2011).10. L. MACKEY: Fundamentals - Probability and Statistics, (slides for CS 281A: Statistical Learning Theory

at UC Berkeley, Aug 2009).11. S. WATANABE: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on

Applied and Computational Mathematics 25, Cambridge University Press, Cambridge, 2009.