SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

Click here to load reader

  • date post

    01-Jul-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of SINGULAR LEARNING THEORY Part I: Statistical Learningcmnd/programs/mis2013/undergrad/Lin/Powe… ·...

  • SINGULAR LEARNING THEORY

    Part I: Statistical Learning

    Shaowei Lin(Institute for Infocomm Research, Singapore)

    21-25 May 2013Motivic Invariants and Singularities Thematic Program

    Notre Dame University

  • Overview

    Probability

    Statistics

    Bayesian

    Regression

    Algebraic Geometry

    Asymptotic Theory∫

    [0,1]2(1−x2y2)N/2dxdy ≈

    π8 N

    −12 logN−

    π8 (

    1log 2−2 log 2−

    γ)N−12 − 14N

    −1logN + 14 (1

    log 2 +

    1−γ)N−1 −√

    2π128 N

    −32 logN + · · ·

    Statistical Learning

    A B

    D C

    Algebraic Statistics Singular Learning Theory

  • Overview

    Probability

    Statistics

    Bayesian

    Regression

    Part I: Statistical Learning

    Part II: Real Log Canonical Thresholds

    Part III: Singularities in Graphical Models

  • Probability

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

  • Random Variables

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

    A probability space (ξ,F ,P) consists of

    • a sample space ξ which is the set of all possible outcomes,• a collection♯ F of events, which are subsets of ξ,• an assignment♭ P : F → [0, 1] of probabilities to events

    A (real-valued) random variable X : ξ → Rk is

    • a function♮ from the sample space to a real vector space.• a measurement of the possible outcomes.• X ∼ P means “X has the distribution given by P ”.♯ σ-algebra: closed under complement, countable union and contains ∅.♭ probability measure: P(∅) = 0, P(ξ) = 1, countable additivity for disjoint events.♮ measurable function: for all x ∈ R, the preimage of {y ∈ Rk : y ≤ x} is in F .

    Example . Rolling a fair die.

    ξ = { , , , , , } F = {∅, { }, { , }, . . .}X ∈ {1, 2, 3, 4, 5, 6} P(X=1) = 16 , P(X≤3) =

    12

  • Discrete and Continuous Random Variables

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

    If ξ is finite, we say X is a discrete random variable.

    • probability mass function p(x) = P (X = x), x ∈ Rk.

    If ξ is infinite, we define the

    • cumulative distribution function (CDF) F (x)

    F (x) = P(X≤x), x ∈ Rk.

    • probability density function ♯ (PDF) p(y)

    F (x) =∫

    {y∈Rk:y≤x} p(y)dy, x ∈ Rk.

    If the PDF exists, then X is a continuous ♭ random variable.♯ Radon-Nikodym derivative of F (x) with respect to the Lebesgue measure on Rk .♭ We can also define PDFs for discrete variables if we allow the Dirac delta function.

    The probability mass/density function is often informally referred toas the distribution of X .

  • Gaussian Random Variables

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

    Example . Multivariate Gaussian distribution X ∼ N (µ,Σ).

    X ∈ Rk, mean µ ∈ Rk, covariance Σ ∈ Rk≻0

    p(x) =1

    (2π detΣ)k/2exp

    (

    −1

    2(x− µ)⊤Σ−1(x− µ)

    )

    PDFs of univariate Gaussian distributions X ∼ N (µ, σ2)

  • Basic Concepts

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

    The expectation E[X] is the integral of the random variable Xwith respect to its probability measure, i.e. the “average”.

    Discrete variables Continuous variables

    E[X] =∑

    x∈Rk

    xP(X=x) E[X] =

    Rk

    xp(x) dx

    The variance E[(X − E[X])2] measures the “spread” of X .

    The conditional probability ♯ P(A|B) of two events A,B ∈ F isthe probability that A will occur given that we know B has occurred.

    • If P(B) > 0, then P(A|B) = P(A ∩B)/P(B).♯ formal definition depends on the notion of conditional expectation.

    Example . Weather forecast.

    P(Rain|Thunder)= 0.2/(0.1 + 0.2) = 2/3

    Rain No Rain

    Thunder 0.2 0.1No Thunder 0.3 0.4

  • Independence

    Probability

    • Random Variables

    • Discrete·Continuous

    • Gaussian

    • Basic Concepts

    • Independence

    Statistics

    Bayesian

    Regression

    Let X ∈ Rk, Y ∈ Rl, Z ∈ Rm be random variables.

    X,Y are independent (X⊥⊥Y ) if

    • P(X∈S, Y ∈T ) = P(X∈S)P(Y ∈T )for all measurable subsets S ⊂ Rk, T ⊂ Rl.

    • i.e. “ knowing X gives no information about Y ”

    X,Y are conditionally independent given Z (X⊥⊥Y | Z) if

    • P(X∈S, Y ∈T |Z=z) = P(X∈S|Z=z)P(Y ∈T |Z=z)for all z ∈ Rm and measurable subsets S ⊂ Rk, T ⊂ Rl.

    • i.e. “ any dependence between X and Y is due to Z ”

    Example . Hidden variables.

    Favorite color X ∈ {red, blue}, favorite food Y ∈ {salad, steak}.If X,Y are dependent, one may ask if there is a hidden variable,e.g. gender Z ∈ {female,male}, such that X⊥⊥Y | Z .

  • Statistics

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

  • Statistical Model

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    Let ∆ denote the space of distributions with outcomes ξ.

    Model : a family M of probability distributions, i.e. a subset of ∆.

    Parametric model : family M of distributions p(·|ω) are indexedby parameters ω in a space Ω, i.e. we have a map Ω → ∆.

    Example . Biased coin tosses.

    Number of heads in two tosses of coin: H ∈ ξ = {0, 1, 2}Space of distributions:

    ∆ = {p ∈ R3≥0 : p(0)+p(1)+p(2) = 1}

    Probability of getting heads: ω ∈ Ω = [0, 1] ⊂ RParametric model for H :

    p(0|ω) = (1− ω)2

    p(1|ω) = 2(1− ω)ωp(2|ω) = ω2

    implicit equation4 p(0)p(2)− p(1)2 = 0

  • Maximum Likelihood

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    A sample X1, . . . , XN of X is a set of independent, identicallydistributed (i.i.d.) random variables with the same distribution.

    Goal: Given a statistical model {p(·|ω) : ω∈Ω} and a sample,find a distribution p(·|ω̂) that best describes the sample.

    A statistic f(X1, . . . , XN ) is a function of the sample.An important statistic is the maximum likelihood estimate (MLE).It is a parameter ω̂ that maximizes the likelihood function

    L(ω) =

    N∏

    i=1

    p(Xi|ω).

    Example . Biased coin tosses.

    Suppose the table below summarizes a sample of H of size 100.

    H 0 1 2

    Count 25 45 30Then, L(ω) = 245ω105(1− ω)95

    ω̂ = 105/200.

  • Kullback-Leibler Divergence

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    Let X1, . . . , XN be a sample of a discrete variable X .The empirical distribution is the function

    q̂(x) =1

    N

    N∑

    i=1

    δ(x−Xi)

    where δ(·) is the Kronecker delta function.

    The Kullback-Leibler divergence of a distribution p from q is

    K(q||p) =∑

    x∈Rk

    q(x) logq(x)

    p(x).

    Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q̂(·).

    K(q̂||q) =∑

    x∈Rk

    q̂(x) log q̂(x)

    ︸ ︷︷ ︸

    entropy

    −1

    NlogL(ω)

  • Kullback-Leibler Divergence

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    Let X1, . . . , XN be a sample of a continuous variable X .The empirical distribution is the generalized function

    q̂(x) =1

    N

    N∑

    i=1

    δ(x−Xi)

    where δ(·) is the Dirac delta function.

    The Kullback-Leibler divergence of a distribution p from q is

    K(q||p) =

    Rk

    q(x) logq(x)

    p(x)dx.

    Proposition . ML distributions minimize the KL divergenceof q(·) = p(·|ω) ∈ M from the empirical distribution q̂(·).

    K(q̂||q) =

    Rk

    q̂(x) log q̂(x)dx

    ︸ ︷︷ ︸

    entropy

    −1

    NlogL(ω)

  • Kullback-Leibler Divergence

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

  • Kullback-Leibler Divergence

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    Example . Population mean and variance.

    Let X ∼ N(µ, σ2) be the height of a random Singaporean.Given sample X1, . . . , XN , estimate mean µ and variance σ

    2.

    Now, the Kullback-Leibler divergence is

    K(q̂||q) =1

    2σ2N

    N∑

    i=1

    (Xi − µ)2 +

    1

    Nlog σ + constant.

    Differentiating this function gives us the MLE

    µ̂ =1

    N

    N∑

    i=1

    Xi, σ̂2 =

    1

    N

    N∑

    i=1

    (Xi − µ̂)2.

    MLE for the model mean is the sample mean.MLE for the model variance is the sample variance.

  • Mixture Models

    Probability

    Statistics

    • Statistical Model

    • Maximum Likelihood

    • Kullback-Leibler

    • Mixture Models

    Bayesian

    Regression

    A mixture of distributions p1(·), . . . , pm(·) is a convex combination

    p(x) =m∑

    i=1

    αipi(x), x ∈ Rk

    i.e. the mixing coefficients αi are nonnegative and sum to one.

    Example . Gaussian mixtures.

    Mixing univariate Gaussians N (µi, σ2i ), i = 1, . . . ,m, produces

    distributions of the form

    p(x) =m∑

    i=1

    αi√

    2πσ2i

    exp

    (

    −(x− µi)

    2

    2σ2i

    )

    .

    This mixture model is therefore described by parameters

    ω = (α1, . . . , αm, µ1, . . . , µm, σ1, . . . , σm)

    and is frequently used in cluster analysis.

  • Bayesian Statistics

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

  • Interpretations of Probability Theory

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

  • Interpretations of Probability Theory

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

  • Bayes’ Rule

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Updating our belief of event A based on an observation B.

    P(A|B)︸ ︷︷ ︸

    posterior (new belief)

    =P(B|A)

    P(B)P(A)︸ ︷︷ ︸

    prior (old belief)

    Example . Biased coin toss.

    Let θ denote P(heads) of a coin.Determine if the coin is fair (θ = 12) or biased (θ =

    34).

    Old belief : P(fair) = 0.9Now, suppose we observed a sample with 8 heads and 2 tails.New belief :

    P(fair|sample) =P(sample|fair)

    P(sample)P(fair)

    =(12)

    8(12)2(0.9)8

    (12)8(12)

    2(0.9)8 + (34)8(14)

    2(0.1)8≈ 0.584

  • Parameter Estimation

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Let D = {X1, . . . , XN} denote a sample of X , i.e. “the data”.

    Frequentists : Compute maximum likelihood estimate.Bayesians : Treat parameters ω ∈ Ω as random variables

    with priors p(ω) that require updating.

    Posterior distributionon parameters ω ∈ Ω

    p(ω|D) =p(D|ω)p(ω)

    Ω p(D|ω)p(ω)dω

    Posterior mean

    Ωω p(ω|D)dω

    Posterior mode argmaxω∈Ω p(ω|D)

    Predictive distributionon outcomes x ∈ ξ

    p(x|D) =

    Ωp(x|ω)p(ω|D)dω

    These integrals are difficult to compute or even estimate!

  • Model Selection

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Which model M1, . . . ,Mm best describes the sample D?

    Frequentists : Pick model containing ML distribution.Bayesians : Assign priors p(Mi), p(ω|Mi) and compute

    P(Mi|D) =P(D|Mi)P(Mi)

    P(D)∝ P(D|Mi)P(Mi).

    where P(D|Mi) is the likelihood integral

    P(D|Mi) =

    N∏

    i=1

    p(Xi|ω,Mi)p(ω|Mi)dω.

    Parameter estimation is a form of model selection!For each ω ∈ Ω, we define a model Mω with one distribution.

  • Estimating Integrals

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Generally, there are three ways to estimate statistical integrals.

    1. Exact methodsCompute a closed form formula for the integral, e.g.Baldoni·Berline·De Loera·Köppe·Vergne, 2010;Lin·Sturmfels·Xu, 2009.

    2. Numerical methodsApproximate using Markov Chain Monte Carlo (MCMC)and other sampling techniques.

    3. Asymptotic methodsAnalyze how the integral behaves for large samples.Rewrite the likelihood integral as

    ZN =

    Ωe−Nf(ω)ϕ(ω)dω

    where f(ω) = − 1N logL(ω) and ϕ(ω) is the prior on Ω.

  • Bayesian Information Criterion

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Laplace approximation : If f(ω) is uniquely minimized at MLE ω̂and the Hessian ∂2f(ω̂) is full rank, then asymptotically

    − logZN ≈ Nf(ω̂) +dimΩ

    2logN +O(1)

    as sample size N → ∞.

    Bayesian information criterion (BIC): Select model that maximizes

    Nf(ω̂) +dimΩ

    2logN

    N = 1 N = 10

    Graphs of e−Nf(ω) for different N . Integral = volume under graph.

  • Singularities in Statistical Models

    Probability

    Statistics

    Bayesian

    • Interpretations

    • Bayes’ Rule

    • Parameter Estimation

    • Model Selection

    • Estimating Integrals

    • Information Criterion

    • Singularities

    Regression

    Informally, the model is singular at ω0 ∈ Ω if the Laplaceapproximation fails when the empirical distribution is p(·|ω0).

    Formally, if we define the Kullback-Leibler function

    K(ω) =

    ξp(x|ω0) log

    p(x|ω0)

    p(x|ω)dx.

    then ω0 is a singularity when the Hessian ∂2K(ω0) is not full rank.

    Statistical models with hidden variables , e.g. mixture models,often contain many singularities.

  • Linear Regression

    Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

  • Least Squares

    Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

    Suppose we have random variables Y ∈ R, X ∈ Rd that satisfy

    Y = ω ·X + ε.

    Parameters ω ∈ Rd; noise ε ∈ N (0, 1); data (Yi, Xi), i = 1 .. N .

    • Commonly computed quantitiesMLE argminω

    ∑Ni=1 |Yi − ω ·Xi|

    2

    Penalized MLE argminω∑N

    i=1 |Yi − ω ·Xi|2 + π(ω)

    • Commonly used penaltiesLASSO π(ω) = |ω|1 · βBayesian Info Criterion (BIC) π(ω) = |ω|0 · logNAkaike Info Criterion (AIC) π(ω) = |ω|0 · 2

    • Commonly asked questionsModel selection (e.g. which factors are important?)Parameter estimation (e.g. how important are the factors?)

  • Sparsity Penalty

    Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

    The best model is usually selected using a score

    argminu∈Ω l(u) + π(u)

    where the likelihood l(u) measures the fitting error of the modelwhile the penalty π(u) measures its complexity .

    Recently, sparse penalties derived from statistical considerationswere found to be highly effective.

    Bayesian info criterion (BIC) ↔ Marginal likelihood integralAkaike info criterion (AIC) ↔ Kullback-Leibler divergence

    Compressive sensing ↔ ℓ1-regularization of BIC

    Singular learning theory plays an important role in these derivations.

  • Paradigms in Statistical Learning

    Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

    • Probabilitytheory of random phenomenaStatisticstheory of making sense of dataLearningthe art of prediction using data

    • Frequentists : “True distribution, maximum likelihood”Bayesians : “Belief updates, maximum a posteriori”

    Interpretations do not affect the correctness of probability theory,but they greatly affect the statistical methodology .

    • We often have to balance the complexity of the modelwith its fitting error via some suitable probabilistic criteria.The learning algorithm also needs to be computable to be useful.

  • Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

    Thank you!

    “Algebraic Methods for Evaluating Integrals in Bayesian Statistics”

    http://math.berkeley.edu/~shaowei/swthesis.pdf

    (PhD dissertation, May 2011)

    http://math.berkeley.edu/~shaowei/swthesis.pdf

  • References

    Probability

    Statistics

    Bayesian

    Regression

    • Least Squares

    • Sparsity Penalty

    • Paradigms

    1. V. I. ARNOL’D, S. M. GUSEĬN-ZADE AND A. N. VARCHENKO: Singularities of Differentiable Maps, Vol. II,Birkhäuser, Boston, 1985.

    2. V. BALDONI, N. BERLINE, J. A. DE LOERA, M. KÖPPE, M. VERGNE: How to integrate a polynomialover a simplex, Mathematics of Computation 80 (2010) 297–325.

    3. A. BRAVO, S. ENCINAS AND O. VILLAMAYOR: A simplified proof of desingularisation and applications,Rev. Math. Iberoamericana 21 (2005) 349–458.

    4. D. A. COX, J. B. LITTLE, AND D. O’SHEA: Ideals, Varieties, and Algorithms: An Introduction toComputational Algebraic Geometry and Commutative Algebra, Springer-Verlag, New York, 1997.

    5. R. DURRETT: Probability - Theory and Examples (4th edition), Cambridge U. Press, 2010.6. M. EVANS, Z. GILULA AND I. GUTTMAN: Latent class analysis of two-way contingency tables by

    Bayesian methods, Biometrika 76 (1989) 557–563.7. H. HIRONAKA: Resolution of singularities of an algebraic variety over a field of characteristic zero I, II,

    Ann. of Math. (2) 79 (1964) 109–203.8. S. LIN, B. STURMFELS AND Z. XU: Marginal likelihood integrals for mixtures of independence models,

    J. Mach. Learn. Res. 10 (2009) 1611–1631.9. S. LIN: Algebraic methods for evaluating integrals in Bayesian statistics, PhD dissertation, Dept.

    Mathematics, UC Berkeley (2011).10. L. MACKEY: Fundamentals - Probability and Statistics, (slides for CS 281A: Statistical Learning Theory

    at UC Berkeley, Aug 2009).11. S. WATANABE: Algebraic Geometry and Statistical Learning Theory, Cambridge Monographs on

    Applied and Computational Mathematics 25, Cambridge University Press, Cambridge, 2009.

    OverviewOverviewProbabilityRandom VariablesDiscrete and Continuous Random VariablesGaussian Random VariablesBasic ConceptsIndependence

    StatisticsStatistical ModelMaximum LikelihoodKullback-Leibler DivergenceKullback-Leibler DivergenceKullback-Leibler DivergenceKullback-Leibler DivergenceMixture Models

    Bayesian StatisticsInterpretations of Probability TheoryInterpretations of Probability TheoryBayes' RuleParameter EstimationModel SelectionEstimating IntegralsBayesian Information CriterionSingularities in Statistical Models

    Linear RegressionLeast SquaresSparsity PenaltyParadigms in Statistical LearningReferences