Learning the Structure of Mixed Graphical Models lee715/slides/MixedGM_Proxnewton_  ·...

Click here to load reader

  • date post

    12-Apr-2018
  • Category

    Documents

  • view

    220
  • download

    5

Embed Size (px)

Transcript of Learning the Structure of Mixed Graphical Models lee715/slides/MixedGM_Proxnewton_  ·...

  • Learning the Structure of MixedGraphical Models

    Jason Lee with Trevor Hastie, Michael Saunders, YuekaiSun, and Jonathan Taylor

    Institute of Computational & Mathematical EngineeringStanford University

    June 26th, 2014

  • Examples of Graphical Models

    I Pairwise MRF.

    p(y) =1

    Z()exp

    (r,j)E(G)

    rj(yr, yj)

    I Multivariate gaussian distribution (Gaussian MRF)

    p(x) =1

    Z()exp

    (1

    2

    ps=1

    pt=1

    stxsxt +

    ps=1

    sxs

    )

  • Mixed Graphical Model

    I Want a simple joint distribution on p continuous variablesand q discrete (categorical) variables.

    I Joint distribution of p gaussian variables is multivariategaussian.

    I Joint distribution of q discrete variables is pairwise mrf.

    I Conditional distributions can be estimated via(generalized) linear regression.

    I What about the potential term between a continuousvariable xs and discrete variable yj?

  • Mixed Model - Joint Distribution

    p(x, y; ) =1

    Z()exp

    (ps=1

    pt=1

    12stxsxt +

    ps=1

    sxs

    +

    ps=1

    qj=1

    sj(yj)xs +

    qj=1

    qr=1

    rj(yr, yj)

  • Properties of the Mixed Model

    I Pairwise model with 3 type of potentials: discrete-discrete,continuous-discrete, and continuous-continuous. Thus hasO((p+ q)2) parameters.

    I p(x|y) is a gaussian with = B1 and = B1

    (j sj(yj)

    ).

    I Conditional distribution of x have the same covarianceregardless of the values taken by the discrete variables y.Mean depends additively on the values of discrete variablesy.

    I Special case of Lauritzens mixed graphical model.

  • Related Work

    I Lauritzen proposed the conditional Gaussian model

    I Fellinghauer et al. (2011) use random forests to fit theconditional distributions. This is tailored for mixed models.

    I Cheng, Levina, and Zhu (2013) generalize to include higherorder edges.

    I Yang et al. (2014) and Shizhe Chen, Witten, and Shojaie(2014) generalize beyond Gaussian and categorical.

  • Outline

    Parameter Learning

    Structure Learning

    Experimental Results

  • Pseudolikelihood

    I Log-likelihood: `() = log p(xi; ). Derivative isT (x, y) Ep()[T (x, y)] where T are sufficient statistics.This is hard to compute.

    I Log-pseudolikelihood: `PL() =

    s log p(xis|xi\s; )

    I Pseudolikelihood is an asymptotically consistentapproximation to the likelihood by using product of theconditional distributions.

    I Partition function cancels out in the conditionaldistribution, so gradients of the log-pseudolikelihood arecheap to compute.

  • Conditional Distribution of a Discrete Variable

    For a discrete variable yr with Lr states, its conditionaldistribution is a multinomial distribution, as used in(multiclass) logistic regression. Whenever a discrete variable isa predictor, each level contributes an additive effect; continuousvariables contribute linear effects.

    p(yr|y\r,, x; ) =exp

    (s sr(yr)xs + rr(yr, yr) +

    j 6=r rj(yr, yj)

    )Lr

    l=1 exp(

    s sr(l)xs + rr(l, l) +

    j 6=r rj(l, yj))

    This is just multinomial logistic regression.

    p(yr = k) =exp

    (Tk z

    )Lrl=1 exp

    (Tl z

    )

  • Continuous variable xs given all other variables is a gaussiandistribution with a linear regression model for the mean.

    p(xs|x\s, y; ) =ss2

    exp

    (ss

    2

    (s +

    j sj(yj)

    t6=s stxt

    ss xs

    )2)

    This can be expressed as linear regression

    E(xs|z1, . . . , zp) = T z = 0 +j

    zjj (1)

    p(xs|z1, . . . , zp) =12

    exp

    ( 1

    22(xs T z)2

    )with = 1/ss

    (2)

  • Two more parameter estimation methods

    Neighborhood selection/Separate regressions.

    I Each node maximizes its own conditional likelihoodp(xs|x\s). Intuitively, this should behave similar to thepseudolikelihood since the pseudolikelihood jointlyminimizes

    s log p(xs|x\s).

    I This has twice the number of parameters as thepseudolikelihood/likelihood because the regressions do notenforce symmetry.

    I Easily distributed.

    Maximum Likelihood

    I Believed to be more statistically efficient

    I Computationally intractable.

  • Outline

    Parameter Learning

    Structure Learning

    Experimental Results

  • Sparsity and Conditional Independence

    I Lack of an edge (u, v) means Xu Xv|X\u,v (Xu and Xvare conditionally independent.)

    I Means that parameter block st,sj , or rj are 0.

    I Each parameter block is a different size. Thecontinuous-continuous edge are scalars, thecontinuous-discrete edge are vectors and thediscrete-discrete edge is a table.

  • Structure Learning

    1

    2

    3

    4

    5

    67

    8

    9

    10

    Estimated Structure

  • Parameters of the mixed model

    Figure: st shown in red, sj shown in blue, and rj shown inorange. The rectangles correspond to a group of parameters.

  • Regularizer

    min `PL()+

    s,t

    wst st+s,j

    wsj sj+r,j

    wrj rj

    I Each edge group is of a different size and different

    distribution, so we need a different penalty for each group.

    I By KKT conditions, a group is non-zero iff `g > wg.

    Thus we choose weights

    wg E0 `g

    .

  • Optimization Algorithm: Proximal Newtonmethod

    I g(x) + h(x) :=

    min `PL() + (

    s,t st+

    s,j sj+

    r,j rj)

    I First-order methods: proximal gradient and acceleratedproximal gradient, which have similar convergenceproperties as their smooth counter parts (sublinearconvergence rate, and linear convergence rate under strongconvexity).

    I Second-order methods: model smooth part g(x) withquadratic model. Proximal gradient is a linear model of thesmooth function g(x).

  • Proximal Newton-like Algorithms

    I Build a quadratic model about the iterate xk and solve thisas a subproblem.

    x+ = argminu g(x)+g(x)T (ux)+1

    2t(ux)TH(ux)+h(u)

    Algorithm 1 A generic proximal Newton-type method

    Require: starting point x0 dom f1: repeat2: Choose an approximation to the Hessian Hk.3: Solve the subproblem for a search direction:

    xk arg mindg(xk)Td+ 12dTHkd+ h(xk + d).

    4: Select tk with a backtracking line search.5: Update: xk+1 xk + tkxk.6: until stopping conditions are satisfied.

  • Why are these proximal?

    Definition (Scaled proximal mappings)

    Let h be a convex function and H, a positive definite matrix.Then the scaled proximal mapping of h at x is defined to be

    proxHh (x) = arg miny

    h(y) +1

    2y x2H .

    The proximal Newton update is

    xk+1 = proxHkh

    (xk H1k g(xk)

    )and analogous to the proximal gradient update

    xk+1 = proxh/L

    (xk

    1

    Lg(xk)

    )

  • A classical idea

    Traces back to:

    I Projected Newton-type methods

    I Cost-approximation methods

    Popular methods tailored to specific problems:

    I glmnet: lasso and elastic-net regularized generalized linearmodels

    I LIBLINEAR: `1-regularized logistic regression

    I QUIC: sparse inverse covariance estimation

  • I Theoretical analysis shows that this convergesquadratically with exact Hessian and super-linearly withBFGS (Lee, Sun, and Saunders 2012).

    I Empirical results on structure learning problem confirmsthis. Requires very few derivatives of the log-partition.

    I If we solve subproblems with first order methods, onlyrequire proximal operator of nonsmooth h(u). Method isvery general.

    I Method allows you to choose how to solve the subproblem,and comes with a stopping criterion that preserves theconvergence rate.

    I PNOPT package:www.stanford.edu/group/SOL/software/pnopt

    www.stanford.edu/group/SOL/software/pnopt

  • Statistical Consistency

    Special case of a more general model selection consistencytheorem.

    Theorem (Lee, Sun, and Taylor 2013)

    1.?

    F C

    |A| log |G|

    n

    2. g = 0 for g I.

    |A| is the number of active edges, and I is the inactive edges.Main assumption is a generalized irrepresentable condition.

  • Outline

    Parameter Learning

    Structure Learning

    Experimental Results

  • Synthetic Experiment

    0 500 1000 1500 20000

    20

    40

    60

    80

    100

    n (Sample Size)

    Pro

    ba

    bili

    ty o

    f R

    ec

    ove

    ry

    ML

    PL

    Figure: Blue nodes are continuous variables, red nodes are binaryvariables and the orange, green and dark blue lines represent the 3types of edges. Plot of the probability of correct edge recovery at agiven sample size (p+ q = 20). Results are averaged over 100 trials.

  • Survey Experiments

    I The survey dataset we consider consists of 11 variables, ofwhich 2 are continuous and 9 are discrete: age(continuous), log-wage (continuous), year(7 states), sex(2states),marital status (5 states), race(4 states), educationlevel (5 states), geographic region(9 states), job class (2states), health (2 states), and health insurance (2 states).

    I All the evaluations are done using a holdout test set of size100, 000 for the survey experiments.

    I The regularization parameter is varied over the interval[5 105, .7] at 50 points equispaced on log-scale for allexperiments.

  • Comparing Against Separate Regressions

    4 2 00

    50

    100Full Model

    Separate Joint

    4 2 00

    0.5

    1age

    4 2 00

    0.5

    1logwage

    4 2 00

    10

    20year

    4 2 00.5

    1

    1.5sex

    4 2 00

    5

    10marital

    4 2 00

    2