Gaussian variational approximation with structured ... · Gaussian variational approximation with...

Gaussian variational approximation with structuredcovariance matrices

David Nott

Department of Statistics and Applied ProbabilityNational University of Singapore

Collaborators: Linda Tan, Victor Ong, Michael Smith,Matias Quiroz, Robert Kohn

David Nott, NUS Gaussian variational approximation 1 / 39

Bayesian computation as usual

Data y to be observed, unknowns θ. Construct a model for (y , θ):

p(y , θ) =

prior³·µp(θ)

likelihood³¹¹¹¹¹¹·¹¹¹¹¹¹µp(y ∣θ)

Condition on the observed y :

p(θ∣y)´¹¹¹¹¹¹¸¹¹¹¹¹¹¶posterior

∝ p(θ)p(y ∣θ)

Summarization of the posterior is done using algorithms like MCMC,sequential Monte Carlo, etc.

These usual Monte Carlo algorithms are exact in principle.


Variational approximations

Increasingly there is interest in algorithms that do not possess theexact in principle property.

Why approximate inference?Make use of the scalability of optimization-based approaches tocomputation.Approximate inference methods are enough to understand whycertain models are unsuitable.Approximate inference methods may perform as well as exactmethods for predictive inference.

This talk concerns a popular framework for approximate inference,variational approximation.


Variational inference basicsBlei et al., 2017, Ormerod and Wand, 2012

Variational approximation reformulates Bayesian computation as anoptimization problem.

Define an approximating family with some parameters (Gaussian forexample). Then

Define some measure of "closeness" of an approximation to thetrue posterior.Optimize that measure over the approximating family.

Variational parameters to be optimized will be denoted λ.


Gaussian variational approximation, toy example


Gaussian variational approximation

In this talk we consider a multivariate normal approximation denotedqλ(θ) = N(µ,Σ) with λ = (µ,Σ) to be optimized.

The best normal approximation will be in the sense of minimizingKullback-Leibler (KL) divergence,

KL(p(θ∣y)∣∣qλ(θ)) = log p(y) − ∫ logp(θ)p(y ∣θ)

qλ(θ)qλ(θ)dθ

= log p(y) −L(λ)

where p(y) = ∫ p(θ)p(y ∣θ) and L(λ) is called thevariational lower bound.

Minimizing the KL divergence is equivalent to maximizing L(λ).



The optimization of L(λ) is challenging for a normal family when thedimension d of θ is large.

Key difficulty:With no restriction on Σ in qλ(θ) = N(µ,Σ), there are d + d(d + 1)/2parameters to optimize.

For high-dimensional θ we need reduced parametrizations for Σ.Exploiting conditional independence structure to motivate sparsityin Σ−1 (Tan and Nott, 2017).Factor models


How should we optimize? Stochastic gradient ascentRobbins and Monro, 1951

We will use stochastic gradient ascent methods for optimizing thelower bound.

Suppose that ∇λL(λ) is the gradient of L(λ) and that ∇λL(λ) is anunbiased estimate of it.

Stochastic gradient ascent

Initialize λ(0)

for t = 0,1, . . . and until some stopping rule is satisfied

λ(t+1) = λ(t) + ρt∇λL(λ(t))

Typically the learning rate sequence ρt , t ≥ 0 satisfies ∑t ρt =∞,∑t ρ

2t <∞ (Robbins and Monro, 1951).


Reparametrization gradientsKingma and Welling, 2013, Rezende et al., 2014

Low variance gradient estimates are crucial for stability and fastconvergence of stochastic optimization - achieved using thereparametrization trick.

Writing h(θ) = p(θ)p(y ∣θ), the lower bound is

L(λ) = ∫ log h(θ) − log qλ(θ)qλ(θ) dθ.

Suppose that for the variational family qλ(θ) we can write θ ∼ qλ(θ) asθ = t(z, λ) where z ∼ f (z) and f (⋅) does not depend on λ. Then

L(λ) = Ef (log h(t(z, λ)) − log qλ(t(z, λ))

Differentiating under the integral sign, ∇λL(λ) is an expectation withrespect to f (⋅): simulation from f (⋅) gives unbiased estimates.


Reparametrization gradients for the Gaussian familyTitsias and Lázaro-Gredilla, 2014, Kucukelbir et al., 2016

Suppose qλ(θ) = N(θ;µ,Σ = CCT ) where µ is the mean vector, Σ isthe covariance matrix, C the Cholesky factor of Σ.

Variational parameter λ = (µ,C). We can write θ ∼ qλ(θ) as

θ = µ +Cz z ∼ f (z) = N(0, I)

This shows that the Gaussian family has the structure required for thereparametrization trick.


Gaussian variational approximation using conditionalindependence structure

The low variance gradient estimates provided by thereparametrization trick are crucial to efficient and stable stochasticvariational optimization methods.However, learning a Gaussian variational approximation is stillhard if we parametrize Σ with a dense Cholesky factor:The number of parameters in the covariance matrix growsquadratically with the parameter dimension.

We need to use the structure of the model to obtain parsimoniousstructured paremetrizations of covariance matrices suitable forGaussian variational approximations in high dimensions.



What exploitable structure is available?Conditional independence structure.

Can we match the true conditional independence structure in theGaussian approximation to make such approximations practical in highdimensions?

For a Gaussian random vector with covariance matrix Σ, if Ω = Σ−1

then Ωij = 0 implies variables i and j are conditionally independentgiven the rest.


Motivating example: longitudinal generalized linearmixed model

Observations y = (y1, . . . ,yn), yi = (yi1, . . . ,yini )⊺.

Observation specific random effects b = (b1, . . . ,bn)⊺. Assumebi ∼ N(0,G) say.The yi are conditionally independent given b, likelihood

n∏i=1

p(yi ∣bi , η)

where η denotes fixed effects and variance parameters.Joint posterior for θ = (b⊺, η⊺)⊺

p(θ∣y)∝ p(η)n∏i=1

p(bi ∣η)p(yi ∣bi , η).

In the joint posterior bi and bj , i ≠ j are conditionally independentgiven η.



Consider a sparse Ω = Σ−1 in Gaussian variational approximation ofthe form:

Ω =

⎛⎜⎜⎜⎜⎜⎜⎝

b1 b2 . . . bn η

b1 Ω11 0 . . . 0 Ω1,n+1b2 0 Ω22 . . . 0 Ω2,n+1⋮ ⋮ ⋮ ⋱ ⋮ ⋮bn 0 0 . . . Ωnn Ωn,n+1η Ωn+1,1 Ωn+1,2 . . . Ωn+1,n Ωn+1,n+1

⎞⎟⎟⎟⎟⎟⎟⎠



It will be convenient later to parametrize Ω in terms of its Choleskyfactor, Ω = TT ⊺ say where T is lower triangular.By imposing sparse structure on T we can impose sparsestructure on Ω. The leftmost non-zero entries in each row match.Choose T of the form

T =

⎛⎜⎜⎜⎜⎜⎜⎝

b1 b2 . . . bn η

b1 T11 0 . . . 0 0b2 0 T22 . . . 0 0⋮ ⋮ ⋮ ⋱ ⋮ ⋮bn 0 0 . . . TNN 0η TN+1,1 TN+1,2 . . . TN+1,N TN+1,N+1

⎞⎟⎟⎟⎟⎟⎟⎠


More general framework

Observations y = (y1, . . . ,yn), observation specific latent variablesb1 . . . ,bn, global parameters η.Joint model p(y , θ) for (y , θ), θ = (b⊺, η⊺)⊺ of the form

p(η)n∏i=1

p(yi ∣bi , η)p(b1, . . . ,bk ∣η)∏i>k

p(bi ∣bi−1, . . . ,bi−k , η) .

for some 0 ≤ k ≤ nbi is conditionally independent of the other latent variables inp(θ∣y) given η and k neighbouring values.Our previous random effects example fits this structure with k = 0.A state space model for a time series where the bi are the statesfits this structure with k = 1.For a state space model, use Ω of the form


More general framework

Ω =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

b1 b2 b3 ⋯ bn−1 bn η

b1 Ω11 Ω⊺

21 0 . . . 0 0 Ω⊺

N+1,1b2 Ω21 Ω22 Ω⊺

32 . . . 0 0 Ω⊺

N+1,2b3 0 Ω32 Ω33 . . . 0 0 Ω⊺

N+1,3⋮ ⋮ ⋮ ⋮ ⋱ . . . ⋮ ⋮bn−1 0 0 0 . . . Ωn−1,n−1 Ω⊺

n,n−1 Ω⊺

n+1,n−1bn 0 0 0 . . . Ωn,n−1 Ωnn Ω⊺

n+1,nη Ωn+1,1 Ωn+1,2 Ωn+1,3 . . . Ωn+1,n−1 Ωn+1,n Ωn+1,n+1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

Again, we can impose the appropriate sparse structure in Ω byparametrizing in terms of the Cholesky factor T , Ω = TT ⊺ and matchingthe row sparsity of T to that of Ω.


Reparametrization gradients

Variational parameters λ = (µ,T ). Now qλ(θ) = N(µ,Σ = T −T T −1).Generative representation for applying the reparametrization trick:z ∼ N(0, I), θ = µ + T −T z.

Simulation from the variational distribution, and gradient estimation,can be done using only solutions of sparse triangular linear systems.

Expressions for lower bound gradients, where z ∼ f (z) = N(0, I):

∇µL = Ef (∇θ log h(µ + T −T z) + Tz).

∇TL = −Ef (T −T z(∇θ log h(µ + T −T z) + Tz)T T −T ),


Logistic regression with random effectsHosmer et al., 2013

Data on 500 subjects studied over seven years.Response: whether the subject is taking drugs from 3 or moredifferent groups.Covariates for gender, race, age, number of oupatient mentalhealth visits.Random intercept model.8 fixed effects parameters, one variance parameter, 500 randomintercepts.


Example: polypharmacy dataHosmer et al., 2013

Compare our approach withAutomatic differentiation variational inference (ADVI, implementedin the statistical package Stan) (Kucukelbir et al., 2016)Doubly stochastic variational inference (DSVI) (Titsias andLázaro-Gredilla, 2014).

DSVI and ADVI have been implemented in a "mean field" version(diagonal covariance structure) and in a version assuming adense Cholesky factor for the covariance.Although our DSVI and ADVI methods are essentially similar, theyuse different step size choices and stopping rules.



−6.0 −5.0 −4.0 −3.0

0.0

0.5

1.0

1.5

2.0

β0

−0.5 0.5 1.5

0.0

0.5

1.0

1.5

2.0

βGender

−2.5 −1.5 −0.5 0.5

0.0

1.0

2.0

3.0

βRace

1.5 2.0 2.5 3.0 3.5

0.0

0.5

1.0

1.5

2.0

βAge

−0.5 0.0 0.5 1.0 1.5

0.0

1.0

2.0

3.0

βMHV41

0.0 0.5 1.0 1.5 2.0 2.5

01

23

4

βMHV42

0.5 1.0 1.5 2.0 2.5 3.0

01

23

4

βMHV43

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

βINPTMHV21

0.6 0.8 1.0 1.2

02

46

812

ζ1

Blue= mean field (dashed=ADVI, solid=DSVI), Green=dense Cholesky(dashed=ADVI, solid=DSVI), Black=MCMC, Red=Sparse precisionCholesky.



The sparse precision Cholesky approach is most accurate.Runtimes

ADVI methods: 7 seconds (mean field), 75 seconds (denseCholesky). Step size sequences based on Stan defaults.DSVI methods: 30 seconds (mean field), 262 seconds (denseCholesky). Adaptive step sizes chosen by ADADELTA.Sparse precision Cholesky: 56 seconds. Adaptive stepsizeschosen by ADADELTA.

DSVI and Sparse Cholesky precision algorithms are implementedin Julia.


An alternative to exploiting conditional independence:factor structure

Using a sparse precision matrix in the variational family is natural withconditional independence structure coming from the model.

In other cases, it may be useful to consider factor structure. Consider

qλ(θ) = N(µ,BB⊺ +D2), (1)

where B is a d × p full rank matrix with p << d , and D is diagonal withdiagonal elements δ = (δ1, . . . , δd)T .

A random draw from (1) can be written as

θ = µ +Bz +Dε,

where (z⊺, ε⊺)⊺ ∼ N(0, I).



In this generative model θ = µ +Bz +Dε:The term Bz involving the low-dimensional factor z capturesdependence among components.The “idiosyncratic” variation Dε, which is independent betweencomponents, captures component specific variation.

Factor models can give parsimonious descriptions of high-dimensionalcovariance structure; we set the upper triangle of B to zero (Gewekeand Zhou, 1996).

Variational family parametrized by λ = (µ,B, δ): the dimension of λgrows linearly in d .


Optimization of the lower bound

The generative representation of the factor model is the basis forreparametrization gradients in stochastic gradient optimization.Gradient expressions for variational parameters λ = (µ,B, δ):

∇µL(λ) =Ef (∇θ log h(µ +Bz +Dε) + (BB⊺ +D2)−1(Bz +Dε)),

∇BL(λ) =Ef (∇θ log h(µ +Bz +Dε)zT + (BBT +D2)−1(Bz +Dε)zT ),

∇δL(λ) =Ef (diag(∇θ log h(µ +Bz +Dε)εT + (BBT +D2)−1(Bz +Dε)εT )).

The high-dimensional matrix inversions (BBT +D2)−1 can be doneconveniently using the Sherman-Morrison-Woodbury formula.



Data on 500 subjects studied over seven years.Response: whether the subject is taking drugs from 3 or moredifferent groups.Covariates for gender, race, age, number of oupatient mentalhealth visits.Random intercept model.8 fixed effects parameters, one variance parameter, 500 randomintercepts.



-6 -5 -4 -3 -2

β0

0

2

4

6

8

-1 0 1 2 3

βgender

0

1

2

3

4

5

6

7

-3 -2 -1 0 1

βrace

0

0.5

1

1.5

2

2.5

3

1 2 3 4

βage

0

0.5

1

1.5

2

2.5

-1 0 1 2

βM1

0

0.5

1

1.5

2

2.5

3

0 1 2 3

βM2

0

1

2

3

4

5

0 1 2 3

βM3

0

1

2

3

4

5

6

-1 0 1 2

βIM

0

0.5

1

1.5

2

0.6 0.8 1 1.2

ζ

0

2

4

6

8

10

12

14

Tan & Nott (2016)VAFC (p = 0)VAFC (p = 4)VAFC (p = 20)



-2 0 2 4 6

Mean of Var Dist (VAFC)

-2

-1

0

1

2

3

4

5

6

Me

an

of

Va

r D

ist

(Ta

n &

No

tt)

0 0.5 1 1.5 2

s.d. of Var Dist (VAFC)

0

0.5

1

1.5

2

s.d

. o

f V

ar

Dis

t (T

an

& N

ott

)


Extensions

Combining factor and conditional independence structure can befruitful.

Consider a time series with observations y1, . . . ,yn. There is a statespace structure with states b1, . . . ,bn.

Suppose the states are high-dimensional.Model the posterior dependence of b = (b1, . . . ,bn)T through adynamic factor modelFactor structure to give a reduced dimension description of thestates; andConditional independence structure in time for the dynamicfactors.


Example: A spatio-temporal modelWikle and Hooten, 2006

Spatio-temporal model of Wikle and Hooten (2006).

Dataset on the spread of the Eurasian collared-dove across NorthAmerica (North American Breeding Bird Survey).

p = 111 spatial grid points with the dove counts aggregated withineach area.

Vectors of spatial counts yt modelled as conditionally independentPoisson variables where the log means are given by a latenthigh-dimensional Markovian dynamic process ut plusmeasurement error.

The dynamic process ut evolves according to a discretizeddiffusion equation.


Example: A spatio-temporal modelWikle and Hooten, 2006

In this model, there are 4,223 unknowns.Dimension reduction: consider 4 factors for parametrization ofspatial state vectors.6,428 variational parameters compared with the 8,923,199 in theunrestricted parametrization.


Estimates of spatio-temporal intensity

Gaussian VA (left) and MCMC (right) estimates for 1999-2003

maplocs[, 1]

1999

maplocs[, 1]

2000

maplocs[, 1]

2001

maplocs[, 1]

2002

2003

maplocs[, 1]

map

locs

_ext

end

1999

maplocs[, 1]

map

locs

_ext

end

2000

maplocs[, 1]

map

locs

_ext

end

2001

maplocs[, 1]

map

locs

_ext

end

2002

map

locs

_ext

end

2003


Spatially averaged intensity over time

0

200

400

600

800

1000

1200

1400

Year t

ϕ t

Draws from MCMC posterior of ϕt19

8619

8719

8819

8919

9019

9119

9219

9319

9419

9519

9619

9719

9819

9920

0020

0120

0220

030

200

400

600

800

1000

1200

1400

Year tϕ t

Draws from VB posterior of ϕt

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003


Spatially varying diffusion coefficients

Columns are for locations with zero (left), low (middle) and high (right) totalcount.

−0.2 0.0 0.2 0.4

Den

sity

of ψ

iAlgorithm

MCMCVB

Location i = 1

−0.2 0.0 0.2 0.4

Den

sity

of ψ

i

Location i = 5

−0.2 0.0 0.2 0.4

Location i = 35

−0.2 0.0 0.2 0.4 0.6

Location i = 46

−0.2 0.0 0.2 0.4

Location i = 96

−0.2 −0.1 0.0 0.1 0.2 0.3

Location i = 105


Future work

Using algorithms for Gaussian approximations within morecomplex procedures

Fitting Gaussian mixture approximations, Gaussian copulaapproaches (Guo et al., 2016, Miller et al., 2016).Gaussian copula approximations (Han et al., 2015).Skew normal copula

How to make these methods easy to use?Amortized variational inference methods.


References

D. Blei, A. Kucukelbir and J. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518), 859-877, 2017.

J. Ormerod and M. Wand. Explaining variational approximations. The AmericanStatistician, 64(2), 140-153, 2010.

Victor M.-H. Ong, David J. Nott and Michael S. Smith (2018). Gaussian variationalapproximation with a factor covariance structure. JCGS,doi:10.1080/10618600.2017.1390472.

Matias Quiroz, David J. Nott and Robert Kohn (2018). Gaussian variationalapproximation for high-dimensional state space models. arXiv:1801.07873.

Linda SL Tan and David J Nott (2018). Gaussian variational approximation withsparse precision matrix. Statistics and Computing, Statistics and Computing,28(2), 259-275.

C.K. Wikle and M.B. Hooten. Hierarchical Bayesian Spatio-Temporal Models forPopulation Spread. In Applications of Computational Statistics in theEnvironmental Sciences: Hierarchical Bayes and MCMC Methods. J.S. Clark andA. Gelfand (eds). Oxford University Press. 145-169, 2006.


Gaussian variational approximation with structured ... · Gaussian variational approximation with...

Documents

Transcript of Gaussian variational approximation with structured ... · Gaussian variational approximation with...