VARIATIONAL METHODS FOR NON-LOCAL OPERATORS OF ELLIPTIC TYPE
Gaussian variational approximation with structured ... · Gaussian variational approximation with...
Transcript of Gaussian variational approximation with structured ... · Gaussian variational approximation with...
Gaussian variational approximation with structuredcovariance matrices
David Nott
Department of Statistics and Applied ProbabilityNational University of Singapore
Collaborators: Linda Tan, Victor Ong, Michael Smith,Matias Quiroz, Robert Kohn
David Nott, NUS Gaussian variational approximation 1 / 39
Bayesian computation as usual
Data y to be observed, unknowns θ. Construct a model for (y , θ):
p(y , θ) =
prior³·µp(θ)
likelihood³¹¹¹¹¹¹·¹¹¹¹¹¹µp(y ∣θ)
Condition on the observed y :
p(θ∣y)´¹¹¹¹¹¹¸¹¹¹¹¹¹¶posterior
∝ p(θ)p(y ∣θ)
Summarization of the posterior is done using algorithms like MCMC,sequential Monte Carlo, etc.
These usual Monte Carlo algorithms are exact in principle.
David Nott, NUS Gaussian variational approximation 2 / 39
Variational approximations
Increasingly there is interest in algorithms that do not possess theexact in principle property.
Why approximate inference?Make use of the scalability of optimization-based approaches tocomputation.Approximate inference methods are enough to understand whycertain models are unsuitable.Approximate inference methods may perform as well as exactmethods for predictive inference.
This talk concerns a popular framework for approximate inference,variational approximation.
David Nott, NUS Gaussian variational approximation 3 / 39
Variational inference basicsBlei et al., 2017, Ormerod and Wand, 2012
Variational approximation reformulates Bayesian computation as anoptimization problem.
Define an approximating family with some parameters (Gaussian forexample). Then
Define some measure of "closeness" of an approximation to thetrue posterior.Optimize that measure over the approximating family.
Variational parameters to be optimized will be denoted λ.
David Nott, NUS Gaussian variational approximation 4 / 39
Gaussian variational approximation, toy example
David Nott, NUS Gaussian variational approximation 5 / 39
Gaussian variational approximation, toy example
David Nott, NUS Gaussian variational approximation 6 / 39
Gaussian variational approximation, toy example
David Nott, NUS Gaussian variational approximation 7 / 39
Gaussian variational approximation
In this talk we consider a multivariate normal approximation denotedqλ(θ) = N(µ,Σ) with λ = (µ,Σ) to be optimized.
The best normal approximation will be in the sense of minimizingKullback-Leibler (KL) divergence,
KL(p(θ∣y)∣∣qλ(θ)) = log p(y) − ∫ logp(θ)p(y ∣θ)
qλ(θ)qλ(θ)dθ
= log p(y) −L(λ)
where p(y) = ∫ p(θ)p(y ∣θ) and L(λ) is called thevariational lower bound.
Minimizing the KL divergence is equivalent to maximizing L(λ).
David Nott, NUS Gaussian variational approximation 8 / 39
Gaussian variational approximation
The optimization of L(λ) is challenging for a normal family when thedimension d of θ is large.
Key difficulty:With no restriction on Σ in qλ(θ) = N(µ,Σ), there are d + d(d + 1)/2parameters to optimize.
For high-dimensional θ we need reduced parametrizations for Σ.Exploiting conditional independence structure to motivate sparsityin Σ−1 (Tan and Nott, 2017).Factor models
David Nott, NUS Gaussian variational approximation 9 / 39
How should we optimize? Stochastic gradient ascentRobbins and Monro, 1951
We will use stochastic gradient ascent methods for optimizing thelower bound.
Suppose that ∇λL(λ) is the gradient of L(λ) and that ∇λL(λ) is anunbiased estimate of it.
Stochastic gradient ascent
Initialize λ(0)
for t = 0,1, . . . and until some stopping rule is satisfied
λ(t+1) = λ(t) + ρt∇λL(λ(t))
Typically the learning rate sequence ρt , t ≥ 0 satisfies ∑t ρt =∞,∑t ρ
2t <∞ (Robbins and Monro, 1951).
David Nott, NUS Gaussian variational approximation 10 / 39
Reparametrization gradientsKingma and Welling, 2013, Rezende et al., 2014
Low variance gradient estimates are crucial for stability and fastconvergence of stochastic optimization - achieved using thereparametrization trick.
Writing h(θ) = p(θ)p(y ∣θ), the lower bound is
L(λ) = ∫ log h(θ) − log qλ(θ)qλ(θ) dθ.
Suppose that for the variational family qλ(θ) we can write θ ∼ qλ(θ) asθ = t(z, λ) where z ∼ f (z) and f (⋅) does not depend on λ. Then
L(λ) = Ef (log h(t(z, λ)) − log qλ(t(z, λ))
Differentiating under the integral sign, ∇λL(λ) is an expectation withrespect to f (⋅): simulation from f (⋅) gives unbiased estimates.
David Nott, NUS Gaussian variational approximation 11 / 39
Reparametrization gradients for the Gaussian familyTitsias and Lázaro-Gredilla, 2014, Kucukelbir et al., 2016
Suppose qλ(θ) = N(θ;µ,Σ = CCT ) where µ is the mean vector, Σ isthe covariance matrix, C the Cholesky factor of Σ.
Variational parameter λ = (µ,C). We can write θ ∼ qλ(θ) as
θ = µ +Cz z ∼ f (z) = N(0, I)
This shows that the Gaussian family has the structure required for thereparametrization trick.
David Nott, NUS Gaussian variational approximation 12 / 39
Gaussian variational approximation using conditionalindependence structure
The low variance gradient estimates provided by thereparametrization trick are crucial to efficient and stable stochasticvariational optimization methods.However, learning a Gaussian variational approximation is stillhard if we parametrize Σ with a dense Cholesky factor:The number of parameters in the covariance matrix growsquadratically with the parameter dimension.
We need to use the structure of the model to obtain parsimoniousstructured paremetrizations of covariance matrices suitable forGaussian variational approximations in high dimensions.
David Nott, NUS Gaussian variational approximation 13 / 39
Gaussian variational approximation
What exploitable structure is available?Conditional independence structure.
Can we match the true conditional independence structure in theGaussian approximation to make such approximations practical in highdimensions?
For a Gaussian random vector with covariance matrix Σ, if Ω = Σ−1
then Ωij = 0 implies variables i and j are conditionally independentgiven the rest.
David Nott, NUS Gaussian variational approximation 14 / 39
Motivating example: longitudinal generalized linearmixed model
Observations y = (y1, . . . ,yn), yi = (yi1, . . . ,yini )⊺.
Observation specific random effects b = (b1, . . . ,bn)⊺. Assumebi ∼ N(0,G) say.The yi are conditionally independent given b, likelihood
n∏i=1
p(yi ∣bi , η)
where η denotes fixed effects and variance parameters.Joint posterior for θ = (b⊺, η⊺)⊺
p(θ∣y)∝ p(η)n∏i=1
p(bi ∣η)p(yi ∣bi , η).
In the joint posterior bi and bj , i ≠ j are conditionally independentgiven η.
David Nott, NUS Gaussian variational approximation 15 / 39
Motivating example: longitudinal generalized linearmixed model
Consider a sparse Ω = Σ−1 in Gaussian variational approximation ofthe form:
Ω =
⎛⎜⎜⎜⎜⎜⎜⎝
b1 b2 . . . bn η
b1 Ω11 0 . . . 0 Ω1,n+1b2 0 Ω22 . . . 0 Ω2,n+1⋮ ⋮ ⋮ ⋱ ⋮ ⋮bn 0 0 . . . Ωnn Ωn,n+1η Ωn+1,1 Ωn+1,2 . . . Ωn+1,n Ωn+1,n+1
⎞⎟⎟⎟⎟⎟⎟⎠
David Nott, NUS Gaussian variational approximation 16 / 39
Motivating example: longitudinal generalized linearmixed model
It will be convenient later to parametrize Ω in terms of its Choleskyfactor, Ω = TT ⊺ say where T is lower triangular.By imposing sparse structure on T we can impose sparsestructure on Ω. The leftmost non-zero entries in each row match.Choose T of the form
T =
⎛⎜⎜⎜⎜⎜⎜⎝
b1 b2 . . . bn η
b1 T11 0 . . . 0 0b2 0 T22 . . . 0 0⋮ ⋮ ⋮ ⋱ ⋮ ⋮bn 0 0 . . . TNN 0η TN+1,1 TN+1,2 . . . TN+1,N TN+1,N+1
⎞⎟⎟⎟⎟⎟⎟⎠
David Nott, NUS Gaussian variational approximation 17 / 39
More general framework
Observations y = (y1, . . . ,yn), observation specific latent variablesb1 . . . ,bn, global parameters η.Joint model p(y , θ) for (y , θ), θ = (b⊺, η⊺)⊺ of the form
p(η)n∏i=1
p(yi ∣bi , η)p(b1, . . . ,bk ∣η)∏i>k
p(bi ∣bi−1, . . . ,bi−k , η) .
for some 0 ≤ k ≤ nbi is conditionally independent of the other latent variables inp(θ∣y) given η and k neighbouring values.Our previous random effects example fits this structure with k = 0.A state space model for a time series where the bi are the statesfits this structure with k = 1.For a state space model, use Ω of the form
David Nott, NUS Gaussian variational approximation 18 / 39
More general framework
Ω =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
b1 b2 b3 ⋯ bn−1 bn η
b1 Ω11 Ω⊺
21 0 . . . 0 0 Ω⊺
N+1,1b2 Ω21 Ω22 Ω⊺
32 . . . 0 0 Ω⊺
N+1,2b3 0 Ω32 Ω33 . . . 0 0 Ω⊺
N+1,3⋮ ⋮ ⋮ ⋮ ⋱ . . . ⋮ ⋮bn−1 0 0 0 . . . Ωn−1,n−1 Ω⊺
n,n−1 Ω⊺
n+1,n−1bn 0 0 0 . . . Ωn,n−1 Ωnn Ω⊺
n+1,nη Ωn+1,1 Ωn+1,2 Ωn+1,3 . . . Ωn+1,n−1 Ωn+1,n Ωn+1,n+1
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
Again, we can impose the appropriate sparse structure in Ω byparametrizing in terms of the Cholesky factor T , Ω = TT ⊺ and matchingthe row sparsity of T to that of Ω.
David Nott, NUS Gaussian variational approximation 19 / 39
Reparametrization gradients
Variational parameters λ = (µ,T ). Now qλ(θ) = N(µ,Σ = T −T T −1).Generative representation for applying the reparametrization trick:z ∼ N(0, I), θ = µ + T −T z.
Simulation from the variational distribution, and gradient estimation,can be done using only solutions of sparse triangular linear systems.
Expressions for lower bound gradients, where z ∼ f (z) = N(0, I):
∇µL = Ef (∇θ log h(µ + T −T z) + Tz).
∇TL = −Ef (T −T z(∇θ log h(µ + T −T z) + Tz)T T −T ),
David Nott, NUS Gaussian variational approximation 20 / 39
Logistic regression with random effectsHosmer et al., 2013
Data on 500 subjects studied over seven years.Response: whether the subject is taking drugs from 3 or moredifferent groups.Covariates for gender, race, age, number of oupatient mentalhealth visits.Random intercept model.8 fixed effects parameters, one variance parameter, 500 randomintercepts.
David Nott, NUS Gaussian variational approximation 21 / 39
Example: polypharmacy dataHosmer et al., 2013
Compare our approach withAutomatic differentiation variational inference (ADVI, implementedin the statistical package Stan) (Kucukelbir et al., 2016)Doubly stochastic variational inference (DSVI) (Titsias andLázaro-Gredilla, 2014).
DSVI and ADVI have been implemented in a "mean field" version(diagonal covariance structure) and in a version assuming adense Cholesky factor for the covariance.Although our DSVI and ADVI methods are essentially similar, theyuse different step size choices and stopping rules.
David Nott, NUS Gaussian variational approximation 22 / 39
Example: polypharmacy dataHosmer et al., 2013
−6.0 −5.0 −4.0 −3.0
0.0
0.5
1.0
1.5
2.0
β0
−0.5 0.5 1.5
0.0
0.5
1.0
1.5
2.0
βGender
−2.5 −1.5 −0.5 0.5
0.0
1.0
2.0
3.0
βRace
1.5 2.0 2.5 3.0 3.5
0.0
0.5
1.0
1.5
2.0
βAge
−0.5 0.0 0.5 1.0 1.5
0.0
1.0
2.0
3.0
βMHV41
0.0 0.5 1.0 1.5 2.0 2.5
01
23
4
βMHV42
0.5 1.0 1.5 2.0 2.5 3.0
01
23
4
βMHV43
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
βINPTMHV21
0.6 0.8 1.0 1.2
02
46
812
ζ1
Blue= mean field (dashed=ADVI, solid=DSVI), Green=dense Cholesky(dashed=ADVI, solid=DSVI), Black=MCMC, Red=Sparse precisionCholesky.
David Nott, NUS Gaussian variational approximation 23 / 39
Example: polypharmacy dataHosmer et al., 2013
The sparse precision Cholesky approach is most accurate.Runtimes
ADVI methods: 7 seconds (mean field), 75 seconds (denseCholesky). Step size sequences based on Stan defaults.DSVI methods: 30 seconds (mean field), 262 seconds (denseCholesky). Adaptive step sizes chosen by ADADELTA.Sparse precision Cholesky: 56 seconds. Adaptive stepsizeschosen by ADADELTA.
DSVI and Sparse Cholesky precision algorithms are implementedin Julia.
David Nott, NUS Gaussian variational approximation 24 / 39
An alternative to exploiting conditional independence:factor structure
Using a sparse precision matrix in the variational family is natural withconditional independence structure coming from the model.
In other cases, it may be useful to consider factor structure. Consider
qλ(θ) = N(µ,BB⊺ +D2), (1)
where B is a d × p full rank matrix with p << d , and D is diagonal withdiagonal elements δ = (δ1, . . . , δd)T .
A random draw from (1) can be written as
θ = µ +Bz +Dε,
where (z⊺, ε⊺)⊺ ∼ N(0, I).
David Nott, NUS Gaussian variational approximation 25 / 39
Gaussian variational approximation
In this generative model θ = µ +Bz +Dε:The term Bz involving the low-dimensional factor z capturesdependence among components.The “idiosyncratic” variation Dε, which is independent betweencomponents, captures component specific variation.
Factor models can give parsimonious descriptions of high-dimensionalcovariance structure; we set the upper triangle of B to zero (Gewekeand Zhou, 1996).
Variational family parametrized by λ = (µ,B, δ): the dimension of λgrows linearly in d .
David Nott, NUS Gaussian variational approximation 26 / 39
Optimization of the lower bound
The generative representation of the factor model is the basis forreparametrization gradients in stochastic gradient optimization.Gradient expressions for variational parameters λ = (µ,B, δ):
∇µL(λ) =Ef (∇θ log h(µ +Bz +Dε) + (BB⊺ +D2)−1(Bz +Dε)),
∇BL(λ) =Ef (∇θ log h(µ +Bz +Dε)zT + (BBT +D2)−1(Bz +Dε)zT ),
∇δL(λ) =Ef (diag(∇θ log h(µ +Bz +Dε)εT + (BBT +D2)−1(Bz +Dε)εT )).
The high-dimensional matrix inversions (BBT +D2)−1 can be doneconveniently using the Sherman-Morrison-Woodbury formula.
David Nott, NUS Gaussian variational approximation 27 / 39
Logistic regression with random effectsHosmer et al., 2013
Data on 500 subjects studied over seven years.Response: whether the subject is taking drugs from 3 or moredifferent groups.Covariates for gender, race, age, number of oupatient mentalhealth visits.Random intercept model.8 fixed effects parameters, one variance parameter, 500 randomintercepts.
David Nott, NUS Gaussian variational approximation 28 / 39
Logistic regression with random effectsHosmer et al., 2013
David Nott, NUS Gaussian variational approximation 29 / 39
Logistic regression with random effectsHosmer et al., 2013
-6 -5 -4 -3 -2
β0
0
2
4
6
8
-1 0 1 2 3
βgender
0
1
2
3
4
5
6
7
-3 -2 -1 0 1
βrace
0
0.5
1
1.5
2
2.5
3
1 2 3 4
βage
0
0.5
1
1.5
2
2.5
-1 0 1 2
βM1
0
0.5
1
1.5
2
2.5
3
0 1 2 3
βM2
0
1
2
3
4
5
0 1 2 3
βM3
0
1
2
3
4
5
6
-1 0 1 2
βIM
0
0.5
1
1.5
2
0.6 0.8 1 1.2
ζ
0
2
4
6
8
10
12
14
Tan & Nott (2016)VAFC (p = 0)VAFC (p = 4)VAFC (p = 20)
David Nott, NUS Gaussian variational approximation 30 / 39
Logistic regression with random effectsHosmer et al., 2013
-2 0 2 4 6
Mean of Var Dist (VAFC)
-2
-1
0
1
2
3
4
5
6
Me
an
of
Va
r D
ist
(Ta
n &
No
tt)
0 0.5 1 1.5 2
s.d. of Var Dist (VAFC)
0
0.5
1
1.5
2
s.d
. o
f V
ar
Dis
t (T
an
& N
ott
)
David Nott, NUS Gaussian variational approximation 31 / 39
Extensions
Combining factor and conditional independence structure can befruitful.
Consider a time series with observations y1, . . . ,yn. There is a statespace structure with states b1, . . . ,bn.
Suppose the states are high-dimensional.Model the posterior dependence of b = (b1, . . . ,bn)T through adynamic factor modelFactor structure to give a reduced dimension description of thestates; andConditional independence structure in time for the dynamicfactors.
David Nott, NUS Gaussian variational approximation 32 / 39
Example: A spatio-temporal modelWikle and Hooten, 2006
Spatio-temporal model of Wikle and Hooten (2006).
Dataset on the spread of the Eurasian collared-dove across NorthAmerica (North American Breeding Bird Survey).
p = 111 spatial grid points with the dove counts aggregated withineach area.
Vectors of spatial counts yt modelled as conditionally independentPoisson variables where the log means are given by a latenthigh-dimensional Markovian dynamic process ut plusmeasurement error.
The dynamic process ut evolves according to a discretizeddiffusion equation.
David Nott, NUS Gaussian variational approximation 33 / 39
Example: A spatio-temporal modelWikle and Hooten, 2006
In this model, there are 4,223 unknowns.Dimension reduction: consider 4 factors for parametrization ofspatial state vectors.6,428 variational parameters compared with the 8,923,199 in theunrestricted parametrization.
David Nott, NUS Gaussian variational approximation 34 / 39
Estimates of spatio-temporal intensity
Gaussian VA (left) and MCMC (right) estimates for 1999-2003
maplocs[, 1]
1999
maplocs[, 1]
2000
maplocs[, 1]
2001
maplocs[, 1]
2002
2003
maplocs[, 1]
map
locs
_ext
end
1999
maplocs[, 1]
map
locs
_ext
end
2000
maplocs[, 1]
map
locs
_ext
end
2001
maplocs[, 1]
map
locs
_ext
end
2002
map
locs
_ext
end
2003
David Nott, NUS Gaussian variational approximation 35 / 39
Spatially averaged intensity over time
0
200
400
600
800
1000
1200
1400
Year t
ϕ t
Draws from MCMC posterior of ϕt19
8619
8719
8819
8919
9019
9119
9219
9319
9419
9519
9619
9719
9819
9920
0020
0120
0220
030
200
400
600
800
1000
1200
1400
Year tϕ t
Draws from VB posterior of ϕt
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
David Nott, NUS Gaussian variational approximation 36 / 39
Spatially varying diffusion coefficients
Columns are for locations with zero (left), low (middle) and high (right) totalcount.
−0.2 0.0 0.2 0.4
Den
sity
of ψ
iAlgorithm
MCMCVB
Location i = 1
−0.2 0.0 0.2 0.4
Den
sity
of ψ
i
Location i = 5
−0.2 0.0 0.2 0.4
Location i = 35
−0.2 0.0 0.2 0.4 0.6
Location i = 46
−0.2 0.0 0.2 0.4
Location i = 96
−0.2 −0.1 0.0 0.1 0.2 0.3
Location i = 105
David Nott, NUS Gaussian variational approximation 37 / 39
Future work
Using algorithms for Gaussian approximations within morecomplex procedures
Fitting Gaussian mixture approximations, Gaussian copulaapproaches (Guo et al., 2016, Miller et al., 2016).Gaussian copula approximations (Han et al., 2015).Skew normal copula
How to make these methods easy to use?Amortized variational inference methods.
David Nott, NUS Gaussian variational approximation 38 / 39
References
D. Blei, A. Kucukelbir and J. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518), 859-877, 2017.
J. Ormerod and M. Wand. Explaining variational approximations. The AmericanStatistician, 64(2), 140-153, 2010.
Victor M.-H. Ong, David J. Nott and Michael S. Smith (2018). Gaussian variationalapproximation with a factor covariance structure. JCGS,doi:10.1080/10618600.2017.1390472.
Matias Quiroz, David J. Nott and Robert Kohn (2018). Gaussian variationalapproximation for high-dimensional state space models. arXiv:1801.07873.
Linda SL Tan and David J Nott (2018). Gaussian variational approximation withsparse precision matrix. Statistics and Computing, Statistics and Computing,28(2), 259-275.
C.K. Wikle and M.B. Hooten. Hierarchical Bayesian Spatio-Temporal Models forPopulation Spread. In Applications of Computational Statistics in theEnvironmental Sciences: Hierarchical Bayes and MCMC Methods. J.S. Clark andA. Gelfand (eds). Oxford University Press. 145-169, 2006.
David Nott, NUS Gaussian variational approximation 39 / 39