15 jVariational inference - ssl2.cms.fu-berlin.de · Variational inference is a statistical...

15 |Variational inference

15.1 Foundations

Variational inference is a statistical inference framework for probabilistic models that comprise unobserv-able random variables. Its general starting point is a joint probability density function over observablerandom variables y and unobservable random variables ϑ,

p(y, ϑ) = p(ϑ)p(y|ϑ), (15.1)

where p(ϑ) is usually referred to as the prior density and p(y|ϑ) as the likelihood. Given an observedvalue of y, the first aim of variational inference is to determine the conditional density of ϑ given theobserved value of y, referred to as the posterior density. The second aim of variational inference is toevaluate the marginal density of the observed data, or, equivalently, its logarithm

ln p(y) = ln

∫p(y, ϑ) dϑ. (15.2)

Eq. (15.2) is commonly referred to as the log marginal likelihood or log model evidence. The log modelevidence allows for comparing different models in their plausibility to explain observed data. In thevariational inference framework it is not the log model evidence itself which is evaluated, but rather alower bound approximation to it. This is due to the fact that if a model comprises many unobservablevariables ϑ the integration of the right-hand side of (15.2) can become analytically burdensome or evenintractable. To nevertheless achieve its two aims, variational inference in effect replaces an integrationproblem with an optimization problem. To this end, variational inference exploits a set of informationtheoretic quantities as introduced in Chapter 11 and below. Specifically, the following log model evidencedecomposition forms the core of the variational inference approach (Figure 15.1):

ln p(y) = F (q(ϑ)) +KL(q(ϑ)‖p(ϑ|y)). (15.3)

In eq. (15.3), q(ϑ) denotes an arbitrary probability density over the unobservable variables which is usedas an approximation of the posterior density p(ϑ|y). In the following, q(ϑ) is referred to as variationaldensity. In words, (15.3) states that for an arbitrary variational density q(ϑ), the log model evidencecomprises the sum of two information theoretic quantities: the so-called variational free energy, definedas

F (q(ϑ)) :=

∫q(ϑ) ln

(p(y, ϑ)

q(ϑ)

)dϑ (15.4)

and the KL divergence between the true posterior density p(ϑ|y) and the variational density q(ϑ),

KL(q(ϑ)||p(ϑ|y)) =

∫q(ϑ) ln

(q(ϑ)

p(ϑ|y)

)dϑ. (15.5)

Based on these definitions, it is straightforward to show the validity of the log model evidence decompo-sition:

Foundations 153

Figure 15.1. Visualization of the log model evidence decomposition that lies at the heart of the variational inferenceapproach. The upper vertical bar represents the log model evidence, which is a function of the probabilistic model p(y, ϑ)and is constant for any observation of y. As shown in the main text, the log model evidence can readily be rewritten into thesum of the variational free energy term F (q(ϑ)) and a KL divergence term KL(q(ϑ)||p(ϑ|y)), if one introduces an arbitraryvariational density over the unobservable variables ϑ. Maximizing the variational free energy hence minimizes the KLdivergence between the variational density q(ϑ) and the true posterior density p(ϑ|y) and renders the variational free energya better approximation of the log model evidence. Equivalently, minimizing the KL divergence between the variationaldensity q(ϑ) and the true posterior density p(ϑ|y) maximizes the free energy and also renders it a tighter approximation tothe log model evidence ln p(y).

Proof of (15.3)

By definition, we have

F (q(ϑ)) =

∫q(ϑ) ln

(p(y, ϑ)

q(ϑ)

)dϑ

=

∫q(ϑ) ln

(p(y)p(ϑ|y)

q(ϑ)

)dϑ

=

∫q(ϑ) ln p(y) dϑ+

∫q(ϑ) ln

(p(ϑ|y)

q(ϑ)

)dϑ

= ln p(y)

∫q(ϑ) dϑ+

∫q(ϑ) ln

(p(ϑ|y)

q(ϑ)

)dϑ

= ln p(y)−∫q(ϑ) ln

(q(ϑ)

p(ϑ|y)

)dϑ

= ln p(y)−KL(q(ϑ)||p(ϑ|y)),

(15.6)

from which eq. (15.3) follows immediately. 2

The log model evidence decomposition (15.3) can be used to achieve the aims of variational inference asfollows: first, the non-negativity property of the KL divergence has the consequence, that the variationalfree energy F (q(ϑ)) is always smaller than or equal to the log model evidence, i.e.,

F (q(ϑ)) ≤ ln p(y). (15.7)

This fact can be exploited in the numerical application of variational inference to probabilistic models:because the log model evidence is a fixed quantity which only depends on the choice of p(y, ϑ) anda specific data realization, manipulating the variational density q(ϑ) for a given data set in such amanner that the variational free energy increases has two consequences: first, the lower bound to the logmodel evidence becomes tighter, and the variational free energy a better approximation to the log modelevidence. Second, because the left-hand side of eq. (15.3) remains constant, the KL divergence betweenthe true posterior and its variational approximation decreases, which renders the variational density q(ϑ)an increasingly better approximation to the true posterior distribution p(ϑ|y). Because the variationalfree energy is a lower bound to the log model evidence, it is also referred to as evidence lower bound(ELBO).

The maximization of a variational free energy in terms of a variational density is a very generalapproach for posterior density and log model evidence approximation. Like the maximum likelihoodapproach, it serves rather as a guiding principle rather than a concrete numerical algorithm. In otherwords, algorithms that make use of the variational free energy log model evidence decomposition arejointly referred to as variational inference algorithms, but many variants exist. In the following twosections, we will discuss two specific variants and illustrate them with examples. The variants will bereferred to as free-form mean-field variational inference and fixed-form mean-field variational inference.Here, the term mean-field refers to a factorization assumption with respect to the variational densities

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0

Free-form mean-field variational inference 154

over s sets of the unobserved random variables,

q(ϑ) =

s∏i=1

q(ϑi). (15.8)

Such a factorization allows the variational free energy to be optimized independently for the variationaldensities q(ϑi) in a coordinate-wise fashion for i = 1, ..., s, a procedure sometimes referred to as coordinateascent variational inference (CAVI). The free-form and fixed-form variants of mean-field variationalinference then differ in their assumptions about the variational densities q(ϑi): the defining feature of thefree-form mean-field variational inference approach is that the parametric form of variational densitiesis not predetermined, but analytically evaluated based on a central result from variational calculus. Assuch, the free-form mean-field variational inference approach is useful to emphasize the roots of variationalinference in variational calculus, but is also analytically quite demanding. In a functional neuroimagingcontext, the free-form mean-field approach thus serves primarily didactic purposes. The fixed-form mean-field variational inference approach, on the other hand, is characterized by predetermined functional formsof the variational densities and of high practical relevance in functional neuroimaging. In particular, afixed-form mean-field variational inference approach that rests on Gaussian variational densities enjoyswide-spread popularity in functional neuroimaging (under the label variational Bayes Laplace algorithm)and theoretical neuroscience (under the label free energy principle). In contrast to the free-form mean-fieldapproach, the fixed-form mean-field approach is less analytically demanding and replaces a variationaloptimization problem with a standard numerical optimization problem. This is achieved by analyticallyevaluating the variational free energy in terms of the parameters of the variational densities.

15.2 Free-form mean-field variational inference

Free-form mean-field variational inference rests on a factorization of the variational density over sets ofunobserved random variables

q(ϑ) = q(ϑs)q(ϑ\s), (15.9)

referred to as a mean-field approximation. In (15.9), ϑ\s denotes all unobserved variables not in thesth group. For the factorization (15.9), the variational free energy becomes a function of two arguments,namely q(ϑs) and q(ϑ\s). Due to the complexity of the integrals involved, a simultaneous analytical maxi-mization of the variational free energy with respect to both its arguments is often difficult to achieve, and acoordinate-wise approach, i.e., maximization first with respect to q(ϑs) and second with respect to q(ϑ\s),is preferred. Notably, the assumed factorization over sets of variables corresponds to the assumption,that the respective variables form stochastically independent contributions to the multivariate posterior,which, depending on the true form of the generative model, may have weak or strong implications for thevalidity of the ensuing posterior inference.

The question is thus how to obtain the arguments q(ϑs) and q(ϑ\s) that maximize the variational freeenergy. It turns out that this challenge corresponds to a well-known problem in statistical physics, whichhas long been solved in a general fashion using variational calculus (Hinton and Van Camp, 1993). Incontrast to ordinary calculus, which deals with the optimization of functions with respect to real numbers,variational calculus deals with the optimization of functions (in the context of variational calculus alsoreferred to as functionals) with respect to functions. Using variational calculus, it can be shown that thevariational free energy is maximized with respect to the unobserved variable partitions ϑs, if q(ϑs) is setproportional (i.e., equal up to a scaling factor) to the exponential of the expected log joint probability ofy and ϑ under the variational density over ϑ\s. Formally, this can be written as

q(ϑs) ∝ exp

(∫q(ϑ\s) ln p(y, ϑ)dϑ\s

)(15.10)

The result stated in eq. (15.10) is fundamental. It represents the general free-form mean-field variationalinference strategy to obtain variational densities over unobserved variables in light of data and maximizingthe lower bound to the log model evidence. We thus refer to eq. (15.10) as the free-form variationalinference theorem. In the following, we provide two proofs that (15.10) maximizes the variational freeenergy with respect to q(ϑs). The first proof is constructive in that it uses the constrained Gateauxderivative approach from variational calculus (Chapter 7), to generate the solution (15.10). The secondproof eschews recursion to variational calculus techniques and uses a reformulation of the variational freeenergy in terms of a KL divergence involving the right-hand side of (15.10) (cf. Tzikas et al. (2008))



Proof I of (15.10)

We first note that the aim of the free-form mean-field variational inference approach is to approximate thelog marginal likelihood ln p(y) by iteratively maximizing its lower bound F (q(ϑs), q(ϑ\s)) with respect to the

arguments q(ϑs) and q(ϑ\s). For iterations i = 1, 2, . . ., during the first maximization of finding q(i+1)(ϑs),

q(i)(ϑ\s) is treated as a constant, while during the second maximization of finding q(i+1)(ϑs), q(i+1)(ϑ\s) is

treated as a constant. However, as ϑs and ϑ\s may be used interchangeably, we here concern ourselves only with

the case of maximizing F (q(ϑs), q(ϑ\s)) with respect to q(ϑs). To obtain an expression for q(i+1)(ϑs), we thusconsider the variational free energy functional

F(q(ϑs), q

(i)(ϑ\s))

=

∫∫q(ϑs)q

(i)(ϑ\s) ln

(p(y, ϑ)

q(ϑ)q(i)(ϑ\s)

)dϑdϑ\s, where

∫q(ϑ) dϑ = 1. (15.11)

In this case, the extended Lagrange function (cf. Chapter 7) is given by

F(q(ϑs), q

(i)(ϑ\s))

=

∫q(ϑs)q

(i)(ϑ\s) ln

(p(y, ϑ)

q(ϑs)q(i)(ϑ\s)

)dϑ\s + λq(ϑs)− λ. (15.12)

Furthermore, the Gateaux derivative δF(q(ϑs), q

(i)(ϑ\s))

is given by the derivative of F with respect to q(ϑs),

because F is not a function of q′(ϑs) . One thus obtains

δF(q(ϑs), q

(i)(ϑ\s))

=∂

∂q(ϑs)

(∫q(ϑs)q

(i)(ϑ\s) ln

(p(y, ϑ)

q(ϑs)q(i)(ϑ\s)

)dθ + λq(ϑs)− λ

)

=∂

∂q(ϑs)

(∫q(ϑs)q

(i)(ϑ\s) ln p(y, ϑ)dϑ\s −∫q(ϑs)q

(i)(ϑ\s) ln(q(ϑs)q(i)(ϑ\s))dϑ\s + λq(ϑs)− λ

)=

∂

∂q(ϑs)

(q(ϑs)

∫q(i)(ϑ\s) ln p(y, ϑ)dϑ\s −

∫q(ϑs)q

(i)(ϑ\s) ln q(ϑs)dϑ\s −∫q(ϑs)q

(i)(ϑ\s) ln q(ϑ\s)dϑ\s + λq(ϑs)− λ)

=∂

∂q(ϑs)

(q(ϑs)

∫q(i)(ϑ\s) ln p(y, ϑ)dϑ\s − q(ϑs) ln q(ϑs)

∫q(i)(ϑ\s)dϑ\s − q(ϑs)

∫q(i)(ϑ\s) ln q(ϑ\s)dϑ\s + λq(ϑs)− λ

)=

∫q(i)(ϑ\s) ln p(y, ϑ)dϑ\s − ln q(ϑs)−

∫q(i)(ϑ\s) ln q(i)(ϑ\s)dϑ\s + λ

=

∫q(i)(ϑ\s) ln p(y, ϑ)dϑ\s − ln q(ϑs) + c,

(15.13)

where

c := λ−∫q(i)(ϑ\s) ln q(i)(ϑ\s)dϑ\s. (15.14)

Setting the Gateaux derivative to zero thus yields

ln q(i+1)(ϑs) =

∫q(i)(ϑ\s) ln p(y, ϑ)dϑ\s + c. (15.15)

Taking the exponential and subsuming the multiplicative constant under the proportionality factor then yieldsfree-form variational inference theorem for mean-field approximations

q(i+1)(ϑs) ∝ exp


). (15.16)

2

Proof II of (15.10)

Consider maximization of the variational free energy with respect to q(ϑs)



Figure 15.2. The log model evidence decomposition visualized in Figure 15.1 is exploited in numerical algorithms for free-form VB inference: based on a mean-field approximation q(ϑ) = q(ϑs)q(ϑ\s), the variational free energy can be maximizedin a coordinate-wise fashion. Maximizing the variational free energy in turn has two implications: it decreases the KLdivergence between q(ϑ) and the true posterior p(ϑ|y) and renders the variational free energy a closer approximation to thelog model evidence. This holds true, because the log model evidence for a given observation y is constant (represented bythe constant length of the vertical bar) and the KL divergence is non-negative.

F (q(ϑs)q(ϑ\s)) =

∫∫q(ϑs)q(ϑ\s) ln

(p(y, ϑ)

q(ϑs)q(ϑ\s)

)dϑsdϑ\

=

∫∫q(ϑs)q(ϑ\s)(ln p(y, ϑ)− ln q(ϑs)− ln q(ϑ\s))dϑsdϑ\s

=

∫∫q(ϑs)q(ϑ\s)(ln p(y, ϑ)− ln q(ϑs))dϑ\sdϑs −

∫∫q(ϑs)q(ϑ\s) ln q(ϑ\)dϑsdϑ\s

=

∫∫q(ϑs)q(ϑ\s)(ln p(y, ϑ)− ln q(ϑs))dϑ\sdϑs −

∫q(ϑ\s) ln q(ϑ\s)

(∫q(ϑs)dϑs

)dϑ\s

=

∫∫q(ϑs)q(ϑ\s) ln p(y, ϑ)dϑ\sdϑs −

∫∫q(ϑs)q(ϑ\s) ln q(ϑs)dϑ\sdϑs −

∫q(ϑ\s) ln q(ϑ\s) · 1 dϑ\s

=

∫q(ϑs)


)dϑs −

∫q(ϑs) ln q(ϑs)(

∫q(ϑ\)dϑ\s)dϑs − c

=

∫q(ϑs)


)dϑs −

∫q(ϑs) ln q(ϑs) · 1 dϑs − c

=

∫q(ϑs)


)−∫q(ϑs) ln q(ϑs)dϑs − c

=

∫q(ϑs)

(ln

(exp

∫q(ϑ\s) ln p(y, ϑ)dϑ\s

))dϑs −

∫q(ϑs) ln q(ϑs)dϑs − c

=

∫q(ϑs)

(ln

(exp

∫q(ϑ\s) ln p(y, ϑ)dϑ\s

))− ln q(ϑs)dϑs − c

=

∫q(ϑs)

(ln

(exp

(∫q(ϑ\) ln p(y, ϑ)dϑ\s

)q(ϑs)

))dϑs − c

= −∫q(ϑs)

(ln

(q(ϑs)

exp∫q(ϑ\s) ln p(y, ϑ)dϑ\s

))dϑs − c

= −KL(q(ϑs)‖ exp

(∫ q(ϑ\s) ln p(y, ϑ)

)dϑ\s

)− c.

(15.17)

Maximizing the negative KL divergence by setting

q(ϑs) = exp(∫ q(ϑ\s) ln p(y, ϑ)

)dϑ\s (15.18)

thus maximizes the variational free energy. 2

Based on the free-form variational inference theorem, algorithmic implementations of variational in-ference can use an iterative coordinate-wise variational free energy ascent. For iterations i = 0, 1, 2, . . .,this strategy proceeds as follows. The ascent starts by initializing q(0)(ϑs) and q(0)(ϑ\s), commonly byequating them to the prior distributions over ϑs and ϑ\s, respectively. Based on (15.10) it then con-

tinues by maximizing the variational free energy F (q(i)(ϑs), q(i)(ϑ\s)), first with respect to the density

q(i)(ϑs), given q(i)(ϑ\s), and yielding the updated density q(i+1)(ϑs). Then, by exchanging the labellingof ϑs and ϑ\s in eq. (15.9), the ascent continues by maximizing the variational free energy with respect

to the density q(i)(ϑ\s), given q(i+1)(ϑs) and yielding q(i+1)(ϑ\s). This procedure is then iterated untilconvergence.

Commonly, the initialization step sets the variational density q(0)(ϑ) to the prior distribution p(ϑ).This defines the starting point of the iterative procedure as representative of the knowledge about the



unknown variables before observed data is taken into account. Further, this choice often enables the useof the well-known benefits of parameterized conjugate priors in the context of variational inference. Theinitialization of the variational density in terms of the prior distribution, and the subsequent optimizationof the variational densities should, however, not be confused with an empirical Bayesian approach, inwhich priors themselves are learned from the data: on each iteration of the variational inference algorithmsketched above, the variational density corresponds to the approximate posterior distribution, not anupdated prior distribution. An empirical Bayesian extension of the variational inference algorithm on theother hand would correspond to a variation of the prior distribution (specifying the variational inferencealgorithm starting conditions) after convergence with the aim of increasing the log model evidence perse. The variational inference algorithm as described here merely increases the lower bound to the fixedlog model evidence, which is determined by the choice of the prior p(ϑ) and likelihood p(y|ϑ), i.e, thegenerative model p(y, ϑ). To summarize above, a general iterative algorithm for free-form mean-fieldvariational inference is outlined below. This iterative scheme shares some similarities with expectation-maximization algorithms for models comprising unobserved variables (Dempster et al., 1977; Wu, 1983).In fact, variational inference can be viewed as a generalization of expectation-maximization algorithms formaximum likelihood estimation to Bayesian inference. For the general linear model, this line of thoughtis further investigated in Chapter 22.

An iterative free-form mean-field variational inference algorithm.

Initialization

0. Initialize q(0)(ϑs) and q(0)(ϑ\s) appropriately, e.g., by setting q(0)(ϑs) ∝ p(ϑs) and q(0)(ϑ\s) ∝p(ϑ\s)

Until convergence

1. Set q(i+1)(ϑs) proportional to exp(∫q(i)(ϑ\) ln p(y, ϑ)dϑ\s)

2. Set q(i+1)(ϑ\s) proportional to exp(∫q(i+1)(ϑs) ln p(y, ϑ)dϑs)

Free-form mean-field variational inference for a Gaussian model

Probabilistic model

To demonstrate the free form mean-field variational inference, we consider the estimation of the ex-pectation and precision parameter of a univariate Gaussian distribution based on n independently andidentically distributed data realizations yi, i = 1, ..., n (Penny and Roberts, 2000; Bishop, 2006; Chappellet al., 2008; Murphy, 2012). To this end, we assume that the yi, i = 1, ..., n are generated by a univariateGaussian distribution with true, but unknown, expectation parameter µ ∈ R and precision parameterλ > 0. We denote the concatenation of the data realizations by y := (y1, . . . , yn)T ∈ Rn. To recapitulate,the aim of variational inference is, based on appropriately chosen prior densities, first, to obtain posteriordensities that quantify the remaining uncertainty over the true, but unknown, unobservable variablesgiven the observable variables, and second, to obtain an approximation to the log model evidence, i.e.,the log probability of the data given the probabilistic model. In the current example, the probabilisticmodel takes the form of the joint probability density function,

p(y, µ, λ) = p(µ, λ)p(y|µ, λ) = p(µ, λ)

n∏i=1

p(yi|µ, λ). (15.19)

A possible choice for a prior joint density of the unobservable variables is given by the product of aunivariate Gaussian density for µ and a Gamma density for λ, i.e.,

p(y, µ, λ) := p(µ)p(λ)p(y|µ, λ) := N(µ;mµ, s

2µ

)G (λ; aλ, bλ)

n∏i=1

N(yi;µ, λ

−1) (15.20)

Note that many other prior densities are conceivable. In fact, a more commonly discussed scenario is thecase of a non-independent Gaussian-Gamma prior density (Bishop, 2006; Murphy, 2012). With respectto the factorized prior density considered here, the non-independent Gaussian-Gamma prior density hasthe advantage that it belongs to the conjugate-exponential class and allows for the derivation of an exact



analytical solution for the form of the posterior distribution. On the other hand, it is not clear in whichapplied scenarios a dependency of the prior over the expectation parameter µ on the prior density over λis in fact a reasonable assumption. We here thus focus on the factorized prior density, as it correspondsto a more parsimonious choice than its non-factorized counterpart. Furthermore, it demonstrates howvariational inference can be used to derive posterior density approximations in model scenarios where noanalytical treatment is possible.

Variational inference

For the posterior density, we consider the mean-field approximation

p(µ, λ|y) ≈ q(µ)q(λ). (15.21)

Recall that the free-form mean-field variational inference theorem states that the variational density overthe unobservable variable partition ϑs is given by

q(ϑs) ∝ exp(∫ q(ϑ\s) ln p(y, ϑ)dϑ\s

). (15.22)

For the current example,q(µ, λ) := q(µ)q(λ), (15.23)

and thusq(µ) = cµ exp (∫ q(λ) ln p(y, µ, λ) dλ) (15.24)

andq(λ) = cλ exp (∫ q(µ) ln p(y, µ, λ) dµ) , (15.25)

where cµ and cλ denote proportionality constants that render the proportionality statement in (15.22)equalities in (15.24) and (15.25), respectively. In the following, we shall derive an iterative scheme basedon the equations above. For this purpose, it is first helpful to explicitly denote the iterative nature ofthe approach by denoting the variational densities q(µ) and q(λ) as q(i)(µ) and q(i)(λ). This also stressesthe fact that in eqs. (15.24) and (15.25), the left-hand variational densities refer to their state at the(i + 1)th algorithm iteration, while the right-hand variational densities refer to their state at the ithalgorithm iteration. Second, as we are dealing with densities from the exponential family, it is helpfulto log transform both eqs. (15.24) and (15.25). For i = 0, 1, 2, ... eqs. (15.24) and (15.25) may thus berewritten as

ln q(i+1)(µ) :=

∫q(i)(λ) ln p(y, µ, λ) dλ+ ln cµ (15.26)

and

ln q(i+1)(λ) :=

∫q(i+1)(µ) ln p(y, µ, λ) dµ+ ln cλ (15.27)

To obtain an expression for q(i+1)(µ), we first note that we can express eq. (15.26) as

ln q(i+1)(µ) = −1

2〈λ〉q(i)(λ)

n∑i=1

(yi − µ)2 − 1

2s2µ(µ−mµ)2 + cµ (15.28)

where cµ denotes a constant including additive terms devoid of µ. Based on (15.28) and using thecompleting-the-square theorem for Gaussian distributions (cf. Chapter 10), we can then infer thatq(i+1)(µ) is proportional to a Gaussian density

q(i+1)(µ) = N(µ;m(i+1)

µ , s2(i+1)

µ

), (15.29)

with parameters

m(i+1)µ =

mµ + s2µ〈λ〉q(i)(λ)

∑ni=1 yi

1 + ns2µ〈λ〉q(i)(λ)

and s2(i+1)

µ =s2µ

1 + ns2µ〈λ〉q(i)(λ)

. (15.30)

Next, to obtain an expression for q(i+1)(λ), we first note that we can express eq. (15.27) as

ln q(i+1)(λ) =n

2lnλ− λ

2〈n∑i=1

(yi − µ)2〉q(i+1)(µ) + (aλ − 1) lnλ− λ

bλ+ cλ, (15.31)



Figure 15.3. Free-form variational inference for the Gaussian. (A) The panels depict the true underlying datamodel p(y|µ, λ), for µ = 1 and λ = 5 as solid line and N = 10 samples yi from this model on the abscissa as red dots.Based on these samples, on each iteration of the VB algorithm, a variational approximation q(µ)q(λ) is updated. The firstpanel of (A) shows the univariate Gaussian model as approximated by the expectations overq(µ) and q(λ) as dashed line.The second panel of (A) shows the effect of the update of the density q(µ) on the first iteration of the algorithm. As q(µ)governs the mean of the univariate Gaussian, the dashed Gaussian is now centered on the mean of the data-points. Thethird panel of (A) shows the effect of the update of the density q(λ) on the first iteration of the algorithm. As q(λ) governsthe precision of the univariate Gaussian model, the dashed Gaussian updates its variance based on the data variability.The fourth and fifth panels of (A) show the corresponding two steps on the 8th iteration. (B) The panels of (B) showthe factorized variational density q(µ)q(λ) over VB algorithm iterations. The white dot in each panel indicates the trueunderlying parameters that gave rise to the observed data. Note that these parameters were not sampled from the priordensity, but that the prior density embeds the initial uncertainty about this true, but unknown, parameter value before theobservation of any data. The ordering of the panels is as in (A). (A) The panel shows the evolution of the variational freeenergy over iterations of the VB algorithm. For the current model and data set, the variational free energy levels off fromapproximately 4 iterations onwards. In the variational inference framework, the final value of the variational free energyafter convergence of the algorithm corresponds to the approximation to the log model evidence ln p(y)

where cλ denotes a constant including additive terms devoid of λ. Expressing the right-hand side of(15.31) in multiplicative terms involving λ and taking exponentials, it then follows that q(i+1)(λ) isproportional to a Gamma density

G(λ; a

(i+1)λ , b

(i+1)λ

)(15.32)

with parameters

a(i+1)λ =

n

2+ aλ and b

(i+1)λ =

(1

bλ+

1

2

(n∑i=1

y2i − 2n∑i=1

yim(i+1)µ + n

((m

(i+1)µ

)2+ s2

(i+1)

µ

)))−1

(15.33)

A number of things are noteworthy. First, the Gaussian and Gamma density forms of the variationaldensities q(i+1)(µ) and q(i+1)(λ) follow directly from the form of the probabilistic model eq. (15.20)and the free-form mean-field theorem for variational inference. In other words, the functional forms ofthe densities q(i+1)(µ) and q(i+1)(λ) are not predetermined, but automatically fall out of the variationalinference approach - hence free-form variational inference. Second, if the variational density q(0)(λ) isinitialized using the prior density p(λ), the expected value 〈λ〉q(0)(λ) is determined by the prior param-

eters aλ and bλ for i = 0, and by the variational density parameters a(i)λ and b

(i)λ for i = 1, 2, .... In

other words, the parameter update equations (15.30) and (15.33) are fully determined in terms of theprior density parameters aλ, bλ,mµ, s

2µ, the data realizations y1, y2, ..., yn, and the variational density

parameters m(i)µ , s2

(i)

µ , a(i)λ and b

(i)λ . Third, an explicit form of the variational free energy is not required

for its maximization by means of the variational densities q(i)(µ) and q(i)(λ). It is nevertheless useful toevaluate it in order to monitor the progression of the iterative algorithm. For the current example, ittakes the form



F : Rn × R× R>0 × R>0 × R2>0 → R,

(y,m(i)

µ , s2(i)

µ , a(i)λ , b

(i)λ ,mµ, s

2µ, aλ, bλ

)7→

F(y,m

(i)µ , s2

(i)

µ , a(i)λ , b

(i)λ ,mµ, s

2µ, aλ, bλ

):=

1

2

(ψ(a(i)λ

)+ ln

(b(i)λ

))−

1

2a(i)λ b

(i)λ

(n∑i=1

y2i +N(

(m(i)µ )2 + s2

(i)

µ

)− 2m

(i)µ

n∑i=1

yi

)

+1

2ln

s2µ

s2(i)µ

+m2µ +m

(i)2

µ + s2(i)

µ − 2m(i)µ mµ

2s2µ−

1

2

+ (b(i)λ − 1)ψ

(b(i)λ

)− ln a

(i)λ − b

(i)λ − ln Γ(b

(i)λ ) + ln Γ (bλ)

+ bλ ln aλ − (bλ − 1)(ψ(b(i)λ

)+ ln a

(i)λ ) +

a(i)λ b

(i)λ

aλ,

(15.34)

where Γ and ψ denote the Gamma and digamma functions, respectively. We visualize the free-formmean-field variational inference approach for the expectation and precision parameter of a univariateGaussian in Figure 15.3.

Proof of eqs. (15.28), (15.29) and (15.30)

We first note that with the probabilistic model (15.20), eq. (15.26) can be rewritten as

ln q(i+1)(µ) = 〈ln p(y, µ, λ)〉q(i)(λ) + ln cµ

= 〈ln(

n∏i=1

p(yi|µ, λ)p(µ)p(λ))〉q(i)(λ) + ln cµ

= 〈n∑i=1

ln p(yi|µ, λ)〉q(i)(λ) + 〈ln p(µ)〉q(i)(λ) + 〈ln p(λ)〉q(i)(λ) + ln cµ

(15.35)

Substitution of the example-specific densities (cf. eq. (15.20))

p(µ) = N(µ;mµ, s2µ), p(λ) = G(λ; aλ, bλ), and p(yi|µ, λ) = N(yi;µ, λ

−1) (15.36)

then yields

ln q(i+1)(µ) = 〈n∑i=1

ln

(λ

12 (2π)−

12 exp

(−λ

2(yi − µ)2

))〉q(i)(λ)

+ 〈ln(

(2πs2µ)−12 exp

(− 1

2s2µ(µ−mµ)2

)⟩q(i)(λ)

+ 〈ln(

1

Γ(aλ)

1

baλλ

λaλ−1 exp

(− λ

bλ

))〉q(i)(λ)

+ ln cµ

= 〈n2

lnλ− n

2ln 2π − λ

2

n∑i=1

(yi − µ)2〉q(i)(λ)

+ 〈−1

2ln s2µ −

1

2ln 2π − 1

2s2µ(µ−mµ)2〉q(i)(λ)

+ 〈− ln Γ(aλ)− aλ ln(bλ) + (aλ − 1) lnλ− λ

bλ〉q(i)(λ)

+ ln cµ

(15.37)

Grouping all terms devoid of µ in a constant cµ and accounting for the linearity of expectations then results in.

ln q(i+1)(µ) = −1

2〈λ〉q(i)(λ)

n∑i=1

(yi − µ)2 − 1

2s2µ(µ−mµ)2 + cµ (15.38)

Next, to use the completing-the-square theorem for inferring that q(i+1)(µ) conforms to a Gaussian density withthe parameters of eq. (15.29), we first rewrite the right-hand side of eq. (15.28) as a quadratic expression in µ.We have



ln q(i+1)(µ) = −1

2〈λ〉q(i)(λ)

n∑i=1

(yi − µ)2 −1

2

(µ−mµ)2

s2µ+ cµ

= −1

2〈λ〉q(i)(λ)

(n∑i=1

y2i − 2µ

n∑i=1

yi + nµ2

)−

1

2

µ2 − 2µmµ +m2µ

s2µ+ cµ

= −1

2

(〈λ〉q(i)(λ)

n∑i=1

y2i − 〈λ〉q(i)(λ)2µn∑i=1

yi + 〈λ〉q(i)(λ)nµ2 +

1

s2µµ2 −

2mµ

s2µµ+

1

s2µm2µ

)+ cµ

= −1

2

(〈λ〉q(i)(λ)nµ

2 +1

s2µµ2 − 〈λ〉q(i)(λ)2µ

n∑i=1

yi −2mµ

s2µµ+ 〈λ〉q(i)(λ)

n∑i=1

y2i +1

s2µm2µ

)+ cµ

= −1

2

((〈λ〉q(i)(λ)n+

1

s2µ

)µ2 −

(2a

(i)λ b

(i)λ

n∑i=1

yi +2mµ

s2µ

)µ+ 〈λ〉q(i)(λ)

n∑i=1

y2i +1

s2µm2µ

)+ cµ

(15.39)

Resolving brackets and grouping terms devoid of µ with the constant cµ, resulting in the new constant ˜cµ, andre-expressing the coefficient of µ2 then results in

ln q(i+1)(µ) = −1

2

(〈λ〉q(i)(λ)n+

1

s2µ

)µ2 +

(〈λ〉q(i)(λ)

n∑i=1

yi +mµ

s2µ

)µ+ ˜cµ

= −1

2

(〈λ〉q(i)(λ)ns

2µ

s2µ+

1

s2µ

)µ2 +

(〈λ〉q(i)(λ)

n∑i=1

yi +mµ

s2µ

)µ+ ˜cµ

= −1

2

(1 + ns2µ〈λ〉q(i)(λ)

s2µ

)µ2 +

(〈λ〉q(i)(λ)

n∑i=1

yi +mµ

s2µ

)µ+ ˜cµ

(15.40)

Using the completing-the-square theorem (cf. Chapter 10) in the form

exp

(−1

2ax2 + bx

)∝ N

(x; a−1b, a−1) (15.41)

then yields

q(i+1)(µ) ∝ N(µ;m(i+1)

µ , s2(i+1)

µ

)(15.42)

where the variational parameters m(i+1)µ , s2

(i+1)

µ may be expressed in terms of the expectation of λ under the ith

variational density q(i)(λ), the prior parameters mµ and s2µ, and the data yi as

s2(i+1)

µ =

(1 + ns2µ〈λ〉q(i)(λ)

s2µ

)−1

=s2µ

1 + ns2µ〈λ〉q(i)(λ)

(15.43)

and

m(i+1)µ =

s2µ1 + ns2µ〈λ〉q(i)(λ)

(〈λ〉q(i)(λ)

n∑i=1

yi +mµ

s2µ

)

=s2µ

1 + ns2µ〈λ〉q(i)(λ)

(mµ + 〈λ〉q(i)(λ)s

2µ

∑ni=1 yi

s2µ

)

=mµ + s2µ〈λ〉q(i)(λ)

∑ni=1 yi

1 +Ns2µ〈λ〉q(i)(λ)

(15.44)

2



Proof of eq. (15.31) and eq. (15.33)

In analogy to the derivation of eq. (15.28) we have

ln q(i+1)(µ) = 〈n∑i=1

ln

(λ

12 (2π)−

12 exp

(−λ

2(yi − µ)2

))〉q(i+1)(µ)

+ 〈ln(

(2πs2µ)−12 exp

(− 1

2s2µ(µ−mµ)2

)⟩q(i+1)(µ)

+ 〈ln(

1

Γ(aλ)

1

baλλ

λaλ−1 exp

(− λ

bλ

))〉q(i+1)(µ)

+ ln cλ

= 〈n2

lnλ− n

2ln 2π − λ

2

n∑i=1

(yi − µ)2〉q(i+1)(µ)

+ 〈−1

2ln s2µ −

1

2ln 2π − 1

2s2µ(µ−mµ)2〉q(i+1)(µ)

+ 〈− ln Γ(aλ)− aλ ln(bλ) + (aλ − 1) lnλ− λ

bλ〉q(i+1)(µ)

+ ln cλ

(15.45)

Grouping all terms devoid of λ in a constant cλ and using the linearity of expectations then simplifies the aboveto

ln q(i+1)(λ) = 〈n2

lnλ− λ

2

n∑i=1

(yi − µ)2〉q(i+1)(µ) + 〈(aλ − 1) lnλ− λ

bλ〉q(i+1)(µ) + cλ

=n

2lnλ− λ

2〈n∑i=1

(yi − µ)2〉q(i+1)(µ) + (aλ − 1) lnλ− λ

bλ+ cλ

(15.46)

Reorganizing the right-hand side of equation (15.31) in multiplicative terms involving λ and expressing theexpectations of µ under the variational density q(i+1)(µ) yields

ln q(i+1)(λ) =(n

2+ aλ − 1

)lnλ−

(1

bλ+

1

2〈n∑i=1

(yi − µ)2〉q(i+1)(µ)

)λ+ cλ

=(n

2+ aλ − 1

)lnλ−

(1

bλ+

1

2〈n∑i=1

y2i − 2µn∑i=1

yi + nµ2〉q(i+1)(µ)

)λ+ cλ

=(n

2+ aλ − 1

)lnλ−

(1

bλ+

1

2

(n∑i=1

y2i − 2

n∑i=1

yi〈µ〉q(i+1)(µ) + n〈µ2〉q(i+1)(µ)

))λ+ cλ

=(n

2+ aλ − 1

)lnλ−

(1

bλ+

1

2

(n∑i=1

y2i − 2n∑i=1

yim(i+1)µ + n

((m

(i+1)µ

)2+ s2

(i+1)

µ

)))λ+ cλ.

(15.47)

Taking the exponential on both sides then yields

q(i+1)(λ) ∝ λ(n2+aλ−1) exp

(− 1

bλ− 1

2

(n∑i=1

y2i − 2

n∑i=1

yim(i+1)µ + n

((m(i+1)µ

)2+ s2

(i+1)

µ

))λ

). (15.48)

Up to a normalization constant, q(i+1)(λ) is thus given by a Gamma density function

q(i+1)(λ) ∝ G(λ; a

(i+1)λ , b

(i+1)λ

)(15.49)

with parameters

a(i+1)λ :=

N

2+ aλ and b

(i+1)λ

(1

bλ+

1

2

(n∑i=1

y2i − 2n∑i=1

yim(i+1)µ + n

((m(i+1)µ

)2+ s2

(i+1)

µ

)))−1

(15.50)

2



Proof of eq. (15.34)

We first reformulate the variational free energy functional as

F (q(ϑ)) =

∫q(ϑ) ln

(p(y, ϑ)

q(ϑ)

)dϑ

=

∫q(ϑ) ln

(p(y|ϑ)p(ϑ)

q(ϑ)

)dϑ

=

∫q(ϑ) ln

(p(y|ϑ))− ln

(q(ϑ)

p(ϑ)

))dϑ

=

∫q(ϑ) ln (p(y|ϑ)) dϑ−

∫q(ϑ) ln

(q(ϑ)

p(ϑ)

)dϑ

=

∫q(ϑ) ln (p(y|ϑ)) dϑ−KL(q(ϑ)‖p(ϑ)),

(15.51)

where the first term on the right-hand side is sometimes referred to as the average likelihood and the second termis the KL divergence between the variational and prior distributions. We next evaluate the average likelihoodterm. To this end, substitution of the relevant probability densities yields

∫q(ϑ) ln(p(y|ϑ)) dϑ =

∫∫q(i)(µ)q(i)(λ) ln

(n∏i=1

N(yi;µ, λ)

)dµdλ

=

∫∫q(i)(µ)q(i)(λ) ln

((λ

2π

) 12

n∏i=1

exp

(−λ

2(yi − µ)2

))dµdλ

=

∫∫q(i)(µ)q(i)(λ)

(1

2lnλ−

1

2ln 2π −

λ

2

n∑i=1

(yi − µ)2

)dµdλ

=1

2

∫∫q(i)(µ)q(i)(λ) lnλdµdλ−

∫∫q(i)(µ)q(i)(λ)

(λ

2

n∑i=1

(yi − µ)2

)dµdλ−

1

2ln 2π

=1

2

∫q(i)(λ) lnλ dλ−

∫q(i)(λ)

λ

2

(∫q(i)(µ)

(n∑i=1

(yi − µ)2

)dµ

)dλ−

1

2ln 2π.

(15.52)

The first integral term on the right-hand side of eq. (15.52) is the expectation of the logarithm of λ under the

variational density q(i)(λ) = G(λ; a

(i)λ , b

(i)λ

)and evaluates to (cf. Johnson et al., 1994)∫

q(i)(λ) lnλdλ = ψ(a(i)λ

)+ ln b

(i)λ , (15.53)

where ψ denotes the digamma function. The second integral term on the right-hand side of eq. (15.52) evaluatesto

∫q(i)(λ)

λ

2

(∫q(i)(µ)

(n∑i=1

(yi − µ)2

)dµ

)dλ =

∫q(i)(λ)

(1

2λ

(∫q(i)(µ)(

n∑i=1

y2i − 2µ

n∑i=1

yi +mµ2

)dµ

)dλ

=1

2

∫q(i)(λ)λ

(n∑i=1

y2i − 2n∑i=1

yi

∫q(i)(µ)µdµ+ n

∫q(i)(µ)µ2dµ

)dλ

=1

2a(i)λ b

(i)λ

(n∑i=1

y2i − 2m(i)µ

n∑i=1

yi + n

((m

(i)µ

)2+ s2

(i)

µ

)).

(15.54)

The average likelihood term in eq. (15.51) thus evaluates to∫q(ϑ) ln (p(y|ϑ)) dϑ = ψ

(a(i)λ

)+ ln b

(i)λ +

1

2a(i)λ b

(i)λ

(n∑i=1

y2i − 2m(i)µ

n∑i=1

yi + n

((m(i)µ

)2+ s2

(i)

µ

)). (15.55)

To evaluate the KL divergence term in eq. (15.51), we first note that with the additivity property of the KLdivergence for factorized densities (cf. Chapter 12), we have

KL(q(µ)q(λ)||p(µ)p(λ)) = KL(q(µ)‖p(µ)) +KL(q(λ)‖p(λ)) (15.56)

In the current example, the variable µ is governed by Gaussian densities for both the prior density p(µ) andthe variational densities. More specifically, in the variational inference algorithm, the prior density for µ has


Fixed-form mean-field variational inference 164

parameters mµ, s2µ, while the variational density q(µ) corresponds to the ith variational density q(i)(µ) with

parameters m(i)µ , s2

(i)

µ . With the known form of the KL divergence for univariate Gaussian densities, we thus have

KL(q(i)(µ)‖p(µ)

)= KL

(N(µ;m(i)

µ , s2(i)

µ

)‖N(µ;mµ, s

2µ

))=

1

2ln

s2µ

s2(i)µ

+m2µ +m

(i)2

µ + s2(i)

µ − 2m(i)µ mµ

2s2µ− 1

2.

(15.57)

Similarly, the variable λ is governed by Gamma densities for both the prior and and the variational densities.Specifically, the prior density of λ has parameters aλ and bλ, while the variational density q(λ) corresponds to

the ith variational Gamma distribution over λ with parameters a(i)λ and b

(i)λ . With the known form of the KL

divergence for Gamma densities, we thus have

KL

(q(i)

(λ)||p(λ)

)= KL

(G

(λ; a

(i)λ, b

(i)λ

)‖G

(λ; aλ, bλ

))

= (b(i)λ− 1)ψ

(b(i)λ

)− ln a

(i)λ− b(i)

λ− ln Γ(b

(i)λ

) + ln Γ(bλ

)+ bλ ln aλ − (bλ − 1)(ψ

(b(i)λ

)+ ln a

(i)λ

) +a(i)λb(i)λ

aλ

(15.58)

The KL divergence term in eq. (15.51) thus evaluates to

KL(q(ϑ)‖p(ϑ)) =1

2ln

s2µ

s2(i)µ

+m2µ +m

(i)2

µ + s2(i)

µ − 2m(i)µ mµ

2s2µ− 1

2

+ (b(i)λ − 1)ψ

(b(i)λ

)− ln a

(i)λ − b

(i)λ − ln Γ

(b(i)λ

)+ ln Γ (bλ)

+ bλ ln aλ − (bλ − 1)(ψ(b(i)λ

)+ ln a

(i)λ ) +

a(i)λ b

(i)λ

aλ

(15.59)

2

15.3 Fixed-form mean-field variational inference

The central idea of fixed-form mean-field variational inference to pre-define the parametric form of thefactorized variational density

q(ϑ) =

k∏i=1

q(ϑi) (15.60)

at all stages of an iterative algorithm for the maximization of the variational free energy. Because thejoint density p(y, ϑ) is defined during the formulation of the probabilistic model of interest, this entailsthat all densities of the variational free energy

F (q(ϑ)) =

∫q(ϑ) ln

(p(y, ϑ)

q(ϑ)

)dϑ (15.61)

are defined in parametric form at all times of the procedure. If the integral on the right-hand side of eq.(15.61) can be analytically evaluated (or at least be approximated) as a function of the parameters of thevariational densities q(ϑi), i = 1, ..., k, the variational problem of maximizing a functional with respectto probability density functions is rendered a problem of multivariate optimization, which in turn can beaddressed using the standard machinery of nonlinear optimization (Chapter 4). In the following, we willexemplify the fixed-form mean-field variational inference approach using a non-linear Gaussian modelwith a single mean-field partition, which forms of the basis for many models in functional neuroimaging(Friston, 2008).

Fixed-form variational inference for a non-linear Gaussian model

Probabilistic model

We consider the following hierarchical nonlinear Gaussian model comprising an unobservable randomvector x and an observable random vector y

x = µx + η (15.62)

y = f(x) + ε, (15.63)



where x, µx, η ∈ Rm and y, ε ∈ Rn, ε and η are random vectors with distributions

η ∼ N (0m,Σx) and ε ∼ N(0n, λ

−1y In

), (15.64)

where Σx ∈ Rm×m p.d. , λy > 0, and f : Rm → Rn is a multivariate vector-valued function. To applythe fixed-form mean-field variational inference approach to this model, we first consider the joint densityimplicit in eq. (15.62). It is given by

p(y, x) = p(y|x)p(x), (15.65)

where the conditional density of the observable random variable is specified by

p(y|x) = N(y; f(x), λ−1y In

)(15.66)

and the marginal distribution of the unobserved random vector x is specified by

p(x) = N (x;µx,Σx) . (15.67)

In functional form, we can thus write the joint density (15.65) as the product of two multivariate Gaussiandistributions,

p(y, x) = N(y; f(x), λ−1y In

)N (x;µx,Σx) . (15.68)

We assume that the prior parameters µx and Σx, as well as the likelihood parameter λy are known. Theaim of variational inference applied to (15.68) is to obtain an approximation to the posterior distributionp(x|y) and the log model evidence ln p(y) .

Variational inference

To achieve these aims, fixed-form mean-field variational inference approximates the posterior distributionp(x|y) using a predefined parametric form of the variational distribution. A common choice is to use aGaussian, i.e.,

q(x) := N(x;mx, Sx). (15.69)

This latter definition is often referred to as the Laplace approximation in the functional neuroimagingliterature (e.g., Friston et al., 2007a). This is unfortunate, because the term Laplace approximation isused in the machine learning and statistics literature for the approximation of an arbitrary probabilitydensity function with a Gaussian density and not for the definition of a variational distribution in termsof a Gaussian distribution.

The definition of q(x) as in eq. (15.69) allows for reformulating the variational problem of maximizingthe variational free energy as a standard problem in nonlinear optimization. Substitution of eqs. (15.68)and (15.69) on the right-hand side of eq. (15.61) yields the variational free energy function

F : Rm × Rm×m → R, (mx, Sx)→ F (mx, Sx) :=

∫N(x;mx, Sx) ln

(N(y; f(x, ), λ−1

y In)N(x;µx, λ−1x Im)

N(x;mx, Sx)

)dx (15.70)

Note that (15.70) specifies the variational free energy explicitly as a multivariate real-valued function inthe variational density parameters mx and Sx. From a mathematical perspective, it is worth noting thatthe fixed-form reformulation of the variational free energy by no means results in a trivial problem. First,the argument of the function F is defined by an integral term involving the nonlinear function f , which,as discussed below, can often only be approximated. This is important, because it calls into question thevalidity of the optimized free energy approximation to the log model evidence. However, as of now, themagnitude of the ensuing approximation error as a function of the degree of nonlinearity of f does notseem to have been systematically studied in the literature. Second, the function F is not a simple real-valued multivariate function, in the sense that its arguments are just arbitrary real vectors. Its secondargument is a covariance parameter, which has a predefined structure, i.e., it has to be a positive-definitematrix. Fortunately, optimization of the function F with respect to this parameter can be achievedanalytically, as discussed below. We next provide the functional form of the free energy function (15.70)and its derivation and then proceed to discuss its optimization with respect to the variational parametersSx and mx.



Approximate evaluation of the variational free energy

Using a multivariate first-order Taylor approximation of the nonlinear function f , the function definedin eq. (15.70) can be approximated as

F : Rm × Rm×m → R, (mx, Sx) 7→

F (mx, Sx) :=−n

2ln 2π +

n

2lnλy −

λy

2(y − f(mx))T (y − f(mx))−

λy

2tr(Jf (mx)T Jf (mx)Sx)

−m

2ln 2π −

1

2ln |Σx| −

1

2(mx − µx)TΣ−1

x (mx − µx)−1

2tr(Σ−1

x Sx)

+1

2ln |Sx|+

m

2ln(2πe),

(15.71)

where tr denotes the trace operator and Jf (mx) denotes the Jacobian matrix of the function f evaluatedat the variational expectation parameter. Note that, formally, different symbols for the function definedin (15.70) and its approximation provided in (15.71) would be appropriate.

Derivation of (15.71)

Using the properties of the logarithm and the linearity of the integral, we first decompose the variational freeenergy integral as follows:

F (q(x)) =

∫q(x) ln

(p(y, x)

q(x)

)dx

=

∫q(x)(ln p(y, x)− ln q(x))dx

=

∫q(x) ln p(y, x)dx−

∫q(x) ln q(x)dx.

(15.72)

Of the remaining two integral terms, the latter corresponds to the differential entropy of a multivariate Gaussiandistribution, which is well-known to correspond to a nonlinear function of the variational covariance parameterSx: ∫

q(x) ln q(x)dx = 〈lnN (x;mx, Sx)〉N(x;mx,Sx) =1

2ln |Sx|+

m

2ln(2πe). (15.73)

There thus remains the evaluation of the first integral term, which corresponds to the expectation of the log jointdensity of the observed and unobserved random variables under the variational density of the unobserved randomvariables. Substitution of p(y, x) results in

〈ln p(y, x)〉q(x) = 〈ln(N(y; f(x), λ−1y In)N(x;µx,Σx))〉N(x;mx,Sx)

= 〈ln(N(y; f(x), λ−1y In))〉N(x;mx,Sx) + 〈ln(N(x;µx,Σx))〉N(x;mx,Sx)

= 〈ln((2π)−n2 |λ−1

y In|−12 exp

(−1

2(y − f(x))T (λ−1

y In)−1(y − f(x)))

)〉N(x;mx,Sx)

+ 〈ln((2π)−m2 |Σx|−

12 exp

(−1

2(x− µx)TΣ−1

x (x− µx))

)〉N(x;mx,Sx)

= 〈−n2

ln 2π +n

2lnλy −

λy2

(y − f(x))T (y − f(x))〉N(x;mx,Sx)

+ 〈−m2

ln 2π − 1

2ln |Σx| −

1

2(x− µx)TΣ−1

x (x− µx)〉N(x;mx,Sx)

=− n

2ln 2π +

n

2lnλy −

λy2〈(y − f(x))T (y − f(x))〉N(x;mx,Sx)

− m

2ln 2π − 1

2ln |Σx| −

1

2〈(x− µx)TΣ−1

x (x− µx)〉N(x;mx,Sx).

(15.74)

There thus remain two integral terms. Of these, the latter can be evaluated using the Gaussian expectationtheorem (Chapter 10). Specifically, we have

〈(x− µx)TΣ−1x (x− µx)〉N(x;mx,Sx) = 〈xTΣ−1

x x− xTΣ−1x µx − µTxΣ−1

x x+ µTxΣ−1x µx〉N(x;mx,Sx)

= 〈xTΣ−1x x〉N(x;mx,Sx) − 2µTxΣ−1

x 〈x〉N(x;mx,Sx) + µTxΣ−1x µx

= tr(Σ−1x Sx) +mT

xΣ−1x mx − 2µTxΣ−1

x mx + µTxΣ−1x µx

= tr(Σ−1x Sx) + (mx − µx)TΣ−1

x (mx − µx).

(15.75)



There thus remains the evaluation of the first integral term in (15.74). We first note that we can write this termas

〈(y − f(x))T (y − f(x))〉N(x;mx,Sx) = 〈yT y − yT f(x)− f(x)T y + f(x)T f(x)〉N(x;mx,Sx)

= yT y − 2yT 〈f(x)〉N(x;mx,Sx) + 〈f(x)T f(x)〉N(x;mx,Sx).(15.76)

We are thus led to the evaluation of the expectation of a Gaussian random variable x under the nonlinear trans-formation f . In the functional neuroimaging literature, the function f is then approximated using a multivariatefirst-order Taylor expansion in order to evaluate the remaining expectations (cf. Friston et al., 2007a; ?). Denotingthe Jacobian matrix of f evaluated at the variational expectation parameter mx by Jf (mx), we thus have

f(x) ≈ f(mx) + Jf (mx)(x−mx). (15.77)

By replacing f(x) in the first expectation of the right-hand side of eq. (15.76) with the approximation (15.77),we obtain

〈f(x)〉N(x;mx,Sx) ≈ 〈f(mx) + Jf (mx)(x−mx)〉N(x;mx,Sx)

= f(mx) + Jf (mx)(〈x−mx〉N(x;mx,Sx))

= f(mx) + Jf (mx)(〈x〉N(x;mx,Sx) −mx)

= f(mx) + Jf (mx)(mx −mx)

= f(mx).

(15.78)

Further, replacing f(x) in the second expectation of the right-hand side of (15.76) with the approximation (15.77),we obtain

〈f(x)T f(x)〉N(x;mx,Sx) ≈ 〈(f(mx) + Jf (mx)(x−mx))T (f(mx) + Jf (mx)(x−mx))〉N(x;mx,Sx)

= 〈f(mx)T f(mx) + 2f(mx)TJf (mx)(x−mx)

+ (Jf (mx)(x−mx))T (Jf (mx)(x−mx))〉N(x;mx,Sx)

= f(mx)T f(mx) + 2f(mx)TJf (mx)〈(x−mx)〉N(x;mx,Sx)

+ 〈(Jf (mx)(x−mx))T (Jf (mx)(x−mx))〉N(x;mx,Sx).

(15.79)

Considering the first remaining expectation then yields

〈(x−mx)〉N(x;mx,Sx) = 〈x〉N(x;mx,Sx) −mx = mx −mx = 0. (15.80)

To evaluate the second remaining expectation, we first rewrite it as

〈(Jf (mx)(x−mx))T (Jf (mx)(x−mx))〉N(x;mx,Sx) = 〈(x−mx)TJf (mx)TJf (mx)(x−mx)〉N(x;mx,Sx) (15.81)

and note that (x − mx)T ∈ R1×(m+p), Jf (mx)T ∈ R(m+p)×n, Jf (mx) ∈ Rn×(m+p) and (x − mx) ∈ R(m+p)×1.Application of the Gaussian expectation theorem then yields

〈(x−mx)TJf (mx)TJf (mx)(x−mx)〉N(x;mx,Sx) = (mx −mx)TJf (mx)TJf (mx)(mx −mx)

+ tr(Jf (mx)TJf (mx)Sx)

= tr(Jf (mx)TJf (mx)Sx).

(15.82)

We thus have〈f(x)T f(x)〉N(x;mx,Sx) = f(mx)T f(mx) + tr(Jf (mx)TJf (mx)Sx). (15.83)

In summary, we obtain the following approximation for the first integral on the left-hand side of (15.74)

〈(y − f(x))T (y − f(x))〉N(x;mx,Sx) = yT y − 2yT 〈f(x)〉N(x;mx,Sx) + 〈f(x)T f(x)〉N(x;mx,Sx)

≈ yT y − 2yT f(mx) + f(mx)T f(mx) + tr(Jf (mx)TJf (mx)Sx)

= (y − f(mx))T (y − f(mx) + tr(Jf (mx)TJf (mx)Sx

).

(15.84)

Concatenating the results, we have thus obtained the following approximation of the expectation of the jointdensity of observed and unobserved random variables under the variational density

〈ln p(y, x)〉q(x) =− n

2ln 2π +

n

2lnλy −

λy2

((y − f(mx))T (y − f(mx)) + tr(Jf (mx)TJf (mx)Sx))

− m

2ln 2π − 1

2ln |Σx| −

1

2(tr(Σ−1

x Sx) + (mx − µx)TΣ−1x (mx − µx)).

(15.85)



Together with the previously evaluated entropy term, we thus obtained an approximation for the variational freeenergy functional under the fixed-form assumption q(x) = N(x;mx, Sx) that can be written as a multivariatereal-valued function in the variational density parameters parameters mx and Sx:

F : Rm × Rm×m → R, (mx, Sx) 7→

F (mx, Sx) :=− n

2ln 2π +

n

2lnλy −

λy2

(y − f(mx))T (y − f(mx))− λy2

tr(Jf (mx)TJf (mx)Sx)

− m

2ln 2π − 1

2ln |Σx| −

1

2(mx − µx)TΣ−1

x (mx − µx)− 1

2tr(Σ−1

x Sx)

+1

2ln |Sx|+

m

2ln(2πe).

(15.86)

2

Maximization with respect to the variational variance parameter

Optimizing nonlinear multivariate real-valued functions such as (15.71) is the fundamental aim of non-linear optimization (Chapter 4). Intuitively, many nonlinear optimization methods are based on a simplepremise: from basic calculus we know that a necessary condition for an extremal point at a given locationin the input space of a function is that the first derivative evaluates to zero at this point, i.e., the functionis neither increasing nor decreasing. If one extends this idea to functions of multidimensional entities,one can show that one may maximize the function F with respect to its input argument Sx based on asimple formula. Omitting all terms of the function F that do not depend on Sx and which hence do notcontribute to changes in the value of F as Sx changes, we can write the first derivative of F with respectto Sx suggestively as

∂

∂SxF (mx, Sx) = −λy

2Jf (mx)TJf (mx)− 1

2Σ−1x +

1

2S−1x (15.87)

Setting the derivative of F with respect to Sx to zero and solving for the extremal argument Sx thenyields the following update rule for the variational covariance parameters

Sx =(λyJ

f (mx)TJf (mx) + Σ−1x)−1

. (15.88)

Proof of (15.88)

We only provide a heuristic proof to demonstrate the general idea. A formal mathematical proof would alsorequire the characterization of for the function F as concave function and a sensible notation for derivatives offunctions of multivariate entities such as vectors and positive-definite matrices. Here, we use the notation forpartial derivatives. We have

∂


2

∂

∂Sxtr(Jf (mx)TJf (mx)Sx)− 1

2

∂

∂Sxtr(Σ−1

x Sx) +1

2

∂

∂Sxln |Sx| (15.89)

which, using the following rules for matrix derivatives involving the trace operator and logarithmic determinants(cf. equations (103) and (57) in (Petersen et al., 2006)

∂

∂Xtr(AXT ) = A and

∂

∂Xln |X| = (XT )−1 (15.90)

with Sx = STx yields∂


2Jf (mx)TJf (mx)− 1

2Σ−1x +

1

2S−1x . (15.91)

Setting the above to zero then yields the equivalent relations

∂

∂SxF (mx, Sx) = 0

⇔ −λy2Jf (mx)TJf (mx)− 1

2Σ−1x +

1

2S−1x = 0

⇔ Sx = (λyJf (mx)TJf (mx) + Σ−1

x )−1.

(15.92)

2



Maximization with respect to the variational expectation parameter

In contrast to the variational covariance parameter, maximization of the variational free energy functionwith respect to mx cannot be achieved analytically, but requires an iterative numerical optimizationalgorithm. In the functional neuroimaging literature, the algorithm employed to this end is fairly specific,but related to standard nonlinear optimization algorithms such as gradient and Newton descents (Chapter4). To simplify the notational complexity of the discussion below, we first rewrite the variational freeenergy function eq. (15.71) as a function of only the variational expectation parameter, assuming that ithas been maximized with respect to the variational covariance parameter Sx previously and remove anyadditive terms devoid of mx. The function of interest then takes the form

F : Rm → R,mx 7→ F (mx) =:− λy2

(y − f(mx))T (y − f(mx))− λy2

tr(Jf (mx)TJf (mx)Sx)

− 1

2(mx − µx)TΣ−1x (mx − µx)

(15.93)

As discussed in Chapter 4, numerical optimization schemes usually work by guessing an initial value forthe maximization argument of the nonlinear function under study and then iteratively update this guessaccording some update rule. A very basic gradient ascent scheme for the function specified in eq. (15.93)is provided as Algorithm 1. A prerequisite for the application of this algorithm is the availability of the

gradient of the function F evaluated at m(k)x for k = 0, 1, 2, . . .. In the functional neuroimaging literature,

it is suggested to approximate this gradient analytically by omitting higher derivatives of the function fwith respect to mx (Friston et al., 2007a). The function (15.93) comprises first derivatives of the functionf with respect to mx in the form of the Jacobian Jf (mx) in the second term. If this term is omitted, thegradient of F evaluates to

∇F (mx) = −λyJf (mx)T (y − f(mx))− Σ−1x (mx − µx) (15.94)

and the update rule for the variational expectation parameter takes the form

m(k+1)x = m(k)

x − λyJf (m(k)x )T (y − f(m(k)

x ))− Σ−1x (m(k)x − µx)T for k = 0, 1, 2, ... (15.95)

Proof of (15.94)

We have

∇F (mx) = −λy

2

∂

∂mx(y − f(mx))T (y − f(mx))−

λy

2

∂

∂mxtr(Jf (mx)T Jh(mx)Sx)−

1

2

∂

∂mx((mx − µx)TΣ−1

x (mx − µx))

(15.96)

Notably, the second term above involves second-order derivatives of the function f with respect to mx. FollowingFriston et al. (2007a) we neglect these terms, and obtain, using the rules of the calculus for multivariate real-valuedfunctions (Petersen and Pedersen, 2012)

∇F (mx) = −λy2

2(∂

∂mxf(mx))T (y − f(mx))− 1

22Σ−1

x (mx − µx)

= −λyJf (mx)T (y − f(mx))− Σ−1x (mx − µx)

(15.97)

2

Algorithm 1: A gradient ascent algorithm

Initialization

1. Define a starting point m(0)x ∈ Rm and set k := 0. If ∇F (m

(0)x ) = 0, stop! m

(0)x is a zero of

∇F . If not, proceed to iterations.

Until Convergence

1. Set m(k+1)x := m

(k)x + κ∇F (m

(k)x )

2. If ∇F (m(k+1)x ) = 0, stop! m

(k+1)x is a zero of F . If not, go to 3.

3. Set k := k + 1 and go to 1.



From an optimization perspective, gradient ascent schemes for the maximization of nonlinear functions aresuboptimal. Furthermore, as can be shown in simple univariate examples, the approximated gradient caneasily fail to reliably identify the necessary condition for an extremal point. The gradient ascent schemeof Algorithm 1 is thus of little data-analytical relevance in functional neuroimaging. It is, however, ofinterest with respect to its speculative neurobiological implementation in the context of the free energyprinciple for perception as discussed below.

A more robust method is a globalized Newton scheme with Hessian modification and numericallyevaluated gradients and Hessians as documented as Algorithm 2 (Ostwald and Starke, 2016). Intuitively,

this algorithm works by approximating the target function F at the current iterand m(k)x by a second-

order Taylor expansion and analytically determining an extremal point of this approximation at eachiteration. The location of this extremal point corresponds to the search direction pk. If, however, the

Hessian HF (m(k)x ) is not positive-definite, there is no guarantee that pk is an ascent direction. Especially

in regions far away from a local extremal point, this can often be the case. Thus, a number of modification

techniques have been developed that minimally change HF(m

(k)x

), but render it positive-definite, such

that an ascent direction is obtained. Finally, on each iteration the Newton step-size tk is determined suchas yield an increase in the target function, but with a not too short step-size. This approach is referredto as backtracking and the conditions for sensible step-lengths are given by the necessary and sufficientWolfe-conditions. Notably, the algorithm documented as Algorithm 2 is a standard approach for theiterative optimization of a nonlinear function, and hence analytical results on its performance bounds areavailable. For a detailed discussion of this algorithm, see Nocedal and Wright (2006).

Algorithm 2: A globalized Newton method with Hessian modification

Initialization

1. Define a starting point m(0)x ∈ Rmp and set k := 0. If ∇F (m

(0)x ) = 0, stop! m

(0)x is a zero of


Until Convergence

1. Evaluate the Newton search direction pk :=(HF (m

(k)x ))−1∇F (m

(k)x )

2. If pTk∇F (m(k)x ) < 0, pk is a descent direction. In this case, modify HF (m

(k)x ) to render it

positive definite.

3. Evaluate a step-size tk fulfilling the sufficient Wolfe-condition using the following algorithm: set

tk := 1 and select ρ ∈]0, 1[, c ∈]0, 1[. Until F(m

(k)x + tkpk

)≥ F

(m

(k)x ) + c1tk∇F (m

(k)x

)Tpk

set tk := ρtk.

4. Set m(k+1)x := m

(k)x + tk(HF (m

(k)x ))−1∇F (m

(k)x )

5. If ∇F (m(k+1)x ) = 0, stop! m

(k+1)x is a zero of F . If not, go to 3.


Finally, the actual optimization scheme employed in much of the functional neuroimaging literature,because it has been implemented in SPM, is more specific. It derives from the local linearization methodfor nonlinear stochastic dynamical systems as suggested by Ozaki (1992) and is often formulated indifferential equation form (cf. Friston et al., 2007a). We provide a standard iterative formulation of thisapproach as Algorithm 3. Note that the analytical and convergence properties of this algorithm are notvery well understood and provide an interesting avenue for future research.

Algorithm 3: A local linearization-based gradient ascent algorithm

Initialization

1. Define a starting point m(0)x ∈ Rm and set k := 0. If ∇F (m

(0)x ) = 0, stop! m

(0)x is a zero of


Until Convergence



1. Set m(k+1)x := m

(k)x + exp(τHF (m

(k)x )− I)HF (m

(k)x )−1∇F (m

(k)x )

2. If ∇F (m(0)x ) = 0, stop! m

(0)x is a zero of F . If not, go to 3.


The free energy principle for perception

The free energy principle is a body of work that attempts to cast biological self-organization in termsof random dynamical systems and emerged under this label over the last two decades. With respectto cognition, the free energy principle aims to provide computational, algorithmic, and neurobiological-implementational descriptions of both sensation and action. A central tenet of the free energy principleis that cognitive agents engage in variational inference. In the following, we shall briefly discuss a subsetof ideas that emerged in the application of variational inference as an algorithmic depiction of perception(cf. Friston, 2005; Bogacz, 2017).

The free energy principle for perception postulates that free energy maximization in hierarchicalmodels can be neurobiologically implemented using predictive processing. In general, predictive processingtheories of perception assert that neurobiological systems encode models of the world and reconcile theirmodel-based predictions with incoming sensory information to arrive at perceptions. In hierarchicalframeworks, the idea is that low-level areas represent prediction errors between the top-down conveyedpredictions and the actual sensory representation. From the perspective of the free energy, we canintegrate this intuitive idea and the formal idea of fixed-form mean-field variational inference for the non-linear Gaussian model (15.62) as follows Figure 15.4: in the world, there exist a true, but unknown, stateof the variable x, which we denote by x∗ and refer to as a cause. Based on a nonlinear transformationimplemented by the function f , this true, but unknown, cause evokes an observation or point y inthe sensorium of a cognitive agent. Under the free energy principle, the cognitive agent encodes aprobabilistic representation p(y, x) of these environmental actualities: high-level cortices represent theprior distributions p(x), whereas low-level early sensory cortices represent the conditional distributionsp(y|x). The computational problem of the neural architecture can then be formulated as the problemof inferring on the true, but unknown, cause x∗ by means of the posterior distribution p(x|y). The freeenergy principles proposes that this inversion of the generative model p(y, x) is performed using fixed-form mean-field variational inference and provides a speculative account of this inversion in terms ofneural representations. Note that in the current setting the cause is assumed to be static, i.e., it does notchange over time. This may be viewed from two perspectives: either the model is assumed to representthe perception of static environmental causes (for example, because x∗ represents the static image ofan object which is partially occluded by another object, where the nonlinear occlusion transformationis implemented in the function f), or the model is assumed to represent the perception of a dynamicenvironmental cause at a singular point in time.

With respect to the speculative neural implementation of the free energy maximization scheme thatensues by applying the variational inference to the model formulated in (15.62), the update of the vari-ational covariance parameter derived above plays only a minor role. Instead, the focus of the neurobio-logical interpretation of the numerical scheme is on the iterative updates of the variational expectationparameter mx. To reflect this, we will consider the following simplified version of the variational freeenergy function of eq. (15.71):

F : Rm → R,m(k)x 7→ F

(m

(k)x

):= −

λy

2

(y − f

(m

(k)x

))T (y − f

(m

(k)x

))−

1

2

(m

(k)x − µx

)TΣ−1x

(m

(k)x − µx

)(15.98)

In this free energy function the second term in denotes the weighted deviation between the prior expec-tation parameter µx and the kth estimate of the variational expectation parameter, while the first termin denotes the weighted deviation between the observed data y and the data prediction of the genera-tive model based on the kth estimate of the variational expectation parameter. Under the free energyprinciple for perception, these terms are referred to as prediction error terms. By defining

ε(k)y := λ−1/2y

(y − f

(m(k)x

))and ε(k)x := Σ−1/2x

(m(k)x − µx

)(15.99)



Figure 15.4. Free energy principle interpretation of fixed-form mean-field variational inference for a non-linear Gaussian models. Note that we assume that there exists a fixed true, but unknown, cause x∗ for the cognitiveagent’s sensory representation y. The perceptual act is conceived as inference, i.e., formation of the posterior distributionover causes in a generative model encoded by the agent. This generative model is assumed to mirror the nonlinear trans-formation from external cause to sensory representation in the form of its likelihood p(y, x) and be equipped with priorassumptions about possible external causes in the form of its marginal distribution p(x). Because the function f is assumedto be complicated, variational inference of the posterior distribution is invoked.

the simplified variational free energy function (15.98) can then be rewritten as

F : Rm → R,m(k)x 7→ F

(m(k)x

)= −1

2ε(k)

T

y ε(k)y −1

2ε(k)

T

x ε(k)x . (15.100)

In this formulation, it is readily seen that maximization of the variational free energy for a model of the

form (15.62) corresponds to the minimization of the prediction error terms ε(k)y and ε

(k)x , because these

enter (15.100) squared and with negative sign. Moreover, in terms of the probabilistic model, the second

term in (15.100) refers to the relation of the current variational parameter expectation estimate m(k)x

to a parameter of the prior distribution p(x). From a Bayesian perspective, one may regard the priordistribution as ” hierarchically higher” than the likelihood distribution, because, in ancestral samplingschemes generating data, one may first draw a value x from the prior distribution and then, based onthis value, draw a realization y of the data from the distribution p(y|x). This suggests to allocate thesecond term in (15.100) to a higher cortical area then the first term. In the current two-level model thefirst term in (15.100)i nvolves the data y, which, in neural terms, is most readily conceived as a primarysensory area, such as the primary visual cortex. Based on a review of the anatomical and physiologicalbasis of cortical systems, the free energy principle for perception then suggests a mapping between neuralpopulations in cortical areas and the terms involved in eqs. (15.98), (15.99), and (15.100) as depicted inFigure 15.5.

A toy example

To illustrate the foregoing discussion, we consider the model

x = µx + η (15.101)

y = f(x) + ε, (15.102)

where x, µx, η ∈ R2, y, ε ∈ Rd2

,, ε and η are distributed with densities p(ε) = N(ε; 0, λ−1y Id2) and

p(η) = N(η; 0, λ−1x I2) with λy, λx > 0, respectively, and f : R2 → Rd2

is the multivariate vector-valuedfunction

f : R2 → Rd2

, x 7→ f(x) :=

(exp

(− 1

σ(ci − x)T (ci − x)

))1≤i≤d2

(15.103)

In (15.103), ci ∈ C for i = 1, . . . , d2 denotes a two-dimensional coordinate vector formed by the Cartesianproduct of an equipartition of an interval [smin, smax] ∈ R with d ∈ N support points with itself:

C := {s1, s2, . . . , sd} × {s1, s2, . . . , sd} ∈ R2 (15.104)

The intuition of this model system is as follows: the true, but unknown, state of the world refers to acoordinate vector x∗ that refers to the location of an object (e.g., a mouse) in the plane. The cognitive



Low Level Intermediate Level

Prediction Error

𝜀𝑦𝑘 = 𝜆𝑦

−1/2𝑦 − 𝑓 𝑚𝑥

𝑘

Prediction Error

𝜀𝑥𝑘 = Σ𝑥

−1/2𝑚𝑥

𝑘 − 𝜇𝑥

Data Representation

𝑦Cause Representation

𝑚𝑥𝑘

Prediction

𝑓 𝑚𝑥𝑘

Gradient

𝜆𝑦𝐽𝑓 𝑚𝑥

𝑘𝑇𝜀𝑦(𝑘)

High Level

Cause Prior Representation

𝜇𝑥

Prediction

𝜇𝑥Gradient

Σ𝑥−1𝜀𝑥

(𝑘)

Figure 15.5. Predictive processing as a speculative neural implementation of the free energy principle forperception. The figure depicts the speculative neurobiological implementation of the gradient ascent scheme described inthe main text. External information is projected onto data representing units residing in deep layers (V) of early sensory(low-level) cortices. Neuronal units lying in more superficial layers (III) of low-level cortices are supposed to encode theprediction-error signal, resulting from the difference between their sensory input from lower layers at the same level and datapredictions formed by the projection of deep intermediate-level units. In turn, the superficial low-level units are assumed toproject the gradient component that allows for adjusting the prediction at the deep intermediate-level units. These units,in correspondence to the deep data-representing units in low-level cortices are assumed to encode the predicted causes ofsensory information. Superficial units at this intermediate level in turn are thought to represent the prediction error betweenhigher-order causes, specifically, the prior expectations projected down from deep high-level representation units and thecause representations obtained from the deep units at their own level. The superficial units of the intermediate level in turnare speculated to project a gradient component to the deep units at their level, adjusting the prediction of the intermediatelevel projected to the lowest level. Note that in the current scheme, the prior predictions at the highest level are fixed

agent (e.g., a cat) does not observe the location of the object directly, but only indirectly, upon transfor-mation by the function f and under the addition of measurement noise. The function f can be thoughtof as a blanket that covers the object, which is hidden and only conveys information about the locationx∗ of the object by means of a bump in the blanket. The task of the cognitive agent is to estimate theunobserved location of the object based on the sensory availability of this bump.

To model the task of the agent, the fixed-form variational inference approach to estimate the varia-tional parameter mx ∈ R2 using the simplified variational free energy (15.98) was employed. Figure 15.6Adepicts the true, but unknown, location of the object x = (1, 1)T , the prior parameter µx = (0, 0)T

corresponding to the initialization of the variational parameter m(0)x and an isocontour of the prior

distribution over x specified by m(0)x and λx = 70. Figure 15.6B depicts a data realization where

smin = −2.5, smax = 2.5, d = 30, σ = 0.4 and λy = 100. Figure 15.6C and Figure 15.6E depict thenegative variational free energy landscape as a function of mx ∈ R2. Note that in accordance with thenonlinear optimization literature, we here consider minimization of the negative variational free energyfunction, rather than maximization of the (positive) variational free energy function. It has a minimum inthe region of the location of the true, but unknown, value x∗, and a slight depression around the locationof the prior parameter µx. Figure 15.6C and Figure 15.6D depict the application of the globalized Newtonapproach with Hessian modification and the ensuing squared and precision weighted prediction errors at

the low (blue) and high (red) level. During the first number of iterations, the iterands m(i)x partially

leave the depicted free energy surface and alternate between two locations while making little progresstowards the minimum. After approximately ten iterations, the iterands enter the minimum’s well andquickly converges to the minimum. With a gradient norm convergence criterion of 10−3, convergence isreached after 30 iterations. Figure 15.6E and Figure 15.6F depict the analogous approach for the locallinearization approach. Here, based on a temporal regularization parameter of t = 10−4, a local search isperformed, which leads the iterands directly into the minimum’s well, but for convergence with the samegradient norm criterion a few more iterations are required. For both methods, the prediction errors atthe low level decrease, while at the high level increase minimally, owing to the deviation of the variationalexpectation parameter at convergence from the prior parameter.

Naturally, the nonlinear optimization scheme that lies at the heart of the fixed-form mean-field vari-


Bibliographical remarks 174

Figure 15.6. Example application of the fixed-form variational inference approach. For a detailed description,please refer to the main text (pmfn 14.m).

ational inference method discussed here suffers from the usual problems of (non-global) nonlinear opti-mization approaches. Figure 15.7 depicts a number of scenarios in which both the globalized Newtonmethod and the local linearization method will have difficulties to identify the correct minimum of thenegative variational free energy. Of particular importance for the successful application of these nonlin-ear optimization methods in the current scenario are the absolute values of, and the ratio between, theprecision parameters λx and λy. Figure 15.7A depicts a scenario with low prior precision (the Gaussianisocontour is so large that it falls out of the space depicted in the leftmost panel) and low noise precision.In this case, the data (middle panel) conveys little information about the true, but unknown, location ofthe object, and the free energy surface has a wavy profile with no clear structure. As the algorithm isinitialized in a local minimum corresponding to the prior expectation, numerical nonlinear optimizationapproaches are likely to fail. Figure Figure 15.7B depicts a scenario in which the ratio between λx and λyrender the negative variational free energy surface virtually flat at the location of the prior expectation.As a consequence, gradient descent schemes will be initialized with steps around zero and usually fail tomake sufficient progress towards the local minimum. Finally, Figure 15.7C depicts a scenario in whichthe prior distribution has relatively high precision, but the negative variational free energy function stillhas a global minimum at the data supported location of the true, but unknown, coordinates. In thiscase, the negative free energy surface has a lot of structure, but if the algorithm is initialized in the localminimum of the prior expectation will usually remain there and fail to identify the global minimum.

15.4 Bibliographical remarks

Introductions to variational inference can be found in all standard machine learning textbooks, such asBishop (2006), Barber (2012), and Murphy (2012). More recent reviews include Blei et al. (2016), and,in the context of the resurgence of neural network approaches, Doersch (2016). The widespread use ofvariational inference in functional neuroimaging based on its SPM implementation originates from Fristonet al. (2007a). Finally, recent technical reviews of the free energy principle include Bogacz (2017) andBuckley et al. (2017).

15.5 Study questions

1. For a probabilistic model p(y, ϑ), write down the log marginal likelihood decomposition that forms the coreof the variational inference approach and discuss its components.


Study questions 175

Figure 15.7. Example scenarios in the application of fixed-form variational inference to a nonlinear Gaussianmodel. For a detailed description, please refer to the main text (pmfn 14.m).

2. Write down the definition of the Kullback-Leibler divergence. What does the Kullback-Leibler divergencemeasure?

3. Write down the definition of the variational free energy and explain its importance in variational inference.

4. Write down the free-form mean-field variational inference theorem and explain its relevance.

5. What are the commonalities and differences between free-form mean-field and fixed-form mean-field varia-tional inference?

6. What is the central postulate of the free energy principle for perception?

Study questions answers

1. For a probabilistic model p(y, ϑ) with observable random variables y and unobservable random variables ϑ,the log marginal likelihood decomposition

ln p(y) = F (q(ϑ)) +KL(q(ϑ)‖p(ϑ|y)) (15.105)

forms the core of the variational inference approach. Here, ln p(y) denotes the logarithm of the marginallikelihood p(y) =

∫p(y, ϑ) dϑ, F (q(ϑ)) denotes the variational free energy evaluated for the variational

distribution q(ϑ) that serves as an approximation of the posterior distribution p(y|ϑ), and KL(q(ϑ)||p(ϑ|y))denotes the KL divergence between the variational distribution q(ϑ) and the true posterior distributionp(ϑ|y).


Study questions 176

2. The Kullback-Leibler divergence of two probability distributions specified in terms of the probability densityfunctions q(x) and p(x) is defined as

KL(q(x)||p(x)) =

∫q(x) ln

(q(x)

p(x)

)dx. (15.106)

The Kullback-Leibler divergence measures the dissimilarity of probability distributions.

3. For a probabilistic model p(y, ϑ) comprising observable random variables y and unobservable random vari-ables ϑ and a variational distribution q(ϑ) that serves as an approximation of the posterior density p(y|ϑ),the variational free energy is defined as

F (q(ϑ)) :=

∫q(ϑ) ln

(p(y, ϑ)

q(ϑ)

)dϑ. (15.107)

Because the variational free energy is a lower-bound to the log marginal likelihood ln p(y), maximizing thevariational free energy with respect to the variational distribution renders the variational free energy anapproximation to the log marginal likelihood, and, given the non-negativity of the KL-divergence, rendersthe variational distribution an approximation to the true posterior distribution p(ϑ|y).

4. The free-form mean-field variational inference theorem states that the variational distribution that maxi-mizes the variational free energy with respect to the sth partition of the mean-field representation of q(ϑ)can be determined according to

q(ϑs) ∝ exp


), (15.108)

where q(ϑ\s) denotes the variational density over all unobservable random variables not in the sth partition.Partition-wise application of the free-form mean-field variational inference theorem allows for developingcoordinate-wise free-form variational free energy maximization algorithms.

5. Both free- and fixed-form mean-field variational inference rest on maximizing the variational free energywith respect to a factorized variational distribution q(ϑ) =

∏ki=1 q(ϑi) over unobserved random variables ϑ

that serves as an approximation to the true posterior distribution p(ϑ|y) of a probabilistic model p(y, ϑ). Forboth free- and fixed-form mean-field variational inference, the factorization of the variational distributionis referred to as a mean-field approximation. Free-form mean-field variational inference uses a centralresult from variational calculus to determine the functional form of the variational distributions and theirparameters. In contrast, fixed-form mean-field variational inference pre-defines the functional form of thevariational distributions, evaluates the variational free energy integral to render it a multivariate real-valuedfunction of the variational distribution parameters, and maximizes the resulting function using standardtechniques of nonlinear optimization.

6. The central postulate of of the free energy principle for perception is that perception corresponds to varia-tional inference.


15 jVariational inference - ssl2.cms.fu-berlin.de · Variational inference is a statistical...

Documents

Transcript of 15 jVariational inference - ssl2.cms.fu-berlin.de · Variational inference is a statistical...