LOW- ε -RING network: Common challenges and common solutions
1 Appendix: Common...
Transcript of 1 Appendix: Common...
1 Appendix: Common distributions
This Appendix provides details for common univariate and multivariate distributions, in-
cluding definitions, moments, and simulation. Deyroye (1986) provides a complete treat-
ment of random number generation, although care must be taken as many distributions
can be parameterized in different ways.
Uniform
• A random variable X has a uniform distribution on the interval [α, β], denoted
U (α, β) , if the probability density function (pdf) is
p (x|α, β) =1
β − α
for x ∈ [α, β] and 0 otherwise. The mean and variance of a uniform random variable
are E (X) = α+β2
and var (X) = (β−α)2
12, respectively.
• The uniform distribution plays a foundational role in random number generation. In
particular, uniform random numbers are required for the inverse transform simulation
method, accept-reject algorithms, and the Metropolis algorithm. Fast and accurate
pre-programmed algorithms are available in most statistical software packages and
programming languages.
Bernoulli
• A random variable X has a Bernoulli distribution with parameter θ, denoted X ∼Ber (θ) if the probability mass function (pmf) is
Prob (X = x|θ) = θx (1− θ)1−x .
for x ∈ {0, 1}. The mean and variance of a Bernoulli random variable are E(X) = θ
and var(X) = θ(1− θ), respectively.
1
• To simulate X ∼ Ber (θ),
1. Draw U ∼ U (0, 1)
2. Set X = 1 if U < θ.
Binomial
• A random variable X ∈ {1, ..., n} has a Binomial distribution with parameters n and
θ, denoted, X ∼ Bin (n, θ), if the pmf is
Prob (X = x|n, θ) =n!
x! (n− x)!θx (1− θ)n−x ,
where n! = n (n− 1)! = n (n− 1) · · · 2 · 1. The mean and variance of a Binomial
random variable are E(X) = nθ and var(X) = nθ (1− θ), respectively. The Binomial
distribution arises as the distribution of a sum of n independent Bernoulli trials. The
Binomial is closely related to a number of other distributions. If W1, ...,Wn are i.i.d.
Ber (p), then∑n
i=1Wi ∼ Bin (n, p). As n → ∞ with np = λ, X ∼ Bin (n, p)
converges in distribution to a Poisson distribution with parameter λ.
• To simulate X ∼ Bin (n, θ),
1. Draw X1, ..., Xn independently Xi ∼ Ber (θ)
2. Set X = count (Xi = 1) .
Multinomial
• A vector of random variables X = (X1, ..., Xk) has a Multinomial distribution, de-
noted X ∼Mult (n, p1, ..., pk), if
p (X = x|p1, ..., pk) =n!
x1! · · ·xk!
k∏i=1
pxii
where∑k
i=1 xi = n. The Multinomial distribution is a natural extension of the
Bernoulli and Binomial distributions. The Bernoulli distribution gives a single trial
resulting in success or failure. The Binomial distribution is an extension that involves
2
n independently repeated Bernoulli trials. The Multinomial allows for multiple out-
comes, instead of the two outcomes in the Binomial distribution. There are still n
total trials, but now the outcome of each trial is assigned into one of k categories,
and xi counts the number of outcomes in category i. The probability of category i is
pi. The mean, variance, and covariances of the Multinomial distribution are given by
E (Xi) = npi, var (Xi) = npi (1− pi) , and cov (Xi, Xj) = −npipj
Multinomial distributions are often used in modeling finite mixture distributions
where the Multinomial random variables represent the various mixture components.
Dirichlet
• A vector of random variables X = (X1, .., Xk) has a Dirichlet distribution, denoted
X ∼ D (α1, ..., αk), if∑k
i=1Xi = 1
p (x|α1, ..., αk) =Γ(∑k
i=1 αi
)Γ (α1) · · ·Γ (αk)
k∏i=1
xαi−1i
The Dirichlet distribution is used as a prior for mixture probabilities in mixture
models. The mean, variance, and covariances of the Multinomial distribution are
E (Xi) =αi∑ki=1 αi
var (Xi) =αi
∑j 6=i αj(∑k
i=1 αi
)2 (∑ki=1 αi + 1
)cov (Xi, Xj) =
−αiαj(∑ki=1 αi
)2 (∑ki=1 αi + 1
)• To simulate a Dirichlet X = (X1, . . . , Xk) ∼ D (α1, ..., αk) use the two step procedure
Step 1: Draw k independent Gammas Yi ∼ Γ (αi, 1)
Step 2: Set Xi = Yi/k∑
i=1
Yi
3
Poisson
• A random variable X ∈ N+ (the non-negative integers) has a Poisson distribution
with parameter λ, denoted X ∼ Poi (λ) , if the pmf is
Prob (X = x|λ) = e−λλx
x!.
The mean and variance of a Poisson random variable are E(X) = λ and var(X) = λ,
respectively.
• To simulate X ∼ Poi (λ),
1. Draw Z1, . . . , Zn independently, Zi ∼ exp (1)
2. Set X = inf
{n > 0 :
n∑i=1
Zi > λ
},
Exponential
• A random variable X ∈ R+ has an exponential distribution with parameter µ, de-
noted, X ∼ exp (µ), if the pdf is
p (x|µ) =1
µexp
(−xµ
).
The mean and variance of an exponential random variable are E (X) = µ and
var (X) = µ2, respectively.
• The inverse transform method is the easiest way to simulate exponential random
variables, since the cumulative distribution function (cdf) is F (x) = 1− e−xµ .
• To simulate X ∼ exp (µ),
1. Draw U ∼ U [0, 1]
2. Set X = −µ ln (1− U) .
Gamma
4
• A random variable X ∈ R+ has a Gamma distribution with parameters α and β,
denoted X ∼ G (α, β), if the pdf is
p (x|α, β) =βα
Γ (α)xα−1 exp (−βx) ,
The mean and variance of a Gamma random variable are E(X) = αβ−1 and var(X) =
αβ−2, respectively. It is important to realise that there are different parameterizations
of the Gamma distribution; e.g. MATLAB parameterizes the Gamma density as
p (x|α, β) =1
Γ (α) βαxα−1 exp (−x/β) .
If X = Y/β and Y ∼ G (α, 1) then X ∼ G (α, β). To see this, use the inverse
transform Y = βX with dY/dX = β, which implies that
p (x|α, β) =1
Γ (α)(xβ)α−1 exp (−βx) β =
βα
Γ (α)xα−1 exp (−βx) ,
the density of a G (α, β) random variable. The exponential distribution is a special
case of the Gamma distribution when α = 1: X ∼ G (1, µ−1) implies that X ∼exp (µ).
• Gamma random variable simulation is standard, with built-in generators in most
software packages. These algorithms typically use accept/reject algorithms that are
customized to the specific values of α and β. To simulate X ∼ G (α, β), when α is
integer-valued,
1. Draw X1, ..., Xα independently Xi ∼ exp(1)
2. Set X = β
α∑i=1
Xi.
For non-integer α, accept-reject methods provide fast and accurate algorithms for
Gamma simulation.
Beta
• A random variable X ∈ [0, 1] has a Beta distribution with parameters α and β,
5
denoted X ∼ B (α, β), if the pdf is
p (x|α, β) =xα−1 (1− x)β−1
B (α, β),
where B (α, β) = Γ (α) Γ (β) /Γ (α+ β) is the Beta function. As∫p (x|α, β) dx = 1
we have B (α, β) =∫ 1
0xα−1 (1− x)β−1 dx.
The mean and variance of a Beta random variable are
E (X) =α
α+ βand var (X) =
αβ
(α+ β)2 (α+ β + 1),
respectively. If α = β = 1, then X ∼ U (0, 1).
• If α and β are integers, to simulate X ∼ B (α, β),
1. Draw X1 ∼ G (α, 1) and X2 ∼ G (β, 1)
2. Set X =X1
X1 +X2
.
When α is integer valued and β is non-integer valued (a common case in Bayesian in-
ference), exponential random variables can be used to simulate beta random variables
via the transformation method.
Chi-squared
• A random variable X ∈ R+ has a Chi-squared distribution with parameter ν, denoted
X ∼ X 2ν if the pdf is
p (x|ν) =1
2ν2 Γ(
ν2
)x ν2−1 exp
(−x
2
).
The mean and variance of X are E (X) = ν and var (X) = 2ν, respectively. The
X 2ν -distribution is a special case of the Gamma distribution: X 2
ν = G(
ν2, 1
2
).
• Simulating chi-squared random variables typically uses the transformation method.
For integer values of ν, the following two-step procedure simulates a X 2ν random
6
variable:
Step 1: Draw Z1, ..., Zν independently Zi ∼ N (0, 1)
Step 2: Set X =ν∑
i=1
Z2i .
When ν is large, simulating using normal random variables is computationally costly
and alternative more computationally efficient algorithms use Gamma random vari-
able generation.
Inverse Gamma
• A random variable X ∈ R+ has a inverse Gamma distribution, denoted by X ∼IG (α, β), if the pdf is
p (x|α, β) =βα
Γ (α)x−(α+1) exp
(−βx
).
The mean and variance of the inverse Gamma distribution for α > 2 are
E (X) =β
α− 1and var (X) =
β2
(α− 1)2 (α− 2)
If Y ∼ G (α, β) , then X = Y −1 ∼ IG (α, β). To see this, write
1 =
∫ ∞
0
βα
Γ (α)yα−1 exp (−βy) dy =
∫ 0
∞
βα
Γ (α)
(1
x
)α−1
exp
(−βx
)(−1
x2
)dx
=
∫ ∞
0
βα
Γ (α)
1
xα+1exp
(−βx
)dx.
• The following two-steps simulate an IG (α, β)
Step 1: Draw Y ∼ G (α, 1)
Step 2: Set X =β
Y.
Again, as in the case of the Gamma distribution, some authors use a different param-
eterization for this distribution, so it is important to be careful to make sure you are
7
drawing using the correct parameters. In the case of prior distributions over the scale,
σ2, it is additional complicated because some authors (Zellner, 1971) parameterize σ
instead of σ2.
Pareto
• A random variable X ∈ R+ has a Pareto distribution, denoted by Par(α, β), if the
pdf for x > β and α > 0 is
p(x|α, β) =αβα
xα+1
The mean and variance are E (X) = αβα−1
if α > 1 and var (X) = αβ2
(α−1)2(α−2)if α > 2.
• The following two-steps simulate a Par(α, β) distribution
Step 1: Draw Y ∼ exp (α)
Step 2: Set X = βeY .
Normal
• A random variable X ∈ R has a normal distribution with parameters µ and σ2,
denoted X ∼ N (µ, σ2), if the pdf is
p(x|µ, σ2
)=
1√2πσ2
exp
(−(x− µ)2
2σ2
)
we will also use φ (x|µ, σ2) to denote the pdf and Φ (x|µ, σ2) the cdf. The mean and
variance are E (X) = µ and var (X) = σ2.
• Given the importance of normal random variables, all software packages have func-
tions to draw normal random variables. The algorithms typically use transformation
methods drawing uniform and exponential random variables or look-up tables. The
classic algorithm is the Box-Muller approach based on simulating uniforms.
Log-Normal
8
• A random variable X ∈ R+ has a lognormal distribution with parameters µ and σ2,
denoted by X ∼ LN (µ, σ2) if the pdf is
p(x|µ, σ2
)=
1√2πσ2
1
xexp
(− 1
2σ2(lnx− µ)2
).
The mean and variance of the log-normal distribution are E (X) = eµ+ 12σ2
and sim-
ilarly var (X) = exp (2µ+ σ2) (exp (σ2)− 1). It is related to a normal distribution
via the transformation X = eµ+σZ . Although finite moments of the lognormal exist,
the distribution does not admit a moment-generating function.
• Simulating lognormal random variables via the transformation method is straightfor-
ward since X = eµ+σZ where Z ∼ N (0, 1) is LN (µ, σ2).
Truncated Normal
• A random variable X has a truncated normal distribution, denoted by T N (µ, σ2),
with parameters µ, σ2 and truncation region (a, b) if the pdf is
p (x|a < x < b) =φ (x|µ, σ2)
Φ (b|µ, σ2)− Φ (a|µ, σ2),
where it is clear that∫ b
−∞ φ (x|µ, σ2) dx = Φ (b|µ, σ2). The mean of a truncated
normal distribution is
E(X|a < X < b) = µ− σφa − φb
Φb − Φa
,
where φx is the standard normal density evaluated at (x− µ) /σ and Φx is the stan-
dard normal cdf evaluated at (x− µ) /σ.
• The inversion method can be used to simulate this distribution. A two-step algorithm
that provides a draw from a truncated standard normal is
Step 1: U ∼ U [0, 1]
Step 2: X = Φ−1 [Φ (a) + U (Φ (b)− Φ (a))] ,
9
where Φ (a) =∫ a
−∞ (2π)−1/2 exp (−x2/2) dx. For the general truncated normal,
Step 1: Draw U ∼ U [0, 1]
Step 2: X = µ+ σΦ−1
[Φ
(a− µ
σ
)+ U
(Φ
(b− µ
σ
)− Φ
(a− µ
σ
))],
where Φ−1 is the inverse of the cdf.
Double exponential
• A random variable X ∈ R has a double exponential (or Laplace) distribution with
parameters µ and σ2, denoted X ∼ DE (µ, σ), if the pdf is
p (x|µ, σ) =1
2σexp
(− 1
σ|x− µ|
).
The mean and variance are E (X) = µ and var (X) = 2σ2.
• The composition method can be used to simulate a DE (µ, σ) random variable:
Step 1: Draw λ ∼ exp (2) and Z ∼ N (0, 1)
Step 2: Set X = µ+ σ√λZ.
Check exponential
• A random variable X ∈ R has a check (or asymmetric) exponential distribution with
parameters µ and σ2, denoted X ∼ CE (τ, µ, σ), if the pdf is
p (x|µ, σ) =1
σµτ
exp
(− 1
σρτ (x− µ)
).
where ρτ (x) = |x| − (2τ − 1)x and µ−1τ = 2τ(1 − τ). The double exponential is a
special case when τ = 12.
• The composition method simulates a CE (µ, σ) random variable:
Step 1: Draw λ ∼ exp (µτ ) and Z ∼ N (0, 1)
Step 2: Set X = µ+ (2τ − 1)σλ+ σ√λZ.
10
T
• A random variable X ∈ R+ has a t-distribution with parameters ν,µ, and σ2, denoted
X ∼ tν (µ, σ2), if the pdf is
p(x|ν, µ, σ2
)=
Γ(
ν+12
)√νπσ2Γ
(ν2
) (1 +(x− µ)2
νσ2
)− ν+12
When µ = 0 and σ = 1, the distribution is denoted merely as tν . The mean and
variance of the t-distribution are E (X) = µ and var (X) = σ2 νν−2
for ν > 2. The
Cauchy distribution is the special case where ν = 1.
• The composition method simulates a tν (µ, σ2) random variable,
Step 1. Draw λ ∼ IG(ν
2,ν
2
)and Z ∼ N (0, 1)
Step 2. Set X = µ+ σ√λZ
Z
• The class of Z-distributions for a, b > 0 have density
fZ(z|a, b, µ, σ) =1
σB(a, b)
ea(z−µ)/σ
(1 + e(z−µ)/σ)a+b≡ Z(z; a, b, σ, µ).
A typical parameterisation is a = δ + θ, b = δ − θ or δ = 12(a+ b), θ = 1
2(a− b). This
is a variance-mean mixture of normals where
Z(z; a, b, σ, µ) =
∫ ∞
0
1√2πλσ2
exp
{− 1
2λσ2
(z − µ− 1
2(a− b)λσ
)2}pa,b(λ) dλ
where pa,b(λ) is a Polya distribution which is an infinite mixture of exponentials that
can be easily sampled as
λD=
∞∑k=0
2ψ−1k Zk, where ψk = (a+ k)(b+ k) and Zk ∼ exp(1).
11
• Z-Distribution simulation is then given by
X = µ+1
2(a− b)λσ + σ
√λZ, where Z ∼ N (0, 1)
Exponential power
• A random variable has an exponential power distribution. denoted X ∼ EP(µ, σ, γ)
if the pdf is
p (x|µ, σ, γ) =1
2σΓ(1 + γ−1)exp
(−∣∣∣∣x− µ
σ
∣∣∣∣γ) .The mean and variance are E (X) = µ and var (X) = Γ(3/γ)
Γ(1/γ)σ2. Normal and double
exponential are then special cases.
• Exponential power simulation relies on the composition method: if X ∼ EP (µ, σ),
then X = σ√λZ, where the scale parameter is related to a positive stable random
variable λ ∼ λ−32St+α/2 (λ−1) and ε ∼ N (0, 1).
Inverse Gaussian
• A random variable X ∈ R+ has an inverse Gaussian distribution with parameters µ
and α, denoted X ∼ IN (µ, α), if the pdf is
p (x|µ, α) =
√α
2πx3exp
(−α (x− µ)2
2µ2x
).
The mean and variance of an inverse Gaussian random variable are E (X) = µ and
var (X) = µ3/α, respectively.
• To simulate an inverse Gaussian IN (µ, α),
Step 0: Draw U ∼ U(0, 1), V ∼ χ21
Step 1: Draw W = ξ +ξ2
2µV − ξ
2õ
√4ξµV + ξ2V 2
Step 2: Set X = W1(U≥ ξξ+W ) +
ξ2
W1(U≤ ξ
ξ+W )
where ξ =√µ/α.
12
Generalized inverse Gaussian
• A random variable X ∈ R+ has an generalized inverse Gaussian distribution with
parameters a, b, and p, X ∼ GIG (a, b, p), if the pdf is
p (x|a, b, p) =(ab
) p2 xp−1
2Kp
(√ab) exp
(−1
2
[ax+
b
x
]),
where Kp is the modified Bessel function of the third kind. The mean and variance
are known, but are complicated expressions of the Bessel functions. The Gamma
distribution is the special case with β = b/2 and a = 0, the inverse Gamma is the
special case with a = 0.
• Simulating GIG random variables is typically done using resampling methods.
Multivariate normal
• A k×1 random vectorX ∈ Rk+ has a multivariate normal distribution with parameters
µ and Σ, denoted X ∼ Nk (µ,Σ), if the pdf is
p (x|µ,Σ) =1
(2π)k2
|Σ|−1/2 exp
(−1
2(x− µ) Σ−1 (x− µ)′
)
where |Σ| is the determinant of the positive definite symmetric matrix Σ. The mean
and covariance matrix of a multivariate normal are E (X) = µ and cov (X) = Σ,
respectively.
Multivariate T
• A 1×k random vector X ∈ Rk+ has a multivariate t-distribution with parameters ν, µ,
and Σ, denoted X ∼ tν (µ,Σ), if the pdf is given by
p (x|ν, µ,Σ) =Γ(
ν+k2
)Γ(
ν2
)(νπ)1/2 |Σ|1/2
[1 +
(x− µ) Σ−1 (x− µ)′
ν
]− ν+k2
.
The mean and covariance matrix of a multivariate t random variable are E (X) = µ
and cov (X) = Σ, respectively.
13
• The following two steps provide a draw from a multivariate t-distribution:
Step 1. Simulate Y ∼ Nk (µ,Σ) and Z ∼ X 2ν
Step 2. Set X = µ+ Y
(Z
ν
)− 12
Wishart
• A random m×m matrix Σ has a Wishart distribution, Σ ∼ Wm (v, V ), if the density
function is given by
p (Σ|v, V ) =|Σ|
(v−m−1)2
2vm2 |V |
v2 Γm
(v2
) exp
(−1
2tr(V −1Σ
)),
for v > m, where
Γm
(v2
)=
m∏k=1
πm(m−1)
4 Γm
(v − k + 1
2
)is the multivariate Gamma function. If v < m, then S does not have a density
although its distribution is well defined. The Wishart distribution arises naturally in
multivariate settings with normally distributed random variables as the distribution
of quadratic forms of multivariate normal random variables.
• The Wishart distribution can be viewed as a multivariate generalization of the X 2ν
distribution. From this, it is clear how to sample from a Wishart distribution:
Step 1: Draw Xj ∼ N (0, V ) for j = 1, ..., v
Step 2: Set S =v∑
j=1
XjX′j.
Inverted Wishart
• A random m × m matrix Σ has an inverted Wishart distribution, denoted Σ ∼IWm (v, V ) if the density function is
p (Σ|v, V ) =|Σ|−
(v+m+1)2
2vm2 |V |−m/2 Γm
(v2
) exp
(−1
2tr(V Σ−1
)).
14
This also implies that Σ−1 has a Wishart distribution, Σ−1 ∼ Wm (v, V −1). The
Jacobian of the transformation ∣∣∣∣∂Σ−1
∂Σ
∣∣∣∣ = |Σ|−(m+1).
• To generate Σ ∼ IWm (w,W ), follow the two step procedure:
Step 1: Draw Xi ∼ N(0,W−1
)Step 2: Set Σ =
∑vi=1XiX
′i.
In cases where m is extremely large, there are more efficient algorithms for drawing
inverted Wishart random variables that involves factoring W and sampling from
univariate X 2 distributions.
15
2 Likelihoods, Priors, and Posteriors
This appendix provides combinations of likelihoods and priors for the following types of
observed data: Bernoulli, Poisson, exponential, normal, normal regression, and multivariate
normal. For each specification, proper conjugate and Jeffreys’ priors are given. The over-
riding Bayesian paradigm takes the form of Bayes rule
p(parameters|data) =p(data|parameters)p(parameters)
p(data)
where the types of data and parameters are problem specific.
Bernoulli observations
• If the data (yt|θ) ∼ Ber (θ) with θ ∈ [0, 1], then the likelihood is
p (y|θ) =T∏
t=1
p (yt|θ) =T∏
t=1
θyt (1− θ)1−yt = θPT
t=1 yt (1− θ)T−PT
t=1 yt ,
• A conjugate prior distribution is the beta family, θ ∼ B (a,A), where
p (θ) =Γ (a+ A)
Γ (a) Γ (A)θa−1 (1− θ)A−1 .
By Bayes rule, the posterior distribution is also Beta
p (θ|y) ∝ p (y|θ) p (θ) = λa+PT
t=1 yt−1 (1− θ)A+T−PT
t=1 yt−1 ∼ B (aT , AT ) ,
where aT = a+∑T
t=1 yt and AT = A+ T −∑T
t=1 yt.
• Fisher’s information for Bernoulli observations is
I (θ) = −Eθ
[∂2 ln p (yt|θ)
∂θ2
]=
1
θ (1− θ),
where Eθ denotes the expectation under a Ber(θ). Jeffreys’ prior is
p (θ) = I (θ)12 = θ−
12 (1− θ)−
12 ∼ B
(1
2,1
2
).
16
Multinomial observations
• If (y|θ) is Multinomial data from k categories where (y|θ) ∼ Multi(θ1, . . . , θk), then
the likelihood for T trials is given by
p(y1, . . . , yk|θ1, . . . , θk) =T !
y1! . . . yk!θy1
1 . . . θyk
k wherek∑
i=1
yi = T
• A conjugate prior is a Dirichlet distribution, θ ∼ Dir(α), with density
p(θ1, . . . , θk|α) =Γ(∑αi)∏
i Γ(αi)θα1−11 . . . θαk−1
k
The posterior is then
p(θ1, . . . , θk|α, y1, . . . , yk) ∝ p(y1, . . . , yk|θ1, . . . , θk)p(θ1, . . . , θk|α)
∝ θy1
1 . . . θyk
k θα1−11 . . . θαk−1
k
= θα1+y1−11 . . . θαk+yk−1
k
∼ Dir(α+ y)
which is again a Dirchlet with parameter α+ y.
Poisson observations
• If the data (yt|λ) ∼ Poi (λ), then the likelihood is
p (y|λ) =T∏
t=1
e−λλyt
yt!∝ e−λTλ
PTt=1 yt ,
• A conjugate prior for λ is a Gamma distribution, λ ∼ G (a,A), with density
p (λ) =Aa
Γ (a)λa−1 exp (−λA) .
The posterior distribution is Gamma
p (λ|y) ∝ p (y|λ) p (λ) = e−λ(A+T )λa+PT
t=1 yt−1 ∼ G (aT , AT )
17
where aT = a+∑T
t=1 yt and AT = A+ T .
• Fisher’s information for Poisson observations is
I (λ) = −Eλ
[∂2 ln p (yt|λ)
∂λ2
]=
1
λ.
Jeffreys’ prior is then p (λ) ≡ I (λ)12 = λ−
12 . This can be viewed as a special case of
the Gamma prior with a = 12
and A = 0.
Exponential observations
• If the data (yt|µ) ∼ exp (µ), then the likelihood is
p (y|µ) =T∏
t=1
µ exp (−µyt) ∝ µT exp
(−µ
T∑t=1
yt
),
• A conjugate prior for µ is a Gamma distribution, µ ∼ G (a,A). The posterior
p (µ|y) ∝ p (y|µ) p (µ) ∝ µa+T−1e−µ
„A+∑T
t=1yt
«∼ G (aT , AT )
where aT = a+ T and AT = A+∑T
t=1yt.
• Fishers’ information for exponential observations is
I (µ) = −Eµ
[∂2 ln p (yt|µ)
∂µ2
]=
(1
µ
)2
.
Jeffreys prior for exponential observations is p (µ) ≡ µ−1. This is a special case of
the Gamma prior with a = 1 and A = 0.
Normal observations with known variance
• If the data (yt|µ, σ2) ∼ N (µ, σ2), the likelihood is
p(y|µ, σ2
)=
(1
2πσ2
)T2
exp
(− 1
2σ2
T∑t=1
(yt − µ)2
).
18
We can factorizeT∑
t=1
(yt − µ)2 =T∑
t=1
[(yt − y)2 + (y − µ)2] .
Thus, the likelihood, as a function of µ, is proportional to
exp
(−T (µ− y)2
2σ2
),
with the other terms in p(y|µ, σ2) being absorbed into the constant of integration.
• A conjugate prior distribution for µ that is independent of σ2 is given by µ ∼ N (a,A).
The posterior distribution is
p(µ|y, σ2
)∝ p
(y|µ, σ2
)p (µ) ∝ exp
(−1
2
[(µ− y)2
σ2/T− (µ− a)2
A
]).
Completing the square yields
(µ− y)2
σ2/T− (µ− a)2
A=
(µ− aT )2
AT
+(y − a)2
σ2/T + A,
with parametersaT
AT
=a
A+
y
σ2/Tand
1
AT
=1
σ2/T+
1
A.
The posterior distribution is
p(µ|y, σ2
)∝ exp
(−1
2
[(µ− aT )2
AT
+(y − a)2
σ2/T + A
])∼ N (aT , AT ) .
• A conjugate prior distribution for µ conditional on σ2 is µ ∼ N (a, σ2A). The poste-
rior distribution is p (µ|y, σ2) ∝ p (y|µ, σ2) p (µ) which gives
p(µ|y, σ2
)∝ exp
{− 1
2σ2
((µ− y)2
T−1+
(µ− a)2
A
)}
19
Completing the square for the quadratic term in the exponential,
(µ− y)2
T−1+
(µ− a)2
A=
(µ− aT )2
AT
+(y − a)2
T−1 + A
whereaT
AT
=a
A+
y
T−1and
1
AT
=1
T−1+
1
A.
The posterior distribution is
p(µ|y, σ2
)∝ exp
(−(µ− aT )2
2σ2AT
)∼ N
(aT , ATσ
2).
Notice the slight differences between this example and the previous one, in terms of
the hyper-parameters and the form of the posterior distribution.
• Fisher’s information for normal observations with σ2 known is
I (µ) = −Eλ
[∂2 ln p (yt|µ, σ2)
∂µ2
]≡ 1.
Jeffreys prior for normal observations (with known variance) is a constant, p (µ) ≡ 1
which is improper. However, the posterior is proper and can be viewed as a limiting
of the normal conjugate prior with a = 0 and A→∞.
Normal Variance with known Mean
• Given µ, if (yt|µ, σ2) ∼ N (µ, σ2) the likelihood for σ2, is
(1
σ2
)T2
exp
(−∑T
t=1 (yt − µ)2
2σ2
).
• A conjugate inverse Gamma prior σ2 ∼ IG(
b2, B
2
)has pdf
p(σ2)
=
(B2
) b2
Γ(
b2
) (σ2)− b
2−1
exp
(− B
2σ2
)
20
By Bayes rule,
p(σ2|µ, y
)∝ p
(y|µ, σ2
)p(σ2)
∝(
1
σ2
) b+T2
+1
exp
(−B +
∑Tt=1 (yt − µ)2
2σ2
)
∼ IG(bT2,BT
2
),
where bT = b+ T and BT = B +∑T
t=1 (yt − µ)2 .
The parameterization of the inverse Gamma, σ2 ∼ IG(
b2, B
2
), is used as opposed
to σ2 ∼ IG (b, B) because the hyperparameters do not have any 1/2 terms. This
is chosen for notational simplicity. It is also common in the literature to assume
p (σ) ∼ IG(
b2, B
2
), which only changes the first term in the expression for bT .
• Fisher’s information for normal observations (with known mean) is
I(σ2)
= −Eσ2
[∂2 ln p (yt|µ, σ2)
∂ (σ2)2
]≡ 1
σ2.
Jeffreys’ prior is p (σ2) ∝ (σ2)−1
, which is improper distribution. However, the re-
sulting posterior is proper and can be viewed as a limiting of the inverse Gamma
conjugate prior with B = 0 and b = 0. A flat or constant prior for σ2 also leads to a
proper posterior. Assuming p (σ2) ≡ 1 yields a conditional posterior
p(σ2|µ, y
)∼ IG
(T
2,
∑Tt=1 (yt − µ)2
2
).
Unknown mean and variance: dependent priors
• If the data (yt|µ, σ2) ∼ N (µ, σ2) and assuming that both µ and σ2 are unknown,
then the likelihood as a function of µ and σ2 is
(1
σ2
)T2
exp
(−∑T
t=1 (yt − µ)2
2σ2
).
21
• A conjugate prior for (µ, σ2) is p (µ, σ2) = p (µ|σ2) p (σ2) where
p(µ|σ2
)∼ N
(a,Aσ2
)and p
(σ2)∼ IG
(b
2,B
2
).
This distribution is often expressed as N (a,Aσ2) IG(
b2, B
2
). This prior assumes that
µ and σ2 are dependent. Bayes rule and a few lines of algebra yields a posterior
p(µ, σ2|y
)∝ p
(y|µ, σ2
)p(µ|σ2
)p(σ2)
∝(
1
σ2
)T+b2
+ 12+1
exp
(− 1
2σ2
[(µ− y)2
1/T+
(µ− a)2
A+
T∑t=1
(yt − y)2 +B
]).
Combining the quadratic terms that depend on µ by completing the square
(µ− y)2
1/T+
(µ− a)2
A=
(µ− aT )2
AT
+(y − a)2
1/T + A,
where the hyper-parameters are
aT
AT
=a
A+
y
T−1and
1
AT
=1
A+
1
T−1.
Inserting this into the likelihood, gives a posterior
p(µ, σ2|y
)∝(
1
σ2
)T+b2
+ 12+1
exp
(− 1
2σ2
[(µ− aT )2
AT
+B + S
])
where we have
S =(y − a)2
1/T + A+
T∑t=1
(yt − y)2 .
Given the conjugate prior structure, the posterior p (µ, σ2|y) ∝ p (µ|σ2, y) p (σ2|y) A
few lines of algebra shows that
p(µ, σ2|y
)∝(
1
σ2
) 12
exp
(−1
2
(µ− aT )2
σ2AT
)×(
1
σ2
)T+b2
+1
exp
(−B + S
2σ2
)
given p (µ|σ2, y) ∼ N (aT , σ2AT ) and p(σ2|y) ∼ IG
(bT
2, BT
2
)with bT = b + T,BT =
22
B + S.
Marginal parameter distributions. In this specification, the marginal parameter
distributions, p (σ2|y) and p (µ|y), are both known analytically. p (σ2|y) is inverse
Gamma. The marginal p (µ|y) is
p (µ|y) =
∫ ∞
0
p(µ, σ2|y
)dσ2 =
∫ ∞
0
p(µ|σ2, y
)p(σ2|y
)dσ2.
Both p (µ|σ2, y) and p (σ2|y) are known, and the integral can be computed analytically.
Ignoring integration constants,
p (µ|y) ∝∫ ∞
0
(1
σ2
) bT +1
2+1
exp
(− 1
σ2
[(µ− aT )
2AT
+BT
2
])dσ2
∝
[1 +
(µ− aT )2
ATBT
]− bT +1
2
using our integration results. This is the kernel of a t-distribution, thus the marginal
posterior is p(µ|y) ∼ tbT(aT , ATBT ).
The marginal likelihood, p (y) =∫p (y|µ, σ2) p (µ, σ2) dµdσ2 can be computed
p(y|µ, σ2
)=Ky
(1
σ2
)T2
exp
(−T (µ− y)2 + S
2σ2
)
p(µ|σ2
)= Kµ
(1
σ2
) 12
exp
(− 1
2σ2
(µ− a)2
A
)
p(σ2)
= Kσ
(1
σ2
) b2+1
exp
(− B
2σ2
),
with constants Ky = (2π)−T/2, Kσ = (B/2)b/2 /Γ (b/2) and Kµ = (2πA)−12 . Substi-
23
tuting these expressions, the marginal likelihood is
p (y) = KKµKσ
∫ ∞
0
(1
σ2
) b+T+12
+1
exp
(−S +B
2σ2
)dσ2
·∫ ∞
−∞exp
(− 1
2σ2
[(µ− y)2
T−1+
(µ− a)2
A
])dµ
Completing the square inside the integrand gives
(µ− y)2
T−1+
(µ− a)2
A=
(µ− aµT )2
Aµ+
(y − a)2
T−1 + A,
with hyper-parameters
aµT
AµT
=y
T−1+a
Aand
1
AµT
=1
T−1+
1
A.
The integrals can be expressed as
∫ ∞
0
(1
σ2
) b+T+12
+1
exp
(−S +B + (y−a)2
T−1+A
2σ2
)dσ2
∫ ∞
−∞exp
(− 1
2σ2
(µ− aµT )2
Aµ
)dµ
The second integral is∫∞−∞ exp
(−1
2
(µ−aµT )
2
σ2Aµ
)dµ =
√2πAµ (σ2)
12 . Using this, σ2 can
be integrated out yielding the expression for the marginal likelihood:
p (y) =√
2πAµKKµKσ
∫ ∞
0
(1
σ2
) b+T2
+1
exp
(−S +B + (y−a)2
T−1+A
2σ2
)dσ2
=
(1
2π
)T2(Aµ
A
) 12(B
2
) b2 Γ(
b+T2
)Γ(
b2
) [S +B +
(y − a)2
T−1 + A
]− b+T2
.
Finally, the predictive distribution can also be computed analytically. Since
p(yT+1|yT
)=
∫p(yT+1|µ, σ2
)p(µ, σ2|yT
)dµdσ2.
To simplify, first compute the integral against µ by substituting from the posterior.
24
Since p (µ|σ2, y) ∼ N (aT , σ2AT ), we have that µ = aT + σ
√ATZ where Z is an
independent normal. Substituting in yT+1 = µ+ σεT+1, gives
yT+1 = aT + σ√ATZ + σεT+1 = aT + σηT+1
where ηT+1 ∼ N (0, AT + 1). The predictive is therefore
p(yT+1|yT
)∝∫ ∞
0
(1
σ2
) 12
exp
(− 1
2σ2
(yT+1 − aT )2
AT + 1
)(1
σ2
) bT2
+1
exp
(−BT
2σ2
)dσ2
∝∫ ∞
0
(1
σ2
)BT +1
2+1
exp
(− 1
2σ2
[BT +
(yT+1 − aT )2
AT + 1
])dσ2
∝
[1 +
(yT+1 − aT )2
BT (AT + 1)
]BT +1
2
∼ tbT(aT , BT (AT + 1)) .
• Jeffreys’ prior for normal observations with unknown mean and variance is a bivariate
distribution. Fishers’ information is
I(µ, σ2
)= −Eµ,σ2
∂2 ln p(yt|µ,σ2)∂µ2
∂2 ln p(yt|µ,σ2)∂µ∂σ2
∂2 ln p(yt|µ,σ2)∂µ∂σ2
∂2 ln p(yt|µ,σ2)∂(σ2)2
=
[1σ2 0
0 2σ2
],
generating a prior distribution p (µ, σ2) ≡ det (I (µ, σ2))12 = σ−2. This prior is im-
proper, but leads to a proper posterior distribution that can be viewed as a limiting
case of the usual conjugate posterior
p(µ, σ2|y
)∼ N
(aT , ATσ
2)IG(bT2,BT
2
),
where aT = 0, AT →∞, b = 0 and B = 0.
Regression
• Consider a regression model specification, yt|xt, β, σ2 ∼ N (xtβ, σ
2), where xt is a
vector of observed covariates, β is a k × 1 vector of regression coefficients and εt ∼
25
N (0, σ2). To express the likelihood, it is useful to stacking the data into matrices:
y = Xβ + ε, where Y is a T × 1 vector of dependent variables, X is a T × k matrix
of regressor variables and ε, where Y is a T × 1 vector of errors.
The usual OLS regression estimator is β = (X ′X)−1X ′y and the residual sum of
squares is S = (y −Xβ)′(y −Xβ). Completing the square,
(y −Xβ)′(y −Xβ) = (β − β)′(X ′X)(β − β) + (y −Xβ)′(y −Xβ)
= (β − β)′(X ′X)(β − β) + S,
which implies that
p(y|β, σ2) =
(1
σ2
)T/2
exp
(− 1
2σ2(y −Xβ)′(y −Xβ)
)=
(1
σ2
)T/2
exp
(− 1
2σ2(β − β)′(X ′X)(β − β)− S
2σ2
)
Hence(β, S
)are sufficient statistics for (β, σ2).
• A proper conjugate prior is p (β|σ2) ∼ Nk (a, σ2A) and p (σ2) ∼ IG(
b2, B
2
)with
p(β|σ2
)= (2π)
k2(σ2)− k
2 |A|−12 exp
(− 1
2σ2(β − a)′ (A)−1 (β − a)
)since |σ2A| = (σ2)
k |A|. The posterior distribution is
p(β, σ2|y
)∝(
1
σ2
) b+T2
+1(1
σ2
) k2
exp
(− 1
2σ2Q(β, σ2
)),
where the quadratic form is defined by
Q(β, σ2
)= (y −Xβ)′ (y −Xβ) + (β − a)′A−1 (β − a) +B.
we can complete the square as follows: combine (y −Xβ)′ (y −Xβ)+(β − a)′A−1 (β − a)
26
by stacking vectors
y =
(y
A−12a
)and W =
(X
A−12
),
where y is a (T + k)× 1 vector and W is a (T + k)× k matrix. Then,
(y −Xβ)′ (y −Xβ) + (β − a)′A−1 (β − a) = (y −Wβ)′ (y −Wβ) .
By analogy to the previous case, define β = (W ′W )−1W ′y, where
W ′W = X ′X + A−1
W ′y = X ′y + A−1a.
Adding and subtracting Wβ to y −Wβ and simplifying gives
(y −Wβ)′ (y −Wβ) =[(y −Wβ
)+(Wβ −Wβ
)]′ [(y −Wβ
)+(Wβ −Wβ
)]=(β − β
)′W ′W
(β − β
)+(y −Wβ
)′ (y −Wβ
)= (β − aT )′A−1
T (β − aT ) + (y −WaT )′ (y −WaT ) ,
where A−1T = W ′W = X ′X + A−1 and aT = AT [X ′y + A−1a]. Straightforward
algebra shows that(y −Wβ
)′ (y −Wβ
)= y′y + a′A−1a − a′TA
−1T aT . Putting the
pieces together, the quadrtic form is
Q(β, σ2
)= (β − aT )′A−1
T (β − aT ) + y′y + a′A−1a− a′TA−1T aT +B
= (β − aT )′A−1T (β − aT ) +BT ,
where BT = y′y + a′A−1a− a′TA−1T aT +B. This leads to the same posterior.
3 Direct sampling
3.1 Generating i.i.d. random variables from distributions
The Gibbs sampler and MH algorithms require simulating i.i.d. random variables from
“recognizable” distributions. Appendix 1 provides a list of common “recognizable” dis-
27
tributions, along with methods for generating random variables from these distributions.
This section briefly reviews that standard methods and approaches for simulating random
variables from recognizable distributions.
Most of these algorithms first generate random variables from a relatively simple “build-
ing block” distribution, such as a uniform or normal distribution, and then transform these
draws to obtain a sample from another distribution. This section describes a number of
these approaches that are commonly encountered in practice. Most “random-number”
generators actual use deterministic methods, along with transformations. In this regard,
it is important to remember Von Neumann’s famous quotation: “Anyone who attempts to
generate random numbers by deterministic means is, of course, living in a state of sin.”
Inverse CDF method The inverse distribution method uses samples of uniform
random variables to generate draws from random variables with a continuous distribution
function, F . Since F (x) is uniformly distributed on [0, 1], draw a uniform random variable
and invert the CDF to get a draw from F . Thus, to sample from F ,
Step 1: Draw U ∼ U [0, 1]
Step 2: Set X = F−1 (U) ,
where F−1 (U) = inf {x : F (x) = U}.This inversion method provides i.i.d. draws from F provided that F−1 (U) can be exactly
calculated. For example, the CDF of an exponential random variable with parameter µ is
F (x) = 1 − exp (−µx), which can easily be inverted. When F−1 cannot be analytically
calculated, approximate inversions can be used. For example, suppose that the density is a
known analytical function. Then, F (x) can be computed to an arbitrary degree of accuracy
on a grid and inversions can be approximately calculated, generating an approximate draw
from F . With all approximations, there is a natural trade-off between computational speed
and accuracy. One example where efficient approximations are possible are inversions
involving normal distributions, which is useful for generating truncated normal random
variables. Outside of these limited cases, the inverse transform method does not provide a
computationally attractive approach for drawing random variables from a given distribution
function. In particular, it does not work well in multiple dimensions.
28
Functional Transformations The second main method uses functional transforma-
tions to express the distribution of a random variable that is a known function of another
random variable. Suppose that X ∼ F , admitting a density f , and that y = h (x) is
an increasing continuous function. Thus, we can define x = h−1 (y) as the inverse of the
function h. The distribution of y is given by
F (a) = Prob (Y ≤ y) =
∫ h−1(y)
−∞f (x) dx = F
(X ≤ h−1 (y)
).
Differentiating with respect to y gives the density via Leibnitz’s rule:
fY (y) = f(h−1 (y)
) ∣∣∣∣dh−1 (y)
dy
∣∣∣∣ ,where we make explicit that the density is over the random variable Y . This result is used
widely. For example, if X ∼ N (0, 1), then Y = µ + σX. Since x = h−1 (y) = y−µσ
, the
distribution function is F(
x−µσ
)and density
fY (y) =1√2πσ
exp
(−(y − µ)2
2σ2
)Transformations are widely used to simulate both univariate and multivariate random vari-
ables. As examples, if Y ∼ X 2ν and ν is an integer, then Y =
∑νi=1X
2i where each Xi is
independent standard normal. Exponential random variables can be used to simulate X 2,
Gamma, beta, and Poisson random variables. The famous Box-Muller algorithm simu-
lates normals from uniform and exponential random variables. In the multivariate setting,
Wishart (and inverse Wishart) random variables can be via sums of squared vectors of
standard normal random variables.
Mixture distributions In the multidimensional case, a special case of the transfor-
mation generates continuous mixture distributions. The density of a continuous mixture
distribution is given by
p (x) =
∫p (x|λ) p (λ) dλ,
29
where p (x|λ) is viewed as density conditional on the parameter λ. One example of this is
the class of scale mixtures of normal distributions, where
p (x|λ) =1√2πλ
exp
(−x
2
2λ
)and, λ is the conditional variance of X. It is often simpler just to write
X =√λε,
where ε ∼ N (0, 1). The distribution of√λ determines the marginal distribution of X.
Here are a number of examples of scale mixture distributions.
• T-distribution. The t-distribution arises in many Bayesian inference problems in-
volving inverse Gamma priors and conditionally normally distributed likelihoods. If
p (yt|λ) ∼ N (0, λ) and p (λ) ∼ IG(
b2, B
2
), then the marginal distribution of yt is
tb(0, B
b
). The proof is direct by analytically computing the marginal distribution,
p (yt) =
∫ ∞
0
p (yt|λ) p (λ) dλ.
Using our integration results:
p (y) =
∫ ∞
0
1√2π
(1
λ
) 12
exp
(−1
2
y2
λ
)︸ ︷︷ ︸
likelihood
(B/2)b/2
Γ(
b2
) (1
λ
) b2+1
exp
(− B
2λ
)︸ ︷︷ ︸
prior
dλ
∝∫ ∞
0
(1
λ
)( b+12 )+1
exp
(−1
λ
y2 +B
2
)dλ
which is in the class of scale mixture integrals. We obtain
p (y) ∝[1 +
y2
B
]−( b+12 )
Thus yt ∼ tb (0, B). Given µ, yt|µ, λ, σ2 ∼ N (µ, λσ2) and λ ∼ IG(
b2, B
2
)which
implies p (yt|µ, σ2) ∼ tb (µ,B).
• Double-exponential distribution. The double exponential arises as a scale mix-
30
ture distribution: If p (yt|λ) ∼ N (0, λ) and λ ∼ exp (2), then the marginal distribu-
tion of yt ∼ DE (0, 1). The proof is again by direct integration using the results in
our integration appendix:∫ ∞
0
1√2πλ
exp
{−1
2
(y2
λ− λ
)}dλ =
1
2exp (− |y|) ,
More generally, if µ and σ2 are known, then yt|µ, λ, σ2 ∼ N (µ, λσ2) and λ ∼ exp (2),
then p (yt|µ, σ2) ∼ DE (µ, σ2) substituting b = (y − µ) /σ and multiplying both sides
by σ−1.
• Asymmetric Laplacean. The asymmetric Laplacean distribution is a scale mixture
of normal distributions: if
p (yt|λ) ∼ N ((2τ − 1)λ, λ) and λ ∼ exp(µ−1
τ
),
then yt ∼ CE (τ, 0, 1). The proof uses our integration appendix and∫ ∞
0
1√2πλ
exp
{− 1
2λ(y + (2τ − 1)λ)2 − 2τ (1− τ)λ
}dλ =
1
2exp (− |y| − (2τ − 1) y)
In general, if µ and σ2 are known, then
p(yt|µ, λ, σ2) ∼ N(µ+ (1− 2τ)λ, λσ2
)and λ ∼ exp
(µ−1
τ
)which leads to p (yt|µ, σ2) ∼ CE (τ, µ, σ2).
• Exponential power family. This family of distributions (Box and Tiao, 1973)
p(x|τ, γ) is given by
p(y|σ, γ) = (2σ)−1c(γ) exp(−∣∣∣yσ
∣∣∣γ)where c(γ) = Γ(1 + γ−1)−1. Following West (1987) we have the following scale
mixtures of normals representation
p(y|σ = 1, γ) = c(γ) exp (−|β|γ) =
∫ ∞
0
1√2πλ
exp
(− y
2
2λ
)p(λ|γ)dλ
31
where p(λ|γ) ∝ λ−3/2St+γ2(λ−1) and St+a is the (analytically intractable) density of a
positive stable distribution.
Factorization Method Another method that is useful in some multivariate settings
is known as factorization. The rules of probability imply that a joint density, p (x1, ..., xn),
can always be factored as
p (x1, ..., xn) = p (xn|x1, ..., xn−1) p (x1, ..., xn−1)
= p (x1) p (x2|x1) · · · p (xn|x1, ..., xn−1) .
In this case, simulating X1 ∼ p (x1), X2 ∼ p (x2|X1), ... Xn ∼ p (xn|X1, ..., Xn−1) generates
a draw from the joint distribution. This procedure is common in Bayesian statistics, and
the distribution of X1 is a marginal distribution and the other distributions are condi-
tional distributions. This is used repeatedly to express a joint posterior in terms of lower
dimensional conditional posteriors.
Rejection sampling The final general method discussed here is the accept-reject
method, developed by von Neumann. Suppose that the goal is to generate a sample from
f (x), where it is assumed that f is bounded, that is, f (x) ≤ cg (x) for some c. The
accept-reject algorithm is a two-step procedure
Step 1: Draw U ∼ U [0, 1] and X ∼ g
Step 2: Accept Y = X if U ≤ f (X)
cg (X), otherwise return to Step 1.
Rejection sampling simulates repeatedly until a draw that satisfies U ≤ f(X)cg(X)
is found. By
direct calculation, it is clear that Y has density f :
Prob(Y ≤ y) = Prob
(X ≤ y|U ≤ f (X)
cg (X)
)=
Prob(X ≤ y, U ≤ f(X)
cg(X)
)Prob
(U ≤ f(X)
cg(X)
)=
∫ y
−∞
(∫ f(x)/cg(X)
−∞ du)g (x) dx∫ −∞
−∞
(∫ f(x)/cg(X)
−∞ du)g (x) dx
=1c
∫ y
−∞ f (x) dx1c
∫ −∞−∞ f (x) dx
=
∫ y
−∞f (x) dx.
32
Rejection sampling requires (a) a bounding or dominating density g, (b) an ability to
evaluate the ratio f/g; (c) an ability to simulate i.i.d. draws from g, and (d) the bounding
constant c. Rejection sampling does not require that the normalization constant∫f (x) dx
be known, since the algorithm only requires knowledge of f/c. For continuous densities
on a bounded support, it is to satisfy (a) and (c) (a uniform density works), but for
continuous densities on unbounded support it can be more difficult since we need to find a
density with heavier tails and higher peaks. Setting c = supx
f(x)g(x)
maximizes the acceptance
probability. In practice, finding the constant is difficult because f generally depends on a
multi-dimensional parameter vector, f (x|θf ), and thus the bounding is over x and θf .
Rejection sampling is often used to generate random variables from various recognizable
distributions, such as the Gamma or beta distributions. In these cases, the structure of
the densities is well known and the bounding density can be tailored to generate fast
and efficient rejection sampling algorithms. In many of these cases, the densities are log-
concave (e.g., Normal, Double exponential, Gamma, and Beta). A density f is log-concave
if ln (f (x)). For differentiable densities, this is equivalent to assuming that dlnf(x)dx
= f ′(x)f(x)
is non-increasing in x and d2lnf(x)dx2 < 0. Under these conditions, it is possible to develop
“black-box” generation methods that perform well in many settings. Another modification
that “adapts” the dominating densities works well for log-concave densities.
The basis of the rejection sampling is the simple fact that since
f (x) =
∫ f(x)
0
du =
∫ f(x)
0
U (du) ,
where U is the uniform distribution function, the density f is a marginal distribution from
the joint distribution
(X,U) ∼ U ((x, u) : 0 ≤ u ≤ f (x)) .
More generally, if X ∈ Rd is a random vector with density f and U is an independent
random variable distributed U [0, 1], then (X, cUf (X)) is uniformly distributed on the set
A ={(x, u) : x ∈ Rd, 0 ≤ u ≤ cf (x)
}. Conversely, if (X,U) is uniformly distributed on A,
then the density of X is f (X).
Multinomial Resampling Sampling N values from a discrete distribution (xi, pi)
can be done by simulating standard uniforms Ui and then using binary search to find the
33
value of j, and hence xj, corresponding to
qj−1 < Ui ≤ qj
where qj =∑j
l=0 pj and q0 = 0. This algorithm is inefficient, although commonly used due
to its simplicity, as requires O(N lnN) operations. The lnN operations comes from the
binary search.
A more efficient method (particularly suited to the particle filtering methods) is to
simulateN+1 exponentially distributed variables z0, . . . , zN via zi = − lnUi and calculating
the totals Zj =∑j
l=0 zl and then merging Zj and qj in the sense that if qjZN > Zi then
output xj. This is an O(N) algorithm.
Slice Sampling Slice sampling can be used to sample the conditional posterior when
the scale mixture is a stable distribution. Here we have
p(λt|σ2, yt
)∼ p(yt|λt, σ
2)p (λt) ,
where p (yt|λt, σ2) = N (0, σ2λt). To do this, introduce a uniform random variable ut and
consider the joint distribution
p(λt, ut|σ2
)∝ p (λt)Uut
[0, φ
(y′t, λtσ
2)]
.
Then the algorithm is:
p(λt|ut, σ
2)∝ p (λt) on N
(yt; 0, σ
2λt
)> ut
p(ut|λt, σ
2)∼ U
[0, φ
(yt;λtσ
2)]
.
The accept/reject sampling alternative is as follows. Since we have the upper bound
φ (yt; 0, λtσ2) ≤ (2πy2
t )−1/2
exp(−1/2), the sample λt via
λt ∼ p (λt)
u ∼ U[0,(2πy2
t
)−1/2e−1/2
],
and if U > φ (yt, σ2λt), repeat.
34
3.2 Integration results
There are a number of useful integration results that are repeatedly used for normalising
probability distributions. The first identity is
∫ ∞
−∞exp
(−(x− µ)2
2σ2
)dx =
√2πσ2,
for all µ which defines the univariate normal distribution. The second identity is for the
Gamma function, defined as
Γ (α) =
∫ ∞
0
yα−1e−ydy,
which is important for a number of distributions, most notably the Gamma and inverse
Gamma. This implies that yα−1e−y/Γ (α) is a proper density. Changing variables to x = βy,
gives the standard form of the Gamma distribution.
Integration by parts implies Γ (α+ 1) = αΓ (α), so Γ (α) = (α− 1)! with Γ (1) = 1 and
Γ (1/2) =√π. For other fractional values
Γ (α) Γ
(α+
1
2
)= 21−2α
√πΓ (2α) ,
A number of integrals are useful for analytic characterization of distributions, either priors
or posteriors, that are scale mixtures of normals. As discussed in the previous appendix,
scale mixtures involve integrals that are a product of two distributions. The following
integral identities are useful for analyzing these specifications. For any given p, a, b > 0,∫ ∞
0
xp−1 exp(−axb
)dx =
1
ba−
pb Γ(pb
)∫ ∞
0
(1
x
)p+1
exp(−ax−b
)dx =
1
ba−
pb Γ(pb
)These integrals are useful when combining Gamma or inverse Gamma distributions prior
distributions with normal likelihood functions. Second, for any a and b,∫ ∞
0
a√2πx
exp
{−1
2
(a2x+
b2
x
)}dx = exp (−|ab|) ,
which are useful for double exponential. A related integral, which is useful for the check
35
exponential distribution is∫ ∞
0
a√2πx
exp
{−1
2
(b2
x+ 2(2τ − 1)b+ a2x
)}dx = exp (− |ab| − (2τ − 1)b) .
3.3 Useful algebraic formulae
One of the most useful algebraic tricks for calculating posterior distribution is completing
the square (have fun showing the algebra for the second part!).
In the scalar case, we have the identity
(x− µ1)2
Σ1
+(x− µ2)
2
Σ2
=(x− µ3)
2
Σ3
+(µ1 − µ2)
2
Σ1 + Σ2
where µ3 = Σ3
(Σ−1
1 µ1 + Σ−12 µ2
)and Σ3 =
(Σ−1
1 + Σ−12
)−1.
A shrinkage interpretation is the following. When analyzing shrinkage and it is common
to define the weight w = (Σ1 + Σ2)−1 Σ1. We can now re-write completing the square as
(x− µ1)2
Σ1
+(x− µ2)
2
Σ2
=(µ1 − µ2)
2
Σ1(1− w)−1+
(x− (1− w)µ1 − wµ2)2
Σ2(1− w).
In vector-matrix case, completing the square becomes
(x− µ1)′ Σ−1
1 (x− µ1)+(x− µ2)′ Σ−1
2 (x− µ2) = (x− µ3)′ Σ−1
3 (x− µ3)+(µ1 − µ2)′ Σ−1
4 (µ1 − µ2)
where µ3 = Σ3
(Σ−1
1 µ1 + Σ−12 µ2
)and
Σ3 =(Σ−1
1 + Σ−12
)−1
Σ4 = Σ−11
(Σ−1
1 + Σ−12
)−1Σ−1
2 .
Completing the square for the convolution of two normal densities is also useful when
interpreting the fundamental identity of likelihood times prior is equal to posterior times
marginal. The identity yields
φ (µ1,Σ1)φ (µ2,Σ2) = φ (µ,Σ) c(µ1,Σ1, µ2,Σ2)
36
with φ is the standard normal density and
c(µ1,Σ1, µ2,Σ2) = (2π)−k2 |Σ1 + Σ2|−
k2 exp
(−1
2(µ2 − µ1)
′ (Σ1 + Σ2)−1 (µ2 − µ1)
)with parameters µ = Σ
(Σ−1
1 µ1 + Σ−12 µ2
)and Σ =
(Σ−1
1 + Σ−12
)−1
3.4 The EM, ECM, and ECME algorithms
MCMC methods have been used extensively to perform numerical integration. There is
also interest in using simulation-based methods to optimise functions. The EM algorithm
is a algorithms in a general class of Q-maximisation algorithms that finds a (deterministic)
sequence {θ(g)} converging to arg maxθ∈θ Q(θ).
First, define a function Q(θ, φ) such that Q(θ) = Q(θ, θ) and it satisfies a convexity
constraint Q(θ, φ) ≥ Q(θ, θ). Then define
θ(g+1) = arg maxθ∈θ
Q(θ, θ(g))
This satisfies the convexity constraint Q(θ, θ) ≥ Q(θ, ϕ) for any ϕ. In order to prove
convergence. you get a sequence of inequalities
Q(θ(0), θ(0)) ≤ Q(θ(1), θ(0)) ≤ Q(θ(1), θ(1)) ≤ . . . ≤ Q
In many models we have to deal with a latent variable and require estimation where in-
tegration is also involved. For example, suppose that we have a triple (y, z, θ) with joint
probability specification p(y, z, θ) = p(y|z, θ)p(z, θ). This can occur in missing data prob-
lems and estimation problems in mixture models.
A standard application of the EM algorithm is to find
arg maxθ∈θ
∫z
p(y|z, θ)p(z|θ)dz
As we are just finding an optimum, you do not need the prior specification p(θ). The EM
algorithm finds a sequence of parameter values θ(g) by alternating between an expectation
and a maximsation step. This still requires the numerical (or analytical) computation of
the criteria function Q(θ, θ(g)) described below.
37
EM algorithms have been used in extensively in mixture models and missing data
problems. The EM algorithm uses the particular choice where
Q(θ) = log p(y|θ) = log
∫p(y, z|θ)dz
Here the likelihood has a mixture representation where z is the latent variable (missing
data, state variable etc). This is termed a Q-maximization algorithm with:
Q(θ, θ(g)) =
∫log p(y|z, θ)p(z|θ(g), y)dz = Ez|θ(g),y[log p(y|z, θ)]
To implement EM you need to be able to calculate Q(θ, θ(g)) and optimize at each iteration.
The EM algorithm and its extensions ECM and ECME are methods of computing
maximum likelihood estimates or posterior modes in the presence of missing data. Let the
objective function be l(θ) = log p(θ|y)+c(y), where c(y) is a possibly unknown normalizing
constant that does not depend on β and y denotes observed data. We have a mixture
representation,
p(θ|y) =
∫p(θ, z|y)dz =
∫p(θ|z, y)p(z|y)dz
where distribution of the latent variables is p(z|θ, y) = p(y|θ, z)p(z|θ)/p(y|θ).In some cases the complete data log-posterior is simple enough for arg maxθ log p(θ|z, y)
to be computed in closed form. The EM algorithm alternates between the Expectation and
Maximization steps for which it is named. The E-step and M-step computes
Q(β|β(g)) = Ez|β(g),y [log p(y, z|β)] =
∫log p(y, z|β)p(z|β(g), y)dz
β(g+1) = arg maxβ
Q(β|β(g))
This has an important monotonicity property that ensures `(β(g)) ≤ `(β(g+1)) for all g.
In fact, the monotonicity proof given by Dempster et al (1977) shows that any β with
Q(β, β(g)) ≥ Q(β(g), β(g)) also satisfies the log-likelihood inequality `(β) ≥ `(β(g)).
In problems with many parameters the M-step of EM may be difficult. In this case θ may
be partitioned into components (θ1, . . . , θk) in such a way that maximizing log p(θj|θ−j, z, y)
is easy. The ECM algorithm pairs the EM algorithm’s E-step with k conditional maximiza-
tion (CM) steps, each maximizing Q over one component θj with each component of θ−j
38
fixed at the most recent value. Due to the fact that each CM step increases Q the ECM al-
gorithm retains the monotonicity property. The ECME algorithm replaces some of ECM’s
CM steps with maximizations over l instead of Q. Liu and Rubin (1994) show that doing
so can greatly increase the rate of convergence.
In many cases we will have a parameter vector θ = (β, ν) partitioned into its components
and a missing data vector z = (λ, ω). Then we compute the Q(β, ν|β(g), ν(g)) objective
function and then compute E- and M steps from this to provide an iterative algorithm for
updating parameters. To update the hyperparameter ν we can maximize the fully data
posterior p(β, ν|y) with β fixed at β(g+1). The algorithm can be summarized as follows
β(g+1) = argmaxβQ(β|β(g), ν(g)) where Q(β|β(g), ν(g)) = Ez β(g),ν(g),y
[log p(y, z|β, ν(g))
]ν(g+1) = argmaxν log p(β(g+1), ν|y)
Simulated Annealing (SA) A simulation-based approach to finding θ = arg maxθ∈Θ
H(θ)
is to sample a sequence densities
πJ(θ) =e−JH(θ)∫eJH(θ)dµ(θ)
where J is a temperature parameter. Instead of looking at derivatives and performing
gradient-based optimization you can simulate from the sequence of densities. This forms a
time-homogeneous Markov chain and under suitable regularly conditions on the relaxation
schedule for the temperature we have θ(g) → θ. The main caveat is that we need to know
the criterion function H(θ) to evaluate the Metropolis probability for sampling from the
sequence of densities. This is not always available.
An interesting generalisation which is appropriate in latent variable mixture models
is the following. Suppose that H(θ) = Ez|θ {H(z, θ)} is unavailable in closed-form where
without loss of generality we assume that H(z, θ) ≥ 0. In this case we can use latent
variable simulated annealing (LVSA) methods. Define a joint probability distribution for
zJ = (z1, . . . , zJ) as
πJ(zJ , θ) ∝J∏
j=1
H(zj, θ)p(zj|θ)µ(θ)
for some measure µ which ensures integrability of the joint. This distribution has the
39
property that its marginal distribution on θ is given by
πJ(θ) ∝ Ez|θ {H(z, θ)}J µ(θ) = eJ ln H(θ)µ(θ)
By the simulated annealing argument we see that this marginal collapses on the maximum
of lnH(θ). The advantage of this approach is that it is typically straightforward to sample
with MCMC from the conditionals
πJ(zi|θ) ∼ H(zj, θ)p(zj|θ) and πJ(θ|z) ∼J∏
j=1
H(zj, θ)p(zj|θ)µ(θ)
Jacquier, Johannes and Polson (2007) apply this to finding MLE estimates for commonly
encountered latent variable models.
3.5 Notes and Discussion
Geman (1987) provides a monograph discussion of Markov chain optimisation methods.
Pincus (1968) provides an early application of the Metropolis algorithm to simulated anneal-
ing optimisation. Standard references on simulated annealing include Kirkpatrick (1984),
Kirkpatrick et al, (1983), Aarts and Korst (1989) Van Laarhoven and Aarts, (1987). Muller
(2000) and Mueller et al (2004) propose this version of latent variable simulated annealing
for dealing with the joint problem of integration and optimisation. Slice sampling appli-
cations in Statistics are dicussed in Besag and Green (1993), Polson, (1996), Damien et
al, (1999) and Neal (2003) provides a general algorithm. General strategies based on the
ratio-of-uniforms method are in Wakefield et al (1991). Devroye (1986) and Ripley (1987)
discuss many tailored made algorithms for simulation.
The EM algorithm for missing data problems was originally developed by Dempster,
Laird and Rubin (1977) but has its roots in the hidden Markov model literature and
the Baum-Welch algorithm. Liu and Rubin (1994) consider faster ECME algorothms.
Liang and Wong (2001) and Liu, Liang and Wong (2000) consider extensions including
evolutionary MCMC algorithms.
40