1 Appendix: Common...

1 Appendix: Common distributions

This Appendix provides details for common univariate and multivariate distributions, in-

cluding definitions, moments, and simulation. Deyroye (1986) provides a complete treat-

ment of random number generation, although care must be taken as many distributions

can be parameterized in different ways.

Uniform

• A random variable X has a uniform distribution on the interval [α, β], denoted

U (α, β) , if the probability density function (pdf) is

p (x|α, β) =1

β − α

for x ∈ [α, β] and 0 otherwise. The mean and variance of a uniform random variable

are E (X) = α+β2

and var (X) = (β−α)2

12, respectively.

• The uniform distribution plays a foundational role in random number generation. In

particular, uniform random numbers are required for the inverse transform simulation

method, accept-reject algorithms, and the Metropolis algorithm. Fast and accurate

pre-programmed algorithms are available in most statistical software packages and

programming languages.

Bernoulli

• A random variable X has a Bernoulli distribution with parameter θ, denoted X ∼Ber (θ) if the probability mass function (pmf) is

Prob (X = x|θ) = θx (1− θ)1−x .

for x ∈ {0, 1}. The mean and variance of a Bernoulli random variable are E(X) = θ

and var(X) = θ(1− θ), respectively.

1

• To simulate X ∼ Ber (θ),

1. Draw U ∼ U (0, 1)

2. Set X = 1 if U < θ.

Binomial

• A random variable X ∈ {1, ..., n} has a Binomial distribution with parameters n and

θ, denoted, X ∼ Bin (n, θ), if the pmf is

Prob (X = x|n, θ) =n!

x! (n− x)!θx (1− θ)n−x ,

where n! = n (n− 1)! = n (n− 1) · · · 2 · 1. The mean and variance of a Binomial

random variable are E(X) = nθ and var(X) = nθ (1− θ), respectively. The Binomial

distribution arises as the distribution of a sum of n independent Bernoulli trials. The

Binomial is closely related to a number of other distributions. If W1, ...,Wn are i.i.d.

Ber (p), then∑n

i=1Wi ∼ Bin (n, p). As n → ∞ with np = λ, X ∼ Bin (n, p)

converges in distribution to a Poisson distribution with parameter λ.

• To simulate X ∼ Bin (n, θ),

1. Draw X1, ..., Xn independently Xi ∼ Ber (θ)

2. Set X = count (Xi = 1) .

Multinomial

• A vector of random variables X = (X1, ..., Xk) has a Multinomial distribution, de-

noted X ∼Mult (n, p1, ..., pk), if

p (X = x|p1, ..., pk) =n!

x1! · · ·xk!

k∏i=1

pxii

where∑k

i=1 xi = n. The Multinomial distribution is a natural extension of the

Bernoulli and Binomial distributions. The Bernoulli distribution gives a single trial

resulting in success or failure. The Binomial distribution is an extension that involves

2

n independently repeated Bernoulli trials. The Multinomial allows for multiple out-

comes, instead of the two outcomes in the Binomial distribution. There are still n

total trials, but now the outcome of each trial is assigned into one of k categories,

and xi counts the number of outcomes in category i. The probability of category i is

pi. The mean, variance, and covariances of the Multinomial distribution are given by

E (Xi) = npi, var (Xi) = npi (1− pi) , and cov (Xi, Xj) = −npipj

Multinomial distributions are often used in modeling finite mixture distributions

where the Multinomial random variables represent the various mixture components.

Dirichlet

• A vector of random variables X = (X1, .., Xk) has a Dirichlet distribution, denoted

X ∼ D (α1, ..., αk), if∑k

i=1Xi = 1

p (x|α1, ..., αk) =Γ(∑k

i=1 αi

)Γ (α1) · · ·Γ (αk)

k∏i=1

xαi−1i

The Dirichlet distribution is used as a prior for mixture probabilities in mixture

models. The mean, variance, and covariances of the Multinomial distribution are

E (Xi) =αi∑ki=1 αi

var (Xi) =αi

∑j 6=i αj(∑k

i=1 αi

)2 (∑ki=1 αi + 1

)cov (Xi, Xj) =

−αiαj(∑ki=1 αi

)2 (∑ki=1 αi + 1

)• To simulate a Dirichlet X = (X1, . . . , Xk) ∼ D (α1, ..., αk) use the two step procedure

Step 1: Draw k independent Gammas Yi ∼ Γ (αi, 1)

Step 2: Set Xi = Yi/k∑

i=1

Yi

3

Poisson

• A random variable X ∈ N+ (the non-negative integers) has a Poisson distribution

with parameter λ, denoted X ∼ Poi (λ) , if the pmf is

Prob (X = x|λ) = e−λλx

x!.

The mean and variance of a Poisson random variable are E(X) = λ and var(X) = λ,

respectively.

• To simulate X ∼ Poi (λ),

1. Draw Z1, . . . , Zn independently, Zi ∼ exp (1)

2. Set X = inf

{n > 0 :

n∑i=1

Zi > λ

},

Exponential

• A random variable X ∈ R+ has an exponential distribution with parameter µ, de-

noted, X ∼ exp (µ), if the pdf is

p (x|µ) =1

µexp

(−xµ

).

The mean and variance of an exponential random variable are E (X) = µ and

var (X) = µ2, respectively.

• The inverse transform method is the easiest way to simulate exponential random

variables, since the cumulative distribution function (cdf) is F (x) = 1− e−xµ .

• To simulate X ∼ exp (µ),

1. Draw U ∼ U [0, 1]

2. Set X = −µ ln (1− U) .

Gamma

4

• A random variable X ∈ R+ has a Gamma distribution with parameters α and β,

denoted X ∼ G (α, β), if the pdf is

p (x|α, β) =βα

Γ (α)xα−1 exp (−βx) ,

The mean and variance of a Gamma random variable are E(X) = αβ−1 and var(X) =

αβ−2, respectively. It is important to realise that there are different parameterizations

of the Gamma distribution; e.g. MATLAB parameterizes the Gamma density as

p (x|α, β) =1

Γ (α) βαxα−1 exp (−x/β) .

If X = Y/β and Y ∼ G (α, 1) then X ∼ G (α, β). To see this, use the inverse

transform Y = βX with dY/dX = β, which implies that

p (x|α, β) =1

Γ (α)(xβ)α−1 exp (−βx) β =

βα

Γ (α)xα−1 exp (−βx) ,

the density of a G (α, β) random variable. The exponential distribution is a special

case of the Gamma distribution when α = 1: X ∼ G (1, µ−1) implies that X ∼exp (µ).

• Gamma random variable simulation is standard, with built-in generators in most

software packages. These algorithms typically use accept/reject algorithms that are

customized to the specific values of α and β. To simulate X ∼ G (α, β), when α is

integer-valued,

1. Draw X1, ..., Xα independently Xi ∼ exp(1)

2. Set X = β

α∑i=1

Xi.

For non-integer α, accept-reject methods provide fast and accurate algorithms for

Gamma simulation.

Beta

• A random variable X ∈ [0, 1] has a Beta distribution with parameters α and β,

5

denoted X ∼ B (α, β), if the pdf is

p (x|α, β) =xα−1 (1− x)β−1

B (α, β),

where B (α, β) = Γ (α) Γ (β) /Γ (α+ β) is the Beta function. As∫p (x|α, β) dx = 1

we have B (α, β) =∫ 1

0xα−1 (1− x)β−1 dx.

The mean and variance of a Beta random variable are

E (X) =α

α+ βand var (X) =

αβ

(α+ β)2 (α+ β + 1),

respectively. If α = β = 1, then X ∼ U (0, 1).

• If α and β are integers, to simulate X ∼ B (α, β),

1. Draw X1 ∼ G (α, 1) and X2 ∼ G (β, 1)

2. Set X =X1

X1 +X2

.

When α is integer valued and β is non-integer valued (a common case in Bayesian in-

ference), exponential random variables can be used to simulate beta random variables

via the transformation method.

Chi-squared

• A random variable X ∈ R+ has a Chi-squared distribution with parameter ν, denoted

X ∼ X 2ν if the pdf is

p (x|ν) =1

2ν2 Γ(

ν2

)x ν2−1 exp

(−x

2

).

The mean and variance of X are E (X) = ν and var (X) = 2ν, respectively. The

X 2ν -distribution is a special case of the Gamma distribution: X 2

ν = G(

ν2, 1

2

).

• Simulating chi-squared random variables typically uses the transformation method.

For integer values of ν, the following two-step procedure simulates a X 2ν random

6

variable:

Step 1: Draw Z1, ..., Zν independently Zi ∼ N (0, 1)

Step 2: Set X =ν∑

i=1

Z2i .

When ν is large, simulating using normal random variables is computationally costly

and alternative more computationally efficient algorithms use Gamma random vari-

able generation.

Inverse Gamma

• A random variable X ∈ R+ has a inverse Gamma distribution, denoted by X ∼IG (α, β), if the pdf is

p (x|α, β) =βα

Γ (α)x−(α+1) exp

(−βx

).

The mean and variance of the inverse Gamma distribution for α > 2 are

E (X) =β

α− 1and var (X) =

β2

(α− 1)2 (α− 2)

If Y ∼ G (α, β) , then X = Y −1 ∼ IG (α, β). To see this, write

1 =

∫ ∞

0

βα

Γ (α)yα−1 exp (−βy) dy =

∫ 0

∞

βα

Γ (α)

(1

x

)α−1

exp

(−βx

)(−1

x2

)dx

=

∫ ∞

0

βα

Γ (α)

1

xα+1exp

(−βx

)dx.

• The following two-steps simulate an IG (α, β)

Step 1: Draw Y ∼ G (α, 1)

Step 2: Set X =β

Y.

Again, as in the case of the Gamma distribution, some authors use a different param-

eterization for this distribution, so it is important to be careful to make sure you are

7

drawing using the correct parameters. In the case of prior distributions over the scale,

σ2, it is additional complicated because some authors (Zellner, 1971) parameterize σ

instead of σ2.

Pareto

• A random variable X ∈ R+ has a Pareto distribution, denoted by Par(α, β), if the

pdf for x > β and α > 0 is

p(x|α, β) =αβα

xα+1

The mean and variance are E (X) = αβα−1

if α > 1 and var (X) = αβ2

(α−1)2(α−2)if α > 2.

• The following two-steps simulate a Par(α, β) distribution

Step 1: Draw Y ∼ exp (α)

Step 2: Set X = βeY .

Normal

• A random variable X ∈ R has a normal distribution with parameters µ and σ2,

denoted X ∼ N (µ, σ2), if the pdf is

p(x|µ, σ2

)=

1√2πσ2

exp

(−(x− µ)2

2σ2

)

we will also use φ (x|µ, σ2) to denote the pdf and Φ (x|µ, σ2) the cdf. The mean and

variance are E (X) = µ and var (X) = σ2.

• Given the importance of normal random variables, all software packages have func-

tions to draw normal random variables. The algorithms typically use transformation

methods drawing uniform and exponential random variables or look-up tables. The

classic algorithm is the Box-Muller approach based on simulating uniforms.

Log-Normal

8

• A random variable X ∈ R+ has a lognormal distribution with parameters µ and σ2,

denoted by X ∼ LN (µ, σ2) if the pdf is

p(x|µ, σ2

)=

1√2πσ2

1

xexp

(− 1

2σ2(lnx− µ)2

).

The mean and variance of the log-normal distribution are E (X) = eµ+ 12σ2

and sim-

ilarly var (X) = exp (2µ+ σ2) (exp (σ2)− 1). It is related to a normal distribution

via the transformation X = eµ+σZ . Although finite moments of the lognormal exist,

the distribution does not admit a moment-generating function.

• Simulating lognormal random variables via the transformation method is straightfor-

ward since X = eµ+σZ where Z ∼ N (0, 1) is LN (µ, σ2).

Truncated Normal

• A random variable X has a truncated normal distribution, denoted by T N (µ, σ2),

with parameters µ, σ2 and truncation region (a, b) if the pdf is

p (x|a < x < b) =φ (x|µ, σ2)

Φ (b|µ, σ2)− Φ (a|µ, σ2),

where it is clear that∫ b

−∞ φ (x|µ, σ2) dx = Φ (b|µ, σ2). The mean of a truncated

normal distribution is

E(X|a < X < b) = µ− σφa − φb

Φb − Φa

,

where φx is the standard normal density evaluated at (x− µ) /σ and Φx is the stan-

dard normal cdf evaluated at (x− µ) /σ.

• The inversion method can be used to simulate this distribution. A two-step algorithm

that provides a draw from a truncated standard normal is

Step 1: U ∼ U [0, 1]

Step 2: X = Φ−1 [Φ (a) + U (Φ (b)− Φ (a))] ,

9

where Φ (a) =∫ a

−∞ (2π)−1/2 exp (−x2/2) dx. For the general truncated normal,

Step 1: Draw U ∼ U [0, 1]

Step 2: X = µ+ σΦ−1

[Φ

(a− µ

σ

)+ U

(Φ

(b− µ

σ

)− Φ

(a− µ

σ

))],

where Φ−1 is the inverse of the cdf.

Double exponential

• A random variable X ∈ R has a double exponential (or Laplace) distribution with

parameters µ and σ2, denoted X ∼ DE (µ, σ), if the pdf is

p (x|µ, σ) =1

2σexp

(− 1

σ|x− µ|

).

The mean and variance are E (X) = µ and var (X) = 2σ2.

• The composition method can be used to simulate a DE (µ, σ) random variable:

Step 1: Draw λ ∼ exp (2) and Z ∼ N (0, 1)

Step 2: Set X = µ+ σ√λZ.

Check exponential

• A random variable X ∈ R has a check (or asymmetric) exponential distribution with

parameters µ and σ2, denoted X ∼ CE (τ, µ, σ), if the pdf is

p (x|µ, σ) =1

σµτ

exp

(− 1

σρτ (x− µ)

).

where ρτ (x) = |x| − (2τ − 1)x and µ−1τ = 2τ(1 − τ). The double exponential is a

special case when τ = 12.

• The composition method simulates a CE (µ, σ) random variable:

Step 1: Draw λ ∼ exp (µτ ) and Z ∼ N (0, 1)

Step 2: Set X = µ+ (2τ − 1)σλ+ σ√λZ.

10

T

• A random variable X ∈ R+ has a t-distribution with parameters ν,µ, and σ2, denoted

X ∼ tν (µ, σ2), if the pdf is

p(x|ν, µ, σ2

)=

Γ(

ν+12

)√νπσ2Γ

(ν2

) (1 +(x− µ)2

νσ2

)− ν+12

When µ = 0 and σ = 1, the distribution is denoted merely as tν . The mean and

variance of the t-distribution are E (X) = µ and var (X) = σ2 νν−2

for ν > 2. The

Cauchy distribution is the special case where ν = 1.

• The composition method simulates a tν (µ, σ2) random variable,

Step 1. Draw λ ∼ IG(ν

2,ν

2

)and Z ∼ N (0, 1)

Step 2. Set X = µ+ σ√λZ

Z

• The class of Z-distributions for a, b > 0 have density

fZ(z|a, b, µ, σ) =1

σB(a, b)

ea(z−µ)/σ

(1 + e(z−µ)/σ)a+b≡ Z(z; a, b, σ, µ).

A typical parameterisation is a = δ + θ, b = δ − θ or δ = 12(a+ b), θ = 1

2(a− b). This

is a variance-mean mixture of normals where

Z(z; a, b, σ, µ) =

∫ ∞

0

1√2πλσ2

exp

{− 1

2λσ2

(z − µ− 1

2(a− b)λσ

)2}pa,b(λ) dλ

where pa,b(λ) is a Polya distribution which is an infinite mixture of exponentials that

can be easily sampled as

λD=

∞∑k=0

2ψ−1k Zk, where ψk = (a+ k)(b+ k) and Zk ∼ exp(1).

11

• Z-Distribution simulation is then given by

X = µ+1

2(a− b)λσ + σ

√λZ, where Z ∼ N (0, 1)

Exponential power

• A random variable has an exponential power distribution. denoted X ∼ EP(µ, σ, γ)

if the pdf is

p (x|µ, σ, γ) =1

2σΓ(1 + γ−1)exp

(−∣∣∣∣x− µ

σ

∣∣∣∣γ) .The mean and variance are E (X) = µ and var (X) = Γ(3/γ)

Γ(1/γ)σ2. Normal and double

exponential are then special cases.

• Exponential power simulation relies on the composition method: if X ∼ EP (µ, σ),

then X = σ√λZ, where the scale parameter is related to a positive stable random

variable λ ∼ λ−32St+α/2 (λ−1) and ε ∼ N (0, 1).

Inverse Gaussian

• A random variable X ∈ R+ has an inverse Gaussian distribution with parameters µ

and α, denoted X ∼ IN (µ, α), if the pdf is

p (x|µ, α) =

√α

2πx3exp

(−α (x− µ)2

2µ2x

).

The mean and variance of an inverse Gaussian random variable are E (X) = µ and

var (X) = µ3/α, respectively.

• To simulate an inverse Gaussian IN (µ, α),

Step 0: Draw U ∼ U(0, 1), V ∼ χ21

Step 1: Draw W = ξ +ξ2

2µV − ξ

2√µ

√4ξµV + ξ2V 2

Step 2: Set X = W1(U≥ ξξ+W ) +

ξ2

W1(U≤ ξ

ξ+W )

where ξ =√µ/α.

12

Generalized inverse Gaussian

• A random variable X ∈ R+ has an generalized inverse Gaussian distribution with

parameters a, b, and p, X ∼ GIG (a, b, p), if the pdf is

p (x|a, b, p) =(ab

) p2 xp−1

2Kp

(√ab) exp

(−1

2

[ax+

b

x

]),

where Kp is the modified Bessel function of the third kind. The mean and variance

are known, but are complicated expressions of the Bessel functions. The Gamma

distribution is the special case with β = b/2 and a = 0, the inverse Gamma is the

special case with a = 0.

• Simulating GIG random variables is typically done using resampling methods.

Multivariate normal

• A k×1 random vectorX ∈ Rk+ has a multivariate normal distribution with parameters

µ and Σ, denoted X ∼ Nk (µ,Σ), if the pdf is

p (x|µ,Σ) =1

(2π)k2

|Σ|−1/2 exp

(−1

2(x− µ) Σ−1 (x− µ)′

)

where |Σ| is the determinant of the positive definite symmetric matrix Σ. The mean

and covariance matrix of a multivariate normal are E (X) = µ and cov (X) = Σ,

respectively.

Multivariate T

• A 1×k random vector X ∈ Rk+ has a multivariate t-distribution with parameters ν, µ,

and Σ, denoted X ∼ tν (µ,Σ), if the pdf is given by

p (x|ν, µ,Σ) =Γ(

ν+k2

)Γ(

ν2

)(νπ)1/2 |Σ|1/2

[1 +

(x− µ) Σ−1 (x− µ)′

ν

]− ν+k2

.

The mean and covariance matrix of a multivariate t random variable are E (X) = µ

and cov (X) = Σ, respectively.

13

• The following two steps provide a draw from a multivariate t-distribution:

Step 1. Simulate Y ∼ Nk (µ,Σ) and Z ∼ X 2ν

Step 2. Set X = µ+ Y

(Z

ν

)− 12

Wishart

• A random m×m matrix Σ has a Wishart distribution, Σ ∼ Wm (v, V ), if the density

function is given by

p (Σ|v, V ) =|Σ|

(v−m−1)2

2vm2 |V |

v2 Γm

(v2

) exp

(−1

2tr(V −1Σ

)),

for v > m, where

Γm

(v2

)=

m∏k=1

πm(m−1)

4 Γm

(v − k + 1

2

)is the multivariate Gamma function. If v < m, then S does not have a density

although its distribution is well defined. The Wishart distribution arises naturally in

multivariate settings with normally distributed random variables as the distribution

of quadratic forms of multivariate normal random variables.

• The Wishart distribution can be viewed as a multivariate generalization of the X 2ν

distribution. From this, it is clear how to sample from a Wishart distribution:

Step 1: Draw Xj ∼ N (0, V ) for j = 1, ..., v

Step 2: Set S =v∑

j=1

XjX′j.

Inverted Wishart

• A random m × m matrix Σ has an inverted Wishart distribution, denoted Σ ∼IWm (v, V ) if the density function is

p (Σ|v, V ) =|Σ|−

(v+m+1)2

2vm2 |V |−m/2 Γm

(v2

) exp

(−1

2tr(V Σ−1

)).

14

This also implies that Σ−1 has a Wishart distribution, Σ−1 ∼ Wm (v, V −1). The

Jacobian of the transformation ∣∣∣∣∂Σ−1

∂Σ

∣∣∣∣ = |Σ|−(m+1).

• To generate Σ ∼ IWm (w,W ), follow the two step procedure:

Step 1: Draw Xi ∼ N(0,W−1

)Step 2: Set Σ =

∑vi=1XiX

′i.

In cases where m is extremely large, there are more efficient algorithms for drawing

inverted Wishart random variables that involves factoring W and sampling from

univariate X 2 distributions.

15

2 Likelihoods, Priors, and Posteriors

This appendix provides combinations of likelihoods and priors for the following types of

observed data: Bernoulli, Poisson, exponential, normal, normal regression, and multivariate

normal. For each specification, proper conjugate and Jeffreys’ priors are given. The over-

riding Bayesian paradigm takes the form of Bayes rule

p(parameters|data) =p(data|parameters)p(parameters)

p(data)

where the types of data and parameters are problem specific.

Bernoulli observations

• If the data (yt|θ) ∼ Ber (θ) with θ ∈ [0, 1], then the likelihood is

p (y|θ) =T∏

t=1

p (yt|θ) =T∏

t=1

θyt (1− θ)1−yt = θPT

t=1 yt (1− θ)T−PT

t=1 yt ,

• A conjugate prior distribution is the beta family, θ ∼ B (a,A), where

p (θ) =Γ (a+ A)

Γ (a) Γ (A)θa−1 (1− θ)A−1 .

By Bayes rule, the posterior distribution is also Beta

p (θ|y) ∝ p (y|θ) p (θ) = λa+PT

t=1 yt−1 (1− θ)A+T−PT

t=1 yt−1 ∼ B (aT , AT ) ,

where aT = a+∑T

t=1 yt and AT = A+ T −∑T

t=1 yt.

• Fisher’s information for Bernoulli observations is

I (θ) = −Eθ

[∂2 ln p (yt|θ)

∂θ2

]=

1

θ (1− θ),

where Eθ denotes the expectation under a Ber(θ). Jeffreys’ prior is

p (θ) = I (θ)12 = θ−

12 (1− θ)−

12 ∼ B

(1

2,1

2

).

16

Multinomial observations

• If (y|θ) is Multinomial data from k categories where (y|θ) ∼ Multi(θ1, . . . , θk), then

the likelihood for T trials is given by

p(y1, . . . , yk|θ1, . . . , θk) =T !

y1! . . . yk!θy1

1 . . . θyk

k wherek∑

i=1

yi = T

• A conjugate prior is a Dirichlet distribution, θ ∼ Dir(α), with density

p(θ1, . . . , θk|α) =Γ(∑αi)∏

i Γ(αi)θα1−11 . . . θαk−1

k

The posterior is then

p(θ1, . . . , θk|α, y1, . . . , yk) ∝ p(y1, . . . , yk|θ1, . . . , θk)p(θ1, . . . , θk|α)

∝ θy1

1 . . . θyk

k θα1−11 . . . θαk−1

k

= θα1+y1−11 . . . θαk+yk−1

k

∼ Dir(α+ y)

which is again a Dirchlet with parameter α+ y.

Poisson observations

• If the data (yt|λ) ∼ Poi (λ), then the likelihood is

p (y|λ) =T∏

t=1

e−λλyt

yt!∝ e−λTλ

PTt=1 yt ,

• A conjugate prior for λ is a Gamma distribution, λ ∼ G (a,A), with density

p (λ) =Aa

Γ (a)λa−1 exp (−λA) .

The posterior distribution is Gamma

p (λ|y) ∝ p (y|λ) p (λ) = e−λ(A+T )λa+PT

t=1 yt−1 ∼ G (aT , AT )

17

where aT = a+∑T

t=1 yt and AT = A+ T .

• Fisher’s information for Poisson observations is

I (λ) = −Eλ

[∂2 ln p (yt|λ)

∂λ2

]=

1

λ.

Jeffreys’ prior is then p (λ) ≡ I (λ)12 = λ−

12 . This can be viewed as a special case of

the Gamma prior with a = 12

and A = 0.

Exponential observations

• If the data (yt|µ) ∼ exp (µ), then the likelihood is

p (y|µ) =T∏

t=1

µ exp (−µyt) ∝ µT exp

(−µ

T∑t=1

yt

),

• A conjugate prior for µ is a Gamma distribution, µ ∼ G (a,A). The posterior

p (µ|y) ∝ p (y|µ) p (µ) ∝ µa+T−1e−µ

„A+∑T

t=1yt

«∼ G (aT , AT )

where aT = a+ T and AT = A+∑T

t=1yt.

• Fishers’ information for exponential observations is

I (µ) = −Eµ

[∂2 ln p (yt|µ)

∂µ2

]=

(1

µ

)2

.

Jeffreys prior for exponential observations is p (µ) ≡ µ−1. This is a special case of

the Gamma prior with a = 1 and A = 0.

Normal observations with known variance

• If the data (yt|µ, σ2) ∼ N (µ, σ2), the likelihood is

p(y|µ, σ2

)=

(1

2πσ2

)T2

exp

(− 1

2σ2

T∑t=1

(yt − µ)2

).

18

We can factorizeT∑

t=1

(yt − µ)2 =T∑

t=1

[(yt − y)2 + (y − µ)2] .

Thus, the likelihood, as a function of µ, is proportional to

exp

(−T (µ− y)2

2σ2

),

with the other terms in p(y|µ, σ2) being absorbed into the constant of integration.

• A conjugate prior distribution for µ that is independent of σ2 is given by µ ∼ N (a,A).

The posterior distribution is

p(µ|y, σ2

)∝ p

(y|µ, σ2

)p (µ) ∝ exp

(−1

2

[(µ− y)2

σ2/T− (µ− a)2

A

]).

Completing the square yields

(µ− y)2

σ2/T− (µ− a)2

A=

(µ− aT )2

AT

+(y − a)2

σ2/T + A,

with parametersaT

AT

=a

A+

y

σ2/Tand

1

AT

=1

σ2/T+

1

A.


p(µ|y, σ2

)∝ exp

(−1

2

[(µ− aT )2

AT

+(y − a)2

σ2/T + A

])∼ N (aT , AT ) .

• A conjugate prior distribution for µ conditional on σ2 is µ ∼ N (a, σ2A). The poste-

rior distribution is p (µ|y, σ2) ∝ p (y|µ, σ2) p (µ) which gives

p(µ|y, σ2

)∝ exp

{− 1

2σ2

((µ− y)2

T−1+

(µ− a)2

A

)}

19

Completing the square for the quadratic term in the exponential,

(µ− y)2

T−1+

(µ− a)2

A=

(µ− aT )2

AT

+(y − a)2

T−1 + A

whereaT

AT

=a

A+

y

T−1and

1

AT

=1

T−1+

1

A.


p(µ|y, σ2

)∝ exp

(−(µ− aT )2

2σ2AT

)∼ N

(aT , ATσ

2).

Notice the slight differences between this example and the previous one, in terms of

the hyper-parameters and the form of the posterior distribution.

• Fisher’s information for normal observations with σ2 known is

I (µ) = −Eλ

[∂2 ln p (yt|µ, σ2)

∂µ2

]≡ 1.

Jeffreys prior for normal observations (with known variance) is a constant, p (µ) ≡ 1

which is improper. However, the posterior is proper and can be viewed as a limiting

of the normal conjugate prior with a = 0 and A→∞.

Normal Variance with known Mean

• Given µ, if (yt|µ, σ2) ∼ N (µ, σ2) the likelihood for σ2, is

(1

σ2

)T2

exp

(−∑T

t=1 (yt − µ)2

2σ2

).

• A conjugate inverse Gamma prior σ2 ∼ IG(

b2, B

2

)has pdf

p(σ2)

=

(B2

) b2

Γ(

b2

) (σ2)− b

2−1

exp

(− B

2σ2

)

20

By Bayes rule,

p(σ2|µ, y

)∝ p

(y|µ, σ2

)p(σ2)

∝(

1

σ2

) b+T2

+1

exp

(−B +

∑Tt=1 (yt − µ)2

2σ2

)

∼ IG(bT2,BT

2

),

where bT = b+ T and BT = B +∑T

t=1 (yt − µ)2 .

The parameterization of the inverse Gamma, σ2 ∼ IG(

b2, B

2

), is used as opposed

to σ2 ∼ IG (b, B) because the hyperparameters do not have any 1/2 terms. This

is chosen for notational simplicity. It is also common in the literature to assume

p (σ) ∼ IG(

b2, B

2

), which only changes the first term in the expression for bT .

• Fisher’s information for normal observations (with known mean) is

I(σ2)

= −Eσ2

[∂2 ln p (yt|µ, σ2)

∂ (σ2)2

]≡ 1

σ2.

Jeffreys’ prior is p (σ2) ∝ (σ2)−1

, which is improper distribution. However, the re-

sulting posterior is proper and can be viewed as a limiting of the inverse Gamma

conjugate prior with B = 0 and b = 0. A flat or constant prior for σ2 also leads to a

proper posterior. Assuming p (σ2) ≡ 1 yields a conditional posterior

p(σ2|µ, y

)∼ IG

(T

2,

∑Tt=1 (yt − µ)2

2

).

Unknown mean and variance: dependent priors

• If the data (yt|µ, σ2) ∼ N (µ, σ2) and assuming that both µ and σ2 are unknown,

then the likelihood as a function of µ and σ2 is

(1

σ2

)T2

exp

(−∑T

t=1 (yt − µ)2

2σ2

).

21

• A conjugate prior for (µ, σ2) is p (µ, σ2) = p (µ|σ2) p (σ2) where

p(µ|σ2

)∼ N

(a,Aσ2

)and p

(σ2)∼ IG

(b

2,B

2

).

This distribution is often expressed as N (a,Aσ2) IG(

b2, B

2

). This prior assumes that

µ and σ2 are dependent. Bayes rule and a few lines of algebra yields a posterior

p(µ, σ2|y

)∝ p

(y|µ, σ2

)p(µ|σ2

)p(σ2)

∝(

1

σ2

)T+b2

+ 12+1

exp

(− 1

2σ2

[(µ− y)2

1/T+

(µ− a)2

A+

T∑t=1

(yt − y)2 +B

]).

Combining the quadratic terms that depend on µ by completing the square

(µ− y)2

1/T+

(µ− a)2

A=

(µ− aT )2

AT

+(y − a)2

1/T + A,

where the hyper-parameters are

aT

AT

=a

A+

y

T−1and

1

AT

=1

A+

1

T−1.

Inserting this into the likelihood, gives a posterior

p(µ, σ2|y

)∝(

1

σ2

)T+b2

+ 12+1

exp

(− 1

2σ2

[(µ− aT )2

AT

+B + S

])

where we have

S =(y − a)2

1/T + A+

T∑t=1

(yt − y)2 .

Given the conjugate prior structure, the posterior p (µ, σ2|y) ∝ p (µ|σ2, y) p (σ2|y) A

few lines of algebra shows that

p(µ, σ2|y

)∝(

1

σ2

) 12

exp

(−1

2

(µ− aT )2

σ2AT

)×(

1

σ2

)T+b2

+1

exp

(−B + S

2σ2

)

given p (µ|σ2, y) ∼ N (aT , σ2AT ) and p(σ2|y) ∼ IG

(bT

2, BT

2

)with bT = b + T,BT =

22

B + S.

Marginal parameter distributions. In this specification, the marginal parameter

distributions, p (σ2|y) and p (µ|y), are both known analytically. p (σ2|y) is inverse

Gamma. The marginal p (µ|y) is

p (µ|y) =

∫ ∞

0

p(µ, σ2|y

)dσ2 =

∫ ∞

0

p(µ|σ2, y

)p(σ2|y

)dσ2.

Both p (µ|σ2, y) and p (σ2|y) are known, and the integral can be computed analytically.

Ignoring integration constants,

p (µ|y) ∝∫ ∞

0

(1

σ2

) bT +1

2+1

exp

(− 1

σ2

[(µ− aT )

2AT

+BT

2

])dσ2

∝

[1 +

(µ− aT )2

ATBT

]− bT +1

2

using our integration results. This is the kernel of a t-distribution, thus the marginal

posterior is p(µ|y) ∼ tbT(aT , ATBT ).

The marginal likelihood, p (y) =∫p (y|µ, σ2) p (µ, σ2) dµdσ2 can be computed

p(y|µ, σ2

)=Ky

(1

σ2

)T2

exp

(−T (µ− y)2 + S

2σ2

)

p(µ|σ2

)= Kµ

(1

σ2

) 12

exp

(− 1

2σ2

(µ− a)2

A

)

p(σ2)

= Kσ

(1

σ2

) b2+1

exp

(− B

2σ2

),

with constants Ky = (2π)−T/2, Kσ = (B/2)b/2 /Γ (b/2) and Kµ = (2πA)−12 . Substi-

23

tuting these expressions, the marginal likelihood is

p (y) = KKµKσ

∫ ∞

0

(1

σ2

) b+T+12

+1

exp

(−S +B

2σ2

)dσ2

·∫ ∞

−∞exp

(− 1

2σ2

[(µ− y)2

T−1+

(µ− a)2

A

])dµ

Completing the square inside the integrand gives

(µ− y)2

T−1+

(µ− a)2

A=

(µ− aµT )2

Aµ+

(y − a)2

T−1 + A,

with hyper-parameters

aµT

AµT

=y

T−1+a

Aand

1

AµT

=1

T−1+

1

A.

The integrals can be expressed as

∫ ∞

0

(1

σ2

) b+T+12

+1

exp

(−S +B + (y−a)2

T−1+A

2σ2

)dσ2

∫ ∞

−∞exp

(− 1

2σ2

(µ− aµT )2

Aµ

)dµ

The second integral is∫∞−∞ exp

(−1

2

(µ−aµT )

2

σ2Aµ

)dµ =

√2πAµ (σ2)

12 . Using this, σ2 can

be integrated out yielding the expression for the marginal likelihood:

p (y) =√

2πAµKKµKσ

∫ ∞

0

(1

σ2

) b+T2

+1

exp

(−S +B + (y−a)2

T−1+A

2σ2

)dσ2

=

(1

2π

)T2(Aµ

A

) 12(B

2

) b2 Γ(

b+T2

)Γ(

b2

) [S +B +

(y − a)2

T−1 + A

]− b+T2

.

Finally, the predictive distribution can also be computed analytically. Since

p(yT+1|yT

)=

∫p(yT+1|µ, σ2

)p(µ, σ2|yT

)dµdσ2.

To simplify, first compute the integral against µ by substituting from the posterior.

24

Since p (µ|σ2, y) ∼ N (aT , σ2AT ), we have that µ = aT + σ

√ATZ where Z is an

independent normal. Substituting in yT+1 = µ+ σεT+1, gives

yT+1 = aT + σ√ATZ + σεT+1 = aT + σηT+1

where ηT+1 ∼ N (0, AT + 1). The predictive is therefore

p(yT+1|yT

)∝∫ ∞

0

(1

σ2

) 12

exp

(− 1

2σ2

(yT+1 − aT )2

AT + 1

)(1

σ2

) bT2

+1

exp

(−BT

2σ2

)dσ2

∝∫ ∞

0

(1

σ2

)BT +1

2+1

exp

(− 1

2σ2

[BT +

(yT+1 − aT )2

AT + 1

])dσ2

∝

[1 +

(yT+1 − aT )2

BT (AT + 1)

]BT +1

2

∼ tbT(aT , BT (AT + 1)) .

• Jeffreys’ prior for normal observations with unknown mean and variance is a bivariate

distribution. Fishers’ information is

I(µ, σ2

)= −Eµ,σ2

∂2 ln p(yt|µ,σ2)∂µ2

∂2 ln p(yt|µ,σ2)∂µ∂σ2

∂2 ln p(yt|µ,σ2)∂µ∂σ2

∂2 ln p(yt|µ,σ2)∂(σ2)2

=

[1σ2 0

0 2σ2

],

generating a prior distribution p (µ, σ2) ≡ det (I (µ, σ2))12 = σ−2. This prior is im-

proper, but leads to a proper posterior distribution that can be viewed as a limiting

case of the usual conjugate posterior

p(µ, σ2|y

)∼ N

(aT , ATσ

2)IG(bT2,BT

2

),

where aT = 0, AT →∞, b = 0 and B = 0.

Regression

• Consider a regression model specification, yt|xt, β, σ2 ∼ N (xtβ, σ

2), where xt is a

vector of observed covariates, β is a k × 1 vector of regression coefficients and εt ∼

25

N (0, σ2). To express the likelihood, it is useful to stacking the data into matrices:

y = Xβ + ε, where Y is a T × 1 vector of dependent variables, X is a T × k matrix

of regressor variables and ε, where Y is a T × 1 vector of errors.

The usual OLS regression estimator is β = (X ′X)−1X ′y and the residual sum of

squares is S = (y −Xβ)′(y −Xβ). Completing the square,

(y −Xβ)′(y −Xβ) = (β − β)′(X ′X)(β − β) + (y −Xβ)′(y −Xβ)

= (β − β)′(X ′X)(β − β) + S,

which implies that

p(y|β, σ2) =

(1

σ2

)T/2

exp

(− 1

2σ2(y −Xβ)′(y −Xβ)

)=

(1

σ2

)T/2

exp

(− 1

2σ2(β − β)′(X ′X)(β − β)− S

2σ2

)

Hence(β, S

)are sufficient statistics for (β, σ2).

• A proper conjugate prior is p (β|σ2) ∼ Nk (a, σ2A) and p (σ2) ∼ IG(

b2, B

2

)with

p(β|σ2

)= (2π)

k2(σ2)− k

2 |A|−12 exp

(− 1

2σ2(β − a)′ (A)−1 (β − a)

)since |σ2A| = (σ2)

k |A|. The posterior distribution is

p(β, σ2|y

)∝(

1

σ2

) b+T2

+1(1

σ2

) k2

exp

(− 1

2σ2Q(β, σ2

)),

where the quadratic form is defined by

Q(β, σ2

)= (y −Xβ)′ (y −Xβ) + (β − a)′A−1 (β − a) +B.

we can complete the square as follows: combine (y −Xβ)′ (y −Xβ)+(β − a)′A−1 (β − a)

26

by stacking vectors

y =

(y

A−12a

)and W =

(X

A−12

),

where y is a (T + k)× 1 vector and W is a (T + k)× k matrix. Then,

(y −Xβ)′ (y −Xβ) + (β − a)′A−1 (β − a) = (y −Wβ)′ (y −Wβ) .

By analogy to the previous case, define β = (W ′W )−1W ′y, where

W ′W = X ′X + A−1

W ′y = X ′y + A−1a.

Adding and subtracting Wβ to y −Wβ and simplifying gives

(y −Wβ)′ (y −Wβ) =[(y −Wβ

)+(Wβ −Wβ

)]′ [(y −Wβ

)+(Wβ −Wβ

)]=(β − β

)′W ′W

(β − β

)+(y −Wβ

)′ (y −Wβ

)= (β − aT )′A−1

T (β − aT ) + (y −WaT )′ (y −WaT ) ,

where A−1T = W ′W = X ′X + A−1 and aT = AT [X ′y + A−1a]. Straightforward

algebra shows that(y −Wβ

)′ (y −Wβ

)= y′y + a′A−1a − a′TA

−1T aT . Putting the

pieces together, the quadrtic form is

Q(β, σ2

)= (β − aT )′A−1

T (β − aT ) + y′y + a′A−1a− a′TA−1T aT +B

= (β − aT )′A−1T (β − aT ) +BT ,

where BT = y′y + a′A−1a− a′TA−1T aT +B. This leads to the same posterior.

3 Direct sampling

3.1 Generating i.i.d. random variables from distributions

The Gibbs sampler and MH algorithms require simulating i.i.d. random variables from

“recognizable” distributions. Appendix 1 provides a list of common “recognizable” dis-

27

tributions, along with methods for generating random variables from these distributions.

This section briefly reviews that standard methods and approaches for simulating random

variables from recognizable distributions.

Most of these algorithms first generate random variables from a relatively simple “build-

ing block” distribution, such as a uniform or normal distribution, and then transform these

draws to obtain a sample from another distribution. This section describes a number of

these approaches that are commonly encountered in practice. Most “random-number”

generators actual use deterministic methods, along with transformations. In this regard,

it is important to remember Von Neumann’s famous quotation: “Anyone who attempts to

generate random numbers by deterministic means is, of course, living in a state of sin.”

Inverse CDF method The inverse distribution method uses samples of uniform

random variables to generate draws from random variables with a continuous distribution

function, F . Since F (x) is uniformly distributed on [0, 1], draw a uniform random variable

and invert the CDF to get a draw from F . Thus, to sample from F ,

Step 1: Draw U ∼ U [0, 1]

Step 2: Set X = F−1 (U) ,

where F−1 (U) = inf {x : F (x) = U}.This inversion method provides i.i.d. draws from F provided that F−1 (U) can be exactly

calculated. For example, the CDF of an exponential random variable with parameter µ is

F (x) = 1 − exp (−µx), which can easily be inverted. When F−1 cannot be analytically

calculated, approximate inversions can be used. For example, suppose that the density is a

known analytical function. Then, F (x) can be computed to an arbitrary degree of accuracy

on a grid and inversions can be approximately calculated, generating an approximate draw

from F . With all approximations, there is a natural trade-off between computational speed

and accuracy. One example where efficient approximations are possible are inversions

involving normal distributions, which is useful for generating truncated normal random

variables. Outside of these limited cases, the inverse transform method does not provide a

computationally attractive approach for drawing random variables from a given distribution

function. In particular, it does not work well in multiple dimensions.

28

Functional Transformations The second main method uses functional transforma-

tions to express the distribution of a random variable that is a known function of another

random variable. Suppose that X ∼ F , admitting a density f , and that y = h (x) is

an increasing continuous function. Thus, we can define x = h−1 (y) as the inverse of the

function h. The distribution of y is given by

F (a) = Prob (Y ≤ y) =

∫ h−1(y)

−∞f (x) dx = F

(X ≤ h−1 (y)

).

Differentiating with respect to y gives the density via Leibnitz’s rule:

fY (y) = f(h−1 (y)

) ∣∣∣∣dh−1 (y)

dy

∣∣∣∣ ,where we make explicit that the density is over the random variable Y . This result is used

widely. For example, if X ∼ N (0, 1), then Y = µ + σX. Since x = h−1 (y) = y−µσ

, the

distribution function is F(

x−µσ

)and density

fY (y) =1√2πσ

exp

(−(y − µ)2

2σ2

)Transformations are widely used to simulate both univariate and multivariate random vari-

ables. As examples, if Y ∼ X 2ν and ν is an integer, then Y =

∑νi=1X

2i where each Xi is

independent standard normal. Exponential random variables can be used to simulate X 2,

Gamma, beta, and Poisson random variables. The famous Box-Muller algorithm simu-

lates normals from uniform and exponential random variables. In the multivariate setting,

Wishart (and inverse Wishart) random variables can be via sums of squared vectors of

standard normal random variables.

Mixture distributions In the multidimensional case, a special case of the transfor-

mation generates continuous mixture distributions. The density of a continuous mixture

distribution is given by

p (x) =

∫p (x|λ) p (λ) dλ,

29

where p (x|λ) is viewed as density conditional on the parameter λ. One example of this is

the class of scale mixtures of normal distributions, where

p (x|λ) =1√2πλ

exp

(−x

2

2λ

)and, λ is the conditional variance of X. It is often simpler just to write

X =√λε,

where ε ∼ N (0, 1). The distribution of√λ determines the marginal distribution of X.

Here are a number of examples of scale mixture distributions.

• T-distribution. The t-distribution arises in many Bayesian inference problems in-

volving inverse Gamma priors and conditionally normally distributed likelihoods. If

p (yt|λ) ∼ N (0, λ) and p (λ) ∼ IG(

b2, B

2

), then the marginal distribution of yt is

tb(0, B

b

). The proof is direct by analytically computing the marginal distribution,

p (yt) =

∫ ∞

0

p (yt|λ) p (λ) dλ.

Using our integration results:

p (y) =

∫ ∞

0

1√2π

(1

λ

) 12

exp

(−1

2

y2

λ

)︸︷︷︸

likelihood

(B/2)b/2

Γ(

b2

) (1

λ

) b2+1

exp

(− B

2λ

)︸︷︷︸

prior

dλ

∝∫ ∞

0

(1

λ

)( b+12 )+1

exp

(−1

λ

y2 +B

2

)dλ

which is in the class of scale mixture integrals. We obtain

p (y) ∝[1 +

y2

B

]−( b+12 )

Thus yt ∼ tb (0, B). Given µ, yt|µ, λ, σ2 ∼ N (µ, λσ2) and λ ∼ IG(

b2, B

2

)which

implies p (yt|µ, σ2) ∼ tb (µ,B).

• Double-exponential distribution. The double exponential arises as a scale mix-

30

ture distribution: If p (yt|λ) ∼ N (0, λ) and λ ∼ exp (2), then the marginal distribu-

tion of yt ∼ DE (0, 1). The proof is again by direct integration using the results in

our integration appendix:∫ ∞

0

1√2πλ

exp

{−1

2

(y2

λ− λ

)}dλ =

1

2exp (− |y|) ,

More generally, if µ and σ2 are known, then yt|µ, λ, σ2 ∼ N (µ, λσ2) and λ ∼ exp (2),

then p (yt|µ, σ2) ∼ DE (µ, σ2) substituting b = (y − µ) /σ and multiplying both sides

by σ−1.

• Asymmetric Laplacean. The asymmetric Laplacean distribution is a scale mixture

of normal distributions: if

p (yt|λ) ∼ N ((2τ − 1)λ, λ) and λ ∼ exp(µ−1

τ

),

then yt ∼ CE (τ, 0, 1). The proof uses our integration appendix and∫ ∞

0

1√2πλ

exp

{− 1

2λ(y + (2τ − 1)λ)2 − 2τ (1− τ)λ

}dλ =

1

2exp (− |y| − (2τ − 1) y)

In general, if µ and σ2 are known, then

p(yt|µ, λ, σ2) ∼ N(µ+ (1− 2τ)λ, λσ2

)and λ ∼ exp

(µ−1

τ

)which leads to p (yt|µ, σ2) ∼ CE (τ, µ, σ2).

• Exponential power family. This family of distributions (Box and Tiao, 1973)

p(x|τ, γ) is given by

p(y|σ, γ) = (2σ)−1c(γ) exp(−∣∣∣yσ

∣∣∣γ)where c(γ) = Γ(1 + γ−1)−1. Following West (1987) we have the following scale

mixtures of normals representation

p(y|σ = 1, γ) = c(γ) exp (−|β|γ) =

∫ ∞

0

1√2πλ

exp

(− y

2

2λ

)p(λ|γ)dλ

31

where p(λ|γ) ∝ λ−3/2St+γ2(λ−1) and St+a is the (analytically intractable) density of a

positive stable distribution.

Factorization Method Another method that is useful in some multivariate settings

is known as factorization. The rules of probability imply that a joint density, p (x1, ..., xn),

can always be factored as

p (x1, ..., xn) = p (xn|x1, ..., xn−1) p (x1, ..., xn−1)

= p (x1) p (x2|x1) · · · p (xn|x1, ..., xn−1) .

In this case, simulating X1 ∼ p (x1), X2 ∼ p (x2|X1), ... Xn ∼ p (xn|X1, ..., Xn−1) generates

a draw from the joint distribution. This procedure is common in Bayesian statistics, and

the distribution of X1 is a marginal distribution and the other distributions are condi-

tional distributions. This is used repeatedly to express a joint posterior in terms of lower

dimensional conditional posteriors.

Rejection sampling The final general method discussed here is the accept-reject

method, developed by von Neumann. Suppose that the goal is to generate a sample from

f (x), where it is assumed that f is bounded, that is, f (x) ≤ cg (x) for some c. The

accept-reject algorithm is a two-step procedure

Step 1: Draw U ∼ U [0, 1] and X ∼ g

Step 2: Accept Y = X if U ≤ f (X)

cg (X), otherwise return to Step 1.

Rejection sampling simulates repeatedly until a draw that satisfies U ≤ f(X)cg(X)

is found. By

direct calculation, it is clear that Y has density f :

Prob(Y ≤ y) = Prob

(X ≤ y|U ≤ f (X)

cg (X)

)=

Prob(X ≤ y, U ≤ f(X)

cg(X)

)Prob

(U ≤ f(X)

cg(X)

)=

∫ y

−∞

(∫ f(x)/cg(X)

−∞ du)g (x) dx∫ −∞

−∞

(∫ f(x)/cg(X)

−∞ du)g (x) dx

=1c

∫ y

−∞ f (x) dx1c

∫ −∞−∞ f (x) dx

=

∫ y

−∞f (x) dx.

32

Rejection sampling requires (a) a bounding or dominating density g, (b) an ability to

evaluate the ratio f/g; (c) an ability to simulate i.i.d. draws from g, and (d) the bounding

constant c. Rejection sampling does not require that the normalization constant∫f (x) dx

be known, since the algorithm only requires knowledge of f/c. For continuous densities

on a bounded support, it is to satisfy (a) and (c) (a uniform density works), but for

continuous densities on unbounded support it can be more difficult since we need to find a

density with heavier tails and higher peaks. Setting c = supx

f(x)g(x)

maximizes the acceptance

probability. In practice, finding the constant is difficult because f generally depends on a

multi-dimensional parameter vector, f (x|θf ), and thus the bounding is over x and θf .

Rejection sampling is often used to generate random variables from various recognizable

distributions, such as the Gamma or beta distributions. In these cases, the structure of

the densities is well known and the bounding density can be tailored to generate fast

and efficient rejection sampling algorithms. In many of these cases, the densities are log-

concave (e.g., Normal, Double exponential, Gamma, and Beta). A density f is log-concave

if ln (f (x)). For differentiable densities, this is equivalent to assuming that dlnf(x)dx

= f ′(x)f(x)

is non-increasing in x and d2lnf(x)dx2 < 0. Under these conditions, it is possible to develop

“black-box” generation methods that perform well in many settings. Another modification

that “adapts” the dominating densities works well for log-concave densities.

The basis of the rejection sampling is the simple fact that since

f (x) =

∫ f(x)

0

du =

∫ f(x)

0

U (du) ,

where U is the uniform distribution function, the density f is a marginal distribution from

the joint distribution

(X,U) ∼ U ((x, u) : 0 ≤ u ≤ f (x)) .

More generally, if X ∈ Rd is a random vector with density f and U is an independent

random variable distributed U [0, 1], then (X, cUf (X)) is uniformly distributed on the set

A ={(x, u) : x ∈ Rd, 0 ≤ u ≤ cf (x)

}. Conversely, if (X,U) is uniformly distributed on A,

then the density of X is f (X).

Multinomial Resampling Sampling N values from a discrete distribution (xi, pi)

can be done by simulating standard uniforms Ui and then using binary search to find the

33

value of j, and hence xj, corresponding to

qj−1 < Ui ≤ qj

where qj =∑j

l=0 pj and q0 = 0. This algorithm is inefficient, although commonly used due

to its simplicity, as requires O(N lnN) operations. The lnN operations comes from the

binary search.

A more efficient method (particularly suited to the particle filtering methods) is to

simulateN+1 exponentially distributed variables z0, . . . , zN via zi = − lnUi and calculating

the totals Zj =∑j

l=0 zl and then merging Zj and qj in the sense that if qjZN > Zi then

output xj. This is an O(N) algorithm.

Slice Sampling Slice sampling can be used to sample the conditional posterior when

the scale mixture is a stable distribution. Here we have

p(λt|σ2, yt

)∼ p(yt|λt, σ

2)p (λt) ,

where p (yt|λt, σ2) = N (0, σ2λt). To do this, introduce a uniform random variable ut and

consider the joint distribution

p(λt, ut|σ2

)∝ p (λt)Uut

[0, φ

(y′t, λtσ

2)]

.

Then the algorithm is:

p(λt|ut, σ

2)∝ p (λt) on N

(yt; 0, σ

2λt

)> ut

p(ut|λt, σ

2)∼ U

[0, φ

(yt;λtσ

2)]

.

The accept/reject sampling alternative is as follows. Since we have the upper bound

φ (yt; 0, λtσ2) ≤ (2πy2

t )−1/2

exp(−1/2), the sample λt via

λt ∼ p (λt)

u ∼ U[0,(2πy2

t

)−1/2e−1/2

],

and if U > φ (yt, σ2λt), repeat.

34

3.2 Integration results

There are a number of useful integration results that are repeatedly used for normalising

probability distributions. The first identity is

∫ ∞

−∞exp

(−(x− µ)2

2σ2

)dx =

√2πσ2,

for all µ which defines the univariate normal distribution. The second identity is for the

Gamma function, defined as

Γ (α) =

∫ ∞

0

yα−1e−ydy,

which is important for a number of distributions, most notably the Gamma and inverse

Gamma. This implies that yα−1e−y/Γ (α) is a proper density. Changing variables to x = βy,

gives the standard form of the Gamma distribution.

Integration by parts implies Γ (α+ 1) = αΓ (α), so Γ (α) = (α− 1)! with Γ (1) = 1 and

Γ (1/2) =√π. For other fractional values

Γ (α) Γ

(α+

1

2

)= 21−2α

√πΓ (2α) ,

A number of integrals are useful for analytic characterization of distributions, either priors

or posteriors, that are scale mixtures of normals. As discussed in the previous appendix,

scale mixtures involve integrals that are a product of two distributions. The following

integral identities are useful for analyzing these specifications. For any given p, a, b > 0,∫ ∞

0

xp−1 exp(−axb

)dx =

1

ba−

pb Γ(pb

)∫ ∞

0

(1

x

)p+1

exp(−ax−b

)dx =

1

ba−

pb Γ(pb

)These integrals are useful when combining Gamma or inverse Gamma distributions prior

distributions with normal likelihood functions. Second, for any a and b,∫ ∞

0

a√2πx

exp

{−1

2

(a2x+

b2

x

)}dx = exp (−|ab|) ,

which are useful for double exponential. A related integral, which is useful for the check

35

exponential distribution is∫ ∞

0

a√2πx

exp

{−1

2

(b2

x+ 2(2τ − 1)b+ a2x

)}dx = exp (− |ab| − (2τ − 1)b) .

3.3 Useful algebraic formulae

One of the most useful algebraic tricks for calculating posterior distribution is completing

the square (have fun showing the algebra for the second part!).

In the scalar case, we have the identity

(x− µ1)2

Σ1

+(x− µ2)

2

Σ2

=(x− µ3)

2

Σ3

+(µ1 − µ2)

2

Σ1 + Σ2

where µ3 = Σ3

(Σ−1

1 µ1 + Σ−12 µ2

)and Σ3 =

(Σ−1

1 + Σ−12

)−1.

A shrinkage interpretation is the following. When analyzing shrinkage and it is common

to define the weight w = (Σ1 + Σ2)−1 Σ1. We can now re-write completing the square as

(x− µ1)2

Σ1

+(x− µ2)

2

Σ2

=(µ1 − µ2)

2

Σ1(1− w)−1+

(x− (1− w)µ1 − wµ2)2

Σ2(1− w).

In vector-matrix case, completing the square becomes

(x− µ1)′ Σ−1

1 (x− µ1)+(x− µ2)′ Σ−1

2 (x− µ2) = (x− µ3)′ Σ−1

3 (x− µ3)+(µ1 − µ2)′ Σ−1

4 (µ1 − µ2)

where µ3 = Σ3

(Σ−1

1 µ1 + Σ−12 µ2

)and

Σ3 =(Σ−1

1 + Σ−12

)−1

Σ4 = Σ−11

(Σ−1

1 + Σ−12

)−1Σ−1

2 .

Completing the square for the convolution of two normal densities is also useful when

interpreting the fundamental identity of likelihood times prior is equal to posterior times

marginal. The identity yields

φ (µ1,Σ1)φ (µ2,Σ2) = φ (µ,Σ) c(µ1,Σ1, µ2,Σ2)

36

with φ is the standard normal density and

c(µ1,Σ1, µ2,Σ2) = (2π)−k2 |Σ1 + Σ2|−

k2 exp

(−1

2(µ2 − µ1)

′ (Σ1 + Σ2)−1 (µ2 − µ1)

)with parameters µ = Σ

(Σ−1

1 µ1 + Σ−12 µ2

)and Σ =

(Σ−1

1 + Σ−12

)−1

3.4 The EM, ECM, and ECME algorithms

MCMC methods have been used extensively to perform numerical integration. There is

also interest in using simulation-based methods to optimise functions. The EM algorithm

is a algorithms in a general class of Q-maximisation algorithms that finds a (deterministic)

sequence {θ(g)} converging to arg maxθ∈θ Q(θ).

First, define a function Q(θ, φ) such that Q(θ) = Q(θ, θ) and it satisfies a convexity

constraint Q(θ, φ) ≥ Q(θ, θ). Then define

θ(g+1) = arg maxθ∈θ

Q(θ, θ(g))

This satisfies the convexity constraint Q(θ, θ) ≥ Q(θ, ϕ) for any ϕ. In order to prove

convergence. you get a sequence of inequalities

Q(θ(0), θ(0)) ≤ Q(θ(1), θ(0)) ≤ Q(θ(1), θ(1)) ≤ . . . ≤ Q

In many models we have to deal with a latent variable and require estimation where in-

tegration is also involved. For example, suppose that we have a triple (y, z, θ) with joint

probability specification p(y, z, θ) = p(y|z, θ)p(z, θ). This can occur in missing data prob-

lems and estimation problems in mixture models.

A standard application of the EM algorithm is to find

arg maxθ∈θ

∫z

p(y|z, θ)p(z|θ)dz

As we are just finding an optimum, you do not need the prior specification p(θ). The EM

algorithm finds a sequence of parameter values θ(g) by alternating between an expectation

and a maximsation step. This still requires the numerical (or analytical) computation of

the criteria function Q(θ, θ(g)) described below.

37

EM algorithms have been used in extensively in mixture models and missing data

problems. The EM algorithm uses the particular choice where

Q(θ) = log p(y|θ) = log

∫p(y, z|θ)dz

Here the likelihood has a mixture representation where z is the latent variable (missing

data, state variable etc). This is termed a Q-maximization algorithm with:

Q(θ, θ(g)) =

∫log p(y|z, θ)p(z|θ(g), y)dz = Ez|θ(g),y[log p(y|z, θ)]

To implement EM you need to be able to calculate Q(θ, θ(g)) and optimize at each iteration.

The EM algorithm and its extensions ECM and ECME are methods of computing

maximum likelihood estimates or posterior modes in the presence of missing data. Let the

objective function be l(θ) = log p(θ|y)+c(y), where c(y) is a possibly unknown normalizing

constant that does not depend on β and y denotes observed data. We have a mixture

representation,

p(θ|y) =

∫p(θ, z|y)dz =

∫p(θ|z, y)p(z|y)dz

where distribution of the latent variables is p(z|θ, y) = p(y|θ, z)p(z|θ)/p(y|θ).In some cases the complete data log-posterior is simple enough for arg maxθ log p(θ|z, y)

to be computed in closed form. The EM algorithm alternates between the Expectation and

Maximization steps for which it is named. The E-step and M-step computes

Q(β|β(g)) = Ez|β(g),y [log p(y, z|β)] =

∫log p(y, z|β)p(z|β(g), y)dz

β(g+1) = arg maxβ

Q(β|β(g))

This has an important monotonicity property that ensures `(β(g)) ≤ `(β(g+1)) for all g.

In fact, the monotonicity proof given by Dempster et al (1977) shows that any β with

Q(β, β(g)) ≥ Q(β(g), β(g)) also satisfies the log-likelihood inequality `(β) ≥ `(β(g)).

In problems with many parameters the M-step of EM may be difficult. In this case θ may

be partitioned into components (θ1, . . . , θk) in such a way that maximizing log p(θj|θ−j, z, y)

is easy. The ECM algorithm pairs the EM algorithm’s E-step with k conditional maximiza-

tion (CM) steps, each maximizing Q over one component θj with each component of θ−j

38

fixed at the most recent value. Due to the fact that each CM step increases Q the ECM al-

gorithm retains the monotonicity property. The ECME algorithm replaces some of ECM’s

CM steps with maximizations over l instead of Q. Liu and Rubin (1994) show that doing

so can greatly increase the rate of convergence.

In many cases we will have a parameter vector θ = (β, ν) partitioned into its components

and a missing data vector z = (λ, ω). Then we compute the Q(β, ν|β(g), ν(g)) objective

function and then compute E- and M steps from this to provide an iterative algorithm for

updating parameters. To update the hyperparameter ν we can maximize the fully data

posterior p(β, ν|y) with β fixed at β(g+1). The algorithm can be summarized as follows

β(g+1) = argmaxβQ(β|β(g), ν(g)) where Q(β|β(g), ν(g)) = Ez β(g),ν(g),y

[log p(y, z|β, ν(g))

]ν(g+1) = argmaxν log p(β(g+1), ν|y)

Simulated Annealing (SA) A simulation-based approach to finding θ = arg maxθ∈Θ

H(θ)

is to sample a sequence densities

πJ(θ) =e−JH(θ)∫eJH(θ)dµ(θ)

where J is a temperature parameter. Instead of looking at derivatives and performing

gradient-based optimization you can simulate from the sequence of densities. This forms a

time-homogeneous Markov chain and under suitable regularly conditions on the relaxation

schedule for the temperature we have θ(g) → θ. The main caveat is that we need to know

the criterion function H(θ) to evaluate the Metropolis probability for sampling from the

sequence of densities. This is not always available.

An interesting generalisation which is appropriate in latent variable mixture models

is the following. Suppose that H(θ) = Ez|θ {H(z, θ)} is unavailable in closed-form where

without loss of generality we assume that H(z, θ) ≥ 0. In this case we can use latent

variable simulated annealing (LVSA) methods. Define a joint probability distribution for

zJ = (z1, . . . , zJ) as

πJ(zJ , θ) ∝J∏

j=1

H(zj, θ)p(zj|θ)µ(θ)

for some measure µ which ensures integrability of the joint. This distribution has the

39

property that its marginal distribution on θ is given by

πJ(θ) ∝ Ez|θ {H(z, θ)}J µ(θ) = eJ ln H(θ)µ(θ)

By the simulated annealing argument we see that this marginal collapses on the maximum

of lnH(θ). The advantage of this approach is that it is typically straightforward to sample

with MCMC from the conditionals

πJ(zi|θ) ∼ H(zj, θ)p(zj|θ) and πJ(θ|z) ∼J∏

j=1

H(zj, θ)p(zj|θ)µ(θ)

Jacquier, Johannes and Polson (2007) apply this to finding MLE estimates for commonly

encountered latent variable models.

3.5 Notes and Discussion

Geman (1987) provides a monograph discussion of Markov chain optimisation methods.

Pincus (1968) provides an early application of the Metropolis algorithm to simulated anneal-

ing optimisation. Standard references on simulated annealing include Kirkpatrick (1984),

Kirkpatrick et al, (1983), Aarts and Korst (1989) Van Laarhoven and Aarts, (1987). Muller

(2000) and Mueller et al (2004) propose this version of latent variable simulated annealing

for dealing with the joint problem of integration and optimisation. Slice sampling appli-

cations in Statistics are dicussed in Besag and Green (1993), Polson, (1996), Damien et

al, (1999) and Neal (2003) provides a general algorithm. General strategies based on the

ratio-of-uniforms method are in Wakefield et al (1991). Devroye (1986) and Ripley (1987)

discuss many tailored made algorithms for simulation.

The EM algorithm for missing data problems was originally developed by Dempster,

Laird and Rubin (1977) but has its roots in the hidden Markov model literature and

the Baum-Welch algorithm. Liu and Rubin (1994) consider faster ECME algorothms.

Liang and Wong (2001) and Liu, Liang and Wong (2000) consider extensions including

evolutionary MCMC algorithms.

40

1 Appendix: Common...

Documents

Transcript of 1 Appendix: Common...