Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand...

90
lecture 14: bayesian inference and monte carlo methods STAT 545: Intro. to Computational Statistics Vinayak Rao Purdue University November 1, 2018

Transcript of Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand...

Page 1: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

lecture 14: bayesian inference andmonte carlo methodsSTAT 545: Intro. to Computational Statistics

Vinayak RaoPurdue University

November 1, 2018

Page 2: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Given a set of observations X, MLE maximizes the likelihood:

θMLE = argmax p(X|θ)

What if we believe θ is close to 0, is sparse, or is smooth?

Encode this with a ‘prior’ probability p(θ).

• Represents a priori beliefs about θ

Given observations, we can calculate the ‘posterior’:

p(θ|X) = p(X|θ)p(θ)P(X)

We can calculate the maximum a posteriori (MAP) solution:

θMAP = argmax p(θ|X)

Point estimate discards information about uncertainty in θ

1/29

Page 3: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Given a set of observations X, MLE maximizes the likelihood:

θMLE = argmax p(X|θ)

What if we believe θ is close to 0, is sparse, or is smooth?

Encode this with a ‘prior’ probability p(θ).

• Represents a priori beliefs about θ

Given observations, we can calculate the ‘posterior’:

p(θ|X) = p(X|θ)p(θ)P(X)

We can calculate the maximum a posteriori (MAP) solution:

θMAP = argmax p(θ|X)

Point estimate discards information about uncertainty in θ

1/29

Page 4: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Given a set of observations X, MLE maximizes the likelihood:

θMLE = argmax p(X|θ)

What if we believe θ is close to 0, is sparse, or is smooth?

Encode this with a ‘prior’ probability p(θ).

• Represents a priori beliefs about θ

Given observations, we can calculate the ‘posterior’:

p(θ|X) = p(X|θ)p(θ)P(X)

We can calculate the maximum a posteriori (MAP) solution:

θMAP = argmax p(θ|X)

Point estimate discards information about uncertainty in θ

1/29

Page 5: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Given a set of observations X, MLE maximizes the likelihood:

θMLE = argmax p(X|θ)

What if we believe θ is close to 0, is sparse, or is smooth?

Encode this with a ‘prior’ probability p(θ).

• Represents a priori beliefs about θ

Given observations, we can calculate the ‘posterior’:

p(θ|X) = p(X|θ)p(θ)P(X)

We can calculate the maximum a posteriori (MAP) solution:

θMAP = argmax p(θ|X)

Point estimate discards information about uncertainty in θ

1/29

Page 6: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Given a set of observations X, MLE maximizes the likelihood:

θMLE = argmax p(X|θ)

What if we believe θ is close to 0, is sparse, or is smooth?

Encode this with a ‘prior’ probability p(θ).

• Represents a priori beliefs about θ

Given observations, we can calculate the ‘posterior’:

p(θ|X) = p(X|θ)p(θ)P(X)

We can calculate the maximum a posteriori (MAP) solution:

θMAP = argmax p(θ|X)

Point estimate discards information about uncertainty in θ

1/29

Page 7: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Bayesian inference works with the entire distribution p(θ|X).

• Represents a posteriori beliefs about θ

Allows us to maintain and propagate uncertainty.

E.g. consider the likelihood p(X|θ) = N(X|θ, 1)

• What is a good prior over θ?• What is a convenient prior over θ?

2/29

Page 8: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Bayesian inference works with the entire distribution p(θ|X).

• Represents a posteriori beliefs about θ

Allows us to maintain and propagate uncertainty.

E.g. consider the likelihood p(X|θ) = N(X|θ, 1)

• What is a good prior over θ?

• What is a convenient prior over θ?

2/29

Page 9: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

Bayesian inference works with the entire distribution p(θ|X).

• Represents a posteriori beliefs about θ

Allows us to maintain and propagate uncertainty.

E.g. consider the likelihood p(X|θ) = N(X|θ, 1)

• What is a good prior over θ?• What is a convenient prior over θ?

2/29

Page 10: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

The posterior distribution p(θ|X) ∝ p(X|θ)p(θ) summarizes allnew information about θ provided by the data

In practice, these distributions are unwieldy.

Need approximations.

An exception: ‘Conjugate priors’ for exponential familydistributions.

3/29

Page 11: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Bayesian inference

The posterior distribution p(θ|X) ∝ p(X|θ)p(θ) summarizes allnew information about θ provided by the data

In practice, these distributions are unwieldy.

Need approximations.

An exception: ‘Conjugate priors’ for exponential familydistributions.

3/29

Page 12: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate exponential family priors

Let observations come from an exponential-family:

p(x|θ) = 1Z(θ)h(x) exp(θ

⊤ϕ(x))

= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))

Place a prior over θ:

p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)

Given a set of observations X = {x1, . . . , xN}

p(θ|X) ∝( N∏i=1

h(xi) exp(θ⊤ϕ(xi)− ζ(θ))

)η(θ) exp(θ⊤a− ζ(θ)b)

∝ η(θ) exp(θ⊤

(a+

N∑i=1

ϕ(xi))

− ζ(θ)(b+ N))

4/29

Page 13: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate exponential family priors

Let observations come from an exponential-family:

p(x|θ) = 1Z(θ)h(x) exp(θ

⊤ϕ(x))

= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))

Place a prior over θ:

p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)

Given a set of observations X = {x1, . . . , xN}

p(θ|X) ∝( N∏i=1

h(xi) exp(θ⊤ϕ(xi)− ζ(θ))

)η(θ) exp(θ⊤a− ζ(θ)b)

∝ η(θ) exp(θ⊤

(a+

N∑i=1

ϕ(xi))

− ζ(θ)(b+ N))

4/29

Page 14: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate exponential family priors

Let observations come from an exponential-family:

p(x|θ) = 1Z(θ)h(x) exp(θ

⊤ϕ(x))

= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))

Place a prior over θ:

p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)

Given a set of observations X = {x1, . . . , xN}

p(θ|X) ∝( N∏i=1

h(xi) exp(θ⊤ϕ(xi)− ζ(θ))

)η(θ) exp(θ⊤a− ζ(θ)b)

∝ η(θ) exp(θ⊤

(a+

N∑i=1

ϕ(xi))

− ζ(θ)(b+ N))

4/29

Page 15: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate exponential family priors

Let observations come from an exponential-family:

p(x|θ) = 1Z(θ)h(x) exp(θ

⊤ϕ(x))

= h(x) exp(θ⊤ϕ(x)− ζ(θ)) with ζ(θ) = log(Z(θ))

Place a prior over θ:

p(θ|a,b) ∝ η(θ) exp(θ⊤a− ζ(θ)b)

Given a set of observations X = {x1, . . . , xN}

p(θ|X) ∝( N∏i=1

h(xi) exp(θ⊤ϕ(xi)− ζ(θ))

)η(θ) exp(θ⊤a− ζ(θ)b)

∝ η(θ) exp(θ⊤

(a+

N∑i=1

ϕ(xi))

− ζ(θ)(b+ N))

4/29

Page 16: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors (contd.)

Prior over θ: exp. fam. distribution with parameters (a,b).

Posterior: same family with parameters (a+∑N

i=1 ϕ(xi),b+ N).

Rare instance where analytical expressions for posterior exists.

In most cases a simple prior quickly leads to a complicatedposterior, requiring Monte Carlo methods.

Note the conjugate prior is an entire family of distributions.

• Actual distribution is chosen by setting the parameters (a,b)(a has the same dimension as ϕ, b is a scalar)

• These might be set by e.g. talking to a domain expert.

5/29

Page 17: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors (contd.)

Prior over θ: exp. fam. distribution with parameters (a,b).

Posterior: same family with parameters (a+∑N

i=1 ϕ(xi),b+ N).

Rare instance where analytical expressions for posterior exists.

In most cases a simple prior quickly leads to a complicatedposterior, requiring Monte Carlo methods.

Note the conjugate prior is an entire family of distributions.

• Actual distribution is chosen by setting the parameters (a,b)(a has the same dimension as ϕ, b is a scalar)

• These might be set by e.g. talking to a domain expert.

5/29

Page 18: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Let xi ∈ {0, 1} indicate if a new drug works for subject i.

The unknown probability of success is π: x ∼ Bern(π).

p(x|π) = π1(x=1)(1− π)1(x=0)

= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))

= (1− π) exp(1(x = 1) log π

1− π

)=

11+ exp(θ) exp (ϕ(x)θ)

This is an exponential family distrib., withθ = log π

1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,

p(x|θ) = exp (ϕ(x)θ − ζ(θ))

6/29

Page 19: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Let xi ∈ {0, 1} indicate if a new drug works for subject i.

The unknown probability of success is π: x ∼ Bern(π).

p(x|π) = π1(x=1)(1− π)1(x=0)

= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))

= (1− π) exp(1(x = 1) log π

1− π

)=

11+ exp(θ) exp (ϕ(x)θ)

This is an exponential family distrib., withθ = log π

1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,

p(x|θ) = exp (ϕ(x)θ − ζ(θ))

6/29

Page 20: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Let xi ∈ {0, 1} indicate if a new drug works for subject i.

The unknown probability of success is π: x ∼ Bern(π).

p(x|π) = π1(x=1)(1− π)1(x=0)

= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))

= (1− π) exp(1(x = 1) log π

1− π

)=

11+ exp(θ) exp (ϕ(x)θ)

This is an exponential family distrib., withθ = log π

1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).

Defining ζ(θ) = log Z(θ) as in the previous slide,

p(x|θ) = exp (ϕ(x)θ − ζ(θ))

6/29

Page 21: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Let xi ∈ {0, 1} indicate if a new drug works for subject i.

The unknown probability of success is π: x ∼ Bern(π).

p(x|π) = π1(x=1)(1− π)1(x=0)

= exp (1(x = 1) log(π) + (1− 1(x = 1)) log(1− π))

= (1− π) exp(1(x = 1) log π

1− π

)=

11+ exp(θ) exp (ϕ(x)θ)

This is an exponential family distrib., withθ = log π

1−π , ϕ(x) = 1(x = 1),h(x) = 1, Z(θ) = (1+ exp(θ)).Defining ζ(θ) = log Z(θ) as in the previous slide,

p(x|θ) = exp (ϕ(x)θ − ζ(θ))

6/29

Page 22: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

When θ = log π1−π is unknown, a Bayesian places a prior on it.

As before, define an exp. fam. prior with parameters a:

p(θ|a) ∝ exp(a1θ + a2ζ(θ))

Then given data X = (x1, . . . , xN),

p(θ|a, X) ∝ p(θ, X|a)

∝ exp((

a1 +N∑i=1

1(xi = 1))θ + (a2 − N)ζ(θ)

)

Thus, the posterior is in the same family as the prior, but withupdated parameters

(a1 +

∑Ni=1 1(xi = 1),a2 − N

).

7/29

Page 23: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

When θ = log π1−π is unknown, a Bayesian places a prior on it.

As before, define an exp. fam. prior with parameters a:

p(θ|a) ∝ exp(a1θ + a2ζ(θ))

Then given data X = (x1, . . . , xN),

p(θ|a, X) ∝ p(θ, X|a)

∝ exp((

a1 +N∑i=1

1(xi = 1))θ + (a2 − N)ζ(θ)

)

Thus, the posterior is in the same family as the prior, but withupdated parameters

(a1 +

∑Ni=1 1(xi = 1),a2 − N

).

7/29

Page 24: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

When θ = log π1−π is unknown, a Bayesian places a prior on it.

As before, define an exp. fam. prior with parameters a:

p(θ|a) ∝ exp(a1θ + a2ζ(θ))

Then given data X = (x1, . . . , xN),

p(θ|a, X) ∝ p(θ, X|a)

∝ exp((

a1 +N∑i=1

1(xi = 1))θ + (a2 − N)ζ(θ)

)

Thus, the posterior is in the same family as the prior, but withupdated parameters

(a1 +

∑Ni=1 1(xi = 1),a2 − N

).

7/29

Page 25: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Looking at the prior more carefully, we see:

p(θ|a) ∝ exp(a1θ + a2ζ(θ))

∝ exp(a1 log

π

1− π+ a2 log(1− π)

)∝ πa1(1− π)(a2−a1)

= πb1−1(1− π)(b2−1)

This is just the Beta(b1,b2) distribution, and you can check thatthe posterior is Beta

(b1 +

∑Ni=1 1(xi = 1),b2 +

∑Ni=1 1(xi = 0)

).

b1 and b2 are sometimes called pseudo-observations, andcapture our prior beliefs: before seeing any x’s our prior is as ifwe saw b1 successes and b2 failures. After seeing data, wefactor actual observations into the pseudo-observations.

8/29

Page 26: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Conjugate priors: Beta-Bernoulli example

Looking at the prior more carefully, we see:

p(θ|a) ∝ exp(a1θ + a2ζ(θ))

∝ exp(a1 log

π

1− π+ a2 log(1− π)

)∝ πa1(1− π)(a2−a1)

= πb1−1(1− π)(b2−1)

This is just the Beta(b1,b2) distribution, and you can check thatthe posterior is Beta

(b1 +

∑Ni=1 1(xi = 1),b2 +

∑Ni=1 1(xi = 0)

).

b1 and b2 are sometimes called pseudo-observations, andcapture our prior beliefs: before seeing any x’s our prior is as ifwe saw b1 successes and b2 failures. After seeing data, wefactor actual observations into the pseudo-observations. 8/29

Page 27: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

monte carlo methods

Page 28: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

What about the situation when the posterior p(θ|X) is nolonger simple/available in closed form?

What information about p(θ|X) do we really need?

Typically, expectations of different functions g:

Eθ|X[g] =∫

dθg(θ)p(θ|X)

What is g for to calculate 1) mean, 2) variance, 3) p(θ > 10|X)?

10/29

Page 29: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

What about the situation when the posterior p(θ|X) is nolonger simple/available in closed form?

What information about p(θ|X) do we really need?

Typically, expectations of different functions g:

Eθ|X[g] =∫

dθg(θ)p(θ|X)

What is g for to calculate 1) mean, 2) variance, 3) p(θ > 10|X)?

10/29

Page 30: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want

µ := Ep[g] =∫Xg(x)p(x)dx

Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.

µ ≈ 1N

N∑i=1

g(xi) := µ

Monte Carlo approximation:

• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation

µ ≈ 1N

N∑i=1

g(xi)

11/29

Page 31: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want

µ := Ep[g] =∫Xg(x)p(x)dx

Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.

µ ≈ 1N

N∑i=1

g(xi) := µ

Monte Carlo approximation:

• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation

µ ≈ 1N

N∑i=1

g(xi)

11/29

Page 32: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

Let us forget the posterior distribution p(θ|X), and considersome general probability distribution p(x). We want

µ := Ep[g] =∫Xg(x)p(x)dx

Sampling approximation: rather than visit all points in X ,calculate a summation over a finite set.

µ ≈ 1N

N∑i=1

g(xi) := µ

Monte Carlo approximation:

• Obtain points by sampling from p(x): xi ∼ p• Approximate integration with summation

µ ≈ 1N

N∑i=1

g(xi)11/29

Page 33: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

µ =1N

N∑i=1

g(xi)

If xi ∼ p,

Ep[µ] =1N

N∑i=1

Ep[g] = µ Unbiased estimate

Varp[µ] =1NVarp[g], Error = StdDev ∝ N−1/2

1N

N∑i=1

f→ Ep(g) = µ as N→ ∞ Consistent estimate (LLN)

12/29

Page 34: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo integration

µ =1N

N∑i=1

g(xi)

If xi ∼ p,

Ep[µ] =1N

N∑i=1

Ep[g] = µ Unbiased estimate

Varp[µ] =1NVarp[g], Error = StdDev ∝ N−1/2

1N

N∑i=1

f→ Ep(g) = µ as N→ ∞ Consistent estimate (LLN)

12/29

Page 35: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo Sampling

Is this a good idea?

• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.

Simpson’s rule in d-dimensions, with N grid points:

error ∝ N−4/d

Monte Carlo integration:

error ∝ N−1/2

Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)

13/29

Page 36: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo Sampling

Is this a good idea?

• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:

error ∝ N−4/d

Monte Carlo integration:

error ∝ N−1/2

Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)

13/29

Page 37: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo Sampling

Is this a good idea?

• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:

error ∝ N−4/d

Monte Carlo integration:

error ∝ N−1/2

Independent of dimensionality!

• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)

13/29

Page 38: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo Sampling

Is this a good idea?

• In low-dims, worth considering numerical methods likequadrature. In high-dims, these quickly become infeasible.Simpson’s rule in d-dimensions, with N grid points:

error ∝ N−4/d

Monte Carlo integration:

error ∝ N−1/2

Independent of dimensionality!• If unbiasedness is important to you.• Very simple.• Very modular: easily incorporated into more complexmodels (Gibbs sampling)

13/29

Page 39: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Monte Carlo Sampling (contd.)

An aside: Monte Carlo should be your method of last resort!

Don’t hesitate using numerical integration

• Numerical integration can be much faster and more accurate

Contrast> integrate(function(x) x ∗ exp(−x), lower = 0, upper = Inf)

with

> mean(rexp(1000))

14/29

Page 40: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)• Careful with batch/parallel processing.

15/29

Page 41: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)• Careful with batch/parallel processing.

15/29

Page 42: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.

• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)• Careful with batch/parallel processing.

15/29

Page 43: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.

• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)• Careful with batch/parallel processing.

15/29

Page 44: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html

• Upside: Can use seeds for reproducibility or debugging> set.seed(1)

• Careful with batch/parallel processing.

15/29

Page 45: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)

• Careful with batch/parallel processing.

15/29

Page 46: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

• The simplest useful probability distribution Unif(0, 1).• In theory, can be used to generate any other RV.• Easy to generate uniform RVs on a deterministic computer?

No!

• Instead: pseudorandom numbers.• Map a seed to a ‘random-looking’ sequence.• Downside: http://boallen.com/random-numbers.html• Upside: Can use seeds for reproducibility or debugging

> set.seed(1)• Careful with batch/parallel processing.

15/29

Page 47: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

R has a bunch of random number generators.

rnorm, rgamma, rbinom, rexp, rpoiss etc.

What if we want samples from some other distribution?

16/29

Page 48: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

Inverse transform sampling

Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du

Let:X ∼ p(·)U = F(X)

Then U is Unif(0, 1)

Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)

E.g. − log(U) is Exponential(1).Usually hard to compute F−1.

17/29

Page 49: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

Inverse transform sampling

Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du

Let:X ∼ p(·)U = F(X)

Then U is Unif(0, 1)

Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)

E.g. − log(U) is Exponential(1).Usually hard to compute F−1.

17/29

Page 50: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

Inverse transform sampling

Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du

Let:X ∼ p(·)U = F(X)

Then U is Unif(0, 1)

Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)

E.g. − log(U) is Exponential(1).Usually hard to compute F−1.

17/29

Page 51: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

Inverse transform sampling

Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du

Let:X ∼ p(·)U = F(X)

Then U is Unif(0, 1)

Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)

E.g. − log(U) is Exponential(1).

Usually hard to compute F−1.

17/29

Page 52: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Generating random variables

Inverse transform sampling

Let X have pdf p(x), and cdf F(x) = P(X ≤ x) =∫ x−∞ p(u)du

Let:X ∼ p(·)U = F(X)

Then U is Unif(0, 1)

Equivalently, sample U ∼ Unif(0, 1), and let X = F−1(U)Then X ∼ p(·) (see wikipedia for proof)

E.g. − log(U) is Exponential(1).Usually hard to compute F−1.

17/29

Page 53: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

If we sample points uniformly below the curve Mf(x):

18/29

Page 54: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

If we sample points uniformly below the curve Mf(x):Probability of a sample in [x0, x0 +∆x] = Mf(x0)∆X∫

X Mf(x0)dx= p(x0)∆x.

18/29

Page 55: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

If we sample points uniformly below the curve Mf(x):Probability of a sample in [x0, x0 +∆x] = Mf(x0)∆X∫

X Mf(x0)dx= p(x0)∆x.

How to do this (without sampling from p)?

18/29

Page 56: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

If Mf(x) ≤ Nq(x) ∀x for constant N and distribution q(·)Sample points uniformly under Nq(x).(sample x0 ∼ q(·), and assign it a uniform height in [0,Nq(x0)]

18/29

Page 57: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

If Mf(x) ≤ Nq(x) ∀x for constant N and distribution q(·)Sample points uniformly under Nq(x).Keep only points under Mf(x).

18/29

Page 58: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

Equivalent algorithm: (convince yourself)• Propose x∗ ∼ q(·)• Accept with probability Mf(x∗)/Nq(x∗)

18/29

Page 59: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling

Let p(x) = f(x)Z .

Probability of a sample in [x0, x0 +∆x] = p(x0)∆x.

We need a bound on f(x).A loose bound leads to lots of rejections.Probability of acceptance = MZ

N .

18/29

Page 60: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Intractable normalization constants

A probability density takes the form p(x) = f(x)Z

• Z =∫X f(x)dx is the normalization contant

• Ensures probability integrates to 1

Often Z is difficult to calculate (intractable integral over f(x))

Consequently, evaluating p(x) is hard

However, rejection sampling doesn’t need Z or p(x)

19/29

Page 61: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Intractable normalization constants

A probability density takes the form p(x) = f(x)Z

• Z =∫X f(x)dx is the normalization contant

• Ensures probability integrates to 1

Often Z is difficult to calculate (intractable integral over f(x))

Consequently, evaluating p(x) is hard

However, rejection sampling doesn’t need Z or p(x)

19/29

Page 62: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Intractable normalization constants

A probability density takes the form p(x) = f(x)Z

• Z =∫X f(x)dx is the normalization contant

• Ensures probability integrates to 1

Often Z is difficult to calculate (intractable integral over f(x))

Consequently, evaluating p(x) is hard

However, rejection sampling doesn’t need Z or p(x)

19/29

Page 63: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Rejection sampling (contd.)

Example 1:

p(x) ∝ exp(−x2/2)| sin(x)|

Example 2 (truncated normal):

p(x) ∝ exp(−x2/2)1{x>c}

What is M for each case? What can we say about efficiency?

20/29

Page 64: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling

Rather that accept/reject, assign weights to samples.

Observe:

Ep[g] =∫g(x)p(x)dx =

∫g(x)p(x)q(x)q(x)dx = Eq

[g(x)p(x)q(x)

]Use Monte Carlo approximation to the latter expectation:

• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫

g(x)p(x)dx ≈ 1N

N∑s=1

w(xs)g(xs)

Since w(x) = p(x)/q(x) = f(x)Zq(x) :

• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).

21/29

Page 65: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling

Rather that accept/reject, assign weights to samples. Observe:

Ep[g] =∫g(x)p(x)dx =

∫g(x)p(x)q(x)q(x)dx = Eq

[g(x)p(x)q(x)

]

Use Monte Carlo approximation to the latter expectation:

• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫

g(x)p(x)dx ≈ 1N

N∑s=1

w(xs)g(xs)

Since w(x) = p(x)/q(x) = f(x)Zq(x) :

• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).

21/29

Page 66: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling

Rather that accept/reject, assign weights to samples. Observe:

Ep[g] =∫g(x)p(x)dx =

∫g(x)p(x)q(x)q(x)dx = Eq

[g(x)p(x)q(x)

]Use Monte Carlo approximation to the latter expectation:

• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫

g(x)p(x)dx ≈ 1N

N∑s=1

w(xs)g(xs)

Since w(x) = p(x)/q(x) = f(x)Zq(x) :

• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).

21/29

Page 67: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling

Rather that accept/reject, assign weights to samples. Observe:

Ep[g] =∫g(x)p(x)dx =

∫g(x)p(x)q(x)q(x)dx = Eq

[g(x)p(x)q(x)

]Use Monte Carlo approximation to the latter expectation:

• Draw proposal x from q(·) and calculate weight w(x) = p(x)q(x) .∫

g(x)p(x)dx ≈ 1N

N∑s=1

w(xs)g(xs)

Since w(x) = p(x)/q(x) = f(x)Zq(x) :

• We don’t need a bounding envelope.• We need normalizn constant Z (but see later).

21/29

Page 68: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling vs simple Monte Carlo

Simple Monte Carlo/MCMC (left) uses sampling approximation

Importance sampling (right) weights the samples

22/29

Page 69: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling (contd)

When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).

Sometimes it’s better to simulate from q(x) than p(x)!

To reduce variance. E.g. rare event simulation.

Let x ∼ (0, 1)

• What is P(X > 5)?

23/29

Page 70: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling (contd)

When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).

Sometimes it’s better to simulate from q(x) than p(x)!

To reduce variance. E.g. rare event simulation.

Let x ∼ (0, 1)

• What is P(X > 5)?

23/29

Page 71: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance Sampling (contd)

When does this make sense?Sometimes it’s easier to simulate from q(x) than p(x).

Sometimes it’s better to simulate from q(x) than p(x)!

To reduce variance. E.g. rare event simulation.

Let x ∼ (0, 1)

• What is P(X > 5)?

23/29

Page 72: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling:

Let X = (x1, . . . , x100) be a hundred dice.What is p(

∑xi ≥ 550)?

Rejection sampling (from p(x)) leads to high rejection.

A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})

24/29

Page 73: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling:

Let X = (x1, . . . , x100) be a hundred dice.What is p(

∑xi ≥ 550)?

Rejection sampling (from p(x)) leads to high rejection.

A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})

24/29

Page 74: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling:

Let X = (x1, . . . , x100) be a hundred dice.What is p(

∑xi ≥ 550)?

Rejection sampling (from p(x)) leads to high rejection.

A better choice might be to bias the dice.E.g. q(xi = v) ∝ v (for v ∈ {1, . . . 6})

24/29

Page 75: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling:

Define SX =∑xi

p(S ≥ 550) =∑

y∈ all configs of 100 diceδ(∑

y ≥ 550)p(y)

=∑

y∈ all configs of 100 dice

p(y)q(y)δ(

∑y ≥ 550)q(y)

For a proposal X∗ ∼ q,

w(X∗) = p(X∗q(X∗) =

(1/6)100∏i q(x∗i )

Use approximation p(S ≥ 550) ≈∑N

j=1 w(Xj)δ(∑xji ≥ 550)

25/29

Page 76: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 77: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 78: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 79: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 80: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 81: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

What is the variance of the estimate?

Var[µimp] = E[µ2imp]− µ2

= E

( 1N

N∑i=1

wig(xi))2− µ2

= E

[(p(x)g(x)q(x)

)2]− µ2

=

∫Xq(x)

(p(x)g(x)q(x)

)2dx− µ2

≥(∫

Xq(x)p(x)g(x)q(x) dx

)2− µ2

= 0 (!)

26/29

Page 82: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

We achieve this lower bound when q(x) ∝ p(x)g(x).A slightly useless result, because

q(x) = p(x)g(x)∫X p(x)g(x)dx

requires solving the integral we care about.

27/29

Page 83: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

We want a small variance in the weights w(xi).Easy to check at Eq[w(x)] = 1.

Varq[w(x)] = Eq[w(x)2]− Eq[w(x)]2

=

∫X

(p(x)q(x)

)2q(x)dx− 1 =

∫X

p(x)2q(x) dx− 1

Can be unbounded. E.g. p = N (0, 2) and q = N (0, 1).

Apopular diagnosis statistic: effective sample size (ESS).

ESS =

(∑Ni=1 w(xi)

)2∑N

i=1 w(xi)2

Small ESS→ Large variability in w’s→ bad estimate.Large ESS promises you nothing!

28/29

Page 84: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling (contd)

A popular diagnosis statistic: effective sample size (ESS).

ESS =

(∑Ni=1 w(xi)

)2∑N

i=1 w(xi)2

Small ESS→ Large variability in w’s→ bad estimate.Large ESS promises you nothing!

28/29

Page 85: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.

29/29

Page 86: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.

29/29

Page 87: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.

29/29

Page 88: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.

29/29

Page 89: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.

Is biased for finite N, but consistent as N→ ∞.

29/29

Page 90: Lecture 14: Bayesian inference and Monte Carlo methods ... · lecture14:bayesianinferenceand montecarlomethods STAT545:Intro.toComputationalStatistics VinayakRao PurdueUniversity

Importance sampling when Z is unknown

Importance weights are w(x) = p(x)/q(x), where p(x) = f(x)/Z.

How can we estimate Z =∫f(x)dx?

Z =

∫f(x)dx =

∫ f(x)q(x)q(x) dx

Reuse samples from the proposal distribution q(x):

Z =1N

N∑i=1

f(xi)q(xi)

=1N

N∑i=1

w(xi)

Can use to approximate importance sampling weights w(xi):

w(xi) =p(xi)q(xi)

=f(xi)Zq(xi)

≈ 1Zw(xi)

Use w(x) instead of w(x) in the Monte Carlo approximation.Is biased for finite N, but consistent as N→ ∞.

29/29