The Likelihood, the prior and Bayes Theorem

Douglas Nychka,www.image.ucar.edu/~nychka

• Likelihoods for three examples.

• Prior, Posterior for a Normal example.

• Priors for Surface temperature and the CO2 problem.

Supported by the National Science Foundation NCAR/IMAGe summer school Jul 2007

The big picture

The goal is to estimate ‘parameters’: θ

System Statese.g. temperature fields from Lecture 1 or the surface

fluxes of CO2

Statistical parameterse.g. such as climatological means or the correlation

coefficients.

We also have ...DataWhose distribution depends on these unknown quanti-ties (θ).

Reversing roles in the problem:‘inverse probability’

We start by specifying the conditional distribution ofthe data given the parameters.

We end by finding the conditional distribution of theparameters given the data.

Likelihood

L(Y, θ) or [Y |θ]

the conditional density of the data given the parameters.

Assume that you know the parameters exactly, what isthe distribution of the data?

This is called a likelihood because for a given pair ofdata and parameters it registers how ‘likely’ is the data.

−4 −2 0 2 4 6

Data is ‘unlikely’ under the dashed density.

Some likelihood examples.

It does not get easier that this!A noisy observation of θ.

Y = θ + N(0,1)

Likelihood:

L(Y, θ) =1√

πe−

(Y−θ)2

Minus log likelihood:

−log(L(Y, θ)) =(Y − θ)2

2+ log(π)/2

-log likelihood usually in simpler algebraically

Note the foreshadowing of a squared term here!

Temperature station dataObservational model:Yj = T (xj) + N(0, σ2) for j = 1, ..., N

L(Y, θ) =1

√πσ

e−(Y1−T (x1))2

2σ2 ×1

√πσ

e−(Y2−T (x2))2

2σ2 ×...×1

√πσ

e−(YN−T (xN ))2

combining

L(Y, θ) =1

πσ)Ne−

∑Nj=1

(Yj−T (xj))2

−log(L(Y, θ)) =N∑

(Yj − T (xj))2

2σ2+ (N/2)log(π) + Nlog(σ)

A simplifying complicationIt will useful to express the relationship as

Y = HT + e

• Y is the data vector.

• H is an indicator matrix of ones and zeroes that mapsthe stations to the grid.

• T is the huge vector of all temperatures on the grid.

• e is the measurement error vector (0, R).

−log(L(Y, θ)) ∝ (Y −HT )TR−1(Y −HT )/2 + Nlog(|R|)

The quadratic thing again!

CO2 Inverse Problem

Z (data), x (concentrations) and u (sources).

zj = hj(xi) + N(0, R)

for j = 1, ..., N

xi is determined by the dynamical model:xi+1 = Φ(xi) + G(u)

L(Z, u) =1

πR)Ne−

∑i,j

(zj−h(xi))2

−logL(Z, u) =(I−1)∑i=0

N∑j=1

(zj − h(xi))2

2R2+ (N/2)log(π) + Nlog(σ)

sources (u) concentrations (x)

Priors

To get the conditional distribution of the parametersgiven the data we need the distribution of the param-eters in the absence of any data. This is called theprior.

The simplest prior for θFor the first example take θ to be N(µ, σ).

If this seems bizarre to put a distribution on this un-known quantity then you are probably following this lec-ture!

We are now ready to use Bayes theorem

Bayes Theorem again

Three ways of stating Bayes Thm:

• (parameters given data) ∝(data given parameters)× (parameters)

• [θ|Y ] ∝ [Y |θ][θ]

• conditional density given the data∝ L(Y,theta) prior(θ)

PosteriorThe conditional density of the parameters given the

Back to first exampleY = θ + N(0,1)

Likelihood

[Y |θ] =1√

πe−

(Y−θ)2

Prior for θ

[θ] =1

√πσ

e−(θ−µ)2

Posterior

[θ|Y ] ∝1√

πe−

(Y−θ)2

√πσ

e−(θ−µ)2

Simplying the posterior for Gaussian-Gaussian

[θ|Y ] ∝ [Y |θ][θ] ∝ e−(Y−θ)2

2 −(θ−µ)2

2σ2 ∝ e−(Y ∗−θ)2

2σ∗

posterior mean:

Y ∗ = (Y σ2 + µ)/(1 + σ2)

Y ∗ = µ + (Y − µ)σ2/(1 + σ2)

Remember that θ has prior mean µ

posterior variance:

(σ∗)2 = σ2/(1 + σ2)

(σ∗)2 = σ2 − 1/(1 + σ2)

Without other information Y has variance of 1 about θ.

As Jeff Anderson says:

products of Gaussians are Gaussian ...

Where do these products come from? What are statis-tical names for them.

It is quite easy to consider priors where the algebra tofind the posterior is impossible! Most real Bayes prob-lems are solved numerically.

More on this topic and MCMC at the end this lecture.

Some posteriors for this example

DATA = 1.5, PRIOR N(0, (1.5)2

Likelihood, POSTERIOR

−4 −2 0 2 4 6

Prior not very informative

DATA PRIOR N(0, (2.5)2

Likelihood POSTERIOR

−4 −2 0 2 4 6

Prior is informative

DATA, PRIOR N(0, (.5)2

Likelihood POSTERIOR

−4 −2 0 2 4 6

-log posterior

(Y − θ)2

(θ − µ)2

2σ2+ constant

−log likelihood + −log prior

fit to data + control/constraints on parameter

This is how the separate terms originate in a vari-ational approach.

The Big Picture

It is useful to report the values where the posteriorhas its maximum.

This is called the posterior mode.

Variational DA techniques = finding posterior mode

Maximizing the posterior is the same as minimizing- log posterior.

Prior for the surface temperature problem

Use climatology!

The station data suggests that the temperature field isGaussian:

N(µ,Σ) and assume that µ Σ are known.

We have cheated in Lecture 1 and estimated the meanfield µ(x) and covariance function from the data.

There is actually some uncertainty in these choices.

Finding the posterior temperature field for agiven year.

Posterior = Likelihood × PriorPosterior ∝ N(HT, R)×N(µ,Σ)

Jeff Anderson: Products of Gaussian are Gaussian ...

After some heavy lifting:

Posterior= N(T , P a)

T = µ + ΣH(HTΣH + R)−1(Y −Hµ)

P a = Σ−ΣH(HTΣH + R)−1HTΣ

These are the Kalman filter equations.

Another Big Picture Slide

Posterior = Likelihood × Prior

-log Posterior = -log Likelihood + -log Prior

For the temperature problem we have

-log Posterior =

(Y −HT )TR−1(Y −HT )/2 + (T − µ)TΣ−1(T − µ)/2

+ other stuff.

(Remember T is the free variable here.)

The CO2 Problem

For the sources: N(µu,k, Pu,k).For the initial concentrations: N(µx, Px)

-log posterior

(I−1)∑i=0

N∑j=1

(zj − hj(xi))TRj

−1(zj − hj(xi))/2 +

K∑k=1

(uk − µu,k)TP−1

u,k (uk − µu,k)/2 +

(x0 − µx)TP−1

x (x0 − µx)/2 + constant

Some Comments

Dynamical constraint:Given the time varying sources and initial concentra-

tions all the subsequent concentrations are found by thedynamical model.

Due to linear properties of tracers

X = ΩU

Ω: after you generate this matrix you will have used upyour computing allocation!

Posterior:Easy to write down but difficult to compute focus on

finding where the posterior is maximized.

A future direction is to draw samples from the posterior.

The Likelihood, the prior and Bayes Theorem · PDF fileThe Likelihood, the prior and Bayes...

Transcript of The Likelihood, the prior and Bayes Theorem · PDF fileThe Likelihood, the prior and Bayes...

The Likelihood, the prior and Bayes Theorem · PDF fileThe Likelihood, the prior and Bayes...

Documents

Transcript of The Likelihood, the prior and Bayes Theorem · PDF fileThe Likelihood, the prior and Bayes...

Lecture 11: Maximum likelihood IV. (nonlinear least square ...

06 Machine Learning - Naive Bayes

The Bayes deconvolution problemstatweb.stanford.edu/~ckirby/brad/papers/2015Bayes... · The label \Bayes deconvolution problem" is intended to emphasize the more general nature of

18.650 (F16) Lecture 3: Maximum Likelihood Estimation

Bayesianmethods&Naïve( Bayes( Lecture18people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf · Naïve Bayes for Digits (Binary Inputs) • Simple version: – One feature

20 02 2013 Teoremas Probabilidade Total e Bayes

b) Bayes-Aktionen und Zul¨assigkeit Satz 2.58 (Bayes ...€¦ · 1 b) Bayes-Aktionen und Zul¨assigkeit Satz 2.58 (Bayes-Aktionen zu nicht entarteter Priori) Gegeben sei ein Entscheidungsproblem

Maximaum likelihood timing estimation

Sparse Covariance Selection via Robust Maximum Likelihood ...

Bayesianmethods&Naïve( Bayes( Lecture18

Naive Bayes Classifiers - Wilkes University

Discretization for Naive-Bayes Learningusers.monash.edu/~webb/Files/Yingthesis.pdf · 2017-07-05 · Discretization for Naive-Bayes Learning Ying Yang A thesis submitted for the degree

Formulation of model likelihood functions

Bayes Net Syntax and Semantics

Introduction to Mobile Robotics Bayes Filter – Discrete ...ais.informatik.uni-freiburg.de/.../robotics/slides/10-discrete.ppt.pdf · Bayes Filter – Discrete Filters Introduction

Maximum Likelihood Estimation - unipveconomia.unipv.it/pagp/pagine_personali/erossi/macroeconometria_4... · Maximum Likelihood Estimation Eduardo Rossi University of Pavia. Likelihood

Bayesian Maximum Likelihood - Northwestern …faculty.wcas.northwestern.edu/~lchrist/course/CIED_2012/bayesian.pdfBayesian Maximum Likelihood • Bayesians describe the mapping from

ABC and empirical likelihood

Unbiased Bayes for Big Data

Ralph S. Silva · Teoria Bayesiana Tests and Conﬁdence Regions The Bayes factor can be perceived as a Bayesian likelihood ratio since, if ˇ 0 is the prior distribution under H