LECTURE 05: BAYESIAN ESTIMATION

ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition

LECTURE 05: BAYESIAN ESTIMATION

• Objectives:Bayesian EstimationExample

• Resources:D.H.S.: Chapter 3 (Part 2)J.O.S.: Bayesian Parameter EstimationA.K.: The Holy TrinityA.E.: Bayesian MethodsJ.H.: Euro Coin

http://rii.ricoh.com/~stork/DHSch3part2.ppt

http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html



https://engineering.purdue.edu/kak/Trinity.pdf

http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/AV0809/eshky.pdf

http://www.isip.msstate.edu/publications/seminars/msstate_misc/2002/euro_coin/presentation_v0.pdf


http://www.isip.piconepress.com/publications/presentations_misc/2002/isip/euro_coin/

http://www.mat.ulaval.ca/informatique/guide94/img14.png


ECE 8527: Lecture 05, Slide 2

• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).

• Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior.

• Bayesian learning: sharpen the a posteriori density causing it to peak near the true value.

• Supervised vs. unsupervised: do we know the class assignments of the training data.

• Bayesian estimation and ML estimation produce very similar results in many cases.

• Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities.

Introduction to Bayesian Parameter Estimation


• Posterior probabilities, P(ωi|x), are central to Bayesian classification.• Bayes formula allows us to compute P(ωi|x) from the priors, P(ωi), and the

likelihood, p(x|ωi).• But what If the priors and class-conditional densities are unknown?• The answer is that we can compute the posterior, P(ωi|x), using all of the

information at our disposal (e.g., training data).• For a training set, D, Bayes formula becomes:

c

jjj

iii

DPDp

DPDpDP

1)(),(

)(),(evidence

priorlikelihood),|(

x

xx

• We assume priors are known: P(ωi|D) = P(ωi).

• Also, assume functional independence:

Di have no influence on

This gives:

c

jjjj

iiii

PDp

PDpDP

1)(),(

)(),(),|(

x

xx

Class-Conditional Densities

jiifDp j ),( x


• Assume the parametric form of the evidence, p(x), is known: p(x|θ).

• Any information we have about θ prior to collecting samples is contained in a known prior density p(θ).

• Observation of samples converts this to a posterior, p(θ|D), which we hope is peaked around the true value of θ.

• Our goal is to estimate a parameter vector:

dDpDp )()( ,xx

• We can write the joint distribution as a product:

dDpp

dDpDpDp

)(

),()(

()

()

xxx

because the samples are drawn independently.

The Parameter Distribution

)( Dp x)Dp (

• This equation links the class-conditional density

to the posterior, . But numerical solutions are typically required!


• Case: only mean unknown ),()( 2 Np x

• Known prior density: ),()( 200 Np

• Using Bayes formula:

)()(

])()([

)()()()(

)()()(

)(

)()()()(

1

pp

pDp

dpDppDp

DppDp

Dp

pDpDpDp

n

kk

x

• Rationale: Once a value of μ is

known, the density for x is

completely known. α is a

normalization factor that

depends on the data, D.

Univariate Gaussian Case


• Applying our Gaussian assumptions:

n

k

k

n

k

kn

k

k

n

k

kk

n

k

k

n

k

kDp

12

2

220

020

2

12

2

220

020

2

12

2

20

20

12

2

22

2

20

200

2

1

22

0

0

2

0

0

01

2

)2(2

21exp

)2(2

21exp

21exp

22

21exp

21exp

21exp

21

21exp

21|

x

xx

xx

x

x

Univariate Gaussian Case


• Now we need to work this into a simpler form:

Univariate Gaussian Case (Cont.)

n

kkn

n

n

kk

n

kk

n

k

kn

k

n

k

k

nwhere

nn

nn

n

Dp

1

20

02

220

2

20

0

12

220

2

20

0

122

0

2

2

2

20

0

122

0

2

12

2

12

2

220

020

2

1ˆ

)ˆ(12121(exp

12121(exp

2221exp

2221exp

)2(2

21exp|

x

x

x

x

x


Univariate Gaussian Case (Cont.)• p(μ|D) is an exponential of a quadratic function, which makes it a normal

distribution. Because this is true for any n, it is referred to as a reproducing density.

• p(μ) is referred to as a conjugate prior.

• Write p(μ|D) ~ N(μn,σn2):

• Expand the quadratic term:

• Equate coefficients of our two functions:

])(21exp[

21)( 2

n

n

nDp

)]2

(21exp[

21])(

21exp[

21)( 2

222

n

nn

nn

n

n

Dp

20

02

220

2

2

22

)ˆ(12121exp

221exp

21

n

n

nn

n

nn



• Rearrange terms so that the dependencies on μ are clear:

• Associate terms related to σ2 and μ:

• There is actually a third equation involving terms not related to :

but we can ignore this since it is not a function of μ and is a complicated equation to solve.

20

02

220

2

22

22

2

)ˆ(12121exp

2121exp

21exp

21

n

n

n

nn

n

n

nn

n

n

n

nn

n

n

or

21

21

21exp

21

21exp

21

02

2

2

2


• Two equations and two unknowns. Solve for μn and σn2. First, solve for μn

2 :

220

220

220

20

2

20

2

2

11

nnnn


• Next, solve for μn:

• Summarizing:

220

2202

0220

2

220

20 ˆ)(

n

nnn

n

nn

220

2

0220

20

220

220

20

0220

220

2

20

2

02

2

ˆ

1)(ˆ

)(ˆ

nnn

nnn

n

n

n

nnnn


• μn represents our best guess after n samples.

• σn2 represents our uncertainty about this guess.

• σn2 approaches σ2/n for large n – each additional observation decreases our

uncertainty.

• The posterior, p(μ|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning.

Bayesian Learning


• How do we obtain p(x|D) (derivation is tedious):

),(])(21exp[

21

])(21exp[

21][)(

21exp[

21

)()(

222

22

nn

n

n

n

n

n

fx

dx

dDpxpDxp

()

where:

df

n

nn

n

nn

2

22

22

22

22)(

21exp[),( x

• Note that: ),(~)|( 22nnNDp x

• The conditional mean, μn, is treated as the true mean.• p(x|D) and P(ωj) can be used to design the classifier.

Class-Conditional Density


• Applying Bayes formula:

))(2)((21exp

)|(|

010

1

110

1

1

n

ik

n

kk

n

ppDp

x

x

tt

which has the form:

)()(21exp| 1

nnDp t

• Once again: ),(~| nnNDp

and we have a reproducing density.

Multivariate Case

• Assume:

where are assumed to be known.

),(~)(and),(~)( 00 NpNp x

00 , and


• Equating coefficients between the two Gaussians:

n

kkn

nnn

n

n

n

n

1

010

11

10

11

1ˆ

ˆ

x

• The solution to these equations is:

nn

nnn

n

nn

1)1(

)1(1ˆ)1(

100

01

01

00

• It also follows that: ),(~)|( nnNDp x

Estimation Equations


• p(x | D) computation can be applied to any situation in which the unknown density can be parameterized.

• The basic assumptions are:

The form of p(x | θ) is assumed known, but the value of θ is not known exactly.

Our knowledge about θ is assumed to be contained in a known prior density p(θ).

The rest of our knowledge about θ is contained in a set D of n random variables x1, x2, …, xn drawn independently according to the unknown probability density function p(x).

General Theory


Formal Solution• The posterior is given by:

• Using Bayes formula, we can write p(D| θ ) as:

• and by the independence assumption:

• This constitutes the formal solution to the problem because we have an expression for the probability of the data given the parameters.

• This also illuminates the relation to the maximum likelihood estimate:

• Suppose p(D| θ ) reaches a sharp peak at .

dDppDp )|()|()|( xx

dppppDp

)|()|()|(

DD

n

kpDp1()|( kx

θθ ˆ


Comparison to Maximum Likelihood

• This also illuminates the relation to the maximum likelihood estimate:

Suppose p(D| θ ) reaches a sharp peak at .

p(θ| D) will also peak at the same place if p(θ) is well-behaved.

p(x|D) will be approximately , which is the ML result.

If the peak of p(D| θ ) is very sharp, then the influence of prior information on the uncertainty of the true value of θ can be ignored.

However, the Bayes solution tells us how to use all of the available information to compute the desired density p(x|D).

θθ ˆ

|xp


Recursive Bayes Incremental Learning

• To indicate explicitly the dependence on the number of samples, let:

• We can then write our expression for p(D| θ ):

where .

• We can write the posterior density using a recursive relation:

• where .

• This is called the Recursive Bayes Incremental Learning because we have a method for incrementally updating our estimates.

dDpxpDpxp

dpDppDpDp

nn

nn

n

nn

)|()|()|()|(

)|()|()|(

1

1

)()|( 0 pDp

nnD xxx ,...,, 21

)|()|()|( 1 nn

n DppDp x

)()|( 0 pDp


When do ML and Bayesian Estimation Differ?

• For infinite amounts of data, the solutions converge. However, limited data is always a problem.

• If prior information is reliable, a Bayesian estimate can be superior.

• Bayesian estimates for uniform priors are similar to an ML solution.

• If p(θ| D) is broad or asymmetric around the true value, the approaches are likely to produce different solutions.

• When designing a classifier using these techniques, there are three sources of error:

Bayes Error: the error due to overlapping distributions

Model Error: the error due to an incorrect model or incorrect assumption about the parametric form.

Estimation Error: the error arising from the fact that the parameters are estimated from a finite amount of data.


Noninformative Priors and Invariance

• The information about the prior is based on the designer’s knowledge of the problem domain.

• We expect the prior distributions to be “translation and scale invariance” – they should not depend on the actual value of the parameter.

• A prior that satisfies this property is referred to as a “noninformative prior”:

The Bayesian approach remains applicable even when little or no prior information is available.

Such situations can be handled by choosing a prior density giving equal weight to all possible values of θ.

Priors that seemingly impart no prior preference, the so-called noninformative priors, also arise when the prior is required to be invariant under certain transformations.

Frequently, the desire to treat all possible values of θ equitably leads to priors with infinite mass. Such noninformative priors are called improper priors.


Example of Noninformative Priors

• For example, if we assume the prior distribution of a mean of a continuous

random variable is independent of the choice of the origin, the only prior

that could satisfy this is a uniform distribution (which isn’t possible).

• Consider a parameter σ, and a transformation of this variable:

new variable, . Suppose we also scale by a positive constant:

. A noninformative prior on σ is the inverse distribution

p(σ) = 1/ σ, which is also improper.

lnln)ln(ˆ aa

)ln(~


• Direct computation of p(D|θ) and p(θ|D) for large data sets is challenging(e.g. neural networks)

• We need a parametric form for p(x|θ) (e.g., Gaussian)

• Gaussian case: computation of the sample mean and covariance, which was straightforward, contained all the information relevant to estimating the unknown population mean and covariance.

• This property exists for other distributions.

• A sufficient statistic is a function s of the samples D that contains all the information relevant to a parameter, θ.

• A statistic, s, is said to be sufficient for θ if p(D|s,θ) is independent of θ:

)|()|(

)|(),|(),|( spsDp

spsDpDsp

Sufficient Statistics


• Theorem: A statistic, s, is sufficient for θ, if and only if p(D|θ) can be written as: .

• There are many ways to formulate sufficient statistics(e.g., define a vector of the samples themselves).

• Useful only when the function g() and the sufficient statistic are simple(e.g., sample mean calculation).

• The factoring of p(D|θ) is not unique:

)(),()|( DhsgDp

)sfDhDhsgsfsg (/)()(),()(),(

• Define a kernel density invariant to scaling:

• Significance: most practical applications of parameter estimation involve simple sufficient statistics and simple kernel densities.

dsg

sgsg),(),(),(~

The Factorization Theorem


• This isolates the θ dependence in the first term, and hence, the sample mean

is a sufficient statistic using the Factorization Theorem.

• The kernel is:

nt

ndn ng

~1~

21exp

21,~~ 1

2/12/

Gaussian Distributions

n

kk

tkd

n

kk

tt

n

k

tkk

ttd

kt

kn

k d

n

Dp

1

12/12/

1

11

1

1112/12/

1

1 2/12/

21exp

21

2exp

21exp

21

21exp

21)(

xx

x

xx

xx


• This can be generalized: )]()()(exp[)|( cbaxx tp

and: )(),(])()()(exp[)|(1

DhgnDpn

kkk

t s

xxcban

1k

• Examples:

The Exponential Family


Summary• Introduction of Bayesian parameter estimation.• The role of the class-conditional distribution in a Bayesian estimate.• Estimation of the posterior and probability density function assuming the

only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|θ), can be modeled as a Gaussian distribution.

• Bayesian estimates of the mean for the multivariate Gaussian case.• General theory for Bayesian estimation.• Comparison to maximum likelihood estimates.• Recursive Bayesian incremental learning.• Noninformative priors.• Sufficient statistics• Kernel density.


• Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.

“The Euro Coin”

http://www.inference.phy.cam.ac.uk/mackay/abstracts/euro.html

http://www.isip.piconepress.com/publications/presentations_misc/2002/isip/euro_coin/

LECTURE 05: BAYESIAN ESTIMATION

Documents

Transcript of LECTURE 05: BAYESIAN ESTIMATION