Nuno Vasconcelos UCSD– the received signal is Gaussian, with same variance σ2, but the means have...

Bayesian parameter estimation

Nuno VasconcelosUCSD

1

Bayesian parameter estimation• the main difference with respect to ML is that in the

Bayesian case Θ is a random variable• basic concepts

– training set D = {x1 , ..., xn} of examples drawn independently– probability density for observations given parameter

– prior distribution for parameter configurations

that encodes prior beliefs about them

• goal: to compute the posterior distribution

)|(| θxPX Θ

)(θΘP

)|(| DP X θΘ

2

Bayes vs ML• there are a number of significant differences between

Bayesian and ML estimates• D1:

– ML produces a number, the best estimate– to measure its goodness we need to measure bias and variance– this can only be done with repeated experiments– Bayes produces a complete characterization of the parameter

from the single dataset– in addition to the most

probable estimate, weobtain a characterizationof the uncertainty

lower uncertainty

higher uncertainty 3

Bayes vs ML• D2: optimal estimate

– under ML there is one “best” estimate– under Bayes there is no “best” estimate– only a random variable that takes different values with different

probabilities– technically speaking, it makes no sense to talk about the “best”

estimate

• D3: predictions– remember that we do not really care about the parameters

themselves– they are needed only in the sense that they allow us to build

models– that can be used to make predictions (e.g. the BDR)– unlike ML, Bayes uses ALL information in the training set to

make predictions4

Bayes vs ML• let’s consider the BDR under the “0-1” loss and an

independent sample D = {x1 , ..., xn}• ML-BDR:

– pick i if

• two steps:– i) find θ*– ii) plug into the BDR

• all information not captured by θ* is lost, not used at decision time

( )( )θθ

θ

θ,|maxarg where

)(;|maxarg)(

|*

*|

*

iDP

iPixPxi

YXi

YiYXi

=

=

5

Bayesian BDR• this problem is avoided by Bayesian estimates

– pick i if

• note: – as before the bottom equation is repeated for each class– hence, we can drop the dependence on the class– and consider the more general problem of estimating

( )( ) ( ) ( ) θθθ dDiPixPDixPwhere

iPDixPxi

iTYYXiTYX

YiTYXi

,|,|,|

)(,|maxarg)(

,|,|,|

,|*

ΘΘ∫=

=

( ) ( ) ( ) θθθ dDPxPDxP TXTX ||| ||| ΘΘ∫=

6

The predictive distribution

7

• the distribution

is known as the predictive distribution• this follows from the fact that it allows us

– to predict the value of x– given ALL the information available in the training set

• note that it can also be written as

– since each parameter value defines a model– this is an expectation over all possible models– each model is weighted by its posterior probability, given training

data


( ) ( )[ ]DTxPEDxP XTTX == ΘΘ ||| ||| θ

The predictive distribution

8

• suppose that

• the predictive distribution is an average of all these Gaussians

( ) ( ) ( ) ( )2|| ,~| and 1,~| σµθθθ NDPNxP TX ΘΘ

µ

σ2

µ1µ2

π2

π1

µ

1

µ1µ2

weight π2

11

( )DxP TX ||

( )DP T || θΘ

weight π2

weight π1


θ x

The predictive distribution• Bayes vs ML

– ML: pick one model– Bayes: average all models

• are Bayesian predictions very different than those of ML?– they can be, unless the prior is narrow

Bayes ~ ML very differentθmax

( )DP T || θΘ

θθmax

( )DP T || θΘ

θ

9

MAP approximation• this sounds good, why use ML at all?• the main problem with Bayes is that the integral

can be quite nasty• in practice one is frequently forced to use approximations• one possibility is to do something similar to ML, i.e. pick

only one model• this can be made to account for the prior by

– picking the model that has the largest posterior probability given the training data


( )DP TMAP |maxarg | θθθ

Θ=

10

MAP approximation• this can usually be computed since

and corresponds to approximating the prior by a delta function centered at its maximum

( )

( ) ( )θθ

θθ

θ

θ

ΘΘ

Θ

=

=

PDP

DP

T

TMAP

|maxarg

|maxarg

|

|

θMAP

( )DP T || θΘ

θ θMAP

( )DP T || θΘ

θ11

Example

• let’s consider an example of why Bayes is usefull• example: communications

– a bit is transmitted by a source, corrupted by noise, and received by a decoder

– Q: what should the optimal decoder do to recover Y?

channelY X

13

Example

• the optimal solution is to threshold X– pick T– decision rule

– what is the threshold?– the midpoint between signal values

⎩⎨⎧

><

=T if,1T if,0

xx

Y

201 µµ +

<x14

Example• today we consider a slight variation

• still:– two states:

Y=0 transmit signal s = -µ0

Y=1 transmit signal s = µ0

– same noise model

atmosphereY X

),0(~ , 2σεε NYX +=

receiver

15

Example• the BDR is still

– pick “0” if

– this is optimal and everything works wonderfully– one day we get a phone call: the receiver is generating a lot of

errors!– something must have changed in the rover– there is no way to go to Mars and check

• goal: to do as best as possible with the info that we have at X and our knowledge of the system

( ) 02

00 =−+

<µµx

16

Example• what we know:

– the received signal is Gaussian, with same variance σ2, but the means have changed

– there is a calibration mode:rover can send a test sequencebut it is expensive, can only send a few bits

– if everything is normal, received means should be µ0 and –µ0

• action:– ask the system to transmit a few 1s and measure X– compute the ML estimate of the mean of X

• result: the estimate is different than µ0

∑=i

iXn1µ

17

Example• we need to combine two forms of information

– our prior is that

– our “data driven” estimate is that

• Q: what do we do?–

– for large n,

– for small n,

– intuitive combination

),(~ 20 σµNX

),ˆ(~ 2σµNX

),,ˆ( 0 nfn µµµ =

)ˆ(µµ fn ≈

)( 0µµ fn ≈

( )0 ,1 ],1,0[

1ˆ

0

0

→∞→→→∈

−+=

nnnnn

nnn

αααµαµαµ

18

Bayesian solution• Gaussian likelihood (observations)

• Gaussian prior (what we know)

– µ0,σ02 are known hyper-parameters

• we need to compute– posterior distribution for µ

known is ),,()|( 22| σσµµµ DGDPT =

),,()( 200 σµµµµ GP =

)(

)()|()|( |

| DPPDP

DPT

TT

µµµ µµ

µ =

19

Bayesian solution• posterior distribution

• note that – this is a probability density– we can ignore constraints (terms that do not depend on µ)– and normalize when we are done

• we only need to work with

)(

)()|()|( |

| DPPDP

DPT

TT

µµµ µµ

µ =

)()|(

)()|()|(

|

||

µµ

µµµ

µµ

µµµ

PxP

PDPDP

iiX

TT

∏∝

∝

20

Bayesian solution• plugging in the Gaussians

21⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛+−

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+∝

⎭⎬⎫

⎩⎨⎧ +−

−+−

∝

⎭⎬⎫

⎩⎨⎧ −

−−

∝

∝

∝

∑∑

∑

∑

∏

∏

20

02

2

20

02

220

2

i20

200

2

2

22

i20

20

2

2

200

2

||

22222

21

2n-exp

22

22-exp

2)(

2)(-exp

),,(),,(

)()|()|(

σµ

σµ

σµ

σµ

σσ

σµµµµ

σµµ

σµµ

σµ

σµµσµ

µµµ µµµ

ii

ii

ii

i

ii

iiXT

xx

xx

x

GxG

PxPDP

Bayesian solution

• this is a Gaussian, we just need to put it in the standard quadratic form to know its mean and variance

• use the completing the squares trick

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+∝

∑µ

σµ

σµ

σσµµ 2

0

02

220

2| 222

21

2n-exp)|( i

i

T

xDP

abc

abxa

ac

ab

abx

abxa

acx

abxacbxax

22222

22

2

22

−+⎟⎠⎞

⎜⎝⎛ +=⎟

⎟⎠

⎞⎜⎜⎝

⎛+⎟

⎠⎞

⎜⎝⎛−⎟

⎠⎞

⎜⎝⎛++=

⎟⎠⎞

⎜⎝⎛ ++=++

22

Bayesian solution

• in this case

• we have

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+∝

∑µ

σµ

σµ

σσµµ 2

0

02

220

2| 222

21

2n-exp)|( i

i

T

xDP

2222 2 ⎟

⎠⎞

⎜⎝⎛ +∝−+⎟

⎠⎞

⎜⎝⎛ +=++

abxa

abc

abxacbxax

⎪⎪⎭⎪

⎪⎩

⎥⎦⎢⎣

⎪⎪

⎪⎪⎪⎪

⎬

⎫

⎪⎪

⎪⎪⎪⎪

⎨

⎧

⎥⎥⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎢⎢⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛+

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛+

−⎟⎟⎠

⎞⎜⎜⎝

⎛+∝

∑2

20

2

20

02

20

2|

21

2n

22

21

2n-exp)|(

σσ

σµ

σµ

σσµµ

ii

T

x

DP

23

Bayesian solution• and using

• we have

• and ⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛

+

+−⎟⎟

⎠

⎞⎜⎜⎝

⎛+

∝

⎪⎪⎭

⎪⎪⎬

⎫

⎪⎪⎩

⎪⎪⎨

⎧

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛ +

⎟⎟⎠

⎞⎜⎜⎝

⎛+

−⎟⎟⎠

⎞⎜⎜⎝

⎛+∝

∑

∑

−2

20

2

20

201

20

2

20

2

2

20

2

20

20

20

2

20

2

20

2|

2-exp

22

21

2n-exp)|(

σσ

σµσµ

σσσσ

σσ

σµσ

σσσσµ

σσµµ

n

x

n

x

nDP

ii

ii

T

( )20

2

20

2

20

2

22

12

n/1σσσσ

σσ n+=⎟⎟

⎠

⎞⎜⎜⎝

⎛+

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛+

=+

+==

∑20

2

20

22

20

2

20

20

2| , ,,,)|(

σσσσσ

σσ

σµσµσµµµµ nn

xGDP n

ii

nnnT

24

Bayesian solution• this can be rewritten as

• we can compare with our “intuitive” solution

⇒⎟⎟⎠

⎞⎜⎜⎝

⎛+

= 20

2

20

22

σσσσσnn

( )2| ,,)|( nnT GDP σµµµµ =

⇒+

+=

∑20

2

20

20

σσ

σµσµ

n

xi

i

n0

1

20

2

2

20

2

20

n

µσσ

σµσσ

σµ

αα4342143421

n

nnn

MLn

−

++

+=

220

2

1 1σσσn

n

+=

25

Bayesian solution• we had

• the Bayesian solution is

• note that

• it is exactly the same as our heuristic

( )0 ,1 ],1,0[

1ˆ

0

0

→∞→→→∈

−+=

nnnnn

nnn

αααµαµαµ

0

1

20

2

2

20

2

20

n

µσσ

σµσσ

σµ

αα4342143421

n

nnn

MLn

−

++

+=

0 ,1 ],1,0[0→∞→

→→∈nnnnn ααα

26

Bayesian solution• for free, Bayes also gives us

– the weighting constants

– a measure of the uncertainty of our estimate

– note that 1/σ2 is a measure of precision– this should be read as

PBayes = PML + Pprior

– Bayesian precision is greater than both that of ML and prior

20

2

20

σσσαn

nn +=

220

2

1 1σσσn

n

+=

27

Observations – 1) note that precision increases with n, variance goes to zero

we are guaranteed that in the limit of infinite data we haveconvergence to a single estimate

– 2) for large n the likelihood term dominates the prior term

the solution is equivalent to that of ML– for small n, the prior dominates– this always happens for Bayesian solutions

220

2

1 1σσσn

n

+=

( )0 ,1 ],1,0[

1ˆ

0

0

→∞→→→∈

−+=

nnnnn

nnn

αααµαµαµ

)()|()|( || µµµ µµµ PxPDPi

iXT ∏∝28

Observations – 3) for a given n

if σ02>>σ2, i.e. we really don’t know what µ is a priori

then µn = µML

– on the other hand, if σ02<<σ2, i.e. we are very certain a priori,

then µn = µ0

• in summary,– Bayesian estimate combines the prior beliefs with the evidence

provided by the data– in a very intuitive manner

( )0 ,1 ],1,0[

1ˆ

0

0

→∞→→→∈

−+=

nnnnn

nnn

αααµαµαµ

20

2

20

σσσαn

nn +=

29

Nuno Vasconcelos UCSD– the received signal is Gaussian, with same variance σ2, but the means have...

Documents

Transcript of Nuno Vasconcelos UCSD– the received signal is Gaussian, with same variance σ2, but the means have...