11.5 MAP Estimator - Binghamton

23
1 11.5 MAP Estimator Recall that the “hit-or-miss” cost function gave the MAP estimator… it maximizes the a posteriori PDF Q : Given that the MMSE estimator is “the most natural” one… why would we consider the MAP estimator? A : If x and θ are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean. MAP avoids this Computational Problem! Note: MAP doesn’t require this integration Trade “natural criterion” vs. “computational ease” What else do you gain? More flexibility to choose the prior PDF

Transcript of 11.5 MAP Estimator - Binghamton

Page 1: 11.5 MAP Estimator - Binghamton

1

11.5 MAP EstimatorRecall that the “hit-or-miss” cost function gave the MAP estimator… it maximizes the a posteriori PDF

Q: Given that the MMSE estimator is “the most natural” one…why would we consider the MAP estimator?

A: If x and θ are not jointly Gaussian, the form for MMSE estimate requires integration to find the conditional mean.

MAP avoids this Computational Problem!Note: MAP doesn’t require this integration

Trade “natural criterion” vs. “computational ease”

What else do you gain? More flexibility to choose the prior PDF

Page 2: 11.5 MAP Estimator - Binghamton

2

Notation and Form for MAP

)|(maxargˆ xθθθ

pMAP =

MAPθNotation: maximizes the posterior PDF

“arg max” extracts the value of θ that causes the maximum

Equivalent Form (via Bayes’ Rule): )]()|([maxargˆ θθθθ

ppMAP x=

Proof: Use )()()|()|(

xxx

pppp θθθ =

)]()|([maxarg)(

)()|(maxargˆ θθθθθθθ

ppp

ppMAP x

xx

=

=

Does not depend on θ

Page 3: 11.5 MAP Estimator - Binghamton

3

Vector MAP < Not as straight-forward as vector extension for MMSE >

The obvious extension leads to problems:

iθChoose to minimize }}ˆ({)ˆ( iii CE θθθ −=R

Exp. over p(x,θi)

⇒ )|(maxargˆ xii pi

θθθ

= 1-D marginal conditioned on x

Need to integrate to get it!!

Problem: The whole point of MAP was to avoid doing the integration needed in MMSE!!!

Is there a way around this?Can we find an Integration-Free Vector MAP?

pddpp θθθ 21 )|θ()|( ∫∫= xx

Page 4: 11.5 MAP Estimator - Binghamton

4

Circular Hit-or-Miss Cost Function Not in Book

First look at the p-dimensional cost function for this “troubling”version of a vector map:

It consists of p individual applications of 1-D “Hit-or-Miss”

ε1

ε2

δ-δ-δ

δ

=square innot ),(,1

square in ),(,0),(

21

2121

εε

εεεεC

The corners of the square “let too much in” ⇒ use a circle!

ε1

ε2

δ

<=

δ

δ

ε

εε

,1

,0)(C

This actually seems more natural than the “square” cost function!!!

Page 5: 11.5 MAP Estimator - Binghamton

5

MAP Estimate using Circular Hit-or-Miss Back to Book

So… what vector Bayesian estimator comes from using this circular hit-or-miss cost function?

Can show that it is the following “Vector MAP”

)|(maxargˆ xθθθ

pMAP = Does Not Require Integration!!!

That is… find the maximum of the joint conditional PDF

in all θi conditioned on x

Page 6: 11.5 MAP Estimator - Binghamton

6

How Do These Vector MAP Versions CompareIn general: They are NOT the Same!!

Example: p = 2p(θ1, θ2 | x)

1/6

1/3

1/6

θ1

θ2

1 2 3 4 5

1

2

The vector MAP using Circular Hit-or-Miss is: [ ]T5.05.2ˆ =θ

To find the vector MAP using the element-wise maximization:

θ1

p(θ1|x)

1 2 3 4 5

1/6

1/3

θ2

p(θ2|x)

1 2

1/3

2/3[ ]T5.15.2ˆ =θ

Page 7: 11.5 MAP Estimator - Binghamton

7

“Bayesian MLE”Recall… As we keep getting good data, p(θ|x) becomes more concentrated as a function of θ. But… since:

)]()|([maxarg)|(maxargˆ θθxxθθθθ

pppMAP ==

… p(x|θ) should also become more concentrated as a function of θ.

p(x|θ)p(θ)

θ

• Note that the prior PDF is nearly constant where p(x|θ) is non-zero

• This becomes truer as N →∞, and p(x|θ) gets more concentrated

)|(maxarg)]()|([maxarg θxθθxθθ

ppp ≈

MAP “Bayesian MLE”

Uses conditional PDF rather than the parameterized PDF

Page 8: 11.5 MAP Estimator - Binghamton

8

11.6 Performance CharacterizationThe performance of Bayesian estimators is characterized by looking at the estimation error: θθε ˆ−=

Random (due to a priori PDF)

Random (due to x)

Performance characterized by error’s PDF p(ε)We’ll focus on Mean and Variance

If ε is Gaussian then these tell the whole storyThis will be the case for the Bayesian Linear Model (see Thm. 10.3)

We’ll also concentrate on the MMSE Estimator

Page 9: 11.5 MAP Estimator - Binghamton

9

Performance of Scalar MMSE Estimator

∫==

θθθ

θθ

dp

E

)|(

}|{ˆ

x

xThe estimator is:

Function of x

So the estimation error is: ),(}|{ θθθε xx fE =−=Function of two RV’s

General Result for a function of two RVs: Z = f (X, Y)

dydxyxpyxfZE XY ),(),(}{ ∫∫=

{ } dydxyxpZEyxfZEZEZ XY ),(}){),((}){(}var{ 22 ∫∫ −=−=

Page 10: 11.5 MAP Estimator - Binghamton

10

Evaluated as seen below

So… applying the mean result gives:

{ }

{ }

{ }{ } 00

}|{][

}]|{[][

}]|{[

}|{}{

|

}|{

||

|

,

==

−=

−=

−=

−=

x

xx

x

xxx

xx

x

x

x

x

x

E

EEE

EEEE

EEE

EEE

E

θθ

θθ

θθ

θθε

θ

θ

θθ

θ

θ

See Chart on “Decomposing Joint

Expectations” in “Notes on 2 RVs”

Pass Eθ |x through the terms

Two Notations for the same thing

}|{)|(}|{

)|(}|{}]|{[

|

|

on dependnot does

|

xxx

xxx

x

xx

θθθθ

θθθθ

θθ

θθ

θ

θ

EdpE

dpEEE

==

=

∫0}{ =εE

i.e., the Mean of the Estimation Error (over data

& parm) is Zero!!!!

Page 11: 11.5 MAP Estimator - Binghamton

11

And… applying the variance result gives:

{ } { } { }

)ˆ(

),()ˆ(

)ˆ(}){(}var{

,2

222

θ

θθθθ

θθεεεε

θ

Bmse

ddp

EEEE

=

−=

−==−=

∫∫ xxx

Use E{ε} = 0

So… the MMSE estimation error has:

• mean = 0

• var = Bmse

So… when we minimize Bmsewe are minimizing the variance

of the estimate

If ε is Gaussian then ( ))ˆ(,0~ θε BmseN

Page 12: 11.5 MAP Estimator - Binghamton

12

Ex. 11.6: DC Level in WGN w/ Gaussian PriorWe saw that

22 /1/1)ˆ(

ANABmse

σσ +=

AAA

A

N

Nx

NA µ

σσ

σ

σσσ

++

+=

/

/

22

2

22

2with

constantconstant

So… A is Gaussian

+ 22 /1/1,0~

ANN

σσε

Note: As N gets large this PDF collapses around 0.

This estimate is “consistent in the Bayesian sense”

Bayesian Consistency: For large N

(regardless of the realization of A!)

AA ≈ˆ

this is Gaussian because it is a linear combo of the jointly

Gaussian data samples

If X is Gaussian then Y = aX + b

is also Gaussian

Page 13: 11.5 MAP Estimator - Binghamton

13

Performance of Vector MMSE Estimatorθθε ˆ−=Vector estimation error: The mean result is obvious.

Must extend the variance result:

θθx,ε MεεCε ˆ}{}cov{ ∆=== TE

Some New Notation…“Bayesian Mean Square

Error Matrix”

Look some more at this:

}{

}{

}{

|

|

ˆ

}{ ]][[

]][[

}|{}|{

}|{}|{

xθx

xθx

θx,θ

xθθxθθ

xθθxθθM

CE

EEEE

EEE

T

T

=

−−=

−−=

0ε =}{E }{ |ˆ xθxθε MC CE==

General Vector Results:

See Chart on “Decomposing

Joint Expectations”

= Cθ|x In general this is a function of x

Page 14: 11.5 MAP Estimator - Binghamton

14

θM ˆThe Diagonal Elements of are Bmse’s of the Estimates

[ ]

)(

),(}]|{[

),(}]|{[}{

2

21

i

iiii

iiiiT

Bmse

dθdθpE

ddpEE

i

p

θ

θθ

θθ

θ

θ θ

=

−=

−=

∫ ∫

∫ ∫ ∫

x

xθx,

xxx

θxθxxεε

To see this:

Why do we call the error covariance the “Bayesian MSE Matrix”?

Integrate over all the other

parameters…“marginalizing”

the PDF

Page 15: 11.5 MAP Estimator - Binghamton

15

Perf. of MMSE Est. for Jointly Gaussian CaseLet the data vector x and the parameter vector θ be jointly Gaussian.

0ε =}{ENothing new to say about the mean result:

Now… look at the Error Covariance (i.e., Bayesian MSq Matrix):

}{ |ˆ xθxθε MC CE==Recall General Result:

Thm 10.2 says that for Jointly Gaussian Vectors we get that…Cθ|x does NOT depend on x

xθxθxθε MC ||ˆ }{ CCE ===

xθxθxθ

xθθε

CCCC

MC

1

−−=

== C

Thm 10.2 also gives the form as:

Page 16: 11.5 MAP Estimator - Binghamton

16

Perf. of MMSE Est. for Bayesian Linear ModelwHθx += ~N(µθ,Cθ)

~N(0,Cw)Recall the model:

0ε =}{ENothing new to say about the mean result:

Now… for the error covariance… this is nothing more than a special case of the jointly Gaussian case we just saw:

xθxθxθ

xθθε

CCCC

MC

1

−−=

== CResults for Jointly Gaussian Case

THCC θθx =

wθx CHHCC += T

Evaluations for Bayesian Linear

( )( ) 111

1|ˆ

−−−

+=

+−===

HCHC

HCCHHCHCCMC

θwθθθxθθε

T

TTC

Alternate Form … see (10.33)

Page 17: 11.5 MAP Estimator - Binghamton

17

Summary of MMSE Est. Error Results1. For all cases: Est. Error is zero mean 0ε =}{E

2. Error Covariance for three “Nested” Cases:

}{ |ˆ xθxθε MC CE==

[ ]iiiBmse θM ˆ)( =θ

General Case:

Jointly Gaussian: xθxθxθxθθε CCCCMC 1|ˆ

−−=== C

Bayesian Linear:Jointly Gaussian

& Linear Observation

( )( ) 111

1|ˆ

−−−

+=

+−===

HCHC

HCCHHCHCCMC

θwθθθxθθε

T

TTC

Page 18: 11.5 MAP Estimator - Binghamton

18

Main Bayesian Approaches

MAP“Hit-or-Miss” Cost Function

MMSE“Squared” Cost Function

(In General: Nonlinear Estimate)

{ }{ }xθxθ CMxθθ

|ˆ :Cov. Err.

ˆ :EstimateE

E=

=

Jointly Gaussian x and θ(Yields Linear Estimate)

{ } { }( )xθxxθxθθθ

xxθx

CCCCM

xxCCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

−=

−+= EE

Bayesian Linear Model(Yields Linear Estimate)

{ } ( ) ( )

( ) θθθθθ

θθθ

HCCHHCHCCM

HµxCHHCHCθθ1

ˆ

1

:Cov. Err.

ˆ :Estimate−

+−=

−++=

wTT

wTTE

)|(maxargˆ :Estimate xθθθ

p=

Hard to Implement…numerical integration

Easy to Implement…Performance Analysis

is Challenging

Easier to Implement…Determining Cθx can be

hard to find

“Easy” to Implement…Only need accurate model: Cθ, Cw, H

Page 19: 11.5 MAP Estimator - Binghamton

19

11.7 Example: Bayesian DeconvolutionThis example shows the power of Bayesian approaches over classical methods in signal estimation problems (i.e. estimating the signal rather than some parameters)

h(t) Σs(t)

w(t)

x(t)

Model as a zero-mean

WSS Gaussian

Process w/ known

ACF Rs(τ)

Assumed Known

Gaussian Bandlimited White Noise w/ Known Variance

Measured Data = Samples of x(t)

Goal: Observe x(t) & Estimate s(t)Note: At Output…

s(t) is Smeared & Noisy

So… model as D-T System

Page 20: 11.5 MAP Estimator - Binghamton

20

Sampled-Data Formulation

+

−−−

=

− ]1[

]1[

]0[

]1[

]1[

]0[

][]1[]1[

00]0[]1[

000]0[

]1[

]1[

]0[

Nw

w

w

ns

s

s

nNhNhNh

hh

h

Nx

x

x

ss

Measured Data Vector x

Known Observation Matrix H

Signal Vector sto Estimate

AWGN: wCw = σ2I

We have modeled s(t) as zero-mean WSS process with known ACF…So… s[n] is a D-T WSS process with known ACF Rs[m]…So… vector s has a known covariance matrix (Toeplitz & Symmetric) given by:

=

]0[]1[]2[]1[

]1[

]2[]0[]1[]2[

]1[]0[]1[

]1[]2[]1[]0[

sssss

s

ssss

sss

sssss

RRRnR

R

RRRR

RRR

nRRRR

sC

Model for Prior PDF is then s ~ N(0,Cs)

s and w are independent

Page 21: 11.5 MAP Estimator - Binghamton

21

MMSE Solution for DeconvolutionWe have the case of the Bayesian Linear Model… so:

( ) xIHHCHCs ss12ˆ −

+= σTT

Note that this is a linear estimateThis matrix is called “The Weiner Filter”

( ) 121ˆ /

−− +== σHHCMC ssεT

The performance of the filter is characterized by:

Page 22: 11.5 MAP Estimator - Binghamton

22

Sub-Example: No Inverse Filtering, Noise OnlyDirect observation of s with H = I… x = s + w

Σs(t)

w(t)

x(t) Goal: Observe x(t) & “De-Noise” s(t)Note: At Output… s(t) with Noise

( ) xICCs ss12ˆ −

+= σ ( ) 121ˆ /

−− +== σICMC ssε

Note: Dimensionality Problem… # of “parms” = # of observationsClassical Methods Fail… xs =ˆ Bayesian methods can solve it!!

For insight… consider “single sample” case:

)(]0[]0[1

]0[]0[

]0[]0[s 22 SNRRxxR

R s

s

s

ση

ηη

σ=

+=

+=

0]0[ˆ]0[]0[ˆ ≈≈ sSNRLow

xsSNRHigh

Data Driven Prior PDF Driven

Page 23: 11.5 MAP Estimator - Binghamton

23

Sub-Sub-Example: Specific Signal ModelDirect observation of s with H = I… x = s + w

But here… the signal follows a specific random signal model

][]1[][ 1 nunsans +−−= u[n] is White Gaussian “Driving Process”

This is a 1st-order “auto-regressive” model: AR(1)Such a random signal has an ACF & PSD of

||12

1

2)(

1][ ku

s aa

kR −

−=

σ22

1

2

1)(

fj

us

eafP

π

σ−+

=

See Figures 11.9 & 11.10 in the

Textbook