Nuno Vasconcelos UCSD– the received signal is Gaussian, with same variance σ2, but the means have...
Transcript of Nuno Vasconcelos UCSD– the received signal is Gaussian, with same variance σ2, but the means have...
Bayesian parameter estimation
Nuno VasconcelosUCSD
1
Bayesian parameter estimation• the main difference with respect to ML is that in the
Bayesian case Θ is a random variable• basic concepts
– training set D = {x1 , ..., xn} of examples drawn independently– probability density for observations given parameter
– prior distribution for parameter configurations
that encodes prior beliefs about them
• goal: to compute the posterior distribution
)|(| θxPX Θ
)(θΘP
)|(| DP X θΘ
2
Bayes vs ML• there are a number of significant differences between
Bayesian and ML estimates• D1:
– ML produces a number, the best estimate– to measure its goodness we need to measure bias and variance– this can only be done with repeated experiments– Bayes produces a complete characterization of the parameter
from the single dataset– in addition to the most
probable estimate, weobtain a characterizationof the uncertainty
lower uncertainty
higher uncertainty 3
Bayes vs ML• D2: optimal estimate
– under ML there is one “best” estimate– under Bayes there is no “best” estimate– only a random variable that takes different values with different
probabilities– technically speaking, it makes no sense to talk about the “best”
estimate
• D3: predictions– remember that we do not really care about the parameters
themselves– they are needed only in the sense that they allow us to build
models– that can be used to make predictions (e.g. the BDR)– unlike ML, Bayes uses ALL information in the training set to
make predictions4
Bayes vs ML• let’s consider the BDR under the “0-1” loss and an
independent sample D = {x1 , ..., xn}• ML-BDR:
– pick i if
• two steps:– i) find θ*– ii) plug into the BDR
• all information not captured by θ* is lost, not used at decision time
( )( )θθ
θ
θ,|maxarg where
)(;|maxarg)(
|*
*|
*
iDP
iPixPxi
YXi
YiYXi
=
=
5
Bayesian BDR• this problem is avoided by Bayesian estimates
– pick i if
• note: – as before the bottom equation is repeated for each class– hence, we can drop the dependence on the class– and consider the more general problem of estimating
( )( ) ( ) ( ) θθθ dDiPixPDixPwhere
iPDixPxi
iTYYXiTYX
YiTYXi
,|,|,|
)(,|maxarg)(
,|,|,|
,|*
ΘΘ∫=
=
( ) ( ) ( ) θθθ dDPxPDxP TXTX ||| ||| ΘΘ∫=
6
The predictive distribution
7
• the distribution
is known as the predictive distribution• this follows from the fact that it allows us
– to predict the value of x– given ALL the information available in the training set
• note that it can also be written as
– since each parameter value defines a model– this is an expectation over all possible models– each model is weighted by its posterior probability, given training
data
( ) ( ) ( ) θθθ dDPxPDxP TXTX ||| ||| ΘΘ∫=
( ) ( )[ ]DTxPEDxP XTTX == ΘΘ ||| ||| θ
The predictive distribution
8
• suppose that
• the predictive distribution is an average of all these Gaussians
( ) ( ) ( ) ( )2|| ,~| and 1,~| σµθθθ NDPNxP TX ΘΘ
µ
σ2
µ1µ2
π2
π1
µ
1
µ1µ2
weight π2
11
( )DxP TX ||
( )DP T || θΘ
weight π2
weight π1
( ) ( ) ( ) θθθ dDPxPDxP TXTX ||| ||| ΘΘ∫=
θ x
The predictive distribution• Bayes vs ML
– ML: pick one model– Bayes: average all models
• are Bayesian predictions very different than those of ML?– they can be, unless the prior is narrow
Bayes ~ ML very differentθmax
( )DP T || θΘ
θθmax
( )DP T || θΘ
θ
9
MAP approximation• this sounds good, why use ML at all?• the main problem with Bayes is that the integral
can be quite nasty• in practice one is frequently forced to use approximations• one possibility is to do something similar to ML, i.e. pick
only one model• this can be made to account for the prior by
– picking the model that has the largest posterior probability given the training data
( ) ( ) ( ) θθθ dDPxPDxP TXTX ||| ||| ΘΘ∫=
( )DP TMAP |maxarg | θθθ
Θ=
10
MAP approximation• this can usually be computed since
and corresponds to approximating the prior by a delta function centered at its maximum
( )
( ) ( )θθ
θθ
θ
θ
ΘΘ
Θ
=
=
PDP
DP
T
TMAP
|maxarg
|maxarg
|
|
θMAP
( )DP T || θΘ
θ θMAP
( )DP T || θΘ
θ11
MAP vs ML• ML-BDR
– pick i if
• Bayes MAP-BDR– pick i if
– the difference is non-negligible only when the dataset is small
• there are better alternative approximations
( )( )θθ
θ
θ,|maxarg where
)(;|maxarg)(
|*
*|
*
iDP
iPixPxi
YXi
YiYXi
=
=
( )( ) ( )iPiDP
iPixPxi
YYTMAPi
YMAPiYX
i
|,|maxarg where
)(;|maxarg)(
|,|
|*
θθθ
θ
θΘΘ=
=
12
Example
• let’s consider an example of why Bayes is usefull• example: communications
– a bit is transmitted by a source, corrupted by noise, and received by a decoder
– Q: what should the optimal decoder do to recover Y?
channelY X
13
Example
• the optimal solution is to threshold X– pick T– decision rule
– what is the threshold?– the midpoint between signal values
⎩⎨⎧
><
=T if,1T if,0
xx
Y
201 µµ +
<x14
Example• today we consider a slight variation
• still:– two states:
Y=0 transmit signal s = -µ0
Y=1 transmit signal s = µ0
– same noise model
atmosphereY X
),0(~ , 2σεε NYX +=
receiver
15
Example• the BDR is still
– pick “0” if
– this is optimal and everything works wonderfully– one day we get a phone call: the receiver is generating a lot of
errors!– something must have changed in the rover– there is no way to go to Mars and check
• goal: to do as best as possible with the info that we have at X and our knowledge of the system
( ) 02
00 =−+
<µµx
16
Example• what we know:
– the received signal is Gaussian, with same variance σ2, but the means have changed
– there is a calibration mode:rover can send a test sequencebut it is expensive, can only send a few bits
– if everything is normal, received means should be µ0 and –µ0
• action:– ask the system to transmit a few 1s and measure X– compute the ML estimate of the mean of X
• result: the estimate is different than µ0
∑=i
iXn1µ
17
Example• we need to combine two forms of information
– our prior is that
– our “data driven” estimate is that
• Q: what do we do?–
– for large n,
– for small n,
– intuitive combination
),(~ 20 σµNX
),ˆ(~ 2σµNX
),,ˆ( 0 nfn µµµ =
)ˆ(µµ fn ≈
)( 0µµ fn ≈
( )0 ,1 ],1,0[
1ˆ
0
0
→∞→→→∈
−+=
nnnnn
nnn
αααµαµαµ
18
Bayesian solution• Gaussian likelihood (observations)
• Gaussian prior (what we know)
– µ0,σ02 are known hyper-parameters
• we need to compute– posterior distribution for µ
known is ),,()|( 22| σσµµµ DGDPT =
),,()( 200 σµµµµ GP =
)(
)()|()|( |
| DPPDP
DPT
TT
µµµ µµ
µ =
19
Bayesian solution• posterior distribution
• note that – this is a probability density– we can ignore constraints (terms that do not depend on µ)– and normalize when we are done
• we only need to work with
)(
)()|()|( |
| DPPDP
DPT
TT
µµµ µµ
µ =
)()|(
)()|()|(
|
||
µµ
µµµ
µµ
µµµ
PxP
PDPDP
iiX
TT
∏∝
∝
20
Bayesian solution• plugging in the Gaussians
21⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛+−
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+∝
⎭⎬⎫
⎩⎨⎧ +−
−+−
∝
⎭⎬⎫
⎩⎨⎧ −
−−
∝
∝
∝
∑∑
∑
∑
∏
∏
20
02
2
20
02
220
2
i20
200
2
2
22
i20
20
2
2
200
2
||
22222
21
2n-exp
22
22-exp
2)(
2)(-exp
),,(),,(
)()|()|(
σµ
σµ
σµ
σµ
σσ
σµµµµ
σµµ
σµµ
σµ
σµµσµ
µµµ µµµ
ii
ii
ii
i
ii
iiXT
xx
xx
x
GxG
PxPDP
Bayesian solution
• this is a Gaussian, we just need to put it in the standard quadratic form to know its mean and variance
• use the completing the squares trick
⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+∝
∑µ
σµ
σµ
σσµµ 2
0
02
220
2| 222
21
2n-exp)|( i
i
T
xDP
abc
abxa
ac
ab
abx
abxa
acx
abxacbxax
22222
22
2
22
−+⎟⎠⎞
⎜⎝⎛ +=⎟
⎟⎠
⎞⎜⎜⎝
⎛+⎟
⎠⎞
⎜⎝⎛−⎟
⎠⎞
⎜⎝⎛++=
⎟⎠⎞
⎜⎝⎛ ++=++
22
Bayesian solution
• in this case
• we have
⎪⎭
⎪⎬
⎫
⎪⎩
⎪⎨
⎧
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+∝
∑µ
σµ
σµ
σσµµ 2
0
02
220
2| 222
21
2n-exp)|( i
i
T
xDP
2222 2 ⎟
⎠⎞
⎜⎝⎛ +∝−+⎟
⎠⎞
⎜⎝⎛ +=++
abxa
abc
abxacbxax
⎪⎪⎭⎪
⎪⎩
⎥⎦⎢⎣
⎪⎪
⎪⎪⎪⎪
⎬
⎫
⎪⎪
⎪⎪⎪⎪
⎨
⎧
⎥⎥⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎢⎢⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛+
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛+
−⎟⎟⎠
⎞⎜⎜⎝
⎛+∝
∑2
20
2
20
02
20
2|
21
2n
22
21
2n-exp)|(
σσ
σµ
σµ
σσµµ
ii
T
x
DP
23
Bayesian solution• and using
• we have
• and ⎪⎪⎭
⎪⎪⎬
⎫
⎪⎪⎩
⎪⎪⎨
⎧
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
+
+−⎟⎟
⎠
⎞⎜⎜⎝
⎛+
∝
⎪⎪⎭
⎪⎪⎬
⎫
⎪⎪⎩
⎪⎪⎨
⎧
⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛ +
⎟⎟⎠
⎞⎜⎜⎝
⎛+
−⎟⎟⎠
⎞⎜⎜⎝
⎛+∝
∑
∑
−2
20
2
20
201
20
2
20
2
2
20
2
20
20
20
2
20
2
20
2|
2-exp
22
21
2n-exp)|(
σσ
σµσµ
σσσσ
σσ
σµσ
σσσσµ
σσµµ
n
x
n
x
nDP
ii
ii
T
( )20
2
20
2
20
2
22
12
n/1σσσσ
σσ n+=⎟⎟
⎠
⎞⎜⎜⎝
⎛+
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛+
=+
+==
∑20
2
20
22
20
2
20
20
2| , ,,,)|(
σσσσσ
σσ
σµσµσµµµµ nn
xGDP n
ii
nnnT
24
Bayesian solution• this can be rewritten as
• we can compare with our “intuitive” solution
⇒⎟⎟⎠
⎞⎜⎜⎝
⎛+
= 20
2
20
22
σσσσσnn
( )2| ,,)|( nnT GDP σµµµµ =
⇒+
+=
∑20
2
20
20
σσ
σµσµ
n
xi
i
n0
1
20
2
2
20
2
20
n
µσσ
σµσσ
σµ
αα4342143421
n
nnn
MLn
−
++
+=
220
2
1 1σσσn
n
+=
25
Bayesian solution• we had
• the Bayesian solution is
• note that
• it is exactly the same as our heuristic
( )0 ,1 ],1,0[
1ˆ
0
0
→∞→→→∈
−+=
nnnnn
nnn
αααµαµαµ
0
1
20
2
2
20
2
20
n
µσσ
σµσσ
σµ
αα4342143421
n
nnn
MLn
−
++
+=
0 ,1 ],1,0[0→∞→
→→∈nnnnn ααα
26
Bayesian solution• for free, Bayes also gives us
– the weighting constants
– a measure of the uncertainty of our estimate
– note that 1/σ2 is a measure of precision– this should be read as
PBayes = PML + Pprior
– Bayesian precision is greater than both that of ML and prior
20
2
20
σσσαn
nn +=
220
2
1 1σσσn
n
+=
27
Observations – 1) note that precision increases with n, variance goes to zero
we are guaranteed that in the limit of infinite data we haveconvergence to a single estimate
– 2) for large n the likelihood term dominates the prior term
the solution is equivalent to that of ML– for small n, the prior dominates– this always happens for Bayesian solutions
220
2
1 1σσσn
n
+=
( )0 ,1 ],1,0[
1ˆ
0
0
→∞→→→∈
−+=
nnnnn
nnn
αααµαµαµ
)()|()|( || µµµ µµµ PxPDPi
iXT ∏∝28
Observations – 3) for a given n
if σ02>>σ2, i.e. we really don’t know what µ is a priori
then µn = µML
– on the other hand, if σ02<<σ2, i.e. we are very certain a priori,
then µn = µ0
• in summary,– Bayesian estimate combines the prior beliefs with the evidence
provided by the data– in a very intuitive manner
( )0 ,1 ],1,0[
1ˆ
0
0
→∞→→→∈
−+=
nnnnn
nnn
αααµαµαµ
20
2
20
σσσαn
nn +=
29
30