Lectures 3 & 4: Estimators

35
Lectures 3 & 4: Estimators • Maximum likelihood estimation (MLE) Likelihood function, information matrix • Least Squares estimation Total least squares General least squares Regularization • Maximum a Posteriori estimation (MAP) Posterior density via Bayes’ rule Confidence regions Hilary Term 2007 A. Zisserman Maximum Likelihood Estimation In the line fitting (linear regression) example the estimate of the line parameters θ involved two steps: 1. Write down the likelihood function expressing the probability of the data z given the parameters θ 2. Maximize the likelihood to determine θ i.e. choose the value of θ so as to make the data as likely as possible. Note, the solution does not depend on the variance σ 2 of the noise y x min a,b n X i (y i (ax i + b)) 2

Transcript of Lectures 3 & 4: Estimators

Page 1: Lectures 3 & 4: Estimators

Lectures 3 & 4: Estimators

• Maximum likelihood estimation (MLE)• Likelihood function, information matrix

• Least Squares estimation• Total least squares

• General least squares

• Regularization

• Maximum a Posteriori estimation (MAP)• Posterior density via Bayes’ rule

• Confidence regions

Hilary Term 2007 A. Zisserman

Maximum Likelihood Estimation

In the line fitting (linear regression) example the estimate of the line parameters θ involved two steps:

1. Write down the likelihood function expressing the probability of the data z given the parameters θ

2. Maximize the likelihood to determine θ

i.e. choose the value of θ so as to make the data as likely as possible.

Note, the solution does not depend on the variance σ2 of the noise

y

x

mina,b

nXi

(yi− (axi+ b))2

Page 2: Lectures 3 & 4: Estimators

Example application: Navigation

Take multiple measurements of beacons to improve location estimates

Page 3: Lectures 3 & 4: Estimators

The Likelihood function

More formally, the likelihood is defined as

L(θ) = p(z|θ)• the likelihood is the conditional probability of obtaining measurements z = (z1, z2, …, zn) given that the true value is θ

p(z|θ)

θz

Note, in general

• The likelihood is a function of θ, but it is not a probability distribution over θ,

and its integral with respect to θ does not (necessarily) equal one.

• It would be incorrect to refer to this as “the likelihood of the data”.

Example: Gaussian Sensors

Likelihood is a Gaussian response in 2D for true value θ = x

-6 -4 -2 0 2 4 6-5

-4

-3

-2

-1

0

1

2

3

4

5

Z1

Z2

x

L(θ) = p(z|θ)

Given observations z and a likelihood function , the Maximum Likelihoodestimate of θ is the value of θ which maximizes the likelihood function

L(θ)

θ = argmaxθL(θ)

Now, look at several problems involving combining multiple measurements from Gaussian sensors

Page 4: Lectures 3 & 4: Estimators

MLE Example 1

Suppose we have two independent sonar measurements (z1, z2) of position x, and the sensor error may be modelled as p(zi|θ) = N(θ, σ2), determine the MLE of x.

Since the sensors are independent the likelihood is

and since the sensors are Gaussian

Solution

L(x) = p(z1, z2|x) = p(z1|x) p(z2|x)

L(x) ∼ e−(z1−x)22σ2 × e−

(z2−x)22σ2 = e

−(z1−x)2+(z2−x)22σ2

ignore irrelevant normalization constants

z1 z2

p(z1|x) p(z2|x)

L(x) ∼ e−(z1−x)2+(z2−x)2

2σ2

x

x=z1 + z2

2L(x) ∼ e−(x−x)2σ2

− lnL(x) = ((z1− x)2+ (z2− x)2)/(2σ2)= (2x2 − 2x(z1+ z2)+ z21 + z22)/(2σ

2)

= (x − x)2/(σ2)+ c(z1, z2)

with

Page 5: Lectures 3 & 4: Estimators

Note

• the likelihood is a Gaussian

• the variance is reduced (cf the original sensor variance)

Z1Z2

p(z1|x) p(z2|x)L(x) ∼ e−

(x−x)2σ2

Maximum likelihood estimate of x

x = argmaxx

L(x)

negative log likelihood

Compute min by differentiating wrt x

x= x=z1 + z2

2

ML x= argminx{− lnL(x)}

d{− lnL(x)}dx

= 2(x − x)/(σ2) = 0

− lnL(x) = (x − x)2/(σ2)+ c(z1, z2)

Page 6: Lectures 3 & 4: Estimators

MLE Example 2: Gaussian fusion with different variances

Suppose we have two independent sonar measurements (z1, z2) of position x, and the sensor error may be modelled as p(z1|θ) = N(θ, σ2

1) and p(z2|θ) = N(θ, σ2

2), determine the MLE of x.

Again, as the sensors are independent

Solution

L(x) = p(z1, z2|x) = p(z1|x) p(z2|x)

L(x) ∼ e−(z1−x)

2

2σ21 × e

−(z2−x)2

2σ22

e.g. if the sensors are: p(z1|x) ∼ N(x,102) p(z2|x) ∼ N(x,202)

Suppose we obtain sensor readings of z1 = 130, z2 = 170.

xMLE =130/102 + 170/202

1/102 + 1/202= 138.0

• the ML estimate is closer to the more confident measurement

The negative log likelihood is then:

− lnL(x) = 1

2

hσ−21 (z1− x)2+ σ−22 (z2− x)2

i+const

=1

2

h(σ−21 + σ−22 )x

2 − 2(σ−21 z1+ σ−22 z2)xi+const

=1

2(σ−21 + σ−22 )

⎡⎣x− σ−21 z1 + σ−22 z2

σ−21 + σ−22

⎤⎦2+ constwhich is maximised w.r.t. x when

xMLE =σ−21 z1 + σ−22 z2

σ−21 + σ−22

Page 7: Lectures 3 & 4: Estimators

The Information Matrix

We have seen that for two independent Normal variables:

• information weighted mean

• additivity of statistical information

(like the parallel resistors formula)

This motivates the definition of the information matrix

x =σ−21 z1 + σ−22 z2

σ−21 + σ−22

σ−2x = σ−21 + σ−22

S =Σ−1

Sx = Sz1 + Sz2

In 2D with z= (zx, zy)>

Sx = Σ−1x = Sz1 + Sz2

x = Σx¡Sz1z1+ Sz2z2

¢If Σ=

"σ2x 0

0 σ2y

#then S =

"1/σ2x 0

0 1/σ2y

#

• large covariance → uncertainty, lack of information

• small covariance → some confidence, or strong information

A singular covariance matrix implies “infinite” or “perfect” information.

Page 8: Lectures 3 & 4: Estimators

MLE Example 3: Gaussian fusion with 2D sensors

Suppose we have two independent 2D vision measurements (z1, z2) of position x, and the sensor error may be modelled as p(z1|θ) = N(θ, Σ1) and p(z2|θ) = N(θ, Σ2), determine the MLE of x, where

z1 =

Ã11

!Σ1 =

"1 00 4

#and z2 =

Ã2−1

!Σ2 =

"4 00 1

#

Again, the sensors are independent so we sum their information

Solution

Σ−1x = Σ−11 +Σ−12 =

"1 00 0.25

#+

"0.25 00 1

#=

"1.25 00 1.25

#

x = Σx¡Sz1z1+ Sz2z2

¢=

"0.8 00 0.8

# Ã"1 00 0.25

#Ã11

!+

"0.25 00 1

#Ã2−1

!!=

Ã1.2−0.6

!

L(x) = p(z1|x) p(z2|x)

N(z1,Σ1), z1 =Ã11

!Σ1 =

"1 00 4

#and N(z2,Σ2), z2 =

Ã2−1

!Σ2 =

"4 00 1

#

xy

Page 9: Lectures 3 & 4: Estimators

Line fitting with errors in both coordinates

Suppose points are measured from a true line

y = ax+ b with errors as

xi = xi+ vi vi ∼ N(0,σ2)yi = yi+ wi wi ∼ N(0,σ2)

In the case of linear regression, the ML estimate gave the cost function

measured value

predicted value

mina,b

nXi

(yi− (axi+ b))2

In the case of errors in both coordinates, the ML cost function is

mina,b

nXi

(xi−xi)2+(yi−yi)2 subject to yi = axi+b

i.e. it is necessary also to estimate (predict)

corrected points (xi, yi)

Page 10: Lectures 3 & 4: Estimators

corrected point

This estimator is called total least squares, or orthogonal regression

Solution [exercise – non-examinable]

• line passes through centroid of the points

• line direction is parallel to the eigenvector of the covariance matrix corresponding to the least eigenvalue

µ=nXi

xi

Σ =nXi

h(xi − µ)(xi− µ)>

i

mina,b

nXi

(xi−xi)2+(yi−yi)2 subject to yi = axi+b

Application: Estimating Geometric Transformations

The line estimation problem is equivalent to estimating a 1D affine transformation from noisy point correspondences

x/

x

x0 = αx+ β

{x0i↔ xi}

Can also compute the ML estimate of 2D transformations

Page 11: Lectures 3 & 4: Estimators

Reminder: Plane Projective transformation

• A projective transformation is also called a ``homography'' and a ``collineation''.

• H has 8 degrees of freedom.

• It can be estimated from 4 or more point correpondences

If the measurement error is Gaussian, then the ML estimate of H and the corrected correspondences is given by minimizing

Maximum Likelihood estimation of homography H

Cost function minimization:

• for 2D affine transformations there is a matrix solution (non-examinable)

• for 2D projective transformation, minimize numerically using e.g. non-linear gradient descent

subject to

Page 12: Lectures 3 & 4: Estimators

e.g. Camera rotating about its centre

image plane 1

image plane 2

• The two image planes are related by a homography H

• H only depends on the camera centre, C, and the planes, not on the 3D structure

Example : Building panoramic mosaics

4 frames from a sequence of 30

The camera rotates (and zooms) with fixed centre

Page 13: Lectures 3 & 4: Estimators

Example : Building panoramic mosaics

30 frames

The camera rotates (and zooms) with fixed centre

Homography between two frames

ML estimate of homography from these 100s of point correspondences

Page 14: Lectures 3 & 4: Estimators

Choice of mosaic frame

This produces the classic "bow-tie" mosaic.

Choose central image as reference

Page 15: Lectures 3 & 4: Estimators

General Linear Least Squares

w = y− Hθ

Write the residuals as an n-vector⎛⎜⎜⎜⎜⎜⎜⎝w1w2..wn

⎞⎟⎟⎟⎟⎟⎟⎠ =⎛⎜⎜⎜⎜⎜⎜⎝y1y2..yn

⎞⎟⎟⎟⎟⎟⎟⎠ −⎡⎢⎢⎢⎢⎢⎢⎣x1 1x2 1..xn 1

⎤⎥⎥⎥⎥⎥⎥⎦Ãab

!

mina,b

Xi

(yi− axi− b)2

Then the sum of squared residuals becomes

w>w = (y−Hθ)> (y−Hθ)We want to minimize this w.r.t. θ

H is a n x 2 matrix

Summary

• If the generative model is linear

• then the ML solution for Gaussian noise is

• where the matrix is the pseudo-inverse of H

=

H>Hθ = H>y

d

dθ(y −Hθ)> (y−Hθ) = 2H> (y −Hθ) = 0

H+ = (H>H)−1H>

θ = (H>H)−1 H> y = H+y

y= Hθ

θ = H+y

Page 16: Lectures 3 & 4: Estimators

Generalization I: multi-linear regression

H=

⎡⎢⎢⎢⎢⎢⎢⎣x1 z1 1x2 z2 1..xn zn 1

⎤⎥⎥⎥⎥⎥⎥⎦ θ = (H>H)−1H>y

yi = θ2xi + θ1zi+ θ0+ wi

yi =Xj

θjφj (xi) + wi

Mo re general still with basis functions φj (x)

Generalization II: regression with basis functions

yi = θ2 (xi)2 + θ1 (xi)+ θ0 + wi

for example, quadratic regression

• what is the correct function to fit ?

• how is the quality of the estimate assessed?

H=

⎡⎢⎢⎢⎢⎢⎢⎣φ2(x1) φ1(x1) 1φ2(x2) φ1(x2) 1

.

.φ2(xn) φ1(xn) 1

⎤⎥⎥⎥⎥⎥⎥⎦ =⎡⎢⎢⎢⎢⎢⎢⎣x21 x1 1

x22 x2 1..

x2n xn 1

⎤⎥⎥⎥⎥⎥⎥⎦

yi =Xj

θjφj (xi)+ wi

Page 17: Lectures 3 & 4: Estimators

We have seen:

• An estimator is a function f(z) which computes an estimate

for the parameters θ from measurements z = (z1, z2, … zn)

• The Maximum Likelihood estimator is

Summary point

θ = f(z)

θ = argmaxθp(z|θ)

Now, we will compute the posterior, and hence the Maximum a Posteriori(MAP) estimator

Bayes’ rule and the Posterior distribution

Suppose we have

• p(z|θ) the conditional density of the sensor(s)

• additionally, p(θ), a prior density of θ the parameter to be estimated.

Use Bayes’ rule to get the posterior density

p(θ|z) = p(z|θ)p(θ)p(z)

=likelihood× priornormalizing factor

Page 18: Lectures 3 & 4: Estimators

Likelihood p(z|θ)

Prior p(θ)

Posterior

p(θ|z) ∼ p(z|θ)p(θ)

θ

Likelihood p(z|θ)

Prior p(θ)

Posterior

p(θ|z) ∼ p(z|θ)p(θ)

θ

Page 19: Lectures 3 & 4: Estimators

Example: posterior for uniform distributions

Likelihood p(z|θ)

Prior p(θ)

Posterior

p(θ|z) ∼ p(z|θ)p(θ)

θ

Likelihood p(z|θ)

Prior p(θ)

Posterior

p(θ|z) ∼ p(z|θ)p(θ)

θ

MAP Estimator

θ = argmaxθp(θ|z)

MAP

MLE?

Page 20: Lectures 3 & 4: Estimators

MAP Example: Gaussian fusion

Consider again two sensors modelled as independent normal

distributions:

z1 = θ+ w1, p(z1|θ) ∼ N(θ,σ21)z2 = θ+ w2, p(z2|θ) ∼ N(θ,σ22)

so that the joint likelihood is

p(z1, z2|θ) = p(z1|θ)× p(z2|θ)Suppose in addition we have prior information that

p(θ) ∼ N(θp,σ2p )Then the posterior density is

p(θ|z1, z2) ∼ p(z1, z2|θ) × p(θ)

e.g. if the sensors are: p(z1|x) ∼ N(x,102), p(z2|x) ∼ N(x,202), and the

prior is p(x) ∼ N(150,302).

Suppose we obtain sensor readings of z1 = 130, z2 = 170 then

xMAP =130/102+ 170/202+ 150/302

1/102 + 1/202+ 1/302= 139.0

σpos =1q

1/102 + 1/202+ 1/302= 8.57

The posterior is a Gaussian, and we can write down its mean

and variance simply by adding statistical information:

p(θ|z1, z2) ∼ N(µpos,σ2pos)where

σ−2pos = σ−21 + σ−22 + σ−2p

µpos =σ−21 z1 + σ−22 z2+ σ−2p θp

σ−21 + σ−22 + σ−2p= θMAP

Page 21: Lectures 3 & 4: Estimators

50 100 150 200 2500

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Matlab example

p(z1|x)p(z2|x)

p(θ)

p(θ|z1, z2)

xinc=1; t = [50:inc:250];

pz1 = normal(t, 130,10^2); pz2 = normal(t, 170,20^2); prx = normal(t, 150,30^2);

posx = pz1 .* pz2 .* prx / (inc*(sum(pz1 .* pz2 .* prx)));

plot(t,pz1,'r', t,pz2,'g', t,prx,'k:', t, posx, 'b');

MAP vs MLE

For the posterior

σ−2pos = σ−21 + σ−22 + σ−2p θMAP=σ−21 z1+ σ−22 z2 + σ−2p θp

σ−21 + σ−22 + σ−2p

Recall that for the likelihood

σ−2L = σ−21 + σ−22 θMLE =σ−21 z1 + σ−22 z2

σ−21 + σ−22

so

σ−2pos = σ−2L + σ−2p θMAP=σ−2p θp+ σ−2L θMLE

σ−2p + σ−2L

i.e. prior acts as an additional sensor

Page 22: Lectures 3 & 4: Estimators

• Decrease in posterior variance relative to the prior and likelihood

1

σ2pos=

1

σ2L+1

σ2p

e.g.

σL = =1q

1/102 + 1/202= 8.94 σp = 30 σpos =

1q1/102+ 1/202 + 1/302

= 8.57

If σp→∞ then σpos→ σL (prior is very uninformative)

• Effect of uniform prior ?

θMAP= θp+σ2p

σ2p + σ2L׳θMLE− θp

´• Update to the prior

• If θMLE = θp then θMAP is unchanged from the prior

(but its variance is decreased)

• If σL is large compared to σp then θMAP≈ θp

• If σp is large compared to σL then θMAP≈ θMLE

(noisy sensor)

(weak prior)

Page 23: Lectures 3 & 4: Estimators

MAP Application 1: Berkeley traffic tracking system

Tracks cars by:

• computing corner features on each video frame

• tracking corners (2D) from frame to frame

Objective: estimate image corner position x(t)

features

tracks

grouping

Tracking corner features

Page 24: Lectures 3 & 4: Estimators

Berkeley visual traffic monitoring system

Traffic by day and by night

Page 25: Lectures 3 & 4: Estimators

ground plane homography

Velocity measurement gate

Page 26: Lectures 3 & 4: Estimators

Tracking Corners

true position

measurement

Corners need 2D Gaussians

Corner feature in 2D coordinates is a vector x = (x, y)>.Suppose x ∼ N(0,σ2x), y ∼ N(0, σ2y) then

p(x, y) =1

2πσxσyexp

(−12

Ãx2

σ2x+y2

σ2y

!)and the covariance matrix is

Σ =

"σ2x 0

0 σ2y

#

Page 27: Lectures 3 & 4: Estimators

Corner fusion in 2D

Given a vehicle at position x with prior

distribution

N(x0, P0)

and a corner measurement

z ∼ N(x, R)Summing information:

S = P−1 = R−1 + P−10Information weighted mean:

x= P³R−1z+P−10 x0

´

Measurementlikelihood

Priordistribution

Prior:

Likelihood:

Corner fusion in 2D

Isotropic measurement error

P0 =

Ã2 11 3

!

x0 =

Ã−1−1

!

R =

Ã1 00 1

!

z=

Ã12

!

Page 28: Lectures 3 & 4: Estimators

Posteriordistribution

Posterior:

P =

Ã0.64 0.090.09 0.73

!

x=

Ã0.551.37

!

Where does the prior come from?

Use the posterior from the previous time step, “updated” to take

account of the dynamics

Iterate:

• prediction (posterior + dynamics) to give prior

• make measurement z

• fuse measurement and prior to give new posterior

xt− xt−1 ∼ NÃÃ0v

!, σ2

Ã1 00 1

!!

Page 29: Lectures 3 & 4: Estimators

Compute estimate and covariance for tracked corner

Prior: Dynamics: Likelihood:

a) Predictionat t=1

b) Fuse measurementat t=1

Dynamics:

P1 = P0+PD =

Ã2.3 1.01.0 3.3

!; x1 = x0+v =

Ã−1−1

!

P1 =³P−11 +R−1

´−1=

Ã0.674 0.0760.076 0.750

!

x1 = P1³P−11 x1 +R−1z1

´=

Ã3.951.78

!

P0 =

Ã2 11 3

!PD =

Ã0.3 00 0.3

!R =

Ã1 00 1

!

x0 =

Ã−1−1

!v =

Ã00

!z1 =

Ã62

!

Berkeley tracker

In action …

Page 30: Lectures 3 & 4: Estimators

• To quantify confidence and uncertainty define a confidence region R about a point x (e.g. the mode) such that at a confidence level c · 1

• we can then say (for example) there is a 99% probability that the true value is in R

• e.g. for a univariate normal distribution N(µ,σ2)

Confidence regions

p(x ∈ R) = c

p(|x − µ| < σ) ≈ 0.67

p(|x− µ| < 2σ) ≈ 0.95

p(|x− µ| < 3σ) ≈ 0.997-6 -4 -2 0 2 4 6

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

• In 2D confidence regions are usually chosen as ellipses

0 2 4 6 8 10

-3

-2

-1

0

1

2

3

4

5

99%3

86%2

39%1

confidenced

99%

86%

For an n dimensionalnormal distributionwithmean zero and covariance

Σ, then

x>Σ−1x

is distributed as a χ2-distributionon n degrees of freedom.

In 2D the region enclosed by the ellipse

x>Σ−1x= d2

defines a confidence region

p(χ22 < d2) = c

where c can be obtained from standard χ2 tables (e.g. HLT) or com-

puted numerically.

Page 31: Lectures 3 & 4: Estimators

MAP Application 2: Image Restoration

1

0

i

likelihood prior

uMAP= argminu C(u)

Suppose sensor model is

zi = ui +wi

where

ui ∈ {0,1} wi ∼ N(0,σ2)Then the likelihood is

p(z|u) ∼Yi

e−(zi−ui)2/(2σ2)

Assume the prior has the form

p(u) ∼Yi

e−λ(ui−ui−1)2

then the negative log of the posterior p(u|z) is

C =Xi

(zi− ui)2 + d(ui− ui−1)2

original

original plus noise

MAP restoration

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

d = 10

Page 32: Lectures 3 & 4: Estimators

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

20 40 60 80 100 120 140 160 180 200 220

20

40

60

original

original plus noise

MAP restoration

d = 60

Page 33: Lectures 3 & 4: Estimators

Loss Functions

Posterior

p(θ|z) ∼ p(z|θ)p(θ)

MAP

Example: hitting a target

Loss function L(x, x) =

(1 if |x − x| > t0 if |x − x| <= t

Expected Loss L(x) =Zp(x|z)L(x, x) dx

chosen value actual value

x

x

2t

Choose to minimize expected lossx x= argminx

Zp(x|z)L(x, x) dx

Example 1: squared loss function

µ= argminu

Z(x − u)2p(x) dx

L(x, x) = (x− x)2

this results in choosing the mean value

Sketch proof

Differentiate w.r.t. uZ2(u− x)p(x) dx= 0

Page 34: Lectures 3 & 4: Estimators

Choose to minimize expected lossx x= argminx

Zp(x|z)L(x, x) dx

Example 2: absolute loss function

this results in choosing the median value

L(x, x) = |x − x|

xmed = argminu

Z|x− u|p(x) dx

[proof: exercise]

Page 35: Lectures 3 & 4: Estimators

Warning: MAP and change of parameterization

MAP

MAP

Posterior

p(θ|z)

Posterior

p(θ0|z)

non linear map

θ0 = g(θ)

Problem:

• MAP finds maximum value of the posterior, but the posterior is a probability density

Solution:

• Loss functions correctly transform under change of coordinate (reparameterization)

Example of non-linear map: perspective projection