Estimation Theory

152
Estimation Theory (Introduction) Estimation Theory

description

Estimation Theory Slides in pdf form

Transcript of Estimation Theory

Page 1: Estimation Theory

Estimation Theory

(Introduction) Estimation Theory

Page 2: Estimation Theory

Course Objective

Extract information from noisy signals

Parameter Estimation Problem : Given a set of measured data

x [0], x [1], . . . , x [N − 1]

which depends on an unknown parameter vector θ, determine an estimator

θ = g(x [0], x [1], . . . , x [N − 1])

where g is some function.

Applications : Image processing, communications, biomedicine, systemidentification, state estimation in control, etc.

(Introduction) Estimation Theory

Page 3: Estimation Theory

Some Examples

Range estimation : We transmit a pulse that is reflected by the aircraft.An echo is received after τ second. Range θ is estimated from the equationθ = τc/2 where c is the light’s speed.

System identification : The input of a plant is excited with u and theoutput signal y is measured. If we have :

y [k] = G (q−1,θ)u[k] + n[k]

where n[k] is the measurement noise. The model parameters are estimatedusing u[k] and y [k] for k = 1, . . . ,N.

DC level in noise : Consider a set of data x [0], x [1], . . . , x [N − 1] thatcan be modeled as :

x [n] = A+ w [n]

where w [n] is some zero mean noise process. A can be estimated using themeasured data set.

(Introduction) Estimation Theory

Page 4: Estimation Theory

Outline

Classical estimation (θ deterministic)

Minimum Variance Unbiased Estimator (MVU)

Cramer-Rao Lower Bound (CRLB)

Best Linear Unbiased Estimator (BLUE)

Maximum Likelihood Estimator (MLE)

Least Squares Estimator (LSE)

Bayesian estimation (θ stochastic)

Minimum Mean Square Error Estimator (MMSE)

Maximum A Posteriori Estimator (MAP)

Linear MMSE Estimator

Kalman Filter

(Introduction) Estimation Theory

Page 5: Estimation Theory

References

Main reference :

Fundamentals of Statistical Signal ProcessingEstimation Theory

by Steven M. KAY, Prentice-Hall, 1993 (available in Library de LaFontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 andChapter 9.

Other references :

Lessons in Estimation Theory for Signal Processing, Communicationsand Control. By Jerry M. Mendel, Prentice-Hall, 1995.

Probability, Random Processes and Estimation Theory for Engineers.By Henry Stark and John W. Woods, Prentice-Hall, 1986.

(Introduction) Estimation Theory

Page 6: Estimation Theory

Review of Probability and Random Variables

(Probability and Random Variables) Estimation Theory

Page 7: Estimation Theory

Random Variables

Random Variable : A rule X (·) that assigns to every element of a samplespace Ω a real value is called a RV. So X is not really a variable that variesrandomly but a function whose domain is Ω and whose range is somesubset of the real line.

Example : Consider the experiment of flipping a coin twice. The samplespace (the possible outcomes) is :

Ω = HH,HT ,TH,TT

We can define a random variable X such that

X (HH) = 1,X (HT ) = 1.1,X (TH) = 1.6,X (TT ) = 1.8

Random variable X assigns to each event (e.g. E = HT ,TH ⊂ Ω) asubset of the real line (in this case B = 1.1, 1.6).

(Probability and Random Variables) Estimation Theory

Page 8: Estimation Theory

Probability Distribution Function

For any element ζ in Ω, the event ζ|X (ζ) ≤ x is an important event.The probability of this event

Pr [ζ|X (ζ) ≤ x] = PX (x)

is called the probability distribution function of X .

Example : For the random variable defined earlier, we have :

PX (1.5) = Pr [ζ|X (ζ) ≤ 1.5] = Pr [HH,HT] = 0.5

PX (x) can be computed for all x ∈ R. It is clear that 0 ≤ PX (x) ≤ 1.

Remark :

For the same experiment (throwing a coin twice) we could defineanother random variable that would lead to a different PX (x).

In most of engineering problems the sample space is a subset of thereal line so X (ζ) = ζ and PX (x) is a continuous function of x .

(Probability and Random Variables) Estimation Theory

Page 9: Estimation Theory

Probability Density Function (PDF)

The Probability Density Function, if it exists, is given by :

pX (x) =dPX (x)

dx

When we deal with a single random variable the subscripts are removed :

p(x) =dP(x)

dx

Properties :

(i)

∫ ∞

−∞p(x)dx = P(∞)− P(−∞) = 1

(ii) Pr [ζ|X (ζ) ≤ x] = Pr [X ≤ x ] = P(x) =

∫ x

−∞p(α)dα

(iii) Pr [x1 < X ≤ x2] =

∫ x2

x1

p(x)dx

(Probability and Random Variables) Estimation Theory

Page 10: Estimation Theory

Gaussian Probability Density Function

A random variable is distributed according to a Gaussian or normal

distribution if the PDF is given by :

p(x) =1√2πσ2

e−(x−µ)2

2σ2

The PDF has two parameters : µ, the mean and σ2 the variance.

We note X ∼ N (µ, σ2) when the random variable X has a normal(Gaussian) distribution with the mean µ and the standard deviation σ.Small σ means small variability (uncertainty) and large σ means largevariability.

Remark : Gaussian distribution is important because according to theCentral Limit Theorem the sum of N independent RVs has a PDF thatconverges to a Gaussian distribution when N goes to infinity.(Probability and Random Variables) Estimation Theory

Page 11: Estimation Theory

Some other common PDF

Uniform (b > a) : p(x) =

1b−a a < x < b

0 otherwise

Exponential (µ > 0) : p(x) =1

µexp(−x/µ)u(x)

Rayleigh (σ > 0) : p(x) =x

σ2exp(

−x2

2σ2)u(x)

Chi-square χ2 : p(x) =

1√

2nΓ(n/2)xn/2−1 exp(− x

2 ) for x > 0

0 for x < 0

where Γ(z) =

∫ ∞

0tz−1e−tdt and u(x) is the unit step function.

(Probability and Random Variables) Estimation Theory

Page 12: Estimation Theory

Joint, Marginal and Conditional PDF

Joint PDF : Consider two random variables X and Y then :

Pr [x1 < X ≤ x2 and y1 < Y ≤ y2] =

∫ x2

x1

∫ y2

y1

p(x , y)dxdy

Marginal PDF :

p(x) =

∫ ∞

−∞p(x , y)dy and p(y) =

∫ ∞

−∞p(x , y)dx

Conditional PDF : p(x |y) is defined as the PDF of X conditioned onknowing the value of Y .

Bayes’ Formula : Consider two RVs defined on the same probability spacethen we have :

p(x , y) = p(x |y)p(y) = p(y |x)p(x) or p(x |y) = p(x , y)

p(y)

(Probability and Random Variables) Estimation Theory

Page 13: Estimation Theory

Independent Random Variables

Two RVs X and Y are independent if and only if :

p(x , y) = p(x)p(y)

A direct conclusion is that :

p(x |y) = p(x , y)

p(y)=

p(x)p(y)

p(y)= p(x) and p(y |x) = p(y)

which means conditioning does not change the PDF.

Remark : For a joint Gaussian pdf the contours of constant density is anellipse centered at (µx , µy ). For independent X and Y the major (orminor) axis is parallel to x or y axis.

(Probability and Random Variables) Estimation Theory

Page 14: Estimation Theory

Expected Value of a Random Variable

The expected value, if it exists, of a random variable X with PDF p(x) isdefined by :

E (X ) =

∫ ∞

−∞xp(x)dx

Some properties of expected value :

EX + Y = EX+ EY EaX = aEXThe expected value of Y = g(X ) can be computed by :

E (Y ) =

∫ ∞

−∞g(x)p(x)dx

Conditional expectation : The conditional expectation of X given that aspecific value of Y has occurred is :

E (X |Y ) =

∫ ∞

−∞xp(x |y)dx

(Probability and Random Variables) Estimation Theory

Page 15: Estimation Theory

Moments of a Random Variable

The r th moment of X is defined as :

E (X r ) =

∫ ∞

−∞x rp(x)dx

The first moment of X is its expected value or the mean (µ = E (X )).

Moments of Gaussian RVs : A Gaussian RV with N (µ, σ2) hasmoments of all orders in closed form

E (X ) = µE (X 2) = µ2 + σ2

E (X 3) = µ3 + 3µσ2

E (X 4) = µ4 + 6µ2σ2 + 3σ4

E (X 5) = µ5 + 10µ3σ2 + 15µσ4

E (X 6) = µ6 + 15µ4σ2 + 45µ2σ4 + 15σ6

(Probability and Random Variables) Estimation Theory

Page 16: Estimation Theory

Central Moments

The r th central moment of X is defined as :

E [(X − µ)r ] =

∫ ∞

−∞(x − µ)rp(x)dx

The second central moment (variance) is denoted by σ2 or var(X ).

Central Moments of Gaussian RVs :

E [(X − µ)r ] =

0 if r is oddσr (r − 1)!! if r is even

where n!! denotes the double factorial that is the product of every oddnumber from n to 1.

(Probability and Random Variables) Estimation Theory

Page 17: Estimation Theory

Some properties of Gaussian RVs

If X ∼ N (µx , σ2x) then Z = (X − µx)/σx ∼ N (0, 1).

If Z ∼ N (0, 1) then X = σxZ + µx ∼ N (µx , σ2x).

If X ∼ N (µx , σ2x) then Z = aX + b ∼ N (aµx + b, a2σ2

x).

If X ∼ N (µx , σ2x) and Y ∼ N (µy , σ

2y ) are two independent RVs, then

aX + bY ∼ N (aµx + bµy , aσ2x + bσ2

y)

The sum of square of n independent RV with standard normaldistribution N (0, 1) has a χ2

n distribution with n degree of freedom.For large value of n, χ2

n converges to N (n, 2n).

The Euclidian norm√X 2 + Y 2 of two independent RVs with

standard normal distribution has the Rayleigh distribution.

(Probability and Random Variables) Estimation Theory

Page 18: Estimation Theory

Covariance

For two RVs X and Y , the covariance is defined as

σxy = E [(X − µx)(Y − µy )]

σxy =

∫ ∞

−∞

∫ ∞

−∞(x − µx)(y − µy)p(x , y)dxdy

If X and Y are zero mean then σxy = EXY .var(X + Y ) = σ2

x + σ2y + 2σxy

var(aX ) = a2σ2x

Important formula : The relation between the variance and the mean ofX is given by

σ2 = E [(X − µ)2] = E (X 2)− 2µE (X ) + µ2

= E (X 2)− µ2

The variance is the mean of the square minus the square of the mean.(Probability and Random Variables) Estimation Theory

Page 19: Estimation Theory

Independence, Uncorrelatedness and Orthogonality

If σxy = 0, then X and Y are uncorrelated and

EXY = EXEY X and Y called orthogonal if EXY = 0.

If X and Y are independent then they are uncorrelated.

p(x , y) = p(x)p(y) ⇒ EXY = EXEY Uncorrelatedness does not imply the independence. For example, if Xis a normal RV with zero mean and Y = X 2 we have p(y |x) = p(y)but

σxy = EXY − EXEY = EX 3 − 0 = 0

The correlation only shows the linear dependence between two RV sois weaker than independence.

For Jointly Gaussian RVs, independence is equivalent to beinguncorrelated.

(Probability and Random Variables) Estimation Theory

Page 20: Estimation Theory

Random Vectors

Random Vector : is a vector of random variables 1 :

x = [x1, x2, . . . , xn]T

Expectation Vector : µx = E (x) = [E (x1),E (x2), . . . ,E (xn)]T

Covariance Matrix : Cx = E [(x− µx)(x− µx)T ]

Cx is an n × n symmetric matrix which is assumed to be positivedefinite and so invertible.

The elements of this matrix are : [Cx ]ij = E[xi − E (xi )][xj − E (xj)].If the random variables are uncorrelated then Cx is a diagonal matrix.

Multivariate Gaussian PDF :

p(x) =1√

(2π)n det(Cx)exp

[−1

2(x− µx)

TC−1x (x− µx)

]1. In some books (including our main reference) there is no distinction between random

variable X and its specific value x . From now on we adopt the notation of our reference.(Probability and Random Variables) Estimation Theory

Page 21: Estimation Theory

Random Processes

Discrete Random Process : x [n] is a sequence of random variablesdefined for every integer n.Mean value : is defined as E (x [n]) = µx [n].Autocorrelation Function (ACF) : is defined as

rxx [k , n] = E (x [n]x [n + k])

Wide Sense Stationary (WSS) : x [n] is WSS if its mean and itsautocorrelation function (ACF) do not depend on n.Autocovariance function : is defined as

cxx [k] = E [(x [n] − µx)(x [n + k]− µx)] = rxx [k]− µ2x

Cross-correlation Function (CCF) : is defined as

rxy [k] = E (x [n]y [n + k])

Cross-covariance function : is defined as

cxy [k] = E [(x [n]− µx)(y [n + k]− µy)] = rxy [k]− µxµy

(Probability and Random Variables) Estimation Theory

Page 22: Estimation Theory

Discrete White Noise

Some properties of ACF and CCF :

rxx [0] ≥ |rxx [k]| rxx [k] = rxx [−k] rxy [k] = ryx [−k]

Power Spectral Density : The Fourier transform of ACF and CCF givesthe Auto-PSD and Cross-PSD :

Pxx(f ) =∞∑

k=−∞rxx [k] exp(−j2πfk)

Pxy (f ) =∞∑

k=−∞rxy [k] exp(−j2πfk)

Discrete White Noise : is a discrete random process with zero mean andrxx [k] = σ2δ[k] where δ[k] is the Kronecker impulse function. The PSD ofwhite noise becomes Pxx(f ) = σ2 and is completely flat with frequency.

(Probability and Random Variables) Estimation Theory

Page 23: Estimation Theory

Introduction and Minimum Variance UnbiasedEstimation

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 24: Estimation Theory

The Mathematical Estimation Problem

Parameter Estimation Problem : Given a set of measured data

x = x [0], x [1], . . . , x [N − 1]which depends on an unknown parameter vector θ, determine an estimator

θ = g(x [0], x [1], . . . , x [N − 1])

where g is some function.

The first step is to find the PDF of data as afunction of θ : p(x; θ)

Example : Consider the problem of DC level in white Gaussian noise withone observed data x [0] = θ + w [0] where w [0] has the PDF N (0, σ2).Then the PDF of x [0] is :

p(x [0]; θ) =1√2πσ2

exp

[− 1

2σ2(x [0] − θ)2

](Minimum Variance Unbiased Estimation) Estimation Theory

Page 25: Estimation Theory

The Mathematical Estimation Problem

Example : Consider a data sequence that can be modeled with a lineartrend in white Gaussian noise

x [n] = A+ Bn+ w [n] n = 0, 1, . . . ,N − 1

Suppose that w [n] ∼ N (0, σ2) and is uncorrelated with all the othersamples. Letting θ = [A B] and x = [x [0], x [1], . . . , x [N − 1]] the PDF is :

p(x;θ) =N−1∏n=0

p(x [n];θ) =1

(√2πσ2)N

exp

[− 1

2σ2

N−1∑n=0

(x [n]− A− Bn)2

]

The quality of any estimator for this problem is related to the assumptionson the data model. In this example, linear trend and WGN PDFassumption.

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 26: Estimation Theory

The Mathematical Estimation Problem

Classical versus Bayesian estimation

If we assume θ is deterministic we will have a classical estimationproblem. The following method will be studied : MVU, MLE, BLUE,LSE.

If we assume θ is a random variable with a known PDF, then we willhave a Bayesian estimation problem. In this case the data aredescribed as the joint PDF

p(x , θ) = p(x |θ)p(θ)

where p(θ) summarizes our knowledge about θ before any data isobserved and p(x |θ) summarizes our knowledge provided by data xconditioned on knowing θ. The following methods will be studied :MMSE, MAP, Kalman Filter.

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 27: Estimation Theory

Assessing Estimator Performance

Consider the problem of estimating a DC level A in uncorrelated noise :

x [n] = A+ w [n] n = 0, 1, . . . ,N − 1

Consider the following estimators :

A1 =1

N

N−1∑n=0

x [n]

A2 = x [0]

Suppose that A = 1, A1 = 0.95 and A2 = 0.98. Which estimator is better ?

An estimator is a random variable, so itsperformance can only be described by its PDF orstatistically (e.g. by Monte-Carlo simulation).

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 28: Estimation Theory

Unbiased Estimators

An estimator that on the average yield the true value is unbiased.Mathematically

E (θ − θ) = 0 for a < θ < b

Let’s compute the expectation of the two estimators A1 and A2 :

E (A1) =1

N

N−1∑n=0

E (x [n]) =1

N

N−1∑n=0

E (A+ w [n]) =1

N

N−1∑n=0

(A + 0) = A

E (A2) = E (x [0]) = E (A+ w [0]) = A+ 0 = A

Both estimators are unbiased. Which one is better ?Now, let’s compute the variance of the two estimators :

var(A1) = var

[1

N

N−1∑n=0

x [n]

]=

1

N2

N−1∑n=0

var(x [n]) =1

N2Nσ2 =

σ2

N

var(A2) = var(x [0]) = σ2 > var(A1)

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 29: Estimation Theory

Unbiased Estimators

Remark : When several unbiased estimators of the same parameters fromindependent set of data are available, i.e., θ1, θ2, . . . , θn, a better estimatorcan be obtained by averaging :

θ =1

n

n∑i=1

θi ⇒ E (θ) = θ

Assuming that the estimators have the same variance, we have :

var(θ) =1

n2

n∑i=1

var(θi) =1

n2n var(θi ) =

var(θi)

n

By increasing n, the variance will decrease (if n → ∞, θ → θ).It is not the case for biased estimators, no matter how many estimators areaveraged.

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 30: Estimation Theory

Minimum Variance Criterion

The most logical criterion for estimation is the Mean Square Error (MSE) :

mse(θ) = E [(θ − θ)2]

Unfortunately this type of estimators leads to unrealizable estimators (theestimator will depend on θ).

mse(θ) = E[θ − E (θ) + E (θ)− θ]2 = E[θ − E (θ) + b(θ)]2

where b(θ) = E (θ)− θ is defined as the bias of the estimator. Therefore :

mse(θ) = E[θ − E (θ)]2+ 2b(θ)E [θ − E (θ)] + b2(θ) = var(θ) + b2(θ)

Instead of minimizing MSE we can minimize the variance ofthe unbiased estimators :

Minimum Variance Unbiased Estimator

(Minimum Variance Unbiased Estimation) Estimation Theory

Page 31: Estimation Theory

Minimum Variance Unbiased Estimator

Existence of MVU Estimator : In general MVU estimator does notalways exist. There may be no unbiased estimator or non of unbiasedestimators has a uniformly minimum variance.

Finding the MVU Estimator : There is no known procedure whichalways leads to the MVU estimator. Three existing approaches are :

1 Determine the Cramer-Rao lower bound (CRLB) and check to see ifsome estimator satisfies it.

2 Apply the Rao-Blackwell-Lehmann-Scheffe theorem (we will skip it).

3 Restrict to linear unbiased estimators.

(Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 31 / 152

Page 32: Estimation Theory

Cramer-Rao Lower Bound

(Cramer-Rao Lower Bound) Estimation Theory

Page 33: Estimation Theory

Cramer-Rao Lower Bound

CRLB is a lower bound on the variance of any unbiased estimator.

var(θ) ≥ CRLB(θ)

Note that the CRLB is a function of θ.

It tells us what is the best performance that can be achieved(useful in feasibility study and comparison with other estimators).

It may lead us to compute the MVU estimator.

(Cramer-Rao Lower Bound) Estimation Theory

Page 34: Estimation Theory

Cramer-Rao Lower Bound

Theorem (scalar case)

Assume that the PDF p(x; θ) satisfies the regularity condition

E

[∂ ln p(x; θ)

∂θ

]= 0 for all θ

Then the variance of any unbiased estimator θ satisfies

var(θ) ≥[−E

(∂2 ln p(x; θ)

∂θ2

)]−1

An unbiased estimator that attains the CRLB can be found iff :

∂ ln p(x; θ)

∂θ= I (θ)(g(x)− θ)

for some functions g(x) and I (θ). The estimator is θ = g(x) and theminimum variance is 1/I (θ).

(Cramer-Rao Lower Bound) Estimation Theory

Page 35: Estimation Theory

Cramer-Rao Lower Bound

Example : Consider x [0] = A+ w [0] with w [0] ∼ N (0, σ2).

p(x [0];A) =1√2πσ2

exp

[− 1

2σ2(x [0] − A)2

]

ln p(x [0],A) = − ln√2πσ2 − 1

2σ2(x [0] − A)2

Then

∂ ln p(x [0];A)

∂A=

1

σ2(x [0] − A) ⇒ −∂2 ln p(x [0];A)

∂A2=

1

σ2

According to Theorem :

var(A) ≥ σ2 and I (A) =1

σ2and A = g(x [0]) = x [0]

(Cramer-Rao Lower Bound) Estimation Theory

Page 36: Estimation Theory

Cramer-Rao Lower Bound

Example : Consider multiple observations for DC level in WGN :

x [n] = A+ w [n] n = 0, 1, . . . ,N − 1 with w [n] ∼ N (0, σ2)

p(x;A) =1

(√2πσ2)N

exp

[− 1

2σ2

N−1∑n=0

(x [n]− A)2

]Then

∂ ln p(x;A)

∂A=

∂A

[− ln[(2πσ2)N/2]− 1

2σ2

N−1∑n=0

(x [n]− A)2

]

=1

σ2

N−1∑n=0

(x [n]− A) =N

σ2

(1

N

N−1∑n=0

x [n]− A

)According to Theorem :

var(A) ≥ σ2

Nand I (A) =

N

σ2and A = g(x [0]) =

1

N

N−1∑n=0

x [n]

(Cramer-Rao Lower Bound) Estimation Theory

Page 37: Estimation Theory

Transformation of Parameters

If it is desired to estimate α = g(θ), then the CRLB is :

var(θ) ≥[−E

(∂2 ln p(x; θ)

∂θ2

)]−1 (∂g

∂θ

)2

Example : Compute the CRLB for estimation of the power (A2) of a DClevel in noise :

var(A2) ≥ σ2

N(2A)2 =

4A2σ2

N

Definition

Efficient estimator : An unbiased estimator that attains the CRLB is saidto be efficient.

Example : Knowing that x =1

N

N−1∑n=0

x [n] is an efficient estimator for A, is

x2 an efficient estimator for A2 ?(Cramer-Rao Lower Bound) Estimation Theory

Page 38: Estimation Theory

Transformation of Parameters

Solution : Knowing that x ∼ N (A, σ2/N), we have :

E (x2) = E 2(x) + var(x) = A2 +σ2

N= A2

So the estimator A2 = x2 is not even unbiased.Let’s look at the variance of this estimator :

var(x2) = E (x4)− E 2(x2)

but we have from the moments of Gaussian RVs (slide 15) :

E (x4) = A4 + 6A2σ2

N+ 3(

σ2

N)2

Therefore :

var(x2) = A4 + 6A2σ2

N+

3σ4

N2−

(A2 +

σ2

N

)2

=4A2σ2

N+

2σ4

N2

(Cramer-Rao Lower Bound) Estimation Theory

Page 39: Estimation Theory

Transformation of Parameters

Remarks :

The estimator A2 = x2 is biased and not efficient.

As N → ∞ the bias goes to zero and the variance of the estimatorapproaches the CRLB. This type of estimators are calledasymptotically efficient.

General Remarks :

If g(θ) = aθ + b is an affine function of θ, then g(θ) = g(θ) is anefficient estimator. First, it is unbiased : E (aθ+ b) = aθ+ b = g(θ),moreover :

var(g(θ)) ≥(∂g

∂θ

)2

var(θ) = a2var(θ)

but var(g(θ)) = var(aθ + b) = a2var(θ), so that the CRLB isachieved.

If g(θ) is a nonlinear function of θ and θ is an efficient estimator,then g(θ) is an asymptotically efficient estimator.

(Cramer-Rao Lower Bound) Estimation Theory

Page 40: Estimation Theory

Cramer-Rao Lower Bound

Theorem (Vector Parameter)

Assume that the PDF p(x;θ) satisfies the regularity condition

E

[∂ ln p(x;θ)

∂θ

]= 0 for all θ

Then the variance of any unbiased estimator θ satisfies Cθ − I−1(θ) ≥ 0where ≥ 0 means that the matrix is positive semidefinite. I(θ) is called theFisher information matrix and is given by :

Iij(θ) = −E

(∂2 ln p(x;θ)

∂θi∂θj

)An unbiased estimator that attains the CRLB can be found iff :

∂ ln p(x;θ)

∂θ= I(θ)(g(x)− θ)

(Cramer-Rao Lower Bound) Estimation Theory

Page 41: Estimation Theory

CRLB Extension to Vector Parameter

Example : Consider a DC level in WGN with A and σ2 unknown.Compute the CRLB for estimation of θ = [A σ2]T .

ln p(x;θ) = −N

2ln 2π − N

2lnσ2 − 1

2σ2

N−1∑n=0

(x [n]− A)2

The Fisher information matrix is :

I(θ) = −E

[∂2 ln p(x;θ)

∂A2∂2 ln p(x;θ)

∂A∂σ2

∂2 ln p(x;θ)∂A∂σ2

∂2 ln p(x;θ)∂(σ2)2

]=

[Nσ2 0

0 N2σ4

]The matrix is diagonal (just for this example) and can be easily inverted toyield :

var(θ) ≥ σ2

Nvar(σ2) ≥ 2σ4

N

Is there any unbiased estimator that achieves these bounds ?(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 41 / 152

Page 42: Estimation Theory

Transformation of Parameters

If it is desired to estimate α = g(θ), and the CRLB for the covariance of θis I−1(θ), then :

Cα ≥(∂g

∂θ

)[−E

(∂2 ln p(x;θ)

∂θ2

)]−1 (∂g

∂θ

)T

Example : Consider a DC level in WGN with A and σ2 unknown.Compute the CRLB for estimation of signal to noise ratio α = A2/σ2.We have θ = [A σ2]T and α = g(θ) = θ21/θ2, then the Jacobian is :

∂g(θ)

∂θ=

[∂g(θ)

∂θ1

∂g(θ)

∂θ2

]=

[2A

σ2− A2

σ4

]

So the CRLB is :

var(α) ≥[2A

σ2− A2

σ4

] [Nσ2 0

0 N2σ4

]−1 [ 2Aσ2

−A2

σ4

]=

4α+ 2α2

N

(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 42 / 152

Page 43: Estimation Theory

Linear Models with WGN

If N point samples of data observed can be modeled as

x = Hθ + w

where

x = N × 1 observation vectorH = N × p observation matrix (known, rank p)θ = p × 1 vector of parameters to be estimatedw = N × 1 noise vector with PDF N (0, σ2I)

Compute the CRLB and the MVU estimator that achieves this bound.

Step 1 : Compute ln p(x;θ).

Step 2 : Compute I(θ) = −E

[∂2 ln p(x;θ)

∂θ2

]and the covariance

matrix of θ : Cθ = I−1(θ).Step 3 : Find the MVU estimator g(x) by factoring

∂ ln p(x;θ)

∂θ= I(θ)[g(x)− θ]

(Linear Models) Estimation Theory Spring 2013 43 / 152

Page 44: Estimation Theory

Linear Models with WGN

Step 1 : ln p(x;θ) = − ln(√2πσ2)N − 1

2σ2 (x−Hθ)T (x−Hθ).Step 2 :

∂ ln p(x;θ)

∂θ= − 1

2σ2

∂θ[xTx− 2xTHθ + θTHTHθ]

=1

σ2[HTx−HTHθ]

Then I(θ) = −E

[∂2 ln p(x;θ)

∂θ2

]=

1

σ2HTH

Step 3 : Find the MVU estimator g(x) by factoring

∂ ln p(x;θ)

∂θ= I(θ)[g(x)− θ] =

HTH

σ2[(HTH)−1HTx− θ]

Therefore :

θ = g(x) = (HTH)−1HTx Cθ = I−1(θ) = σ2(HTH)−1

(Linear Models) Estimation Theory Spring 2013 44 / 152

Page 45: Estimation Theory

Linear Models with WGN

For a linear model with WGN represented by x = Hθ + w the MVUestimator is :

θ = (HTH)−1HTx

This estimator is efficient and attains the CRLB.

That the estimator is unbiased can be seen easily by :

E (θ) = (HTH)−1HTE (Hθ + w) = θ

The statistical performance of θ is completely specified because θ is alinear transformation of a Gaussian vector x and hence has a Gaussiandistribution :

θ ∼ N (θ, σ2(HTH)−1)

(Linear Models) Estimation Theory Spring 2013 45 / 152

Page 46: Estimation Theory

Example (Curve Fitting)

Consider fitting the data x [n] by a p-th order polynomial function of n :

x [n] = θ0 + θ1n + θ2n2 + · · ·+ θpn

p + w [n]

We have N data samples, then :

x = [x [0], x [1], . . . , x [N − 1]]T

w = [w [0],w [1], . . . ,w [N − 1]]T

θ = [θ0, θ1, . . . , θp]T

so x = Hθ + w, where H is :1 0 0 · · · 01 1 1 · · · 11 2 4 · · · 2p

......

.... . .

...1 N − 1 (N − 1)2 · · · (N − 1)p

N×(p+1)

Hence the MVU estimator is : θ = (HTH)−1HTx(Linear Models) Estimation Theory Spring 2013 46 / 152

Page 47: Estimation Theory

Example (Fourier Analysis)

Consider the Fourier analysis of the data x [n] :

x [n] =M∑k=1

ak cos(2πkn

N) +

M∑k=1

bk sin(2πkn

N) + w [n]

so we have θ = [a1, a2, . . . , aM , b1, b2, . . . , bM ]T and x = Hθ + w where :

H = [ha1,ha2, . . . ,h

aM ,hb1,h

b2, . . . ,h

bM ]

with

hak =

1

cos(2πkN )

cos(2πk2N )...

cos(2πk(N−1)N )

, hbk =

1

sin(2πkN )

sin(2πk2N )...

sin(2πk(N−1)N )

Hence the MVU estimate of the Fourier coefficients is : θ = (HTH)−1HTx

(Linear Models) Estimation Theory Spring 2013 47 / 152

Page 48: Estimation Theory

Example (Fourier Analysis)

After simplification (noting that (HTH)−1 = 2N I), we have :

θ =2

N[(ha1)

Tx, . . . , (haM)Tx , (hb1)T x, . . . , (hbM)Tx]T

which is the same as the standard solution :

ak =2

N

N−1∑n=0

x [n] cos

(2πkn

N

), bk =

2

N

N−1∑n=0

x [n] sin

(2πkn

N

)

From the properties of linear models the estimates are unbiased.

The covariance matrix is :

Cθ = σ2(HTH)−1 =2σ2

NI

Note that θ is Gaussian and Cθ is diagonal (the amplitude estimatesare independent).

(Linear Models) Estimation Theory Spring 2013 48 / 152

Page 49: Estimation Theory

Example (System Identification)

Consider identification of a Finite Impulse Response (FIR) model, h[k] fork = 0, 1, . . . , p − 1, with input u[n] and output x [n] provided forn = 0, 1, . . . ,N − 1 :

x [n] =

p−1∑k=0

h[k]u[n− k] + w [n] n = 0, 1, . . . ,N − 1

FIR model can be represented by the linear model x = Hθ + w where

θ =

h[0]h[1]. . .

h[p − 1]

p×1

H =

u[0] 0 · · · 0u[1] u[0] · · · 0...

.... . .

...u[N − 1] u[N − 2] · · · u[N − p]

N×p

The MVU estimate is θ = (HTH)−1HTx with Cθ = σ2(HTH)−1.

(Linear Models) Estimation Theory Spring 2013 49 / 152

Page 50: Estimation Theory

Linear Models with Colored Gaussian Noise

Determine the MVU estimator for the linear model x = Hθ + w with w acolored Gaussian noise with N (0,C).Whitening approach : Since C is positive definite, its inverse can befactored as C−1 = DTD where D is an invertible matrix. This matrix actsas a whitening transformation for w :

E [(Dw)(Dw)T ] = E (DwwTD) = DCDT = DD−1D−TD = I

Now if we transform the linear model x = Hθ + w to :

x′ = Dx = DHθ +Dw = H′θ + w′

where w′ = Dw ∼ N (0, I) is white and we can compute the MVUestimator as :

θ = (H′TH′)−1H′Tx′ = (HTDTDH)−1HTDTDx

so, we have :

θ = (HTC−1H)−1HTC−1x with Cθ = (H′TH′)−1 = (HTC−1H)−1

(Linear Models) Estimation Theory Spring 2013 50 / 152

Page 51: Estimation Theory

Linear Models with known components

Consider a linear model x = Hθ + s+ w, where s is a known signal. Todetermine the MVU estimator let x′ = x− s, so that x′ = Hθ + w is astandard linear model. The MVU estimator is :

θ = (HTH)−1HT (x− s) with Cθ = σ2(HTH)−1

Example : Consider a DC level and exponential in WGN :x [n] = A+ rn + w [n] where r is known. Then we have :

x [0]x [1]...

x [N − 1]

=

11...1

A+

1r...

rN−1

+

w [0]w [1]...

w [N − 1]

The MVU estimator is :

A = (HTH)−1HT (x− s) =1

N

N−1∑n=0

(x [n]− rn) with var(A) =σ2

N

(Linear Models) Estimation Theory Spring 2013 51 / 152

Page 52: Estimation Theory

Best Linear Unbiased Estimators (BLUE)

Problems of finding the MVU estimators :

The MVU estimator does not always exist or impossible to find.

The PDF of data may be unknown.

BLUE is a suboptimal estimator that :

restricts estimates to be linear in data ; θ = Ax

restricts estimates to be unbiased ; E (θ) = AE (x) = θ

minimizes the variance of the estimates ;

needs only the mean and the variance of the data (not the PDF). Asa result, in general, the PDF of the estimates cannot be computed.

Remark : The unbiasedness restriction implies a linear model for the data.However, it may still be used if the data are transformed suitably or themodel is linearized.

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 52 / 152

Page 53: Estimation Theory

Finding the BLUE (Scalar Case)

1 Choose a linear estimator for the observed data

x [n] , n = 0, 1, . . . ,N − 1

θ =N−1∑n=0

anx [n] = aTx where a = [a0, a1, . . . , aN−1]T

2 Restrict estimate to be unbiased :

E (θ) =N−1∑n=0

anE (x [n]) = θ

3 Minimize the variance

var(θ) = E[θ − E (θ)]2 = E[aTx− aTE (x)]2= EaT [x− E (x)][x − E (x)]Ta = aTCa

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 53 / 152

Page 54: Estimation Theory

Finding the BLUE (Scalar Case)

Consider the problem of amplitude estimation of known signals in noise :

x [n] = θs[n] + w [n]

1 Choose a linear estimator : θ =N−1∑n=0

anx [n] = aT x

2 Restrict estimate to be unbiased : E (θ) = aTE (x) = aT sθ = θthen aT s = 1 where s = [s[0], s[1], . . . , s[N − 1]]T

3 Minimize aTCa subject to aT s = 1.

The constrained optimization can be solved using Lagrangian Multipliers :

Minimize J = aTCa+ λ(aT s− 1)

The optimal solution is :

θ =sTC−1x

sTC−1sand var(θ) =

1

sTC−1s

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 54 / 152

Page 55: Estimation Theory

Finding the BLUE (Vector Case)

Theorem (Gauss–Markov)

If the data are of the general linear model form

x = Hθ + w

with w is a noise vector with zero mean and covariance C (the PDF of wis arbitrary), then the BLUE of θ is :

θ = (HTC−1H)−1HTC−1x

and the covariance matrix of θ is

Cθ = (HTC−1H)−1

Remark : If noise is Gaussian then BLUE is MVU estimator.

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 55 / 152

Page 56: Estimation Theory

Finding the BLUE

Example : Consider the problem of DC level in noise : x [n] = A+ w [n],where w [n] is of unspecified PDF with var(w [n]) = σ2

n. We have θ = Aand H = 1 = [1, 1, . . . , 1]T . The covariance matrix is :

C =

σ20 0 · · · 00 σ2

1 · · · 0...

.... . .

...0 0 · · · σ2

N−1

⇒ C−1 =

1σ20

0 · · · 0

0 1σ21

· · · 0

......

. . ....

0 0 · · · 1σ2N−1

and hence the BLUE is :

θ = (HTC−1H)−1HTC−1x =

(N−1∑n=0

1

σ2n

)−1 N−1∑n=0

x [n]

σ2n

and the minimum covariance is :

Cθ = (HTC−1H)−1 =

(N−1∑n=0

1

σ2n

)−1

(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 56 / 152

Page 57: Estimation Theory

Maximum Likelihood Estimation

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 57 / 152

Page 58: Estimation Theory

Maximum Likelihood Estimation

Problems : MVU estimator does not often exist or cannot be found.BLUE is restricted to linear models.

Maximum Likelihood Estimator (MLE) :

can always be applied if the PDF is known ;

is optimal for large data size ;

is computationally complex and requires numerical methods.

Basic Idea : Choose the parameter value that makes the observed data,the most likely data to have been observed.

Likelihood Function : is the PDF p(x ; θ) when θ is regarded as a variable(not a parameter).

ML Estimate : is the value of θ that maximizes the likelihood function.

Procedure : Find log-likelihood function ln p(x ; θ) ; differentiate w.r.t θand set to zero and solve for θ.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 58 / 152

Page 59: Estimation Theory

Maximum Likelihood Estimation

Example : Consider DC level in WGN with unknown variancex [n] = A+ w [n]. Suppose that A > 0 and σ2 = A. The PDF is :

p(x;A) =1

(2πA)N2

exp

[− 1

2A

N−1∑n=0

(x [n]− A)2

]Taking the derivative of the log-likelihood function, we have :

∂ ln p(x;A)

∂A= − N

2A+

1

A

N−1∑n=0

(x [n]− A) +1

2A2

N−1∑n=0

(x [n]− A)2

What is the CRLB ? Does an MVU estimator exist ?MLE can be found by setting the above equation to zero :

A = −1

2+

√√√√ 1

N

N−1∑n=0

x2[n] +1

4

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 59 / 152

Page 60: Estimation Theory

Stochastic Convergence

Convergence in distribution : Let p(xN ) be a sequence of PDF. Ifthere exists a PDF p(x) such that

limN→∞

p(xN) = p(x)

at every point x at which p(x) is continuous, we say that p(xN) converges

in distribution to p(x). We write also xNd→ x .

Example

Consider N independent RV x1, x2, . . . , xN with mean µ and finite variance

σ2. Let xN =1

N

N∑n=1

xn, then according to the Central Limit Theorem

(CLT), zN =√NxN − µ

σconverges in distribution to z ∼ N (0, 1).

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 60 / 152

Page 61: Estimation Theory

Stochastic Convergence

Convergence in probability : Sequence of random variables xNconverges in probability to the random variable x if for every ε > 0

limN→∞

Pr|xN − x | > ε = 0

Convergence with probability 1 (almost sure convergence) :Sequence of random variables xN converges with probability 1 to therandom variable x if and only if for all possible events

Pr limN→∞

xN = x = 1

Example

Consider N independent RV x1, x2, . . . , xN with mean µ and finite variance

σ2. Let xN =1

N

N∑n=1

xn. Then according to the Law of Large Numbers

(LLN), zN converges in probability to µ.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 61 / 152

Page 62: Estimation Theory

Asymptotic Properties of Estimators

Asymptotic Unbiasedness : Estimator θN is an asymptotically unbiasedestimator of θ if :

limN→∞

E (θN − θ) = 0

Asymptotic Distribution : It refers to p(xn) as it evolves fromn = 1, 2, . . ., especially for large value of n (it is not the ultimate form ofdistribution, which may be degenerate).

Asymptotic Variance : is not equal to limN→∞

var(θN) (which is the

limiting variance). It is defined as :

asymptotic var(θN) =1

Nlim

N→∞EN[θN − lim

N→∞E (θN)]

2

Mean-Square Convergence : Estimator θN converges to θ in amean-squared sense, if :

limN→∞

E [(θN − θ)2] = 0

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 62 / 152

Page 63: Estimation Theory

Consistency

Estimator θN is a consistent estimator of θ if for every ε > 0

plim(θN) = θ ⇔ limN→∞

Pr [|θN − θ| > ε] = 0

Remarks :

If θN is asymptotically unbiased and its limiting variance is zero thenit converges to θ in mean-square.

If θN converges to θ in mean-square, then the estimator is consistent.

Asymptotic unbiasedness does not imply consistency and vice versa.

plim can be treated as an operator, e.g. :

plim(xy) = plim(x)plim(y) ; plim

(x

y

)=

plim(x)

plim(y)

The importance of consistency is that any continuous function of aconsistent estimator is itself a consistent estimator.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 63 / 152

Page 64: Estimation Theory

Maximum Likelihood Estimation

Properties : MLE may be biased and is not necessarily an efficientestimator. However :

MLE is a consistent estimator meaning that

limN→∞

Pr [|θ − θ| > ε] = 0

MLE asymptotically attains the CRLB (asymptotic variance is equalto CRLB).

Under some“regularity”conditions, the MLE is asymptoticallynormally distributed

θa∼ N (θ, I−1(θ))

even if the PDF of x is not Gaussian.

If an MVU estimator exists then ML procedure will find it.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 64 / 152

Page 65: Estimation Theory

Maximum Likelihood Estimation

Example

Consider DC level in WGN with known variance σ2.

Sol. Then

p(x;A) =1

(2πσ2)N2

exp

[− 1

2σ2

N−1∑n=0

(x [n]− A)2

]∂ ln p(x;A)

∂A=

1

σ2

N−1∑n=0

(x [n]− A) = 0

⇒N−1∑n=0

x [n]− NA = 0

which leads to A = x =1

N

N−1∑n=0

x [n]

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 65 / 152

Page 66: Estimation Theory

MLE for Transformed Parameters (Invariance Property)

Theorem (Invariance Property of the MLE)

The MLE of the parameter α = g(θ), where the PDF p(x; θ) isparameterized by θ, is given by

α = g(θ)

where θ is the MLE of θ.

It can be proved using the property of the consistent estimators.

If α = g(θ) is a one-to-one function, then

α = argmaxα

p(x; g−1(α)) = g(θ)

If α = g(θ) is not a one-to-one function, then

pT (x;α) = maxθ:α=g(θ)

p(x; θ) and α = argmaxα

pT (x; θ) = g(θ)

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 66 / 152

Page 67: Estimation Theory

MLE for Transformed Parameters (Invariance Property)

Example

Consider DC level in WGN and find MLE of α = exp(A). Since g(θ) is aone-to-one function then :

α = argmaxα

pT (x, α) = argmaxα

p(x; lnα) = exp(x)

Example

Consider DC level in WGN and find MLE of α = A2. Since g(θ) is not aone-to-one function then :

α = argmaxα≥0

p(x;√α), p(x;−√α)

=

[arg max√

α≥0p(x;√α), p(x;−√

α)]2

=

[arg max

−∞<A<∞p(x;A)

]2= A2 = x2

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 67 / 152

Page 68: Estimation Theory

MLE (Extension to Vector Parameter)

Example

Consider DC Level in WGN with unknown variance. The vector parameterθ = [A σ2]T should be estimated.We have :

∂ ln p(x;θ)

∂A=

1

σ2

N−1∑n=0

(x [n]− A)

∂ ln p(x;θ)

∂σ2= − N

2σ2+

1

2σ4

N−1∑n=0

(x [n]− A)2

which leads to the following MLE :

A = x σ2 =1

N

N−1∑n=0

(x [n]− x)2

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 68 / 152

Page 69: Estimation Theory

MLE for General Gaussian Case

Consider the general Gaussian case where x ∼ N (µ(θ),C(θ)).The partial derivative of the PDF is :

∂ ln p(x;θ)

∂θk= −1

2tr

(C−1(θ)

∂C(θ)

∂θk

)+

∂µ(θ)T

∂θkC−1(θ)(x − µ(θ))

−1

2(x− µ(θ))T

∂C−1(θ)

∂θk(x− µ(θ))

for k = 1, . . . , p.

By setting the above equations equal to zero, MLE can be found.

A particular case is when C is known (the first and third termsbecome zero).

In addition, if µ(θ) is linear in θ, the general linear model is obtained.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 69 / 152

Page 70: Estimation Theory

MLE for General Linear Models

Consider the general linear model x = Hθ +w where w is a noise vectorwith PDF N (0,C) :

p(x;θ) =1√

(2π)N det(C)exp

[−1

2(x−Hθ)TC−1(x−Hθ)

]Taking the derivative of ln p(x;θ) leads to :

∂ ln p(x;θ)

∂θ=

∂(Hθ)T

∂θC−1(x−Hθ)

Then

HTC−1(x−Hθ) = 0 ⇒ θ = (HTC−1H)−1HTC−1x

which is the same as MVU estimator. The PDF of θ is :

θ ∼ N (θ, (HTC−1H)−1)

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 70 / 152

Page 71: Estimation Theory

MLE (Numerical Method)

Newton–Raphson : A closed form estimator cannot be always computedby maximizing the likelihood function. However, the maximum value canbe computed by the numerical methods like the iterative Newton–Raphsonalgorithm.

θk+1 = θk −[∂2 ln p(x;θ)

∂θ∂θT

]−1∂ ln p(x;θ)

∂θ

∣∣∣∣θ=θk

Remarks :

The Hessian can be replaced by the negative of its expectation, theFisher information matrix I(θ).

This method suffers from convergence problems (local maximum).

Typically, for large data length, the log-likelihood function becomesmore quadratic and the algorithm will produce the MLE.

(Maximum Likelihood Estimation) Estimation Theory Spring 2013 71 / 152

Page 72: Estimation Theory

Least Squares Estimation

(Least Squares) Estimation Theory Spring 2013 72 / 152

Page 73: Estimation Theory

The Least Squares Approach

In all the previous methods we assumed that the measured signal x [n] isthe sum of a true signal s[n] and a measurement error w [n] with knownprobabilistic model. In least squares method

x [n] = s[n, θ] + e[n]

where e(n) represents the modeling and measurement errors. The objectiveis to minimize the LS cost :

J(θ) =N−1∑n=0

(x [n]− s[n, θ])2

We do not need a probabilistic assumption but onlya deterministic signal model.

It has a broader range of applications.

No claim about the optimality can be made.

The statistical performance cannot be assessed.

(Least Squares) Estimation Theory Spring 2013 73 / 152

Page 74: Estimation Theory

The Least Squares Approach

Example

Estimate the DC level of a signal. We observe x [n] = A+ ε[n] forn = 0, . . . ,N − 1 and the LS criterion is :

J(A) =N−1∑n=0

(x [n]− A)2

∂J(A)

∂A= −2

N−1∑n=0

(x [n]− A) = 0 ⇒ A =1

N

N−1∑n=0

x [n]

(Least Squares) Estimation Theory Spring 2013 74 / 152

Page 75: Estimation Theory

Linear Least Squares

Suppose that the observation model is linear x = Hθ + ε then

J(θ) =N−1∑n=0

(x [n]− s[n, θ])2 = (x−Hθ)T (x−Hθ)

= xT x− 2xTHθ + θTHTHθ

where H is full rank. The gradient is

∂J(θ)

∂θ= −2HTx+ 2HTHθ = 0 ⇒ θ = (HTH)−1HTx

The minimum LS cost is :

Jmin = (x−Hθ)T (x−Hθ) = xT [I−H(HTH)−1HT ]x = xT (x−Hθ)

where I−H(HTH)−1HT is an Idempotent matrix.

(Least Squares) Estimation Theory Spring 2013 75 / 152

Page 76: Estimation Theory

Comparing Different Estimators for the Linear Model

Consider the following linear model

x = Hθ + w

Estimator Assumption Estimate

LSE No probabilistic assumption θ = (HTH)−1HTx

BLUE w is white with unknown PDF θ = (HTH)−1HTx

MLE w is white Gaussian noise θ = (HTH)−1HTx

MVUE w is white Gaussian noise θ = (HTH)−1HTx

For MLE and MVUE the PDF of the estimate will beGaussian.

(Least Squares) Estimation Theory Spring 2013 76 / 152

Page 77: Estimation Theory

Weighted Linear Least Squares

The LS criterion can be modified by including a positive definite(symmetric) weighting matrix W :

J(θ) = (x−Hθ)TW(x−Hθ)

That leads to the following estimator :

θ = (HTWH)−1HTWx

and minimum LS cost :

Jmin = xT [W −WH(HTWH)−1HTW]x

Remark : If we take W = C−1, where C is the covariance of noise thenweighted least squares estimator is the BLUE. However, there is no trueLS-based reason for this choice.

(Least Squares) Estimation Theory Spring 2013 77 / 152

Page 78: Estimation Theory

Geometrical Interpretation

Recall the general signal model s = Hθ. If we denote the column of H byhi we have :

s = [h1 h2 · · · hp]

θ1θ2...θp

=

p∑i=1

θihi

The signal model is a linear combination of the vectors hi.The LS minimizes the length of the error vector between the data andthe signal model ε = x− s :

J(θ) = (x−Hθ)T (x−Hθ) = ‖x−Hθ‖2

The data vector can lie anywhere in RN , while signal vectors must liein a p-dimensional subspace of RN , termed Sp, which is spanned bythe column of H.

(Least Squares) Estimation Theory Spring 2013 78 / 152

Page 79: Estimation Theory

Geometrical Interpretation (Orthogonal Projection)

Intuitively, it is clear that the LS error is minimized when s = Hθ isthe orthogonal projection of x onto Sp.

So the LS error vector ε = x− s is orthogonal to all columns of H :

HTε = 0 ⇒ HT (x−Hθ) = 0 ⇒ θ = (HTH)−1HTx

The signal estimate is the projection of x onto Sp :

s = Hθ = H(HTH)−1HTx = Px

where P is the orthogonal projection matrix.

Note that if z ∈ Range(H), then Pz = z. Recall that Range(H) is thesubspace spanned by the columns of H.

Now, since Px ∈ Sp then P(Px) = Px. Therefore any projectionmatrix is idempotent, i.e. P2 = P.

It can be verified that P is symmetric and singular (with rank p).

(Least Squares) Estimation Theory Spring 2013 79 / 152

Page 80: Estimation Theory

Geometrical Interpretation (Orthonormal columns of H)

Recall that

HTH =

< h1,h1 > < h1,h2 > · · · < h1,hp >< h2,h1 > < h2,h2 > · · · < h2,hp >

......

. . ....

< hp,h1 > < hp,h2 > · · · < hp,hp >

If the columns of H are orthonormal then HTH = I and θ = HTx.

In this case we have θi = hTi x, thus

s = Hθ =

p∑i=1

θihi =

p∑i=1

(hTi x)hi

If we increase the number of parameters (the order of the linearmodel), we can easily compute the new estimate.

(Least Squares) Estimation Theory Spring 2013 80 / 152

Page 81: Estimation Theory

Choosing the Model Order

Suppose that you have a set of data and the objective is to fit apolynomial to data. What is the best polynomial order ?

Remarks :

It is clear that by increasing the order Jmin is monotonicallynon-increasing.

By choosing p = N we can perfectly fit the data to model. However,we fit the noise as well.

We should choose the simplest model that adequately describes thedata.

We increase the order only if the cost reduction is significant.

If we have an idea about the expected level of Jmin, we increase p toapproximately attain this level.

There is an order-recursive LS algorithm to efficiently compute a(p + 1)-th order model based on a p-th order one (See 8.6).

(Least Squares) Estimation Theory Spring 2013 81 / 152

Page 82: Estimation Theory

Sequential Least Squares

Suppose that θ[N − 1] based on x[N − 1] =[x [0], . . . , x [N − 1]

]is

available. If we get a new data x [N], we want to compute θ[N] as afunction of θ[N − 1] and x [N].

Example

Consider LS estimate of DC level : A[N − 1] =1

N

N−1∑n=0

x [n]. We have :

A[N] =1

N + 1

N∑n=0

x [n] =1

N + 1

N

[1

N

N−1∑n=0

x [n]

]+ x [N]

=N

N + 1A[N − 1] +

1

N + 1x [N]

= A[N − 1] +1

N + 1(x [N] − A[N − 1])

new estimate = old estimate + gain × prediction error

(Least Squares) Estimation Theory Spring 2013 82 / 152

Page 83: Estimation Theory

Sequential Least Squares

Example (DC level in uncorrelated noise)

x [n] = A+ w [n] and var(w [n]) = σ2n. The WLS estimate (or BLUE) is :

A[N − 1] =

∑N−1n=0

x [n]σ2n∑N−1

n=01σ2n

Similar to the previous example we can obtain :

A[N] = A[N − 1] +1/σ2

N∑N−1n=0

1σ2n

(x [N] − A[N − 1]

)= A[N − 1] + K [N]

(x [N]− A[N − 1]

)The gain factor K [N] can be reformulated as :

K [N ] =var(A[N − 1])

var(A[N − 1]) + σ2N

(Least Squares) Estimation Theory Spring 2013 83 / 152

Page 84: Estimation Theory

Sequential Least Squares

Consider the general linear model x = Hθ +w, where w is an uncorrelatednoise with the covariance matrix C. The BLUE (or WLS) is :

θ = (HTC−1H)−1HTC−1x and Cθ = (HTC−1H)−1

Let’s define :

C[n] = diag(σ20 , σ

21 , . . . , σ

2n)

H[n] =

[H[n − 1]

hT [n]

]=

[n × p1× p

]x[n] =

[x [0], x [1], . . . , x [n]

]The objective is to find θ[n], based on n+ 1 data samples, as a function ofθ[n − 1] and new data x [n]. The batch estimator is :

θ[n − 1] = Σ[n− 1]HT [n − 1]C−1[n − 1]x[n − 1]

with Σ[n − 1] = (HT [n − 1]C−1[n − 1]H[n− 1])−1

(Least Squares) Estimation Theory Spring 2013 84 / 152

Page 85: Estimation Theory

Sequential Least Squares

Estimator Update :

θ[n] = θ[n − 1] +K[n](x [n]− hT [n]θ[n − 1]

)where

K[n] =Σ[n− 1]h[n]

σ2n + hT [n]Σ[n − 1]h[n]

Covariance Update :

Σ[n] = (I−K[n]hT [n])Σ[n − 1]

The following Lemma is used to compute the updates :

Matrix Inversion Lemma

(A+ BCD)−1 = A−1 − A−1B[C−1 +DA−1B]−1DA−1

with A = Σ−1[n − 1],B = h[n],C = 1/σ2n ,D = hT [n].

Initialization ?(Least Squares) Estimation Theory Spring 2013 85 / 152

Page 86: Estimation Theory

Constrained Least Squares

Linear constraints in the form Aθ = b can be considered in LS solution.The LS criterion becomes :

Jc(θ) = (x−Hθ)T (x−Hθ) + λT (Aθ − b)

∂Jc∂θ

= −2HTx+ 2HTHθ + ATλ

Setting the gradient equal to zero produces :

θc = (HTH)−1HT x− 1

2(HTH)−1ATλ

= θ − (HTH)−1AT λ

2

where θ = (HTH)−1HTx is the unconstrained LSE.Now Aθc = b can be solved to find λ :

λ = 2[A(HTH)−1AT ]−1(Aθ − b)

(Least Squares) Estimation Theory Spring 2013 86 / 152

Page 87: Estimation Theory

Nonlinear Least Squares

Many applications have nonlinear observation model : s(θ) = Hθ. Thisleads to a nonlinear optimization problem that can be solved numerically.

Newton-Raphson Method : Find a zero of the gradient of the criterionby linearizing the gradient around the current estimate.

θk+1 = θk −(∂g(θ)

∂θ

)−1

g(θ)

∣∣∣∣∣θ=θk

where g(θ) =∂sT (θ)

∂θ[x− s(θ)] and

∂[g(θ)]i∂θj

=N−1∑n=0

[(x [n]− s[n])

∂2s[n]

∂θiθj− ∂s[n]

∂θj

∂s[n]

∂θi

]Around the solution [x− s(θ)] is small so the first term in the Jacobiancan be neglected. This makes this method equivalent to the Gauss-Newtonalgorithm which is numerically more robust.

(Least Squares) Estimation Theory Spring 2013 87 / 152

Page 88: Estimation Theory

Nonlinear Least Squares

Gauss-Newton Method : Linearize signal model around the currentestimate and solve the resulting linear problem.

s(θ) ≈ s(θ0) +∂s(θ)

∂θ

∣∣∣∣θ=θ0

(θ − θ0) = s(θ0) +H(θ0)(θ − θ0)

The solution to the linearized problem is :

θ = [HT (θ0)H(θ0)]−1HT (θ0)[x− s(θ0) +H(θ0)θ0]

= θ0 + [HT (θ0)H(θ0)]−1HT (θ0)[x− s(θ0)]

If we now iterate the solution, it becomes :

θk+1 = θk + [HT (θk)H(θk)]−1HT (θk)[x− s(θk)]

Remark : Both the Newton-Raphson and the Gauss-Newton methods canhave convergence problems.

(Least Squares) Estimation Theory Spring 2013 88 / 152

Page 89: Estimation Theory

Nonlinear Least Squares

Transformation of parameters : Transform into a linear problem byseeking for an invertible function α = g(θ) such that :

s(θ) = s(g−1(α)) = Hα

So the nonlinear LSE is θ = g−1(α) where α = (HTH)−1HTx.

Example (Estimate the amplitude and phase of a sinusoidal signal)

s[n] = A cos(2πf0n + φ) n = 0, 1, . . . ,N − 1

The LS problem is nonlinear, however, we have :

A cos(2πf0n + φ) = A cosφ cos 2πf0n − A sinφ sin 2πf0n

if we let α1 = A cosφ and α2 = −A sinφ then the signal model becomeslinear s = Hα. The LSE is α = (HTH)−1HTx and g−1(α) is

A =√

α21 + α2

2 and φ = arctan

(−α2

α1

)(Least Squares) Estimation Theory Spring 2013 89 / 152

Page 90: Estimation Theory

Nonlinear Least Squares

Separability of parameters : If the signal model is linear in some of theparameters, try to write it as s = H(α)β. The LS error can be minimizedwrt β.

β = [HT(α)H(α)]−1HT(α)x

The cost function is a function of α that can be minimized by anumerical method (e.g. brute force).

J(α, β) = xT[I−H(α)[HT(α)H(α)]−1HT(α)

]x

Then

α = argmaxα

xT[H(α)[HT(α)H(α)]−1HT(α)

]x

Remark : This method is interesting if the dimension of α is much lessthan the dimension of β.

(Least Squares) Estimation Theory Spring 2013 90 / 152

Page 91: Estimation Theory

Nonlinear Least Squares

Example (Damped Exponentials)

Consider the following signal model :

s[n] = A1rn + A2r

2n + A3r3n n = 0, 1, . . . ,N − 1

with θ = [A1,A2,A3, r ]T . It is known that 0 < r < 1. Let’s take

β = [A1,A2,A3]T so we get s = H(r)β with :

H(r) =

1 1 1r r2 r3

......

...

rN−1 r2(N−1) r3(N−1)

Step 1 : Maximize xT

[H(r)[HT (r)H(r)]−1HT (r)

]x to obtain r .

Step 2 : Compute β = [HT (r)H(r)]−1HT (r)x.

(Least Squares) Estimation Theory Spring 2013 91 / 152

Page 92: Estimation Theory

The Bayesian Philosophy

(The Bayesian Philosophy) Estimation Theory Spring 2013 92 / 152

Page 93: Estimation Theory

Introduction

Classical Approach :

Assumes θ is unknown but deterministic.

Some prior knowledge on θ cannot be used.

Variance of the estimate may depend on θ.

MVUE may not exist.

In Monte-Carlo Simulations, we do M runs for each fixed θ and thencompute sample mean and variance for each θ (no averaging over θ).

Bayesian Approach :

Assumes θ is random with a known prior PDF, p(θ).

We estimate a realization of θ based on the available data.

Variance of the estimate does not depend on θ.

A Bayesian estimate always exists

In Monte-Carlo simulations, we do M runs for randomly chosen θ andthen we compute sample mean and variance over all θ values.

(The Bayesian Philosophy) Estimation Theory Spring 2013 93 / 152

Page 94: Estimation Theory

Mean Square Error (MSE)

Classical MSE

Classical MSE is a function of the unknown parameter θ and cannot beused for constructing the estimators.

mse(θ) = E[θ − θ(x)]2 =

∫[θ − θ(x)]2p(x; θ)dx

Note that the E· is wrt the PDF of x.

Bayesian MSE :

Bayesian MSE is not a function of θ and can be minimized to find anestimator.

Bmse(θ) = E[θ − θ(x)]2 =

∫ ∫[θ − θ(x)]2p(x, θ)dxdθ

Note that the E· is wrt the joint PDF of x and θ.

(The Bayesian Philosophy) Estimation Theory Spring 2013 94 / 152

Page 95: Estimation Theory

Minimum Mean Square Error Estimator

Consider the estimation of A that minimizes the Bayesian MSE, where A isa random variable with uniform PDF p(A) = U [−A0,A0] and independentof w [n].

Bmse(θ) =

∫ ∫[A− A]2p(x, θ)dxdA =

∫ [∫[A− A]2p(A|x)dA

]p(x)dx

where we used Bayes’ theorem : p(x, θ) = p(A|x)p(x).Since p(x) ≥ 0 for all x, we need only minimize the integral in brackets.We set the derivative equal to zero :

∂A

∫[A− A]2p(A|x)dA = −2

∫Ap(A|x)dA+ 2A

∫p(A|x)dA

which results in

A =

∫Ap(A|x)dA = E (A|x)

Bayesian MMSE Estimate : is the conditional mean of A given data xor the mean of posterior PDF p(A|x).

(The Bayesian Philosophy) Estimation Theory Spring 2013 95 / 152

Page 96: Estimation Theory

Minimum Mean Square Error Estimator

How to compute the Bayesian MMSE estimate :

A = E (A|x) =∫

Ap(A|x)dAThe posteriori PDF can be computed using Bayes’ Rule :

p(A|x) = p(x|A)p(A)p(x)

=p(x|A)p(A)∫p(x|A)p(A)dA

p(x) is the marginal PDF defined as p(x) =∫p(x,A)dA.

The integral in the denominator acts as a ”normalization”ofp(x|A)p(A) such the integral of p(A|x) be equal to 1.p(x|A) has exactly the same form as p(x;A) which is used in theclassical estimation.The MMSE estimator always exists and can be computed by :

A =

∫Ap(x|A)p(A)dA∫p(x|A)p(A)dA

(The Bayesian Philosophy) Estimation Theory Spring 2013 96 / 152

Page 97: Estimation Theory

Minimum Mean Square Error Estimator

Example : For p(A) = U [−A0,A0] we have :

A =

1

2A0

∫ A0

−A0

A1

(2πσ2)N/2exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

1

2A0

∫ A0

−A0

1

(2πσ2)N/2exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

Before collecting the data the mean of the prior PDF p(A) is the bestestimate while after collecting the data the best estimate is the meanof posterior PDF p(A|x).Choice of p(A) is crucial for the quality of the estimation.

Only Gaussian prior PDF leads to a closed-form estimator.

Conclusion : For an accurate estimator choose a prior PDF that can bephysically justified. For a closed-form estimator choose a Gaussian priorPDF.

(The Bayesian Philosophy) Estimation Theory Spring 2013 97 / 152

Page 98: Estimation Theory

Properties of the Gaussian PDF

Theorem (Conditional PDF of Bivariate Gaussian)

If x and y are distributed according to a bivariate Gaussian PDF :

p(x , y) =1

2π√

det(C)exp

[−1

2

[x − E (x)y − E (y)

]TC−1

[x − E (x)y − E (y)

]]

with covariance matrix : C =

[var(x) cov(x , y)

cov(x , y) var(y)

]then the conditional PDF p(y |x) is also Gaussian and :

E (y |x) = E (y) +cov(x , y)

var(x)[x − E (x)]

var(y |x) = var(y)− cov2(x , y)

var(x)

(The Bayesian Philosophy) Estimation Theory Spring 2013 98 / 152

Page 99: Estimation Theory

Properties of the Gaussian PDF

Example (DC level in WGN with Gaussian prior PDF)

Consider the Bayesian model x [0] = A+ w [0] (just one observation) witha prior PDF of A ∼ N (µA, σ

2A) independent of noise.

First we compute the covariance cov(A, x [0]) :

cov(A, x [0]) = E(A− µA)(x [0] − E (x [0])= E(A− µA)(A + w [0]− µA − 0)= E(A− µA)

2 + (A− µA)w [0] = var(A) + 0 = σ2A

The Bayesian MMSE estimate is also Gaussian with :

A = µA|x = µA +cov(A, x [0])

var(x)[x [0] − µA] = µA +

σ2A

σ2 + σ2A

(x [0] − µA)

var(A) = σ2A|x = σ2

A − cov2(A, x [0])

var(x)= σ2

A(1−σ2A

σ2 + σ2A

)

(The Bayesian Philosophy) Estimation Theory Spring 2013 99 / 152

Page 100: Estimation Theory

Properties of the Gaussian PDF

Theorem (Conditional PDF of Multivariate Gaussian)

If x (with dimension k × 1) and y (with dimension l × 1) are jointlyGaussian with PDF :

p(x , y) =1

(2π)k+l2

√det(C)

exp

[−1

2

[x− E (x)y − E (y)

]TC−1

[x− E (x)y − E (y)

]]

with covariance matrix : C =

[Cxx Cxy

Cyx Cyy

]then the conditional PDF p(y|x) is also Gaussian and :

E (y|x) = E (y) + CyxC−1xx [x− E (x)]

Cy |x = Cyy − CyxC−1xx Cxy

(The Bayesian Philosophy) Estimation Theory Spring 2013 100 / 152

Page 101: Estimation Theory

Bayesian Linear Model

Let the data be modeled as

x = Hθ + w

where θ is a p × 1 random vector with prior PDF N (µθ,Cθ) and w is anoise vector with PDF N (0,Cw ).Since θ and w are independent and Gaussian, they are jointly Gaussianthen the posterior PDF is also Gaussian. In order to find the MMSEestimator we should compute the covariance matrices :

Cxx = E[x− E (x)][x − E (x)]T = E(Hθ + w−Hµθ)(Hθ +w−Hµθ)

T = E(H(θ − µθ) + w)(H(θ− µθ) +w)T = HE(θ − µθ)(θ − µθ)

T HT + E (wwT ) = HCθHT + Cw

Cθx = E(θ − µθ)[H(θ − µθ) + w]T= E(θ − µθ)(θ − µθ)

THT = CθHT

(The Bayesian Philosophy) Estimation Theory Spring 2013 101 / 152

Page 102: Estimation Theory

Bayesian Linear Model

Theorem (Posterior PDF for the Bayesian General Linear Model)

If the observed data can be modeled as

x = Hθ + w

where θ is a p × 1 random vector with prior PDF N (µθ,Cθ) and w is anoise vector with PDF N (0,Cw ), then the posterior PDF p(x|θ) isGaussian with mean :

E (θ|x) = µθ + CθHT(HCθH

T + Cw)−1(x−Hµθ)

and covariance

Cθ|x = Cθ − CθHT(HCθH

T + Cw)−1HCθ

Remark : In contrast to the classical general linear model, H need not befull rank.

(The Bayesian Philosophy) Estimation Theory Spring 2013 102 / 152

Page 103: Estimation Theory

Bayesian Linear Model

Example (DC Level in WGN with Gaussian Prior PDF)

x [n] = A+w [n], n = 0, . . . ,N−1, A ∼ N (µA, σ2A), w [n] ∼ N (0, σ2)

We have the general Bayesian linear model x = HA+w with H = 1. Then

E (A|x) = µA + σ2A1

T (1σ2A1

T + σ2I)−1(x− 1µA)

var(A|x) = σ2A − σ2

A1T (1σ2

A1T + σ2I)−11σ2

A

We use the Matrix Inversion Lemma to get :

(I+ 1σ2A

σ21T )−1 = I− 11T

σ2

σ2A+ 1T1

= I− 11T

σ2

σ2A+ N

E (A|x) = µA +σ2A

σ2A + σ2

N

(x − µA) and var(A|x) =σ2

N σ2A

σ2A + σ2

N

(The Bayesian Philosophy) Estimation Theory Spring 2013 103 / 152

Page 104: Estimation Theory

Nuisance Parameters

Definition (Nuisance Parameter)

Suppose that θ and α are unknown parameters but we are only interestedin estimating θ. In this case α is called a nuisance parameter.

In the classical approach we have to estimate both but in theBayesian approach we can integrate it out.Note that in the Bayesian approach we can find p(θ|x) from p(θ, α|x)as a marginal PDF :

p(θ|x) =∫

p(θ, α|x)dαWe can also express it as

p(θ|x) = p(x|θ)p(θ)∫p(x|θ)p(θ)dθ where p(x|θ) =

∫p(x|θ, α)p(α|θ)dα

Furthermore if α is independent of θ we have :

p(x|θ) =∫

p(x|θ, α)p(α)dα(The Bayesian Philosophy) Estimation Theory Spring 2013 104 / 152

Page 105: Estimation Theory

General Bayesian Estimators

(General Bayesian Estimators) Estimation Theory Spring 2013 105 / 152

Page 106: Estimation Theory

General Bayesian Estimators

Risk Function

A general Bayesian estimator is obtained by minimizing the Bayes Risk

θ = argminθ

R(θ)

where R(θ) = EC (ε) is the Bayes Risk, C (ε) is a cost function andε = θ − θ is the estimation error.

Three common risk functions

1. Quadratic : R(θ) = Eε2 = E(θ − θ)2

2. Absolute : R(θ) = E|ε| = E|θ − θ|

3. Hit-or-Miss : R(θ) = EC (ε) where C (ε) =

0 |ε| < δ1 |ε| ≥ δ

(General Bayesian Estimators) Estimation Theory Spring 2013 106 / 152

Page 107: Estimation Theory

General Bayesian Estimators

The Bayes risk is

R(θ) = EC (ε) =

∫ ∫C (θ − θ)p(x, θ)dxdθ =

∫g(θ)p(x)dx

where g(θ) =

∫C (θ − θ)p(θ|x)dθ should be minimized.

Quadratic

For this case R(θ) = E(θ − θ)2 =Bmse(θ) which leads to the MMSEestimator with

θ = E (θ|x)so θ is the mean of the posterior PDF p(θ|x).

(General Bayesian Estimators) Estimation Theory Spring 2013 107 / 152

Page 108: Estimation Theory

General Bayesian Estimators

Absolute

In this case we have :

g(θ) =

∫|θ − θ|p(θ|x)dθ =

∫ θ

−∞(θ − θ)p(θ|x)dθ +

∫ ∞

θ(θ − θ)p(θ|x)dθ

By setting the derivative of g(θ) equal to zero and the use of Leibnitz’srule, we get : ∫ θ

−∞p(θ|x)dθ =

∫ ∞

θp(θ|x)dθ

Then θ is the median (area to the left=area to the right) of p(θ|x)

(General Bayesian Estimators) Estimation Theory Spring 2013 108 / 152

Page 109: Estimation Theory

General Bayesian Estimators

Hit-or-Miss

In this case we have

g(θ) =

∫ θ−δ

−∞1 · p(θ|x)dθ +

∫ ∞

θ+δ1 · p(θ|x)dθ

= 1−∫ θ+δ

θ−δp(θ|x)dθ

For δ arbitrarily small the optimal estimate is the location of themaximum of p(θ|x) or the mode of the posterior PDF.

This estimator is called the maximum a posteriori (MAP) estimator.

Remark : For unimodal and symmetric posterior PDF (e.g. GaussianPDF), the mean and the mode and the median are the same.

(General Bayesian Estimators) Estimation Theory Spring 2013 109 / 152

Page 110: Estimation Theory

Minimum Mean Square Error Estimators

Extension to the vector parameter case : In general, like the scalarcase, we can write :

θi = E (θi |x) =∫

θip(θi |x)dθi i = 1, 2, . . . , p

Then p(θi |x) can be computed as a marginal conditional PDF. Forexample :

θ1 =

∫θ1

[∫· · ·

∫p(θ|x)dθ2 · · · dθp

]dθ1 =

∫θ1p(θ|x)dθ

In vector form we have :

θ =

∫θ1p(θ|x)dθ∫θ2p(θ|x)dθ

...∫θpp(θ|x)dθ

=

∫θp(θ|x)dθ = E (θ|x)

Similarly Bmse(θi) =

∫[Cθ|x ]iip(x)dx

(General Bayesian Estimators) Estimation Theory Spring 2013 110 / 152

Page 111: Estimation Theory

Properties of MMSE Estimators

For Bayesian Linear Model, poor prior knowledge leads to MVUestimator.

θ = E (θ|x) = µθ + (C−1θ +HTC−1

w H)−1HTC−1w (x−Hµθ)

For no prior knowledge µθ → 0 and C−1θ → 0, and therefore :

θ → [HTC−1w H]−1HTC−1

w x

Commutes over affine transformations.Suppose that α = Aθ + b then the MMSE estimator for α is

α = E (α|x) = E (Aθ + b|x) = AE (θ|x) + b = Aθ + b

Enjoys the additive property for independent data sets.Assume that θ, x1, x2 are jointly Gaussian with x1 and x2independent :

θ = E (θ) + Cθx1C−1x1x1[x1 − E (x1)] + Cθx2C

−1x2x2[x2 − E (x2)]

(General Bayesian Estimators) Estimation Theory Spring 2013 111 / 152

Page 112: Estimation Theory

Maximum A Posteriori Estimators

In the MAP estimation approach we have :

θ = argmaxθ

p(θ|x)where

p(θ|x) = p(x|θ)p(θ)p(x)

An equivalent maximization is :

θ = argmaxθ

p(x|θ)p(θ)or

θ = argmaxθ

[ln p(x|θ) + ln p(θ)]

If p(θ) is uniform or is approximately constant around the maximumof p(x|θ) (for large data length), we can remove p(θ) to obtain theBayesian Maximum Likelihood :

θ = argmaxθ

p(x|θ)(General Bayesian Estimators) Estimation Theory Spring 2013 112 / 152

Page 113: Estimation Theory

Maximum A Posteriori Estimators

Example (DC Level in WGN with Uniform Prior PDF)

The MMSE estimator cannot be obtained in explicit form due to the needto evaluate the following integrals :

A =

1

2A0

∫ A0

−A0

A1

(2πσ2)N/2exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

1

2A0

∫ A0

−A0

1

(2πσ2)N/2exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

The MAP estimator is given as : A =

−A0 x < −A0

x −A0 ≤ x ≤ A0

A0 x > A0

Remark : The main advantage of MAP estimator is that for non jointlyGaussian PDFs it may lead to explicit solutions or less computationaleffort.

(General Bayesian Estimators) Estimation Theory Spring 2013 113 / 152

Page 114: Estimation Theory

Maximum A Posteriori Estimators

Extension to the vector parameter case

θ1 = argmaxθ1

p(θ1|x) = argmaxθ1

∫· · ·

∫p(θ|x)dθ2 . . . dθp

This needs integral evaluation of the marginal conditional PDFs.

An alternative is the following vector MAP estimator that maximizes thejoint conditional PDF :

θ = argmaxθ

p(θ|x) = argmaxθ

p(x|θ)p(θ)

This corresponds to a circular Hit-or-Miss cost function.

In general, as N → ∞, the MAP estimator → the Bayesian MLE.

If the posterior PDF is Gaussian, the mode is identical to the mean,therefore the MAP estimator is identical to the MMSE estimator.

The invariance property in ML theory does not hold for the MAPestimator.

(General Bayesian Estimators) Estimation Theory Spring 2013 114 / 152

Page 115: Estimation Theory

Performance Description

In the classical estimation the mean and the variance of the estimate(or its PDF) indicates the performance of the estimator.In the Bayesian approach the PDF of the estimate is different for eachrealization of θ. So a good estimator should perform well for everypossible value of θ.The estimation error ε = θ − θ is a function of two random variables(θ and x).The mean and the variance of the estimation error, wrt two randomvariables, indicates the performance of the estimator.

The mean value of the estimation error is zero. So the estimates areunbiased (in the Bayesian sense).

Ex ,θ(θ − θ) = Ex ,θ(θ − E (θ|x)) = Ex [Eθ|x(θ)− Eθ|x(θ|x)] = Ex(0) = 0

The variance of the estimate is Bmse :

var(ε) = Ex ,θ(ε2) = Ex ,θ[(θ − θ)2] = Bmse(θ)

(General Bayesian Estimators) Estimation Theory Spring 2013 115 / 152

Page 116: Estimation Theory

Performance Description (Vector Parameter Case)

The vector of the estimation error is ε = θ − θ and has a zero mean.

Its covariance matrix is :

Mθ = Ex ,θ(εεT ) = Ex ,θ[θ − E (θ|x)][θ − E (θ|x)]T

= Ex

[Eθ|x

[θ − E (θ|x)][θ − E (θ|x)]T]

= Ex(Cθ|x)

if x and θ are jointly Gaussian, we have :

Mθ = Cθ|x = Cθθ − CθxC−1xx Cxθ

since Cθ|x does not depend on x.

For a Bayesian linear model we have :

Mθ = Cθ|x = Cθ − CθHT(HCθH

T + Cw )−1HCθ

In this case ε is a linear transformation of x and θ and thus isGaussian. Therefore :

ε ∼ N (0,Mθ)

(General Bayesian Estimators) Estimation Theory Spring 2013 116 / 152

Page 117: Estimation Theory

Signal Processing Example

Deconvolution Problem : Estimate a signal transmitted through achannel with known impulse response :

x [n] =ns−1∑m=0

h[n −m]s[m] + w [n] n = 0, 1, . . . ,N − 1

where s[n] is a WSS Gaussian process with known ACF, and w [n] is WGNwith variance σ2.

x [0]x [1]...

x [N − 1]

=

h[0] 0 · · · 0h[1] h[0] · · · 0...

.

.....

.

..h[N − 1] h[N − 2] · · · h[N − ns ]

s[0]s[1]...

s[ns − 1]

+

w [0]w [1]...

w [N − 1]

x = Hθ + w ⇒ s = CsHT (HCsH

T + σ2I)−1x

where Cs is a symmetric Toeplitz matrix with [Cs]ij = rss [i − j]

(General Bayesian Estimators) Estimation Theory Spring 2013 117 / 152

Page 118: Estimation Theory

Signal Processing Example

Consider that H = I (no filtering), so the Bayesian model becomes

x = s+ w

Classical Approach The MVU estimator is s = (HTH)−1HTx = x.

Bayesian Approach The MMSE estimator is s = Cs(Cs + σ2I)−1x.

Scalar case : We estimate s[0] based on x [0] :

s[0] =rss [0]

rss [0] + σ2x [0] =

η

η + 1x [0]

where η = rss [0]/σ2 is the SNR.

Signal is a Realization of an Auto Regressive Process : Considera first order AR process : s[n] = −a s[n − 1] + u[n], where u[n] isWGN with variance σ2

u. The ACF of s is :

rss [k] =σ2u

1− a2(−a)|k| ⇒ [Cs]ij = rss [i − j]

(General Bayesian Estimators) Estimation Theory Spring 2013 118 / 152

Page 119: Estimation Theory

Linear Bayesian Estimators

(Linear Bayesian Estimators) Estimation Theory Spring 2013 119 / 152

Page 120: Estimation Theory

Introduction

Problems with general Baysian estimators :

difficult to determine in closed form,

need intensive computations,

involve multidimensional integration (MMSE) or multidimensionalmaximization (MAP),

can be determined only under the jointly Gaussian assumption.

What can we do if the joint PDF is not Gaussian or unknown ?

Keep the MMSE criterion ;

Restrict the estimator to be linear.

This leads to the Linear MMSE Estimator :

Which is a suboptimal estimator that can be easily implemented.

It needs only the first and second moments of the joint PDF.

It is analogous to BLUE in classical estimation.

In practice, this estimator is termed Wiener filter.

(Linear Bayesian Estimators) Estimation Theory Spring 2013 120 / 152

Page 121: Estimation Theory

Linear MMSE Estimators (scalar case)

Goal : Estimate θ, given the data vector x. Assume that only the first twomoments of the joint PDF of x and θ are available :[

E (θ)E (x)

],

[Cθθ Cθx

Cxθ Cxx

]LMMSE Estimator : Take the class of all affine estimators

θ =N−1∑n=0

anx [n] + aN = aTx+ aN

where a = [a0, a1, . . . , aN−1]T . Then minimize the Bayesian MSE :

Bmse(θ) = E [(θ − θ)2]

Computing aN : Let’s differentiate the Bmse with respect to aN :

∂aNE[(θ − aTx− aN)

2]= −2E

[(θ − aTx− aN)

]Setting this equal to zero gives : aN = E (θ)− aTE (x).

(Linear Bayesian Estimators) Estimation Theory Spring 2013 121 / 152

Page 122: Estimation Theory

Linear MMSE Estimators (scalar case)

Minimize the Bayesian MSE :

Bmse(θ) = E [(θ − θ)2] = E [(θ − aTx− E (θ) + aTE (x))2]

= E[aT (x− E (x))− (θ − E (θ))]2

= E [aT (x− E (x))(x − E (x))Ta]− E [aT (x− E (x))(θ − E (θ))]

−E [(θ − E (θ))(x − E (x))Ta] + E [(θ − E (θ))2]

= aTCxxa− aTCxθ − Cθxa+ Cθθ

This can be minimized by setting the gradient to zero :

∂Bmse(θ)

∂a= 2Cxxa− 2Cxθ = 0

which results in a = C−1xx Cxθ and leads to (Note that : Cθx = CT

xθ) :

θ = aTx+ aN = CTxθC

−1xx x+ E (θ)− CT

xθC−1xx E (x)

= E (θ) + CθxC−1xx (x− E (x))

(Linear Bayesian Estimators) Estimation Theory Spring 2013 122 / 152

Page 123: Estimation Theory

Linear MMSE Estimators (scalar case)

The minimum Bayesian MSE is obtained :

Bmse(θ) = CTxθC

−1xx CxxC

−1xx Cxθ − CT

xθC−1xx Cxθ

−CθxC−1xx Cxθ + Cθθ

= Cθθ − CθxC−1xx Cxθ

Remarks :

The results are identical to those of MMSE estimators for jointlyGaussian PDF.

The LMMSE estimator relies on the correlation between randomvariables.

If a parameter is uncorrelated with the data (but nonlinearlydependent), it cannot be estimated with an LMMSE estimator.

(Linear Bayesian Estimators) Estimation Theory Spring 2013 123 / 152

Page 124: Estimation Theory

Linear MMSE Estimators (scalar case)

Example (DC Level in WGN with Uniform Prior PDF)

The data model is : x [n] = A+ w [n] n = 0, 1, . . . ,N − 1where A ∼ U [−A0, A0] is independent from w [n] (WGN with variance σ2).We have E (A) = 0 and E (x) = 0. The covariances are :

Cxx = E (xxT ) = E [(A1+ w)(A1+w)T ] = E (A2)11T + σ2I

Cθx = E (AxT ) = E [A(A1+ w)T ] = E (A2)1T

Therefore the LMMSE estimator is :

A = CθxC−1xx x = σ2

A1T (σ2

A11T + σ2I)−1x =

A20/3

A20/3 + σ2/N

x

where

σ2A = E (A2) =

∫ A0

−A0

A2 1

2A0dA =

A3

6A0

∣∣∣∣A0

−A0

=A20

3

(Linear Bayesian Estimators) Estimation Theory Spring 2013 124 / 152

Page 125: Estimation Theory

Linear MMSE Estimators (scalar case)

Example (DC Level in WGN with Uniform Prior PDF)

Comparison of different Bayesian estimators :

MMSE : A =

∫ A0

−A0

A exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

∫ A0

−A0

exp

[−1

2σ2

N−1∑n=0

(x [n]− A)2

]dA

MAP : A =

−A0 x < −A0

x −A0 ≤ x ≤ A0

A0 x > A0

LMMSE : A =A20/3

A20/3 + σ2/N

x

(Linear Bayesian Estimators) Estimation Theory Spring 2013 125 / 152

Page 126: Estimation Theory

Geometrical Interpretation

Vector space of random variables : The set of scalar zero mean randomvariables is a vector space.

The zero length vector of the set is a RV with zero variance.

For any real scalar a, ax is another zero mean RV in the set.

For two RVs x and y , x + y = y + x is a RV in the set.

For two RVs x and y , the inner product is : < x , y >= E (xy)

The length of x is defined as : ‖x‖ =√< x , x > =

√E (x2)

Two RVs x and y are orthogonal if : < x , y >= E (xy) = 0.

For RVs, x1, x2, y and real numbers a1 and a2 we have :

< a1x1 + a2x2, y >= a1 < x1, y > +a2 < x2, y >

E [(a1x1 + a2x2)y ] = a1E (x1y) + a2E (x2y)

The projection of y on x is :< y , x >

‖x‖2 x =E (yx)

σ2x

x

(Linear Bayesian Estimators) Estimation Theory Spring 2013 126 / 152

Page 127: Estimation Theory

Geometrical Interpretation

The LMMSE estimator can be determined using the vector spaceviewpoint. We have :

θ =N−1∑n=0

anx [n]

where aN is zero, because of zero mean assumption.

θ belongs to the subspace spanned by x [0], x [1], . . . , x [N − 1].θ is not in this subspace.A good estimator will minimize the MSE :

E [(θ − θ)2] = ‖ε‖2where ε = θ − θ is the error vector.Clearly the length of the vector error is minimized when ε isorthogonal to the subspace spanned by x [0], x [1], . . . , x [N − 1] or toeach data sample :

E[(θ − θ)x [n]

]= 0 for n = 0, 1, . . . ,N − 1

(Linear Bayesian Estimators) Estimation Theory Spring 2013 127 / 152

Page 128: Estimation Theory

Geometrical Interpretation

The LMMSE estimator can be determined by solving the followingequations :

E

[(θ −

N−1∑m=0

amx [m]

)x [n]

]= 0 n = 0, 1, . . . ,N − 1

orN−1∑m=0

amE (x [m]x [n]) = E (θx [n]) n = 0, 1, . . . ,N − 1

E(x2[0]) E(x [0]x [1]) · · · E(x [0]x [N − 1])E(x [1]x [0]) E(x2[1]) · · · E(x [1]x [N − 1])

......

. . ....

E(x [N − 1]x [0]) E(x [N − 1]x [1]) · · · E(x2[N − 1])

a0a1...

aN−1

=

E(θx [0])E(θx [1])

..

.E(θx [N − 1])

ThereforeCxxa = CT

θx ⇒ a = C−1xx C

Tθx

The LMMSE estimator is :

θ = aTx = CθxC−1xx x

(Linear Bayesian Estimators) Estimation Theory Spring 2013 128 / 152

Page 129: Estimation Theory

The vector LMMSE Estimator

We want to estimate θ = [θ1, θ2, . . . , θp]T with a linear estimator that

minimize the Bayesian MSE for each element.

θi =N−1∑n=0

ainx [n] + aiN Minimize Bmse(θi) = E[(θi − θi)

2]

Therefore :

θi = E (θi) + Cθi x︸︷︷︸1×N

C−1xx︸︷︷︸

N×N

(x− E (x)

)︸ ︷︷ ︸N×1

i = 1, 2, . . . , p

The scalar LMMSE estimators can be combined into a vector estimator :

θ = E (θ)︸ ︷︷ ︸p×1

+ Cθx︸︷︷︸p×N

C−1xx︸︷︷︸

N×N

(x− E (x)

)︸ ︷︷ ︸N×1

and similarly

Bmse(θi ) = [Mθ]ii where Mθ︸︷︷︸p×p

= Cθθ︸︷︷︸p×p

− Cθx︸︷︷︸p×N

C−1xx Cxθ︸︷︷︸

N×p

(Linear Bayesian Estimators) Estimation Theory Spring 2013 129 / 152

Page 130: Estimation Theory

The vector LMMSE Estimator

Theorem (Bayesian Gauss–Markov Theorem)

If the data are described by the Bayesian linear model form

x = Hθ + w

where θ is a p × 1 random vector with mean E(θ) and covariance Cθθ, wis a noise vector with zero mean and covariance Cw and is uncorrelatedwith θ (the joint PDF of p(θ,w) is arbitrary), then the LMMSE estimatorof θ is :

θ = E (θ) + CθθHT (HCθθH

T + Cw )−1

(x−HE (θ)

)The performance of the estimator is measured by ε = θ− θ whose mean iszero and whose covariance matrix is

Cε = Mθ = Cθθ −CθθHT (HCθθH

T +Cw )−1HCθθ =

(C−1θθ +HTC−1

w H)−1

(Linear Bayesian Estimators) Estimation Theory Spring 2013 130 / 152

Page 131: Estimation Theory

Sequential LMMSE Estimation

Objective : Given θ[n − 1] based on x[n − 1], update the new estimateθ[n] based on the new sample x [n].

Example (DC Level in White Noise)

Assume that both A and w [n] have zero mean :

x [n] = A+ w [n] ⇒ A[N − 1] =σ2A

σ2A + σ2/N

x

Estimator Update : A[N] = A[N − 1] + K [N](x [N] − A[N − 1])

where K [N] =Bmse(A[N − 1])

Bmse(A[N − 1]) + σ2

Minimum MSE Update :

Bmse(A[N]) = (1− K [N])Bmse(A[N − 1])

(Linear Bayesian Estimators) Estimation Theory Spring 2013 131 / 152

Page 132: Estimation Theory

Sequential LMMSE Estimation

Vector space view : If two observations are orthogonal, the LMMSEestimate of θ is the sum of the projection of θ on each observation.

1 Find the LMMSE estimator of A based on x [0], yielding A[0] :

A[0] =E (Ax [0])

E (x2[0])x [0] =

σ2A

σ2A + σ2

x [0]

2 Find the LMMSE estimator of x [1] based on x [0], yielding x [1|0]

x [1|0] = E (x [0]x [1])

E (x2[0])x [0] =

σ2A

σ2A + σ2

x [0]

3 Determine the innovation of the new data :x [1] = x [1] − x [1|0].This error vector is orthogonal to x [0]

4 Add to A[0] the LMMSE estimator of A based on the innovation :

A[1] = A[0] +E (Ax [1])

E (x2[1])x [1]︸ ︷︷ ︸

the projection of A on x[1]

= A[0] + K [1](x [1] − x [1|0])

(Linear Bayesian Estimators) Estimation Theory Spring 2013 132 / 152

Page 133: Estimation Theory

Sequential LMMSE Estimation

Basic Idea : Generate a sequence of orthogonal RVs, namely, theinnovations :

x [0]︸︷︷︸x [0]

, x [1]︸︷︷︸x [1]−x[1|0]

, x [2]︸︷︷︸x [2]−x[2|0,1]

, . . . , x [n]︸︷︷︸x [n]−x[n|0,1,...,n−1]

Then, add the individual estimators to yield :

A[N] =N∑

n=0

K [n]x [n] = A[N−1]+K [N]x [N] where K [n] =E (Ax [n])

E (x2[n])

It can be shown that :

x [N] = x [N] − A[N − 1] and K [N] =Bmse(A[N − 1])

σ2 + Bmse(A[N − 1])

the minimum MSE be updated as :

Bmse(A[N]) = (1− K [N])Bmse(A[N − 1])

(Linear Bayesian Estimators) Estimation Theory Spring 2013 133 / 152

Page 134: Estimation Theory

General Sequential LMMSE Estimation

Consider the general Bayesian linear model x = Hθ + w, where w is anuncorrelated noise with the diagonal covariance matrix Cw . Let’s define :

Cw [n] = diag(σ20 , σ

21 , . . . , σ

2n)

H[n] =

[H[n − 1]

hT [n]

]=

[n× p1× p

]x[n] =

[x [0], x [1], . . . , x [n]

]Estimator Update :

θ[n] = θ[n − 1] +K[n](x [n]− hT [n]θ[n − 1]

)where

K[n] =M[n − 1]h[n]

σ2n + hT [n]M[n − 1]h[n]

Minimum MSE Matrix Update :

M[n] = (I−K[n]hT [n])M[n− 1]

(Linear Bayesian Estimators) Estimation Theory Spring 2013 134 / 152

Page 135: Estimation Theory

General Sequential LMMSE Estimation

Remarks :

For the initialization of the sequential LMMSE estimator, we can usethe prior information :

θ[−1] = E (θ) M[−1] = Cθθ

For no prior knowledge about θ we can let Cθθ → ∞. Then we havethe same form as the sequential LSE, although the approaches arefundamentally different.

No matrix inversion is required.

The gain factor K[n] weights confidence in the new data (measuredby σ2

n) against all previous data (summarized by M[n − 1]).

(Linear Bayesian Estimators) Estimation Theory Spring 2013 135 / 152

Page 136: Estimation Theory

Wiener Filtering

Consider the signal model : x [n] = s[n] + w [n] n = 0, 1, . . . ,N − 1where the data and noise are zero mean with known covariance matrix.

There are three main problems concerning the Wiener Filters :

Filtering : Estimate θ = s[n] (scalar) based on the data setx = [x [0], x [1], . . . , x [n]]. The signal sample is estimatedbased on the the present and past data only.

Smoothing : Estimate θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based onthe data set x =

[x [0], x [1], . . . , x [N − 1]

]. The signal sample

is estimated based on the the present, past and future data.

Prediction : Estimate θ = x [N − 1 + ] for positive integer based on thedata set x = [x [0], x [1], . . . , x [N − 1]].

Remarks : All these problems are solved using the LMMSE estimator :

θ = CθxC−1xx x with the minimum Bmse : Mθ = Cθθ − CθxC

−1xx Cxθ

(Linear Bayesian Estimators) Estimation Theory Spring 2013 136 / 152

Page 137: Estimation Theory

Wiener Filtering (Smoothing)

Estimate θ = s = [s[0], s[1], . . . , s[N − 1]] (vector) based on the data setx =

[x [0], x [1], . . . , x [N − 1]

].

Cxx = Rxx = Rss +Rww and Cθx = E (sxT ) = E (s(s+w)T ) = Rss

Therefore :s = CsxC

−1xx x = Rss(Rss + Rww )

−1x = Wx

Filter Interpretation : The Wiener Smoothing Matrix can be interpretedas an FIR filter. Let’s define :

W = [a0, a1, . . . , aN−1]

where aTn is the (n + 1)th raw of W. Let’s define also :hn = [h(n)[0], h(1)[1], . . . , h(n)[N − 1]] which is just the vector an whenflipped upside down. Then

s[n] = aTn x =N−1∑k=0

h(n)[N − 1− k]x [k]

This represents a time-varying, non causal FIR filter.(Linear Bayesian Estimators) Estimation Theory Spring 2013 137 / 152

Page 138: Estimation Theory

Wiener Filtering (Filtering)

Estimate θ = s[n] based on the data set x =[x [0], x [1], . . . , x [n]

]. We

have again Cxx = Rxx = Rss + Rww and

Cθx = E (s[n][x [0], x [1], . . . , x [n]

]) =

[rss [n], rss [n − 1], . . . , rss [0]

]= (r′ss)

T

Therefore :s[n] = (r′ss)

T (Rss + Rww )−1x = aTx

Filter Interpretation : If we define hn = [h(n)[0], h(1)[1], . . . , h(n)[N − 1]]which is just the vector a when flipped upside down. Then :

s[n] = aTx =n∑

k=0

h(n)[n − k]x [k]

This represents a time-varying, causal FIR filter. The impulse response canbe computed as :

(Rss + Rww )a = r′ss ⇒ (Rss + Rww )hn = rss

(Linear Bayesian Estimators) Estimation Theory Spring 2013 138 / 152

Page 139: Estimation Theory

Wiener Filtering (Prediction)

Estimate θ = x [N − 1 + ] based on the data setx = [x [0], x [1], . . . , x [N − 1]]. We have again Cxx = Rxx and

Cθx = E (x [N − 1 + ][x [0], x [1], . . . , x [N − 1]])

=[rxx [N − 1 + ], rxx [N − 2 + ], . . . , rxx []

]= (r′xx)

T

Therefore :x [N − 1 + ] = (r′xx)

TR−1xx x = aTx

Filter Interpretation : If we let again h[N − k] = ak , we have

x [N − 1 + ] = aTx =N−1∑k=0

h[N − k]x [k]

This represents a time-invariant, causal FIR filter. The impulse responsecan be computed as :

Rxxh = rxx

For = 1, this is equivalent to an Auto-Regressive process.(Linear Bayesian Estimators) Estimation Theory Spring 2013 139 / 152

Page 140: Estimation Theory

Kalman Filters

(Kalman Filters) Estimation Theory Spring 2013 140 / 152

Page 141: Estimation Theory

Introduction

Kalman filter is a generalization of the Wiener filter.

In Wiener filter we estimate s[n] based on the noisy observationvector x[n]. We assume that s[n] is a stationary random process withknown mean and covariance matrix.

In Kalman filter we assume that s[n] is a non stationary randomprocess whose mean and covariance matrix vary according to a knowndynamical model.

Kalman filter is a sequential MMSE estimator of s[n]. If the signaland noise are jointly Gaussian, then the Kalman filter is an optimalMMSE estimator, if not, it is the optimal LMMSE estimator.

Kalman filter has many applications in Control theory for stateestimation.

It can be generalized to vector signals and noise (in contrast toWiener filter).

(Kalman Filters) Estimation Theory Spring 2013 141 / 152

Page 142: Estimation Theory

Dynamical Signal Models

Consider a first order dynamical model :

s[n] = as[n − 1] + u[n] n ≥ 0

where u[n] is WGN with variance σ2u, s[−1] ∼ N (µs , σ

2s ) and s[−1] is

independent of u[n] for all n ≥ 0. It can be shown that :

s[n] = an+1s[−1] +n∑

k=0

aku[n− k] ⇒ E (s[n]) = an+1µs

and

Cs [m, n] = E[(s[m]− E (s[m])

)(s[n]− E (s[n])

)]= am+n+2σ2

s + σ2ua

m−nn∑

k=0

a2k for m ≥ 0

and Cs [m, n] = Cs [n,m] for m < n.(Kalman Filters) Estimation Theory Spring 2013 142 / 152

Page 143: Estimation Theory

Dynamical Signal Models

Theorem (Vector Gauss-Markov Model)

The Gauss-Markov model for a p × 1 vector signal s[n] is :

s[n] = As[n − 1] + Bu[n] n ≥ 0

A is p × p and B p × r and all eigenvalues of A are less than 1 inmagnitude. The r × 1 vector u[n] ∼ N (0,Q) and the initial condition is ap × 1 random vector with s[n] ∼ N (µs,Cs) and is independent of u[n]’s.Then, the signal process is Gaussian with mean E (s[n]) = An+1µs andcovariance :

Cs[m, n] = Am+1Cs

[An+1

]T+

m∑k=m−n

AkBQBT[An−m+k

]Tfor m ≥ n and Cs[m, n] = Cs[n,m] for m < n. The covariance matrixC[n] = Cs[n, n] and the mean and covariance propagation equations are :E (s[n]) = AE (s[n− 1]) , C[n] = AC[n − 1]AT +BQBT

(Kalman Filters) Estimation Theory Spring 2013 143 / 152

Page 144: Estimation Theory

Scalar Kalman Filter

Data model :

State equation s[n] = as[n− 1] + u[n]

Observation equation x [n] = s[n] + w [n]

Assumptions :

u[n] is zero mean Gaussian noise with independent samples andE (u2[n]) = σ2

u.

w [n] is zero mean Gaussian noise with independent samples andE (w2[n]) = σ2

n (time-varying variance).

s[−1] ∼ N (µs , σ2s ) (for simplicity we suppose that µs = 0).

s[−1], u[n] and w [n] are independent.

Objective : Develop a sequential MMSE estimator to estimate s[n] basedon the data X[n] =

[x [0], x [1], . . . , x [n]

]. This estimator is the mean of

posterior PDF :

s[n|n] = E (s[n]∣∣x [0], x [1], . . . , x [n])

(Kalman Filters) Estimation Theory Spring 2013 144 / 152

Page 145: Estimation Theory

Scalar Kalman Filter

To develop the equations of a Kalman filter we need the followingproperties :

Property 1 : For two uncorrelated jointly Gaussian data vector, theMMSE estimator θ (if it is zero mean) is given by :

θ = E (θ|x1, x2) = E (θ|x1) + E (θ|x2)

Property 2 : If θ = θ1 + θ2, then the MMSE estimator of θ is :

θ = E (θ|x) = E (θ1 + θ2|x) = E (θ1|x) + E (θ2|x)

Basic Idea : Generate the innovation x [n] = x [n]− x [n|n− 1] which isuncorrelated with previous samples X[n − 1]. Then use x [n] instead of x [n]for estimation (X[n] is equivalent to

[X[n − 1], x [n]

]).

(Kalman Filters) Estimation Theory Spring 2013 145 / 152

Page 146: Estimation Theory

Scalar Kalman Filter

From Property 1, we have :

s[n|n] = E (s[n]∣∣X[n − 1], x [n]) = E (s[n]

∣∣X[n − 1]) + E (s[n]∣∣x [n])

= s[n|n − 1] + E (s[n]∣∣x [n])

From Property 2, we have :

s[n|n − 1] = E (as[n − 1] + u[n]∣∣X[n − 1])

= aE (s[n − 1]∣∣X[n − 1]) + E (u[n]

∣∣X[n − 1])

= as[n − 1|n − 1]

The MMSE estimator of s[n] based on x [n] is :

E (s[n]∣∣x [n]) = E (s[n]x(n))

E (x2[n])x [n] = K [n](x [n]− x [n|n− 1])

where x [n|n − 1] = s[n|n − 1] + w [n|n − 1] = s[n|n− 1].Finally we have :

s[n|n] = as[n − 1|n − 1] + K [n](x [n]− s[n|n − 1])

(Kalman Filters) Estimation Theory Spring 2013 146 / 152

Page 147: Estimation Theory

Scalar State – Scalar Observation Kalman Filter

State Model : s[n] = as[n − 1] + u[n]Observation Model : x [n] = s[n] + w [n]

Prediction :s[n|n − 1] = as[n − 1|n − 1]

Minimum Prediction MSE :

M[n|n− 1] = a2M[n − 1|n − 1] + σ2u

Kalman Gain :

K [n] =M[n|n− 1]

σ2n +M[n|n− 1]

Correction :s[n|n] = s[n|n − 1] + K [n](x [n]− s[n|n− 1])

Minimum MSE :M[n|n] = (1− K [n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 147 / 152

Page 148: Estimation Theory

Properties of Kalman Filter

The Kalman filter is an extension of the sequential MMSE estimatorbut applied to time-varying parameters that are represented by adynamic model.No matrix inversion is required.The Kalman filter is a time-varying linear filter.The Kalman filter computes (and uses) its own performance measure,the Bayesian MSE M[n|n].The prediction stage increases the error, while the correction stagedecreases it.The Kalman filter generates an uncorrelated sequence from theobservations so can be viewed as a whitening filter.The Kalman filter is optimal in the Gaussian case and the optimalLMMSE estimator if the Gaussian assumption is not valid.In KF M[n|n],M[n|n − 1] and K [n] do not depend on themeasurement and can be computed off line. At steady state (n → ∞)the Kalman filter becomes an LTI filter (M[n|n],M[n|n − 1] and K [n]become constant).

(Kalman Filters) Estimation Theory Spring 2013 148 / 152

Page 149: Estimation Theory

Vector State – Scalar Observation Kalman Filter

State Model : s[n] = As[n − 1] +Bu[n] with u[n] ∼ N (0,Q)Observation Model : x [n] = hT [n]s[n] + w [n] with s[−1] ∼ N (µs ,Cs)

Prediction :s[n|n − 1] = As[n − 1|n − 1]

Minimum Prediction MSE (p × p) :

M[n|n− 1] = AM[n− 1|n − 1]AT + BQBT

Kalman Gain (p × 1) :

K[n] =M[n|n − 1]h[n]

σ2n + hT [n]M[n|n− 1]h[n]

Correction :s[n|n] = s[n|n − 1] +K[n](x [n]− hT [n]s[n|n − 1])

Minimum MSE Matrix (p × p) :

M[n|n] = (I −K[n]hT [n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 149 / 152

Page 150: Estimation Theory

Vector State – Vector Observation Kalman Filter

State Model : s[n] = As[n − 1] +Bu[n] with u[n] ∼ N (0,Q)Observation Model : x[n] = H[n]s[n] + w[n] with w[n] ∼ N (0,C[n])

Prediction :s[n|n − 1] = As[n − 1|n − 1]

Minimum Prediction MSE (p × p) :

M[n|n− 1] = AM[n− 1|n − 1]AT + BQBT

Kalman Gain Matrix (p ×M) : (Need matrix inversion !)

K[n] = M[n|n− 1]HT [n](C[n] +H[n]M[n|n− 1]HT [n]

)−1

Correction :s[n|n] = s[n|n − 1] +K[n](x [n]−H[n]s[n|n − 1])

Minimum MSE Matrix (p × p) :

M[n|n] = (I −K[n]H[n])M[n|n − 1]

(Kalman Filters) Estimation Theory Spring 2013 150 / 152

Page 151: Estimation Theory

Extended Kalman Filter

The extended Kalman filter is a sub-optimal approach when the stateand the observation equations are nonlinear.Nonlinear data model : Consider the following nonlinear data model

s[n] = f(s[n − 1]) + Bu[n]

x[n] = g(s[n]) + w[n]

where f(·) and g(·) are nonlinear function mapping (vector to vector).Model linearization : We linearize the model around the estimated valueof s using a first order Taylor series.

f(s[n − 1]) ≈ f (s[n − 1|n − 1]) +∂f

∂s[n − 1]

[s[n − 1]− s[n − 1|n − 1]

]g(s[n]) ≈ g(s[n|n− 1]) +

∂g

∂s[n]

[s[n]− s[n|n − 1]

]We denote the Jacobians by :

A[n − 1] =∂f

∂s[n − 1]

∣∣∣∣s[n−1|n−1]

H[n] =∂g

∂s[n]

∣∣∣∣s[n|n−1]

(Kalman Filters) Estimation Theory Spring 2013 151 / 152

Page 152: Estimation Theory

Extended Kalman Filter

Now we use the linearized models :

s[n] = A[n − 1]s[n − 1] + Bu[n]

+f (s[n − 1|n − 1])− A[n − 1]s[n − 1|n − 1]

x[n] = H[n]s[n] + w[n]

+g(s[n|n − 1]) −H[n]s[n|n − 1]

There are two differences with the standard Kalman filter :1 A is now, time varying.2 Both equations have known terms added to them. This will not change

the derivation of the Kalman filter.

The extended Kalman filter has exactly the same equations where Ais replaced with A[n − 1].

A[n − 1] and H[n] should be computed at each sampling time.

MSE matrices and Kalman gain can no longer be computed off line.

(Kalman Filters) Estimation Theory Spring 2013 152 / 152