Estimation Theory

152
Estimation Theory Alireza Karimi Laboratoire d’Automatique, MEC2 397, email: alireza.karimi@epfl.ch Spring 2013 (Introduction) Estimation Theory Spring 2013 1 / 152

Transcript of Estimation Theory

  • Estimation Theory

    Alireza Karimi

    Laboratoire dAutomatique, MEC2 397,email: [email protected]

    Spring 2013

    (Introduction) Estimation Theory Spring 2013 1 / 152

  • Course Objective

    Extract information from noisy signals

    Parameter Estimation Problem : Given a set of measured data

    {x [0], x [1], . . . , x [N 1]}

    which depends on an unknown parameter vector , determine an estimator

    = g(x [0], x [1], . . . , x [N 1])

    where g is some function.

    Applications : Image processing, communications, biomedicine, systemidentication, state estimation in control, etc.

    (Introduction) Estimation Theory Spring 2013 2 / 152

  • Some Examples

    Range estimation : We transmit a pulse that is reected by the aircraft.An echo is received after second. Range is estimated from the equation = c/2 where c is the lights speed.

    System identication : The input of a plant is excited with u and theoutput signal y is measured. If we have :

    y [k] = G (q1,)u[k] + n[k]

    where n[k] is the measurement noise. The model parameters are estimatedusing u[k] and y [k] for k = 1, . . . ,N.

    DC level in noise : Consider a set of data {x [0], x [1], . . . , x [N 1]} thatcan be modeled as :

    x [n] = A+ w [n]

    where w [n] is some zero mean noise process. A can be estimated using themeasured data set.

    (Introduction) Estimation Theory Spring 2013 3 / 152

  • Outline

    Classical estimation ( deterministic)

    Minimum Variance Unbiased Estimator (MVU)

    Cramer-Rao Lower Bound (CRLB)

    Best Linear Unbiased Estimator (BLUE)

    Maximum Likelihood Estimator (MLE)

    Least Squares Estimator (LSE)

    Bayesian estimation ( stochastic)

    Minimum Mean Square Error Estimator (MMSE)

    Maximum A Posteriori Estimator (MAP)

    Linear MMSE Estimator

    Kalman Filter

    (Introduction) Estimation Theory Spring 2013 4 / 152

  • References

    Main reference :

    Fundamentals of Statistical Signal ProcessingEstimation Theory

    by Steven M. KAY, Prentice-Hall, 1993 (available in Library de LaFontaine, RLC). We cover Chapters 1 to 14, skipping Chapter 5 andChapter 9.

    Other references :

    Lessons in Estimation Theory for Signal Processing, Communicationsand Control. By Jerry M. Mendel, Prentice-Hall, 1995.

    Probability, Random Processes and Estimation Theory for Engineers.By Henry Stark and John W. Woods, Prentice-Hall, 1986.

    (Introduction) Estimation Theory Spring 2013 5 / 152

  • Review of Probability and Random Variables

    (Probability and Random Variables) Estimation Theory Spring 2013 6 / 152

  • Random Variables

    Random Variable : A rule X () that assigns to every element of a samplespace a real value is called a RV. So X is not really a variable that variesrandomly but a function whose domain is and whose range is somesubset of the real line.

    Example : Consider the experiment of ipping a coin twice. The samplespace (the possible outcomes) is :

    = {HH,HT ,TH,TT}

    We can dene a random variable X such that

    X (HH) = 1,X (HT ) = 1.1,X (TH) = 1.6,X (TT ) = 1.8

    Random variable X assigns to each event (e.g. E = {HT ,TH} ) asubset of the real line (in this case B = {1.1, 1.6}).

    (Probability and Random Variables) Estimation Theory Spring 2013 7 / 152

  • Probability Distribution Function

    For any element in , the event {|X () x} is an important event.The probability of this event

    Pr [{|X () x}] = PX (x)is called the probability distribution function of X .

    Example : For the random variable dened earlier, we have :

    PX (1.5) = Pr [{|X () 1.5}] = Pr [{HH,HT}] = 0.5PX (x) can be computed for all x R. It is clear that 0 PX (x) 1.

    Remark :

    For the same experiment (throwing a coin twice) we could deneanother random variable that would lead to a dierent PX (x).

    In most of engineering problems the sample space is a subset of thereal line so X () = and PX (x) is a continuous function of x .

    (Probability and Random Variables) Estimation Theory Spring 2013 8 / 152

  • Probability Density Function (PDF)

    The Probability Density Function, if it exists, is given by :

    pX (x) =dPX (x)

    dx

    When we deal with a single random variable the subscripts are removed :

    p(x) =dP(x)

    dx

    Properties :

    (i)

    p(x)dx = P() P() = 1

    (ii) Pr [{|X () x}] = Pr [X x ] = P(x) = x

    p()d

    (iii) Pr [x1 < X x2] = x2x1

    p(x)dx

    (Probability and Random Variables) Estimation Theory Spring 2013 9 / 152

  • Gaussian Probability Density Function

    A random variable is distributed according to a Gaussian or normal

    distribution if the PDF is given by :

    p(x) =122

    e(x)222

    The PDF has two parameters : , the mean and 2 the variance.

    We note X N (, 2) when the random variable X has a normal(Gaussian) distribution with the mean and the standard deviation .Small means small variability (uncertainty) and large means largevariability.

    Remark : Gaussian distribution is important because according to theCentral Limit Theorem the sum of N independent RVs has a PDF thatconverges to a Gaussian distribution when N goes to innity.(Probability and Random Variables) Estimation Theory Spring 2013 10 / 152

  • Some other common PDF

    Uniform (b > a) : p(x) =

    { 1ba a < x < b0 otherwise

    Exponential ( > 0) : p(x) =1

    exp(x/)u(x)

    Rayleigh ( > 0) : p(x) =x

    2exp(

    x222

    )u(x)

    Chi-square 2 : p(x) =

    {1

    2n(n/2)xn/21 exp( x2 ) for x > 0

    0 for x < 0

    where (z) =

    0

    tz1etdt and u(x) is the unit step function.

    (Probability and Random Variables) Estimation Theory Spring 2013 11 / 152

  • Joint, Marginal and Conditional PDF

    Joint PDF : Consider two random variables X and Y then :

    Pr [x1 < X x2 and y1 < Y y2] = x2x1

    y2y1

    p(x , y)dxdy

    Marginal PDF :

    p(x) =

    p(x , y)dy and p(y) =

    p(x , y)dx

    Conditional PDF : p(x |y) is dened as the PDF of X conditioned onknowing the value of Y .

    Bayes Formula : Consider two RVs dened on the same probability spacethen we have :

    p(x , y) = p(x |y)p(y) = p(y |x)p(x) or p(x |y) = p(x , y)p(y)

    (Probability and Random Variables) Estimation Theory Spring 2013 12 / 152

  • Independent Random Variables

    Two RVs X and Y are independent if and only if :

    p(x , y) = p(x)p(y)

    A direct conclusion is that :

    p(x |y) = p(x , y)p(y)

    =p(x)p(y)

    p(y)= p(x) and p(y |x) = p(y)

    which means conditioning does not change the PDF.

    Remark : For a joint Gaussian pdf the contours of constant density is anellipse centered at (x , y ). For independent X and Y the major (orminor) axis is parallel to x or y axis.

    (Probability and Random Variables) Estimation Theory Spring 2013 13 / 152

  • Expected Value of a Random Variable

    The expected value, if it exists, of a random variable X with PDF p(x) isdened by :

    E (X ) =

    xp(x)dx

    Some properties of expected value :

    E{X + Y } = E{X}+ E{Y }E{aX} = aE{X}The expected value of Y = g(X ) can be computed by :

    E (Y ) =

    g(x)p(x)dx

    Conditional expectation : The conditional expectation of X given that aspecic value of Y has occurred is :

    E (X |Y ) =

    xp(x |y)dx

    (Probability and Random Variables) Estimation Theory Spring 2013 14 / 152

  • Moments of a Random Variable

    The r th moment of X is dened as :

    E (X r ) =

    x rp(x)dx

    The rst moment of X is its expected value or the mean ( = E (X )).

    Moments of Gaussian RVs : A Gaussian RV with N (, 2) hasmoments of all orders in closed form

    E (X ) = E (X 2) = 2 + 2

    E (X 3) = 3 + 32

    E (X 4) = 4 + 622 + 34

    E (X 5) = 5 + 1032 + 154

    E (X 6) = 6 + 1542 + 4524 + 156

    (Probability and Random Variables) Estimation Theory Spring 2013 15 / 152

  • Central Moments

    The r th central moment of X is dened as :

    E [(X )r ] =

    (x )rp(x)dx

    The second central moment (variance) is denoted by 2 or var(X ).

    Central Moments of Gaussian RVs :

    E [(X )r ] ={

    0 if r is oddr (r 1)!! if r is even

    where n!! denotes the double factorial that is the product of every oddnumber from n to 1.

    (Probability and Random Variables) Estimation Theory Spring 2013 16 / 152

  • Some properties of Gaussian RVs

    If X N (x , 2x) then Z = (X x)/x N (0, 1).If Z N (0, 1) then X = xZ + x N (x , 2x).If X N (x , 2x) then Z = aX + b N (ax + b, a22x).If X N (x , 2x) and Y N (y , 2y ) are two independent RVs, then

    aX + bY N (ax + by , a2x + b2y)

    The sum of square of n independent RV with standard normaldistribution N (0, 1) has a 2n distribution with n degree of freedom.For large value of n, 2n converges to N (n, 2n).The Euclidian norm

    X 2 + Y 2 of two independent RVs with

    standard normal distribution has the Rayleigh distribution.

    (Probability and Random Variables) Estimation Theory Spring 2013 17 / 152

  • Covariance

    For two RVs X and Y , the covariance is dened as

    xy = E [(X x)(Y y )]

    xy =

    (x x)(y y)p(x , y)dxdy

    If X and Y are zero mean then xy = E{XY }.var(X + Y ) = 2x +

    2y + 2xy

    var(aX ) = a22x

    Important formula : The relation between the variance and the mean ofX is given by

    2 = E [(X )2] = E (X 2) 2E (X ) + 2= E (X 2) 2

    The variance is the mean of the square minus the square of the mean.(Probability and Random Variables) Estimation Theory Spring 2013 18 / 152

  • Independence, Uncorrelatedness and Orthogonality

    If xy = 0, then X and Y are uncorrelated and

    E{XY } = E{X}E{Y }X and Y called orthogonal if E{XY } = 0.If X and Y are independent then they are uncorrelated.

    p(x , y) = p(x)p(y) E{XY } = E{X}E{Y }Uncorrelatedness does not imply the independence. For example, if Xis a normal RV with zero mean and Y = X 2 we have p(y |x) = p(y)but

    xy = E{XY } E{X}E{Y } = E{X 3} 0 = 0The correlation only shows the linear dependence between two RV sois weaker than independence.

    For Jointly Gaussian RVs, independence is equivalent to beinguncorrelated.

    (Probability and Random Variables) Estimation Theory Spring 2013 19 / 152

  • Random Vectors

    Random Vector : is a vector of random variables 1 :

    x = [x1, x2, . . . , xn]T

    Expectation Vector : x = E (x) = [E (x1),E (x2), . . . ,E (xn)]T

    Covariance Matrix : Cx = E [(x x)(x x)T ]Cx is an n n symmetric matrix which is assumed to be positivedenite and so invertible.

    The elements of this matrix are : [Cx ]ij = E{[xi E (xi )][xj E (xj)]}.If the random variables are uncorrelated then Cx is a diagonal matrix.

    Multivariate Gaussian PDF :

    p(x) =1

    (2)n det(Cx)exp

    [12(x x)TC1x (x x)

    ]1. In some books (including our main reference) there is no distinction between random

    variable X and its specific value x . From now on we adopt the notation of our reference.(Probability and Random Variables) Estimation Theory Spring 2013 20 / 152

  • Random Processes

    Discrete Random Process : x [n] is a sequence of random variablesdened for every integer n.Mean value : is dened as E (x [n]) = x [n].Autocorrelation Function (ACF) : is dened as

    rxx [k , n] = E (x [n]x [n + k])

    Wide Sense Stationary (WSS) : x [n] is WSS if its mean and itsautocorrelation function (ACF) do not depend on n.Autocovariance function : is dened as

    cxx [k] = E [(x [n] x)(x [n + k] x)] = rxx [k] 2xCross-correlation Function (CCF) : is dened as

    rxy [k] = E (x [n]y [n + k])

    Cross-covariance function : is dened as

    cxy [k] = E [(x [n] x)(y [n + k] y)] = rxy [k] xy(Probability and Random Variables) Estimation Theory Spring 2013 21 / 152

  • Discrete White Noise

    Some properties of ACF and CCF :

    rxx [0] |rxx [k]| rxx [k] = rxx [k] rxy [k] = ryx [k]

    Power Spectral Density : The Fourier transform of ACF and CCF givesthe Auto-PSD and Cross-PSD :

    Pxx(f ) =

    k=rxx [k] exp(j2fk)

    Pxy (f ) =

    k=rxy [k] exp(j2fk)

    Discrete White Noise : is a discrete random process with zero mean andrxx [k] =

    2[k] where [k] is the Kronecker impulse function. The PSD ofwhite noise becomes Pxx(f ) =

    2 and is completely at with frequency.

    (Probability and Random Variables) Estimation Theory Spring 2013 22 / 152

  • Introduction and Minimum Variance UnbiasedEstimation

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 23 / 152

  • The Mathematical Estimation Problem

    Parameter Estimation Problem : Given a set of measured data

    x = {x [0], x [1], . . . , x [N 1]}which depends on an unknown parameter vector , determine an estimator

    = g(x [0], x [1], . . . , x [N 1])where g is some function.

    The rst step is to nd the PDF of data as afunction of : p(x; )

    Example : Consider the problem of DC level in white Gaussian noise withone observed data x [0] = + w [0] where w [0] has the PDF N (0, 2).Then the PDF of x [0] is :

    p(x [0]; ) =122

    exp

    [ 122

    (x [0] )2]

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 24 / 152

  • The Mathematical Estimation Problem

    Example : Consider a data sequence that can be modeled with a lineartrend in white Gaussian noise

    x [n] = A+ Bn+ w [n] n = 0, 1, . . . ,N 1

    Suppose that w [n] N (0, 2) and is uncorrelated with all the othersamples. Letting = [A B] and x = [x [0], x [1], . . . , x [N 1]] the PDF is :

    p(x;) =N1n=0

    p(x [n];) =1

    (22)N

    exp

    [ 122

    N1n=0

    (x [n] A Bn)2]

    The quality of any estimator for this problem is related to the assumptionson the data model. In this example, linear trend and WGN PDFassumption.

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 25 / 152

  • The Mathematical Estimation Problem

    Classical versus Bayesian estimation

    If we assume is deterministic we will have a classical estimationproblem. The following method will be studied : MVU, MLE, BLUE,LSE.

    If we assume is a random variable with a known PDF, then we willhave a Bayesian estimation problem. In this case the data aredescribed as the joint PDF

    p(x , ) = p(x |)p()

    where p() summarizes our knowledge about before any data isobserved and p(x |) summarizes our knowledge provided by data xconditioned on knowing . The following methods will be studied :MMSE, MAP, Kalman Filter.

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 26 / 152

  • Assessing Estimator Performance

    Consider the problem of estimating a DC level A in uncorrelated noise :

    x [n] = A+ w [n] n = 0, 1, . . . ,N 1

    Consider the following estimators :

    A1 =1

    N

    N1n=0

    x [n]

    A2 = x [0]

    Suppose that A = 1, A1 = 0.95 and A2 = 0.98. Which estimator is better ?

    An estimator is a random variable, so itsperformance can only be described by its PDF orstatistically (e.g. by Monte-Carlo simulation).

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 27 / 152

  • Unbiased Estimators

    An estimator that on the average yield the true value is unbiased.Mathematically

    E ( ) = 0 for a < < bLets compute the expectation of the two estimators A1 and A2 :

    E (A1) =1

    N

    N1n=0

    E (x [n]) =1

    N

    N1n=0

    E (A+ w [n]) =1

    N

    N1n=0

    (A + 0) = A

    E (A2) = E (x [0]) = E (A+ w [0]) = A+ 0 = A

    Both estimators are unbiased. Which one is better ?Now, lets compute the variance of the two estimators :

    var(A1) = var

    [1

    N

    N1n=0

    x [n]

    ]=

    1

    N2

    N1n=0

    var(x [n]) =1

    N2N2 =

    2

    N

    var(A2) = var(x [0]) = 2 > var(A1)

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 28 / 152

  • Unbiased Estimators

    Remark : When several unbiased estimators of the same parameters fromindependent set of data are available, i.e., 1, 2, . . . , n, a better estimatorcan be obtained by averaging :

    =1

    n

    ni=1

    i E () =

    Assuming that the estimators have the same variance, we have :

    var() =1

    n2

    ni=1

    var(i) =1

    n2n var(i ) =

    var(i)

    n

    By increasing n, the variance will decrease (if n , ).It is not the case for biased estimators, no matter how many estimators areaveraged.

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 29 / 152

  • Minimum Variance Criterion

    The most logical criterion for estimation is the Mean Square Error (MSE) :

    mse() = E [( )2]

    Unfortunately this type of estimators leads to unrealizable estimators (theestimator will depend on ).

    mse() = E{[ E () + E () ]2} = E{[ E () + b()]2}

    where b() = E () is dened as the bias of the estimator. Therefore :

    mse() = E{[ E ()]2}+ 2b()E [ E ()] + b2() = var() + b2()

    Instead of minimizing MSE we can minimize the variance ofthe unbiased estimators :

    Minimum Variance Unbiased Estimator

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 30 / 152

  • Minimum Variance Unbiased Estimator

    Existence of MVU Estimator : In general MVU estimator does notalways exist. There may be no unbiased estimator or non of unbiasedestimators has a uniformly minimum variance.

    Finding the MVU Estimator : There is no known procedure whichalways leads to the MVU estimator. Three existing approaches are :

    1 Determine the Cramer-Rao lower bound (CRLB) and check to see ifsome estimator satises it.

    2 Apply the Rao-Blackwell-Lehmann-Schee theorem (we will skip it).

    3 Restrict to linear unbiased estimators.

    (Minimum Variance Unbiased Estimation) Estimation Theory Spring 2013 31 / 152

  • Cramer-Rao Lower Bound

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 32 / 152

  • Cramer-Rao Lower Bound

    CRLB is a lower bound on the variance of any unbiased estimator.

    var() CRLB()

    Note that the CRLB is a function of .

    It tells us what is the best performance that can be achieved(useful in feasibility study and comparison with other estimators).

    It may lead us to compute the MVU estimator.

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 33 / 152

  • Cramer-Rao Lower Bound

    Theorem (scalar case)

    Assume that the PDF p(x; ) satises the regularity condition

    E

    [ ln p(x; )

    ]= 0 for all

    Then the variance of any unbiased estimator satises

    var() [E

    (2 ln p(x; )

    2

    )]1An unbiased estimator that attains the CRLB can be found i :

    ln p(x; )

    = I ()(g(x) )

    for some functions g(x) and I (). The estimator is = g(x) and theminimum variance is 1/I ().

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 34 / 152

  • Cramer-Rao Lower Bound

    Example : Consider x [0] = A+ w [0] with w [0] N (0, 2).

    p(x [0];A) =122

    exp

    [ 122

    (x [0] A)2]

    ln p(x [0],A) = ln22 1

    22(x [0] A)2

    Then

    ln p(x [0];A)

    A=

    1

    2(x [0] A)

    2 ln p(x [0];A)

    A2=

    1

    2

    According to Theorem :

    var(A) 2 and I (A) = 12

    and A = g(x [0]) = x [0]

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 35 / 152

  • Cramer-Rao Lower Bound

    Example : Consider multiple observations for DC level in WGN :

    x [n] = A+ w [n] n = 0, 1, . . . ,N 1 with w [n] N (0, 2)

    p(x;A) =1

    (22)N

    exp

    [ 122

    N1n=0

    (x [n] A)2]

    Then

    ln p(x;A)

    A=

    A

    [ ln[(22)N/2] 1

    22

    N1n=0

    (x [n] A)2]

    =1

    2

    N1n=0

    (x [n] A) = N2

    (1

    N

    N1n=0

    x [n] A)

    According to Theorem :

    var(A) 2

    Nand I (A) =

    N

    2and A = g(x [0]) =

    1

    N

    N1n=0

    x [n]

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 36 / 152

  • Transformation of Parameters

    If it is desired to estimate = g(), then the CRLB is :

    var() [E

    (2 ln p(x; )

    2

    )]1(g

    )2Example : Compute the CRLB for estimation of the power (A2) of a DClevel in noise :

    var(A2) 2

    N(2A)2 =

    4A22

    N

    Denition

    Ecient estimator : An unbiased estimator that attains the CRLB is saidto be ecient.

    Example : Knowing that x =1

    N

    N1n=0

    x [n] is an ecient estimator for A, is

    x2 an ecient estimator for A2 ?(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 37 / 152

  • Transformation of Parameters

    Solution : Knowing that x N (A, 2/N), we have :

    E (x2) = E 2(x) + var(x) = A2 +2

    N= A2

    So the estimator A2 = x2 is not even unbiased.Lets look at the variance of this estimator :

    var(x2) = E (x4) E 2(x2)but we have from the moments of Gaussian RVs (slide 15) :

    E (x4) = A4 + 6A22

    N+ 3(

    2

    N)2

    Therefore :

    var(x2) = A4 + 6A22

    N+

    34

    N2(A2 +

    2

    N

    )2=

    4A22

    N+

    24

    N2

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 38 / 152

  • Transformation of Parameters

    Remarks :

    The estimator A2 = x2 is biased and not ecient.

    As N the bias goes to zero and the variance of the estimatorapproaches the CRLB. This type of estimators are calledasymptotically ecient.

    General Remarks :

    If g() = a + b is an ane function of , then g() = g() is anecient estimator. First, it is unbiased : E (a+ b) = a+ b = g(),moreover :

    var(g()) (g

    )2var() = a2var()

    but var(g()) = var(a + b) = a2var(), so that the CRLB isachieved.

    If g() is a nonlinear function of and is an ecient estimator,then g() is an asymptotically ecient estimator.

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 39 / 152

  • Cramer-Rao Lower Bound

    Theorem (Vector Parameter)

    Assume that the PDF p(x;) satises the regularity condition

    E

    [ ln p(x;)

    ]= 0 for all

    Then the variance of any unbiased estimator satises C I1() 0where 0 means that the matrix is positive semidenite. I() is called theFisher information matrix and is given by :

    Iij() = E(2 ln p(x;)

    ij

    )An unbiased estimator that attains the CRLB can be found i :

    ln p(x;)

    = I()(g(x) )

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 40 / 152

  • CRLB Extension to Vector Parameter

    Example : Consider a DC level in WGN with A and 2 unknown.Compute the CRLB for estimation of = [A 2]T .

    ln p(x;) = N2ln 2 N

    2ln2 1

    22

    N1n=0

    (x [n] A)2

    The Fisher information matrix is :

    I() = E[

    2 ln p(x;)A2

    2 ln p(x;)A2

    2 ln p(x;)A2

    2 ln p(x;)(2)2

    ]=

    [N2

    0

    0 N24

    ]The matrix is diagonal (just for this example) and can be easily inverted toyield :

    var() 2

    Nvar(2) 2

    4

    N

    Is there any unbiased estimator that achieves these bounds ?(Cramer-Rao Lower Bound) Estimation Theory Spring 2013 41 / 152

  • Transformation of Parameters

    If it is desired to estimate = g(), and the CRLB for the covariance of is I1(), then :

    C (g

    )[E

    (2 ln p(x;)

    2

    )]1(g

    )TExample : Consider a DC level in WGN with A and 2 unknown.Compute the CRLB for estimation of signal to noise ratio = A2/2.We have = [A 2]T and = g() = 21/2, then the Jacobian is :

    g()

    =

    [g()

    1

    g()

    2

    ]=

    [2A

    2 A

    2

    4

    ]

    So the CRLB is :

    var() [2A

    2 A

    2

    4

    ] [N2

    0

    0 N24

    ]1 [ 2A2A24

    ]=

    4+ 22

    N

    (Cramer-Rao Lower Bound) Estimation Theory Spring 2013 42 / 152

  • Linear Models with WGN

    If N point samples of data observed can be modeled as

    x = H + w

    where

    x = N 1 observation vectorH = N p observation matrix (known, rank p) = p 1 vector of parameters to be estimatedw = N 1 noise vector with PDF N (0, 2I)

    Compute the CRLB and the MVU estimator that achieves this bound.

    Step 1 : Compute ln p(x;).

    Step 2 : Compute I() = E[2 ln p(x;)

    2

    ]and the covariance

    matrix of : C = I1().

    Step 3 : Find the MVU estimator g(x) by factoring

    ln p(x;)

    = I()[g(x) ]

    (Linear Models) Estimation Theory Spring 2013 43 / 152

  • Linear Models with WGN

    Step 1 : ln p(x;) = ln(22)N 1

    22(xH)T (xH).

    Step 2 :

    ln p(x;)

    = 1

    22

    [xTx 2xTH + THTH]

    =1

    2[HTxHTH]

    Then I() = E[2 ln p(x;)

    2

    ]=

    1

    2HTH

    Step 3 : Find the MVU estimator g(x) by factoring

    ln p(x;)

    = I()[g(x) ] = H

    TH

    2[(HTH)1HTx ]

    Therefore :

    = g(x) = (HTH)1HTx C = I1() = 2(HTH)1

    (Linear Models) Estimation Theory Spring 2013 44 / 152

  • Linear Models with WGN

    For a linear model with WGN represented by x = H + w the MVUestimator is :

    = (HTH)1HTx

    This estimator is ecient and attains the CRLB.

    That the estimator is unbiased can be seen easily by :

    E () = (HTH)1HTE (H + w) =

    The statistical performance of is completely specied because is alinear transformation of a Gaussian vector x and hence has a Gaussiandistribution :

    N (, 2(HTH)1)

    (Linear Models) Estimation Theory Spring 2013 45 / 152

  • Example (Curve Fitting)

    Consider tting the data x [n] by a p-th order polynomial function of n :

    x [n] = 0 + 1n + 2n2 + + pnp + w [n]

    We have N data samples, then :

    x = [x [0], x [1], . . . , x [N 1]]Tw = [w [0],w [1], . . . ,w [N 1]]T = [0, 1, . . . , p]

    T

    so x = H + w, where H is :1 0 0 01 1 1 11 2 4 2p...

    ......

    . . ....

    1 N 1 (N 1)2 (N 1)p

    N(p+1)

    Hence the MVU estimator is : = (HTH)1HTx(Linear Models) Estimation Theory Spring 2013 46 / 152

  • Example (Fourier Analysis)

    Consider the Fourier analysis of the data x [n] :

    x [n] =Mk=1

    ak cos(2kn

    N) +

    Mk=1

    bk sin(2kn

    N) + w [n]

    so we have = [a1, a2, . . . , aM , b1, b2, . . . , bM ]T and x = H + w where :

    H = [ha1,ha2, . . . ,h

    aM ,h

    b1,h

    b2, . . . ,h

    bM ]

    with

    hak =

    1

    cos(2kN )

    cos(2k2N )...

    cos(2k(N1)N )

    , hbk =

    1

    sin(2kN )

    sin(2k2N )...

    sin(2k(N1)N )

    Hence the MVU estimate of the Fourier coecients is : = (HTH)1HTx

    (Linear Models) Estimation Theory Spring 2013 47 / 152

  • Example (Fourier Analysis)

    After simplication (noting that (HTH)1 = 2N I), we have :

    =2

    N[(ha1)

    Tx, . . . , (haM)Tx , (hb1)

    T x, . . . , (hbM)Tx]T

    which is the same as the standard solution :

    ak =2

    N

    N1n=0

    x [n] cos

    (2kn

    N

    ), bk =

    2

    N

    N1n=0

    x [n] sin

    (2kn

    N

    )

    From the properties of linear models the estimates are unbiased.

    The covariance matrix is :

    C = 2(HTH)1 =

    22

    NI

    Note that is Gaussian and C is diagonal (the amplitude estimatesare independent).

    (Linear Models) Estimation Theory Spring 2013 48 / 152

  • Example (System Identication)

    Consider identication of a Finite Impulse Response (FIR) model, h[k] fork = 0, 1, . . . , p 1, with input u[n] and output x [n] provided forn = 0, 1, . . . ,N 1 :

    x [n] =

    p1k=0

    h[k]u[n k] + w [n] n = 0, 1, . . . ,N 1

    FIR model can be represented by the linear model x = H + w where

    =

    h[0]h[1]. . .

    h[p 1]

    p1

    H =

    u[0] 0 0u[1] u[0] 0...

    .... . .

    ...u[N 1] u[N 2] u[N p]

    Np

    The MVU estimate is = (HTH)1HTx with C = 2(HTH)1.

    (Linear Models) Estimation Theory Spring 2013 49 / 152

  • Linear Models with Colored Gaussian Noise

    Determine the MVU estimator for the linear model x = H + w with w acolored Gaussian noise with N (0,C).Whitening approach : Since C is positive denite, its inverse can befactored as C1 = DTD where D is an invertible matrix. This matrix actsas a whitening transformation for w :

    E [(Dw)(Dw)T ] = E (DwwTD) = DCDT = DD1DTD = I

    Now if we transform the linear model x = H + w to :

    x = Dx = DH +Dw = H + w

    where w = Dw N (0, I) is white and we can compute the MVUestimator as :

    = (HTH)1HTx = (HTDTDH)1HTDTDx

    so, we have :

    = (HTC1H)1HTC1x with C = (HTH)1 = (HTC1H)1

    (Linear Models) Estimation Theory Spring 2013 50 / 152

  • Linear Models with known components

    Consider a linear model x = H + s+ w, where s is a known signal. Todetermine the MVU estimator let x = x s, so that x = H + w is astandard linear model. The MVU estimator is :

    = (HTH)1HT (x s) with C = 2(HTH)1

    Example : Consider a DC level and exponential in WGN :x [n] = A+ rn + w [n] where r is known. Then we have :

    x [0]x [1]...

    x [N 1]

    =

    11...1

    A+

    1r...

    rN1

    +

    w [0]w [1]...

    w [N 1]

    The MVU estimator is :

    A = (HTH)1HT (x s) = 1N

    N1n=0

    (x [n] rn) with var(A) = 2

    N

    (Linear Models) Estimation Theory Spring 2013 51 / 152

  • Best Linear Unbiased Estimators (BLUE)

    Problems of nding the MVU estimators :

    The MVU estimator does not always exist or impossible to nd.

    The PDF of data may be unknown.

    BLUE is a suboptimal estimator that :

    restricts estimates to be linear in data ; = Ax

    restricts estimates to be unbiased ; E () = AE (x) =

    minimizes the variance of the estimates ;

    needs only the mean and the variance of the data (not the PDF). Asa result, in general, the PDF of the estimates cannot be computed.

    Remark : The unbiasedness restriction implies a linear model for the data.However, it may still be used if the data are transformed suitably or themodel is linearized.

    (Best Linear Unbiased Estimators) Estimation Theory Spring 2013 52 / 152

  • Finding the BLUE (Scalar Case)

    1 Choose a linear estimator for the observed data

    x [n] , n = 0, 1, . . . ,N 1

    =N1n=0

    anx [n] = aTx where a = [a0, a1, . . . , aN1]T

    2 Restrict estimate to be unbiased :

    E () =N1n=0

    anE (x [n]) =

    3 Minimize the variance

    var() = E{[ E ()]2} = E{[aTx aTE (x)]2}= E{aT [x E (x)][x E (x)]Ta} = aTCa

    (Best Linear Unbiased Estimators) Estimation Theory Spring 2013 53 / 152

  • Finding the BLUE (Scalar Case)

    Consider the problem of amplitude estimation of known signals in noise :

    x [n] = s[n] + w [n]

    1 Choose a linear estimator : =N1n=0

    anx [n] = aT x

    2 Restrict estimate to be unbiased : E () = aTE (x) = aT s = then aT s = 1 where s = [s[0], s[1], . . . , s[N 1]]T

    3 Minimize aTCa subject to aT s = 1.

    The constrained optimization can be solved using Lagrangian Multipliers :

    Minimize J = aTCa+ (aT s 1)The optimal solution is :

    =sTC1xsTC1s

    and var() =1

    sTC1s(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 54 / 152

  • Finding the BLUE (Vector Case)

    Theorem (GaussMarkov)

    If the data are of the general linear model form

    x = H + w

    with w is a noise vector with zero mean and covariance C (the PDF of wis arbitrary), then the BLUE of is :

    = (HTC1H)1HTC1x

    and the covariance matrix of is

    C = (HTC1H)1

    Remark : If noise is Gaussian then BLUE is MVU estimator.

    (Best Linear Unbiased Estimators) Estimation Theory Spring 2013 55 / 152

  • Finding the BLUE

    Example : Consider the problem of DC level in noise : x [n] = A+ w [n],where w [n] is of unspecied PDF with var(w [n]) = 2n. We have = Aand H = 1 = [1, 1, . . . , 1]T . The covariance matrix is :

    C =

    20 0 00 21 0...

    .... . .

    ...0 0 2N1

    C1 =

    120

    0 00 1

    21 0

    ......

    . . ....

    0 0 12N1

    and hence the BLUE is :

    = (HTC1H)1HTC1x =

    (N1n=0

    1

    2n

    )1 N1n=0

    x [n]

    2n

    and the minimum covariance is :

    C = (HTC1H)1 =

    (N1n=0

    1

    2n

    )1(Best Linear Unbiased Estimators) Estimation Theory Spring 2013 56 / 152

  • Maximum Likelihood Estimation

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 57 / 152

  • Maximum Likelihood Estimation

    Problems : MVU estimator does not often exist or cannot be found.BLUE is restricted to linear models.

    Maximum Likelihood Estimator (MLE) :

    can always be applied if the PDF is known ;

    is optimal for large data size ;

    is computationally complex and requires numerical methods.

    Basic Idea : Choose the parameter value that makes the observed data,the most likely data to have been observed.

    Likelihood Function : is the PDF p(x ; ) when is regarded as a variable(not a parameter).

    ML Estimate : is the value of that maximizes the likelihood function.

    Procedure : Find log-likelihood function ln p(x ; ) ; dierentiate w.r.t and set to zero and solve for .

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 58 / 152

  • Maximum Likelihood Estimation

    Example : Consider DC level in WGN with unknown variancex [n] = A+ w [n]. Suppose that A > 0 and 2 = A. The PDF is :

    p(x;A) =1

    (2A)N2

    exp

    [ 12A

    N1n=0

    (x [n] A)2]

    Taking the derivative of the log-likelihood function, we have :

    ln p(x;A)

    A= N

    2A+

    1

    A

    N1n=0

    (x [n] A) + 12A2

    N1n=0

    (x [n] A)2

    What is the CRLB ? Does an MVU estimator exist ?MLE can be found by setting the above equation to zero :

    A = 12+

    1N

    N1n=0

    x2[n] +1

    4

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 59 / 152

  • Stochastic Convergence

    Convergence in distribution : Let {p(xN )} be a sequence of PDF. Ifthere exists a PDF p(x) such that

    limN

    p(xN) = p(x)

    at every point x at which p(x) is continuous, we say that p(xN) converges

    in distribution to p(x). We write also xNd x .

    Example

    Consider N independent RV x1, x2, . . . , xN with mean and nite variance

    2. Let xN =1

    N

    Nn=1

    xn, then according to the Central Limit Theorem

    (CLT), zN =NxN

    converges in distribution to z N (0, 1).

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 60 / 152

  • Stochastic Convergence

    Convergence in probability : Sequence of random variables {xN}converges in probability to the random variable x if for every > 0

    limN

    Pr{|xN x | > } = 0

    Convergence with probability 1 (almost sure convergence) :Sequence of random variables {xN} converges with probability 1 to therandom variable x if and only if for all possible events

    Pr{ limN

    xN = x} = 1

    Example

    Consider N independent RV x1, x2, . . . , xN with mean and nite variance

    2. Let xN =1

    N

    Nn=1

    xn. Then according to the Law of Large Numbers

    (LLN), zN converges in probability to .

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 61 / 152

  • Asymptotic Properties of Estimators

    Asymptotic Unbiasedness : Estimator N is an asymptotically unbiasedestimator of if :

    limN

    E (N ) = 0Asymptotic Distribution : It refers to p(xn) as it evolves fromn = 1, 2, . . ., especially for large value of n (it is not the ultimate form ofdistribution, which may be degenerate).

    Asymptotic Variance : is not equal to limN

    var(N) (which is the

    limiting variance). It is dened as :

    asymptotic var(N) =1

    Nlim

    NE{N[N lim

    NE (N)]

    2}

    Mean-Square Convergence : Estimator N converges to in amean-squared sense, if :

    limN

    E [(N )2] = 0(Maximum Likelihood Estimation) Estimation Theory Spring 2013 62 / 152

  • Consistency

    Estimator N is a consistent estimator of if for every > 0

    plim(N) = limN

    Pr [|N | > ] = 0

    Remarks :

    If N is asymptotically unbiased and its limiting variance is zero thenit converges to in mean-square.

    If N converges to in mean-square, then the estimator is consistent.

    Asymptotic unbiasedness does not imply consistency and vice versa.

    plim can be treated as an operator, e.g. :

    plim(xy) = plim(x)plim(y) ; plim

    (x

    y

    )=

    plim(x)

    plim(y)

    The importance of consistency is that any continuous function of aconsistent estimator is itself a consistent estimator.

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 63 / 152

  • Maximum Likelihood Estimation

    Properties : MLE may be biased and is not necessarily an ecientestimator. However :

    MLE is a consistent estimator meaning that

    limN

    Pr [| | > ] = 0

    MLE asymptotically attains the CRLB (asymptotic variance is equalto CRLB).

    Under someregularityconditions, the MLE is asymptoticallynormally distributed

    a N (, I1())

    even if the PDF of x is not Gaussian.

    If an MVU estimator exists then ML procedure will nd it.

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 64 / 152

  • Maximum Likelihood Estimation

    Example

    Consider DC level in WGN with known variance 2.

    Sol. Then

    p(x;A) =1

    (22)N2

    exp

    [ 122

    N1n=0

    (x [n] A)2]

    ln p(x;A)

    A=

    1

    2

    N1n=0

    (x [n] A) = 0

    N1n=0

    x [n] NA = 0

    which leads to A = x =1

    N

    N1n=0

    x [n]

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 65 / 152

  • MLE for Transformed Parameters (Invariance Property)

    Theorem (Invariance Property of the MLE)

    The MLE of the parameter = g(), where the PDF p(x; ) isparameterized by , is given by

    = g()

    where is the MLE of .

    It can be proved using the property of the consistent estimators.

    If = g() is a one-to-one function, then

    = argmax

    p(x; g1()) = g()

    If = g() is not a one-to-one function, then

    pT (x;) = max{:=g()}p(x; ) and = argmax

    pT (x; ) = g()

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 66 / 152

  • MLE for Transformed Parameters (Invariance Property)

    Example

    Consider DC level in WGN and nd MLE of = exp(A). Since g() is aone-to-one function then :

    = argmax

    pT (x, ) = argmax

    p(x; ln) = exp(x)

    Example

    Consider DC level in WGN and nd MLE of = A2. Since g() is not aone-to-one function then :

    = argmax0

    {p(x;), p(x;)}

    =

    [arg max

    0{p(x;), p(x;)}

    ]2=

    [arg max

  • MLE (Extension to Vector Parameter)

    Example

    Consider DC Level in WGN with unknown variance. The vector parameter = [A 2]T should be estimated.We have :

    ln p(x;)

    A=

    1

    2

    N1n=0

    (x [n] A)

    ln p(x;)

    2= N

    22+

    1

    24

    N1n=0

    (x [n] A)2

    which leads to the following MLE :

    A = x 2 =1

    N

    N1n=0

    (x [n] x)2

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 68 / 152

  • MLE for General Gaussian Case

    Consider the general Gaussian case where x N ((),C()).The partial derivative of the PDF is :

    ln p(x;)

    k= 1

    2tr

    (C1()

    C()

    k

    )+

    ()T

    kC1()(x ())

    12(x ())T C

    1()k

    (x ())

    for k = 1, . . . , p.

    By setting the above equations equal to zero, MLE can be found.

    A particular case is when C is known (the rst and third termsbecome zero).

    In addition, if () is linear in , the general linear model is obtained.

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 69 / 152

  • MLE for General Linear Models

    Consider the general linear model x = H +w where w is a noise vectorwith PDF N (0,C) :

    p(x;) =1

    (2)N det(C)exp

    [12(xH)TC1(xH)

    ]Taking the derivative of ln p(x;) leads to :

    ln p(x;)

    =

    (H)T

    C1(xH)

    Then

    HTC1(xH) = 0 = (HTC1H)1HTC1xwhich is the same as MVU estimator. The PDF of is :

    N (, (HTC1H)1)

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 70 / 152

  • MLE (Numerical Method)

    NewtonRaphson : A closed form estimator cannot be always computedby maximizing the likelihood function. However, the maximum value canbe computed by the numerical methods like the iterative NewtonRaphsonalgorithm.

    k+1 = k [2 ln p(x;)

    T

    ]1 ln p(x;)

    =k

    Remarks :

    The Hessian can be replaced by the negative of its expectation, theFisher information matrix I().

    This method suers from convergence problems (local maximum).

    Typically, for large data length, the log-likelihood function becomesmore quadratic and the algorithm will produce the MLE.

    (Maximum Likelihood Estimation) Estimation Theory Spring 2013 71 / 152

  • Least Squares Estimation

    (Least Squares) Estimation Theory Spring 2013 72 / 152

  • The Least Squares Approach

    In all the previous methods we assumed that the measured signal x [n] isthe sum of a true signal s[n] and a measurement error w [n] with knownprobabilistic model. In least squares method

    x [n] = s[n, ] + e[n]

    where e(n) represents the modeling and measurement errors. The objectiveis to minimize the LS cost :

    J() =N1n=0

    (x [n] s[n, ])2

    We do not need a probabilistic assumption but onlya deterministic signal model.

    It has a broader range of applications.

    No claim about the optimality can be made.

    The statistical performance cannot be assessed.

    (Least Squares) Estimation Theory Spring 2013 73 / 152

  • The Least Squares Approach

    Example

    Estimate the DC level of a signal. We observe x [n] = A+ [n] forn = 0, . . . ,N 1 and the LS criterion is :

    J(A) =N1n=0

    (x [n] A)2

    J(A)

    A= 2

    N1n=0

    (x [n] A) = 0 A = 1N

    N1n=0

    x [n]

    (Least Squares) Estimation Theory Spring 2013 74 / 152

  • Linear Least Squares

    Suppose that the observation model is linear x = H + then

    J() =N1n=0

    (x [n] s[n, ])2 = (xH)T (xH)

    = xT x 2xTH + THTH

    where H is full rank. The gradient is

    J()

    = 2HTx+ 2HTH = 0 = (HTH)1HTx

    The minimum LS cost is :

    Jmin = (xH)T (xH) = xT [IH(HTH)1HT ]x = xT (xH)

    where IH(HTH)1HT is an Idempotent matrix.

    (Least Squares) Estimation Theory Spring 2013 75 / 152

  • Comparing Dierent Estimators for the Linear Model

    Consider the following linear model

    x = H + w

    Estimator Assumption Estimate

    LSE No probabilistic assumption = (HTH)1HTx

    BLUE w is white with unknown PDF = (HTH)1HTx

    MLE w is white Gaussian noise = (HTH)1HTx

    MVUE w is white Gaussian noise = (HTH)1HTx

    For MLE and MVUE the PDF of the estimate will beGaussian.

    (Least Squares) Estimation Theory Spring 2013 76 / 152

  • Weighted Linear Least Squares

    The LS criterion can be modied by including a positive denite(symmetric) weighting matrix W :

    J() = (xH)TW(xH)

    That leads to the following estimator :

    = (HTWH)1HTWx

    and minimum LS cost :

    Jmin = xT [W WH(HTWH)1HTW]x

    Remark : If we take W = C1, where C is the covariance of noise thenweighted least squares estimator is the BLUE. However, there is no trueLS-based reason for this choice.

    (Least Squares) Estimation Theory Spring 2013 77 / 152

  • Geometrical Interpretation

    Recall the general signal model s = H. If we denote the column of H byhi we have :

    s = [h1 h2 hp]

    12...p

    =p

    i=1

    ihi

    The signal model is a linear combination of the vectors {hi}.The LS minimizes the length of the error vector between the data andthe signal model = x s :

    J() = (xH)T (xH) = xH2

    The data vector can lie anywhere in RN , while signal vectors must liein a p-dimensional subspace of RN , termed Sp, which is spanned bythe column of H.

    (Least Squares) Estimation Theory Spring 2013 78 / 152

  • Geometrical Interpretation (Orthogonal Projection)

    Intuitively, it is clear that the LS error is minimized when s = H isthe orthogonal projection of x onto Sp.

    So the LS error vector = x s is orthogonal to all columns of H :

    HT = 0 HT (xH) = 0 = (HTH)1HTx

    The signal estimate is the projection of x onto Sp :

    s = H = H(HTH)1HTx = Px

    where P is the orthogonal projection matrix.

    Note that if z Range(H), then Pz = z. Recall that Range(H) is thesubspace spanned by the columns of H.

    Now, since Px Sp then P(Px) = Px. Therefore any projectionmatrix is idempotent, i.e. P2 = P.

    It can be veried that P is symmetric and singular (with rank p).

    (Least Squares) Estimation Theory Spring 2013 79 / 152

  • Geometrical Interpretation (Orthonormal columns of H)

    Recall that

    HTH =

    < h1,h1 > < h1,h2 > < h1,hp >< h2,h1 > < h2,h2 > < h2,hp >

    ......

    . . ....

    < hp,h1 > < hp,h2 > < hp,hp >

    If the columns of H are orthonormal then HTH = I and = HTx.

    In this case we have i = hTi x, thus

    s = H =

    pi=1

    ihi =

    pi=1

    (hTi x)hi

    If we increase the number of parameters (the order of the linearmodel), we can easily compute the new estimate.

    (Least Squares) Estimation Theory Spring 2013 80 / 152

  • Choosing the Model Order

    Suppose that you have a set of data and the objective is to t apolynomial to data. What is the best polynomial order ?

    Remarks :

    It is clear that by increasing the order Jmin is monotonicallynon-increasing.

    By choosing p = N we can perfectly t the data to model. However,we t the noise as well.

    We should choose the simplest model that adequately describes thedata.

    We increase the order only if the cost reduction is signicant.

    If we have an idea about the expected level of Jmin, we increase p toapproximately attain this level.

    There is an order-recursive LS algorithm to eciently compute a(p + 1)-th order model based on a p-th order one (See 8.6).

    (Least Squares) Estimation Theory Spring 2013 81 / 152

  • Sequential Least Squares

    Suppose that [N 1] based on x[N 1] = [x [0], . . . , x [N 1]] isavailable. If we get a new data x [N], we want to compute [N] as afunction of [N 1] and x [N].Example

    Consider LS estimate of DC level : A[N 1] = 1N

    N1n=0

    x [n]. We have :

    A[N] =1

    N + 1

    Nn=0

    x [n] =1

    N + 1

    {N

    [1

    N

    N1n=0

    x [n]

    ]+ x [N]

    }

    =N

    N + 1A[N 1] + 1

    N + 1x [N]

    = A[N 1] + 1N + 1

    (x [N] A[N 1])

    new estimate = old estimate + gain prediction error(Least Squares) Estimation Theory Spring 2013 82 / 152

  • Sequential Least Squares

    Example (DC level in uncorrelated noise)

    x [n] = A+ w [n] and var(w [n]) = 2n. The WLS estimate (or BLUE) is :

    A[N 1] =N1

    n=0x [n]2nN1

    n=012n

    Similar to the previous example we can obtain :

    A[N] = A[N 1] + 1/2NN1

    n=012n

    (x [N] A[N 1])

    = A[N 1] + K [N](x [N] A[N 1])The gain factor K [N] can be reformulated as :

    K [N ] =var(A[N 1])

    var(A[N 1]) + 2N(Least Squares) Estimation Theory Spring 2013 83 / 152

  • Sequential Least Squares

    Consider the general linear model x = H +w, where w is an uncorrelatednoise with the covariance matrix C. The BLUE (or WLS) is :

    = (HTC1H)1HTC1x and C = (HTC1H)1

    Lets dene :

    C[n] = diag(20 , 21 , . . . ,

    2n)

    H[n] =

    [H[n 1]hT [n]

    ]=

    [n p1 p

    ]x[n] =

    [x [0], x [1], . . . , x [n]

    ]The objective is to nd [n], based on n+ 1 data samples, as a function of[n 1] and new data x [n]. The batch estimator is :

    [n 1] = [n 1]HT [n 1]C1[n 1]x[n 1]with [n 1] = (HT [n 1]C1[n 1]H[n 1])1

    (Least Squares) Estimation Theory Spring 2013 84 / 152

  • Sequential Least Squares

    Estimator Update :

    [n] = [n 1] +K[n](x [n] hT [n][n 1])where

    K[n] =[n 1]h[n]

    2n + hT [n][n 1]h[n]

    Covariance Update :

    [n] = (IK[n]hT [n])[n 1]The following Lemma is used to compute the updates :

    Matrix Inversion Lemma

    (A+ BCD)1 = A1 A1B[C1 +DA1B]1DA1

    with A = 1[n 1],B = h[n],C = 1/2n ,D = hT [n].Initialization ?

    (Least Squares) Estimation Theory Spring 2013 85 / 152

  • Constrained Least Squares

    Linear constraints in the form A = b can be considered in LS solution.The LS criterion becomes :

    Jc() = (xH)T (xH) + T (A b)Jc

    = 2HTx+ 2HTH + ATSetting the gradient equal to zero produces :

    c = (HTH)1HT x 1

    2(HTH)1AT

    = (HTH)1AT 2

    where = (HTH)1HTx is the unconstrained LSE.Now Ac = b can be solved to nd :

    = 2[A(HTH)1AT ]1(A b)

    (Least Squares) Estimation Theory Spring 2013 86 / 152

  • Nonlinear Least Squares

    Many applications have nonlinear observation model : s() = H. Thisleads to a nonlinear optimization problem that can be solved numerically.

    Newton-Raphson Method : Find a zero of the gradient of the criterionby linearizing the gradient around the current estimate.

    k+1 = k (g()

    )1g()

    =k

    where g() =sT ()

    [x s()] and

    [g()]ij

    =N1n=0

    [(x [n] s[n])

    2s[n]

    ij s[n]

    j

    s[n]

    i

    ]Around the solution [x s()] is small so the rst term in the Jacobiancan be neglected. This makes this method equivalent to the Gauss-Newtonalgorithm which is numerically more robust.

    (Least Squares) Estimation Theory Spring 2013 87 / 152

  • Nonlinear Least Squares

    Gauss-Newton Method : Linearize signal model around the currentestimate and solve the resulting linear problem.

    s() s(0) + s()

    =0

    ( 0) = s(0) +H(0)( 0)

    The solution to the linearized problem is :

    = [HT (0)H(0)]1HT (0)[x s(0) +H(0)0]

    = 0 + [HT (0)H(0)]

    1HT (0)[x s(0)]

    If we now iterate the solution, it becomes :

    k+1 = k + [HT (k)H(k)]

    1HT (k)[x s(k)]

    Remark : Both the Newton-Raphson and the Gauss-Newton methods canhave convergence problems.

    (Least Squares) Estimation Theory Spring 2013 88 / 152

  • Nonlinear Least Squares

    Transformation of parameters : Transform into a linear problem byseeking for an invertible function = g() such that :

    s() = s(g1()) = H

    So the nonlinear LSE is = g1() where = (HTH)1HTx.

    Example (Estimate the amplitude and phase of a sinusoidal signal)

    s[n] = A cos(2f0n + ) n = 0, 1, . . . ,N 1The LS problem is nonlinear, however, we have :

    A cos(2f0n + ) = A cos cos 2f0n A sin sin 2f0n

    if we let 1 = A cos and 2 = A sin then the signal model becomeslinear s = H. The LSE is = (HTH)1HTx and g1() is

    A =21 +

    22 and = arctan

    (21

    )(Least Squares) Estimation Theory Spring 2013 89 / 152

  • Nonlinear Least Squares

    Separability of parameters : If the signal model is linear in some of theparameters, try to write it as s = H(). The LS error can be minimizedwrt .

    = [HT()H()]1HT()x

    The cost function is a function of that can be minimized by anumerical method (e.g. brute force).

    J(, ) = xT[IH()[HT()H()]1HT()]x

    Then

    = argmax

    xT[H()[HT()H()]1HT()

    ]x

    Remark : This method is interesting if the dimension of is much lessthan the dimension of .

    (Least Squares) Estimation Theory Spring 2013 90 / 152

  • Nonlinear Least Squares

    Example (Damped Exponentials)

    Consider the following signal model :

    s[n] = A1rn + A2r

    2n + A3r3n n = 0, 1, . . . ,N 1

    with = [A1,A2,A3, r ]T . It is known that 0 < r < 1. Lets take

    = [A1,A2,A3]T so we get s = H(r) with :

    H(r) =

    1 1 1r r2 r3

    ......

    ...

    rN1 r2(N1) r3(N1)

    Step 1 : Maximize xT

    [H(r)[HT (r)H(r)]1HT (r)

    ]x to obtain r .

    Step 2 : Compute = [HT (r)H(r)]1HT (r)x.

    (Least Squares) Estimation Theory Spring 2013 91 / 152

  • The Bayesian Philosophy

    (The Bayesian Philosophy) Estimation Theory Spring 2013 92 / 152

  • Introduction

    Classical Approach :

    Assumes is unknown but deterministic.

    Some prior knowledge on cannot be used.

    Variance of the estimate may depend on .

    MVUE may not exist.

    In Monte-Carlo Simulations, we do M runs for each xed and thencompute sample mean and variance for each (no averaging over ).

    Bayesian Approach :

    Assumes is random with a known prior PDF, p().

    We estimate a realization of based on the available data.

    Variance of the estimate does not depend on .

    A Bayesian estimate always exists

    In Monte-Carlo simulations, we do M runs for randomly chosen andthen we compute sample mean and variance over all values.

    (The Bayesian Philosophy) Estimation Theory Spring 2013 93 / 152

  • Mean Square Error (MSE)

    Classical MSE

    Classical MSE is a function of the unknown parameter and cannot beused for constructing the estimators.

    mse() = E{[ (x)]2} =[ (x)]2p(x; )dx

    Note that the E{} is wrt the PDF of x.

    Bayesian MSE :

    Bayesian MSE is not a function of and can be minimized to nd anestimator.

    Bmse() = E{[ (x)]2} =

    [ (x)]2p(x, )dxd

    Note that the E{} is wrt the joint PDF of x and .(The Bayesian Philosophy) Estimation Theory Spring 2013 94 / 152

  • Minimum Mean Square Error Estimator

    Consider the estimation of A that minimizes the Bayesian MSE, where A isa random variable with uniform PDF p(A) = U [A0,A0] and independentof w [n].

    Bmse() =

    [A A]2p(x, )dxdA =

    [[A A]2p(A|x)dA

    ]p(x)dx

    where we used Bayes theorem : p(x, ) = p(A|x)p(x).Since p(x) 0 for all x, we need only minimize the integral in brackets.We set the derivative equal to zero :

    A

    [A A]2p(A|x)dA = 2

    Ap(A|x)dA+ 2A

    p(A|x)dA

    which results in

    A =

    Ap(A|x)dA = E (A|x)

    Bayesian MMSE Estimate : is the conditional mean of A given data xor the mean of posterior PDF p(A|x).

    (The Bayesian Philosophy) Estimation Theory Spring 2013 95 / 152

  • Minimum Mean Square Error Estimator

    How to compute the Bayesian MMSE estimate :

    A = E (A|x) =

    Ap(A|x)dAThe posteriori PDF can be computed using Bayes Rule :

    p(A|x) = p(x|A)p(A)p(x)

    =p(x|A)p(A)p(x|A)p(A)dA

    p(x) is the marginal PDF dened as p(x) =p(x,A)dA.

    The integral in the denominator acts as a normalizationofp(x|A)p(A) such the integral of p(A|x) be equal to 1.p(x|A) has exactly the same form as p(x;A) which is used in theclassical estimation.The MMSE estimator always exists and can be computed by :

    A =

    Ap(x|A)p(A)dAp(x|A)p(A)dA

    (The Bayesian Philosophy) Estimation Theory Spring 2013 96 / 152

  • Minimum Mean Square Error Estimator

    Example : For p(A) = U [A0,A0] we have :

    A =

    1

    2A0

    A0A0

    A1

    (22)N/2exp

    [122

    N1n=0

    (x [n] A)2]dA

    1

    2A0

    A0A0

    1

    (22)N/2exp

    [122

    N1n=0

    (x [n] A)2]dA

    Before collecting the data the mean of the prior PDF p(A) is the bestestimate while after collecting the data the best estimate is the meanof posterior PDF p(A|x).Choice of p(A) is crucial for the quality of the estimation.

    Only Gaussian prior PDF leads to a closed-form estimator.

    Conclusion : For an accurate estimator choose a prior PDF that can bephysically justied. For a closed-form estimator choose a Gaussian priorPDF.

    (The Bayesian Philosophy) Estimation Theory Spring 2013 97 / 152

  • Properties of the Gaussian PDF

    Theorem (Conditional PDF of Bivariate Gaussian)

    If x and y are distributed according to a bivariate Gaussian PDF :

    p(x , y) =1

    2det(C)

    exp

    [12

    [x E (x)y E (y)

    ]TC1

    [x E (x)y E (y)

    ]]

    with covariance matrix : C =

    [var(x) cov(x , y)

    cov(x , y) var(y)

    ]then the conditional PDF p(y |x) is also Gaussian and :

    E (y |x) = E (y) + cov(x , y)var(x)

    [x E (x)]

    var(y |x) = var(y) cov2(x , y)

    var(x)

    (The Bayesian Philosophy) Estimation Theory Spring 2013 98 / 152

  • Properties of the Gaussian PDF

    Example (DC level in WGN with Gaussian prior PDF)

    Consider the Bayesian model x [0] = A+ w [0] (just one observation) witha prior PDF of A N (A, 2A) independent of noise.First we compute the covariance cov(A, x [0]) :

    cov(A, x [0]) = E{(A A)(x [0] E (x [0])}= E{(A A)(A + w [0] A 0)}= E{(A A)2 + (A A)w [0]} = var(A) + 0 = 2A

    The Bayesian MMSE estimate is also Gaussian with :

    A = A|x = A +cov(A, x [0])

    var(x)[x [0] A] = A +

    2A

    2 + 2A(x [0] A)

    var(A) = 2A|x = 2A

    cov2(A, x [0])

    var(x)= 2A(1

    2A2 + 2A

    )

    (The Bayesian Philosophy) Estimation Theory Spring 2013 99 / 152

  • Properties of the Gaussian PDF

    Theorem (Conditional PDF of Multivariate Gaussian)

    If x (with dimension k 1) and y (with dimension l 1) are jointlyGaussian with PDF :

    p(x , y) =1

    (2)k+l2

    det(C)

    exp

    [12

    [x E (x)y E (y)

    ]TC1

    [x E (x)y E (y)

    ]]

    with covariance matrix : C =

    [Cxx CxyCyx Cyy

    ]then the conditional PDF p(y|x) is also Gaussian and :

    E (y|x) = E (y) + CyxC1xx [x E (x)]Cy |x = Cyy CyxC1xx Cxy

    (The Bayesian Philosophy) Estimation Theory Spring 2013 100 / 152

  • Bayesian Linear Model

    Let the data be modeled as

    x = H + w

    where is a p 1 random vector with prior PDF N (,C) and w is anoise vector with PDF N (0,Cw ).Since and w are independent and Gaussian, they are jointly Gaussianthen the posterior PDF is also Gaussian. In order to nd the MMSEestimator we should compute the covariance matrices :

    Cxx = E{[x E (x)][x E (x)]T }= E{(H + wH)(H +wH)T }= E{(H( ) + w)(H( ) +w)T }= HE{( )( )T }HT + E (wwT ) = HCHT + Cw

    Cx = E{( )[H( ) + w]T}= E{( )( )T}HT = CHT

    (The Bayesian Philosophy) Estimation Theory Spring 2013 101 / 152

  • Bayesian Linear Model

    Theorem (Posterior PDF for the Bayesian General Linear Model)

    If the observed data can be modeled as

    x = H + w

    where is a p 1 random vector with prior PDF N (,C) and w is anoise vector with PDF N (0,Cw ), then the posterior PDF p(x|) isGaussian with mean :

    E (|x) = + CHT(HCHT + Cw)1(xH)

    and covariance

    C|x = C CHT(HCHT + Cw)1HCRemark : In contrast to the classical general linear model, H need not befull rank.

    (The Bayesian Philosophy) Estimation Theory Spring 2013 102 / 152

  • Bayesian Linear Model

    Example (DC Level in WGN with Gaussian Prior PDF)

    x [n] = A+w [n], n = 0, . . . ,N1, A N (A, 2A), w [n] N (0, 2)

    We have the general Bayesian linear model x = HA+w with H = 1. Then

    E (A|x) = A + 2A1T (12A1T + 2I)1(x 1A)var(A|x) = 2A 2A1T (12A1T + 2I)112A

    We use the Matrix Inversion Lemma to get :

    (I+ 12A2

    1T )1 = I 11T

    2

    2A+ 1T1

    = I 11T

    2

    2A+ N

    E (A|x) = A + 2A

    2A +2

    N

    (x A) and var(A|x) =2

    N 2A

    2A +2

    N

    (The Bayesian Philosophy) Estimation Theory Spring 2013 103 / 152

  • Nuisance Parameters

    Denition (Nuisance Parameter)

    Suppose that and are unknown parameters but we are only interestedin estimating . In this case is called a nuisance parameter.

    In the classical approach we have to estimate both but in theBayesian approach we can integrate it out.Note that in the Bayesian approach we can nd p(|x) from p(, |x)as a marginal PDF :

    p(|x) =

    p(, |x)dWe can also express it as

    p(|x) = p(x|)p()p(x|)p()d where p(x|) =

    p(x|, )p(|)d

    Furthermore if is independent of we have :

    p(x|) =

    p(x|, )p()d(The Bayesian Philosophy) Estimation Theory Spring 2013 104 / 152

  • General Bayesian Estimators

    (General Bayesian Estimators) Estimation Theory Spring 2013 105 / 152

  • General Bayesian Estimators

    Risk Function

    A general Bayesian estimator is obtained by minimizing the Bayes Risk

    = argminR()

    where R() = E{C ()} is the Bayes Risk, C () is a cost function and = is the estimation error.

    Three common risk functions

    1. Quadratic : R() = E{2} = E{( )2}

    2. Absolute : R() = E{||} = E{| |}

    3. Hit-or-Miss : R() = E{C ()} where C () ={

    0 || < 1 ||

    (General Bayesian Estimators) Estimation Theory Spring 2013 106 / 152

  • General Bayesian Estimators

    The Bayes risk is

    R() = E{C ()} =

    C ( )p(x, )dxd =

    g()p(x)dx

    where g() =

    C ( )p(|x)d should be minimized.

    Quadratic

    For this case R() = E{( )2} =Bmse() which leads to the MMSEestimator with

    = E (|x)so is the mean of the posterior PDF p(|x).

    (General Bayesian Estimators) Estimation Theory Spring 2013 107 / 152

  • General Bayesian Estimators

    Absolute

    In this case we have :

    g() =

    | |p(|x)d =

    ( )p(|x)d +

    ( )p(|x)d

    By setting the derivative of g() equal to zero and the use of Leibnitzsrule, we get :

    p(|x)d =

    p(|x)d

    Then is the median (area to the left=area to the right) of p(|x)

    (General Bayesian Estimators) Estimation Theory Spring 2013 108 / 152

  • General Bayesian Estimators

    Hit-or-Miss

    In this case we have

    g() =

    1 p(|x)d + +

    1 p(|x)d

    = 1 +

    p(|x)d

    For arbitrarily small the optimal estimate is the location of themaximum of p(|x) or the mode of the posterior PDF.This estimator is called the maximum a posteriori (MAP) estimator.

    Remark : For unimodal and symmetric posterior PDF (e.g. GaussianPDF), the mean and the mode and the median are the same.

    (General Bayesian Estimators) Estimation Theory Spring 2013 109 / 152

  • Minimum Mean Square Error Estimators

    Extension to the vector parameter case : In general, like the scalarcase, we can write :

    i = E (i |x) =

    ip(i |x)di i = 1, 2, . . . , pThen p(i |x) can be computed as a marginal conditional PDF. Forexample :

    1 =

    1

    [

    p(|x)d2 dp]d1 =

    1p(|x)d

    In vector form we have :

    =

    1p(|x)d2p(|x)d

    ...pp(|x)d

    =

    p(|x)d = E (|x)

    Similarly Bmse(i) =

    [C|x ]iip(x)dx

    (General Bayesian Estimators) Estimation Theory Spring 2013 110 / 152

  • Properties of MMSE Estimators

    For Bayesian Linear Model, poor prior knowledge leads to MVUestimator.

    = E (|x) = + (C1 +HTC1w H)1HTC1w (xH)For no prior knowledge 0 and C1 0, and therefore :

    [HTC1w H]1HTC1w xCommutes over ane transformations.Suppose that = A + b then the MMSE estimator for is

    = E (|x) = E (A + b|x) = AE (|x) + b = A + bEnjoys the additive property for independent data sets.Assume that , x1, x2 are jointly Gaussian with x1 and x2independent :

    = E () + Cx1C1x1x1[x1 E (x1)] + Cx2C1x2x2[x2 E (x2)]

    (General Bayesian Estimators) Estimation Theory Spring 2013 111 / 152

  • Maximum A Posteriori Estimators

    In the MAP estimation approach we have :

    = argmax

    p(|x)where

    p(|x) = p(x|)p()p(x)

    An equivalent maximization is :

    = argmax

    p(x|)p()or

    = argmax

    [ln p(x|) + ln p()]If p() is uniform or is approximately constant around the maximumof p(x|) (for large data length), we can remove p() to obtain theBayesian Maximum Likelihood :

    = argmax

    p(x|)(General Bayesian Estimators) Estimation Theory Spring 2013 112 / 152

  • Maximum A Posteriori Estimators

    Example (DC Level in WGN with Uniform Prior PDF)

    The MMSE estimator cannot be obtained in explicit form due to the needto evaluate the following integrals :

    A =

    1

    2A0

    A0A0

    A1

    (22)N/2exp

    [122

    N1n=0

    (x [n] A)2]dA

    1

    2A0

    A0A0

    1

    (22)N/2exp

    [122

    N1n=0

    (x [n] A)2]dA

    The MAP estimator is given as : A =

    A0 x < A0x A0 x A0A0 x > A0

    Remark : The main advantage of MAP estimator is that for non jointlyGaussian PDFs it may lead to explicit solutions or less computationaleort.

    (General Bayesian Estimators) Estimation Theory Spring 2013 113 / 152

  • Maximum A Posteriori Estimators

    Extension to the vector parameter case

    1 = argmax1

    p(1|x) = argmax1

    p(|x)d2 . . . dpThis needs integral evaluation of the marginal conditional PDFs.

    An alternative is the following vector MAP estimator that maximizes thejoint conditional PDF :

    = argmax

    p(|x) = argmax

    p(x|)p()

    This corresponds to a circular Hit-or-Miss cost function.

    In general, as N , the MAP estimator the Bayesian MLE.If the posterior PDF is Gaussian, the mode is identical to the mean,therefore the MAP estimator is identical to the MMSE estimator.

    The invariance property in ML theory does not hold for the MAPestimator.

    (General Bayesian Estimators) Estimation Theory Spring 2013 114 / 152

  • Performance Description

    In the classical estimation the mean and the variance of the estimate(or its PDF) indicates the performance of the estimator.In the Bayesian approach the PDF of the estimate is dierent for eachrealization of . So a good estimator should perform well for everypossible value of .The estimation error = is a function of two random variables( and x).The mean and the variance of the estimation error, wrt two randomvariables, indicates the performance of the estimator.

    The mean value of the estimation error is zero. So the estimates areunbiased (in the Bayesian sense).

    Ex ,( ) = Ex ,( E (|x)) = Ex [E|x() E|x(|x)] = Ex(0) = 0The variance of the estimate is Bmse :

    var() = Ex ,(2) = Ex ,[( )2] = Bmse()

    (General Bayesian Estimators) Estimation Theory Spring 2013 115 / 152

  • Performance Description (Vector Parameter Case)

    The vector of the estimation error is = and has a zero mean.Its covariance matrix is :

    M = Ex ,(T ) = Ex ,{[ E (|x)][ E (|x)]T }

    = Ex

    [E|x

    {[ E (|x)][ E (|x)]T}] = Ex(C|x)

    if x and are jointly Gaussian, we have :

    M = C|x = C CxC1xx Cxsince C|x does not depend on x.For a Bayesian linear model we have :

    M = C|x = C CHT(HCHT + Cw )1HCIn this case is a linear transformation of x and and thus isGaussian. Therefore :

    N (0,M)(General Bayesian Estimators) Estimation Theory Spring 2013 116 / 152

  • Signal Processing Example

    Deconvolution Problem : Estimate a signal transmitted through achannel with known impulse response :

    x [n] =ns1m=0

    h[n m]s[m] + w [n] n = 0, 1, . . . ,N 1

    where s[n] is a WSS Gaussian process with known ACF, and w [n] is WGNwith variance 2.

    x [0]x [1]...

    x [N 1]

    =

    h[0] 0 0h[1] h[0] 0...

    .

    .....

    .

    ..h[N 1] h[N 2] h[N ns ]

    s[0]s[1]...

    s[ns 1]

    +

    w [0]w [1]...

    w [N 1]

    x = H + w s = CsHT (HCsHT + 2I)1xwhere Cs is a symmetric Toeplitz matrix with [Cs]ij = rss [i j]

    (General Bayesian Estimators) Estimation Theory Spring 2013 117 / 152

  • Signal Processing Example

    Consider that H = I (no ltering), so the Bayesian model becomes

    x = s+ w

    Classical Approach The MVU estimator is s = (HTH)1HTx = x.Bayesian Approach The MMSE estimator is s = Cs(Cs +

    2I)1x.

    Scalar case : We estimate s[0] based on x [0] :

    s[0] =rss [0]

    rss [0] + 2x [0] =

    + 1x [0]

    where = rss [0]/2 is the SNR.

    Signal is a Realization of an Auto Regressive Process : Considera rst order AR process : s[n] = a s[n 1] + u[n], where u[n] isWGN with variance 2u. The ACF of s is :

    rss [k] =2u

    1 a2 (a)|k| [Cs]ij = rss [i j]

    (General Bayesian Estimators) Estimation Theory Spring 2013 118 / 152

  • Linear Bayesian Estimators

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 119 / 152

  • Introduction

    Problems with general Baysian estimators :

    dicult to determine in closed form,

    need intensive computations,

    involve multidimensional integration (MMSE) or multidimensionalmaximization (MAP),

    can be determined only under the jointly Gaussian assumption.

    What can we do if the joint PDF is not Gaussian or unknown ?

    Keep the MMSE criterion ;

    Restrict the estimator to be linear.

    This leads to the Linear MMSE Estimator :

    Which is a suboptimal estimator that can be easily implemented.

    It needs only the rst and second moments of the joint PDF.

    It is analogous to BLUE in classical estimation.

    In practice, this estimator is termed Wiener lter.

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 120 / 152

  • Linear MMSE Estimators (scalar case)

    Goal : Estimate , given the data vector x. Assume that only the rst twomoments of the joint PDF of x and are available :[

    E ()E (x)

    ],

    [C CxCx Cxx

    ]LMMSE Estimator : Take the class of all ane estimators

    =N1n=0

    anx [n] + aN = aTx+ aN

    where a = [a0, a1, . . . , aN1]T . Then minimize the Bayesian MSE :

    Bmse() = E [( )2]Computing aN : Lets dierentiate the Bmse with respect to aN :

    aNE[( aTx aN)2

    ]= 2E [( aTx aN)]

    Setting this equal to zero gives : aN = E () aTE (x).(Linear Bayesian Estimators) Estimation Theory Spring 2013 121 / 152

  • Linear MMSE Estimators (scalar case)

    Minimize the Bayesian MSE :

    Bmse() = E [( )2] = E [( aTx E () + aTE (x))2]= E

    {[aT (x E (x)) ( E ())]2}

    = E [aT (x E (x))(x E (x))Ta] E [aT (x E (x))( E ())]E [( E ())(x E (x))Ta] + E [( E ())2]

    = aTCxxa aTCx Cxa+ CThis can be minimized by setting the gradient to zero :

    Bmse()

    a= 2Cxxa 2Cx = 0

    which results in a = C1xx Cx and leads to (Note that : Cx = CTx) :

    = aTx+ aN = CTxC

    1xx x+ E () CTxC1xx E (x)

    = E () + CxC1xx (x E (x))

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 122 / 152

  • Linear MMSE Estimators (scalar case)

    The minimum Bayesian MSE is obtained :

    Bmse() = CTxC1xx CxxC

    1xx Cx CTxC1xx Cx

    CxC1xx Cx + C= C CxC1xx Cx

    Remarks :

    The results are identical to those of MMSE estimators for jointlyGaussian PDF.

    The LMMSE estimator relies on the correlation between randomvariables.

    If a parameter is uncorrelated with the data (but nonlinearlydependent), it cannot be estimated with an LMMSE estimator.

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 123 / 152

  • Linear MMSE Estimators (scalar case)

    Example (DC Level in WGN with Uniform Prior PDF)

    The data model is : x [n] = A+ w [n] n = 0, 1, . . . ,N 1where A U [A0, A0] is independent from w [n] (WGN with variance 2).We have E (A) = 0 and E (x) = 0. The covariances are :

    Cxx = E (xxT ) = E [(A1+ w)(A1+w)T ] = E (A2)11T + 2I

    Cx = E (AxT ) = E [A(A1+ w)T ] = E (A2)1T

    Therefore the LMMSE estimator is :

    A = CxC1xx x =

    2A1

    T (2A11T + 2I)1x =

    A20/3

    A20/3 + 2/N

    x

    where

    2A = E (A2) =

    A0A0

    A21

    2A0dA =

    A3

    6A0

    A0A0

    =A203

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 124 / 152

  • Linear MMSE Estimators (scalar case)

    Example (DC Level in WGN with Uniform Prior PDF)

    Comparison of dierent Bayesian estimators :

    MMSE : A =

    A0A0

    A exp

    [122

    N1n=0

    (x [n] A)2]dA

    A0A0

    exp

    [122

    N1n=0

    (x [n] A)2]dA

    MAP : A =

    A0 x < A0x A0 x A0A0 x > A0

    LMMSE : A =A20/3

    A20/3 + 2/N

    x

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 125 / 152

  • Geometrical Interpretation

    Vector space of random variables : The set of scalar zero mean randomvariables is a vector space.

    The zero length vector of the set is a RV with zero variance.

    For any real scalar a, ax is another zero mean RV in the set.

    For two RVs x and y , x + y = y + x is a RV in the set.

    For two RVs x and y , the inner product is : < x , y >= E (xy)

    The length of x is dened as : x = < x , x > =E (x2)Two RVs x and y are orthogonal if : < x , y >= E (xy) = 0.

    For RVs, x1, x2, y and real numbers a1 and a2 we have :

    < a1x1 + a2x2, y >= a1 < x1, y > +a2 < x2, y >

    E [(a1x1 + a2x2)y ] = a1E (x1y) + a2E (x2y)

    The projection of y on x is :< y , x >

    x2 x =E (yx)

    2xx

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 126 / 152

  • Geometrical Interpretation

    The LMMSE estimator can be determined using the vector spaceviewpoint. We have :

    =N1n=0

    anx [n]

    where aN is zero, because of zero mean assumption.

    belongs to the subspace spanned by x [0], x [1], . . . , x [N 1]. is not in this subspace.A good estimator will minimize the MSE :

    E [( )2] = 2where = is the error vector.Clearly the length of the vector error is minimized when isorthogonal to the subspace spanned by x [0], x [1], . . . , x [N 1] or toeach data sample :

    E[( )x [n]

    ]= 0 for n = 0, 1, . . . ,N 1

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 127 / 152

  • Geometrical Interpretation

    The LMMSE estimator can be determined by solving the followingequations :

    E

    [(

    N1m=0

    amx [m]

    )x [n]

    ]= 0 n = 0, 1, . . . ,N 1

    orN1m=0

    amE (x [m]x [n]) = E (x [n]) n = 0, 1, . . . ,N 1

    E(x2[0]) E(x [0]x [1]) E(x [0]x [N 1])E(x [1]x [0]) E(x2[1]) E(x [1]x [N 1])

    ......

    . . ....

    E(x [N 1]x [0]) E(x [N 1]x [1]) E(x2[N 1])

    a0a1...

    aN1

    =

    E(x [0])E(x [1])

    ..

    .E(x [N 1])

    ThereforeCxxa = C

    Tx a = C1xx CTx

    The LMMSE estimator is :

    = aTx = CxC1xx x

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 128 / 152

  • The vector LMMSE Estimator

    We want to estimate = [1, 2, . . . , p]T with a linear estimator that

    minimize the Bayesian MSE for each element.

    i =N1n=0

    ainx [n] + aiN Minimize Bmse(i) = E[(i i)2

    ]Therefore :

    i = E (i) + Ci x1N

    C1xxNN

    (x E (x))

    N1

    i = 1, 2, . . . , p

    The scalar LMMSE estimators can be combined into a vector estimator :

    = E () p1

    + CxpN

    C1xxNN

    (x E (x))

    N1and similarly

    Bmse(i ) = [M]ii where Mpp

    = Cpp

    CxpN

    C1xx CxNp

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 129 / 152

  • The vector LMMSE Estimator

    Theorem (Bayesian GaussMarkov Theorem)

    If the data are described by the Bayesian linear model form

    x = H + w

    where is a p 1 random vector with mean E() and covariance C, wis a noise vector with zero mean and covariance Cw and is uncorrelatedwith (the joint PDF of p(,w) is arbitrary), then the LMMSE estimatorof is :

    = E () + CHT (HCH

    T + Cw )1(xHE ())

    The performance of the estimator is measured by = whose mean iszero and whose covariance matrix is

    C = M = C CHT (HCHT +Cw )1HC =(C1 +H

    TC1w H)1

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 130 / 152

  • Sequential LMMSE Estimation

    Objective : Given [n 1] based on x[n 1], update the new estimate[n] based on the new sample x [n].

    Example (DC Level in White Noise)

    Assume that both A and w [n] have zero mean :

    x [n] = A+ w [n] A[N 1] = 2A

    2A + 2/N

    x

    Estimator Update : A[N] = A[N 1] + K [N](x [N] A[N 1])

    where K [N] =Bmse(A[N 1])

    Bmse(A[N 1]) + 2Minimum MSE Update :

    Bmse(A[N]) = (1 K [N])Bmse(A[N 1])(Linear Bayesian Estimators) Estimation Theory Spring 2013 131 / 152

  • Sequential LMMSE Estimation

    Vector space view : If two observations are orthogonal, the LMMSEestimate of is the sum of the projection of on each observation.

    1 Find the LMMSE estimator of A based on x [0], yielding A[0] :

    A[0] =E (Ax [0])

    E (x2[0])x [0] =

    2A2A +

    2x [0]

    2 Find the LMMSE estimator of x [1] based on x [0], yielding x [1|0]

    x [1|0] = E (x [0]x [1])E (x2[0])

    x [0] =2A

    2A + 2x [0]

    3 Determine the innovation of the new data :x [1] = x [1] x [1|0].This error vector is orthogonal to x [0]

    4 Add to A[0] the LMMSE estimator of A based on the innovation :

    A[1] = A[0] +E (Ax [1])

    E (x2[1])x [1]

    the projection of A on x[1]

    = A[0] + K [1](x [1] x [1|0])

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 132 / 152

  • Sequential LMMSE Estimation

    Basic Idea : Generate a sequence of orthogonal RVs, namely, theinnovations :{

    x [0]x [0]

    , x [1]x [1]x[1|0]

    , x [2]x [2]x[2|0,1]

    , . . . , x [n]x [n]x[n|0,1,...,n1]

    }Then, add the individual estimators to yield :

    A[N] =Nn=0

    K [n]x [n] = A[N1]+K [N]x [N] where K [n] = E (Ax [n])E (x2[n])

    It can be shown that :

    x [N] = x [N] A[N 1] and K [N] = Bmse(A[N 1])2 + Bmse(A[N 1])

    the minimum MSE be updated as :

    Bmse(A[N]) = (1 K [N])Bmse(A[N 1])

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 133 / 152

  • General Sequential LMMSE Estimation

    Consider the general Bayesian linear model x = H + w, where w is anuncorrelated noise with the diagonal covariance matrix Cw . Lets dene :

    Cw [n] = diag(20 ,

    21 , . . . ,

    2n)

    H[n] =

    [H[n 1]hT [n]

    ]=

    [n p1 p

    ]x[n] =

    [x [0], x [1], . . . , x [n]

    ]Estimator Update :

    [n] = [n 1] +K[n](x [n] hT [n][n 1])where

    K[n] =M[n 1]h[n]

    2n + hT [n]M[n 1]h[n]

    Minimum MSE Matrix Update :

    M[n] = (IK[n]hT [n])M[n 1]

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 134 / 152

  • General Sequential LMMSE Estimation

    Remarks :

    For the initialization of the sequential LMMSE estimator, we can usethe prior information :

    [1] = E () M[1] = CFor no prior knowledge about we can let C . Then we havethe same form as the sequential LSE, although the approaches arefundamentally dierent.

    No matrix inversion is required.

    The gain factor K[n] weights condence in the new data (measuredby 2n) against all previous data (summarized by M[n 1]).

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 135 / 152

  • Wiener Filtering

    Consider the signal model : x [n] = s[n] + w [n] n = 0, 1, . . . ,N 1where the data and noise are zero mean with known covariance matrix.

    There are three main problems concerning the Wiener Filters :

    Filtering : Estimate = s[n] (scalar) based on the data setx = [x [0], x [1], . . . , x [n]]. The signal sample is estimatedbased on the the present and past data only.

    Smoothing : Estimate = s = [s[0], s[1], . . . , s[N 1]] (vector) based onthe data set x =

    [x [0], x [1], . . . , x [N 1]]. The signal sample

    is estimated based on the the present, past and future data.

    Prediction : Estimate = x [N 1 + ] for positive integer based on thedata set x = [x [0], x [1], . . . , x [N 1]].

    Remarks : All these problems are solved using the LMMSE estimator :

    = CxC1xx x with the minimum Bmse : M = C CxC1xx Cx

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 136 / 152

  • Wiener Filtering (Smoothing)

    Estimate = s = [s[0], s[1], . . . , s[N 1]] (vector) based on the data setx =

    [x [0], x [1], . . . , x [N 1]].

    Cxx = Rxx = Rss +Rww and Cx = E (sxT ) = E (s(s+w)T ) = Rss

    Therefore :s = CsxC

    1xx x = Rss(Rss + Rww )

    1x = WxFilter Interpretation : The Wiener Smoothing Matrix can be interpretedas an FIR lter. Lets dene :

    W = [a0, a1, . . . , aN1]

    where aTn is the (n + 1)th raw of W. Lets dene also :hn = [h

    (n)[0], h(1)[1], . . . , h(n)[N 1]] which is just the vector an whenipped upside down. Then

    s[n] = aTn x =N1k=0

    h(n)[N 1 k]x [k]

    This represents a time-varying, non causal FIR lter.(Linear Bayesian Estimators) Estimation Theory Spring 2013 137 / 152

  • Wiener Filtering (Filtering)

    Estimate = s[n] based on the data set x =[x [0], x [1], . . . , x [n]

    ]. We

    have again Cxx = Rxx = Rss + Rww and

    Cx = E (s[n][x [0], x [1], . . . , x [n]

    ]) =

    [rss [n], rss [n 1], . . . , rss [0]

    ]= (rss)

    T

    Therefore :s[n] = (rss)

    T (Rss + Rww )1x = aTx

    Filter Interpretation : If we dene hn = [h(n)[0], h(1)[1], . . . , h(n)[N 1]]

    which is just the vector a when ipped upside down. Then :

    s[n] = aTx =n

    k=0

    h(n)[n k]x [k]

    This represents a time-varying, causal FIR lter. The impulse response canbe computed as :

    (Rss + Rww )a = rss (Rss + Rww )hn = rss

    (Linear Bayesian Estimators) Estimation Theory Spring 2013 138 / 152

  • Wiener Filtering (Prediction)

    Estimate = x [N 1 + ] based on the data setx = [x [0], x [1], . . . , x [N 1]]. We have again Cxx = Rxx and

    Cx = E (x [N 1 + ][x [0], x [1], . . . , x [N 1]])=

    [rxx [N 1 + ], rxx [N 2 + ], . . . , rxx []

    ]= (rxx)

    T

    Therefore :x [N 1 + ] = (rxx)TR1xx x = aTx

    Filter Interpretation : If we let again h[N k] = ak , we have

    x [N 1 + ] = aTx =N1k=0

    h[N k]x [k]

    This represents a time-invariant, causal FIR lter. The impulse responsecan be computed as :

    Rxxh = rxx

    For = 1, this is equivalent to an Auto-Regressive process.(Linear Bayesian Estimators) Estimation Theory Spring 2013 139 / 152

  • Kalman Filters

    (Kalman Filters) Estimation Theory Spring 2013 140 / 152

  • Introduction

    Kalman lter is a generalization of the Wiener lter.

    In Wiener lter we estimate s[n] based on the noisy observationvector x[n]. We assume that s[n] is a stationary random process withknown mean and covariance matrix.

    In Kalman lter we assume that s[n] is a non stationary randomprocess whose mean and covariance matrix vary according to a knowndynamical model.

    Kalman lter is a sequential MMSE estimator of s[n]. If the signaland noise are jointly Gaussian, then the Kalman lter is an optimalMMSE estimator, if not, it is the optimal LMMSE estimator.

    Kalman lter has many applications in Control theory for stateestimation.

    It can be generalized to vector signals and noise (in contrast toWiener lter).

    (Kalman Filters) Estimation Theory Spring 2013 141 / 152

  • Dynamical Signal Models

    Consider a rst order dynamical model :

    s[n] = as[n 1] + u[n] n 0

    where u[n] is WGN with variance 2u, s[1] N (s , 2s ) and s[1] isindependent of u[n] for all n 0. It can be shown that :

    s[n] = an+1s[1] +n

    k=0

    aku[n k] E (s[n]) = an+1s

    and

    Cs [m, n] = E[(s[m] E (s[m]))(s[n] E (s[n]))]

    = am+n+22s + 2ua

    mnn

    k=0

    a2k for m 0

    and Cs [m, n] = Cs [n,m] for m < n.(Kalman Filters) Estimation Theory Spring 2013 142 / 152

  • Dynamical Signal Models

    Theorem (Vector Gauss-Markov Model)

    The Gauss-Markov model for a p 1 vector signal s[n] is :

    s[n] = As[n 1] + Bu[n] n 0

    A is p p and B p r and all eigenvalues of A are less than 1 inmagnitude. The r 1 vector u[n] N (0,Q) and the initial condition is ap 1 random vector with s[n] N (s,Cs) and is independent of u[n]s.Then, the signal process is Gaussian with mean E (s[n]) = An+1s andcovariance :

    Cs[m, n] = Am+1Cs

    [An+1

    ]T+

    mk=mn

    AkBQBT[Anm+k

    ]Tfor m n and Cs[m, n] = Cs[n,m] for m < n. The covariance matrixC[n] = Cs[n, n] and the mean and covariance propagation equations are :E (s[n]) = AE (s[n 1]) , C[n] = AC[n 1]AT +BQBT

    (Kalman Filters) Estimation Theory Spring 2013 143 / 152

  • Scalar Kalman Filter

    Data model :

    State equation s[n] = as[n 1] + u[n]Observation equation x [n] = s[n] + w [n]

    Assumptions :

    u[n] is zero mean Gaussian noise with independent samples andE (u2[n]) = 2u.

    w [n] is zero mean Gaussian noise with independent samples andE (w2[n]) = 2n (time-varying variance).

    s[1] N (s , 2s ) (for simplicity we suppose that s = 0).s[1], u[n] and w [n] are independent.

    Objective : Develop a sequential MMSE estimator to estimate s[n] basedon the data X[n] =

    [x [0], x [1], . . . , x [n]

    ]. This estimator is the mean of

    posterior PDF :

    s[n|n] = E (s[n]x [0], x [1], . . . , x [n])(Kalman Filters) Estimation Theory Spring 2013 144 / 152

  • Scalar Kalman Filter

    To develop the equations of a Kalman lter we need the followingproperties :

    Property 1 : For two uncorrelated jointly Gaussian data vector, theMMSE estimator (if it is zero mean) is given by :

    = E (|x1, x2) = E (|x1) + E (|x2)

    Property 2 : If = 1 + 2, then the MMSE estimator of is :

    = E (|x) = E (1 + 2|x) = E (1|x) + E (2|x)

    Basic Idea : Generate the innovation x [n] = x [n] x [n|n 1] which isuncorrelated with previous samples X[n 1]. Then use x [n] instead of x [n]for estimation (X[n] is equivalent to

    [X[n 1], x [n]]).

    (Kalman Filters) Estimation Theory Spring 2013 145 / 152

  • Scalar Kalman Filter

    From Property 1, we have :

    s[n|n] = E (s[n]X[n 1], x [n]) = E (s[n]X[n 1]) + E (s[n]x [n])= s[n|n 1] + E (s[n]x [n])

    From Property 2, we have :

    s[n|n 1] = E (as[n 1] + u[n]X[n 1])= aE (s[n 1]X[n 1]) + E (u[n]X[n 1])= as[n 1|n 1]

    The MMSE estimator of s[n] based on x [n] is :

    E (s[n]x [n]) = E (s[n]x(n))

    E (x2[n])x [n] = K [n](x [n] x [n|n 1])

    where x [n|n 1] = s[n|n 1] + w [n|n 1] = s[n|n 1].Finally we have :

    s[n|n] = as[n 1|n 1] + K [n](x [n] s[n|n 1])(Kalman Filters) Estimation Theory Spring 2013 146 / 152

  • Scalar State Scalar Observation Kalman Filter

    State Model : s[n] = as[n 1] + u[n]Observation Model : x [n] = s[n] + w [n]

    Prediction :s[n|n 1] = as[n 1|n 1]

    Minimum Prediction MSE :

    M[n|n 1] = a2M[n 1|n 1] + 2uKalman Gain :

    K [n] =M[n|n 1]

    2n +M[n|n 1]Correction :

    s[n|n] = s[n|n 1] + K [n](x [n] s[n|n 1])Minimum MSE :

    M[n|n] = (1 K [n])M[n|n 1](Kalman Filters) Estimation Theory Spring 2013 147 / 152

  • Properties of Kalman Filter

    The Kalman lter is an extension of the sequential MMSE estimatorbut applied to time-varying parameters that are represented by adynamic model.No matrix inversion is required.The Kalman lter is a time-varying linear lter.The Kalman lter computes (and uses) its own performance measure,the Bayesian MSE M[n|n].The prediction stage increases the error, while the correction stagedecreases it.The Kalman lter generates an uncorrelated sequence from theobservations so can be viewed as a whitening lter.The Kalman lter is optimal in the Gaussian case and the optimalLMMSE estimator if the Gaussian assumption is not valid.In KF M[n|n],M[n|n 1] and K [n] do not depend on themeasurement and can be computed o line. At steady state (n )the Kalman lter becomes an LTI lter (M[n|n],M[n|n 1] and K [n]become constant).

    (Kalman Filters) Estimation Theory Spring 2013 148 / 152

  • Vector State Scalar Ob