Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares...

Click here to load reader

  • date post

    16-Mar-2020
  • Category

    Documents

  • view

    6
  • download

    0

Embed Size (px)

Transcript of Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares...

  • Alternatives to Least-Squares

    • Need a method that gives consistent estimates in presence of colourednoise

    • Generalized Least-Squares

    • Instrumental Variable Method

    • Maximum Likelihood Identification

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 61

  • Generalized Least Squares

    • Noise modelw(t) =

    e(t)C(q−1)

    where {e(t)} = N(0, σ) and C(q−1) is monic and of degree n.

    • System described as

    A(q−1)y(t) = B(q−1)u(t) +1

    C(q−1)e(t)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 62

  • Generalized Least Squares

    • Defining filtered sequences

    ỹ(t) = C(q−1)y(t)

    ũ(t) = C(q−1)u(t)

    • System becomes

    A(q−1)ỹ(t) = B(q−1)ũ(t) + e(t)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 63

  • Generalized Least Squares

    • If C(q−1) is known, then least-squares gives consistent estimates of Aand B, given ỹ and ũ

    • The problem however, is that in practice C(q−1) is not known and {ỹ}and {ũ} cannot be obtained.

    • An iterative method proposed by Clarke (1967) solves that problem

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 64

  • Generalized Least Squares

    1. Set Ĉ(q−1) = 1

    2. Compute {ỹ(t)} and {ũ(t)} for t = 1, . . . , N3. Use least–squares method to estimate  and B̂ from ỹ and ũ

    4. Compute the residuals

    ŵ(t) = Â(q−1

    )y(t)− B̂(q−1)u(t)

    5. Use least-squares method to estimate Ĉ from

    Ĉ(q−1

    )ŵ(t) = ε(t)

    i.e.

    ŵ(t) = −c1ŵ(t− 1)− c2ŵ(t− 2)− · · ·+ ε(t)where {ε(t)} is white

    6. If converged, then stop, otherwise repeat from step 2.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 65

  • Generalized Least Squares

    • For convergence, the loss function and/or the whiteness of the residualscan be tested.

    • An advantage of GLS is that not only it may give consistent estimatesof the deterministic part of the system, but also gives a representationof the noise that the LS method does not give.

    • The consistency of GLS depends on the signal to noise ratio, theprobability of consistent estimation increasing with the S/N.

    • There is, however no guarantee of obtaining consistent estimates

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 66

  • Instrumental Variable Method (IV) 4

    • As seen previously, the LS estimate

    θ̂ = [XTX]−1XTY

    is unbiased if W is independent of X.

    • Assume that a matrix V is available, which is correlated with X but notwith W and such that V TX is positive definite, i.e.

    E[V TX] is nonsingular

    E[V TW ] = 0

    4T. Söderström and P.G. Stoica,Instrumental Variable Methods for System Identification, Spinger-Verlag,Berlin, 1983

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 67

  • Instrumental Variable Method (IV)

    • Then,V TY = V TXθ + V TW

    and θ estimated by

    θ̂ = [V TX]−1V TY

    V is called the instrumental variable matrix.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 68

  • Instrumental Variable Method (IV)

    • Ideally V is the noise-free process output and the IV estimate θ̂ isconsistent.

    • There are many possible ways to construct the instrumental variable. Forinstance, it may be built using an initial least–squares estimates:

    Â1ŷ(t) = B̂1u(t)

    and the kth row of V is given by

    vTk = [−ŷ(k − 1), . . . ,−ŷ(k − n), u(k − 1), . . . , u(k − n)]

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 69

  • Instrumental Variable Method (IV)

    • Consistent estimation cannot be guaranteed in general.

    • A two-tier method in its off–line version, the IV method is more usefulin its recursive form.

    • Use of instrumental variable in closed-loop.

    – Often the instrumental variable is constructed from the input sequence.This cannot be done in closed-loop, as the input is formed from oldoutputs, and hence is correlated with the noise w, unless w is white.In that situation, the following choices are available.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 70

  • Instrumental Variable Method (IV)

    • Instrumental variables in closed-loop

    – Delayed inputs and outputs. If the noise w is assumed to be a movingaverage of order n, then choosing v(t) = x(t− d) with d > n gives aninstrument uncorrelated with the noise. This, however will only workwith a time-varying regulator.

    – Reference signals. Building the instruments from the setpoint willsatisfy the noise independence condition. However, the setpoint mustbe a sufficiently rich signal for the estimates to converge.

    – External signal. This is effect relates to the closed-loop identifiabilitycondition (covered a bit later...). A typical external signal is a whitenoise independent of w(t).

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 71

  • Maximum-likelihood identification

    The maximum-likelihood method considers the ARMAX model belowwhere u is the input, y the output and e is zero-mean white noise withstandard deviation σ:

    A(q−1)y(t) = B(q−1)u(t− k) + C(q−1)e(t)

    whereA(q−1) = 1 + a1q−1 + · · ·+ anq−n

    B(q−1) = b1q−1 + · · ·+ bnq−n

    C(q−1) = 1 + c1q−1 + · · ·+ cnq−n

    The parameters of A, B, C as well as σ are unknown.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 72

  • Maximum-likelihood identificationDefining

    θT

    = [ a1 · · · an b1 · · · bn c1 · · · cn]

    xT

    = [ −y(t) · · · u(t− k) · · · e(t− 1) · · · ]

    the ARMAX model can be written as

    y(t) = xT (t)θ + e(t)

    Unfortunately, one cannot use the least-squares method on this model sincethe sequence e(t) is unknown.

    In case of known parameters, the past values of e(t) can be reconstructedexactly from the sequence:

    �(t) = [A(q−1)y(t)−B(q−1)]/C(q−1)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 73

  • Maximum-likelihood identification

    Defining the performance index below, the maximum-likelihood method is then

    summarized by the following steps.

    V =1

    2

    NXt=1

    �2(t)

    • Minimize V with respect to θ̂, using for instance a Newton-Raphson algorithm. Notethat � is linear in the parameters of A and B but not in those of C. We then have to

    use some iterative procedure

    θ̂i+1 = θ̂i − αi(V ′′(θ̂i))−1V ′(θ̂i)

    • Initial estimate θ̂0 is usually obtained from a least-squares estimate• Estimate the noise variance as

    σ̂2=

    2

    NV (θ̂)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 74

  • Properties of the Maximum-likelihood Estimate (MLE)

    • If the model order is sufficient, the MLE is consistent, i.e. θ̂ → θ asN →∞.

    • The MLE is asymptotically normal with mean θ and standard deviationσθ.

    • The MLE is asymptotically efficient, i.e. there is no other unbiasedestimator giving a smaller σθ.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 75

  • Properties of the Maximum-likelihood Estimate (MLE)

    • The Cramer-Rao inequality says that there is a lower limit on the precisionof an unbiased estimate, given by

    cov θ̂ ≥ M−1θ

    where Mθ is the Fisher Information Matrix Mθ = −E[(log L)θθ] For theMLE

    σ2θ = M−1θ

    i.e.σ2θ = σ

    2V −1θθif σ is estimated, then

    σ2θ =2VN

    V −1θθ

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 76

  • Identification In Practice

    1. Specify a model structure

    2. Compute the best model in this structure

    3. Evaluate the properties of this model

    4. Test a new structure, go to step 1

    5. Stop when satisfactory model is obtained

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 77

  • MATLAB System Identification Toolbox

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 78

  • MATLAB System Identification Toolbox

    • Most used package

    • Graphical User Interface

    • Automates all the steps

    • Easy to use

    • Familiarize yourself with it by running the examples

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 79

  • RECURSIVE IDENTIFICATION

    • There are many situations when it is preferable to perform theidentification on-line, such as in adaptive control.

    • Identification methods need to be implemented in a recursive fashion,i.e. the parameter estimate at time t should be computed as a functionof the estimate at time t− 1 and of the incoming information at time t.

    • Recursive least-squares

    • Recursive instrumental variables

    • Recursive extended least-squares and recursive maximum likelihood

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 80

  • Recursive Least-Squares (RLS)

    We have seen that, with t observations available, the least-squaresestimate is

    θ̂(t) = [XT (t)X(t)]−1XT (t)Y (t)

    withY T (t) = [ y(1) · · · y(t)]

    X(t) =

    xT (1)...xT (t)

    Assume one additional observation becomes available, the problem is thento find θ̂(t + 1) as a function of θ̂(t) and y(t + 1) and u(t + 1).

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 81

  • Recursive Least-Squares (RLS)

    Defining X(t + 1) and Y (t + 1) as

    X(t + 1) =[

    X(t)xT (t + 1)

    ]Y (t + 1) =

    [Y (t)

    y(t + 1)

    ]and defining P (t) and P (t + 1) as

    P (t) = [XT (t)X(t)]−1 P (t + 1) = [XT (t + 1)X(t + 1)]−1

    one can write

    P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

    θ̂(t + 1) = P (t + 1)[XT (t)Y (t) + x(t + 1)y(t + 1)]

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 82

  • Matrix Inversion Lemma

    Let A, D and [D−1 + CA−1B] be nonsingular square matrices. ThenA + BDC is invertible and

    (A + BDC)−1 = A−1 −A−1B(D−1 + CA−1B)−1CA−1

    Proof The simplest way to prove it is by direct multiplication

    (A + BDC)(A−1 − A−1B(D−1 + CA−1B)−1CA−1)

    = I + BDCA−1 − B(D−1 + CA−1B)−1CA−1

    −BDCA−1B(D−1 + CA−1B)−1CA−1

    = I + BDCA−1 − BD(D−1 + CA−1B)(D−1 + CA−1B)−1CA−1

    = I

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 83

  • Matrix Inversion Lemma

    An alternative form, useful for deriving recursive least-squares is obtainedwhen B and C are n× 1 and 1× n (i.e. column and row vectors):

    (A + BC)−1 = A−1 − A−1BCA−1

    1 + CA−1B

    Now, consider

    P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

    and use the matrix-inversion lemma with

    A = XT (t)X(t) B = x(t + 1) C = xT (t + 1)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 84

  • Recursive Least-Squares (RLS)

    Some simple matrix manipulations then give the recursive least-squaresalgorithm:

    θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

    K(t + 1) =P (t)x(t + 1)

    1 + xT (t + 1)P (t)x(t + 1)

    P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

    1 + xT (t + 1)P (t)x(t + 1)

    Note that K(t + 1) can also be expressed as

    K(t + 1) = P (t + 1)x(t + 1)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 85

  • Recursive Least-Squares (RLS)

    • The recursive least-squares algorithm is the exact mathematical equivalent of the batchleast-squares.

    • Once initialized, no matrix inversion is needed• Matrices stay the same size all the time• Computationally very efficient• P is proportional to the covariance matrix of the estimate, and is thus called the

    covariance matrix.

    • The algorithm has to be initialized with θ̂(0) and P (0). Generally, P (0) is initializedas αI where I is the identity matrix and α is a large positive number. The larger α,

    the less confidence is put in the initial estimate θ̂(0).

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 86

  • RLS and Kalman Filter

    There are some very strong connections between the recursive least-squares algorithm and the Kalman filter. Indeed, the RLS algorithm has thestructure of a Kalman filter:

    θ̂(t + 1)︸ ︷︷ ︸new

    = θ̂(t)︸︷︷︸old

    +K(t + 1) [y(t + 1)− xT (t + 1)θ̂(t)]︸ ︷︷ ︸correction

    where K(t + 1) is the Kalman gain.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 87

  • RLS

    The following Matlab code is a straightforward implementation of theRLS algorithm:function [thetaest,P]=rls(y,x,thetaest,P)

    % RLS

    % y,x: current measurement and regressor

    % thetaest, P: parameter estimates and covariance matrix

    K= P*x/(1+x’*P*x); % Gain

    P= P- (P*x*x’*P)/(1+x’*P*x); % Covariance matrix update

    thetaest= thetaest +K*(y-x’*thetaest); %Parameter estimate update

    % end

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 88

  • Recursive Extended Least-Squares and RecursiveMaximum-Likelihood

    Because the prediction error is not linear in the C–parameters, it is not possible to an

    exact recursive maximum likelihood method as for the least–squares method.

    The ARMAX model

    A(q−1

    )y(t) = B(q−1

    )u(t) + C(q−1

    )e(t)

    can be written as

    y(t) = xT(t)θ + e(t)

    with

    θ = [a1, . . . , an, b1, . . . , bn, c1, . . . , cn]T

    xT(t) = [−y(t− 1), . . . ,−y(t− n), u(t− 1),

    . . . , u(t− n), e(t− 1), . . . , e(t− n)]T

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 89

  • Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

    • If e(t) was known, RLS could be used to estimate θ, however it isunknown and thus has to be estimated.

    • It can be done in two ways, either using the prediction error or theresidual.

    • The first case corresponds to the RELS method, the second to the AMLmethod.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 90

  • Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

    • The one–step ahead prediction error is defined as

    ε(t) = y(t)− ŷ(t | t− 1)

    = y(t)− xT (t)θ̂(t− 1)

    x(t) = [−y(t− 1), . . . , u(t− 1), . . . , ε(t− 1), . . . , ε(t− n)]T

    • The residual is defined as

    η(t) = y(t)− ŷ(t | t)

    = y(t)− xT (t)θ̂(t)

    x(t) = [−y(t− 1), . . . , u(t− 1), . . . , η(t− 1), . . . , η(t− n)]T

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 91

  • Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

    • Sometimes ε(t) and η(t) are also referred to as a–priori and a–posterioriprediction errors.

    • Because it uses the latest estimate θ̂(t), as opposed to θ̂(t− 1) for ε(t),η(t) is a better estimate, especially in transient behaviour.

    • Note however that if θ̂(t) converges as t −→∞ then η(t) −→ ε(t).

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 92

  • Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

    The two schemes are then described by

    θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

    K(t + 1) = P (t + 1)x(t + 1)/[1 + xT(t + 1)P (t)x(t + 1)]

    P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

    [1 + xT (t + 1)P (t)x(t + 1)]

    but differ by their definition of x(t)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 93

  • Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

    • The RELS algorithm corresponds uses the prediction error. This algorithmis called RELS, Extended Matrix or RML1 in the literature. It hasgenerally good convergence properties, and has been proved consistentfor moving–average and first–order auto regressive processes. However,counterexamples to general convergence exist, see for example Ljung(1975).

    • The AML algorithm uses the residual error. The AML has betterconvergence properties than the RML, and indeed convergence can beproven under rather unrestrictive conditions.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 94

  • Recursive Maximum-Likelihood

    • Yet another approach.

    • The ML can also be interpreted in terms of data filtering. Consider theperformance index:

    V (t) =12

    t∑i=1

    ε2(i)

    with ε(t) = y(t)− xT (t)θ̂(t− 1)

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 95

  • Recursive Maximum-Likelihood

    • Because V is a nonlinear function of C, it has to be approximated by aTaylor series truncated after the second term. The resulting scheme isthen:

    θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

    K(t + 1) =P (t + 1)xf(t + 1)

    [1 + xTf (t + 1)P (t)xf(t + 1)]

    P (t + 1) = P (t)−P (t)xf(t + 1)x

    Tf (t + 1)P (t)

    [1 + xTf (t + 1)P (t)xf(t + 1)]

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 96

  • Properties of AML and RML

    Definition 1. A discrete transfer function is said to be strictly positive realif it is stable and

    ReH(ejw) > 0 ∀w − π < w ≤ π

    on the unit circle.

    This condition can be checked by replacing z by 1+jw1−jw and extracting thereal part of the resulting expression.

    For the convergence of AML, the following theorem is then available.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 97

  • Properties of AML and RML

    Theorem. [Ljung & Söderström, 1983)] Assume both process andmodel are described by ARMAX with order model ≥ order process, then if

    1. {u(t)} is sufficiently rich

    2. 1C(q−1) −

    12$isstrictlypositivereal then θ̂(t) will converge such that

    E[ε(t, θ̂)− e(t)]2 = 0

    If model and process have the same order, this implies

    θ̂(t) −→ θ as t −→∞

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 98

  • A Unified Algorithm

    Looking at all the previous algorithms, it is obvious that they all havethe same form, with only different parameters. They can all be representedby a recursive prediction - error method (RPEM).

    θ̂(t + 1) = θ̂(t) + K(t + 1)ε(t + 1)

    K(t + 1) = P (t)z(t + 1)/[1 + xT (t + 1)P (t)z(t + 1)]

    P (t + 1) = P (t)− P (t)z(t + 1)xT (t + 1)P (t)

    [1 + xT (t + 1)P (t)x(t + 1)]

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 99

  • Tracking Time-Varying ParametersAll previous methods use the least–squares criterion

    V (t) =1t

    t∑i=1

    [y(i)− xT (i)θ̂]2

    and thus identify the average behaviour of the process. When the parametersare time varying, it is desirable to base the identification on the most recentdata rather than on the old one, not representative of the process anymore.This can be achieved by exponential discounting of old data, using thecriterion

    V (t) =1t

    t∑i=1

    λt−i[y(i)− xT (i)θ̂]2

    where 0 < λ ≤ is called the forgetting factor.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 100

  • Tracking Time-Varying Parameters

    The new criterion can also be written

    V (t) = λV (t− 1) + [y(t)− xT (t)θ̂]2

    Then, it can be shown (Goodwin and Payne, 1977) that the RLS schemebecomes

    θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

    K(t + 1) = P (t)x(t + 1)/[λ + xT(t + 1)P (t)x(t + 1)]

    P (t + 1) =

    (P (t)−

    P (t)x(t + 1)xT (t + 1)P (t)

    [λ + xT (t + 1)P (t)x(t + 1)]

    )1

    λ

    In choosing λ, one has to compromise between fast tracking and long termquality of the estimates. The use of the forgetting may give rise to problems.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 101

  • Tracking Time-Varying Parameters

    The smaller λ is, the faster the algorithm can track, but the more theestimates will vary, even the true parameters are time-invariant.A small λ may also cause blowup of the covariance matrix P , since inthe absence of excitation, covariance matrix update equation essentiallybecomes

    P (t + 1) =1λP (t)

    in which case P grows exponentially, leading to wild fluctuations in theparameter estimates.One way around this is to vary the forgetting factor according to theprediction error ε as in

    λ(t) = 1− kε2(t)Then, in case of low excitation ε will be small and λ will be close to 1. Incase of large prediction errors, λ will decrease.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 102

  • Exponential Forgetting and Resetting AlgorithmThe following scheme due Salgado, Goodwin and Middleton5 is

    recommended:

    ε(t + 1) = y(t + 1)− xT (t + 1)θ̂(t)

    θ̂(t + 1) = θ̂T (t) +αP (t)x(t + 1)

    λ + xT (t + 1)P (t)x(k + 1)ε(t)

    P (t + 1) =1λ

    [P (t)− P (t)x(t + 1)x

    T (t + 1)P (t)λ + x(t + 1)TP (t)x(t + 1)

    ]+βI − γP (t)2

    where I is the identity matrix, and α, β and γ are constants.5M.E. Salgado, G.C. Goodwin, and R.H. Middleton, “Exponential Forgetting and Resetting”, International

    Journal of Control, vol. 47, no. 2, pp. 477–485, 1988.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 103

  • Exponential Forgetting and Resetting Algorithm

    With the EFRA, the covariance matrix is bounded on both sides:

    σminI ≤ P (t) ≤ σmaxI ∀t

    where

    σmin ≈β

    α− ησmax ≈

    η

    γ+

    β

    η

    with

    η =1− λ

    λWith α = 0.5, β = γ = 0.005 and λ = 0.95, σmin = 0.01 and σmax = 10.

    Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 104