# Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares...

date post

16-Mar-2020Category

## Documents

view

6download

0

Embed Size (px)

### Transcript of Alternatives to Least-Squares - UBCcourses.ece.ubc.ca/574/ident2.pdfAlternatives to Least-Squares...

Alternatives to Least-Squares

• Need a method that gives consistent estimates in presence of colourednoise

• Generalized Least-Squares

• Instrumental Variable Method

• Maximum Likelihood Identification

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 61

Generalized Least Squares

• Noise modelw(t) =

e(t)C(q−1)

where {e(t)} = N(0, σ) and C(q−1) is monic and of degree n.

• System described as

A(q−1)y(t) = B(q−1)u(t) +1

C(q−1)e(t)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 62

Generalized Least Squares

• Defining filtered sequences

ỹ(t) = C(q−1)y(t)

ũ(t) = C(q−1)u(t)

• System becomes

A(q−1)ỹ(t) = B(q−1)ũ(t) + e(t)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 63

Generalized Least Squares

• If C(q−1) is known, then least-squares gives consistent estimates of Aand B, given ỹ and ũ

• The problem however, is that in practice C(q−1) is not known and {ỹ}and {ũ} cannot be obtained.

• An iterative method proposed by Clarke (1967) solves that problem

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 64

Generalized Least Squares

1. Set Ĉ(q−1) = 1

2. Compute {ỹ(t)} and {ũ(t)} for t = 1, . . . , N3. Use least–squares method to estimate Â and B̂ from ỹ and ũ

4. Compute the residuals

ŵ(t) = Â(q−1

)y(t)− B̂(q−1)u(t)

5. Use least-squares method to estimate Ĉ from

Ĉ(q−1

)ŵ(t) = ε(t)

i.e.

ŵ(t) = −c1ŵ(t− 1)− c2ŵ(t− 2)− · · ·+ ε(t)where {ε(t)} is white

6. If converged, then stop, otherwise repeat from step 2.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 65

Generalized Least Squares

• For convergence, the loss function and/or the whiteness of the residualscan be tested.

• An advantage of GLS is that not only it may give consistent estimatesof the deterministic part of the system, but also gives a representationof the noise that the LS method does not give.

• The consistency of GLS depends on the signal to noise ratio, theprobability of consistent estimation increasing with the S/N.

• There is, however no guarantee of obtaining consistent estimates

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 66

Instrumental Variable Method (IV) 4

• As seen previously, the LS estimate

θ̂ = [XTX]−1XTY

is unbiased if W is independent of X.

• Assume that a matrix V is available, which is correlated with X but notwith W and such that V TX is positive definite, i.e.

E[V TX] is nonsingular

E[V TW ] = 0

4T. Söderström and P.G. Stoica,Instrumental Variable Methods for System Identification, Spinger-Verlag,Berlin, 1983

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 67

Instrumental Variable Method (IV)

• Then,V TY = V TXθ + V TW

and θ estimated by

θ̂ = [V TX]−1V TY

V is called the instrumental variable matrix.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 68

Instrumental Variable Method (IV)

• Ideally V is the noise-free process output and the IV estimate θ̂ isconsistent.

• There are many possible ways to construct the instrumental variable. Forinstance, it may be built using an initial least–squares estimates:

Â1ŷ(t) = B̂1u(t)

and the kth row of V is given by

vTk = [−ŷ(k − 1), . . . ,−ŷ(k − n), u(k − 1), . . . , u(k − n)]

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 69

Instrumental Variable Method (IV)

• Consistent estimation cannot be guaranteed in general.

• A two-tier method in its off–line version, the IV method is more usefulin its recursive form.

• Use of instrumental variable in closed-loop.

– Often the instrumental variable is constructed from the input sequence.This cannot be done in closed-loop, as the input is formed from oldoutputs, and hence is correlated with the noise w, unless w is white.In that situation, the following choices are available.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 70

Instrumental Variable Method (IV)

• Instrumental variables in closed-loop

– Delayed inputs and outputs. If the noise w is assumed to be a movingaverage of order n, then choosing v(t) = x(t− d) with d > n gives aninstrument uncorrelated with the noise. This, however will only workwith a time-varying regulator.

– Reference signals. Building the instruments from the setpoint willsatisfy the noise independence condition. However, the setpoint mustbe a sufficiently rich signal for the estimates to converge.

– External signal. This is effect relates to the closed-loop identifiabilitycondition (covered a bit later...). A typical external signal is a whitenoise independent of w(t).

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 71

Maximum-likelihood identification

The maximum-likelihood method considers the ARMAX model belowwhere u is the input, y the output and e is zero-mean white noise withstandard deviation σ:

A(q−1)y(t) = B(q−1)u(t− k) + C(q−1)e(t)

whereA(q−1) = 1 + a1q−1 + · · ·+ anq−n

B(q−1) = b1q−1 + · · ·+ bnq−n

C(q−1) = 1 + c1q−1 + · · ·+ cnq−n

The parameters of A, B, C as well as σ are unknown.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 72

Maximum-likelihood identificationDefining

θT

= [ a1 · · · an b1 · · · bn c1 · · · cn]

xT

= [ −y(t) · · · u(t− k) · · · e(t− 1) · · · ]

the ARMAX model can be written as

y(t) = xT (t)θ + e(t)

Unfortunately, one cannot use the least-squares method on this model sincethe sequence e(t) is unknown.

In case of known parameters, the past values of e(t) can be reconstructedexactly from the sequence:

�(t) = [A(q−1)y(t)−B(q−1)]/C(q−1)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 73

Maximum-likelihood identification

Defining the performance index below, the maximum-likelihood method is then

summarized by the following steps.

V =1

2

NXt=1

�2(t)

• Minimize V with respect to θ̂, using for instance a Newton-Raphson algorithm. Notethat � is linear in the parameters of A and B but not in those of C. We then have to

use some iterative procedure

θ̂i+1 = θ̂i − αi(V ′′(θ̂i))−1V ′(θ̂i)

• Initial estimate θ̂0 is usually obtained from a least-squares estimate• Estimate the noise variance as

σ̂2=

2

NV (θ̂)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 74

Properties of the Maximum-likelihood Estimate (MLE)

• If the model order is sufficient, the MLE is consistent, i.e. θ̂ → θ asN →∞.

• The MLE is asymptotically normal with mean θ and standard deviationσθ.

• The MLE is asymptotically efficient, i.e. there is no other unbiasedestimator giving a smaller σθ.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 75

Properties of the Maximum-likelihood Estimate (MLE)

• The Cramer-Rao inequality says that there is a lower limit on the precisionof an unbiased estimate, given by

cov θ̂ ≥ M−1θ

where Mθ is the Fisher Information Matrix Mθ = −E[(log L)θθ] For theMLE

σ2θ = M−1θ

i.e.σ2θ = σ

2V −1θθif σ is estimated, then

σ2θ =2VN

V −1θθ

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 76

Identification In Practice

1. Specify a model structure

2. Compute the best model in this structure

3. Evaluate the properties of this model

4. Test a new structure, go to step 1

5. Stop when satisfactory model is obtained

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 77

MATLAB System Identification Toolbox

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 78

MATLAB System Identification Toolbox

• Most used package

• Graphical User Interface

• Automates all the steps

• Easy to use

• Familiarize yourself with it by running the examples

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 79

RECURSIVE IDENTIFICATION

• There are many situations when it is preferable to perform theidentification on-line, such as in adaptive control.

• Identification methods need to be implemented in a recursive fashion,i.e. the parameter estimate at time t should be computed as a functionof the estimate at time t− 1 and of the incoming information at time t.

• Recursive least-squares

• Recursive instrumental variables

• Recursive extended least-squares and recursive maximum likelihood

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 80

Recursive Least-Squares (RLS)

We have seen that, with t observations available, the least-squaresestimate is

θ̂(t) = [XT (t)X(t)]−1XT (t)Y (t)

withY T (t) = [ y(1) · · · y(t)]

X(t) =

xT (1)...xT (t)

Assume one additional observation becomes available, the problem is thento find θ̂(t + 1) as a function of θ̂(t) and y(t + 1) and u(t + 1).

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 81

Recursive Least-Squares (RLS)

Defining X(t + 1) and Y (t + 1) as

X(t + 1) =[

X(t)xT (t + 1)

]Y (t + 1) =

[Y (t)

y(t + 1)

]and defining P (t) and P (t + 1) as

P (t) = [XT (t)X(t)]−1 P (t + 1) = [XT (t + 1)X(t + 1)]−1

one can write

P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

θ̂(t + 1) = P (t + 1)[XT (t)Y (t) + x(t + 1)y(t + 1)]

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 82

Matrix Inversion Lemma

Let A, D and [D−1 + CA−1B] be nonsingular square matrices. ThenA + BDC is invertible and

(A + BDC)−1 = A−1 −A−1B(D−1 + CA−1B)−1CA−1

Proof The simplest way to prove it is by direct multiplication

(A + BDC)(A−1 − A−1B(D−1 + CA−1B)−1CA−1)

= I + BDCA−1 − B(D−1 + CA−1B)−1CA−1

−BDCA−1B(D−1 + CA−1B)−1CA−1

= I + BDCA−1 − BD(D−1 + CA−1B)(D−1 + CA−1B)−1CA−1

= I

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 83

Matrix Inversion Lemma

An alternative form, useful for deriving recursive least-squares is obtainedwhen B and C are n× 1 and 1× n (i.e. column and row vectors):

(A + BC)−1 = A−1 − A−1BCA−1

1 + CA−1B

Now, consider

P (t + 1) = [XT (t)X(t) + x(t + 1)xT (t + 1)]−1

and use the matrix-inversion lemma with

A = XT (t)X(t) B = x(t + 1) C = xT (t + 1)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 84

Recursive Least-Squares (RLS)

Some simple matrix manipulations then give the recursive least-squaresalgorithm:

θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

K(t + 1) =P (t)x(t + 1)

1 + xT (t + 1)P (t)x(t + 1)

P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

1 + xT (t + 1)P (t)x(t + 1)

Note that K(t + 1) can also be expressed as

K(t + 1) = P (t + 1)x(t + 1)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 85

Recursive Least-Squares (RLS)

• The recursive least-squares algorithm is the exact mathematical equivalent of the batchleast-squares.

• Once initialized, no matrix inversion is needed• Matrices stay the same size all the time• Computationally very efficient• P is proportional to the covariance matrix of the estimate, and is thus called the

covariance matrix.

• The algorithm has to be initialized with θ̂(0) and P (0). Generally, P (0) is initializedas αI where I is the identity matrix and α is a large positive number. The larger α,

the less confidence is put in the initial estimate θ̂(0).

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 86

RLS and Kalman Filter

There are some very strong connections between the recursive least-squares algorithm and the Kalman filter. Indeed, the RLS algorithm has thestructure of a Kalman filter:

θ̂(t + 1)︸ ︷︷ ︸new

= θ̂(t)︸︷︷︸old

+K(t + 1) [y(t + 1)− xT (t + 1)θ̂(t)]︸ ︷︷ ︸correction

where K(t + 1) is the Kalman gain.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 87

RLS

The following Matlab code is a straightforward implementation of theRLS algorithm:function [thetaest,P]=rls(y,x,thetaest,P)

% RLS

% y,x: current measurement and regressor

% thetaest, P: parameter estimates and covariance matrix

K= P*x/(1+x’*P*x); % Gain

P= P- (P*x*x’*P)/(1+x’*P*x); % Covariance matrix update

thetaest= thetaest +K*(y-x’*thetaest); %Parameter estimate update

% end

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 88

Recursive Extended Least-Squares and RecursiveMaximum-Likelihood

Because the prediction error is not linear in the C–parameters, it is not possible to an

exact recursive maximum likelihood method as for the least–squares method.

The ARMAX model

A(q−1

)y(t) = B(q−1

)u(t) + C(q−1

)e(t)

can be written as

y(t) = xT(t)θ + e(t)

with

θ = [a1, . . . , an, b1, . . . , bn, c1, . . . , cn]T

xT(t) = [−y(t− 1), . . . ,−y(t− n), u(t− 1),

. . . , u(t− n), e(t− 1), . . . , e(t− n)]T

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 89

Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

• If e(t) was known, RLS could be used to estimate θ, however it isunknown and thus has to be estimated.

• It can be done in two ways, either using the prediction error or theresidual.

• The first case corresponds to the RELS method, the second to the AMLmethod.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 90

Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

• The one–step ahead prediction error is defined as

ε(t) = y(t)− ŷ(t | t− 1)

= y(t)− xT (t)θ̂(t− 1)

x(t) = [−y(t− 1), . . . , u(t− 1), . . . , ε(t− 1), . . . , ε(t− n)]T

• The residual is defined as

η(t) = y(t)− ŷ(t | t)

= y(t)− xT (t)θ̂(t)

x(t) = [−y(t− 1), . . . , u(t− 1), . . . , η(t− 1), . . . , η(t− n)]T

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 91

Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

• Sometimes ε(t) and η(t) are also referred to as a–priori and a–posterioriprediction errors.

• Because it uses the latest estimate θ̂(t), as opposed to θ̂(t− 1) for ε(t),η(t) is a better estimate, especially in transient behaviour.

• Note however that if θ̂(t) converges as t −→∞ then η(t) −→ ε(t).

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 92

Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

The two schemes are then described by

θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

K(t + 1) = P (t + 1)x(t + 1)/[1 + xT(t + 1)P (t)x(t + 1)]

P (t + 1) = P (t)−P (t)x(t + 1)xT (t + 1)P (t)

[1 + xT (t + 1)P (t)x(t + 1)]

but differ by their definition of x(t)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 93

Recursive Extended Least-Squares and ApproximateMaximum-Likelihood

• The RELS algorithm corresponds uses the prediction error. This algorithmis called RELS, Extended Matrix or RML1 in the literature. It hasgenerally good convergence properties, and has been proved consistentfor moving–average and first–order auto regressive processes. However,counterexamples to general convergence exist, see for example Ljung(1975).

• The AML algorithm uses the residual error. The AML has betterconvergence properties than the RML, and indeed convergence can beproven under rather unrestrictive conditions.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 94

Recursive Maximum-Likelihood

• Yet another approach.

• The ML can also be interpreted in terms of data filtering. Consider theperformance index:

V (t) =12

t∑i=1

ε2(i)

with ε(t) = y(t)− xT (t)θ̂(t− 1)

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 95

Recursive Maximum-Likelihood

• Because V is a nonlinear function of C, it has to be approximated by aTaylor series truncated after the second term. The resulting scheme isthen:

θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

K(t + 1) =P (t + 1)xf(t + 1)

[1 + xTf (t + 1)P (t)xf(t + 1)]

P (t + 1) = P (t)−P (t)xf(t + 1)x

Tf (t + 1)P (t)

[1 + xTf (t + 1)P (t)xf(t + 1)]

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 96

Properties of AML and RML

Definition 1. A discrete transfer function is said to be strictly positive realif it is stable and

ReH(ejw) > 0 ∀w − π < w ≤ π

on the unit circle.

This condition can be checked by replacing z by 1+jw1−jw and extracting thereal part of the resulting expression.

For the convergence of AML, the following theorem is then available.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 97

Properties of AML and RML

Theorem. [Ljung & Söderström, 1983)] Assume both process andmodel are described by ARMAX with order model ≥ order process, then if

1. {u(t)} is sufficiently rich

2. 1C(q−1) −

12$isstrictlypositivereal then θ̂(t) will converge such that

E[ε(t, θ̂)− e(t)]2 = 0

If model and process have the same order, this implies

θ̂(t) −→ θ as t −→∞

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 98

A Unified Algorithm

Looking at all the previous algorithms, it is obvious that they all havethe same form, with only different parameters. They can all be representedby a recursive prediction - error method (RPEM).

θ̂(t + 1) = θ̂(t) + K(t + 1)ε(t + 1)

K(t + 1) = P (t)z(t + 1)/[1 + xT (t + 1)P (t)z(t + 1)]

P (t + 1) = P (t)− P (t)z(t + 1)xT (t + 1)P (t)

[1 + xT (t + 1)P (t)x(t + 1)]

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 99

Tracking Time-Varying ParametersAll previous methods use the least–squares criterion

V (t) =1t

t∑i=1

[y(i)− xT (i)θ̂]2

and thus identify the average behaviour of the process. When the parametersare time varying, it is desirable to base the identification on the most recentdata rather than on the old one, not representative of the process anymore.This can be achieved by exponential discounting of old data, using thecriterion

V (t) =1t

t∑i=1

λt−i[y(i)− xT (i)θ̂]2

where 0 < λ ≤ is called the forgetting factor.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 100

Tracking Time-Varying Parameters

The new criterion can also be written

V (t) = λV (t− 1) + [y(t)− xT (t)θ̂]2

Then, it can be shown (Goodwin and Payne, 1977) that the RLS schemebecomes

θ̂(t + 1) = θ̂(t) + K(t + 1)[y(t + 1)− xT (t + 1)θ̂(t)]

K(t + 1) = P (t)x(t + 1)/[λ + xT(t + 1)P (t)x(t + 1)]

P (t + 1) =

(P (t)−

P (t)x(t + 1)xT (t + 1)P (t)

[λ + xT (t + 1)P (t)x(t + 1)]

)1

λ

In choosing λ, one has to compromise between fast tracking and long termquality of the estimates. The use of the forgetting may give rise to problems.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 101

Tracking Time-Varying Parameters

The smaller λ is, the faster the algorithm can track, but the more theestimates will vary, even the true parameters are time-invariant.A small λ may also cause blowup of the covariance matrix P , since inthe absence of excitation, covariance matrix update equation essentiallybecomes

P (t + 1) =1λP (t)

in which case P grows exponentially, leading to wild fluctuations in theparameter estimates.One way around this is to vary the forgetting factor according to theprediction error ε as in

λ(t) = 1− kε2(t)Then, in case of low excitation ε will be small and λ will be close to 1. Incase of large prediction errors, λ will decrease.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 102

Exponential Forgetting and Resetting AlgorithmThe following scheme due Salgado, Goodwin and Middleton5 is

recommended:

ε(t + 1) = y(t + 1)− xT (t + 1)θ̂(t)

θ̂(t + 1) = θ̂T (t) +αP (t)x(t + 1)

λ + xT (t + 1)P (t)x(k + 1)ε(t)

P (t + 1) =1λ

[P (t)− P (t)x(t + 1)x

T (t + 1)P (t)λ + x(t + 1)TP (t)x(t + 1)

]+βI − γP (t)2

where I is the identity matrix, and α, β and γ are constants.5M.E. Salgado, G.C. Goodwin, and R.H. Middleton, “Exponential Forgetting and Resetting”, International

Journal of Control, vol. 47, no. 2, pp. 477–485, 1988.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 103

Exponential Forgetting and Resetting Algorithm

With the EFRA, the covariance matrix is bounded on both sides:

σminI ≤ P (t) ≤ σmaxI ∀t

where

σmin ≈β

α− ησmax ≈

η

γ+

β

η

with

η =1− λ

λWith α = 0.5, β = γ = 0.005 and λ = 0.95, σmin = 0.01 and σmax = 10.

Adaptive Control Lecture Notes – c©Guy A. Dumont, 1997-2005 104