System Identi cationsysid/course/2018slides/SysID_lecture11.small.pdf · ' pk q' T pk q); f E t '...

System IdentificationLecture 11: Statistical properties of parameter estimators, Instrumental

variable methods

Roy Smith

2018-11-28 11.1

2018-11-28 11.2

Statistical basis for estimation methods

Parametrised models:

G “ Gpθ, zq, H “ Hpθ, zq (pulse response, ARX, ARMAX,. . .. . . , state-space)

Estimation

θ “ argminθ

Jpθ, ZKq, (ZK : finite-length measured noisy data)

Examples: Least squares (linear regression)Prediction error methodsCorrelation methods

How do the statistical properties of the data (i.e. noise effects) influence ourchoice of methods and our results?

2018-11-28 11.3

Maximum likelihood estimation

Basic formulation

Consider K observations, z1, . . . , zK .

Each is a realisation of a random variable, with joint probability distribution,

fp x1, . . . , xKlooooomooooon;randomvariables

θq ÐÝ family of distributions parametrised by θ.

Another common notation is,

fpx1, . . . , xK | θq ÐÝ the pdf for x1, . . . , xK given θ.

For independent variables,

fpx1, . . . , xK ; θq “ f1px1; θqf2px2; θq ¨ ¨ ¨ fKpxK ; θq “Kź

i“1

fipxi; θq

2018-11-28 11.4


Likelihood function

Substituting the observation, ZK “ tz1, . . . , zKu, gives a function of θ,

Lpθq “ fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.

(Likelihood function)

Maximum likelihood estimator:

θML “ argmaxθ

Lpθq.

The value chosen for θ is the one that gives the most “agreement” with theobservation.

2018-11-28 11.5


Estimating the mean of a Gaussian distribution (σ2 “ 0.5)

2

4

6

8

10

02

46

810

12

0

0.25

0.5

0.75

x “ θ

fpx; θq “ 1?2πσ2

e´ px´θq2

2σ2

θx

2018-11-28 11.6


Estimating the mean of a Gaussian distribution (σ2 “ 0.5)

Datum: z “ 7.0

2

4

6

8

10

02

46

810

12

0

0.25

0.5

0.75

Lpθq “ fpz; θq

x “ θ

fpx; θq “ 1?2πσ2

e´ px´θq2

2σ2

θML “ 7.00

θx

2018-11-28 11.7


Log-likelihood function

It is often mathematically easier to consider,

θML “ argmaxθ

lnLpθq.

As the ln function is monotonic this gives the same θ.

This is typically the natural logarithm so as to be able to handle theexponentiation in typical pdfs.

2018-11-28 11.8

Example

Estimation of the mean of a set of samples

zi, i “ 1, . . . ,K zi „ N pθ0, σ2i q. (note: different variances)

Sample mean estimate: θSM “ 1

K

Kÿ

i“1

zi

Probability density functions (pdf): θ is the common mean of the distributions.

fipxi; θq “ 1a2πσ2

i

exp

ˆ´pxi ´ θq

2

2σ2i

˙

For independent samples the joint pdf is:

fpx1, . . . , xK ; θq “Kź

i“1

1a2πσ2

i

exp

ˆ´pxi ´ θq

2

2σ2i

˙

2018-11-28 11.9

Example

Estimation of the mean of a set of samples

θML “ argmaxθ

ln fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.

“ argmaxθ

ln Lpθq

“ argmaxθ

˜´K

2lnp2πq ´

Kÿ

i“1

1

2lnpσ2

i q ´ 1

2

Kÿ

i“1

pzi ´ θq2σ2i

¸

This gives (differentiate and equate to zero),

θML “

¨˚

1Kÿ

i“1

1

σ2i

˛‹‚

Kÿ

i“1

ziσ2i

2018-11-28 11.10

Bayesian approach

Random parameter framework

Consider θ to be a random variable with pdf: fθpxq.

This is an a priori distribution (assumed before the experiment).

Conditional distribution (inference from the experiment)

Our model (plus assumptions) gives a conditional distribution,

fpx1, . . . , xK | θqOn the basis of the experiment (xi “ zi),

Probpθ | z1, . . . , zKq “ ProbpZK | θqProbpθqProbpZKq

So,

argmaxθ

fpθ | z1, . . . , zKq “ argmaxθ

fpZK | θqfθpθq

2018-11-28 11.11

Maximum a posteriori (MAP) estimation

Estimator

Given data, ZK ,

θMAP “ argmaxθ

fpZK | θqfθpθq.

We can interpret the maximum likelihood estimator as,

θML “ argmaxθ

fpx1, . . . , xK ; θqˇˇxi“zi,i“1,...,K.

“ argmaxθ

fpZK | θq

These estimates coincide if we assume a uniform distribution for θ.

2018-11-28 11.12

MAP estimation

A priori parameter distribution

fθpθq “ 1a2πσ2

θ

e´ pθ´θaq2

2σ2θ , θa “ 5, σ2

θ “ 1.

fθpθq

θa “ 5

θa ˘ σθa

0.1

0.2

0.3

0.4

θ2 4 5 6 8 10

2018-11-28 11.13

MAP estimation

Estimating the mean: Gaussian distribution (σ2 “0.5, θa “5, σ2θa“1)

2

4

6

8

10

02

46

810

12

0

0.1

0.2

0.25

x “ θ

fpx; θqfθpθq “ 1?2πσ2

e´ px´θq2

2σ21a2πσ2

θ

e´ pθ´θaq2

2σ2θ

θx

2018-11-28 11.14

MAP estimation

Estimating the mean: Gaussian distribution (σ2 “0.5, θa “5, σ2θa“1)

Datum: z “ 7.0

2

4

6

8

10

02

46

810

12

0

0.1

0.2

0.25

fpz; θqfθpθqx “ θ

fpx; θqfθpθq “ 1?2πσ2

e´ px´θq2

2σ21a2πσ2

θ

e´ pθ´θaq2

2σ2θ

θMAP “ 6.33

θx

2018-11-28 11.15

Cramer-Rao bound

Mean-square error matrix

P “ E

"´θpZKq ´ θ0

¯´θpZKq ´ θ0

¯T*

Assume that the pdf for ZK is fpZK ; θq.

Cramer-Rao inequality

Assume EtθpZKqu “ θ0, and ZK Ă RK .

Then, P ěM´1 (M is the Fischer Information Matrix)

M “ E

#ˆd

dθln fpZK ; θq

˙ˆd

dθln fpZK ; θq

˙T+ˇˇˇθ“θ0

“ ´E"d2

dθ2ln fpZK ; θq

*ˇˇθ“θ0

2018-11-28 11.16

Maximum likelihood: statistical properties

Asymptotic results for i.i.d. variables

Consider a parametrised family of pdfs,

fpx1, . . . , xK ; θq “Kź

i“1

fipxi; θq.

Then,

limKÝÑ8 θML

w.p. 1ÝÑ θ0,

and

limKÝÑ8

?K

´θMLpZKq ´ θ0

¯„ N

`0,M´1

˘.

2018-11-28 11.17

Prediction error statistics

Prediction error framework

εpk, θq “ ypkq ´ ypk, θqAssume that εpk, θq is i.i.d. with pdf: fεpx; θq.

For example: ARX case, εpk, θ0q “ 0 „ N p0, σ2q.

Joint pdf for prediction:

fpXK ; θq “Kź

k“1

fεpεpk, θq; θq

2018-11-28 11.18


Maximum likelihood approach

θML “ argmaxθ

fpXK ; θq |XK“ZK“ argmax

θLpθq

“ argmaxθ

ln fpZK | θq

“ argmaxθ

1

K

Kÿ

k“1

ln fεpεpk, θq; θq.

If we choose the prediction error cost function as,

lpε, θq “ ´ ln fεpε; θq,then,

θPE “ argminθ

1

K

Kÿ

k“1

lpεpk, θq, θq “ θML

2018-11-28 11.19


Example

Gaussian noise case, εpkq „ N p0, σ2q.lpεpk, θq, θq “ ´ ln fεpε; θq

“ constant` 1

2lnσ2 ` 1

2

εpk, θq2σ2

If σ2 is constant (and not a parameter to be estimated) then,

θML “ argmaxθ

Lpθq “ argminθ

1

K

Kÿ

k“1

lpεpk, θq, θq

“ argminθ

}εpk, θq}22 “ θPE

2018-11-28 11.20


Example

If we have a linear predictor, and independent gaussian noise, then,

θ “ argminθ

}εpk, θq}22,

§ Is a linear, least-squares problem;

§ Is equivalent to minimizingKÿ

k“1

´ ln fεpε; θq;§ Is equivalent to a maximum likelihood estimation;

§ Gives (asymptotically) the minimum variance parameter estimates.

2018-11-28 11.21

Linear regression statistics

One-step ahead predictor

ypk|θq “ ϕT pkqθ ` µpkqIn the ARX case µpkq “ epkq. In other special cases µpkq can depend on ZK .

Prediction error: εpkq “ ypkq ´ ϕT pkqθA typical cost function is:

Jpθ, ZKq “ 1

K

K´1ÿ

k“0

εpkq22

Least-squares criterion:

θLS “˜

1

K

K´1ÿ

k“0

ϕpkqϕT pkq¸´1

looooooooooooooomooooooooooooooonR´1K P Rdˆd

1

K

K´1ÿ

k“0

ϕpkqypkqlooooooooomooooooooon

fK P Rd

2018-11-28 11.22


Bpθ, zq`1

Apθ, zqypkq upkq

vpkq

Least-squares estimator properties

The least-squares estimate can be expressed as,

θLS “ R´1K fK

True plant:

ypkq “ ϕT pkqθ0 ` vpkqAsymptotic bias:

limKÝÑ8 θLS ´ θ0 “ lim

KÝÑ8R´1K

1

K

K´1ÿ

k“0

ϕT pkqvpkq “ pR˚q´1f˚.

R˚ “ E!ϕpkqϕT pkq

), f˚ “ E tϕpkqvpkqu .

2018-11-28 11.23


Consistency of the LS estimator

For consistency, limKÝÑ8 θLS “ θ0,

we require, pR˚q´1f˚ “ 0.

So,

1. R˚ must be non-singular. Persistency of excitation requirement.

2. f˚ “ E tϕpkqvpkqu “ 0. This happens if either:

2a. vpkq is zero-mean and independent of ϕpkq; or2b. upkq is independent of vpkq and G is FIR. (n “ 0).

This gives,

limKÝÑ8

?K

´θLS ´ θ0

¯„ N

`0, σ2

0pR˚q´1˘.

2018-11-28 11.24

Correlation methods

Ideal prediction error estimator

ypkq ´ ypk|k ´ 1q “ εpkq “ epkqloomoonideally

The sequence of prediction errors, tepkq, k “ 0,K ´ 1u, is white.

If the estimator is optimal (θ “ θ0) then the prediction errors contain nofurther information about the process.

Another intrepretation: the prediction errors, εpkq, are uncorrelated with theexperimental data, ZK .

2018-11-28 11.25

Correlation methods

Approach

Select a sequence, ζpkq, derived from the past data, ZK .

Require that the error, εpk, θq, is uncorrelated with ζpkq,

1

K

K´1ÿ

k“0

ζpkqεpk, θq “ 0 (could also use a function, αpεq )

We can view the ID problem as finding θ such that this relationship is satisfied.

The values, ζpkq, are known as instruments.

Typically ζpkq P Rdˆny , where θ P Rd, ypkq P Rny .

2018-11-28 11.26

Correlation methods

Procedure

Choose a linear filter, F pzq for the prediction errors,

εF pk, θq “ F pzqεpk, θq (this is optional).

Choose a sequence of correlation vectors, ζpk, ZK , θq constructed from thedata (and possibly θ).

Choose a function αpεq (default is αpεq “ ε). Then,

θ “ θ, solving fKpθ, ZKq “ 1

K

K´1ÿ

k“0

ζpk, θqαpεpk, θqq “ 0.

2018-11-28 11.27

Pseudo-linear regressions

Regression-based one-step ahead predictors

For ARX, ARMAX, etc., model structures we can write the predictor,

ypk|θq “ ϕT pk, θqθ.We previously solved this via LS (or iterative LS, or optimisation) methods.

Correlation based solution

θPLR “ θ solving1

K

K´1ÿ

k“0

ϕpk, θqp ypkq ´ ϕT pk, θqθlooooooooomooooooooonprediction error

q “ 0.

The prediction errors are orthogonal to the regressor, ϕpk, θq.

2018-11-28 11.28

Instrumental variable methods

Instrumental variables

θIV “ θ solving1

K

K´1ÿ

k“0

ζpk, θqpypkq ´ ϕT pk, θqθq “ 0.

This is solved by,

θIV “˜

1

K

K´1ÿ

k“0

ζpkqϕT pkq¸´1

1

K

K´1ÿ

k“0

ζpkqypkq.

So, for consistency we require,

E!ζpkqϕT pkq

)to be nonsingular,

and

E tζpkqvpkqu “ 0 (uncorrelated w.r.t. prediction error)

2018-11-28 11.29

Example

ARX model

ypkqà1ypk´1q`¨ ¨ ¨ànypkńq “ b1upk´1q`¨ ¨ ¨`bmupk´mq`vpkq

One approach: filtered input signals as instruments

Bpθ, zq`1

Apθ, zq

P pzqQpzq

ypkq upkqvpkq

xpkq

xpkq ` q1xpk ´ 1q ` ¨ ¨ ¨ ` qnxpk ´ nq “ p1upk ´ 1q ` ¨ ¨ ¨ ` pmupk ´mq

2018-11-28 11.30

Instrumental variable example

Bpθ, zq`1

Apθ, zq

P pzqQpzq

ypkq upkqvpkq

xpkq

ζpkq “ “´xpk ´ 1q . . . ´xpk ´ nq upk ´ 1q . . . upk ´mq‰

Here,

RK “ 1

K

K´1ÿ

k“0

ζpkqϕT pkq is required to be invertible,

and we also need,

E

#1

K

K´1ÿ

k“0

ζpkqvpkq+“ 0.

2018-11-28 11.31


Invertibility of RK?

y “ BpzqApzqu`

1

Apzqv x “ P pzqQpzqu

So, ζpkqϕT pkq has the form,

ζpkqϕT pkq “„xk´10

uk´10

“yk´10 uk´1

0

‰

“„PQuk´10

uk´10

“`BAuk´10 ` 1

Avk´10

˘uk´10

‰

“„PQuk´10

uk´10

“BAuk´10 uk´1

0

‰

loooooooooooooooomooooooooooooooooninvertible?

` s„PQuk´10

0

“1Avk´10 0

‰

looooooooooooomooooooooooooonvanishing?pÝÑ 0q

2018-11-28 11.32


y “ BpzqApzqu`

1

Apzqv, x “ P pzqQpzqu

ζpkqϕT pkq “»–P pzqQpzqu

k´10

uk´10

fifl„BpzqApzqu

k´10 uk´1

0

`»–P pzqQpzqu

k´10

0

fifl„

1

Apzqvk´10 0

This will be invertible if:

§ vpkq and upkq are uncorrelated.

§ upkq and xpkq “ P pzqQpzqupkq are sufficiently exciting.

§ There are no pole/zero cancellations betweenP pzqQpzq and

BpzqApzq .

2018-11-28 11.33

Instrumental variable approach

A nonlinear estimation problem

Bpθ, zqApθ, zq`

P pzqQpzq

ypkq upkqvpkq

xpkq

Choosing P pzq and QpzqThe procedure works well when P pzq « Bpzq and Qpzq « Apzq.

Approach:

1. Estimate θLS via linear regression.

2. Select Qpzq “ ALSpzq and P pzq “ BLSpzq.3. Calculate θIV.

2018-11-28 11.34

Instrumental variable approach

Considerations

§ Variance and MSE depend on the choice of instruments.

§ Consistency (asymptotically unbiased) is lost if:

§ Noise and instruments are correlated (for example, in closed-loop,generating instruments from u).

§ Model order selection is incorrect.

§ Filter dynamics cancel plant dynamics.

§ True system is not in the model set.

§ Closed-loop approaches: generate instruments from the excitation, r.

2018-11-28 11.35

Bibliography

Prediction error minimizationLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [sections 7.1, 7.2 & 7.3].

Parameter estimation statisticsLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [section 7.4].

Correlation and instrumental variable methodsLennart Ljung, System Identification;Theory for the User, 2nd Ed., Prentice-Hall,1999, [sections 7.5 & 7.6].

2018-11-28 11.36

System Identi cationsysid/course/2018slides/SysID_lecture11.small.pdf · ' pk q' T pk q); f E t '...

Documents

Transcript of System Identi cationsysid/course/2018slides/SysID_lecture11.small.pdf · ' pk q' T pk q); f E t '...