Pattern Recognition applied to Biomedical Signals - …fcruz/pdf/PattRecAug07.pdfPattern Recognition...

Post on 28-Mar-2018

234 views 3 download

Transcript of Pattern Recognition applied to Biomedical Signals - …fcruz/pdf/PattRecAug07.pdfPattern Recognition...

Pattern Recognition applied to Biomedical Signals

Dr Philip de ChazalChief Technical OfficerBiancaMed, Dublin, Ireland

Outline

1st Hour: Focus on Pattern Recognition basics1. Pattern Recognition Overview2. Classifiers: LDA, QDA, FFNN, HMM etc3. Performance Assessment

Data splittingPerformance measures (sensitivity, specificity etc)

4. FeaturesTransformationsMissing valuesFeature Selection

Outline

2nd hour: Case study on Sleep apnea detection from the Electrocardiogram– Database– Expert Annotations– Features– Performance assessment– Practical tips

1. Pattern Recognition OverviewPattern Recognition

Supervised Learning Unsupervised learning

Self organising feature maps

Cluster Analysis

Hebbian Learning

Vector QuantisationPlug-in parameters Distributed parameters

Parametric Non-Parametric

Nearest neighbour (kNN)

Decision Trees

Discriminant Analysis Bayesian ClassifiersNeural Networks (feed

forward etc)

HMM

Classifier Types: Parametric models with plug-in parameters

Define a parametric model between the input features and the output classesThe model has adjustable parameters which are set using training data

Features Classifier Classes

Adjustable Parameters

2. Classification Methods

Bayes theoremGaussian modelsLinear discriminant analysis– Derivation– Covariance inversion– Training Equations– Classifying Equations

Quadratic discriminant analysisFeedforward neural networks

Bayes theorem

x

ℜ1 ℜ2

p x C P C( | ) ( )2 2p x C P C( | ) ( )1 1

Bayes theorem

Optimally classify an object into one of cmutually exclusive classes given priors and class densities.For a c-class problem Bayes’ rule states that the posterior probability of the kth class is related to the its prior probability and its class density function by

( )( )

1

,

,

k k kk c

l l ll

fp

f

π

π=

=

∑x θ

x θ

Class density

Posterior probability

Prior probability Normalising

component

Likelihood functionIf we have

1

c

kn

N N=

= ∑ labelled data (d×1) vectors, ( ) , 1..kn kn N=x , then a likelihood can

be formed as follows:

( ) ( )( )

1 1

,kNc

kk k n k

k n

l fπ= =

=∏∏θ x θ

Our aim is to find the values of θ for each class that maximise the value of the ( )l θ likelihood. Equivalently we can find the values of θ that maximise the value of

the log- likelihood:

( ) ( ) ( )( )

( )( ) ( )

( )

1 1

( )

1 1 1

log( ) log ,

log , log

k

k

Nck

k k n kk n

Nc ck

k n k k kk n k

L l f

f N

π

π

= =

= = =

= =

= +

∑∑

∑∑ ∑

θ θ x θ

x θ

Think of it as combined probabilities

The class densities are modelled with a Gaussian model (d-dimensional) with common covariance across all classes:

( ) ( ) ( ) ( )122 11

2, , 2 expd T

k k k k kf π −− −⎡ ⎤= = − − −⎣ ⎦x θ μ Σ Σ x μ Σ x μ

Common Covariance Class mean

LDA: Gaussion parametric model for the class densities

LDA: Log-likelihood

For a training example

Hence

And the log-likelihood over all training examples

( )( ) ( ) ( ) ( ) ( )( ) ( ) 1 ( )1 12 2 2log , , log 2 log

Tk k kdk n k n k n kf π −= − − − − −x μ Σ Σ x μ Σ x μ

( )( ) ( ) ( ) ( ) ( )( ) ( ) 1 ( )12 2 2

1 1 1 1log , log 2 log

k kN Nc c Tk k kdN Nk n k n k n k

k n k nf π −

= = = =

= − − − − −∑∑ ∑∑x θ Σ x μ Σ x μ

( ) ( ) ( ) ( ) ( ) ( )( ) 1 ( )11 2 2 2

1 1 1,... , log 2 log log

kNc cTk kdN Nc n k n k k k

k n kL Nπ π−

= = =

= = − − − − − +∑∑ ∑θ μ μ Σ Σ x μ Σ x μ

LDA: Maximising the log-likelihood function

1 ( )

1[many steps omitted!]= 0

kNk

k n k knk

L N−

=

⎛ ⎞∂= − =⎜ ⎟∂ ⎝ ⎠

∑Σ x μμ

( )

1

kNk

k n kn

N=

= ∑μ x

0.5 1 11 0.5

11 1 0.5

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

H

L

M

M O

L

1.

2. ( )( )( ) ( )1

1 1[many steps omitted!]=

kNc Tk kn k n k

k n

L N−= =

⎛ ⎞∂= × − − −⎜ ⎟∂ ⎝ ⎠

∑∑H Σ x μ x μΣ

( )( )( ) ( )

1 1

kNc Tk kn k n k

k nN

= =

= − −∑∑Σ x μ x μ

QDA: Separate Gaussion parametric model for the class densities

The class densities are modelled with a Gaussian model (d-dimensional) with separate covariance across all classes:

Class specific Covariance Class mean

( ) ( ) ( ) ( )1 122( ) ( ) ( )1

2, , 2 expd Tk k k

k k k k kf π−−− ⎡ ⎤= = − − −⎣ ⎦x θ μ Σ Σ x μ Σ x μ

QDA: Log-likelihood

For a training example

( )( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( ) ( )1 12 2 2log , , log 2 log

Tk k k k k kdk n k n k n kf π

= − − − − −x μ Σ Σ x μ Σ x μ

Hence

( )( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( )12 2

1 1 1 1 1

1log , log 2 log2

k kN Nc c c Tk k k k kdNk n k k n k n k

k n k k nf Nπ

= = = = =

= − − − − −∑∑ ∑ ∑∑x θ Σ x μ Σ x μ

And the log-likelihood over all training examples

( ) ( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( )1 11 2 2 2

1 1 1 1,... , log 2 log log

kNc c cTk k k k kdNc k n k n k k k

k k n kL N Nπ π

= = = =

= = − − − − − +∑ ∑∑ ∑θ μ μ Σ Σ x μ Σ x μ

QDA: Maximising the log-likelihood function

1 ( )

1[many steps omitted!]= 0

kNk

k n k knk

L N−

=

⎛ ⎞∂= − =⎜ ⎟∂ ⎝ ⎠

∑Σ x μμ

( )

1

kNk

k n kn

N=

= ∑μ x

( )( )( ) ( )1

1[many steps omitted!]=

kN Tk kk k n k n k

nk

L N−=

⎛ ⎞∂= × − − −⎜ ⎟∂ ⎝ ⎠

∑H Σ x μ x μΣ

0.5 1 11 0.5

11 1 0.5

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

H

L

M

M O

L

1.

2.

( )( )( ) ( )

1

kN Tk kk n k n k k

nN

=

= − −∑Σ x μ x μ

Covariance inversion

If a covariance matrix does not have full rank then cannot invert matrixWork around– Identify columns of CV matrix with zero

eigenvectors– Remove these columns for CV (equivalent to

remove corresponding features)– Invert submatrix

Training EquationsDefine an associated (c×1) target vector nt which has one element set to 1 and all other elements set to zero for each of the N (d×1) training feature vector nx . The position of the element with value 1 indicates the class e.g. for a four class problem a target vector indicating that the associated training feature vector belongs to class 3 is

[ ]0 0 1 0 Tt = . Form a (d×N) matrix of feature vectors,

[ ]1 2 ... N=X x x x and a (c×N) matrix of target vectors

[ ]1 2 ... N=T t t t and a (d×c) matrix of mean vectors

[ ]1 2 ... c=M μ μ μ and a (c ×1) vector of prior probabilities

[ ]1 2 ... Tcπ π π=π

Matlab implementation – Target vectors

10000t5

01000t4

00100t3

00010t3

00001t1

Class 5Class 4Class 3Class 2Class 1Vector elements

Training Equations

( )T N= −Σ X X ΜT

( ) 1T T −=M XT TT

[ ]( )

. .,1

Tk

k k k kdN⎛ ⎞⎛ ⎞= × −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

Σ X X 1 T μ T

LDA

QDA

. means the kth row of matrix kT T

Processing – general form ( )

( )( ) ( )( ) ( )

( ) ( )( )( )( ) ( )( )( )

( )( )( )( )( )( )

( )( )

( )( )

( )( )

1 1

1 1

1

1

, exp ,

, exp ,

exp exp log , exp log ,

exp exp log , exp log ,

exp,where log ,

exp

,

,

k k k k k kk c c

l l l l l ll l

k k k k k k

c c

l l l l l ll l

kk k k kc

ll

k k kk c

kl l l

l

f K fp

f K f

K f f K

K f f K

yy f K

y

fNow p

f

π π

π π

π π

π π

π

π

π

= =

= =

=

=

= =

+= =

+

= = +

=

∑ ∑

∑ ∑

x θ x θ

x θ x θ

x θ x θ

x θ x θ

x θ

x θ

x θ

( )

( )

( )( )

1

1 1

1

12

1

,1

,

exp1 ,

exp

c

k k kc ck

ck

l l ll

ck

l k cl

ll

f

f

yp p p

y

π

π

=

= =

=

=

=

= =

∴ = − =

∑∑ ∑

∑∑

x θ

x θ

Processing - LDA

( )( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )

( )

( )

11 12 2 2

1 1 11 12 2 2

1 1 11 12 2 2

1 112

log , log log 2 log

log 2 log 2 log

log , log 2 log

log

log

Tdk k k k k k

T T T dk k k k

T T Tdk k k k

T Tk k k k k

k k k k

Tk k

f

K where K

y

y b

π π π

π π

π π

π

π

− − −

− − −

− −

= − − − − −

= − − + − −

= + − + = − − +

= + −

= + +

=

x θ Σ x μ Σ x μ

x Σ x μ Σ x μ Σ μ Σ

μ Σ x μ Σ μ Σ x Σ x

μ Σ x μ Σ μ

a x

a μ 1

112

Tk k kb

−= −

Σ

μ Σ μ

Linear equation

Processing - QDA

( )( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( )

11 12 2 2

11 12 2 2

11 12 2 2

1

log , log log 2 log

log log 2 log

log log , log 2

2log log

Tdk k k k k k k k

T dk k k k k

T dk k k k k

Tk k k k k k

f

K where K

y

π π π

π π

π π

π

= − − − − −

= − − − − −

= − − − − − =

= − − − −

x θ Σ x μ Σ x μ

x μ Σ x μ Σ

x μ Σ x μ Σ

x μ Σ x μ Σ

Quadratic equation

Feedforward Neural Networks

Multilayer perceptron artificial neural network0 or more hidden layers

x1

x2

1 1

∑xd

1

1

1

1

∑w11

1)(

wMd(1)

b21)(

b11)(

bM(1)

y21)(

y11)(

yM(1)

w112( )

wcM( )2

y22( )

y12( )

yc( )2

b22( )

b12( )

bc( )2

1b3

1)(

y31)(

Input Features

Output Classes

Feedforward Neural Networks

Flexible linear or nonlinear mapping from features to classesA feedforward neural network is a ‘universal function approximator’Except for linear networks, training requires numerical optimisation Back propagation algorithm used for efficient training

x1

x21

∑w11

1)(

b11)(

1)(

y11)(

Non-linearity

y b w xm m mi ii

d( ( (1) 1) 1)

1

= +FHG

IKJ=

∑ϕ

OutputLinearly summed inputs

Nodal equation

3. Performance Measurement Methods

Twoway classification– Prior probabilities– Sensitivity– Specificity– Positive and negative predictivity– Accuracy

Multiway classification– Prior probabilities– Sensitivity– Specificity– Positive and negative predictivity– Accuracy

Data splitting

Two classification

Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total

Abnormal (S) a b a+b

Normal (~S) c d c+d

Total a+c b+d N

a is the number of cases which were classified abnormal and were truly abnormal. b is the number of cases which were classified abnormal but were in fact normal. c is the number of cases which were classified normal but were in fact abnormal d is the number of cases which were classified normal and were truly normal. N is the total number of cases.

(TP)(FP)(FN)(TN)

Probability of having the disease (P) = a cN+

Probability of not having the disease = b dN

P+= −1

Two way classification

Sensitivity (Se) = p(S|D) = Percentage of well classified abnormals aa c

=+

Specificity (Sp) = p(~S|~D)= Percentage of well classified normals db d

=+

Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total

Abnormal (S) a b a+b

Normal (~S) c d c+d

Total a+c b+d N

Accuracy (A) = Percentage of well classified cases a dN

=+

Two way classification

Predictive value of positive test (PV+) = p(D|S)

=Percentage of well classified positives aa b

P SeP Se P Sp

=+

=+ − −

.. ( ).( )1 1

Predictive value of a negative test (PV-) =p(~D|~S)

=Percentage of well classified negatives dc d

P SpP Se P Sp

=+

=−

− + −( ).

.( ) ( ).1

1 1

Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total

Abnormal (S) a b a+b

Normal (~S) c d c+d

Total a+c b+d N

Multiway classification Diagnostic Allocation

True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .

Disease 1 N10 N00 N00 N1n N1 .

Disease 2 N20 N00 N00 N2n N2 .

:Disease n Nn0 N11 N12 Nnn Nn .

Sum N.0 N.1 N.2 N.n N..

Prevalence of disease i = probability of having disease i: P NNi

i= .

..

Multiway classification Diagnostic Allocation

True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .

Disease 1 N10 N00 N00 N1n N1 .

Disease 2 N20 N00 N00 N2n N2 .

:Disease n Nn0 N11 N12 Nnn Nn .

Sum N.0 N.1 N.2 N.n N..

Sensitivity for disease i = Proportion of correctly classified cases with disease i:

Se NNi

ii

i

=.

Specificity = Proportion of correctly classified normals = NN

00

0.

Accuracy (A) = Proportion of correctly classified cases N

Nii∑

..

Multiway classification Diagnostic Allocation

True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .

Disease 1 N10 N00 N00 N1n N1 .

Disease 2 N20 N00 N00 N2n N2 .

:Disease n Nn0 N11 N12 Nnn Nn .

Sum N.0 N.1 N.2 N.n N..

Predictive value for disease i = proportion of cases classified disease i which are

correct: PV NNi

ii

i

=.

Data splitting

Ideal– Maximum data for training– Maximum data for testing– Training and test data independent– Conflicting requirements

Resubstitution– Train and test on same recordings– Positively biased results

Holdout– Train on one sample, test on remaining sample

Cross fold validation

Cross fold validation

Training case Testing case

Illustration of 5-fold cross validation using 10 ECG records. The data is divided into 5 mutually exclusive folds and the classifier is trained and tested 5 times. Each time a different test fold is used and the remainder of the data used for training.

Unbiased, computationally intensive

4. Features

TransformationsMissing valuesFeature Selection

Transformations

Look at the histogram of features and try applying a transformation if a skewed distribution resulting in a less skewed distribution.

Histogram of original feature

Histogram of log(feature)

Missing Values

Practical data sets often have missing feature values due to faulty measurements etcA majority of classifier models require all feature values to presentWhat to do?

– Delete all cases with one of more feature values– Estimate the missing feature values

replace with average (bad choice as skews distribution)Replace with random valueReplace with value from another case that is “similar” and has all featuresSee academic.uprm.edu/~eacuna/IFCS04r.pdf for a good summary

Feature Selection

The aim is to find a subset of the available features that provides “acceptable”performance– Fewer features means easier implementation– There may exist subsets of the available features

that provide higher classification performance than the full feature set

Irrelevant and redundant features in general reduce classifier performance

– Methods to look for include “filter” and “wrapper”methods, forward selection, backward elimination, stepwise, exhaustive, beam search

Part 2: Case Study: Sleep Apnea Detection using the Electrocardiogram

Diagnosis using polysomnogram (multiple signals).Diagnostic test costs $1500 - carried out in hospital.Only 15% with disease have been diagnosed.

Obstructive sleep apnea: 2-4% prevalence, disrupted sleep, treated with Continuous Positive Airway Pressure (CPAP) mask.

Case Study: Sleep Apnea Detection using the Electrocardiogram

Study Objective

See if can determine a method that can reliably detect sleep apnea using the Electrocardiogram (ECG)Benefits– Can do the test at home– Low cost– Reduce waiting lists in hospitals

Sleep Apnoea ECG database

Computers in Cardiology Conference 2000 Challenge– Automated ECG apnoea detection.

Uses modified lead V2 ECG from PSG database from patients at Philipps University in Germany (T. Penzel) which had been scored by sleep physiologists using complete polysomnogram. Supplied raw ECG waveform from a single lead and QRS detection times (unverified). 70 records total (about 8 hrs each); 35 released for training, 35 for independent testing

PSG scored on epoch-by-epoch basis.Goal was to mimic human scorer

– Each epoch labelled as Normal (NR) or Sleep disordered respiration (SDR)

– Ea

Epoch based scoring

Splitting up of the DataWe have 70 overnight recordings of ECG– Every minute annotated as ‘normal’ of ‘sleep

disorder breathing’ by an expert– Over 32000 labels– 35 recordings available for training, 35 withheld for

testing

Guilleminault et al. were first to report on characteristic bradycardia/tachycardia pattern associated with obstructive apnoeas(Guilleminault C et al.. Cyclical variation of heart rate in sleep apnea syndrome. Lancet 1984 ).

• Ichimaru reported on low-frequency heart rate fluctuations caused by Cheyne-Stokes respiration(Ichimaru Y, Yanaga T. Frequency characteristics of the heart rate variability produced by Cheyne-Stokes respiration during 24-hr ambulatory electrocardiographic monitoring. Comput Biomed Res 1989).

Bradycardia/tachycardia patterns

Stein et al., J. CardiovascElectrophysiol.,

2003

Brady/tachy patterns

EDR signal

Modulation of chest lead ECG signal amplitude by respiration

Respiration

EDR(n)=area enclosed by the QRS complex(n)

ECG derived respiration (EDR)

A. Travaglini, et al, “Respiratory signal derived from eight-lead ECG,” in Computers in Cardiology, 1998.

B. G.B. Moody, et al, “Clinical Validation of the ECG-Derived Respiration (EDR) Technique,” in Computers in Cardiology, 1986.

Travaglini et al

Features

Record-based stdevEDR amplitude

Record-based stdevRR interval

NN50

Epoch-based mean EDR amplitude

Epoch-based mean RR interval

pNN50

Epoch-based st.dev. EDRamplitude

Epoch-based st.dev. RR interval

Allan Factor at 5-25 secs time scale

Record-based mean EDR amplitude

Record-based mean RR interval

SDNN

32 PSD features32 PSD featuresSerial correlation

EDR frequency domain

RR interval frequency domain

RR interval time domain

88 Features in all

Experiments

Linear and quadratic discriminant classifiers– Quick to train– Wanted to focus study on the features not the classifiers– Wanted a system that is readily implemented on

microprocessors

Different combination of feature groups: – RR time domain and frequency domain– EDR frequency domain

Feature SelectionCovariance regularisation (not discussed here)

Training – all features35-fold cross validation. Each fold contained 1 record

– Removed intra-record bias

Training –Feature SelectionUse the best first feature selection strategyOuter loop: 35-fold cross validationInner loop: 34 fold cross validation.As before each fold contained 1 record

– Removed intra-record bias

Performance Assessment 1

During Feature selection classifiers compared using ‘Accuracy’

Accuracy = (TP+TN) / (TP+FN+FP+TN)

TNFPSDR

FNTPNR

SDRNR

ExpertPr

edic

ted

Performance Assessment 2In addition classifiers assessed using Sensitivity and Specificity

TNFPSDR

FNTPNR

SDRNR

ExpertPr

edic

ted

Sensitivity = TP / (TP+FP)

Specificity = TN / (TN+FN)

Results- cross validation

(a) All features, (b) Feature selection

•LDA better than QDA

•Feature selection improved QDA

•Best features were RR and EDR combined

Results – withheld set

•Results on new data (withheld set) similar to the cross-validation results which is an encouraging sign!

Results- feature separation across classes

RR PSD features– Good separation at low

frequencies

EDR PSD features– Good separation

particularly at low frequencies

Practical tips

Always start with a simple system and add complexity if needed e.g start with LDA, progress to NN if neededFocus on finding good discriminating features. If your features are poor then no fancy classifier will helpAs best as possible make sure your performance assessment is unbiased

– Be careful of training and testing with features from the same record– Never make many performance assessments on the same data set

then report the best as this will be a postiviely biased result. Be particularly careful with feature selection where many thousands of comparisons may be made.

Consider what is the target device for the pattern recognition system

– E.g. low power, low computational device will influence your choice of features and classifier

Bibliography

Data splitting– R. Kohavi, “A study of cross validation and bootstrap for accuracy estimation and model selection,” In:

Proc. of 14th Int. Joint Conference on Artificial Intelligence, 1995, pp. 1137-1143. Performance estimating

– M. H. Zweig (1993), “Receiver Operator Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine,” Clin. Chem., vol. 39(4), pp. 561-577

– J. Michaelis and J. L. Willems (1987), “Performance Evaluation of Diagnostic ECG Programs,”Proceedings of the Computers in Cardiology Conference, Leuven, Belgium, Sept 12-15, pp. 25-30, Edited by: K. L. Ripley, IEEE Computer Society.

Classifiers– C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford University Press, 1995.– B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge, England: Cambridge University

Press, 1996.– R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, New York:John Wiley and Sons, 2001. – Warren Sarle, SAS Institute, http://www.faqs.org/faqs/ai-faq/neural-nets/part1/– Mike James, Classification Algorithms, John Wiley and Sons, 1985Sleep Apnea

– P. de Chazal, C. Heneghan, E. Sheridan, R.B. Reilly, P. Nolan, M O’Malley (2003) ”Automated Processing of the Single Lead Electrocardiogram for the Detection of Obstructive Sleep Apnea”, IEEE Transactions on Biomedical Engineering, Vol. 50, No. 6, June 2003, pp. 686-696 .

Email – contact me

philip.dechazal@biancamed.comphilip@ee.ucd.ie