Post on 28-Mar-2018
Pattern Recognition applied to Biomedical Signals
Dr Philip de ChazalChief Technical OfficerBiancaMed, Dublin, Ireland
Outline
1st Hour: Focus on Pattern Recognition basics1. Pattern Recognition Overview2. Classifiers: LDA, QDA, FFNN, HMM etc3. Performance Assessment
Data splittingPerformance measures (sensitivity, specificity etc)
4. FeaturesTransformationsMissing valuesFeature Selection
Outline
2nd hour: Case study on Sleep apnea detection from the Electrocardiogram– Database– Expert Annotations– Features– Performance assessment– Practical tips
1. Pattern Recognition OverviewPattern Recognition
Supervised Learning Unsupervised learning
Self organising feature maps
Cluster Analysis
Hebbian Learning
Vector QuantisationPlug-in parameters Distributed parameters
Parametric Non-Parametric
Nearest neighbour (kNN)
Decision Trees
Discriminant Analysis Bayesian ClassifiersNeural Networks (feed
forward etc)
HMM
Classifier Types: Parametric models with plug-in parameters
Define a parametric model between the input features and the output classesThe model has adjustable parameters which are set using training data
Features Classifier Classes
Adjustable Parameters
2. Classification Methods
Bayes theoremGaussian modelsLinear discriminant analysis– Derivation– Covariance inversion– Training Equations– Classifying Equations
Quadratic discriminant analysisFeedforward neural networks
Bayes theorem
x
ℜ1 ℜ2
p x C P C( | ) ( )2 2p x C P C( | ) ( )1 1
Bayes theorem
Optimally classify an object into one of cmutually exclusive classes given priors and class densities.For a c-class problem Bayes’ rule states that the posterior probability of the kth class is related to the its prior probability and its class density function by
( )( )
1
,
,
k k kk c
l l ll
fp
f
π
π=
=
∑x θ
x θ
Class density
Posterior probability
Prior probability Normalising
component
Likelihood functionIf we have
1
c
kn
N N=
= ∑ labelled data (d×1) vectors, ( ) , 1..kn kn N=x , then a likelihood can
be formed as follows:
( ) ( )( )
1 1
,kNc
kk k n k
k n
l fπ= =
=∏∏θ x θ
Our aim is to find the values of θ for each class that maximise the value of the ( )l θ likelihood. Equivalently we can find the values of θ that maximise the value of
the log- likelihood:
( ) ( ) ( )( )
( )( ) ( )
( )
1 1
( )
1 1 1
log( ) log ,
log , log
k
k
Nck
k k n kk n
Nc ck
k n k k kk n k
L l f
f N
π
π
= =
= = =
= =
= +
∑∑
∑∑ ∑
θ θ x θ
x θ
Think of it as combined probabilities
The class densities are modelled with a Gaussian model (d-dimensional) with common covariance across all classes:
( ) ( ) ( ) ( )122 11
2, , 2 expd T
k k k k kf π −− −⎡ ⎤= = − − −⎣ ⎦x θ μ Σ Σ x μ Σ x μ
Common Covariance Class mean
LDA: Gaussion parametric model for the class densities
LDA: Log-likelihood
For a training example
Hence
And the log-likelihood over all training examples
( )( ) ( ) ( ) ( ) ( )( ) ( ) 1 ( )1 12 2 2log , , log 2 log
Tk k kdk n k n k n kf π −= − − − − −x μ Σ Σ x μ Σ x μ
( )( ) ( ) ( ) ( ) ( )( ) ( ) 1 ( )12 2 2
1 1 1 1log , log 2 log
k kN Nc c Tk k kdN Nk n k n k n k
k n k nf π −
= = = =
= − − − − −∑∑ ∑∑x θ Σ x μ Σ x μ
( ) ( ) ( ) ( ) ( ) ( )( ) 1 ( )11 2 2 2
1 1 1,... , log 2 log log
kNc cTk kdN Nc n k n k k k
k n kL Nπ π−
= = =
= = − − − − − +∑∑ ∑θ μ μ Σ Σ x μ Σ x μ
LDA: Maximising the log-likelihood function
1 ( )
1[many steps omitted!]= 0
kNk
k n k knk
L N−
=
⎛ ⎞∂= − =⎜ ⎟∂ ⎝ ⎠
∑Σ x μμ
( )
1
kNk
k n kn
N=
= ∑μ x
0.5 1 11 0.5
11 1 0.5
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦
H
L
M
M O
L
1.
2. ( )( )( ) ( )1
1 1[many steps omitted!]=
kNc Tk kn k n k
k n
L N−= =
⎛ ⎞∂= × − − −⎜ ⎟∂ ⎝ ⎠
∑∑H Σ x μ x μΣ
( )( )( ) ( )
1 1
kNc Tk kn k n k
k nN
= =
= − −∑∑Σ x μ x μ
QDA: Separate Gaussion parametric model for the class densities
The class densities are modelled with a Gaussian model (d-dimensional) with separate covariance across all classes:
Class specific Covariance Class mean
( ) ( ) ( ) ( )1 122( ) ( ) ( )1
2, , 2 expd Tk k k
k k k k kf π−−− ⎡ ⎤= = − − −⎣ ⎦x θ μ Σ Σ x μ Σ x μ
QDA: Log-likelihood
For a training example
( )( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( ) ( )1 12 2 2log , , log 2 log
Tk k k k k kdk n k n k n kf π
−
= − − − − −x μ Σ Σ x μ Σ x μ
Hence
( )( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( )12 2
1 1 1 1 1
1log , log 2 log2
k kN Nc c c Tk k k k kdNk n k k n k n k
k n k k nf Nπ
−
= = = = =
= − − − − −∑∑ ∑ ∑∑x θ Σ x μ Σ x μ
And the log-likelihood over all training examples
( ) ( ) ( ) ( ) ( ) ( )1( ) ( ) ( ) ( ) ( )1 11 2 2 2
1 1 1 1,... , log 2 log log
kNc c cTk k k k kdNc k n k n k k k
k k n kL N Nπ π
−
= = = =
= = − − − − − +∑ ∑∑ ∑θ μ μ Σ Σ x μ Σ x μ
QDA: Maximising the log-likelihood function
1 ( )
1[many steps omitted!]= 0
kNk
k n k knk
L N−
=
⎛ ⎞∂= − =⎜ ⎟∂ ⎝ ⎠
∑Σ x μμ
( )
1
kNk
k n kn
N=
= ∑μ x
( )( )( ) ( )1
1[many steps omitted!]=
kN Tk kk k n k n k
nk
L N−=
⎛ ⎞∂= × − − −⎜ ⎟∂ ⎝ ⎠
∑H Σ x μ x μΣ
0.5 1 11 0.5
11 1 0.5
⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦
H
L
M
M O
L
1.
2.
( )( )( ) ( )
1
kN Tk kk n k n k k
nN
=
= − −∑Σ x μ x μ
Covariance inversion
If a covariance matrix does not have full rank then cannot invert matrixWork around– Identify columns of CV matrix with zero
eigenvectors– Remove these columns for CV (equivalent to
remove corresponding features)– Invert submatrix
Training EquationsDefine an associated (c×1) target vector nt which has one element set to 1 and all other elements set to zero for each of the N (d×1) training feature vector nx . The position of the element with value 1 indicates the class e.g. for a four class problem a target vector indicating that the associated training feature vector belongs to class 3 is
[ ]0 0 1 0 Tt = . Form a (d×N) matrix of feature vectors,
[ ]1 2 ... N=X x x x and a (c×N) matrix of target vectors
[ ]1 2 ... N=T t t t and a (d×c) matrix of mean vectors
[ ]1 2 ... c=M μ μ μ and a (c ×1) vector of prior probabilities
[ ]1 2 ... Tcπ π π=π
Matlab implementation – Target vectors
10000t5
01000t4
00100t3
00010t3
00001t1
Class 5Class 4Class 3Class 2Class 1Vector elements
Training Equations
( )T N= −Σ X X ΜT
( ) 1T T −=M XT TT
[ ]( )
. .,1
Tk
k k k kdN⎛ ⎞⎛ ⎞= × −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠
Σ X X 1 T μ T
LDA
QDA
. means the kth row of matrix kT T
Processing – general form ( )
( )( ) ( )( ) ( )
( ) ( )( )( )( ) ( )( )( )
( )( )( )( )( )( )
( )( )
( )( )
( )( )
1 1
1 1
1
1
, exp ,
, exp ,
exp exp log , exp log ,
exp exp log , exp log ,
exp,where log ,
exp
,
,
k k k k k kk c c
l l l l l ll l
k k k k k k
c c
l l l l l ll l
kk k k kc
ll
k k kk c
kl l l
l
f K fp
f K f
K f f K
K f f K
yy f K
y
fNow p
f
π π
π π
π π
π π
π
π
π
= =
= =
=
=
= =
+= =
+
= = +
=
∑ ∑
∑ ∑
∑
∑
x θ x θ
x θ x θ
x θ x θ
x θ x θ
x θ
x θ
x θ
( )
( )
( )( )
1
1 1
1
12
1
,1
,
exp1 ,
exp
c
k k kc ck
ck
l l ll
ck
l k cl
ll
f
f
yp p p
y
π
π
=
= =
=
=
=
= =
∴ = − =
∑∑ ∑
∑
∑∑
x θ
x θ
Processing - LDA
( )( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( ) ( )
( )
( )
11 12 2 2
1 1 11 12 2 2
1 1 11 12 2 2
1 112
log , log log 2 log
log 2 log 2 log
log , log 2 log
log
log
Tdk k k k k k
T T T dk k k k
T T Tdk k k k
T Tk k k k k
k k k k
Tk k
f
K where K
y
y b
π π π
π π
π π
π
π
−
− − −
− − −
− −
= − − − − −
= − − + − −
= + − + = − − +
= + −
= + +
=
x θ Σ x μ Σ x μ
x Σ x μ Σ x μ Σ μ Σ
μ Σ x μ Σ μ Σ x Σ x
μ Σ x μ Σ μ
a x
a μ 1
112
Tk k kb
−
−= −
Σ
μ Σ μ
Linear equation
Processing - QDA
( )( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
11 12 2 2
11 12 2 2
11 12 2 2
1
log , log log 2 log
log log 2 log
log log , log 2
2log log
Tdk k k k k k k k
T dk k k k k
T dk k k k k
Tk k k k k k
f
K where K
y
π π π
π π
π π
π
−
−
−
−
= − − − − −
= − − − − −
= − − − − − =
= − − − −
x θ Σ x μ Σ x μ
x μ Σ x μ Σ
x μ Σ x μ Σ
x μ Σ x μ Σ
Quadratic equation
Feedforward Neural Networks
Multilayer perceptron artificial neural network0 or more hidden layers
x1
x2
1 1
∑
∑xd
1
1
∑
∑
∑
1
1
∑w11
1)(
wMd(1)
b21)(
b11)(
bM(1)
y21)(
y11)(
yM(1)
w112( )
wcM( )2
y22( )
y12( )
yc( )2
b22( )
b12( )
bc( )2
∑
1b3
1)(
y31)(
Input Features
Output Classes
Feedforward Neural Networks
Flexible linear or nonlinear mapping from features to classesA feedforward neural network is a ‘universal function approximator’Except for linear networks, training requires numerical optimisation Back propagation algorithm used for efficient training
x1
x21
∑w11
1)(
b11)(
1)(
y11)(
Non-linearity
y b w xm m mi ii
d( ( (1) 1) 1)
1
= +FHG
IKJ=
∑ϕ
OutputLinearly summed inputs
Nodal equation
3. Performance Measurement Methods
Twoway classification– Prior probabilities– Sensitivity– Specificity– Positive and negative predictivity– Accuracy
Multiway classification– Prior probabilities– Sensitivity– Specificity– Positive and negative predictivity– Accuracy
Data splitting
Two classification
Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total
Abnormal (S) a b a+b
Normal (~S) c d c+d
Total a+c b+d N
a is the number of cases which were classified abnormal and were truly abnormal. b is the number of cases which were classified abnormal but were in fact normal. c is the number of cases which were classified normal but were in fact abnormal d is the number of cases which were classified normal and were truly normal. N is the total number of cases.
(TP)(FP)(FN)(TN)
Probability of having the disease (P) = a cN+
Probability of not having the disease = b dN
P+= −1
Two way classification
Sensitivity (Se) = p(S|D) = Percentage of well classified abnormals aa c
=+
Specificity (Sp) = p(~S|~D)= Percentage of well classified normals db d
=+
Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total
Abnormal (S) a b a+b
Normal (~S) c d c+d
Total a+c b+d N
Accuracy (A) = Percentage of well classified cases a dN
=+
Two way classification
Predictive value of positive test (PV+) = p(D|S)
=Percentage of well classified positives aa b
P SeP Se P Sp
=+
=+ − −
.. ( ).( )1 1
Predictive value of a negative test (PV-) =p(~D|~S)
=Percentage of well classified negatives dc d
P SpP Se P Sp
=+
=−
− + −( ).
.( ) ( ).1
1 1
Diagnostic True statusAllocation Abnormal (D) Normal (~D) Total
Abnormal (S) a b a+b
Normal (~S) c d c+d
Total a+c b+d N
Multiway classification Diagnostic Allocation
True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .
Disease 1 N10 N00 N00 N1n N1 .
Disease 2 N20 N00 N00 N2n N2 .
:Disease n Nn0 N11 N12 Nnn Nn .
Sum N.0 N.1 N.2 N.n N..
Prevalence of disease i = probability of having disease i: P NNi
i= .
..
Multiway classification Diagnostic Allocation
True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .
Disease 1 N10 N00 N00 N1n N1 .
Disease 2 N20 N00 N00 N2n N2 .
:Disease n Nn0 N11 N12 Nnn Nn .
Sum N.0 N.1 N.2 N.n N..
Sensitivity for disease i = Proportion of correctly classified cases with disease i:
Se NNi
ii
i
=.
Specificity = Proportion of correctly classified normals = NN
00
0.
Accuracy (A) = Proportion of correctly classified cases N
Nii∑
..
Multiway classification Diagnostic Allocation
True Status No disease Disease 1 Disease 2 . . Disease n SumNo disease N00 N01 N02 N0n N0 .
Disease 1 N10 N00 N00 N1n N1 .
Disease 2 N20 N00 N00 N2n N2 .
:Disease n Nn0 N11 N12 Nnn Nn .
Sum N.0 N.1 N.2 N.n N..
Predictive value for disease i = proportion of cases classified disease i which are
correct: PV NNi
ii
i
=.
Data splitting
Ideal– Maximum data for training– Maximum data for testing– Training and test data independent– Conflicting requirements
Resubstitution– Train and test on same recordings– Positively biased results
Holdout– Train on one sample, test on remaining sample
Cross fold validation
Cross fold validation
Training case Testing case
Illustration of 5-fold cross validation using 10 ECG records. The data is divided into 5 mutually exclusive folds and the classifier is trained and tested 5 times. Each time a different test fold is used and the remainder of the data used for training.
Unbiased, computationally intensive
4. Features
TransformationsMissing valuesFeature Selection
Transformations
Look at the histogram of features and try applying a transformation if a skewed distribution resulting in a less skewed distribution.
Histogram of original feature
Histogram of log(feature)
Missing Values
Practical data sets often have missing feature values due to faulty measurements etcA majority of classifier models require all feature values to presentWhat to do?
– Delete all cases with one of more feature values– Estimate the missing feature values
replace with average (bad choice as skews distribution)Replace with random valueReplace with value from another case that is “similar” and has all featuresSee academic.uprm.edu/~eacuna/IFCS04r.pdf for a good summary
Feature Selection
The aim is to find a subset of the available features that provides “acceptable”performance– Fewer features means easier implementation– There may exist subsets of the available features
that provide higher classification performance than the full feature set
Irrelevant and redundant features in general reduce classifier performance
– Methods to look for include “filter” and “wrapper”methods, forward selection, backward elimination, stepwise, exhaustive, beam search
Part 2: Case Study: Sleep Apnea Detection using the Electrocardiogram
Diagnosis using polysomnogram (multiple signals).Diagnostic test costs $1500 - carried out in hospital.Only 15% with disease have been diagnosed.
Obstructive sleep apnea: 2-4% prevalence, disrupted sleep, treated with Continuous Positive Airway Pressure (CPAP) mask.
Case Study: Sleep Apnea Detection using the Electrocardiogram
Study Objective
See if can determine a method that can reliably detect sleep apnea using the Electrocardiogram (ECG)Benefits– Can do the test at home– Low cost– Reduce waiting lists in hospitals
Sleep Apnoea ECG database
Computers in Cardiology Conference 2000 Challenge– Automated ECG apnoea detection.
Uses modified lead V2 ECG from PSG database from patients at Philipps University in Germany (T. Penzel) which had been scored by sleep physiologists using complete polysomnogram. Supplied raw ECG waveform from a single lead and QRS detection times (unverified). 70 records total (about 8 hrs each); 35 released for training, 35 for independent testing
PSG scored on epoch-by-epoch basis.Goal was to mimic human scorer
– Each epoch labelled as Normal (NR) or Sleep disordered respiration (SDR)
– Ea
Epoch based scoring
Splitting up of the DataWe have 70 overnight recordings of ECG– Every minute annotated as ‘normal’ of ‘sleep
disorder breathing’ by an expert– Over 32000 labels– 35 recordings available for training, 35 withheld for
testing
Guilleminault et al. were first to report on characteristic bradycardia/tachycardia pattern associated with obstructive apnoeas(Guilleminault C et al.. Cyclical variation of heart rate in sleep apnea syndrome. Lancet 1984 ).
• Ichimaru reported on low-frequency heart rate fluctuations caused by Cheyne-Stokes respiration(Ichimaru Y, Yanaga T. Frequency characteristics of the heart rate variability produced by Cheyne-Stokes respiration during 24-hr ambulatory electrocardiographic monitoring. Comput Biomed Res 1989).
Bradycardia/tachycardia patterns
Stein et al., J. CardiovascElectrophysiol.,
2003
Brady/tachy patterns
EDR signal
Modulation of chest lead ECG signal amplitude by respiration
Respiration
EDR(n)=area enclosed by the QRS complex(n)
ECG derived respiration (EDR)
A. Travaglini, et al, “Respiratory signal derived from eight-lead ECG,” in Computers in Cardiology, 1998.
B. G.B. Moody, et al, “Clinical Validation of the ECG-Derived Respiration (EDR) Technique,” in Computers in Cardiology, 1986.
Travaglini et al
Features
Record-based stdevEDR amplitude
Record-based stdevRR interval
NN50
Epoch-based mean EDR amplitude
Epoch-based mean RR interval
pNN50
Epoch-based st.dev. EDRamplitude
Epoch-based st.dev. RR interval
Allan Factor at 5-25 secs time scale
Record-based mean EDR amplitude
Record-based mean RR interval
SDNN
32 PSD features32 PSD featuresSerial correlation
EDR frequency domain
RR interval frequency domain
RR interval time domain
88 Features in all
Experiments
Linear and quadratic discriminant classifiers– Quick to train– Wanted to focus study on the features not the classifiers– Wanted a system that is readily implemented on
microprocessors
Different combination of feature groups: – RR time domain and frequency domain– EDR frequency domain
Feature SelectionCovariance regularisation (not discussed here)
Training – all features35-fold cross validation. Each fold contained 1 record
– Removed intra-record bias
Training –Feature SelectionUse the best first feature selection strategyOuter loop: 35-fold cross validationInner loop: 34 fold cross validation.As before each fold contained 1 record
– Removed intra-record bias
Performance Assessment 1
During Feature selection classifiers compared using ‘Accuracy’
Accuracy = (TP+TN) / (TP+FN+FP+TN)
TNFPSDR
FNTPNR
SDRNR
ExpertPr
edic
ted
Performance Assessment 2In addition classifiers assessed using Sensitivity and Specificity
TNFPSDR
FNTPNR
SDRNR
ExpertPr
edic
ted
Sensitivity = TP / (TP+FP)
Specificity = TN / (TN+FN)
Results- cross validation
(a) All features, (b) Feature selection
•LDA better than QDA
•Feature selection improved QDA
•Best features were RR and EDR combined
Results – withheld set
•Results on new data (withheld set) similar to the cross-validation results which is an encouraging sign!
Results- feature separation across classes
RR PSD features– Good separation at low
frequencies
EDR PSD features– Good separation
particularly at low frequencies
Practical tips
Always start with a simple system and add complexity if needed e.g start with LDA, progress to NN if neededFocus on finding good discriminating features. If your features are poor then no fancy classifier will helpAs best as possible make sure your performance assessment is unbiased
– Be careful of training and testing with features from the same record– Never make many performance assessments on the same data set
then report the best as this will be a postiviely biased result. Be particularly careful with feature selection where many thousands of comparisons may be made.
Consider what is the target device for the pattern recognition system
– E.g. low power, low computational device will influence your choice of features and classifier
Bibliography
Data splitting– R. Kohavi, “A study of cross validation and bootstrap for accuracy estimation and model selection,” In:
Proc. of 14th Int. Joint Conference on Artificial Intelligence, 1995, pp. 1137-1143. Performance estimating
– M. H. Zweig (1993), “Receiver Operator Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine,” Clin. Chem., vol. 39(4), pp. 561-577
– J. Michaelis and J. L. Willems (1987), “Performance Evaluation of Diagnostic ECG Programs,”Proceedings of the Computers in Cardiology Conference, Leuven, Belgium, Sept 12-15, pp. 25-30, Edited by: K. L. Ripley, IEEE Computer Society.
Classifiers– C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford University Press, 1995.– B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge, England: Cambridge University
Press, 1996.– R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, New York:John Wiley and Sons, 2001. – Warren Sarle, SAS Institute, http://www.faqs.org/faqs/ai-faq/neural-nets/part1/– Mike James, Classification Algorithms, John Wiley and Sons, 1985Sleep Apnea
– P. de Chazal, C. Heneghan, E. Sheridan, R.B. Reilly, P. Nolan, M O’Malley (2003) ”Automated Processing of the Single Lead Electrocardiogram for the Detection of Obstructive Sleep Apnea”, IEEE Transactions on Biomedical Engineering, Vol. 50, No. 6, June 2003, pp. 686-696 .
Email – contact me
philip.dechazal@biancamed.comphilip@ee.ucd.ie