Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov...

Post on 14-Jun-2020

14 views 0 download

Transcript of Alexey Pomerantsev & Oxana Rodionova - Chemometrics · Alexey Pomerantsev & Oxana Rodionova Semenov...

28.06.12 1CAC-2012

Robust SIMCA Robust SIMCA bearing on nonbearing on non--robust PCArobust PCA

Alexey Pomerantsev & Oxana Rodionova

Semenov Institute of Chemical Physics, Moscow

Use and abuse of robust PCAUse and abuse of robust PCA

28.06.12 2CAC-2012

Contaminated dataContaminated data

Rγ=(1–γ)R+γD

Few outliers Two groups

28.06.12 3CAC-2012

Classical methods for contaminated data Classical methods for contaminated data

Masking effect Swamping effect

28.06.12 4CAC-2012

Classical and robust statisticsClassical and robust statisticsRobustClassical

( )xmedian ~ =x

∑=

=I

iix

Ix

1

1

( )xs ~median1.4826 MAD −= x

∑=

−−

=I

ii xx

Is

1

22 )(1

1

Classical

Robust

outlier

28.06.12 5CAC-2012

Extremes and OutliersExtremes and Outliers

)2(

),(

222 χ∝+

∝⎟⎠⎞

⎜⎝⎛

yx

Nyx I0

)1|2(222 α−χ≤+ −yx

( )Iyx 1222 )1(|2 β−χ≤+ −

α is Extreme significance

β is Outlier significance

α=0.01β=0.05

28.06.12 6CAC-2012

ProblemProblem

OK

Regular data

OKRobustmethods

BADClassicalmethods

Contaminated data

?

28.06.12 7CAC-2012

Principal Component AnalysisPrincipal Component Analysis

I

A

TA

A PA

EA+X I= × J

J

t

I

J

28.06.12 8CAC-2012

PCA PCA robustificationrobustification

(1) Data pre-processing;

(2) Decomposition;

(3) Calculation of thresholds.

28.06.12 9CAC-2012

Scores & Orthogonal DistancesScores & Orthogonal DistancesSD:

distance within the model

∑=

λ==

A

a a

iaiii

th1

21tt )( tTTt

OD:distance to the model

∑∑

+==

==K

Aaia

J

jiji tev

1

2

1

2

28.06.12 10CAC-2012

Data Driven SIMCAData Driven SIMCA

SD OD

hu =

v (u1,...., uI )% (u0/N) χ2(N)

u0= ?

N = ?

J. Chemometrics 22 (2008) 601-609

28.06.12 11CAC-2012

Tolerance AreasTolerance Areas

α is Extreme significance

β is Outlier significance

00 vvN

hhNz vh +=

)(2vh NNz +χ∝

)1|(2 α−+χ≤ −vh NNz

( )Ivh NNz 12 )1(| β−+χ≤ −

28.06.12 12CAC-2012

Classical Data Driven (CDD) SIMCA Classical Data Driven (CDD) SIMCA Classical Method of Moments

2

20

0ˆ2intˆ,ˆusuNuu ==

∑∑==

−−

==I

iiu

I

ii uu

Isu

Iu

1

22

1)(

11,1

(u1,...., uI )% (u0/N) χ2(N)Given

Then

Where

28.06.12 13CAC-2012

Robust Data Driven (RDD) SIMCA Robust Data Driven (RDD) SIMCA Robust Method of Moments

M=median(u) R=interquartile(u)

Given

Then

Where

[ ]⎪⎪⎩

⎪⎪⎨

χ−χ=

χ=⇐

−−

),25.0(),75.0(

),5.0(

~

~

220

200

NNNuR

NNuM

N

u

(u1,...., uI )% (u0/N) χ2(N)

u(1) ≤ u(2 )≤ .... ≤ u(I-1) ≤ u(I)

½ ½

¼ ¼

28.06.12 14CAC-2012

Dual Data Driven (3D) SIMCA Dual Data Driven (3D) SIMCA Given

Then

X=TtP+Eh=(h1,...., hI) v=(v1,...., vI)

NoRDD SIMCA

YesCDD SIMCA

RDD SIMCACDD SIMCA

( ) ( )vh NvNh ˆˆˆˆ00

( ) ( )vh NvNh ~~~~00

( ) ( )vvhh NNNN ˆ~&ˆ~≈≈

28.06.12 15CAC-2012

TOMCAT TOMCAT ToolBoxToolBox

M. Daszykowski, S. Serneels, K. Kaczmarek, P. Van Espen, C. Croux, B. Walczak, ChemoLab 85 (2007) 269-277

http://chemometria.us.edu.pl/RobustToolbox/

Robust PCArobust PCs, robust singular values

Robust classification rulesz-transformed robust OD and SD

)(

)(medianODROD

)(

)(medianSDRD

OD

OD

SD

SD

n

ii

n

ii

Q

Q

−=

−=

Robust pre-processingrobust centering & scaling

28.06.12 16CAC-2012

Case study I. Simulated regular dataCase study I. Simulated regular data

),(N),(N 2I0εV0δ

εδx

σ∝∝

+=

The numbers of variables, J=3

The numbers of objects, I=100

The number of principal components, A=2

The δ properties are:

E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.

The ε component properties are:

E(ε) = 0, σ=0.05

28.06.12 17CAC-2012

SIMCA plotsSIMCA plots

ext=5

out=11

out=0out=12

CDD SIMCA TOMCAT

extreme area (α=0.05) outlier area (β=0.05)

28.06.12 18CAC-2012

Totally in 10 regular data setsTotally in 10 regular data sets

70

60

Expected

28.06.12 19CAC-2012

Case study II. Simulated data with outliersCase study II. Simulated data with outliers

),(N),(N 2I0εV0δ

εδx

σ∝∝

+=

The numbers of variables, J=3

The numbers of objects, I=100

The number of principal components, A=2

The δ properties are:

E(δ) = 0, v11= v22 = v33 = 0.28, rank(V) = 2.

The ε component properties are:

E(ε) = 0, σ=0.05 (first 97 objects)

E(ε) = 0, σ =0.2 (last 3 objects)

28.06.12 20CAC-2012

SIMCA plotsSIMCA plots

28.06.12 21CAC-2012

REFERENCE & RDDREFERENCE & RDD--SIMCASIMCA

28.06.12 22CAC-2012

Totally in 10 data sets with outliersTotally in 10 data sets with outliers

Expected

28.06.12 23CAC-2012

Case study II. Real world data with 2 groupsCase study II. Real world data with 2 groups

Substance in the closed PE bags,

82 drums measured by NIR.

Totally: 246 spectra

Group G1: 196 objects

Group G2: 50 objects ACA 642 (2009) 222-227

28.06.12 24CAC-2012

Expected/observed number of extremesExpected/observed number of extremes

Clean subset G1 Contaminated dataset G1+G2

Expected number of extremes N=αI

28.06.12 25CAC-2012

Results of separationResults of separation

Subset G1 revealed Subset G2 revealed

28.06.12 26CAC-2012

Conclusion 1Conclusion 1Each tool has its purpose: classical methods are for regular data, whereas robust methods should be used for contaminated data. Do not expect that there exists a common tool that yields reasonable results in both cases.

28.06.12 27CAC-2012

Conclusion 2Conclusion 2

Extreme objects play an important role in data analysis. These objects should not be confused with outliers. The number of extremes should be compared to the expected number, coupled with the significance level α.

Clean dataset Contaminated dataset

28.06.12 28CAC-2012

Conclusion 3Conclusion 3The proposed Dual Data Driven PCA/SIMCA approach looks like a fine competitor to the pure classical and to the strictly robust methods. This technique has demonstrated a proper performance in the analysis of both regular and contaminated data sets.

Clean dataset Contaminated dataset

28.06.12 29CAC-2012

Thank you for Thank you for your attentionyour attention

A Lawyer’s Mistake