A Convex Hull Peeling Depth Approach to Nonparametric ...

46
A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications Hyunsook Lee . [email protected] Department of Statistics The Pennsylvania State University Hyunsook Lee, Department of Statistics, Penn State Univ – p. 1

Transcript of A Convex Hull Peeling Depth Approach to Nonparametric ...

Page 1: A Convex Hull Peeling Depth Approach to Nonparametric ...

A Convex Hull Peeling Depth Approach toNonparametric Massive Multivariate Data

Analysis with Applications

Hyunsook Lee.

[email protected]

Department of Statistics

The Pennsylvania State University

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 1

Page 2: A Convex Hull Peeling Depth Approach to Nonparametric ...

Outlines

I Convex Hull Peeling (CHP) and Multivariate Data Analysis

Definitions on CHP

Data Depth (Ordering Multivariate Data)

Quantiles and Density Estimation

I Color Magnitude (CM) Diagram and Sloan Digital Sky Survey

I Nonparametric Descriptive Statistics with CHP

Multivariate Median

Skewness and Kurtosis of a Multivariate Distribution

I Outlier Detection with CHP

Level α ; Shape Distortion; Balloon Plot

I Concluding Remarks

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 2

Page 3: A Convex Hull Peeling Depth Approach to Nonparametric ...

Definitions

Convex Set A set C ⊆ Rd is convex if for every two points x, y ∈ C thewhole segment xy is also contained in C.

Convex Hull The convex hull of a set of points X in Rd is denoted byCH(X), is the intersection of all convex sets in Rd containing X. In

algorithms, a convex hull indicates points of a shape invariant minimal

subset of CH(X) (vertices, extreme points), connecting these points

produces a wrap of CH(X).

−2 −1 0 1 2

−2

−1

01

2

x

y

−2 −1 0 1 2

−2

−1

01

2

x

y

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 3

Page 4: A Convex Hull Peeling Depth Approach to Nonparametric ...

Convex Hull Peeling

−2 −1 0 1 2

−2

−1

01

2

x

yBefore

−2 −1 0 1 2

−2

−1

01

2x

y

After

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 4

Page 5: A Convex Hull Peeling Depth Approach to Nonparametric ...

Convex Hull Peeling Depth (CHPD)

[CHPD:] For a point x ∈ Rd and the data set X = {X1, ..., XN−1}, letC1 = CH{x, X} and denote a set of its vertices V1. We can getCj = Cj−1\Vj−1 through CHP until x ∈ Vj (j = 2, ...). Then,

CHPD(x) =](∪k

i=1Vi)

Nfor k s.t. k = minj{j : x ∈ Vj} ; otherwise CHPD

is 0.

I Tukey (1974): Locating data center (median) by the Convex Hull PeelingProcess.

I Barnett (1976): Ordering based on Depth

I p̂th quantiles are 1 − p̂thCHPDs.

I Hyper-polygons of 1 − p̂th depth obtainable from any dimensional data.

I QHULL(Barber et. al., 1996) works for general dimensions (http://qhull.org).

I Why CHPD...

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 5

Page 6: A Convex Hull Peeling Depth Approach to Nonparametric ...

Challenges inNonparametric Multivariate Analysis

How to Order Multivariate Data?

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6

Page 7: A Convex Hull Peeling Depth Approach to Nonparametric ...

Challenges inNonparametric Multivariate Analysis

How to Order Multivariate Data?

Ordering Multivariate Data → Data DepthI Mahalanobis Depth : Mahalanobis (1936)

I Convex Hull Peeling Depth: Barnett (1976)

I Half Space Depth: Tukey (1975)

I Simplical Depth : Liu (1990)

I Oja Depth : Oja (1983)

I Majority Depth : Singh (1991)

I Ordering is not uniformly defined

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6

Page 8: A Convex Hull Peeling Depth Approach to Nonparametric ...

Statistical Data Depth(Zuo and Serfling, 2000)

(P1) (Affine invariance) D(Ax + b; FAX+b) = D(x; FX) for all X (Anonsingular matrix) holds for any random vector X in Rd, any d × d

nonsingular matrix A, and any d-vector b;

(P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for anyF ∈ F having center θ;

(P3) (Monotonicity) for any F ∈ F having deepest point θ,D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and

(P4) D(x; F ) → 0 as ||x|| → ∞, for each F ∈ F .

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 7

Page 9: A Convex Hull Peeling Depth Approach to Nonparametric ...

Convex Hull Peeling Depth

I affine invariance

I maximality at center

I monotonicity relative to deepest point

I vanishing at infinity

CHPD has these properties and points of smallest depth are possibleoutliers

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 8

Page 10: A Convex Hull Peeling Depth Approach to Nonparametric ...

Quantile Estimation

I Median: A point(s) left after peeling(will show robustness of this estimator later)

I pth Quantile: Level set whose central region contains ∼ 100p% data(will define the level set and the central region later)

I No Closed Form; Empirical Process

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 9

Page 11: A Convex Hull Peeling Depth Approach to Nonparametric ...

Empirical Density Estimation

Density Estimation with CHPD on Bivariate Normal Data (McDermott, 2003)

100000 Bivariate Normal SampleQuantiles={0.99,0.95,0.90,0.80,...0.20,0.10,0.05,0.01}

I

−4 −2 0 2 4

−4−2

02

4

x

y

−4 −3 −2 −1 0 1 2 3 40.00

0.05

0.10

0.15

0.20

−3−2−1 0 1 2

3

x

y

dens

ity

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 10

Page 12: A Convex Hull Peeling Depth Approach to Nonparametric ...

Lessons and Further Studies

I Sample from a convex distribution (no doughnut shape)

I Works on Massive data−→ Sequential Method

I Without previous knowledge, no model or prior is known to start ananalysis. Exploratory data analysis for a large database

I Nonparametric and non-distance based approach

I Where CHP can be applied and how?−→ Multi-color diagram from astronomy, where a plethora of freedata archives is available.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 11

Page 13: A Convex Hull Peeling Depth Approach to Nonparametric ...

Color Magnitude diagram

Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12

Page 14: A Convex Hull Peeling Depth Approach to Nonparametric ...

Color Magnitude diagram

Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)

What if we can see beyond 2 dimensions without bias (projection)Then, 3 or higher dimensional color diagrams might have popularity.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12

Page 15: A Convex Hull Peeling Depth Approach to Nonparametric ...

Color Magnitude diagram

Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)

What if we can see beyond 2 dimensions without bias (projection)Then, 3 or higher dimensional color diagrams might have popularity.

CHP may assist analyzing multi-color diagrams.Need a suitable data set with colors.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12

Page 16: A Convex Hull Peeling Depth Approach to Nonparametric ...

Sloan Digital Sky Survey: SDSS

Commissioned 2000, now Data Release 5 is available.5 bands; 4 variables (u-g, g-r, r-i, i-z)

I Studies on analyzing astronomical massive data received spotlightsrecently. http://www.sdss.org

I July, 2005: Data Release Four6670 square degrees, 180 million objectsAvailable from http://www.sdss.org/dr4From SpecPhotoAll with SQL:

I Attributes of photometric data are color indices, u,b,g,i,z along withcoordinates.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 13

Page 17: A Convex Hull Peeling Depth Approach to Nonparametric ...

SQL for SDSS

select ra, dec, z, psfMag_u, psfMag_g, psfMag_r,

psfMag_i, psfMag_z

from SpecPhotoAll

where specclass= 2

I Note — 2: galaxies, 3: QSO, 4: HighZ QSO

I Galaxies: 499043

I Quasars: 70204

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 14

Page 18: A Convex Hull Peeling Depth Approach to Nonparametric ...

Multivariate Descriptive Statistics

I CHP Median

I CHP Skewness

I CHP Kurtosis

with bivariate simulated data and SDSS DR4

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 15

Page 19: A Convex Hull Peeling Depth Approach to Nonparametric ...

Convex Hull Peeling Median (CHPM)

Multivariate Median: the inner most point among data→ Survey of Multivariate Median (Small, 1990)

CHPM: recursive peeling leads to the inner most point(s). The averageof these largest depth points is the median of a data set.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16

Page 20: A Convex Hull Peeling Depth Approach to Nonparametric ...

Convex Hull Peeling Median (CHPM)

Multivariate Median: the inner most point among data→ Survey of Multivariate Median (Small, 1990)

CHPM: recursive peeling leads to the inner most point(s). The averageof these largest depth points is the median of a data set.

Simulations: Sample from standard bivariate normal distributionn mean median CHPM

104 (0.001338, -0.02232) (-0.005305, -0.01643) (0.000918, -0.010589)

106 (0.000072, 0.000114) (0.001185, -0.000717) (0.002455, -0.000456)

Sequential CHPM → (0.004741, -0.004111)

Setting for the sequential method: m=10000 and d=0.05

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16

Page 21: A Convex Hull Peeling Depth Approach to Nonparametric ...

Application: Median

Quasars u-g g-r r-i i-z

Mean 0.4619 0.2484 0.1649 0.1008

Median 0.2520 0.1750 0.1520 0.0770

CHPM 0.2530 0.1640 0.1913 0.0700

Galaxies u-g g-r r-i i-z

Mean 1.622 0.9211 0.4226 0.3439

Median 1.680 0.8930 0.4200 0.3540

CHPM 1.790 0.957 0.424 0.367

Seq. CHPM 1.772 0.950 0.4228 0.363

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 17

Page 22: A Convex Hull Peeling Depth Approach to Nonparametric ...

Robustness of Convex Hull Peeling Median

Breakdown point of a convex hull peeling median goes to zero asn → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18

Page 23: A Convex Hull Peeling Depth Approach to Nonparametric ...

Robustness of Convex Hull Peeling Median

Breakdown point of a convex hull peeling median goes to zero asn → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.

Empirical mean square error (EMSE) and Relative Efficiency (RE):Model:(1 − ε)N((0, 0), I) + εN(·, 4I)

n = 5000, m = 500, Tj=(CHPM, Mean)EMSE = 1

m

∑mi=1 ||Tj − µ||2

N((5, 5)t, 4I) N((10, 10)t, 4I)

ε CHPM Mean RE CHPM Mean RE

0 0.002178 0.000417 0.191689 0.002178 0.000417 0.191689

0.005 0.0028521 0.001682 0.589961 0.002891 0.005444 1.88291

0.05 0.016842 0.125522 7.45262 0.017824 0.500610 28.08597

0.2 0.139215 2.00109 14.37612 0.1435910 8.0017 55.7264

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18

Page 24: A Convex Hull Peeling Depth Approach to Nonparametric ...

Generalized Quantile Process

EinMahl and Mason (1992)

Un(t) = inf{λ(A) : Pn(A) ≥ t, A ∈ A}, 0 < t < 1.

I Central Region:RCH(t) = {x ∈ R

d : CHPD(x) ≥ t}

I Level Set:BCH(t) = ∂RCH(t)

= {x ∈ Rd : CHPD(x) = t}

I Volume Functional:VCH(t) = V olume(RCH(t))

−→ One dimensional mapping.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 19

Page 25: A Convex Hull Peeling Depth Approach to Nonparametric ...

−4 −2 0 2 4

−4−2

02

4

x

y

−4 −2 0 2 4

02

46

8

x

y

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

x

y

−50 0 50 100

−200

010

030

0

x

y

→ not equi-probability contours, assume smooth convex distributions

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 20

Page 26: A Convex Hull Peeling Depth Approach to Nonparametric ...

Skewness Measure

Let xj,i be the ith vertex in a level set BCH,j comprised by the jth

peeling process. A measure of skewness:

Rj =maxi ||xj,i − CHPM || − mini ||xj,i − CHPM ||

mini ||xj,i − CHPM ||

Not only a sequence of Rj visualizes but also quantizes the skewnessalong depths.Denominator for the regularization → affine invariant Rj

symmetric: flat Rj along convex hull peels

skewed: fluctuating Rj

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 21

Page 27: A Convex Hull Peeling Depth Approach to Nonparametric ...

Simulation: Skewness Measure

−3 −2 −1 0 1 2 3

−2

02

N(0, I)

−4 −2 0 2

−2

02

N(0, Σ)

5 10 15 20 25 30

−3

−2

−1

01

23

χ102

N(0

, 1)

non normal

0 20 40 60 80

0.0

0.5

1.0

1.5

2.0

2.5

jthconvex hull level set

Rj

0 20 40 60 80

1.0

1.5

2.0

2.5

jthconvex hull level set

Rj

0 20 40 60 802

34

56

jthconvex hull level set

Rj

2000 Hyunsook Lee, Department of Statistics, Penn State Univ – p. 22

Page 28: A Convex Hull Peeling Depth Approach to Nonparametric ...

Application: Skewness Measure (Quasars)

0 20 40 60 80

24

68

1012

14

jthconvex hull level set

Rj

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 23

Page 29: A Convex Hull Peeling Depth Approach to Nonparametric ...

Application: Skewness Measure (Galaxies)

0 50 100 150 200

24

68

1012

jthconvex hull level set

Rj

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 24

Page 30: A Convex Hull Peeling Depth Approach to Nonparametric ...

Kurtosis Measure

Quantile (Depth) based Kurtosis:

KCH(r) =VCH( 1

2 − r2 ) + VCH( 1

2 + r2 ) − 2VCH( 1

2 )

VCH( 12 − r

2 ) − VCH( 12 + r

2 )

Tailweight:

t(r, s) =VCH(r)

VCH(s)

for 0 < s < r ≤ 1. Here,VCH(r) indicates the volume functional at depth r.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 25

Page 31: A Convex Hull Peeling Depth Approach to Nonparametric ...

Simulation: Kurtosis Measure (Tailweight)

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

r : depth

t(r, r

min)

uniformnormalt10

cauchy

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 26

Page 32: A Convex Hull Peeling Depth Approach to Nonparametric ...

Application: Kurtosis Measure (Quasars)

1.0 0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

0.6

0.8

1.0

depth

v(r,

r min)

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 27

Page 33: A Convex Hull Peeling Depth Approach to Nonparametric ...

Multivariate Outlier Detection

I What are Outliers?I Detecting Algorithms

Level α

Shape DistortionBalloon Plot

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 28

Page 34: A Convex Hull Peeling Depth Approach to Nonparametric ...

What are Outliers?

Outliers are...

I Cumbersome Observations

I Lead to New Scientific Discoveries

I Improve Models (Robust Statistics)

I ...

I No Clear Objectives but Come Along Often

CHP: Experience and relative Robustness support the Idea of OutlierDetection.⇒ We need a clear definition on outliers; especially, outliers of the 21stcentury. And outlier detecting methods.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 29

Page 35: A Convex Hull Peeling Depth Approach to Nonparametric ...

Outliers are observations....

I Huber (1972): unlikely to belong to the main population.

I Barnett and Leroy (1994): appear inconsistent with the remainder.

I Hawkins (1980): deviated so much to arouse suspicion.

I Beckman and Cook (1983): surprising and discrepant to theinvestigator.Discordant Observations or Contaminants

I Rohlf (1975): somewhat isolated from the main cloud of points.

Yet, somewhat VAGUE!

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 30

Page 36: A Convex Hull Peeling Depth Approach to Nonparametric ...

Some Outlier Detection Methods

Univariate: Box-and-Whisker plot, Order statistics, ...

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31

Page 37: A Convex Hull Peeling Depth Approach to Nonparametric ...

Some Outlier Detection Methods

Univariate: Box-and-Whisker plot, Order statistics, ...

Multivariate: Mostly bivariate applications

I Generalized Gap Test (Rolhf, 1975)

I Bivariate Box Plot (Zani et. al, 1999)

I Sunburst Plot (Liu et. al., 1999)

I Bag plot (Miller et. al., 2003)

and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1(x − µ̂).

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31

Page 38: A Convex Hull Peeling Depth Approach to Nonparametric ...

Some Outlier Detection Methods

Univariate: Box-and-Whisker plot, Order statistics, ...

Multivariate: Mostly bivariate applications

I Generalized Gap Test (Rolhf, 1975)

I Bivariate Box Plot (Zani et. al, 1999)

I Sunburst Plot (Liu et. al., 1999)

I Bag plot (Miller et. al., 2003)

and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1(x − µ̂).

Difficulties of multivariate analysis arise from the complexity of orderingmultivariate data.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31

Page 39: A Convex Hull Peeling Depth Approach to Nonparametric ...

Quantile Based Outlier Detection

−4 −2 0 2 4

−4

−2

02

4

x

y

bivariate standard normal

−20 −10 0 10 20

−20

−10

010

20

x

y

bivariate t5 with ρ=−0.5

−4 −3 −2 −1 0 1 2 3 40.00

0.05

0.10

0.15

0.20

−3−2−1 0 1 2

3

x

y

dens

ity

−6 −4 −2 0 2 4 60.00

0.05

0.10

0.15

0.20

−6−4−2 0 2 4

6

x

y

dens

ity

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 32

Page 40: A Convex Hull Peeling Depth Approach to Nonparametric ...

Contour Shape Changes

−3 −2 −1 0 1 2 3−

20

2

x

y

Bivariate Normal Sample

−2 0 2 4 6

−2

02

4

x

y

Outliers Added

−3 −2 −1 0 1 2 3

−2

02

x

y

−2 0 2 4 6−

20

24

x

y

0 20 40 60 80

010

2030

4050

jthconvex hull level set

Vol

j

0 20 40 60 80

010

2030

4050

jthconvex hull level set

Vol

j

GAP

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 33

Page 41: A Convex Hull Peeling Depth Approach to Nonparametric ...

Balloon Plot

−3 −2 −1 0 1 2 3

−2

02

x

y

Bivariate Normal Sample

−2 0 2 4 6

−2

02

4

x

y

Outliers Added

A Balloon Plot is obtained by blowing .5th CHPD polyhedron by 1.5times (lengthwise). Let V.5 be a set of vertices of .5th CHPD hull. Theballoon for outlier detection is

B1.5 = {yi : yi = xi + 1.5(xi − CHPM), xi ∈ V.5}.

In other words, blow the balloon of IQR 1.5 times larger.Hyunsook Lee, Department of Statistics, Penn State Univ – p. 34

Page 42: A Convex Hull Peeling Depth Approach to Nonparametric ...

Outliers in Quasar Population

0 20 40 60 80

010

020

030

040

0

jthconvex hull level set

Vol

j

0 20 40 60 80

01

23

4jthconvex hull level set

Vol

j0.25

Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (474.134, 14.442,4.353)

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 35

Page 43: A Convex Hull Peeling Depth Approach to Nonparametric ...

Outliers in Galaxy Population

0 50 100 150 200

010

0020

0030

0040

0050

00

j (peel)

Vol

j

0 50 100 150 200

02

46

8j(peel)

Vol

j0.25

Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (4919.492, 4.310,1.075)

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 36

Page 44: A Convex Hull Peeling Depth Approach to Nonparametric ...

Discussion on CHP

Convex Hull Peeling is..

I a robust location estimator.

I a tool for descriptive statistics.Skewness and Kurtosis measure.

I a reasonable approach for detecting multivariate outliers.

I a starter for clustering.

⇒Our methods help to characterize multivariate distributions andidentify outlier candidates from multivariate massive data; therefore,the results initiate scientists to study further with less bias.CHP as Exploratory Data Analysis and Data Mining Tools.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 37

Page 45: A Convex Hull Peeling Depth Approach to Nonparametric ...

Concluding Remarks

Drawbacks of CHPD

I Limited to moderate dimension data.

I CHPD estimates depths inward not outward.

I Convexity of a data set.

I No population/theoretical counterpart.

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38

Page 46: A Convex Hull Peeling Depth Approach to Nonparametric ...

Concluding Remarks

Drawbacks of CHPD

I Limited to moderate dimension data.

I CHPD estimates depths inward not outward.

I Convexity of a data set.

I No population/theoretical counterpart.

No assumption on data distribution, Non-distancebased, Affine invariant, Applicable to streaming data,Detecting Outliers, Providing Multivariate DescriptiveStatistics, Exploratory data analysis

Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38