Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking (AISTATS2016)

Sparse Representation of Multivariate Extremes for Anomaly Detection

4 Experiments on simulated data

4.1 Simulation on 2D data

The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribu-tion with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal obser-vations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 repre-sents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe non-transformed input space.

Figure 3: Level sets of sn

on simulated 2D data

4.2 Recovering the support of the dependencestructure

In this section, we simulate data whose asymptotic behav-ior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Ex-periments illustrate in this case how many data is needed torecover properly the K sub-cones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of sub-cones, then ahigh number of data will be required.

Datasets of size 50000 (resp. 150000) are generated in R10

according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric lo-gistic distribution (G

log

). The data have the following fea-tures: (i) They resemble ‘real life’ data, that is, the Xj

i

’sare non zero and the transformed ˆV

i

’s belong to the inte-rior cone C{1,...,d} (ii) The associated (asymptotic) expo-nent measure concentrates on K disjoint cones {C

↵m , 1 m K}. For the sake of reproducibility, G

log

(x) =

exp{�P

K

m=1

⇣Pj2↵m

(|A(j)|xj

)

�1/w↵m

⌘w↵m

}, where|A(j)| is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w

↵m = 0.1 is a dependence parameter (strong de-pendence). The data are simulated using Algorithm 2.2 in

[20]. The subset of sub-cones D with non-zero µ-mass israndomly chosen (for each fixed number of sub-cones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of non-recovered or false-discovered sub-cones. Table 1 shows the averaged numbersof errors among the 100 experiments.

# sub-cones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)

3 0.07 0.015 0.00 0.01

10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87

Table 1: Support recovering on simulated data

The results are very promising in situations where the num-ber of sub-cones is moderate w.r.t. the number of observa-tions. Indeed, when the total number of sub-cones in thedependence structure is ‘too large’ (relatively to the num-ber of observations), some sub-cones are under-representedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ

min

, for instanceby cross-validation.

5 Real-world data sets

5.1 Sparse structure of extremes (wave data)

Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1- sparse dependence struc-ture of extremes (small number of sub-cones with non zeromass), 2- low dimension of the sub-cones with non-zeromass, do occur with real data.

We consider wave directions data provided by Shell, whichconsist of 58585 measurements D

i

, i 58595 of wavedirections between 0

� and 360

� at 50 different locations(buoys in North sea). The dimension is thus 50. The an-gle 90

� being fairly rare, we work with data obtained asXj

i

= 1/(10�10

+ |90 � Dj

i

|), where Dj

i

is the wave di-rection at buoy j, time i. Thus, Dj

i

’s close to 90 corre-spond to extreme Xj

i

’s. Results in Table 2 (µtotal

denotesthe total probability mass of µ) show that, the number ofsub-cones C

↵

identified by Algorithm 1 is indeed small

l 

l limn→∞

n P!Zn − bn

an≥ z

"= − logG(z)

(1)bn ∈ R(2)

I : (3)N (i) : i (4)

(5)

limn→∞

n P!Zn − bn

an≥ z

"= − logG(z)

limn→∞

n P!Zn − bn

an≥ z

"= (1 + γz)−1/γ

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= (1 + γjz)

−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= (1 + γjz)

−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

4 6 8

010

2030

40

Block Maximum from Exp(1)

maximum

1.5 2.0 2.5 3.0 3.5 4.0

010

2030

40

Block Maximum from N(0, 1)

maximum

0 20 40 60 80 100

020

4060

8010

0

Block Maximum from Par(2)

maximum

0.94 0.96 0.98 1.00

020

4060

8012

0

Block Maximum from U(0, 1)

maximum

100 200

4 6 8

010

2030

40

Block Maximum from Exp(1)

maximum

1.5 2.0 2.5 3.0 3.5 4.0

010

2030

40

Block Maximum from N(0, 1)

maximum

0 20 40 60 80 100

020

4060

8010

0

Block Maximum from Par(2)

maximum

0.94 0.96 0.98 1.000

2040

6080

120

Block Maximum from U(0, 1)

maximum

100 200

l 

l  Zn = max{X1, . . . ,Xn}

limn→∞

n P!Zn − bn

an≥ z

"= − logG(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= (1 + γjz)

−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}

limn→∞

n P!Zn − bn

an≥ z

"= − logG(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= (1 + γjz)

−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

● ●● ●●● ●●●● ●●●● ●●● ●●● ●●●● ●●●● ● ● ●● ●●● ●● ● ●●●● ●●●● ● ●● ●●● ●● ● ●●● ●● ●● ●●● ●● ●● ●●● ●●● ●● ●●●● ● ●● ●● ●● ●●● ●●● ●●● ●● ●●

l 

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

1− Fj(Xj)

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

1− Fj(Xj)

I : (1)N (i) : i (2)

(3)

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

1− Fj(Xj)

1/(1− Fj(Xj))

I : (1)N (i) : i (2)

(3)

stated below shows that nonzero mass on C↵

is the same as nonzero mass onR✏

↵

for ✏ arbitrarily small.

Figure 1: Truncated cones in 3D Figure 2: Truncated ✏-rectangles in 2D

Lemma 1. For any non empty index subset ; 6= ↵ ⇢ {1, . . . , d}, the exponentmeasure of C

↵

isµ(C

↵

) = lim✏!0

µ(R✏

↵

).

Proof. First consider the case ↵ = {1, . . . , d}. Then R✏

↵

’s forms an increasingsequence of sets as ✏ decreases and C

↵

= R0

↵

= [✏>0,✏2Q R✏

↵

. The result followsfrom the ‘continuity from below’ property of the measure µ. Now, for ✏ � 0and ↵ ( {1, . . . , d}, consider the sets

O✏

↵

= {x 2 Rd

+

: 8j 2 ↵ : xj

> ✏},N ✏

↵

= {x 2 Rd

+

: 8j 2 ↵ : xj

> ✏, 9j /2 ↵ : xj

> ✏},

so that N ✏

↵

⇢ O✏

↵

and R✏

↵

= O✏

↵

\N ✏

↵

. Observe also that C↵

= O0

↵

\N0

↵

. Thus,µ(R✏

↵

) = µ(O✏

↵

)� µ(N ✏

↵

), and µ(C↵

) = µ(O0

↵

)� µ(N0

↵

), so that it is su�cientto show that

µ(N0

↵

) = lim✏!0

µ(N ✏

↵

), and µ(O0

↵

) = lim✏!0

µ(O✏

↵

).

Notice that the N ✏

↵

’s and the O✏

↵

’s form two increasing sequences of sets(when ✏ decreases), and that N0

↵

=S

✏>0,✏2Q N ✏

↵

, O0

↵

=S

✏>0,✏2Q O✏

↵

. Thisproves the desired result.

We may now make precise the above heuristic interpretation of the quan-tities µ(C

↵

): the vector M = {µ(C↵

) : ; 6= ↵ ⇢ {1, . . . , d}} asymptoticallydescribes the dependence structure of the extremal observations. Indeed, by

12

Figure 3: Estimation procedure

equivalently the corresponding sub-cones are low-dimensional compared withd.

In fact, cM(↵) is (up to a normalizing constant) an empirical version ofthe conditional probability that T (X) belongs to the rectangle rR✏

↵

, giventhat kT (X)k exceeds a large threshold r. Indeed, as explained in Remark 1,

M(↵) = limr!1

µ([0,1]c) P(T (X) 2 rR✏

↵

| kT (X)k � r). (3.4)

The remaining of this section is devoted to obtaining non-asymptotic up-per bounds on the error ||cM�M||1. The main result is stated in Theorem 1.Before all, notice that the error may be obviously decomposed as the sum ofa stochastic term and a bias term inherent to the ✏-thickening approach:

||cM�M||1 = max↵

|µn

(R✏

↵

)� µ(C↵

)|

max↵

|µ� µn

|(R✏

↵

) + max↵

|µ(R✏

↵

)� µ(C↵

)| . (3.5)

Here and beyond, for notational convenience, we simply denotes ‘↵’ for ‘↵non empty subset of {1, . . . , d}’. The main steps of the argument leading to

18




↵


M(↵) = limr!1

µ([0,1]c) P(T (X) 2 rR✏

↵

| kT (X)k � r). (3.4)


||cM�M||1 = max↵

|µn

(R✏

↵

)� µ(C↵

)|

max↵

|µ� µn

|(R✏

↵

) + max↵

|µ(R✏

↵

)� µ(C↵

)| . (3.5)


18

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

µα,ϵn

µ{1},ϵn

µ{1,2},ϵn

µ{2},ϵn

1− Fj(Xj)

1/(1− Fj(Xj))




↵


M(↵) = limr!1

µ([0,1]c) P(T (X) 2 rR✏

↵

| kT (X)k � r). (3.4)


||cM�M||1 = max↵

|µn

(R✏

↵

)� µ(C↵

)|

max↵

|µ� µn

|(R✏

↵

) + max↵

|µ(R✏

↵

)� µ(C↵

)| . (3.5)


18

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

µα,ϵn

µ{1},ϵn

µ{1,2},ϵn

µ{2},ϵn

1− Fj(Xj)

1/(1− Fj(Xj))

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

µα,ϵn

µ{1},ϵn

µ{1,2},ϵn

µ{2},ϵn

1− Fj(Xj)

1/(1− Fj(Xj))

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

µα,ϵn

µ{1},ϵn

µ{1,2},ϵn

µ{2},ϵn#

α

µα,ϵn

ϵ

1− Fj(Xj)

1/(1− Fj(Xj))

bn ∈ Rd

Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ

Gj(z) = (1 + γjz)−1/γj

limn→∞

n P!Zn − bn

an≥ z

"= G(z)

limn→∞

n P!Z1

n − b1na1n

≥ z1 or . . . orZd

n − bdnadn

≥ zd

"= G(Z)

limn→∞

n P!V 1

n≥ v1 or . . . or

V d

n≥ vd

"=

[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))

G()(1 + γjz)−1/γj

1 + γz > 0, γ ∈ Rbn ∈ R

µα,ϵn

µ{1},ϵn

µ{1,2},ϵn

µ{2},ϵn

1− Fj(Xj)

1/(1− Fj(Xj))

stated below shows that nonzero mass on C↵

is the same as nonzero mass onR✏

↵

for ✏ arbitrarily small.

Figure 1: Truncated cones in 3D Figure 2: Truncated ✏-rectangles in 2D

Lemma 1. For any non empty index subset ; 6= ↵ ⇢ {1, . . . , d}, the exponentmeasure of C

↵

isµ(C

↵

) = lim✏!0

µ(R✏

↵

).

Proof. First consider the case ↵ = {1, . . . , d}. Then R✏

↵

’s forms an increasingsequence of sets as ✏ decreases and C

↵

= R0

↵

= [✏>0,✏2Q R✏

↵

. The result followsfrom the ‘continuity from below’ property of the measure µ. Now, for ✏ � 0and ↵ ( {1, . . . , d}, consider the sets

O✏

↵

= {x 2 Rd

+

: 8j 2 ↵ : xj

> ✏},N ✏

↵

= {x 2 Rd

+

: 8j 2 ↵ : xj

> ✏, 9j /2 ↵ : xj

> ✏},

so that N ✏

↵

⇢ O✏

↵

and R✏

↵

= O✏

↵

\N ✏

↵

. Observe also that C↵

= O0

↵

\N0

↵

. Thus,µ(R✏

↵

) = µ(O✏

↵

)� µ(N ✏

↵

), and µ(C↵

) = µ(O0

↵

)� µ(N0

↵

), so that it is su�cientto show that

µ(N0

↵

) = lim✏!0

µ(N ✏

↵

), and µ(O0

↵

) = lim✏!0

µ(O✏

↵

).

Notice that the N ✏

↵

’s and the O✏

↵

’s form two increasing sequences of sets(when ✏ decreases), and that N0

↵

=S

✏>0,✏2Q N ✏

↵

, O0

↵

=S

✏>0,✏2Q O✏

↵

. Thisproves the desired result.

We may now make precise the above heuristic interpretation of the quan-tities µ(C

↵

): the vector M = {µ(C↵

) : ; 6= ↵ ⇢ {1, . . . , d}} asymptoticallydescribes the dependence structure of the extremal observations. Indeed, by

12


4 Experiments on simulated data

4.1 Simulation on 2D data

The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribu-tion with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal obser-vations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 repre-sents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe non-transformed input space.

Figure 3: Level sets of sn

on simulated 2D data

4.2 Recovering the support of the dependencestructure

In this section, we simulate data whose asymptotic behav-ior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Ex-periments illustrate in this case how many data is needed torecover properly the K sub-cones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of sub-cones, then ahigh number of data will be required.

Datasets of size 50000 (resp. 150000) are generated in R10

according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric lo-gistic distribution (G

log

). The data have the following fea-tures: (i) They resemble ‘real life’ data, that is, the Xj

i

’sare non zero and the transformed ˆV

i

’s belong to the inte-rior cone C{1,...,d} (ii) The associated (asymptotic) expo-nent measure concentrates on K disjoint cones {C

↵m , 1 m K}. For the sake of reproducibility, G

log

(x) =

exp{�P

K

m=1

⇣Pj2↵m

(|A(j)|xj

)

�1/w↵m

⌘w↵m

}, where|A(j)| is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w

↵m = 0.1 is a dependence parameter (strong de-pendence). The data are simulated using Algorithm 2.2 in

[20]. The subset of sub-cones D with non-zero µ-mass israndomly chosen (for each fixed number of sub-cones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of non-recovered or false-discovered sub-cones. Table 1 shows the averaged numbersof errors among the 100 experiments.

# sub-cones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)

3 0.07 0.015 0.00 0.01

10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87

Table 1: Support recovering on simulated data

The results are very promising in situations where the num-ber of sub-cones is moderate w.r.t. the number of observa-tions. Indeed, when the total number of sub-cones in thedependence structure is ‘too large’ (relatively to the num-ber of observations), some sub-cones are under-representedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ

min

, for instanceby cross-validation.

5 Real-world data sets

5.1 Sparse structure of extremes (wave data)

Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1- sparse dependence struc-ture of extremes (small number of sub-cones with non zeromass), 2- low dimension of the sub-cones with non-zeromass, do occur with real data.

We consider wave directions data provided by Shell, whichconsist of 58585 measurements D

i

, i 58595 of wavedirections between 0

� and 360

� at 50 different locations(buoys in North sea). The dimension is thus 50. The an-gle 90

� being fairly rare, we work with data obtained asXj

i

= 1/(10�10

+ |90 � Dj

i

|), where Dj

i

is the wave di-rection at buoy j, time i. Thus, Dj

i

’s close to 90 corre-spond to extreme Xj

i

’s. Results in Table 2 (µtotal

denotesthe total probability mass of µ) show that, the number ofsub-cones C

↵

identified by Algorithm 1 is indeed small

Nicolas Goix, Anne Sabourin, Stephan Clemencon

Figure 4: sub-cone dimensions of wave data

compared to the total number of sub-cones (250-1) (Phe-nomenon 1). Extreme data are essentially concentrated in18 sub-cones. Further, the dimension of those sub-cones isessentially moderate (Phenomenon 2): respectively 93%,98.6% and 99.6% of the mass is affected to sub-cones ofdimension no greater than 10, 15 and 20 respectively (to becompared with d = 50). Histograms displaying the massrepartition produced by Algorithm 1 are given in Fig. 4.

non-extreme extremedata data

# of sub-cones with positivemass (µmin/µ

total

= 0) 3413 858ditto after thresholding(µmin/µ

total

= 0.002) 2 64ditto after thresholding(µmin/µ

total

= 0.005) 1 18

Table 2: Total number of sub-cones of wave data

5.2 Anomaly Detection

Figure 5: Combination of any AD algorithm with DAMEX

The main purpose of Algorithm 1 is to deal with extremedata. In this section we show that it may be combinedwith a standard AD algorithm to handle extreme and non-extreme data, improving the global performance of the cho-sen standard algorithm. This can be done as illustrated inFig. 5 by splitting the input space between an extreme re-gion and a non-extreme one, then applying Algorithm 1 tothe extreme region, while the non-extreme one is processedwith the standard algorithm.

One standard AD algorithm is the Isolation Forest (iForest)algorithm, which we chose in view of its established highperformances ([15]). Our aim is to compare the results

obtained with the combined method ‘iForest + DAMEX’above described, to those obtained with iForest alone onthe whole input space.

number of samples number of featuresshuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3smtp 95373 3

Table 3: Datasets characteristics

Six reference datasets in AD are considered: shuttle, forest-cover, http, smtp, SF and SA. The experiments are per-formed in a semi-supervised framework (the training setconsists of normal data). In a non-supervised framework(training set including abnormal data), the improvementsbrought by the use of DAMEX are less significant, but theprecision score is still increased when the recall is high(high rate of true positives), inducing more vertical ROCcurves near the origin.

The shuttle dataset is available in the UCI repository [14].We use instances from all different classes but class 4,which yields an anomaly ratio (class 1) of 7.15%. In theforestcover data, also available at UCI repository ([14]), thenormal data are the instances from class 2 while instancesfrom class 4 are anomalies,which yields an anomaly ratioof 0.9%. The last four datasets belong to the KDD Cup’99 dataset ([12], [21]), which consist of a wide variety ofhand-injected attacks (anomalies) in a closed network (nor-mal background). Since the original demonstrative purposeof the dataset concerns supervised AD, the anomaly rate isvery high (80%). We thus transform the KDD data to ob-tain smaller anomaly rates. For datasets SF, http and smtp,we proceed as described in [23]: SF is obtained by pick-ing up the data with positive logged-in attribute, and focus-ing on the intrusion attack, which gives 0.3% of anomalies.The two datasets http and smtp are two subsets of SF corre-sponding to a third feature equal to ’http’ (resp. to ’smtp’).Finally, the SA dataset is obtained as in [8] by selecting allthe normal data, together with a small proportion (1%) ofanomalies.

Table 3 summarizes the characteristics of these datasets.For each of them, 20 experiments on random training andtesting datasets are performed, yielding averaged ROC andPrecision-Recall curves whose AUC are presented in Ta-ble 4. The parameter µ

min

is fixed to µtotal

/(#chargedsub-cones), the averaged mass of the non-empty sub-cones.

The parameters (k, ✏) are chosen according to remarks 1and 2. The stability w.r.t. k (resp. ✏) is investigated overthe range [n1/4, n2/3] (resp. [0.0001, 0.1]). This yields pa-

Nicolas Goix, Anne Sabourin, Stephan Clemencon

Figure 4: sub-cone dimensions of wave data

compared to the total number of sub-cones (250-1) (Phe-nomenon 1). Extreme data are essentially concentrated in18 sub-cones. Further, the dimension of those sub-cones isessentially moderate (Phenomenon 2): respectively 93%,98.6% and 99.6% of the mass is affected to sub-cones ofdimension no greater than 10, 15 and 20 respectively (to becompared with d = 50). Histograms displaying the massrepartition produced by Algorithm 1 are given in Fig. 4.

non-extreme extremedata data

# of sub-cones with positivemass (µmin/µ

total

= 0) 3413 858ditto after thresholding(µmin/µ

total

= 0.002) 2 64ditto after thresholding(µmin/µ

total

= 0.005) 1 18

Table 2: Total number of sub-cones of wave data

5.2 Anomaly Detection

Figure 5: Combination of any AD algorithm with DAMEX

The main purpose of Algorithm 1 is to deal with extremedata. In this section we show that it may be combinedwith a standard AD algorithm to handle extreme and non-extreme data, improving the global performance of the cho-sen standard algorithm. This can be done as illustrated inFig. 5 by splitting the input space between an extreme re-gion and a non-extreme one, then applying Algorithm 1 tothe extreme region, while the non-extreme one is processedwith the standard algorithm.

One standard AD algorithm is the Isolation Forest (iForest)algorithm, which we chose in view of its established highperformances ([15]). Our aim is to compare the results

obtained with the combined method ‘iForest + DAMEX’above described, to those obtained with iForest alone onthe whole input space.

number of samples number of featuresshuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3smtp 95373 3

Table 3: Datasets characteristics

Six reference datasets in AD are considered: shuttle, forest-cover, http, smtp, SF and SA. The experiments are per-formed in a semi-supervised framework (the training setconsists of normal data). In a non-supervised framework(training set including abnormal data), the improvementsbrought by the use of DAMEX are less significant, but theprecision score is still increased when the recall is high(high rate of true positives), inducing more vertical ROCcurves near the origin.

The shuttle dataset is available in the UCI repository [14].We use instances from all different classes but class 4,which yields an anomaly ratio (class 1) of 7.15%. In theforestcover data, also available at UCI repository ([14]), thenormal data are the instances from class 2 while instancesfrom class 4 are anomalies,which yields an anomaly ratioof 0.9%. The last four datasets belong to the KDD Cup’99 dataset ([12], [21]), which consist of a wide variety ofhand-injected attacks (anomalies) in a closed network (nor-mal background). Since the original demonstrative purposeof the dataset concerns supervised AD, the anomaly rate isvery high (80%). We thus transform the KDD data to ob-tain smaller anomaly rates. For datasets SF, http and smtp,we proceed as described in [23]: SF is obtained by pick-ing up the data with positive logged-in attribute, and focus-ing on the intrusion attack, which gives 0.3% of anomalies.The two datasets http and smtp are two subsets of SF corre-sponding to a third feature equal to ’http’ (resp. to ’smtp’).Finally, the SA dataset is obtained as in [8] by selecting allthe normal data, together with a small proportion (1%) ofanomalies.

Table 3 summarizes the characteristics of these datasets.For each of them, 20 experiments on random training andtesting datasets are performed, yielding averaged ROC andPrecision-Recall curves whose AUC are presented in Ta-ble 4. The parameter µ

min

is fixed to µtotal

/(#chargedsub-cones), the averaged mass of the non-empty sub-cones.

The parameters (k, ✏) are chosen according to remarks 1and 2. The stability w.r.t. k (resp. ✏) is investigated overthe range [n1/4, n2/3] (resp. [0.0001, 0.1]). This yields pa-


Dataset iForest only iForest + DAMEXROC PR ROC PR

shuttle 0.996 0.974 0.997 0.987forestcov. 0.964 0.193 0.976 0.363http 0.993 0.185 0.999 0.500smtp 0.900 0.004 0.898 0.003SF 0.941 0.041 0.980 0.694SA 0.990 0.387 0.999 0.892

Table 4: Results in terms of AUC

rameters (k, ✏) = (n1/3, 0.0001) for SA and forestcover,and (k, ✏) = (n1/2, 0.01) for shuttle. As the datasetshttp, smtp and SF do not have enough features to con-sider the stability, we choose the (standard) parameters(k, ✏) = (n1/2, 0.01). DAMEX significantly improves theprecision for each dataset, excepting for smtp.

Figure 6: ROC and PR curve on forestcover dataset

Figure 7: ROC and PR curve on http dataset

In terms of AUC of the ROC curve, one observes slight ornegligible improvements. Figures 6, 7, 8, 9 represent aver-aged ROC curves and PR curves for forestcover, http, smtpand SF. The curves for the two other datasets are availablein supplementary material. Excepting for the smtp dataset,one observes highter slope at the origin of the ROC curveusing DAMEX. It illustrates the fact that DAMEX is par-ticularly adapted to situation where one has to work with alow false positive rate constrain.

Figure 8: ROC and PR curve on smtp dataset

Figure 9: ROC and PR curve on SF dataset

Concerning the smtp dataset, the algorithm seems to be un-able to capture any extreme dependence structure, eitherbecause the latter is non-existent (no regularly varying tail),or because the convergence is too slow to appear in our rel-atively small dataset.

6 Conclusion

The DAMEX algorithm allows to detect anomalies occur-ring in extreme regions of a possibly large-dimensionalinput space by identifying lower-dimensional subspacesaround which normal extreme data concentrate. It is de-signed in accordance with well established results bor-rowed from multivariate Extreme Value Theory. Variousexperiments on simulated data and real Anomaly Detec-tion datasets demonstrate its ability to recover the supportof the extremal dependence structure of the data, thus im-proving the performance of standard Anomaly Detectionalgorithms. These results pave the way towards novel ap-proaches in Machine Learning that take advantage of mul-tivariate Extreme Value Theory tools for learning tasks in-volving features subject to extreme behaviors.

Acknowledgements

Part of this work has been supported by the industrial chair‘Machine Learning for Big Data’ from Telecom ParisTech.

l 

l 

Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking (AISTATS2016)

Data & Analytics

Transcript of Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking (AISTATS2016)