Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking (AISTATS2016)

Upload
hayatowatanabe 
Category
Data & Analytics

view
71 
download
1
Embed Size (px)
Transcript of Sparse Representation of Multivariate Extremes with Applications to Anomaly Ranking (AISTATS2016)
Sparse Representation of Multivariate Extremes for Anomaly Detection
4 Experiments on simulated data
4.1 Simulation on 2D data
The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribution with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal observations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 represents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe nontransformed input space.
Figure 3: Level sets of sn
on simulated 2D data
4.2 Recovering the support of the dependencestructure
In this section, we simulate data whose asymptotic behavior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Experiments illustrate in this case how many data is needed torecover properly the K subcones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of subcones, then ahigh number of data will be required.
Datasets of size 50000 (resp. 150000) are generated in R10
according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric logistic distribution (G
log
). The data have the following features: (i) They resemble ‘real life’ data, that is, the Xj
i
’sare non zero and the transformed ˆV
i
’s belong to the interior cone C{1,...,d} (ii) The associated (asymptotic) exponent measure concentrates on K disjoint cones {C
↵m , 1 m K}. For the sake of reproducibility, G
log
(x) =
exp{�P
K
m=1
⇣Pj2↵m
(A(j)xj
)
�1/w↵m
⌘w↵m
}, whereA(j) is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w
↵m = 0.1 is a dependence parameter (strong dependence). The data are simulated using Algorithm 2.2 in
[20]. The subset of subcones D with nonzero µmass israndomly chosen (for each fixed number of subcones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of nonrecovered or falsediscovered subcones. Table 1 shows the averaged numbersof errors among the 100 experiments.
# subcones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)
3 0.07 0.015 0.00 0.01
10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87
Table 1: Support recovering on simulated data
The results are very promising in situations where the number of subcones is moderate w.r.t. the number of observations. Indeed, when the total number of subcones in thedependence structure is ‘too large’ (relatively to the number of observations), some subcones are underrepresentedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ
min
, for instanceby crossvalidation.
5 Realworld data sets
5.1 Sparse structure of extremes (wave data)
Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1 sparse dependence structure of extremes (small number of subcones with non zeromass), 2 low dimension of the subcones with nonzeromass, do occur with real data.
We consider wave directions data provided by Shell, whichconsist of 58585 measurements D
i
, i 58595 of wavedirections between 0
� and 360
� at 50 different locations(buoys in North sea). The dimension is thus 50. The angle 90
� being fairly rare, we work with data obtained asXj
i
= 1/(10�10
+ 90 � Dj
i
), where Dj
i
is the wave direction at buoy j, time i. Thus, Dj
i
’s close to 90 correspond to extreme Xj
i
’s. Results in Table 2 (µtotal
denotesthe total probability mass of µ) show that, the number ofsubcones C
↵
identified by Algorithm 1 is indeed small
Sparse Representation of Multivariate Extremes for Anomaly Detection
4 Experiments on simulated data
4.1 Simulation on 2D data
The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribution with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal observations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 represents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe nontransformed input space.
Figure 3: Level sets of sn
on simulated 2D data
4.2 Recovering the support of the dependencestructure
In this section, we simulate data whose asymptotic behavior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Experiments illustrate in this case how many data is needed torecover properly the K subcones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of subcones, then ahigh number of data will be required.
Datasets of size 50000 (resp. 150000) are generated in R10
according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric logistic distribution (G
log
). The data have the following features: (i) They resemble ‘real life’ data, that is, the Xj
i
’sare non zero and the transformed ˆV
i
’s belong to the interior cone C{1,...,d} (ii) The associated (asymptotic) exponent measure concentrates on K disjoint cones {C
↵m , 1 m K}. For the sake of reproducibility, G
log
(x) =
exp{�P
K
m=1
⇣Pj2↵m
(A(j)xj
)
�1/w↵m
⌘w↵m
}, whereA(j) is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w
↵m = 0.1 is a dependence parameter (strong dependence). The data are simulated using Algorithm 2.2 in
[20]. The subset of subcones D with nonzero µmass israndomly chosen (for each fixed number of subcones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of nonrecovered or falsediscovered subcones. Table 1 shows the averaged numbersof errors among the 100 experiments.
# subcones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)
3 0.07 0.015 0.00 0.01
10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87
Table 1: Support recovering on simulated data
The results are very promising in situations where the number of subcones is moderate w.r.t. the number of observations. Indeed, when the total number of subcones in thedependence structure is ‘too large’ (relatively to the number of observations), some subcones are underrepresentedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ
min
, for instanceby crossvalidation.
5 Realworld data sets
5.1 Sparse structure of extremes (wave data)
Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1 sparse dependence structure of extremes (small number of subcones with non zeromass), 2 low dimension of the subcones with nonzeromass, do occur with real data.
We consider wave directions data provided by Shell, whichconsist of 58585 measurements D
i
, i 58595 of wavedirections between 0
� and 360
� at 50 different locations(buoys in North sea). The dimension is thus 50. The angle 90
� being fairly rare, we work with data obtained asXj
i
= 1/(10�10
+ 90 � Dj
i
), where Dj
i
is the wave direction at buoy j, time i. Thus, Dj
i
’s close to 90 correspond to extreme Xj
i
’s. Results in Table 2 (µtotal
denotesthe total probability mass of µ) show that, the number ofsubcones C
↵
identified by Algorithm 1 is indeed small
Sparse Representation of Multivariate Extremes for Anomaly Detection
4 Experiments on simulated data
4.1 Simulation on 2D data
The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribution with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal observations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 represents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe nontransformed input space.
Figure 3: Level sets of sn
on simulated 2D data
4.2 Recovering the support of the dependencestructure
In this section, we simulate data whose asymptotic behavior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Experiments illustrate in this case how many data is needed torecover properly the K subcones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of subcones, then ahigh number of data will be required.
Datasets of size 50000 (resp. 150000) are generated in R10
according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric logistic distribution (G
log
). The data have the following features: (i) They resemble ‘real life’ data, that is, the Xj
i
’sare non zero and the transformed ˆV
i
’s belong to the interior cone C{1,...,d} (ii) The associated (asymptotic) exponent measure concentrates on K disjoint cones {C
↵m , 1 m K}. For the sake of reproducibility, G
log
(x) =
exp{�P
K
m=1
⇣Pj2↵m
(A(j)xj
)
�1/w↵m
⌘w↵m
}, whereA(j) is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w
↵m = 0.1 is a dependence parameter (strong dependence). The data are simulated using Algorithm 2.2 in
[20]. The subset of subcones D with nonzero µmass israndomly chosen (for each fixed number of subcones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of nonrecovered or falsediscovered subcones. Table 1 shows the averaged numbersof errors among the 100 experiments.
# subcones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)
3 0.07 0.015 0.00 0.01
10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87
Table 1: Support recovering on simulated data
The results are very promising in situations where the number of subcones is moderate w.r.t. the number of observations. Indeed, when the total number of subcones in thedependence structure is ‘too large’ (relatively to the number of observations), some subcones are underrepresentedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ
min
, for instanceby crossvalidation.
5 Realworld data sets
5.1 Sparse structure of extremes (wave data)
Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1 sparse dependence structure of extremes (small number of subcones with non zeromass), 2 low dimension of the subcones with nonzeromass, do occur with real data.
We consider wave directions data provided by Shell, whichconsist of 58585 measurements D
i
, i 58595 of wavedirections between 0
� and 360
� at 50 different locations(buoys in North sea). The dimension is thus 50. The angle 90
� being fairly rare, we work with data obtained asXj
i
= 1/(10�10
+ 90 � Dj
i
), where Dj
i
is the wave direction at buoy j, time i. Thus, Dj
i
’s close to 90 correspond to extreme Xj
i
’s. Results in Table 2 (µtotal
denotesthe total probability mass of µ) show that, the number ofsubcones C
↵
identified by Algorithm 1 is indeed small
Sparse Representation of Multivariate Extremes for Anomaly Detection
4 Experiments on simulated data
4.1 Simulation on 2D data
The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribution with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal observations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 represents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe nontransformed input space.
Figure 3: Level sets of sn
on simulated 2D data
4.2 Recovering the support of the dependencestructure
In this section, we simulate data whose asymptotic behavior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Experiments illustrate in this case how many data is needed torecover properly the K subcones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of subcones, then ahigh number of data will be required.
Datasets of size 50000 (resp. 150000) are generated in R10
according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric logistic distribution (G
log
). The data have the following features: (i) They resemble ‘real life’ data, that is, the Xj
i
’sare non zero and the transformed ˆV
i
’s belong to the interior cone C{1,...,d} (ii) The associated (asymptotic) exponent measure concentrates on K disjoint cones {C
↵m , 1 m K}. For the sake of reproducibility, G
log
(x) =
exp{�P
K
m=1
⇣Pj2↵m
(A(j)xj
)
�1/w↵m
⌘w↵m
}, whereA(j) is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w
↵m = 0.1 is a dependence parameter (strong dependence). The data are simulated using Algorithm 2.2 in
[20]. The subset of subcones D with nonzero µmass israndomly chosen (for each fixed number of subcones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of nonrecovered or falsediscovered subcones. Table 1 shows the averaged numbersof errors among the 100 experiments.
# subcones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)
3 0.07 0.015 0.00 0.01
10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87
Table 1: Support recovering on simulated data
The results are very promising in situations where the number of subcones is moderate w.r.t. the number of observations. Indeed, when the total number of subcones in thedependence structure is ‘too large’ (relatively to the number of observations), some subcones are underrepresentedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ
min
, for instanceby crossvalidation.
5 Realworld data sets
5.1 Sparse structure of extremes (wave data)
Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1 sparse dependence structure of extremes (small number of subcones with non zeromass), 2 low dimension of the subcones with nonzeromass, do occur with real data.
We consider wave directions data provided by Shell, whichconsist of 58585 measurements D
i
, i 58595 of wavedirections between 0
� and 360
� at 50 different locations(buoys in North sea). The dimension is thus 50. The angle 90
� being fairly rare, we work with data obtained asXj
i
= 1/(10�10
+ 90 � Dj
i
), where Dj
i
is the wave direction at buoy j, time i. Thus, Dj
i
’s close to 90 correspond to extreme Xj
i
’s. Results in Table 2 (µtotal
denotesthe total probability mass of µ) show that, the number ofsubcones C
↵
identified by Algorithm 1 is indeed small
l
l limn→∞
n P!Zn − bn
an≥ z
"= − logG(z)
(1)bn ∈ R(2)
I : (3)N (i) : i (4)
(5)
limn→∞
n P!Zn − bn
an≥ z
"= − logG(z)
limn→∞
n P!Zn − bn
an≥ z
"= (1 + γz)−1/γ
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= (1 + γjz)
−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= (1 + γjz)
−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
4 6 8
010
2030
40
Block Maximum from Exp(1)
maximum
1.5 2.0 2.5 3.0 3.5 4.0
010
2030
40
Block Maximum from N(0, 1)
maximum
0 20 40 60 80 100
020
4060
8010
0
Block Maximum from Par(2)
maximum
0.94 0.96 0.98 1.00
020
4060
8012
0
Block Maximum from U(0, 1)
maximum
100 200
4 6 8
010
2030
40
Block Maximum from Exp(1)
maximum
1.5 2.0 2.5 3.0 3.5 4.0
010
2030
40
Block Maximum from N(0, 1)
maximum
0 20 40 60 80 100
020
4060
8010
0
Block Maximum from Par(2)
maximum
0.94 0.96 0.98 1.000
2040
6080
120
Block Maximum from U(0, 1)
maximum
100 200
l
l Zn = max{X1, . . . ,Xn}
limn→∞
n P!Zn − bn
an≥ z
"= − logG(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= (1 + γjz)
−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}
limn→∞
n P!Zn − bn
an≥ z
"= − logG(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= (1 + γjz)
−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
● ●● ●●● ●●●● ●●●● ●●● ●●● ●●●● ●●●● ● ● ●● ●●● ●● ● ●●●● ●●●● ● ●● ●●● ●● ● ●●● ●● ●● ●●● ●● ●● ●●● ●●● ●● ●●●● ● ●● ●● ●● ●●● ●●● ●●● ●● ●●
l
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
1− Fj(Xj)
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
1− Fj(Xj)
I : (1)N (i) : i (2)
(3)
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
1− Fj(Xj)
1/(1− Fj(Xj))
I : (1)N (i) : i (2)
(3)
stated below shows that nonzero mass on C↵
is the same as nonzero mass onR✏
↵
for ✏ arbitrarily small.
Figure 1: Truncated cones in 3D Figure 2: Truncated ✏rectangles in 2D
Lemma 1. For any non empty index subset ; 6= ↵ ⇢ {1, . . . , d}, the exponentmeasure of C
↵
isµ(C
↵
) = lim✏!0
µ(R✏
↵
).
Proof. First consider the case ↵ = {1, . . . , d}. Then R✏
↵
’s forms an increasingsequence of sets as ✏ decreases and C
↵
= R0
↵
= [✏>0,✏2Q R✏
↵
. The result followsfrom the ‘continuity from below’ property of the measure µ. Now, for ✏ � 0and ↵ ( {1, . . . , d}, consider the sets
O✏
↵
= {x 2 Rd
+
: 8j 2 ↵ : xj
> ✏},N ✏
↵
= {x 2 Rd
+
: 8j 2 ↵ : xj
> ✏, 9j /2 ↵ : xj
> ✏},
so that N ✏
↵
⇢ O✏
↵
and R✏
↵
= O✏
↵
\N ✏
↵
. Observe also that C↵
= O0
↵
\N0
↵
. Thus,µ(R✏
↵
) = µ(O✏
↵
)� µ(N ✏
↵
), and µ(C↵
) = µ(O0
↵
)� µ(N0
↵
), so that it is su�cientto show that
µ(N0
↵
) = lim✏!0
µ(N ✏
↵
), and µ(O0
↵
) = lim✏!0
µ(O✏
↵
).
Notice that the N ✏
↵
’s and the O✏
↵
’s form two increasing sequences of sets(when ✏ decreases), and that N0
↵
=S
✏>0,✏2Q N ✏
↵
, O0
↵
=S
✏>0,✏2Q O✏
↵
. Thisproves the desired result.
We may now make precise the above heuristic interpretation of the quantities µ(C
↵
): the vector M = {µ(C↵
) : ; 6= ↵ ⇢ {1, . . . , d}} asymptoticallydescribes the dependence structure of the extremal observations. Indeed, by
12
Figure 3: Estimation procedure
equivalently the corresponding subcones are lowdimensional compared withd.
In fact, cM(↵) is (up to a normalizing constant) an empirical version ofthe conditional probability that T (X) belongs to the rectangle rR✏
↵
, giventhat kT (X)k exceeds a large threshold r. Indeed, as explained in Remark 1,
M(↵) = limr!1
µ([0,1]c) P(T (X) 2 rR✏
↵
 kT (X)k � r). (3.4)
The remaining of this section is devoted to obtaining nonasymptotic upper bounds on the error cM�M1. The main result is stated in Theorem 1.Before all, notice that the error may be obviously decomposed as the sum ofa stochastic term and a bias term inherent to the ✏thickening approach:
cM�M1 = max↵
µn
(R✏
↵
)� µ(C↵
)
max↵
µ� µn
(R✏
↵
) + max↵
µ(R✏
↵
)� µ(C↵
) . (3.5)
Here and beyond, for notational convenience, we simply denotes ‘↵’ for ‘↵non empty subset of {1, . . . , d}’. The main steps of the argument leading to
18
Figure 3: Estimation procedure
equivalently the corresponding subcones are lowdimensional compared withd.
In fact, cM(↵) is (up to a normalizing constant) an empirical version ofthe conditional probability that T (X) belongs to the rectangle rR✏
↵
, giventhat kT (X)k exceeds a large threshold r. Indeed, as explained in Remark 1,
M(↵) = limr!1
µ([0,1]c) P(T (X) 2 rR✏
↵
 kT (X)k � r). (3.4)
The remaining of this section is devoted to obtaining nonasymptotic upper bounds on the error cM�M1. The main result is stated in Theorem 1.Before all, notice that the error may be obviously decomposed as the sum ofa stochastic term and a bias term inherent to the ✏thickening approach:
cM�M1 = max↵
µn
(R✏
↵
)� µ(C↵
)
max↵
µ� µn
(R✏
↵
) + max↵
µ(R✏
↵
)� µ(C↵
) . (3.5)
Here and beyond, for notational convenience, we simply denotes ‘↵’ for ‘↵non empty subset of {1, . . . , d}’. The main steps of the argument leading to
18
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
µα,ϵn
µ{1},ϵn
µ{1,2},ϵn
µ{2},ϵn
1− Fj(Xj)
1/(1− Fj(Xj))
Figure 3: Estimation procedure
equivalently the corresponding subcones are lowdimensional compared withd.
In fact, cM(↵) is (up to a normalizing constant) an empirical version ofthe conditional probability that T (X) belongs to the rectangle rR✏
↵
, giventhat kT (X)k exceeds a large threshold r. Indeed, as explained in Remark 1,
M(↵) = limr!1
µ([0,1]c) P(T (X) 2 rR✏
↵
 kT (X)k � r). (3.4)
The remaining of this section is devoted to obtaining nonasymptotic upper bounds on the error cM�M1. The main result is stated in Theorem 1.Before all, notice that the error may be obviously decomposed as the sum ofa stochastic term and a bias term inherent to the ✏thickening approach:
cM�M1 = max↵
µn
(R✏
↵
)� µ(C↵
)
max↵
µ� µn
(R✏
↵
) + max↵
µ(R✏
↵
)� µ(C↵
) . (3.5)
Here and beyond, for notational convenience, we simply denotes ‘↵’ for ‘↵non empty subset of {1, . . . , d}’. The main steps of the argument leading to
18
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
µα,ϵn
µ{1},ϵn
µ{1,2},ϵn
µ{2},ϵn
1− Fj(Xj)
1/(1− Fj(Xj))
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
µα,ϵn
µ{1},ϵn
µ{1,2},ϵn
µ{2},ϵn
1− Fj(Xj)
1/(1− Fj(Xj))
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
µα,ϵn
µ{1},ϵn
µ{1,2},ϵn
µ{2},ϵn#
α
µα,ϵn
ϵ
1− Fj(Xj)
1/(1− Fj(Xj))
bn ∈ Rd
Zn = max{X1, . . . ,Xn}G(z) = (1 + γz)−1/γ
Gj(z) = (1 + γjz)−1/γj
limn→∞
n P!Zn − bn
an≥ z
"= G(z)
limn→∞
n P!Z1
n − b1na1n
≥ z1 or . . . orZd
n − bdnadn
≥ zd
"= G(Z)
limn→∞
n P!V 1
n≥ v1 or . . . or
V d
n≥ vd
"=
[0, v1]× · · ·× [0, vd]V j = 1/(1− Fj(Xj))
G()(1 + γjz)−1/γj
1 + γz > 0, γ ∈ Rbn ∈ R
µα,ϵn
µ{1},ϵn
µ{1,2},ϵn
µ{2},ϵn
1− Fj(Xj)
1/(1− Fj(Xj))
stated below shows that nonzero mass on C↵
is the same as nonzero mass onR✏
↵
for ✏ arbitrarily small.
Figure 1: Truncated cones in 3D Figure 2: Truncated ✏rectangles in 2D
Lemma 1. For any non empty index subset ; 6= ↵ ⇢ {1, . . . , d}, the exponentmeasure of C
↵
isµ(C
↵
) = lim✏!0
µ(R✏
↵
).
Proof. First consider the case ↵ = {1, . . . , d}. Then R✏
↵
’s forms an increasingsequence of sets as ✏ decreases and C
↵
= R0
↵
= [✏>0,✏2Q R✏
↵
. The result followsfrom the ‘continuity from below’ property of the measure µ. Now, for ✏ � 0and ↵ ( {1, . . . , d}, consider the sets
O✏
↵
= {x 2 Rd
+
: 8j 2 ↵ : xj
> ✏},N ✏
↵
= {x 2 Rd
+
: 8j 2 ↵ : xj
> ✏, 9j /2 ↵ : xj
> ✏},
so that N ✏
↵
⇢ O✏
↵
and R✏
↵
= O✏
↵
\N ✏
↵
. Observe also that C↵
= O0
↵
\N0
↵
. Thus,µ(R✏
↵
) = µ(O✏
↵
)� µ(N ✏
↵
), and µ(C↵
) = µ(O0
↵
)� µ(N0
↵
), so that it is su�cientto show that
µ(N0
↵
) = lim✏!0
µ(N ✏
↵
), and µ(O0
↵
) = lim✏!0
µ(O✏
↵
).
Notice that the N ✏
↵
’s and the O✏
↵
’s form two increasing sequences of sets(when ✏ decreases), and that N0
↵
=S
✏>0,✏2Q N ✏
↵
, O0
↵
=S
✏>0,✏2Q O✏
↵
. Thisproves the desired result.
We may now make precise the above heuristic interpretation of the quantities µ(C
↵
): the vector M = {µ(C↵
) : ; 6= ↵ ⇢ {1, . . . , d}} asymptoticallydescribes the dependence structure of the extremal observations. Indeed, by
12
Sparse Representation of Multivariate Extremes for Anomaly Detection
4 Experiments on simulated data
4.1 Simulation on 2D data
The purpose of this simulation is to provide an insightinto the rationale of the algorithm in the bivariate case.Normal data are simulated under a 2D logistic distribution with asymmetric parameters (white and green points inFig. 3), while the abnormal ones are uniformly distributed.Thus, the ‘normal’ extremes should be concentrated aroundthe axes, while the ‘abnormal’ ones could be anywhere.The training set (white points) consists of normal observations. The testing set consists of normal observations(white points) and abnormal ones (red points). Fig. 3 represents the level sets of this scoring function (inversed colors,the darker, the more abnormal) in both the transformed andthe nontransformed input space.
Figure 3: Level sets of sn
on simulated 2D data
4.2 Recovering the support of the dependencestructure
In this section, we simulate data whose asymptotic behavior corresponds to some exponent measure µ. This measureis chosen such that it concentrates on K chosen cones. Experiments illustrate in this case how many data is needed torecover properly the K subcones (namely the dependencestructure) depending on its complexity. If the dependencestructure spreads on a high number K of subcones, then ahigh number of data will be required.
Datasets of size 50000 (resp. 150000) are generated in R10
according to a popular multivariate extreme value model,introduced by [22], namely a multivariate asymmetric logistic distribution (G
log
). The data have the following features: (i) They resemble ‘real life’ data, that is, the Xj
i
’sare non zero and the transformed ˆV
i
’s belong to the interior cone C{1,...,d} (ii) The associated (asymptotic) exponent measure concentrates on K disjoint cones {C
↵m , 1 m K}. For the sake of reproducibility, G
log
(x) =
exp{�P
K
m=1
⇣Pj2↵m
(A(j)xj
)
�1/w↵m
⌘w↵m
}, whereA(j) is the cardinal of the set {↵ 2 D : j 2 ↵} andwhere w
↵m = 0.1 is a dependence parameter (strong dependence). The data are simulated using Algorithm 2.2 in
[20]. The subset of subcones D with nonzero µmass israndomly chosen (for each fixed number of subcones K)and the purpose is to recover D by Algorithm 1. For eachK, 100 experiments are made and we consider the numberof ‘errors’, that is, the number of nonrecovered or falsediscovered subcones. Table 1 shows the averaged numbersof errors among the 100 experiments.
# subcones K Aver. # errors Aver. # errors(n=5e4) (n=15e4)
3 0.07 0.015 0.00 0.01
10 0.01 0.0615 0.09 0.0220 0.39 0.1425 1.12 0.3930 1.82 0.9835 3.59 1.8540 6.59 3.1445 8.06 5.2350 11.21 7.87
Table 1: Support recovering on simulated data
The results are very promising in situations where the number of subcones is moderate w.r.t. the number of observations. Indeed, when the total number of subcones in thedependence structure is ‘too large’ (relatively to the number of observations), some subcones are underrepresentedand become ‘too weak’ to resist the thresholding (Step 4in Algorithm 1). Handling complex dependence structureswithout a confortable number of observations thus requiresa careful choice of the thresholding level µ
min
, for instanceby crossvalidation.
5 Realworld data sets
5.1 Sparse structure of extremes (wave data)
Our goal is here to verify that the two expected phenomenamentioned in the introduction, 1 sparse dependence structure of extremes (small number of subcones with non zeromass), 2 low dimension of the subcones with nonzeromass, do occur with real data.
We consider wave directions data provided by Shell, whichconsist of 58585 measurements D
i
, i 58595 of wavedirections between 0
� and 360
� at 50 different locations(buoys in North sea). The dimension is thus 50. The angle 90
� being fairly rare, we work with data obtained asXj
i
= 1/(10�10
+ 90 � Dj
i
), where Dj
i
is the wave direction at buoy j, time i. Thus, Dj
i
’s close to 90 correspond to extreme Xj
i
’s. Results in Table 2 (µtotal
denotesthe total probability mass of µ) show that, the number ofsubcones C
↵
identified by Algorithm 1 is indeed small
Nicolas Goix, Anne Sabourin, Stephan Clemencon
Figure 4: subcone dimensions of wave data
compared to the total number of subcones (2501) (Phenomenon 1). Extreme data are essentially concentrated in18 subcones. Further, the dimension of those subcones isessentially moderate (Phenomenon 2): respectively 93%,98.6% and 99.6% of the mass is affected to subcones ofdimension no greater than 10, 15 and 20 respectively (to becompared with d = 50). Histograms displaying the massrepartition produced by Algorithm 1 are given in Fig. 4.
nonextreme extremedata data
# of subcones with positivemass (µmin/µ
total
= 0) 3413 858ditto after thresholding(µmin/µ
total
= 0.002) 2 64ditto after thresholding(µmin/µ
total
= 0.005) 1 18
Table 2: Total number of subcones of wave data
5.2 Anomaly Detection
Figure 5: Combination of any AD algorithm with DAMEX
The main purpose of Algorithm 1 is to deal with extremedata. In this section we show that it may be combinedwith a standard AD algorithm to handle extreme and nonextreme data, improving the global performance of the chosen standard algorithm. This can be done as illustrated inFig. 5 by splitting the input space between an extreme region and a nonextreme one, then applying Algorithm 1 tothe extreme region, while the nonextreme one is processedwith the standard algorithm.
One standard AD algorithm is the Isolation Forest (iForest)algorithm, which we chose in view of its established highperformances ([15]). Our aim is to compare the results
obtained with the combined method ‘iForest + DAMEX’above described, to those obtained with iForest alone onthe whole input space.
number of samples number of featuresshuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3smtp 95373 3
Table 3: Datasets characteristics
Six reference datasets in AD are considered: shuttle, forestcover, http, smtp, SF and SA. The experiments are performed in a semisupervised framework (the training setconsists of normal data). In a nonsupervised framework(training set including abnormal data), the improvementsbrought by the use of DAMEX are less significant, but theprecision score is still increased when the recall is high(high rate of true positives), inducing more vertical ROCcurves near the origin.
The shuttle dataset is available in the UCI repository [14].We use instances from all different classes but class 4,which yields an anomaly ratio (class 1) of 7.15%. In theforestcover data, also available at UCI repository ([14]), thenormal data are the instances from class 2 while instancesfrom class 4 are anomalies,which yields an anomaly ratioof 0.9%. The last four datasets belong to the KDD Cup’99 dataset ([12], [21]), which consist of a wide variety ofhandinjected attacks (anomalies) in a closed network (normal background). Since the original demonstrative purposeof the dataset concerns supervised AD, the anomaly rate isvery high (80%). We thus transform the KDD data to obtain smaller anomaly rates. For datasets SF, http and smtp,we proceed as described in [23]: SF is obtained by picking up the data with positive loggedin attribute, and focusing on the intrusion attack, which gives 0.3% of anomalies.The two datasets http and smtp are two subsets of SF corresponding to a third feature equal to ’http’ (resp. to ’smtp’).Finally, the SA dataset is obtained as in [8] by selecting allthe normal data, together with a small proportion (1%) ofanomalies.
Table 3 summarizes the characteristics of these datasets.For each of them, 20 experiments on random training andtesting datasets are performed, yielding averaged ROC andPrecisionRecall curves whose AUC are presented in Table 4. The parameter µ
min
is fixed to µtotal
/(#chargedsubcones), the averaged mass of the nonempty subcones.
The parameters (k, ✏) are chosen according to remarks 1and 2. The stability w.r.t. k (resp. ✏) is investigated overthe range [n1/4, n2/3] (resp. [0.0001, 0.1]). This yields pa
Nicolas Goix, Anne Sabourin, Stephan Clemencon
Figure 4: subcone dimensions of wave data
compared to the total number of subcones (2501) (Phenomenon 1). Extreme data are essentially concentrated in18 subcones. Further, the dimension of those subcones isessentially moderate (Phenomenon 2): respectively 93%,98.6% and 99.6% of the mass is affected to subcones ofdimension no greater than 10, 15 and 20 respectively (to becompared with d = 50). Histograms displaying the massrepartition produced by Algorithm 1 are given in Fig. 4.
nonextreme extremedata data
# of subcones with positivemass (µmin/µ
total
= 0) 3413 858ditto after thresholding(µmin/µ
total
= 0.002) 2 64ditto after thresholding(µmin/µ
total
= 0.005) 1 18
Table 2: Total number of subcones of wave data
5.2 Anomaly Detection
Figure 5: Combination of any AD algorithm with DAMEX
The main purpose of Algorithm 1 is to deal with extremedata. In this section we show that it may be combinedwith a standard AD algorithm to handle extreme and nonextreme data, improving the global performance of the chosen standard algorithm. This can be done as illustrated inFig. 5 by splitting the input space between an extreme region and a nonextreme one, then applying Algorithm 1 tothe extreme region, while the nonextreme one is processedwith the standard algorithm.
One standard AD algorithm is the Isolation Forest (iForest)algorithm, which we chose in view of its established highperformances ([15]). Our aim is to compare the results
obtained with the combined method ‘iForest + DAMEX’above described, to those obtained with iForest alone onthe whole input space.
number of samples number of featuresshuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3smtp 95373 3
Table 3: Datasets characteristics
Six reference datasets in AD are considered: shuttle, forestcover, http, smtp, SF and SA. The experiments are performed in a semisupervised framework (the training setconsists of normal data). In a nonsupervised framework(training set including abnormal data), the improvementsbrought by the use of DAMEX are less significant, but theprecision score is still increased when the recall is high(high rate of true positives), inducing more vertical ROCcurves near the origin.
The shuttle dataset is available in the UCI repository [14].We use instances from all different classes but class 4,which yields an anomaly ratio (class 1) of 7.15%. In theforestcover data, also available at UCI repository ([14]), thenormal data are the instances from class 2 while instancesfrom class 4 are anomalies,which yields an anomaly ratioof 0.9%. The last four datasets belong to the KDD Cup’99 dataset ([12], [21]), which consist of a wide variety ofhandinjected attacks (anomalies) in a closed network (normal background). Since the original demonstrative purposeof the dataset concerns supervised AD, the anomaly rate isvery high (80%). We thus transform the KDD data to obtain smaller anomaly rates. For datasets SF, http and smtp,we proceed as described in [23]: SF is obtained by picking up the data with positive loggedin attribute, and focusing on the intrusion attack, which gives 0.3% of anomalies.The two datasets http and smtp are two subsets of SF corresponding to a third feature equal to ’http’ (resp. to ’smtp’).Finally, the SA dataset is obtained as in [8] by selecting allthe normal data, together with a small proportion (1%) ofanomalies.
Table 3 summarizes the characteristics of these datasets.For each of them, 20 experiments on random training andtesting datasets are performed, yielding averaged ROC andPrecisionRecall curves whose AUC are presented in Table 4. The parameter µ
min
is fixed to µtotal
/(#chargedsubcones), the averaged mass of the nonempty subcones.
The parameters (k, ✏) are chosen according to remarks 1and 2. The stability w.r.t. k (resp. ✏) is investigated overthe range [n1/4, n2/3] (resp. [0.0001, 0.1]). This yields pa
Sparse Representation of Multivariate Extremes for Anomaly Detection
Dataset iForest only iForest + DAMEXROC PR ROC PR
shuttle 0.996 0.974 0.997 0.987forestcov. 0.964 0.193 0.976 0.363http 0.993 0.185 0.999 0.500smtp 0.900 0.004 0.898 0.003SF 0.941 0.041 0.980 0.694SA 0.990 0.387 0.999 0.892
Table 4: Results in terms of AUC
rameters (k, ✏) = (n1/3, 0.0001) for SA and forestcover,and (k, ✏) = (n1/2, 0.01) for shuttle. As the datasetshttp, smtp and SF do not have enough features to consider the stability, we choose the (standard) parameters(k, ✏) = (n1/2, 0.01). DAMEX significantly improves theprecision for each dataset, excepting for smtp.
Figure 6: ROC and PR curve on forestcover dataset
Figure 7: ROC and PR curve on http dataset
In terms of AUC of the ROC curve, one observes slight ornegligible improvements. Figures 6, 7, 8, 9 represent averaged ROC curves and PR curves for forestcover, http, smtpand SF. The curves for the two other datasets are availablein supplementary material. Excepting for the smtp dataset,one observes highter slope at the origin of the ROC curveusing DAMEX. It illustrates the fact that DAMEX is particularly adapted to situation where one has to work with alow false positive rate constrain.
Figure 8: ROC and PR curve on smtp dataset
Figure 9: ROC and PR curve on SF dataset
Concerning the smtp dataset, the algorithm seems to be unable to capture any extreme dependence structure, eitherbecause the latter is nonexistent (no regularly varying tail),or because the convergence is too slow to appear in our relatively small dataset.
6 Conclusion
The DAMEX algorithm allows to detect anomalies occurring in extreme regions of a possibly largedimensionalinput space by identifying lowerdimensional subspacesaround which normal extreme data concentrate. It is designed in accordance with well established results borrowed from multivariate Extreme Value Theory. Variousexperiments on simulated data and real Anomaly Detection datasets demonstrate its ability to recover the supportof the extremal dependence structure of the data, thus improving the performance of standard Anomaly Detectionalgorithms. These results pave the way towards novel approaches in Machine Learning that take advantage of multivariate Extreme Value Theory tools for learning tasks involving features subject to extreme behaviors.
Acknowledgements
Part of this work has been supported by the industrial chair‘Machine Learning for Big Data’ from Telecom ParisTech.
l
l