N IME ERIES - Data Science Summer School · max )( min) ; where max min = 1 + N=T 2 p N=T , and 2 [...

1
O N C LUSTERING F INANCIAL T IME S ERIES G AUTIER M ARTI ,P HILIPPE D ONNAT AND F RANK N IELSEN N OISY CORRELATION MATRICES Let X be the matrix storing the standardized re- turns of N = 560 assets (credit default swaps) over a period of T = 2500 trading days. Then, the empirical correlation matrix of the re- turns is C = 1 T XX > . We can compute the empirical density of its eigenvalues ρ(λ)= 1 N dn(λ) , where n(λ) counts the number of eigenvalues of C less than λ. From random matrix theory, the Marchenko- Pastur distribution gives the limit distribution as N →∞, T →∞ and T /N fixed. It reads: ρ(λ)= T /N 2π p (λ max - λ)(λ - λ min ) λ , where λ max min = 1+ N/T ± 2 p N/T , and λ [λ min max ]. Figure 1: Marchenko-Pastur density vs. empirical den- sity of the correlation matrix eigenvalues Notice that the Marchenko-Pastur density fits well the empirical density meaning that most of the information contained in the empirical corre- lation matrix amounts to noise: only 26 eigenval- ues are greater than λ max . The highest eigenvalue corresponds to the ‘mar- ket’, the 25 others can be associated to ‘industrial sectors’. C LUSTERING TIME SERIES Given a correlation matrix of the returns, Figure 2: An empirical and noisy correlation matrix one can re-order assets using a hierarchical clus- tering algorithm to make the hierarchical correla- tion pattern blatant, Figure 3: The same noisy correlation matrix re-ordered by a hierarchical clustering algorithm and finally filter the noise according to the corre- lation pattern: Figure 4: The resulting filtered correlation matrix B EYOND CORRELATION Sklar’s Theorem. For any random vector X =(X 1 ,...,X N ) having continuous marginal cumulative distribution functions F i , its joint cumulative distribution F is uniquely expressed as F (X 1 ,...,X N )= C (F 1 (X 1 ),...,F N (X N )), where C , the multivariate distribution of uniform marginals, is known as the copula of X . Figure 5: ArcelorMittal and Société générale prices are projected on dependence distribution space; notice their heavy-tailed exponential distribution. Let θ [0, 1]. Let (X, Y ) ∈V 2 . Let G =(G X ,G Y ), where G X and G Y are respectively X and Y marginal cdf. We define the following distance d 2 θ (X, Y )= θd 2 1 (G X (X ),G Y (Y )) + (1 - θ )d 2 0 (G X ,G Y ), where d 2 1 (G X (X ),G Y (Y )) = 3E[|G X (X ) - G Y (Y )| 2 ], and d 2 0 (G X ,G Y )= 1 2 R R q dG X - q dG Y 2 dλ. C LUSTERING R ESULTS &S TABILITY Figure 6: (Top) The returns correlation structure ap- pears more clearly using rank correlation; (Bottom) Clusters of returns distributions can be partly described by the returns volatility Figure 7: Stability test on Odd/Even trading days sub- sampling: our approach (GNPR) yields more stable clusters with respect to this perturbation than standard approaches (using Pearson correlation or L 2 distances).

Transcript of N IME ERIES - Data Science Summer School · max )( min) ; where max min = 1 + N=T 2 p N=T , and 2 [...

Page 1: N IME ERIES - Data Science Summer School · max )( min) ; where max min = 1 + N=T 2 p N=T , and 2 [ min; max]. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 ¸ 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

ON CLUSTERING FINANCIAL TIME SERIESGAUTIER MARTI, PHILIPPE DONNAT AND FRANK NIELSEN

NOISY CORRELATION MATRICESLet X be the matrix storing the standardized re-turns of N = 560 assets (credit default swaps)over a period of T = 2500 trading days.

Then, the empirical correlation matrix of the re-turns is

C =1

TXX>.

We can compute the empirical density of itseigenvalues

ρ(λ) =1

N

dn(λ)

dλ,

where n(λ) counts the number of eigenvalues ofC less than λ.

From random matrix theory, the Marchenko-Pastur distribution gives the limit distribution asN →∞, T →∞ and T/N fixed. It reads:

ρ(λ) =T/N

√(λmax − λ)(λ− λmin)

λ,

where λmaxmin = 1 + N/T ± 2

√N/T , and λ ∈

[λmin, λmax].

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

λ

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

ρ(λ

)

Figure 1: Marchenko-Pastur density vs. empirical den-sity of the correlation matrix eigenvalues

Notice that the Marchenko-Pastur density fitswell the empirical density meaning that most ofthe information contained in the empirical corre-lation matrix amounts to noise: only 26 eigenval-ues are greater than λmax.The highest eigenvalue corresponds to the ‘mar-ket’, the 25 others can be associated to ‘industrialsectors’.

CLUSTERING TIME SERIESGiven a correlation matrix of the returns,

0 100 200 300 400 5000

100

200

300

400

500

Figure 2: An empirical and noisy correlation matrix

one can re-order assets using a hierarchical clus-tering algorithm to make the hierarchical correla-tion pattern blatant,

0 100 200 300 400 5000

100

200

300

400

500

Figure 3: The same noisy correlation matrix re-orderedby a hierarchical clustering algorithm

and finally filter the noise according to the corre-lation pattern:

0 100 200 300 400 5000

100

200

300

400

500

Figure 4: The resulting filtered correlation matrix

BEYOND CORRELATIONSklar’s Theorem. For any random vector X = (X1, . . . , XN ) having continuous marginal cumulativedistribution functions Fi, its joint cumulative distribution F is uniquely expressed as

F (X1, . . . , XN ) = C(F1(X1), . . . , FN (XN )),

where C, the multivariate distribution of uniform marginals, is known as the copula of X .

Figure 5: ArcelorMittal and Société générale prices are projected on dependence ⊕ distribution space; notice theirheavy-tailed exponential distribution.

Let θ ∈ [0, 1]. Let (X,Y ) ∈ V2. Let G = (GX , GY ), where GX and GY are respectively X and Y marginalcdf. We define the following distance

d2θ(X,Y ) = θd21(GX(X), GY (Y )) + (1− θ)d20(GX , GY ),

where d21(GX(X), GY (Y )) = 3E[|GX(X)−GY (Y )|2], and d20(GX , GY ) =12

∫R

(√dGX

dλ −√

dGY

)2

dλ.

CLUSTERING RESULTS & STABILITY

0 5 10 15 20 25 30

Standard Deviation in basis points0

5

10

15

20

25

30

35

Num

ber

of

occ

urr

ence

s

Standard Deviations Histogram

Figure 6: (Top) The returns correlation structure ap-pears more clearly using rank correlation; (Bottom)Clusters of returns distributions can be partly describedby the returns volatility

Figure 7: Stability test on Odd/Even trading days sub-sampling: our approach (GNPR) yields more stableclusters with respect to this perturbation than standardapproaches (using Pearson correlation or L2 distances).