35
Sparse Covariance Selection via Robust Maximum Likelihood Estimation O. Banerjee, A. d’Aspremont and L. El Ghaoui ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/ aspremon 1
• Upload

others
• Category

## Documents

• view

8
• download

0

### Transcript of Sparse Covariance Selection via Robust Maximum Likelihood ...

Sparse Covariance Selection via

Robust Maximum Likelihood Estimation

O. Banerjee, A. d’Aspremont and L. El Ghaoui

ORFE, Princeton University & EECS, U.C. Berkeley

Available online at www.princeton.edu/∼aspremon

1

Introduction

• We estimate a sample covariance matrix Σ from empirical data.

• Objective: infer dependence relationships between variables.

• We also want this information to be as sparse as possible.

• Basic solution: look at the magnitude of the covariance coefficients:

|Σij| > β ⇔ variables i and j are related.

We can do better. . .

2

Covariance Selection

Following Dempster (1972), look for zeros in the inverse matrix instead:

• Parsimony. Suppose that we are estimating a Gaussian density:

f(x, Σ) =

(

1

)

p2(

1

detΣ

)12

exp

(

−1

2xTΣ−1x

)

,

a sparse inverse matrix Σ−1 corresponds to a sparse representation of thedensity f as a member of an exponential family of distributions:

f(x, Σ) = exp(α0 + t(x) + α11t11(x) + . . . + αrstrs(x))

with here tij(x) = xixj and αij = Σ−1ij .

• Dempster (1972) calls Σ−1ij a concentration coefficient.

There is more. . .

3

Covariance Selection

• We have m + 1 observations xi ∈ Rn on n random variables.

• Estimate a sample covariance matrix S such that

S =1

m

m+1∑

i=1

(xi − x)(xi − x)T

• Choose a (symmetric) subset I of index pairs and denote by J theremaining indices so that I ∪ J = N

2.

• Choose a matrix Σ such that:

◦ Σij = Sij for all indices (i, j) in J

◦ Σ−1ij = 0 for all indices (i, j) in I

4

Covariance Selection

Why is this a better choice? Dempster (1972) shows:

• Existence and Uniqueness. If there is a positive semidefinite matrix Σij

satisfying Σij = Sij on J , then there is only one such matrix satisfying

Σ−1ij = 0 on I.

• Maximum Entropy. Among all Gaussian models Σ such that Σij = Sij on

J , the choice Σ−1ij = 0 on I has maximum entropy.

• Maximum Likelihood. Among all Gaussian models Σ such that Σ−1ij = 0

on I, the choice Σij = Sij on J has maximum likelihood.

5

Applications & Related Work

• Gene expression data. The sample data is composed of gene expressionvectors and we want to isolate links in the expression of various genes.See Dobra, Hans, Jones, Nevins, Yao & West (2004), Dobra & West(2004) for example.

• Speech Recognition. See Bilmes (1999), Bilmes (2000) or Chen &Gopinath (1999).

• Finance. Identify links between sectors, etc.

• Related work by Dahl, Roychowdhury & Vandenberghe (2005): interiorpoint methods for large, sparse MLE.

6

Outline

• Introduction

• Robust Maximum Likelihood Estimation

• Algorithms

• Numerical Results

7

Maximum Likelihood Estimation

• We can estimate Σ by solving the following maximum likelihood problem:

maxX∈Sn

log detX − Tr(SX)

• Problem here: How do we make Σ−1 sparse?

• Or, in other words, how do we efficiently choose I and J?

• Solution: penalize the MLE.

8

AIC and BIC

Original solution in Akaike (1973), penalize the likelihood function:

maxX∈Sn

log detX − Tr(SX) − ρCard(X)

where Card(X) is the number of nonzero elements in X .

• ρ = 2/(m + 1) for the Akaike Information Criterion (AIC).

• ρ = log(m+1)(m+1) for the Bayesian Information Criterion (BIC).

Of course, this is a (NP-Hard) combinatorial problem. . .

9

Convex Relaxation

• We can form a convex relaxation of AIC or BIC penalized MLE

maxX∈Sn

log detX − Tr(SX) − ρCard(X)

replacing Card(X) by ‖X‖1 =∑

ij |Xij| to solve

maxX∈Sn

log detX − Tr(SX) − ρ‖X‖1

• This is the classic l1 heuristic, ‖X‖1 is a convex lower bound onCard(X).

• See Fazel, Hindi & Boyd (2000) or d’Aspremont, El Ghaoui, Jordan &Lanckriet (2004) for related applications.

10

Robustness

• This penalized MLE problem can be rewritten:

maxX∈Sn

min|Uij|≤ρ

log detX − Tr((S − U)X)

• This can be interpreted as a robust MLE problem with componentwisenoise of magnitude ρ on the elements of S.

• The relaxed sparsity requirement is equivalent to a robustification.

• See d’Aspremont et al. (2004) for similar results on sparse PCA.

11

Outline

• Introduction

• Robust Maximum Likelihood Estimation

• Algorithms

• Numerical Results

12

Algorithms

• We need to solve:

maxX∈Sn

log detX − Tr(SX) − ρ‖X‖1

• For medium size problems, this can be done using interior point methods.

• In practice, we need to solve very large instances. . .

• The ‖X‖1 penalty implicitly introduce O(n2) linear constraints.

13

Algorithms

Here, we can exploit problem structure

• Our problem here has a particular min-max structure:

maxX∈Sn

min|Uij|≤ρ

log detX − Tr((S − U)X)

• This min-max structure means that we use prox function algorithms byNesterov (2005) (see also Nemirovski (2004)) to solve large, denseproblem instances.

• We also detail a “greedy” block-coordinate descent method with goodempirical performance.

14

Nesterov’s method

Solveminx∈Q1

f(x)

• Starts from a particular min-max model on the problem:

f(x) = f(x) + maxu

{〈Tx, u〉 − φ(u) : u ∈ Q2}

• assuming that:

◦ f is defined over a compact convex set Q1 ⊂ Rn

◦ f(x) is convex, differentiable and has a Lipschitz continuous gradientwith constant M ≥ 0

◦ T ∈ Rn×n

◦ φ(u) is a continuous convex function over some closed compact setQ2 ⊂ Rn.

15

Nesterov’s method

Assuming that a problem can be written according to this min-max model,the algorithm works as follows. . .

• Regularization. Add strongly convex penalty inside the min-maxrepresentation to produce an ǫ-approximation of f with Lipschitzcontinuous gradient (generalized Moreau-Yosida regularization step, seeLemarechal & Sagastizabal (1997) for example).

• Optimal first order minimization. Use optimal first order scheme forLipschitz continuous functions detailed in Nesterov (1983) to the solvethe regularized problem.

Caveat: Only efficient if the subproblems involved in these steps can besolved explicitly or very efficiently. . .

16

Nesterov’s method

• The min-max model makes this an ideal candidate for robust optimization

• For fixed problem size, the number of iterations required to get an ǫsolution is given by

O

(

1

ǫ

)

compared to O(

1ǫ2

)

for generic first-order methods.

• Each iteration has low memory requirements.

• Change in granularity of the solver: larger number of cheaper iterations.

17

Nesterov’s method

• We solve the following (modified) problem:

max{X∈Sn

: αIn�X�βIn}min

{U∈Sn: |Uij|≤ρ}

log detX − Tr((S − U)X)

equivalent to the original problem if α ≤ 1/(‖S‖ + nρ) and β ≥ n/ρ.

• We can write this in the min-max model’s format:

minX∈Q1

maxU∈Q2

f(X) + Tr((TX)U) = minX∈Q1

f(X),

where:

◦ f(X) = − log detX + Tr(SX)

◦ T = ρIn2

◦ Q1 := {X ∈ Sn : αIn � X � βIn}◦ Q2 := {U ∈ Sn : ‖U‖∞ ≤ 1}.

18

Nesterov’s method

Regularization. The objective is first smoothed by penalization, we define:

fǫ(X) := f(X) + maxU∈Q2

Tr(XU) − (ǫ/2D2)d2(U)

which is an ǫ approximation of f where

• the prox function on Q2 is d2(U) = 12 Tr(UTU)

• the constant D2 is given by D2 := maxQ2d2(U) = n2/2.

In this case, this corresponds to a classic Moreau-Yosida regularization of thepenalty ‖X‖1 and the function fǫ has a Lipschitz continuous gradient withconstant:

Lǫ := M + D2ρ2/(2ǫ)

19

Nesterov’s method

Optimal first-order minimization. The minimization algorithm in Nesterov(1983) then involves the following steps:

Choose ǫ > 0 and set X0 = βIn, For k = 0, . . . , N(ǫ) do

1. Compute ∇fǫ(Xk) = −X−1 + Σ + U∗(Xk)

2. FindYk = arg minY {Tr(∇fǫ(Xk)(Y − Xk)) + 1

2Lǫ‖Y − Xk‖2F : Y ∈ Q1}.

3. Find Zk =

arg minX

{

Lǫβ2d1(X) +

∑ki=0

i+12 Tr(∇fǫ(Xi)(X − Xi)) : X ∈ Q1

}

.

4. Update Xk = 2k+3Zk + k+1

k+3Yk.

20

Nesterov’s method

• We have chosen a prox function d1(X) for the set {αIn � X � βIn}:

d1(X) = − log detX + log β

• Step 1 only amount to computing the inverse of X and the (explicit)solution to the regularized subproblem on Q2.

• Steps 2 and 3 are both projections on Q1 and require an eigenvalue

decomposition.

• This means that the total complexity estimate of the method is:

O

(

κ√

n(log κ)

ǫ(4n4αρ + n3

√ǫ)

)

where log κ = log(β/α) bounds the solution’s condition number.

21

Dual block-coordinate descent

• Here we consider the dual of the original problem:

maximize log det(S + U)subject to ‖U‖∞ ≤ ρ

S + U � 0

• The diagonal entries of an optimal U are Uij = ρ.

• We will solve for U column by column, sweeping all the columns.

22

Dual block-coordinate descent

• Let C = S + U be the current iterate, after permutation we can alwaysassume that we optimize over the last column:

maximize log det

(

C11 C12 + uC21 + uT C22

)

subject to ‖u‖∞ ≤ ρ

where C12 is the last column of C (off-diag.).

• Each iteration reduces to a simple box-constrained QP:

minimize uT (C11)−1usubject to ‖u‖∞ ≤ ρ

• We stop when Tr(SX) + ρ‖X‖1 − n ≤ ǫ where X = C−1.

23

Outline

• Introduction

• Robust Maximum Likelihood Estimation

• Algorithms

• Numerical Results

24

Numerical Examples

Generate random examples:

• Take a sparse, random p.s.d. matrix A ∈ Sn

• We add a uniform noise with magnitude σ to its inverse

We then solve the penalized MLE problem (or the modified one):

maxX∈Sn

log detX − Tr(SX) − ρ‖X‖1

and compare the solution with the original matrix A.

25

Numerical Examples

A basic example. . .

Noisy inverse Σ−1 Solution for ρ = σOriginal inverse A

The original inverse covariance matrix A, the noisy inverse Σ−1 and thesolution.

26

−0.5 0 0.5 10

1

2

3

4

5

6

7

log(ρ/σ)

Err

or(in

%)

Average and standard deviation of the percentage of errors (false positives +false negatives) versus log(ρ/σ) on random problems.

27

10−2

10−1

100

101

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Min. X

ij with A

ij non zero

Mean Xij with A

ij non zero

Max. Xij with A

ij=0

Mean Xij with A

ij=0

ρ/σ

Coeffi

cien

ts V

28

0 20 40 60 80 100 120 140 160 180 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Coefficients

Mag

nitude

Classification Error. Magnitude of solution coefs associated with nonzerocoefs in A (solid line) and magnitude of solution coefs associated with zerocoefs in A (dashed line). Here n = 100.

29

101

102

103

10−1

100

101

102

103

104

n

CPU

Tim

e(in

seco

nds)

Computing time. CPU time (in seconds) to reach gap of ǫ = 1 versusproblem size n on random problems, solved using Nesterov’s method (stars)and the coordinate descent algorithm (dots).

30

0 5 10 15 20 25 30 3510

−1

100

101

102

CPU Time (in seconds)

Dual

ity

Gap

A

C

B

Computing time. Convergence plot a random problem of size n = 100, thistime comparing Nesterov’s method where ǫ = 5 (solid line A) and ǫ = 1(solid line B) with one sweep of the block-coordinate descent method(dashed line C).

31

32

Conclusion

• A convex relaxation for sparse covariance selection.

• Robustness interpretation.

• Two algorithms for dense large-scale instances.

• Precision requirements? Thresholding? . . .

Slides and software available online at www.princeton.edu/∼aspremon

33

References

Akaike, J. (1973), Information theory and an extension of the maximum likelihood principle,

in B. N. Petrov & F. Csaki, eds, ‘Second international symposium on information

theory’, Akedemiai Kiado, Budapest, pp. 267–281.

Bilmes, J. A. (1999), ‘Natural statistic models for automatic speech recognition’, Ph.D.

thesis, UC Berkeley, Dept. of EECS, CS Division .

Bilmes, J. A. (2000), ‘Factored sparse inverse covariance matrices’, IEEE International

Conference on Acoustics, Speech, and Signal Processing .

Chen, S. S. & Gopinath, R. A. (1999), ‘Model selection in acoustic modeling’,

EUROSPEECH .

Dahl, J., Roychowdhury, V. & Vandenberghe, L. (2005), ‘Maximum likelihood estimation of

gaussian graphical models: numerical implementation and topology selection’, UCLA

preprint .

d’Aspremont, A., El Ghaoui, L., Jordan, M. & Lanckriet, G. R. G. (2004), ‘A direct

formulation for sparse PCA using semidefinite programming’, Advances in Neural

Information Processing Systems 17.

Dempster, A. (1972), ‘Covariance selection’, Biometrics 28, 157–175.

Dobra, A., Hans, C., Jones, B., Nevins, J. J. R., Yao, G. & West, M. (2004), ‘Sparse

graphical models for exploring gene expression data’, Journal of Multivariate Analysis

90(1), 196–212.

34

Dobra, A. & West, M. (2004), ‘Bayesian covariance selection’, working paper .

Fazel, M., Hindi, H. & Boyd, S. (2000), ‘A rank minimization heuristic with application to

minimum order system approximation’, American Control Conference, September 2000 .

Lemarechal, C. & Sagastizabal, C. (1997), ‘Practical aspects of the Moreau-Yosida

regularization: theoretical preliminaries’, SIAM Journal on Optimization 7(2), 367–385.

Nemirovski, A. (2004), ‘Prox-method with rate of convergence o(1/t) for variational

inequalities with Lipschitz continuous monotone operators and smooth convex-concave

saddle-point problems’, SIAM Journal on Optimization 15(1), 229–251.

Nesterov, Y. (1983), ‘A method of solving a convex programming problem with convergence

rate O(1/k2)’, Soviet Math. Dokl. 27(2), 372–376.

Nesterov, Y. (2005), ‘Smooth minimization of nonsmooth functions’, Mathematical

Programming, Series A 103, 127–152.

35