Download - Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Transcript
Page 1: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Maximal Sparsity Representation via l1 Minimization

David L. Donoho∗ and Michael Elad†

August 15, 2002

Abstract

Finding a sparse representation of signals is desired in many applications. For a repre-

sentation dictionary D and a given signal S ∈ span{D}, we are interested in finding the

sparsest vector γ such that Dγ = S. Previous results have shown that if D is composed

of a pair of unitary matrices, then under some restrictions dictated by the nature of the

matrices involved, one can find the sparsest representation using an l1 minimization rather

than using the l0 norm of the required composition. Obviously, such a result is highly de-

sired since it leads to a convex Linear Programming form. In this paper we extend previous

results and prove a similar relationship for the most general dictionary D. We also show

that previous results are emerging as special cases of the new extended theory. In addition,

we show that the above results can be markedly improved if an ensemble of such signals is

given, and higher order moments are used.

Keywords: Sparse Representation, Atomic Decomposition, Convex Optimization, Linear

Programming, Basic Pursuit, Matching Pursuit.

∗Department of Statistics, Stanford University, Stanford 94305-9025 CA. USA.†Department of Computer Science (SCCM), Stanford University, Stanford 94305-9025 CA. USA.

1

elad
Draft: to be submitted to the IEEE Transactions on Information Theory on September 2002
Page 2: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

1 Introduction

A sparse representation for a signal is a desired efficient description of it that can be used

for its analysis or compression [1]. However, far deeper reasons lead to the search for sparse

representations for signals. As it turns out, one of the most natural and effective priors in

Bayesian theory for signal estimation is the existence of a sparse representation over a suitable

dictionary. This prior is leaning on the assumption that its ground–truth representation is

expected to be simple and thus sparse in some representation space [1]. Indeed, it is sparsity

that lead to the vast theoretic and applicative work in Wavelet theory [1].

More formally, we are given a representation dictionary D defined as a matrix of size [N × L].

We hereby assume that the columns of D, denoted as {dk}Lk=1

, are normalized, i.e. ∀1 ≤ k ≤

L, dHd = 1. These columns are to be used to represent incoming signals S ∈ span{D} ⊆ CN .

Note that we do not claim any relationship between N and L, and in particular, N may be larger

than L, implying that the proposed representation space is not complete.

Given a signal vector S, we are interested in finding the sparsest vector γ such that Dγ = S.

This process is commonly referred to as atomic decomposition, since we decompose the signal

S into its building atoms, taken from the dictionary. The emphasis here is on finding such a

decomposition that uses as few as possible atoms. Thus, we resort to the following optimization

problem

(P0) Minimize ‖γ‖0 subject to S = Dγ. (1)

Obviously, two easy-to-solve special cases are the case of a unique solution to Dγ = S and the

2

Page 3: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

case of no feasible solution at all. While both these cases lead to easy-to-solve (P0), in general,

(P0) solution requires combinatorial search through all the combinations of columns from D, and

as such, its complexity grows exponentially with L.

Thus, we are interested either in an approximation of (P0) solution, or better yet, a numerical

shortcut leading to its exact solution. Matching Pursuit (MP) [1, 2] and Basis Pursuit (BP)

[3] are two different methods to achieve the required simplifying goal. In the MP and related

algorithms, a sequential sub-optimal representation is sought using a greedy algorithm. As such,

these family of algorithms lead to an approximation of (P0) solution.

A numerically more complicated approach, which in some cases lead to the exact solution of

(P0), is the BP algorithm. BP suggests solving (P0) by replacing it with a related (P1) problem

defined by

(P1) Minimize ‖γ‖1 subject to S = Dγ. (2)

As can be seen, the penalty is replaced by an l1 norm (sum of absolute values). As such, (P1)

is a convex programming problem implying that we expect no local minima problems in its

numerical solution. Actually, a well known result from optimization theory shows that an l1

minimization could be solved using a Linear Programming procedure [3, 4, 5]. Recent results

in numerical optimization and the introduction of the interior point methods turn the above

described problem to a practically solvable one, even for very large dictionaries.

A most interesting and surprising result due to Donoho and Huo [6] is that the solution of

(P1) in some cases coincides with the (P0) one. Donoho and Huo assumed a specific structure

3

Page 4: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

of D, built by concatenating two unitary matrices, Φ and Ψ of size N × N each, thus giving

that L = 2N . For this specific dictionary form, they developed conditions for the equivalence

between the (P0) and (P1) solutions. These conditions were expressed in terms of the involved

dictionary D (and actually, more accurately, in terms of Φ and Ψ). Later these conditions were

improved by Elad and Bruckstein to show that the equivalence is actually true for a wider class

of signals [7, 8].

In this paper we further extend the results in [6, 7, 8], and prove a (P0)-(P1) equivalence for

the most general form of dictionaries. In order to prove this equivalence we address two questions

for a given dictionary D and signal S:

1. Uniqueness: Having solved the (P1) problem, under which conditions can we guarantee

that this is also the (P0) solution as well? This question is answered by generalizing the

uniqueness Theorem in [6, 7, 8].

2. Equivalence: Knowing the solution of the (P0) problem (or actually, knowing its l0 norm),

what are the conditions under which (P1) is guaranteed to lead to the exact same solution?

This question is answered by generalizing the equivalence Theorem in [6, 7, 8].

The proposed analysis adopts a totaly new line of reasoning, compared to the work done in

[6, 7, 8], and yet, we show that all previous results emerge as special cases of this new analysis.

So far atomic decomposition was targeted towards dealing with a single given vector, finding

the limitations of using (P1) instead of (P0) in order to decompose it to its building atoms

taken from the dictionary D. This is the problem solved in [6, 7, 8] and in this paper too. An

interesting extension of the above results correspond a source generating an ensemble of random

4

Page 5: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

sparse representation signals from the same dictionary using the same stationary random rule.

The questions raised are whether there is something to gain from the given multiple signals, and

if so then how. As it turns out, use of higher moments leads in this case to a similar formulation

of (P0), and again, (P1) similar form comes to replace it as a traceable alternative. We show that

indeed, similar relations between the (P0) and the (P1) hold, with far weaker conditions due to

the increased dimensionality, implying that less restrictions are posed to guarantee the desired

(P0)-(P1) equivalence.

This paper is organized as follows: In the next section we briefly repeat the main results found

in [6, 7, 8] on the uniqueness and equivalence Theorems for the two-unitary matrices dictionary.

Section 3 then extends the uniqueness Theorem for an arbitrary dictionary. Section 4 similarly

extends the equivalence results for general form dictionary. The idea to use an ensemble of signals

and higher moments for accurate sparse decomposition is covered in Section 5. We summarize

and draw future research directions in Section 6.

2 Previous Results

As was said before, previous results refer to the special case where the dictionary is built by

concatenating two unitary matrices, Φ and Ψ of size N ×N each, giving D = [Φ,Ψ]. We define

φi and ψj (1 ≤ i, j ≤ N) as the columns of these two unitary matrices. Following [6] we define a

real-positive scalar M representing the cross-correlation between these two bases by

M = Sup{|〈φi, ψj〉|, ∀1 ≤ i, j ≤ N}.

5

Page 6: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Thus, given the two matrices Φ and Ψ, M can be computed, and it is easy to show [6, 8] that

1/√N ≤M ≤ 1. The lower bound is obtained for a pair such as spikes and sines [6] or Identity

and Hadamard matrices [8]. The upper bound is obtained if at least one of the vectors in Φ is

also found in Ψ. Using this definition for M , the following Theorem states the requirement on

a given representation such that it is guaranteed to be the solution of the (P0) problem:

Theorem 1 - Uniqueness: Given a dictionary D = [Φ,Ψ], given its corresponding cross-

correlation scalar value M as defined in Equation (3), and given a signal S, a representation of

the signal by S = Dγ is necessarily the sparsest one possible if ‖γ‖0 < 1/M .

This Theorem’s proof is given in [7, 8]. A somewhat weaker version of it requiring ‖γ‖0 <

0.5(1 +M−1) is proven in [6].

Thus, having solved (P1) for the incoming signal, we measure the l0 norm of its solution, and

if it is sparse enough (below 1/M), we conclude that this is also the (P0) solution. For the best

cases where 1/M =√N , we get that the requirement translates into ‖γ‖0 <

√N .

The above Theorem by itself is not sufficient because nothing is claimed on why the (P1)

should lead to the (P0) solution in the first place. All we know at the moment is that if by a

coincidence we got a sparse solution out of (P1), then we can claim it is the desired (P0) solution.

The next Theorem comes to close this gap:

Theorem 2 - Equivalence: Given a dictionary D = [Φ,Ψ], given its corresponding cross-

correlation scalar value M as defined in Equation (3), and given a signal S, if there exists a

sparse representation satisfying ‖γ‖0 < (√

2−0.5)/M , then the (P1) solution necessarily finds it.

6

Page 7: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

This Theorem’s proof is given in [7, 8]. Again, a somewhat weaker version of this Theorem

requiring ‖γ‖0 < 0.5(1+M−1) is proven in [6]. Note that there is an uncomfortable gap between

the above two Theorems. In a later work, Feuer and Nemirovsky managed to show that this gap

is indeed unbridgeable [12], by proving that the bound in the above theorem is tight.

To summarize, we are encouraged to solve the (P1) problem because it is guaranteed to lead

to the sparsest possible solution, provided that it is sparse enough to begin with. All this

corresponds to the limited case of dictionaries built from two unitary matrices. An attempt to

extend the two Theorems to non-unitary but still square and non-singular matrices Φ and Ψ

was proposed already in [6] using a different definition for M :

M = Sup{

∣Φ−1Ψ∣

i,j,∣

∣Ψ−1Φ,∣

i,j, ∀1 ≤ i, j ≤ N

}

.

As it turns out, using this definition, the above two Theorems hold as well. However, this new

definition may lead to values of M above 1, rendering the above Theorems useless. In the next

two sections we shall show an alternative treatment which overcomes this difficulty, and does

that while making no assumption on the structure of the dictionary D.

3 A Uniqueness Theorem for Arbitrary Dictionaries

We are given a dictionary D defined as a matrix of size [N × L], where its columns {dk}Lk=1

are

normalized. An incoming signal vector S is to be represented using the dictionary by S = Dγ.

7

Page 8: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Assume that we found two different suitable representations, i.e.

∃γ16= γ

2| S = Dγ

1= Dγ

2. (3)

Thus we have that the difference of the representation vectors, δγ = γ1− γ

2, must be in the

null space of the dictionary, namely D(γ1− γ

2) = Dδγ = 0. This implies that some group of

columns from D is required to be linearly dependent. It is clear that any N + 1 such vectors are

by definition linearly dependent, so obviously we are expecting a number smaller than that. In

order to proceed this analysis, we need to define the notion of Matrix Spark which is closer in

nature to the matrix Rank.

Definition - Spark: Given a matrix A we define the non-negative integer σ = Spark(A) as the

largest possible number such that every sub-group of σ columns from A are linearly independent,

and at least one sub-group of σ + 1 columns of A are linearly dependent.

Clearly, if we assume that there are no zero columns in A, we have that σ ≥ 1, and equality

is obtained if there are two columns from A that are linearly dependent. Note that A could be

full rank and yet σ = 1. At the other extreme we get that σ ≤ Min{L,Rank{A}}.

As an interesting example, let us consider the case where D = [Φ,Ψ], Φ = IN (identity matrix

of size N × N), and Ψ = FN (the Discrete Fourier Transform matrix of size N × N). Then,

based on Laplace’s Theorem [6] we have that a given train-of-spikes with√N spacing in one

basis transforms to the exact train of exponents in the Fourier domain. Thus, there exists a

group of 2√N vectors in this D that is linearly dependent, and therefore in this case we can say

that Spark{D} < 2√N . As we shall see later, for this case we get that Spark{D} = 2

√N − 1.

8

Page 9: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

At the moment let us assume that the Spark is computed simply by sweeping through all the

possible combinations of columns of the matrix A and testing for linear dependence. Later we

shall show that the Spark can be bounded both from above and below. Having found the Spark

of our dictionary σ = Spark(D), then obviously we require that every vector in the its column

null-space should satisfy

Dδγ = 0 =⇒ ‖γ1− γ

2‖ = ‖δγ‖0 > σ. (4)

On the other hand we have that ‖γ1−γ

2‖ ≤ ‖γ

1‖+‖γ

2‖. Thus, we get that for the two arbitrary

representations found, the following must hold true

‖γ1‖ + ‖γ

2‖ > σ. (5)

This inequality can be interpreted as an uncertainty law:

Theorem 3 - Uncertainty Law: Given a dictionary D and given its corresponding Spark

value σ, for every non-zero signal S and every pair of different representations of it, i.e, S =

Dγ1

= Dγ2, we have that the sparsity of the two representations together must be above σ as in

Equation (5).

An immediate consequence of this result is a new general uniqueness Theorem. Using Equation

(5), if there exists a representation satisfying ‖γ1‖ ≤ σ/2, then necessarily due to the above

Theorem, any other representation γ2

of this signal must satisfy ‖γ1‖ > σ/2, implying that the

γ1

is the sparsest possible one.

9

Page 10: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

Theorem 4 - New Uniqueness: Given a dictionary D, given its corresponding Spark value

σ, and given a signal S, a representation of the signal by S = Dγ is necessarily the sparsest one

possible if ‖γ‖0 ≤ σ/2.

The obvious question at this point is what is the relationship between the definedM in previous

results and the newly defined notion of Spark of a matrix. In order to explain this relationship

we bring the following analysis on bounding the Spark. Note that the proposed bounding are

important not only because of relating the new results to the previous ones, but also because

we need methods to approximate the Spark and replace the impossible sweep through all the

column combinations by a traceable and a computationally reasonable method for computing

the Spark.

3.1 Bounding the Spark from Below

Let us built the Gramian matrix for our dictionary, G = DHD. Obviously, every entry in G

is an inner product of a pair of columns from the dictionary, the main diagonal contains exact

’1’-s due to the normalization of D’s columns, and all the entries outside the main diagonal are

in the general case complex values with magnitude equal or smaller than 1.

If the Spark is known to be σ, it implies that any leading minor of G of size σ × σ must be

positive definite [9]. This reasoning works the other way around, namely, if every σ × σ leading

minor is guaranteed to be positive definite, then obviously the Spark is at least σ. The problem

is we do not want to sweep through all combinations of columns from D, nor do we want to

check every possible σ × σ leading minor of the Gramian matrix.

Instead, we use the well known Gersgorin Disk Theorem [9], or better yet, its special case

10

Page 11: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

property that claims that every strictly diagonally dominant matrix must be positive definite.

A matrix is strictly diagonally dominant if for each row, the sum of absolute values of the off-

diagonal values is strictly smaller compared to the main diagonal entry. In our case, if for every

σ× σ leading minor we get a strictly diagonally dominant matrix, then obviously these matrices

are positive definite, and thus the Spark of D is at least σ.

Using the above rule, let us search for the most problematic set of column vectors from the

dictionary. By problematic we refer to the set that tends to create the smallest possible diagonally

non-dominant leading minor. Thus, if we take the Gramian matrix G and perform a decreasing

rearrangement of its absolute entries in each row, we should get that the first column has all ’1’-s

(taken from the main diagonal), and as we observe the entries from left to right in each row we

are expecting to see a decrease in magnitude. Computing the cumulative sum per each such row

excluding the first entry, let us define Pk as the number of entries in the kth row that are summed

to the minimal value above 1. Assume that we computed P = Min1≤k≤LPk. Then, clearly, every

leading minor of size P × P must be diagonally dominant by definition. Moreover, using minors

of size (P + 1)× (P + 1), at least one of them is expected to be diagonally non-dominant. Thus,

P is a lower bound on the actual Spark of D. The process we have just described is exactly the

method to find the ’most problematic’ set of columns from D, and this way bound the Matrix’s

Spark. To summarize, we have the following Theorem:

Theorem 5 - Lower-Bound on the Spark: Given the dictionary D and its corresponding

Gramian matrix G = DHD, apply the following stages:

1. Perform a decreasing rearrangement of |G| in each row,

11

Page 12: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

2. Compute the cumulative sum per each such row excluding the first entry,

3. Compute Pk, the number of entries at the kth row that are summed to the minimal value

above 1, and

4. Find P = Min1≤k≤LPk.

Then, σ = Spark(D) ≥ P .

A special case of interest to us is obtained if we simply assume that ∀i 6= j, |Gi,j| ≤ M .

Note the resemblance of this M to the M defined at equation (3) and originally in [6]. Then

clearly we have that for an arbitrary leading minor of size (P + 1) × (P + 1), we should require

PM ≥ 1 in order to allow for it to become a diagonally non-dominant matrix. Thus we get

σ = Spark(D) ≥ P ≥ 1/M . Thus,

Theorem 6 - Lower-Bound on the Spark (special case 1): Given the dictionary D and

its corresponding Gramian matrix G = DHD, define M as the upper bound on the entries of G,

i.e., ∀i 6= j, |Gi,j| ≤M . Then, the following relationship holds σ = Spark(D) ≥ 1/M .

Using this new simple bound and plugging Theorem 4, we get exactly the uniqueness Theorem

as stated by Donoho and Huo [6], namely, if a proposed representation has less than 0.5(1+1/M)

non-zeros, it is necessarily the sparsest one possible. Note that although looking different, the

requirements ‖γ‖0 < 0.5(1 + 1/M) and ‖γ‖0 ≤ 0.5/M are equivalent since ‖γ‖0 is integer.

As an example for this last result, if we return to the special case where D = [Φ,Ψ], Φ = IN

and Ψ = FN , then we have that M = 1/√N , and thus σ = Spark(D) ≥

√N . Remember that

we claimed that for this case σ = 2√N − 1, so clearly the new bound should be improved.

12

Page 13: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

An interesting question that emerges is why didn’t we get the better 1/M result as stated

in [7, 8] and as appears in Theorem 1? Answering this question may lead to the improvement

we mentioned for the above example as well. It turns out that if we plug in the fact that the

dictionary is built of two unitary matrices D = [Φ,Ψ], the Gramian matrix contains many zeros

corresponding the orthogonality of the columns in Φ and Ψ, and in this case the bound can be

further improved. In an attempt to compose the ’worst’ (P + 1) × (P + 1) leading minor, it is

not hard to see that we need to take half of the vectors from Φ and half from Ψ (and let us

conveniently assume that P is odd). Then we get that in each row there are (P − 1)/2 exact

zeros, (P + 1)/2 values assumed to be below or equal to M in their absolute value, and one

identity corresponding to the main diagonal. Thus, in this special case we get that diagonally

non-dominance is achieved for (P + 1)/2 ·M ≥ 1 leading to P ≥ 2/M − 1. Using Theorem 5 we

get σ = Spark(D) ≥ P ≥ 2/M − 1. Thus,

Theorem 7 - Lower-Bound on the Spark (special case 2): Given the specific dictio-

nary D = [Φ,Ψ] where Φ and Ψ are both unitary N × N matrices, and assuming that for

its corresponding Gramian matrix G = DHD we define ∀i 6= j, |Gi,j| ≤ M , we have that

σ = Spark(D) ≥ 2/M − 1.

Note that, again, using this result and plugging Theorem 4, we get exactly the result in Theorem

1 as taken from [7, 8] (and again the difference in appearance is caused by replacing [≥] sign by

[>] one). Returning again to the example with D = [IN ,FN ], we know that M = 1/√N and

thus σ = Spark(D) ≥ 2√N − 1. On the other hand, we have seen that there exists a set of 2

√N

columns that are linearly dependent, and therefore we conclude that σ = 2√N −1 and this case.

13

Page 14: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

So, in this example we got that the proposed lower bound on the Spark in Theorem 7 is actually

a tight bound.

Returning back to the general dictionary form, we should ask how close is the found bound

to the actual Spark? As it turns out this bound is rather loose, which typically means that

σ = Spark(D) � 1/M . This gap is not surprising if we bare in mind that requiring diagonally

dominance for positive definiteness is a highly restrictive approach, and it is commonly known

that Gersgorin disks are far too loose bounds on eigenvalue locations [9]. This is why we turn to

methods to bound the Spark from above as discussed hereafter.

3.2 Bounding the Spark from Above

Let us propose a presumably practical method for finding the matrix Spark. Define the following

sequence of optimization problems for k = 1, 2, ..., L:

(Rk) Minimize ‖U‖0 subject to DU = 0 Uk = 1. (6)

If the Spark value σ is achieved by set of columns from D containing the first column, then

solution of (R1) necessarily gives that the solution satisfies Min‖U‖0 = σ. Similarly, by sweeping

through k = 1, 2, ..., L we guarantee that the solution with the minimal l0 norm is necessarily

the exact matrix’s Spark. That is to say, if we denote the solution of the (Rk) problem as U optk ,

then we get

σ = Spark(D) = Min1≤k≤L‖U optk ‖0. (7)

14

Page 15: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

However, as we know by now, solution of l0 norm is notoriously hard. Thus, in the spirit of

the Basis Pursuit approach discussed in this paper, let us replace the minimization of the l0

norm by a more convenient l1 norm. Thus, we define the sequence of optimization problems for

k = 1, 2, ..., L:

(Qk) Minimize ‖V ‖1 subject to DV = 0 Vk = 1. (8)

This time we have a set of convex programming problems that we can solve using Linear-

Programming solver. Let us define the solution of the (Qk) problem as V optk . Then, clearly

‖U optk ‖0 ≤ ‖V opt

k ‖0 =⇒ σ = Spark(D) ≤ Min1≤k≤L‖V optk ‖0. (9)

So, let us recap the above discussion into the following new Theorem on bounding the Spark:

Theorem 8 - Upper-Bound on the Spark: Given the dictionary D, apply the following

stages:

1. Solve the sequence of L optimization problems defined as (Qk), and define their correspond-

ing solutions as V optk ,

2. Compute the l0 norm of the found solutions V optk and find the smallest one, denoted as

‖V ‖min0

.

Then, σ = Spark(D) ≤ ‖V ‖min0

.

It is interesting to note that in order to make a statement regarding the relation between (P0)

and (P1) solutions with a dictionary D of size N × L, we find ourselves required to use the (P0)

15

Page 16: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

and (P1) relation on dictionaries of size N × (L − 1). Is there some sort of recursiveness that

could be exploited? We leave this as an open question.

4 A Equivalence Theorem for Arbitrary Dictionaries

In the previous Section we focused on extending the uniqueness Theorem for arbitrary dictio-

nary D. Here we similarly extend the equivalence Theorem from [6, 8] for such general shaped

dictionaries. So, the question we focus on now is: if a signal S has a sparse representation γ in

the dictionary D, what are the conditions such that solving the (P1) optimization problem leads

to this solution as well?

Assume that the sparsest representation is found and denoted as γ0. We have that Dγ

0= S.

Assume also that a second representation γ1is found, i.e. Dγ

1= S, and obviously ‖γ

1‖0 > ‖γ

0‖0.

In order for the (P1) to lead to the γ0

solution, we must have that ‖γ1‖1 ≥ ‖γ

0‖1, that is to say,

we need to get that the sparsest solution γ0

is also ”shortest” in the l1 metric. In addition, we

should require that Dγ0

= S = Dγ1, or differently, D(γ

0− γ

1) = Dx = 0. So, let us define the

following optimization problem:

Minimize ‖γ1‖1 − ‖γ

0‖1 =

L∑

k=1

|γ0(k) + x(k)| −L

k=1

|γ0(k)| Subject to Dx = 0 (10)

and if the value of the penalty function at the minimum is positive, it implies that the l1 norm of

the denser representation is found to be higher than the sparse solution one. This in turn means

that the (P1) problem leads to the sparsest solution γ0

as required.

Since the optimization problem in Equation (10) is difficult to work with, following [6, 8], we

16

Page 17: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

perform several simplification stages, while guaranteeing that the minimum value of the penalty

function only gets smaller. First, we change the summations to the on- and the off-support of

the sparse representation of γ0:

L∑

k=1

|γ0(k) + x(k)| −L

k=1

|γ0(k)| =∑

off support ofγ0

|xk| +

+∑

on support ofγ0

(|γ0(k) + x(k)| − |γ0(k)|) .

Using |v +m| ≥ |v| − |m| we have that |v +m| − |v| ≥ |v| − |m| − |v| = −|m| and thus

off support ofγ0

|xk| +∑

on support ofγ0

(|γ0(k) + x(k)| − |γ0(k)|) ≥

≥∑

off support ofγ0

|xk| −∑

on support ofγ0

|x(k)| =

=L

k=1

|xk| − 2 ·∑

on support ofγ0

|x(k)|

So, if we replace the optimization problem in Equation (10) with

MinimizeL

k=1

|xk| − 2 ·∑

on support ofγ0

|x(k)| Subject to Dx = 0, (11)

then obviously, the minimum value of the penalty function is expected to be lower, and if it is

still above zero, it implies that solving (P1) is going to lead to the proper sparsest solution.

Following [8], we add a constraint in order to avoid the trivial solution x = 0, which corresponds

to the case where the two representations are the same. The new constraint 1T |x| = 1 implies

17

Page 18: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

that the sum of absolute entries in x is required to be 1. Thus, Equation 11 becomes

Minimize 1 − 2 · 1Tγ0|x| Subject to Dx = 0 and 1T |x| = 1. (12)

Note that in the new formulation we used a slightly different notation. The vector 1γ0is a binary

vector of length L obtained by putting ’1’-s and ’0’-s where the condition γ06= 0 holds true or

false respectively.

Looking at Equation (12), we see that both x and |x| appear in it, and this complicates its

solution. Let us replace the constraint Dx = 0 with a weaker requirement that is posed on the

vector |x|. If the feasible solution is required to satisfy Dx = 0, then clearly it must also satisfy

the weaker condition DHDx = Gx = 0, where G is the Gramian matrix we have already used

in the previous Section to bound the Spark. Thus we have that

Gx = 0 =⇒ (G − I + I)x = 0 =⇒ − x = (G − I)x (13)

=⇒ |x| = | (G − I)x| ≤ |G − I| |x|.

The matrix (G− I) is the Gramian matrix with its main diagonal nulled to zero. If we take the

new constraint and plug it back to Equation (12) instead of the original Dx = 0, we get

Minimize 1 − 2 · 1Tγ0z Subject to {|G − I| · z ≥ z & 1T z = 1 & z ≥ 0}. (14)

Note that we have defined z = |x|.

In order to further simplify the problem and come up with simple requirements for this opti-

18

Page 19: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

mization problem to give positive value at the minimum, we assume that the off-diagonal entries

in G satisfy ∀i 6= j, |Gi,j| ≤M . Thus the constraint |G − I| · z ≥ z is replaced by

z ≤ |G − I| z ≤M · (1 − I)z,

where I is the L × L identity matrix and 1 in an L × L matrix with all entries equal 1. Using

the fact that 1z = 1 × 1T z = 1 (since 1T z = 1) we get that the constraint becomes

z ≤M · (1 − I)z = M1 −Mz =⇒ z ≤ M

1 +M1 .

Going back to our minimization task as written in Equation (14), we have that the minimum

value is obtained by assuming that on the support of γ0

all the z(k) values are exactly z(k) =

M/(1 +M). and then the penalty term becomes

1 − 2 · 1Tγ0z = 1 − 2M

1 +M· ‖γ

0‖ .

and therefore we require

1 − 2M

1 +M· ‖γ

0‖ ≥ 0 =⇒ ‖γ

0‖ ≤ 1

2

(

1 +1

M

)

.

To summarize, we have the following result:

Theorem 9 - New Equivalence: Given a dictionary D, given its corresponding M value

extracted from the Gramian matrix G = DHD, and given a signal S, if the sparsest representation

19

Page 20: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

of the signal by S = Dγ0

satisfies ‖γ0‖ ≤ 0.5(1 + 1/M), then the (P1) solution is guaranteed to

find this sparse representation.

The above Theorem poses the same requirement as the one posed by Donoho and Huo [6] in

their equivalence Theorem. However there is some very basic and major difference between these

two results. The new result does not assume any structure on the dictionary, whereas Donoho

and Huo assumed the two-unitary matrices dictionary.

As in the uniqueness case, if we assume that the dictionary is composed of two unitary matrices

concatenated together, then this may lead to an improvement of the bound to 1/M . We skip

this analysis because major sections of it are exactly the same ones as described in [8].

A question that remains unanswered at this stage is whether we can prove a more general

equivalence result without using M , but rather using the notion of Spark of the dictionary.

5 Sparse Representation of Random Ensemble of Signals

So far we assumed that only one signal is given to us, and we are to seek its decomposition into a

set of building atoms using our knowledge that it is a sparse linear combination of the dictionary

columns. Assume now that the source generating this signal is activated infinitely many times,

creating this way an infinite sequence of signals {Sk}∞k=1. If we assume that the source uses

the same probabilistic law for generating the representation coefficients in all instances, then

this should somehow serve for improving our capabilities to decompose the incoming signals into

their building atoms.

More specifically, let us assume that the source draws the representation coefficients for all these

20

Page 21: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

instances {Sk}∞k=1= {Dγ

k}∞k=1

from the same L distribution laws {Pj(α)}Lj=1

in an identically

and independent manner. Thus, each coefficient is generated using a different statistical rule. We

further assume that F of the coefficients are capable of having non-zero values, and the remaining

L − F coefficients are exact zero with probability 1. Thus, we may claim per each signal that

it is a sparse composition of atoms of the same F elements, but with varying coefficients due to

the probabilities {Pj(α)}Lj=1

.

Clearly, given the signal S1, we can apply previous results and seek its sparse representation

using (P1) solution, provided that F , the number of non-zero entries in the representation, is low

enough. If another signal S2 is given as well, apart from our knowledge that it has also a sparse

representation, we know that the non-zeros in the two representations are expected to appear in

the same locations. This is a new powerful knowledge we seek to exploit. Here we use higher

moments to achieve this gain, and thus we need an infinitely long sequence of signals. Let us

define the mean and variance values per each representation coefficient as

{

mj =∫

αα · Pj(α)dα , σ2

j =∫

α(α−mj)

2 · Pj(α)dα}L

j=1

(15)

Thus, the mean and covariance of the representation vector γ are given by

E{

γ}

=

[

m1 m2 . . . mL

]H

= MH

21

Page 22: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

E{

γγH}

=

σ2

10 0 . . . 0

0 σ2

20 . . . 0

......

.... . .

...

0 0 0 . . . σ2

L

+M ·MH = Σ +M ·MH . (16)

Using these definitions, we may write the mean and covariance of the incoming signals we get

E {S} = E{

Dγ}

= D · E{

γ}

= D ·M.

E{

SSH}

= E{

DγγHDH}

= D · E{

γγH}

· DH = D ·(

Σ +M ·MH)

· DH .

Since we gathered an infinitely long sequence of signals {Sk}∞k=1and since the process described

here is ergodic, we may obtain the above mean-covariance pair by computing the estimates

E {S} = D ·M = limn→∞

1

n

n∑

k=1

Sk.

E{

SSH}

= D ·(

Σ +M ·MH)

· DH = limn→∞

1

n

n∑

k=1

SkSHk .

Now recall that according to our assumptions only F of the diagonal entries of Σ are expected to

be non-zeros, and all remaining L−F are supposed to be exact zeros. Thus, after the removal of

the rank-one mean term D ·M ·MH ·DH (based on the estimated first moment), the covariance

matrix of the signal is expected to be of rank F exactly. This, by the way, leads to a good cleaning

method of the estimated covariance in cases of insufficient measurements, i.e., by application of

SVD on the estimated DΣDH and nulling the last (and thus smallest) L−F singular values [10].

So, at this stage we got to the point where through the incoming signals we are capable of

22

Page 23: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

computing the rank F matrix DΣDH . A slightly different way to write this matrix is

DΣDH =L

k=1

σ2

k ·(

dk × dHk

)

. (17)

However, our goal is to find the building atoms for the measured signals, i.e. the indices where

the σj are non-zeros. Whereas SVD can reduce the matrix rank [10], it cannot map the remaining

F rank-1 terms to the columns of the dictionary, and thus cannot be used to solve our problem.

Instead, we suggest the following optimization problem

Minimize ‖α‖0 subject to DΣDH =L

k=1

αk ·(

dk × dHk

)

. (18)

Since the sparsest solution has exactly F non-zeros, we are expecting to obtain this result from

the above problem. As before, replacing this l0 optimization problem by an l1 one we should

solve instead

Minimize ‖α‖1 subject to DΣDH =L

k=1

αk ·(

dk × dHk

)

, (19)

and all the previous Theorems on uniqueness and equivalence hold as well. Note that we refer to

the rank-1 terms as vectors rather than matrices, and thus the formulation is exactly the same.

The dictionary in this case is built by outer product of each vector by itself. Thus, the new

dictionary is of size N 2 × L. Therefore, if for the one-signal problem we used the scalar Msingle,

23

Page 24: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

defined as Msingle = sup1≤i6=j≤L|dHi dj|, then now the new Mmultiple should be

Mmultiple = sup1≤i6=j≤L

∣(di ⊗ di)H

(

dj ⊗ dj

)∣

∣ = sup1≤i6=j≤L

(

dHi dj

)

⊗(

dHi dj

)∣

∣ = M2

single. (20)

Thus, M being a value smaller than 1 becomes smaller and thus all the sparseness requirements

in the developed Theorems improve markedly. As an example, for the dictionary composed

of the identity and the Hadamard unitary matrices, if N = 256, then it is easy to show that

Msingle = 1/√N = 1/16. Thus, using Theorem 9 we have that if a signal has a representation

with smaller than 0.5(1 + 1/Msingle) = 8.5 atoms, the (P1) will find this representation. Now,

if a sequence of such signals is obtained we get that if the signals are composed of smaller than

0.5(1 + 1/Mmultiple) = 128.5 atoms, then a correct decomposition can be found using (P1).

Before we conclude this topic of how to exploit the existence of multiple signals, several com-

ments are in order:

• So far we have seen two extremes - one signal or infinitely many signals. A more practical

question is how to exploit the support overlap in the multiple signal case if a finite set of

signals is given to us. This question remains at the moment unanswered.

• Notice that in the above two optimization problems we conveniently chose not to use an

additional constraint forcing positiveness on the unknowns, due to their origin as variances.

As it turn out, this additional constraint does not impact the correctness of the Theorems

obtained in this paper, and thus we are allowed to use them as they are. Of-course, from the

practical point of view, adding the positivity constraint, we should improve the conditioning

of the optimization problems involved and this way stabilize their solutions.

24

Page 25: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

• A vital analysis of inaccurate cases is missing both here and also in the single signal case.

An extension of the obtained results for the case where the representation is an approximate

one is in order. This is especially important in the multiple signals, where an estimated

covariance matrix is used. Again, we leave these questions open at the moment.

• The improvement found in using the large ensemble of signals comes from the fact that

the representation coefficients are statistically independent. Actually, K-th order moment

(e.g. K = 3, 4, ...) could be used just as well, leading this way to a better bound 0.5(1 +

1/Mmultiple) since then we have Mmultiple = MKsingle. This leads to the conclusion that in

the case of infinite amount of measured signals, the (P1) can be applied successfully as a

replacement to (P0) for all signals, provided that a sufficiently high-order moment is used.

6 Summary and Conclusions

This work addresses the problem of decomposing a signal into its building atoms. We assume

that a dictionary of the atoms is given, and we seek the sparsest representation. The basis pursuit

method [3] proposes to replace the minimization of the l0-norm of the repreentation coefficients

with the l1-norm, leading to a solvable problem. Later contribution [6, 7, 8, 12] proposed theoretic

background for such a replacement, proving that, indeed, a sparse representation is unique, and

also proving that for sufficiently sparse representations, there is an equivalence between the l0-

norm and the l1-norm minimizations. However, all these theoretic results were based on the

assumption that the dictionary is composed of pair of unitary bases.

In this paper we propose an extensions to the uniqueness and equivalence results mentioned

25

Page 26: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

above, and threat the most general dictionary case. We show that similar theorems are found to

be true for any dictionary. A basic tool used in order to prove these theorems is the Spark of a

matrix. We bound this value from both sides in order to practically evaluate it.

Another contribution of this paper is the decomposition of multiple signals generated by the

same statistical source. We show that using the above understanding, far better results are

achieved when higher order moments are used.

Open questions for future research:

• Our equivalence Theorem for the general dictionary case is weaker compared to the unique-

ness Theorems that are based on the Spark of the dictionary matrix. We did not find a

parallel result to the σ/2 uniqueness result, nor did we find a bound using the ordering

method as described in the Uniqueness theorem 5. Further work needs to be done here in

order to improve our results.

• We found ways to bound the Spark of a matrix from below and above. Are there better

ways to compute/bound the Spark? Is there a way to exploit the order-reduction property

we found in the upper bound on the Spark? Further work is required in order to establish

better methods to compute the Spark.

• Multiple signals case was solved only for the infinite amount of measurements case, building

on estimation of moments. A similar result should be obtained for the case of finite number

of signals, using a deterministic approach, rather than statistic-based.

• All the results in this paper should be extended to the case of approximate representation

where a bounded error is allowed in the equation Dγ = S. We expect all the results to

26

Page 27: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

hold as well, and somehow improve as a function of the allowed error norm.

References

[1] S. Mallat, A Wavelet Tour of Signal Processing, 1998, Academic Press, Second

Edition.

[2] S. Mallat & Z. Zhang, Matching Persuit with Time-Frequency Dictionaries, IEEE Transac-

tions on Signal Processing, Volume 41, number 12, pages 3397-3415, December 1993.

[3] S.S. Chen & D.L. Donoho & M.A. Saunders, Atomic decomposition by basis pursuit, SIAM

Review, January 2001, volume 43, number 1, pages 129-59.

[4] P.E. Gill & W. Murray & M.h. Wright, Numerical Linear Algebra and Optimization,

1985, Cambridge University Press.

[5] D. Bertsekas, Non-Linear Programming, 1995, Athena Publishing.

[6] D.L.Donoho & X.Huo, Uncertainty Principles and Ideal Atomic Decomposition, IEEE

Transactions on Information Theory, November 2001, volume 47, number 7, pages 2845-62.

[7] M. Elad & A.M. Bruckstein, On Sparse Representations, International Conference on Image

Processing (ICIP) 2001, Tsaloniky, Greece, November 2001.

[8] M. Elad & A.M. Bruckstein, A Generalized Unceritainty Principle and Sparse Representa-

tion in Pairs of <N Bases, Accepted to the IEEE Transactions on Information Theory on

December 2001.

27

Page 28: Maximal Sparsity Representation via l Minimization...Maximal Sparsity Representation via l1 Minimization David L. Donoho and Michael Elady August 15, 2002 Abstract Finding a sparse

[9] R.A. Horn & C.R. Johnson, Matrix Analysis, 1991, Addison-Wesley, Redwood City, CA.

[10] G.H. Golub & C.F. Van Loan, Matrix Computations, Third eddition, 1996, The John

Hopkins University Press.

[11] D.L.Donoho & P.B Stark, Uncertainty Principles and Signal Recovery, SIAM Journal on

Applied Mathematics, June 1989, Volume 49/3, pages 906-931.

[12] A. Feuer & A. Nemirovsky, On Sparse Representations in Pairs of Bases, Submitted to the

IEEE Transactions on Information Theory on July 2002.

28