SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a...

9
Singular Value Decomposition (Com S 477/577 Notes) Yan-Bin Jia Sep 12, 2017 1 Introduction Now comes a highlight of linear algebra. Any real m × n matrix can be factored as A = U ΣV T where U is an m × m orthogonal matrix 1 whose columns are the eigenvectors of AA T , V is an n × n orthogonal matrix whose columns are the eigenvectors of A T A, and Σ is an m × n diagonal matrix of the form Σ= σ 1 . . . 0 σ r 0 0 . . . 0 with σ 1 σ 2 ≥···≥ σ r > 0 and r = rank(A). In the above, σ 1 ,...,σ r are the square roots of the eigenvalues of A T A. They are called the singular values of A. Our basic goal is to “solve” the system Ax = b for all matrices A and vectors b. A second goal is to solve the system using a numerically stable algorithm. A third goal is to solve the system in a reasonably efficient manner. For instance, we do not want to compute A 1 using determinants. Three situations arise regarding the basic goal: (a) If A is square and invertible, we want to have the solution x = A 1 b. (b) If A is underconstrained, we want the entire set of solutions. (c) If A is overconstrained, we could simply give up. But this case arises a lot in practice, so instead we will ask for the least-squares solution. In other words, we want that ¯ x which minimizes the error A ¯ x b. Geometrically, A ¯ x is the point in the column space of A closest to b. 1 That is, UU T = U T U = I . 1

Transcript of SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a...

Page 1: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

Singular Value Decomposition

(Com S 477/577 Notes)

Yan-Bin Jia

Sep 12, 2017

1 Introduction

Now comes a highlight of linear algebra. Any real m× n matrix can be factored as

A = UΣV T

where U is an m×m orthogonal matrix1 whose columns are the eigenvectors of AAT , V is an n×n

orthogonal matrix whose columns are the eigenvectors of ATA, and Σ is an m×n diagonal matrixof the form

Σ =

σ1. . . 0

σr0

0 . . .

0

with σ1 ≥ σ2 ≥ · · · ≥ σr > 0 and r = rank(A). In the above, σ1, . . . , σr are the square roots of theeigenvalues of ATA. They are called the singular values of A.

Our basic goal is to “solve” the system Ax = b for all matrices A and vectors b. A second goalis to solve the system using a numerically stable algorithm. A third goal is to solve the system ina reasonably efficient manner. For instance, we do not want to compute A−1 using determinants.

Three situations arise regarding the basic goal:

(a) If A is square and invertible, we want to have the solution x = A−1b.

(b) If A is underconstrained, we want the entire set of solutions.

(c) If A is overconstrained, we could simply give up. But this case arises a lot in practice, soinstead we will ask for the least-squares solution. In other words, we want that x̄ whichminimizes the error ‖Ax̄ − b‖. Geometrically, Ax̄ is the point in the column space of A

closest to b.

1That is, UUT = U

TU = I .

1

Page 2: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

Gaussian elimination is reasonably efficient, but it is not numerically very stable. In partic-ular, elimination does not deal with nearly singular matrices. The method is not designed foroverconstrained systems. Even for underconstrained systems, the method requires extra work.

The poor numerical character of elimination can be seen in a couple ways. First, the eliminationprocess assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.After all, the matrix A contains continuous, possibly noisy, entries. Yet, rank is a discrete integer.Strictly speaking, the two sets below are linearly independent vectors:

100

,

010

,

001

,

1.011.001.00

,

1.001.011.00

,

1.001.001.01

.

Yet, the first set seems genuinely independent, while the second set seems “almost dependent”.Second, elimination-based methods work like LU decomposition, which represents the coefficient

matrix A as a matrix product LDU , where L and U are respectively lower and upper diagonal andD is diagonal. One solves the system Ax = b by solving (via backsubstitution) Ly = b andUx = D−1y. If A is nearly singular, then D will contain near-zero entries on its diagonal, andthus D−1 will contain large numbers. That is OK since in principle one needs the large numbersto obtain a true solution. The problem is if A contains noisy entries. Then the large numbersmay be pure noise that dominates the true information. Furthermore, since L and U can be fairlyarbitrary, they may distort or magnify that noise across other variables.

Singular value decomposition is a powerful technique for dealing with sets of equations ormatrices that are either singular or else numerically very close to singular. In many cases whereGaussian elimination and LU decomposition fail to give satisfactory results, SVD will not onlydiagnose the problem but also give you a useful numerical answer. It is also the method of choicefor solving most linear least-squares problems.

2 SVD Close-up

An n× n symmetric matrix A has an eigen decomposition in the form of

A = SΛS−1,

where Λ is a diagonal matrix with the eigenvalues δi of A on the diagonal and S contains theeigenvectors of A.

Why is the above decomposition appealing? The answer lies in the change of coordinatesy = S−1x. Instead of working with the system Ax = b, we can work with the system Λy = c,where c = S−1b. Since Λ is diagonal, we are left with a trivial system

δiyi = ci, i = 1, . . . , n.

If this system has a solution, then another change of coordinates gives us x, that is, x = Sy.2

2There is no reason to believe that computing S−1

b is any easier than computing A−1

b. However, in the ideal

case, the eigenvectors of A are orthogonal. This is true, for instance, if AAT = A

TA. In that case the columns of

S are orthogonal and so we can take S to be orthogonal. But then S−1 = S

T , and the problem of solving Ax = b

becomes very simple.

2

Page 3: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

Unfortunately, the decomposition A = SΛS−1 is not always possible. The condition for itsexistence is that A is n×n symmetric with n linearly independent eigenvectors. Even worse, whatdo we do if A is not square?

The answer is work with ATA and AAT , both of which are symmetric (and have n and m

orthogonal eigenvectors, respectively). So we have the following decompositions:

ATA = V DV T ,

AAT = UD′UT ,

where V is an n×n orthogonal matrix consisting of the eigenvectors of ATA, D an n× n diagonalmatrix with the eigenvalues of ATA on the diagonal, U an m×m orthogonal matrix consisting ofthe eigenvectors of AAT , and D′ an m × m diagonal matrix with the eigenvalues of AAT on thediagonal. It turns out that D and D′ have the same non-zero diagonal entries except that the ordermight be different.

Recall the SVD form of A:

A = U Σ V T (1)

m× n m×m m× n n× n

... ......

...0

0 ...

...

0

0=

col(A)

ur ur+1

row(A)

null(A)

vT

1

vT

r

vT

r+1

null(AT )

vT

n

u1 um

σ1

σr

There are several facts about SVD:

(a) rank(A) = rank(Σ) = r.

(b) The column space of A is spanned by the first r columns of U .

(c) The null space of A is spanned by the last n− r columns of V .

(d) The row space of A is spanned by the first r columns of V .

(e) The null space of AT is spanned by the last m− r columns of U .

We can think of U and V as rotations and reflections and Σ as stretching matrix. The nextfigure illustrates the sequence of transformation under A on the unit vectors v1 and v2 (and allother vectors on the unit circle) in the case m = n = 2. Note that V = (v1v2). When multipliedby V T on the left, the two vectors undergo a rotation and become unit vectors i = (1, 0)T andj = (0, 1)T . Then the matrix Σ stretches these two resulting vectors to σ1i and σ2j, respectively.In the last step, the vectors undergo a final rotation due to U and become σ1u1 and σ2u2.

3

Page 4: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

U

A

ΣV T

V

v1 σ2j

σ1i

σ2u2

σ1u1

i

jv2

From (1) we also see thatA = σ1u1v

T1 + · · ·+ σrurv

Tr .

We can swap σi with σj as long as we swap ui with uj and vi with vj at the same time. If σi = σj ,then ui and uj can be swapped as long as vi and vj are also swapped. SVD is unique up to thepermutations of (ui, σi,vi) or of (ui,vi) among those with equal σis. It is also unique up to thesigns of ui and vi, which have to change simultaneously.

It follows that

ATA = V ΣTUTUΣV T

= V

σ21

. . .

σ2r

0. . .

0

V T .

Hence σ21, . . . , σ

2r (and 0 if r < n) are the eigenvalues of ATA, which is positive definite if rank(A) =

n, and v1, . . . ,vn its eigenvectors.Similarly, we have

AAT = U

σ21

. . .

σ2r

0. . .

0

UT .

Therefore σ21 , . . . , σ

2r (and 0 if r < m) are also eigenvalues of AAT , and u1, . . . ,um its eigenvectors.

Example 1. Find the singular value decomposition of

A =

(

2 2−1 1

)

.

4

Page 5: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

The eigenvalues of

ATA =

(

5 33 5

)

are 2 and 8 corresponding to unit eigenvectors

v1 =

( − 1√

2

1√

2

)

and v2 =

( 1√

2

1√

2

)

,

respectively. Hence σ1 =√2 and σ2 =

√8 = 2

√2. We have

Av1 = σ1u1vT

1 v1 = σ1u1 =

(

0√2

)

, so u1 =

(

01

)

;

Av2 = σ2u2vT

2 v2 = σ1u2 =

(

2√20

)

, so u2 =

(

10

)

.

The SVD of A is therefore

A = UΣV T =

(

0 11 0

)( √2 0

0 2√2

)

( − 1√

2

1√

2

1√

2

1√

2

)

.

Note that we can also compute the eigenvectors u1 and u2 directly from AAT .

3 More Discussion

That rank(A) = rank(Σ) tells us that we can determine the rank of A by counting the non-zeroentries in Σ. In fact, we can do better. Recall that one of our complaints about Gaussian eliminationwas that it did not handle noise or nearly singular matrices well. SVD remedies this situation.

For example, suppose that an n× n matrix A is nearly singular. Indeed, perhaps A should besingular, but due to noisy data, it is not quite singular. This will show up in Σ, for instance, whenall of the n diagonal entries in Σ are non-zero and some of the diagonal entries are almost zero.

More generally, an m × n matrix A may appear to have rank r, yet when we look at Σ wemay find that some of the singular values are very close to zero. If there are l such values, thenthe “true” rank of A is probably r − l, and we would do well to modify Σ. Specifically, we shouldreplace the l nearly zero singular values with zero.

Geometrically, the effect of this replacement is to reduce the column space of A and increaseits null space. The point is that the column space is warped along directions for which σi ≈ 0. Ineffect, solutions to Ax = b get pulled off to infinity (since 1

σi≈ ∞) along vectors that are almost

in the null space. So, it is better to ignore the ith coordinate by zeroing σi.

Example 2. The matrix

A =

1.01 1.00 1.001.00 1.01 1.001.00 1.00 1.00

yields

Σ =

3.01 0 00 0.01 00 0 0.01

.

Since 0.01 is significantly smaller than 3.01, we could treat it as zero and the rank A as one.

5

Page 6: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

4 The SVD Solution

For the moment, let us suppose A is an n × n square matrix. The following picture sketches theway in which SVD solves the system Ax = b.

bothb′

set of solutions x∗ + null(A) to

SVD solution x∗ to

col(A)

null(A)

Ax = b

b

Ax = b

Ax = b′

The system Ax = b has an affine set of solutions, given by x0 + null(A), where x0 is anysolution. It is easy to describe the null space null(A) given the SVD decomposition, since it is justthe span of the last n − r columns of V . Also note that Ax = b has no solution since b is not inthe column space of A.

SVD solves Ax = b by determining that x which is the closest to the origin, i.e., which hasthe minimum norm. It first projects b onto the column space of A, obtaining b′, and then solvesAx = b′. Essentially, SVD obtains the least-squares solution.

Given the diagonal matrix Σ in the SVD of A, denote by Σ+ the diagonal matrix whose diagonalentries are of the form:

(

Σ+)

ii=

{

1

σiif 1 ≤ i ≤ r;

0 if r + 1 ≤ i ≤ n.

Thus the product matrix

Σ · Σ+ = Σ+ · Σ =

1. . .

10

. . .

0

has r 1’s on its diagonal. If Σ is invertible, then Σ+ = Σ−1.

Example 3. If

Σ =

2 0 00 5 00 0 0

Page 7: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

then

Σ+ =

1

20 0

0 1

50

0 0 0

.

So how do we compute a solution to Ax = b for all the cases? We first compute the SVDdecomposition A = UΣV T , then compute

x̄ = V Σ+UTb.

(a) If A is invertible, then x̄ is the unique solution to Ax = b.

(b) If A is singular and b is in the range of A, then x̄ is the solution closest to the origin. AndV Σ+UT is called a pseudo-inverse of A. The set of solutions is x+ null(A), where null(A) isspanned by the last n− r columns of V .

(c) If A is singular and b is not in the range of A, then x̄ is the least-squares solution.

So far we have assumed that A is square. How do we solve Ax = b for general m × n matrixA? We still make use of the n×m pseudo-inverse matrix

A+ = V Σ+UT .

The SVD solution is x̄ = A+b.The pseudoinverse A+ has the following properties

A+ui =

{

1

σivi if i ≤ r,

0 if i > r;

(

A+)T

vi =

{

1

σiui if i ≤ r,

0 if i > r.

Namely, the vectors u1, . . . ,ur in the column space of A go back to the row space. The othervectors ur+1, . . . ,um are in the null space of AT , and A+ sends them to zero. When we know whathappens to each basis vector ui, we know A+ because vi and σi will be determined.

Example 4. The pseudoinverse of A =

(

2 2−1 1

)

from Example 1 is A+ = A−1, because A is invertible.

And we have

A+ = A−1 = V Σ−1UT =

(

1

4− 1

2

1

4

1

2

)

.

In general there will not be any zero singular values. However, if A has column degeneraciesthere may be near-zero singular values. It may be useful to zero these out, to remove noise. (Theimplication is that the overdetermined set is not quite as overdetermined as it seemed, that is, thenull space is not trivial.)

7

Page 8: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

5 SVD Algorithm

Here we only take a brief look at how the SVD algorithm actually works. For more details werefer to [2]. It is useful to establish a contrast with Gaussian elimination, which reduces a matrixA by a series of row operations that zero out portions of columns of A. Row operations implypre-multiplying the matrix A. They are all collected together in the matrix L−1 where A = LDU .

In contrast, SVD zeros out portions of both rows and columns. Thus, whereas Gaussian elimi-nation only reduces A using pre-multiplication, SVD uses both pre- and post-multiplication. As aresult, SVD can at each stage rely on orthogonal matrices to perform its reductions on A. By usingorthogonal matrices, SVD reduces the risk of magnifying noise and errors. The pre-multiplicationmatrices are gathered together in the matrix UT , while the post-multiplication matrices are gath-ered together in the matrix V .

There are two phases to the SVD decomposition algorithm:

(i) SVD reduces A to bidiagonal form using a series of orthogonal transformations. This phaseis deterministic and has a running time that depends only on the size of the matrix A.

(ii) SVD removes the superdiagonal elements from the bidiagonal matrix using orthogonal trans-formation. This phase is iterative but converges quickly.

Let us take a slightly closer look at the first phase. Step 1 in this phase creates two orthogonalmatrices U1 and V1 such that

U1A =

a′11 a′12 · · · a′1n0... B′

0

,

U1AV1 =

a′′11 a′′12 0 · · · 00... B′′

0

.

If A is m × n then B′′ is (m − 1) × (n − 1). The next step of this phase recursively workson B′′, and so forth, until orthogonal matrices U1, . . . , Un−1, V1, . . . , Vn−2 are produced such thatUn−1 · · ·U1AV1 · · ·Vn−2 is bidiagonal (assume m ≥ n).

In both phases, the orthogonal transformation are constructed from Householder matrices. Forpractitioners, free online libraries such as [5, 6] offer SVD source code in C++.

References

[1] M. Erdmann. Lecture notes for 16-811 Mathematical Fundamentals for Robotics. The RoboticsInstitute, Carnegie Mellon University, 1998.

[2] G. E. Forsythe, M. A. Malcolm, and C. B. Moler. Computer Methods for Mathematical Com-

putations. Prentice-Hall, 1977.

8

Page 9: SingularValueDecompositionweb.cs.iastate.edu/~cs577/handouts/svd.pdf · process assumes a non-singular matrix. But singularity, and rank in general, is a slippery concept.

[3] W. H. Press, et al. Numerical Recipes in C++: The Art of Scientific Computing. CambridgeUniversity Press, 2nd edition, 2002.

[4] G. Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, 1993.

[5] Boost C++ Libraries. http://www.boost.org/.

[6] LIT-lib. http://ltilib.sourceforge.net/doc/homepage/index.shtml.

9