Topic 4: Review of Linear Algebra - GitHub Pages · Topic 4: Review of Linear Algebra 4-2 In this...

CS 4780/6780: Fundamentals of Data Science Spring 2019

Topic 4: Review of Linear Algebra

Instructor: Daniel L. Pimentel-Alarcon c© Copyright 2019

GO GREEN. AVOID PRINTING, OR PRINT DOUBLE-SIDED.

4.1 Vectors

In words, a vector is simply a point in space.

Example 4.1. Here are two vectors in 3-dimensional space, often denoted as R3:

x =

1−2π

, y =

e01

.

Why do I care about vectors?

If you are wondering this, you are asking yourself the right question! The reason is simple: in data sciencewe have to deal with data, and it is useful to arrange it in vectors.

Example 4.2 (Electronic health records). Hospitals keep health records of their patients, containinginformation such as weight, amount of exercise they do, and glucose level. The information of the ith

patient can be arranged as a vector

xi =

weightexerciseglucose

∈ R3.

4-1

Topic 4: Review of Linear Algebra 4-2

In this sort of problem we want to identify causes for diseases. This can be done by analyzing thepatterns in the vectors of different patients. For example, if our data x1, . . . ,xN looks like:

then it is reasonable to conclude that overweight and lack of exercise are highly correlated with diabetes.

Of course, this is an oversimplified example. Not all correlations are as evident. Health records actuallyinclude much more comprehensive information, such as age, gender, ethnicity, cholesterol levels, etc.This would produce data vectors xi in higher dimensions:

xi =

weightexercise

glucoseagegender

ethnicitycholesterol

...

∈ RD.

Now you will have to use your imagination to decide how D-dimensional space looks like. In fact, itcan be very challenging to visualize points in RD (with D > 3, obviously). Luckily, using linear algebrawe can find lines, planes, curves, etc. (similar to the gray curves depicted in the figure above, onlyin higher dimensions) that explain out data (just as the gray curves explain the correlations betweenweight, exercise and glucose).

Example 4.3 (Recommender systems). Similarly, Amazon, Netflix, Pandora, Spotify, Pinterest, Yelp,Apple, etc., keep information of their users, such as age, gender, income level, and very importantly,ratings of their products. The information of the ith user can be arranged as a vector

xi =

agegenderincome

rating of item 1rating of item 2

...rating of item D− 3

∈ RD.

In this sort of problem we want to analyze these data vectors to predict which users will like whichitems, in order to make good recommendations. If Amazon recommends you an item you will like, youare more likely to buy it. You can see why all these companies have a great interest in this problem,


and they are paying a lot of money to people who work on this.

This can be done by finding structures (e.g., lines or curves) in high-dimensions that explain the data.As in Example 4.2, where we discovered that weight and exercise are good predictors for diabetes, herewe want to discover which variables (e.g., gender, age, income, etc.) can predict which items (e.g.,movies, shoes, songs, etc.) you would like.

Example 4.4 (Genomics). The genome of each individual can be stored as a vector containing itscorresponding sequence of nucleotides, e.g., Adenine, Thymine, Guanine, Cytosine, Thymine, . . .

In this sort of problem we want to analyze these data vectors to determine which genes are correlatedto which diseases (or features, like height or weight).

Example 4.5 (Image processing). A m× n grayscale image can be stored in a data matrix X ∈ Rm×n

whose (i, j)th entry contains the gray intensity of pixel (i, j). Furthermore, X can be vectorized, i.e., wecan stack its columns to form a vector x ∈ RD, with D = mn.

We want to analyze these vectors to interpret the image. For example, identify the objects that appearin the image, classifying faces, etc.


Example 4.6 (Computer vision). The images X1, . . . ,XN ∈ Rm×n that form a video can be vectorizedto obtain vectors x1, . . . ,xN ∈ RD.

Similar to image processing, we want to analyze these vectors to interpret the video. For example, beable to distinguish background from foreground, track objects, etc. This has applications in surveillance,defense, robotics, etc.

Example 4.7 (Neural activity). Functional magnetic resonance imaging (fMRI) generates a seriesof MRI images over time. Because oxygenated and deoxygenated hemoglobin have slightly differentmagnetic characteristics, variations in the MRI intensity indicate areas of the brain with increasedblood flow and hence neural activity. The central task in fMRI is to reliably detect neural activity atdifferent spatial locations (pixels) in the brain. The measurements over time at the (i, j)th pixel can bestored in a data vector xij ∈ RD.

...

...

The idea is to analyze these vectors to determine the active pixels.

Example 4.8 (Sun flares). The Sun, like all active stars, is constantly producing huge electromagneticflares. Every now and then, these flares hit the Earth. Last time this happened was in 1859, and all thathappened was that you could see the northern lights all the way down to Mexico — not a bad secondaryeffect! However, back in 1859 we didn’t have a massive power grid, satellites, wireless communications,GPS, airplanes, space stations, etc. If a flare hits the Earth now, all these systems would be crippled,and repairing them could take years and would cost trillions of dollars to the U.S. alone! To makethings worse, it turns out that these flares are not rare at all! It is estimated that the chance that aflare hits the earth in the next decade is about 12%.


Of course, we cannot stop these flares any more than we can stop an earthquake. If it hits us, it hits us.However, like with an earthquake, we can act ahead. If we know that one flare is coming, we can turneverything off, let it pass, and then turn everything back on, like nothing happened. Hence the NASAand other institutions are investing a great deal of time, effort and money to develop techniques thatenable us to predict that a flare is coming.

So essentially, we want to device a sort of flares radar or detector. This radar would receive, for example,an image X of the sun (or equivalently, a vectorized image x ∈ RD), and would have to decide whethera flare is coming or not.

These are only a few examples that I hope help convince you that vectors are the backbone of data science.In these notes we will review some of the most basic linear algebra concepts that will later enable us to usepowerful vectors machinery that has been developed over centuries in order to tackle the modern problemsin data science.

4.2 Fundamental Concepts

One of the most elemental vector manipulations are linear combinations, which essentially means scalingand adding vectors.

Definition 4.1 (Linear combination, coefficients). A vector z is a linear combination of {x1, . . . ,xR} if itcan be written as

z =

R∑r=1

crxr (4.1)

for some c1, . . . , cR ∈ R. The scalars {c1, . . . , cR} are called the coefficients of z with respect to (w.r.t.){x1, . . . ,xR}.


Example 4.9. Let x and y be as in Example 4.1. Let

z = −3x + 2y = −3

1−2π

+ 2

e01

=

−36−3π

+

2e02

=

−3 + 2e6

−3π + 2

Then z is a linear combination of x and y, with coefficients −3 and 2.

Another fundamental concept is linear independence.

Definition 4.2 (Linear independence). A set of vectors {x1, . . . ,xR} is linearly independent if

R∑r=1

crxr = 0

implies cr = 0 for every r = 1, . . . ,R. Otherwise we say it is linearly dependent.

Intuitively, a set of vectors is linearly independent if none of them can be written as a linear combination ofthe others.

Example 4.10. The following vectors are linearly independent:

x1 =

100

, x2 =

010

, x3 =

001

,because we cannot write either of them as a linear combination of the others. In contrast, the followingvectors are linearly dependent:

y1 =

552

, y2 =

110

, y3 =

001

,because we can write y1 as a linear combination of y2 and y3, namely, y1 = 5y2 + 2y3.

4.3 Matrices

Matrices are very handy structures to arrange and manipulate vectors.

Example 4.11. We can arrange the vectors in the Example 4.10 in the following matrices:

X =

1 0 00 1 00 0 1

, Y =

5 1 05 1 02 0 1

.


4.3.1 Review of Basic Matrix Operations

• Matrix Multiplication. Given matrices A ∈ Rm×k and B ∈ Rk×n, their product, denoted by AB,is another matrix of size m× n whose (i, j)th entry is given by:

[AB]ij =

k∑`=1

Ai`B`j.

Intuitively, the (i, j)th entry of AB is given by the multiplication of the ith row of A and the jth columnof B. For example, if

A =

[1 2 34 5 6

], B =

1 23 45 6

, (4.2)

Then:

AB =

[(1 · 1 + 2 · 3 + 3 · 5) (1 · 2 + 2 · 4 + 3 · 6)(4 · 1 + 5 · 3 + 6 · 5) (4 · 2 + 5 · 4 + 6 · 6)

]=

[22 2849 64

].

• Scalar Multiplication Given a scalar c and a matrix A ∈ Rm×n, their product, denoted by cA, isan m× n matrix whose (i, j)th entry is given by c times the (i, j)th entry of A. For example, with A asin (4.2), and c = 7,

cA = 7A =

[7 14 2128 35 42

],

• Transposition. The transpose of a matrix A ∈ Rm×n, denoted by AT, is an n × m matrix whose(i, j)th entry is given by the (j, i)th entry of A. Intuitively, transposing a matrix is like flipping its rowsand columns along the diagonal. For example, with A as in (4.2),

AT =

1 42 53 6

.• Trace. Given a squared matrix X ∈ Rm×m, its trace, denoted by tr(X) is the sum of its diagonal

entries:

tr(X) =

m∑i=1

Xii.

For example, with X as in Example 4.11, tr(X) = 3.

• Identity Matrix. The identity matrix of size m × m, denoted by Im, is a squared matrix whosediagonal entries are all ones, and off-diagonal entries are all zeros. For example, the 3 × 3 identitymatrix is

I3 =

1 0 00 1 00 0 1

.Whenever there is no possible confusion about the size of the identity matrix, people often drop the msubindex, and simply denote it as I.


• Inverse. Given a squared matrix X ∈ Rm×m, its inverse, denoted by X−1 is a matrix such thatXX−1 = X−1X = I. For example, with X as in Example 4.11, X−1 = X. As another example,consider

X =

1 1 11 2 41 3 9

.Then

X−1 =

3 −3 1−2.5 4 −1.5

0.5 −1 0.5

.You can verify that XX−1 = X−1X = I. Here is a special trick to invert 2× 2 matrices:[

a bc d

]−1=

1

ad− bc

[d −b−c a

].

Of course, this requires that ad− bc 6= 0.

• Hadamard Product. Given two matrices A,B ∈ Rm×n, their Hadamard product (aka point-wiseproduct) is a matrix of size m × n, denoted as A �B, whose (i, j)th entry is given by the product ofthe (i, j) entries of A and B, i.e.,

[A�B]ij = AijBij.

For example, with A,B as in (4.2),

A�BT =

[1 2 34 5 6

]�[1 3 52 4 6

]=

[1 6 158 20 36

].

• Vector Operations. Notice that vectors are 1-column matrices, so all matrix operators that applyto non-squared matrices also apply to vectors. For instance, with the same setup as in Example 4.10,

xT1x2 =

[1 0 0

] 010

= 0.

Notice that xT1x2 6= x1x

T2 :

x1xT2 =

100

[0 1 0]

=

0 1 00 0 00 0 0

.• Norms. Given a matrix A ∈ Rm×n, its Frobenius norm is defined as:

‖A‖F :=

√√√√ m∑i=1

n∑j=1

A2ij.

In the particular case of vectors, this is often called the `2-norm or Euclidean norm, and is denoted by‖x‖2, or simply ‖x‖. Norms are important because they essentially quantify the size of a matrix orvector. Just as there are several ways to quantify the size of a person (e.g., age, height, weight), there


are also several ways to quantify the size of a matrix or vector, for which we can use different norms.Another example is the `1-norm (aka taxi-cab or Manhattan norm) is defined as:

‖A‖1 :=

m∑i=1

n∑j=1

|Aij|.

Intuitively, the `2-norm measures the point-to-point size, while the `1-norm measures the taxi-cab size.For example, for the same vector x = [4 3]T, here are two different notions to quantify its size:

‖x‖2 =√

42 + 32 = 5 ‖x‖1 = |4|+ |3| = 7.

The norm ‖x‖ of a vector x is essentially its size. Norms are also useful because they allow us to measuredistance between vectors (through their difference). For example, consider the following images:

and vectorize them as in Example 4.5 to produce vectors x,y, z. We want to do face clustering, i.e.,we want to know which images correspond to the same person. If ‖x − y‖ is small (i.e., x is similarto y), it is reasonable to conclude that the first two images correspond to the same person. If ‖x− z‖is large (i.e., x is very different from z), it is reasonable to conclude that the first and second imagescorresponds to different persons.

Norms satisfy the so-called triangle inequality: ‖x + y‖ ≤ ‖x‖ + ‖y‖. This allows you to draw usefulconclusions. For example, knowing that ‖x−y‖ is small and that ‖x−z‖ is large allows us to concludethat ‖y − z‖ is also large. Intuitively, this allows us to conclude that if x,y correspond to the sameperson, and x and z corresponds to different persons, then y and z also correspond to different persons.In other words, nothing weird will happen.

Inner Product. Given two matrices A,B ∈ Rm×n, their inner product is defined as:

〈A,B〉 := tr(ATB).

For example, with A as in (4.2),

〈A,A〉 = tr(ATA) = tr

1 42 53 6

[1 2 34 5 6

] = tr

17 22 2722 29 3627 36 45

= 91.


Notice that for vectors x,y ∈ RD, xTy will always be a scalar, so we can drop the trace, and simplywrite 〈x,y〉 = xTy. For example, with the same setup as in Example 4.10:

〈x1,x2〉 = xT1x2 =

[1 0 0

] 010

= 0,

〈y1,y2〉 = yT1y2 =

[5 5 2

] 110

= 10.

Inner products are of particular importance because they measure the similarity between matrices andvectors. In particular, recall from your kinder garden classes that the angle θ between two vectors isgiven by:

cos θ =〈x,y〉‖x‖‖y‖

.

More generally, the larger the inner product between two objects (in absolute value), the more similarthey are.

4.3.2 Why do I care about Matrices?

Matrix operators are useful because they allow us to write otherwise complex burdensome operations insimple and concise matrix form.

Example 4.12. With the same settings as in Examples 4.10 and 4.11, we can write linear combinationsusing a simple matrix multiplication. For instance, instead of writing

y1 =

552

= 5

100

+ 5

010

+ 2

001

= 5x1 + 5x2 + 2x3,

we can let c = [5 5 2]T be y’s coefficient vector, and equivalently write in simpler form:

y1 =

552

=

1 0 00 1 00 0 1

552

= Xc.

Similar matrix expressions will be ubiquitous throughout this course, so you should start getting familiarand feeling comfortable using matrix operations.

4.4 Subspaces

Subspaces are essentially high-dimensional lines. A 1-dimensional subspace is a line, a 2-dimensional sub-spaces is a plane, and so on. Subspaces are useful because data often lies near subspaces.

Example 4.13. The health records data in Example 4.2 lies near a 1-dimensional subspace (line):


In higher dimensions subspaces may be harder to visualize, so you will have to use imagination to decidehow a higher-dimensional subspace looks. Luckily, we have a precise and formal mathematical way to definethem:

Definition 4.3 (Subspace). A subset U ⊆ RD is a subspace if for every a,b ∈ R and every u,v ∈ U,au + bv ∈ U.

Definition 4.4 (Span). span[u1, . . . ,uR] is the set of all linear combinations of {u1, . . . ,uR}. More formally,

span[u1, . . . ,uR] :=

{x ∈ RD : x =

R∑r=1

crur for some c1, . . . , cR ∈ R

}.

For any vectors u1, . . . ,uR ∈ RD, span[u1, . . . ,uR] is a subspace. For example, here is a subspace U (plane)spanned by two vectors, u1 and u2:

Definition 4.5 (Basis). A set of linearly independent vectors {u1, . . . ,uR} is a basis of a subspace U if eachv ∈ U can be written as

v =

R∑r=1

crur

for a unique set of coefficients {c1, . . . , cR}.


Definition 4.6 (Orthogonal). A collection of vectors {x1, . . . ,xR} is orthogonal if 〈xi,xj〉 = 0 for everyi 6= j.

If ‖x‖ = 1, we say x is a unit vector, or that it is normalized. Similarly, a collection of normalized, orthogonalvectors is called orthonormal. There is a tight relation between inner products and norms. The following isone of the most important and useful inequalities that describe this relationship.

Proposition 4.1 (Cauchy-Schwartz inequality). For every x,y ∈ RD,

|〈x,y〉| ≤ ‖x‖‖y‖.

Furthermore, if y 6= 0, then equality holds if and only if x = ay for some a ∈ R.

4.5 Projections

In words, the projection x of a vector x onto a subspace U is the vector in U that is closest to x:

More formally,

Definition 4.7 (Projection). The projection of x ∈ RD onto subspace U is the vector x ∈ U satisfying

‖x− x‖ ≤ ‖x− u‖ for every u ∈ U.

Notice that if x ∈ U, then x = x. The following proposition tells us exactly how to compute projections.

Proposition 4.2. Let {u1, . . . ,uR} be an orthonormal basis of U. The projection of x ∈ RD onto U isgiven by

x =

R∑r=1

〈x,ur〉ur.

In other words, the coefficient of x w.r.t. ur is given by 〈x,ur〉.


Furthermore, the following proposition tells us that we can compute projections very efficiently: just usinga simple matrix multiplication! This makes projections very attractive in practice. For example, as we sawbefore, data often lies near subspaces. We can measure how close using the norm of the residual x− x.

Proposition 4.3 (Projector operator). Let U ∈ RD×R be a basis of U. The projection operatorPU : RD → U that maps any vector x ∈ RD to its projection x ∈ U is given by:

PU = U(UTU)−1UT.

Notice that if U is orthonormal, then PU = UUT.

Proof. Since x ∈ U, that means we can write x as Uc for some c ∈ RR. We thus want to find the c thatminimizes:

‖x−Uc‖22 = (x−Uc)T(x−Uc) = xT − 2cTUTx + cTUTUc.

Since this is convex in c, we can use elemental optimization to find the desired minimizer, i.e., we will takederivative w.r.t. c, set to zero and solve for c. To learn more about how to take derivatives w.r.t. vectors andmatrices see Old and new matrix algebra useful for statistics by Thomas P. Minka. The derivative w.r.t. c isgiven by:

−2UTx + 2UTUc.

Setting to zero and solving for c we obtain:

c := arg minc∈RR

‖x−Uc‖22 = (UTU)−1UTx,

where we know UTU is invertible because U is a basis by assumption, so its columns are linearly independent.It follows that

x = Uc = U(UTU)−1UT︸︷︷︸PU

x,

as claimed. If U is orthonormal, then UTU = I, and hence PU simplifies to UUT. Notice that c are thecoefficients of x w.r.t. the basis U.

4.6 Gram-Schmidt Orthogonalization

Orthonormal bases have very nice and useful properties. For example, in Proposition 4.3, if the basis U isorthonormal, then the projection operator is simplified into UUT, which requires much less computationsthan U(UTU)−1UT. The following procedure tells us exactly how to transform an arbitrary basis into anorthonormal basis.


Proposition 4.4 (Gram-Schmidt procedure). Let {u1, . . . ,uR} be a basis of U. Let

v′r =

{u1 r = 1,

ur −∑r−1

k=1〈ur,vk〉vk r = 2, . . . ,R,

vr = v′r/‖v′r‖.

Then {v1, . . . ,vR} are an orthonormal basis of U.

Proof. We know from Proposition 4.2 that∑r−1

k=1〈ur,vk〉vk is the projection of ur onto span[v1, . . . ,vr−1].This implies v′r is the orthogonal residual of ur onto span[v1, . . . ,vr−1], and hence it is orthogonal to{v1, . . . ,vr−1}, as desired. vr is simply the normalized version of v′r.

4.7 Conclusions

These notes review several fundamental concepts of linear algebra that lie at the heart of data science, andwill be crucial in every topic to be studied in this course.

Topic 4: Review of Linear Algebra - GitHub Pages · Topic 4: Review of Linear Algebra 4-2 In this...

Documents

Transcript of Topic 4: Review of Linear Algebra - GitHub Pages · Topic 4: Review of Linear Algebra 4-2 In this...