Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. ·...

10
Machine Learning – Brett Bernstein Lecture 13: Principal Component Analysis Intro Question Let S R n×n be symmetric. 1. How does trace S relate to the spectral decomposition S = W ΛW T where W is or- thogonal and Λ is diagonal? Solution. We use the following useful property of traces: trace AB = trace BA for any matrices A, B where the dimensions allow. Thus we have trace S = trace W W T )= trace W T )W = trace Λ, so the trace of S is the sum of its eigenvalues. 2. How do you solve w * = arg max kwk 2 =1 w T Sw? What is w T * Sw * ? Solution. Suppose S was diagonal: S = λ 1 0 0 0 0 λ 2 0 0 0 0 . . . 0 0 0 0 λ n , with λ 1 ≥···≥ λ n . Then w * = e 1 , the first standard basis vector since v T Sv = n X i=1 λ i v 2 i λ 1 n X i=1 v 2 i = λ 1 = e T 1 Se 1 . In general, we have S = W ΛW T , so we want W T w = e 1 (using the fact that kW T vk 2 = kvk 2 ). But then the answer is the first column of W , i.e., the eigenvector corresponding to largest eigenvalue λ 1 . Principal Component Analysis (PCA) This will be our first topic of unsupervised learning. Simply put, in unsupervised learning we have no y-values (i.e., no labels). As such, our goal is to find and exploit intrinsic structure in the training data. With PCA, we are trying to find a low dimensional affine subspace that explains most of the variance in our dataset. 1

Transcript of Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. ·...

Page 1: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

Machine Learning – Brett Bernstein

Lecture 13: Principal Component Analysis

Intro Question

Let S ∈ Rn×n be symmetric.

1. How does traceS relate to the spectral decomposition S = WΛW T where W is or-thogonal and Λ is diagonal?

Solution. We use the following useful property of traces: traceAB = traceBA forany matrices A,B where the dimensions allow. Thus we have

traceS = traceW (ΛW T ) = trace (ΛW T )W = trace Λ,

so the trace of S is the sum of its eigenvalues.

2. How do you solve w∗ = arg max‖w‖2=1wTSw? What is wT

∗ Sw∗?

Solution. Suppose S was diagonal:

S =

λ1 0 0 00 λ2 0 0

0 0. . . 0

0 0 0 λn

,

with λ1 ≥ · · · ≥ λn. Then w∗ = e1, the first standard basis vector since

vTSv =n∑

i=1

λiv2i ≤ λ1

n∑i=1

v2i = λ1 = eT1 Se1.

In general, we have S = WΛW T , so we want W Tw = e1 (using the fact that ‖W Tv‖2 =‖v‖2). But then the answer is the first column of W , i.e., the eigenvector correspondingto largest eigenvalue λ1.

Principal Component Analysis (PCA)

This will be our first topic of unsupervised learning. Simply put, in unsupervised learning wehave no y-values (i.e., no labels). As such, our goal is to find and exploit intrinsic structurein the training data. With PCA, we are trying to find a low dimensional affine subspacethat explains most of the variance in our dataset.

1

Page 2: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

Definition of Principal Components

When studying PCA, we will always work with centered data. As such, we define the centereddata matrix

X̃ := X −X,with rows x̃Ti = (xi − x)T . Next we need the concept of variance along a direction.

Definition 1 (Variance Along a Direction). Let x̃1, . . . , x̃n be the centered data with x̃i ∈ Rd.Fix a direction w ∈ Rd with ‖w‖2 = 1. The sample variance along the direction w is givenby

1

n− 1

n∑i=1

(x̃Ti w)2.

This is the sample variance of the components

x̃T1w, . . . , x̃Tnw.

x2

x1

x̃1

x̃2

x̃3

x̃4 x̃5

x̃6

x̃7

w

2

Page 3: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

x2

x1

x̃1

x̃2

x̃3

x̃4 x̃5

x̃6

x̃7

wT x̃i-values

w

As an aside, note that the variance along w is also the sample variance of

xT1w, . . . , xTnw

where we haven’t centered the data.Define the first loading vector w(1) to be the direction along which the sample variance

is maximized:

w(1) = arg max‖w‖2=1

1

n− 1

n∑i=1

(x̃Ti w)2.

The maximizer will not be unique, so we arbitrarily choose one of the maximizers. We thendefine x̃Ti w(1) to be the first principal component of the centered data point x̃i. That is, it isthe component of x̃i in the direction w(1).

The kth loading vector w(k) maximizes the variance along it while being orthogonal tothe first k − 1 loading vectors:

w(k) = arg max‖w‖2=1

w⊥w(1),...,w(k−1)

1

n− 1

n∑i=1

(x̃Ti w)2.

Taken together, w(1), . . . , w(d) form an orthonormal basis of Rd. Analgously, we define x̃Ti w(k)

to be the kth principal component of the centered data point x̃i. If W is a matrix whose kthcolumn is w(k) then W T x̃i expresses x̃i in terms of its principal components. We can write

X̃W to express the entire data matrix in terms of principal components.

3

Page 4: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

Computing Principal Components

Recall that w(1) is defined by

arg max‖w‖2=1

1

n− 1

n∑i=1

(x̃Ti w)2.

We now perform some algebra to simplify this expression. Note that

n∑i=1

(x̃Ti w)2 =n∑

i=1

(x̃Ti w)(x̃Ti w)

=n∑

i=1

(wT x̃i)(x̃Ti w)

= wT

[n∑

i=1

x̃ix̃Ti

]w

= wT X̃T X̃w.

This shows

w(1) = arg max‖w‖2=1

1

n− 1wT X̃T X̃w = arg max

‖w‖2=1

wTSw,

where S = 1n−1X̃

T X̃ is the sample covariance matrix. From the introductory questions, weknow that the maximizer is the eigenvector corresponding to the largest eigenvalue of S, andthe maximum value attained is the sample variance along w(1).

In fact, we can take this further. Suppose we compute the spectral decomposition of S.That is,

S = WΛW T

where W is orthogonal and

Λ =

λ1 0 0 00 λ2 0 0

0 0. . . 0

0 0 0 λd

with λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0 (all are non-negative since S is PSD). Then the ith column ofW is w(i) and λi is the variance along w(i).

As an aside, we sketch the proof idea.

Proof sketch. By the spectral theorem we have

S =n∑

i=1

λiW:,iWT:,i,

4

Page 5: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

where W:,i is the ith column of W . Note that

w(k) = arg max‖w‖2=1

w⊥w(1),...,w(k−1)

wTSw

= arg max‖w‖2=1

w⊥w(1),...,w(k−1)

wT

[n∑

i=1

λiW:,iWT:,i

]w

= arg max‖w‖2=1

w⊥w(1),...,w(k−1)

wT

[n∑

i=k

λiW:,iWT:,i

]w.

Noting that the maximizer w must be in the span of W:,k, . . . ,W:,d, we see the maximizer isw(k) = W:,k: (

n∑i=k

αiW:,i

)T [ n∑i=k

λiW:,iWT:,i

](n∑

i=k

αiW:,i

)

=n∑

i=k

α2iλi

≤ λk

n∑i=k

α2i

= λk

= W T:,kSW:,k.

Let’s illustrate the ideas thus far using an example.

Example 2. A collection of people come to a testing site to have their heights measuredtwice. The two testers use different measuring devices, each of which introduces errors intothe measurement process. Below we depict some of the measurements computed (alreadycentered).

5

Page 6: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

−20

−10

10

20

Tester 2

−10 −5 5 10Tester 1

1. Describe (vaguely) what you expect the sample covariance matrix to look like.

2. What do you think w(1) and w(2) are?

We can now plot the data in terms of the principal components (i.e., we plot X̃W ).

6

Page 7: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

−20

−10

10

20

Tester 2

−10 −5 5 10Tester 1

−1.25

2.5

6.25w(2)

−20 −10 10 20w(1)

Uses of Principal Component Analysis

1. Dimensionality reduction: In our height example above, we can replace our two fea-tures with only a single feature, the first principal component. This can be used as apreprocessing step in a supervised learning algorithm. More about this in a moment.

2. Visualization: If we have high dimensional data, it can be hard to plot it effectively.Sometimes plotting the first two principal components can reveal interesting geometricstructure in the data.

3. Principal Component Regression: Building on dimensionality reduction, suppose webegin with a dataset D = {(x1, y1), . . . , (xn, yn)} and want to build a linear model. Wecan choose some k and replace each x̃i with its first k principal components. Afterward

7

Page 8: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

we perform linear regression. This is called principal component regression, and canbe thought of as a discrete variant of ridge regression (see HTF 3.4.1).

When performing dimensionality reduction, one must choose how many principal componentsto use. This is often done using a scree plot: a plot of the eigenvalues of S in descending order.

Scree plot taken from Jolliffe’s Principal Component Analysis. Often people look for an “el-bow” in the scree plot: a point where the plot becomes much less steep.

8

Page 9: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

Other Comments About PCA

1. Often people standardize their data before running PCA to add scale-invariance.Stated different, if a feature is scaled (maybe by choice of measurement unit) by alarge factor we arbitrarily increase its variance, and thus can incorrectly force it to bea large part of the first principal component.

2. Define the dispersion of the data by

∆ =n∑

i=1

‖xi − x‖22.

Projecting the centered data onto the k-dimensional subspace spanned by w(1), . . . , w(k)

maximizes the resulting dispersion over all possible k-dimensional subspaces.

3. The k-dimensional subspace V spanned by w(1), . . . , w(k) best fits the centered data inthe least-squares sense: it minimizes

n∑i=1

‖xi − PV (xi)‖22

over all k-dimensional subspaces, where PV orthogonally projects onto V .

4. Converting your data into principal components can hurt interpretability since the newfeatures are linear combinations (i.e., blends or baskets) of your old features.

5. The smallest principal components, if they correspond to small eigenvalues, are nearlyin the null space of X, and thus reveal linear dependencies in the centered data.

Example 3. Suppose you have the following data:

9

Page 10: Machine Learning { Brett Bernstein Lecture 13: Principal Component Analysis · 2021. 3. 14. · With PCA, we are trying to nd a low dimensional a ne subspace that explains most of

x2

x1

1. How can we get the first principal component to properly distinguish the rings above?

Solution. Add features or use kernels. Below we added the feature ‖x̃i‖2 and took thefirst principal component.

w(1)

10