Lecture 13 - Principal Component Analysis Lecture 13 Computing Principal Components Some Linear...

download Lecture 13 - Principal Component Analysis Lecture 13 Computing Principal Components Some Linear Algebra

of 27

  • date post

    30-May-2020
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of Lecture 13 - Principal Component Analysis Lecture 13 Computing Principal Components Some Linear...

  • Lecture 13 Principal Component Analysis

    Brett Bernstein

    CDS at NYU

    April 25, 2017

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26

  • Lecture 13 Initial Question

    Intro Question

    Question

    Let S ∈ Rn×n be symmetric. 1 How does traceS relate to the spectral decomposition S = WΛW T

    where W is orthogonal and Λ is diagonal?

    2 How do you solve w∗ = arg max‖w‖2=1 w TSw? What is wT∗ Sw∗?

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 2 / 26

  • Lecture 13 Initial Question

    Intro Solution

    Solution

    1 We use the following useful property of traces: traceAB = traceBA for any matrices A,B where the dimensions allow. Thus we have

    traceS = traceW (ΛW T ) = trace (ΛW T )W = traceΛ,

    so the trace of S is the sum of its eigenvalues.

    2 w∗ is an eigenvector with the largest eigenvalue. Then w T ∗ Sw∗ is the

    largest eigenvalue.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 3 / 26

  • Lecture 13 Principal Component Analysis (PCA)

    Unsupervised Learning

    1 Where did the y ’s go?

    2 Try to find intrinsic structure in unlabeled data.

    3 With PCA, we are looking for a low dimensional affine subspace that approximates our data well.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 4 / 26

  • Lecture 13 Definition of Principal Components

    Centered Data

    1 Throughout this lecture we will work with centered data.

    2 Suppose X ∈ Rn×d is our data matrix. Define

    x = 1

    n

    n∑ i=1

    xi .

    3 Let X ∈ Rn×d be the matrix with x in every row. 4 Define the centered data:

    X̃ = X − X , x̃i = xi − x .

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 5 / 26

  • Lecture 13 Definition of Principal Components

    Variance Along A Direction

    Definition

    Let x̃1, . . . , x̃n ∈ Rd be the centered data. Fix a direction w ∈ Rd with ‖w‖2 = 1. The sample variance along w is given by

    1

    n − 1 n∑

    i=1

    (x̃Ti w) 2.

    This is the sample variance of the components

    x̃T1 w , . . . , x̃ T n w .

    1 This is also the sample variance of

    xT1 w , . . . , x T n w ,

    using the uncentered data.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 6 / 26

  • Lecture 13 Definition of Principal Components

    Variance Along A Direction

    x2

    x1

    x̃1

    x̃2

    x̃3

    x̃4 x̃5

    x̃6

    x̃7

    w

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 7 / 26

  • Lecture 13 Definition of Principal Components

    Variance Along A Direction

    x2

    x1

    x̃1

    x̃2

    x̃3

    x̃4 x̃5

    x̃6

    x̃7

    wT x̃i-values

    w

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 8 / 26

  • Lecture 13 Definition of Principal Components

    First Principal Component

    1 Define the first loading vector w(1) to be the direction giving the highest variance:

    w(1) = arg max ‖w‖2=1

    1

    n − 1 n∑

    i=1

    (x̃Ti w) 2.

    2 Maximizer is not unique, so we choose one.

    Definition

    The first principal component of x̃i is x̃ T i w(1).

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 9 / 26

  • Lecture 13 Definition of Principal Components

    Principal Components

    1 Define the kth loading vector w(k) to be the direction giving the highest variance that is orthogonal to the first k − 1 loading vectors:

    w(k) = arg max ‖w‖2=1

    w⊥w(1),...,w(k−1)

    1

    n − 1 n∑

    i=1

    (x̃Ti w) 2.

    2 The complete set of loading vectors w(1), . . . ,w(d) form an

    orthonormal basis for Rd .

    Definition

    The kth principal component of x̃i is x̃ T i w(k).

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 10 / 26

  • Lecture 13 Definition of Principal Components

    Principal Components

    1 Let W denote the matrix with the kth loading vector w(k) as its kth column.

    2 Then W T x̃i gives the principal components of x̃i as a column vector.

    3 X̃W gives a new data matrix in terms of principal components.

    4 If we compute the singular value decomposition (SVD) of X̃ we get

    X̃ = VDW T ,

    where D ∈ Rn×d is diagonal with non-negative entries, and V ∈ Rn×n, W ∈ Rd×d are orthogonal.

    5 Then X̃T X̃ = WDTDW T . Thus we can use the SVD on our data matrix to obtain the loading vectors W and the eigenvalues Λ = 1n−1D

    TD.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 11 / 26

  • Lecture 13 Computing Principal Components

    Some Linear Algebra

    Recall that w(1) is defined by

    w(1) = arg max ‖w‖2=1

    1

    n − 1 n∑

    i=1

    (x̃Ti w) 2.

    We now perform some algebra to simplify this expression. Note that

    n∑ i=1

    (x̃Ti w) 2 =

    n∑ i=1

    (x̃Ti w)(x̃ T i w)

    = n∑

    i=1

    (wT x̃i )(x̃ T i w)

    = wT

    [ n∑

    i=1

    x̃i x̃ T i

    ] w

    = wT X̃T X̃w .

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 12 / 26

  • Lecture 13 Computing Principal Components

    Some Linear Algebra

    1 This shows

    w(1) = arg max ‖w‖2=1

    1

    n − 1w T X̃T X̃w = arg max

    ‖w‖2=1 wTSw ,

    where S = 1n−1 X̃ T X̃ is the sample covariance matrix.

    2 By the introductory problem this implies w(1) is the eigenvector corresponding to the largest eigenvalue of S .

    3 We also learn that the variance along w(1) is λ1, the largest eigenvalue of S .

    4 With a bit more work we can see that w(k) is the eigenvector corresponding to the kth largest eigenvalue, with λk giving the associated variance.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 13 / 26

  • Lecture 13 Computing Principal Components

    PCA Example

    Example

    A collection of people come to a testing site to have their heights measured twice. The two testers use different measuring devices, each of which introduces errors into the measurement process. Below we depict some of the measurements computed (already centered).

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 14 / 26

  • Lecture 13 Computing Principal Components

    PCA Example

    −20

    −10

    10

    20 Tester 2

    −20 −10 10 Tester 1

    1 Describe (vaguely) what you expect the sample covariance matrix to look like.

    2 What do you think w(1) and w(2) are?

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 15 / 26

  • Lecture 13 Computing Principal Components

    PCA Example: Solutions

    1 We expect tester 2 to have a larger variance than tester 1, and to be nearly perfectly correlated. The sample covariance matrix is

    S =

    ( 40.5154 93.5069 93.5069 232.8653

    ) .

    2 We have

    S = WΛW T ,W =

    ( 0.3762 −0.9265 0.9265 0.3762

    ) ,Λ =

    ( 270.8290 0

    0 2.5518

    ) .

    Note that traceΛ = traceS .

    Since λ2 is small, it shows that w(2) is almost in the null space of S . This suggests −.9265x̃1 + .3762x̃2 ≈ 0 for data points (x̃1, x̃2). In other words, x̃2 ≈ 2.46x̃1. Maybe tester 2 used centimeters and tester 1 used inches.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 16 / 26

  • Lecture 13 Computing Principal Components

    PCA Example: Plot In Terms of Principal Components

    −20

    −10

    10

    20 Tester 2

    −20 −10 10 Tester 1

    −1.25 2.5

    6.25w(2)

    −20 −10 10 20 w(1)

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 17 / 26

  • Lecture 13 Computing Principal Components

    Uses of PCA: Dimensionality Reduction

    1 In our height example above, we can replace our two features with only a single feature, the first principal component.

    2 This can be used as a preprocessing step in a supervised learning algorithm.

    3 When performing dimensionality reduction, one must choose how many principal components to use. This is often done using a scree plot: a plot of the eigenvalues of S in descending order.

    4 Often people look for an “elbow” in the scree plot: a point where the plot becomes much less steep.

    Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 18 / 26

  • Lecture 13 Computing Principal Components

    Scree Plot

    1From Jolliffe, Principal Component Analysis Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 19 / 26

  • Lecture 13 Computing Principal Components

    Uses of PCA: Visualization

    1 Visualization: If we have high dimensional data, it can be hard to plot it effectively. Sometimes plotting the first two principal components can reveal interesting geometric structure in the data.

    1https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735096/ Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 20 / 26

  • Lecture 13 Computing Principal Components

    Uses of PCA: Principal Component Regression

    1 Want to build a linear model with a dataset