Principal components - University of Massachusetts...

32
Principal components I Previously, for an N × p data matrix X with centered columns, the first principal component X ν is defined to have the maximum variance, that is, the principal component loading vector ν solves max ν T (X T X )ν, subject to ν T ν =1 I The solution is that ν is the eigen vector of X T X corresponding to the largest eigen value. That is, Let the eigen decomposition be X T X = VD 2 V T with decreasing diagonal elements of the diagonal matrix D . The columns of V contains p principal component loading vectors.

Transcript of Principal components - University of Massachusetts...

Page 1: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Principal components

I Previously, for an N × p data matrix X with centeredcolumns, the first principal component Xν is defined to havethe maximum variance, that is, the principal componentloading vector ν solves

max νT (XTX )ν, subject to νTν = 1

I The solution is that ν is the eigen vector of XTXcorresponding to the largest eigen value. That is, Let theeigen decomposition be

XTX = VD2V T

with decreasing diagonal elements of the diagonal matrix D.The columns of V contains p principal component loadingvectors.

Page 2: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

I The first q principal components for the ith observation xi arethen

λ̂i = V Tq (xi − x̄)

where Vq consists of the first q columns of V and x̄ is thecolumn average vector of X .

Page 3: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Principal components: another angle

Consider using a hyperplane of rank q to approximate the data X :

f (λ) = µ+ Mqλ

where µ is a location vector in Rp, Mq is a p × q matrix with qorthogonal unit vectors as columns, and λ is a q vector ofparameters. Fitting such a model to data by least squares amountsto minimizing the reconstruction error

minµ,λi ,Mq

N∑i=1

‖xi − µ−Mqλi‖2.

The solution isµ̂ = x̄ ,

λ̂i = V Tq (xi − x̄).

where Vq consists of the first q columns of the eigenmatrix V .

Page 4: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Let the singular value decomposition of X be

X = UDV T

It can be seenλ̂i = the ith row of UqDq.

Page 5: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 6: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

90 data points in three dimension were generated near the surfaceof a half sphere of radius 1. The points were in each of threeclusters - red, green and blue - located near (0,1,0), (0,0,1) and(1,0,0).

Page 7: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 8: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Example: Handwritten Digits

Figure 14.22 shows a sample of 130 handwritten 3’s, each adigitized 16 x 16 grayscale image, from a total of 658 such 3’s. Wesee considerable variation in writing styles, character thickness andorientation. We consider these images as points xi in IR256, andcompute their principal components via the SVD (singular valuedecomposition).Figure 14.23 shows the first two principal components of thesedata. For each of these first two principal components, wecomputed the 5%, 25%, 50%, 75% and 95% quantile points, andused them to define the rectangular grid superimposed on the plot.The circled points indicate those images close to the vertices of thegrid.

Page 9: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 10: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Here we have displayed the first two principal componentdirections, v1 and v2, as images. Although there are a possible 256principal components, approximately 50 account for 90% of thevariation in the threes, 12 account for 63%.

Page 11: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 12: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Principal curves

Page 13: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Principal curves generalize the principal component line, providinga smooth one-dimensional curved approximation to a set of datapoints in Rp.

I Let f (λ) be a parameterized smooth curve in Rp. Hence f (λ)is a vector function with p coordinates, each a smoothfunction of the single parameter λ.

I Let λf (x) define the closest point on the curve to x . Thenf (λ) is called a principal curve for the distribution of theranom vector X if

f (λ) = E (X |λf (x) = λ).

This says f (λ) is the average of all data points that project toit, that is, the points for which it is ”responsible.” This is alsoknown as a self-consistency property.

Page 14: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Algorithm to find principal curve

To find a principal curve f (λ) = (f1(λ), f2(λ), ..., fp(λ)) of the dataXT = (X1, ...,Xp), Consider the following alternating steps:

I f̂j(λ)← E (Xj |λ(X ) = λ); j = 1, 2, ..., p, that is, smoothing Xj

as a function of λ.

I λ̂f (x)←λ′ ‖x − f̂ (λ′)‖2. That is, for each point x , find itsprojection λ on the current principal curve.

Initialize with f (λ1) = x̄ + V1λ1, which is the data reconstructionbased on the first principal component.

Page 15: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Principal surfaces

Principal surfaces have exactly the same form as principal curves,but are of two dimension, with coordinate functions

f (λ1, λ2) = (f1(λ1, λ2), ..., fp(λ1, λ2)).

The estimates in step 1 above are obtained from two-dimensionalsurface smoothers.

Page 16: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Self-organizing maps: K-means clustering with2-dimensional representation

Page 17: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Document retrieval has gained importance with the rapiddevelopment of the Internet and the Web, and SOMs have provedto be useful for organiz- ing and indexing large corpora. Thisexample is taken from the WEBSOM homepagehttp : //websom.hut.fi/. Figure 14.19 represents a SOM fit to12,088 newsgroup comp.ai.neural-nets articles. The labels aregenerated automatically by the WEBSOM software and provide aguide as to the typical content of a node.

Page 18: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 19: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 20: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 21: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 22: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

SOM algorithm

Consider a SOM with a two-dimensional rectangular grid of Kproto- types mj ∈ Rp. Each of the K prototypes are parametrizedwith respect to an integer coordinate pair lj ∈ Q×Q. HereQ1 = {1, 2, ..., q1}, similarly Q2, and K = q1q2. The mj areinitialized, for example, to lie in the two-dimensional principalcomponent plane of the data. The observations xi are processedone at a time. We find the closest prototype mj to xi in Euclideandistance in Rp, and then for all neighbors mk of mj , move mk

toward xi via the update

mk ← mk + α(xi −mk).

The ”neighbors” of mj are defined to be all mk such that thedistance between lj andlk is small. The simplest approach usesEuclidean distance, and ”small” is determined by a threshold r.This neighborhood always includes the closest prototype mj itself.

Page 23: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Batch processing

In the batch version of the SOM, we update each mj via

mj =

∑wkxk∑wk

.

I The sum is over points xk that mapped (i.e., were closest to)neighbors mk of mj .

I The weight function may be rectangular, that is, equal to 1for the neighbors of mk , or may decrease smoothly withdistance ‖lk − lj‖ as before.

I If the neighborhood size is chosen small enough so that itconsists only of mk , with rectangular weights, this reduces tothe K-means clustering procedure.

Page 24: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

I Relationship with principal surfaces:I If we use a kernel surface smoother to estimate each

coordinate function of the principal surface, fj(λ1, λ2), this hasthe same form as the batch version of SOMs. The SOMweights wk are just the weights in the kernel.

I Principal sur- faces provide a smooth parameterization of theentire manifold in terms of its coordinate functions, whileSOMs are discrete and produce only the estimated prototypesfor approximating the data.

I The smooth parameterization in principal surfaces preservesdistance locally, however, there is little indication in the SOMprojection that the red cluster is tighter than the others.

Page 25: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value
Page 26: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Kernel principal component

Data in the left are 450 points falling in three concentric clusters of150 points each. The points are uniformly distributed in angle,with radius 1, 2.8 and 5 in the three groups, and Gaussian noisewith standard deviation 0.25 added to each point.

Page 27: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Kernel principal components

Treating the kernel matrix K as an inner-product matrix of theimplicit features, Kernel PCA computes the eigen-decomposition ofthe double-centered version of the gram matrix

K̃ = (I −M)K (I −M) = UD2UT

where M = 11T/N. Then the kernal principal components aredefined as

Z = UD.

The elements of the mth component Zm can be written (up tocentering) as

zim =N∑i=1

αjmK (xi , xj)

where αjm = ujm/dm.

Page 28: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Kernel principal components applied to the toy example of Figure14.29, using different kernels. (Top left:) Radial kernel (14.67)with c = 2. (Top right:) Radial kernel with c = 10.

Page 29: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Sparse principal components

Here the shape of the mid-sagittal cross-section of the corpus callosum

(CC) is related to various clinical parameters in a study involving 569

elderly persons. For such applications, a number of landmarks are

identified along the circumference of the shape; These are aligned by

Procrustes analysis to allow for rotations, and in this case scaling as well.

The features used for PCA are the sequence of coordinate pairs for each

landmark, unpacked into a single vector.

Page 30: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Low walking speed relates to CCs that are thinner (displayingatrophy) in regions connecting the motor control and cognitivecenters of the brain. Low verbal fluency relates to CCs that arethinner in regions connecting auditory/visual/cognitive centers.The sparse principal components procedure gives a moreparsimonious, and po- tentially more informative picture of theimportant differences.

Page 31: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

Sparse principal component analysis

Let xi be the ith row of X . For a single component, the Sparseprincipal component technique solves

minθ,ν

N∑i=1

‖xi − θνT xi‖2 + λ‖ν‖22 + λ1‖ν‖1, subject to ‖θ‖2 = 1

I If both λ and λ1 are zero and N > p, it is easy to show thatν = θ and is the largest principal component direction.

I When p >> N the solution is not necessarily unique unlessλ > 0. For any λ > 0 and λ1 = 0 the solution for ν isproportional to the largest principal component direction.

I The second penalty on ν encourages sparseness of theloadings.

Page 32: Principal components - University of Massachusetts Amherstpeople.math.umass.edu/~anna/stat697F/Chapter10_part2.pdf · compute their principal components via the SVD (singular value

For multiple components, the sparse principal componentsprocedures minimizes

N∑i=1

‖xi − θV T xi‖22 + λ

K∑k=1

‖νk‖22 +K∑

k=1

λ1k‖νk‖1,

subject to θT θ = IK .HereVisap×K matrix with columns νk and θis also p × K .

I The above minimization is not jointly convex in V and θ, butit is convex in each parameter with the other parameter fixed.

I Minimization over V with θ fixed is equivalent to K elastic netproblems and can be done efficiently.

I On the other hand, minimization over θ with V fixed can besolved by a simple SVD calculation.

I These steps are alternated until convergence.