Principal components - University of Massachusetts...

Principal components

I Previously, for an N × p data matrix X with centeredcolumns, the first principal component Xν is defined to havethe maximum variance, that is, the principal componentloading vector ν solves

max νT (XTX )ν, subject to νTν = 1

I The solution is that ν is the eigen vector of XTXcorresponding to the largest eigen value. That is, Let theeigen decomposition be

XTX = VD2V T

with decreasing diagonal elements of the diagonal matrix D.The columns of V contains p principal component loadingvectors.

I The first q principal components for the ith observation xi arethen

λ̂i = V Tq (xi − x̄)

where Vq consists of the first q columns of V and x̄ is thecolumn average vector of X .

Principal components: another angle

Consider using a hyperplane of rank q to approximate the data X :

f (λ) = µ+ Mqλ

where µ is a location vector in Rp, Mq is a p × q matrix with qorthogonal unit vectors as columns, and λ is a q vector ofparameters. Fitting such a model to data by least squares amountsto minimizing the reconstruction error

minµ,λi ,Mq

N∑i=1

‖xi − µ−Mqλi‖2.

The solution isµ̂ = x̄ ,

λ̂i = V Tq (xi − x̄).

where Vq consists of the first q columns of the eigenmatrix V .

Let the singular value decomposition of X be

X = UDV T

It can be seenλ̂i = the ith row of UqDq.

90 data points in three dimension were generated near the surfaceof a half sphere of radius 1. The points were in each of threeclusters - red, green and blue - located near (0,1,0), (0,0,1) and(1,0,0).

Example: Handwritten Digits

Figure 14.22 shows a sample of 130 handwritten 3’s, each adigitized 16 x 16 grayscale image, from a total of 658 such 3’s. Wesee considerable variation in writing styles, character thickness andorientation. We consider these images as points xi in IR256, andcompute their principal components via the SVD (singular valuedecomposition).Figure 14.23 shows the first two principal components of thesedata. For each of these first two principal components, wecomputed the 5%, 25%, 50%, 75% and 95% quantile points, andused them to define the rectangular grid superimposed on the plot.The circled points indicate those images close to the vertices of thegrid.

Here we have displayed the first two principal componentdirections, v1 and v2, as images. Although there are a possible 256principal components, approximately 50 account for 90% of thevariation in the threes, 12 account for 63%.

Principal curves

Principal curves generalize the principal component line, providinga smooth one-dimensional curved approximation to a set of datapoints in Rp.

I Let f (λ) be a parameterized smooth curve in Rp. Hence f (λ)is a vector function with p coordinates, each a smoothfunction of the single parameter λ.

I Let λf (x) define the closest point on the curve to x . Thenf (λ) is called a principal curve for the distribution of theranom vector X if

f (λ) = E (X |λf (x) = λ).

This says f (λ) is the average of all data points that project toit, that is, the points for which it is ”responsible.” This is alsoknown as a self-consistency property.

Algorithm to find principal curve

To find a principal curve f (λ) = (f1(λ), f2(λ), ..., fp(λ)) of the dataXT = (X1, ...,Xp), Consider the following alternating steps:

I f̂j(λ)← E (Xj |λ(X ) = λ); j = 1, 2, ..., p, that is, smoothing Xj

as a function of λ.

I λ̂f (x)←λ′ ‖x − f̂ (λ′)‖2. That is, for each point x , find itsprojection λ on the current principal curve.

Initialize with f (λ1) = x̄ + V1λ1, which is the data reconstructionbased on the first principal component.

Principal surfaces

Principal surfaces have exactly the same form as principal curves,but are of two dimension, with coordinate functions

f (λ1, λ2) = (f1(λ1, λ2), ..., fp(λ1, λ2)).

The estimates in step 1 above are obtained from two-dimensionalsurface smoothers.

Self-organizing maps: K-means clustering with2-dimensional representation

Document retrieval has gained importance with the rapiddevelopment of the Internet and the Web, and SOMs have provedto be useful for organizing and indexing large corpora. Thisexample is taken from the WEBSOM homepagehttp : //websom.hut.fi/. Figure 14.19 represents a SOM fit to12,088 newsgroup comp.ai.neural-nets articles. The labels aregenerated automatically by the WEBSOM software and provide aguide as to the typical content of a node.

SOM algorithm

Consider a SOM with a two-dimensional rectangular grid of Kproto- types mj ∈ Rp. Each of the K prototypes are parametrizedwith respect to an integer coordinate pair lj ∈ Q×Q. HereQ1 = {1, 2, ..., q1}, similarly Q2, and K = q1q2. The mj areinitialized, for example, to lie in the two-dimensional principalcomponent plane of the data. The observations xi are processedone at a time. We find the closest prototype mj to xi in Euclideandistance in Rp, and then for all neighbors mk of mj , move mk

toward xi via the update

mk ← mk + α(xi −mk).

The ”neighbors” of mj are defined to be all mk such that thedistance between lj andlk is small. The simplest approach usesEuclidean distance, and ”small” is determined by a threshold r.This neighborhood always includes the closest prototype mj itself.

Batch processing

In the batch version of the SOM, we update each mj via

mj =

∑wkxk∑wk

.

I The sum is over points xk that mapped (i.e., were closest to)neighbors mk of mj .

I The weight function may be rectangular, that is, equal to 1for the neighbors of mk , or may decrease smoothly withdistance ‖lk − lj‖ as before.

I If the neighborhood size is chosen small enough so that itconsists only of mk , with rectangular weights, this reduces tothe K-means clustering procedure.

I Relationship with principal surfaces:I If we use a kernel surface smoother to estimate each

coordinate function of the principal surface, fj(λ1, λ2), this hasthe same form as the batch version of SOMs. The SOMweights wk are just the weights in the kernel.

I Principal surfaces provide a smooth parameterization of theentire manifold in terms of its coordinate functions, whileSOMs are discrete and produce only the estimated prototypesfor approximating the data.

I The smooth parameterization in principal surfaces preservesdistance locally, however, there is little indication in the SOMprojection that the red cluster is tighter than the others.

Kernel principal component

Data in the left are 450 points falling in three concentric clusters of150 points each. The points are uniformly distributed in angle,with radius 1, 2.8 and 5 in the three groups, and Gaussian noisewith standard deviation 0.25 added to each point.

Kernel principal components

Treating the kernel matrix K as an inner-product matrix of theimplicit features, Kernel PCA computes the eigen-decomposition ofthe double-centered version of the gram matrix

K̃ = (I −M)K (I −M) = UD2UT

where M = 11T/N. Then the kernal principal components aredefined as

Z = UD.

The elements of the mth component Zm can be written (up tocentering) as

zim =N∑i=1

αjmK (xi , xj)

where αjm = ujm/dm.

Kernel principal components applied to the toy example of Figure14.29, using different kernels. (Top left:) Radial kernel (14.67)with c = 2. (Top right:) Radial kernel with c = 10.

Sparse principal components

Here the shape of the mid-sagittal cross-section of the corpus callosum

(CC) is related to various clinical parameters in a study involving 569

elderly persons. For such applications, a number of landmarks are

identified along the circumference of the shape; These are aligned by

Procrustes analysis to allow for rotations, and in this case scaling as well.

The features used for PCA are the sequence of coordinate pairs for each

landmark, unpacked into a single vector.

Low walking speed relates to CCs that are thinner (displayingatrophy) in regions connecting the motor control and cognitivecenters of the brain. Low verbal fluency relates to CCs that arethinner in regions connecting auditory/visual/cognitive centers.The sparse principal components procedure gives a moreparsimonious, and po- tentially more informative picture of theimportant differences.

Sparse principal component analysis

Let xi be the ith row of X . For a single component, the Sparseprincipal component technique solves

minθ,ν

N∑i=1

‖xi − θνT xi‖2 + λ‖ν‖22 + λ1‖ν‖1, subject to ‖θ‖2 = 1

I If both λ and λ1 are zero and N > p, it is easy to show thatν = θ and is the largest principal component direction.

I When p >> N the solution is not necessarily unique unlessλ > 0. For any λ > 0 and λ1 = 0 the solution for ν isproportional to the largest principal component direction.

I The second penalty on ν encourages sparseness of theloadings.

For multiple components, the sparse principal componentsprocedures minimizes

N∑i=1

‖xi − θV T xi‖22 + λ

K∑k=1

‖νk‖22 +K∑

k=1

λ1k‖νk‖1,

subject to θT θ = IK .HereVisap×K matrix with columns νk and θis also p × K .

I The above minimization is not jointly convex in V and θ, butit is convex in each parameter with the other parameter fixed.

I Minimization over V with θ fixed is equivalent to K elastic netproblems and can be done efficiently.

I On the other hand, minimization over θ with V fixed can besolved by a simple SVD calculation.

I These steps are alternated until convergence.

Principal components - University of Massachusetts...

Documents

Transcript of Principal components - University of Massachusetts...