Principal Component Analysis of Tree Topology

1

Principal Component Analysis of Tree Topology

Yongdai KimSeoul National University2011. 6. 5

Presented by J. S. Marron, SAMSI

2

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

3

UNC, Stat & OR

Brain Data: Mean – 2 σ1 PC1

Careful about values < 0

4

UNC, Stat & OR

Interpret’n: Directions Leave Positive Or-thant

(pic here)

5

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

6

UNC, Stat & OR



Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

7

UNC, Stat & OR




Discussed by Dan

8

UNC, Stat & OR




Discussed by Dan

9

UNC, Stat & OR




Discussed by Lingsong

10

UNC, Stat & OR

Non-neg’ve Matrix Factoriza-tion

Ideas:

Linearly Approx. Data (as in PCA)

But Stay in Positive Orthant

11

UNC, Stat & OR




12

Contents

1. Introduction

2. Proposed Method

3. Bayesian Factor Model

4. PCA

5. Estimation of Projected Trees

13

Introduction Given data , let be branch

length vectors.

Dimension p = # nodes in support (union) tree.

For tree , define tree topology vector , p-dimensional binary vector where

Goal: PCA method for

1, , nT T 1, , nv v

iT iy ( 0),ik iky I v 1, , .k p

1, , ny y

14

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

15

UNC, Stat & OR

Goal of Bayes Factor Model

Model Large Flat Spots as yi = 0

16

Proposed Method

1. Gaussian Latent Variable Model

2. Est. Corr. Matrix: Bayes Factor Model

3. PCA on Est’ed Correlation Matrix

4. Interpret in Tree Space

17

Proposed Method

1. Gaussian Latent Variable Model

•

•

~ ( , ), 1, ,i PX N i n

( )

1

Assume ( 0, 0), 1, , where( , , )' and ( ) is the parent node of k

ik ik pa k

pp

y I x x k pR pa k

18

Proposed Method

2. Estimation of the correlation matrix by Bayesian factor model

• Estimate and by Bayesian factor model

19

Proposed Method

3. PCA with an estimated correlation matrix

• Apply the PCA to an estimated

20

Proposed Method

3. Estimation of projected tree

• Define projected trees on PCA directions

• Estimate the projected trees by MCMC algorithm

21

Bayesian Factor Model

1. Model

2. Priors

3. MCMC algorithm

4. Convergence diagnostic


1. Model

•

22

11

12,

2,

2, ,

| , , ~ ( , ), 1, ,

where q is a positive integer, , , arep dimensionalvectors, ~ (0, ), 1, , ,and ( , 1, , ).Let ( , 1, , ) and

q

i i iq P il li

q

il z l

k

z z l k

X z z N z W i n

W Wz N l q

diag k pdiag l q 2 1 for identifiability.


2. Prior•

•

•

• This prior has been proposed by Ghosh and Dun-son(2009)

23

2~ (0, ) iidk N

0

1

~ (0,1), iid and ~ , where ( , , )'.

lk lk

l l lp

w N l k w l kW w w

2, , 1, , ~ ( , ) iidz l z zl q InvGamma a b


3. MCMC algorithm• Notation

• Step 1. generate

24

2 2

1 1 12 2 2

1 1 1

Let { , } where , , ( , , ) ,

( , , ) , ( , , ), and ( , , ).

k z ε

k k qk n n

p z z zq n

R X,Z,W ,σ σ ,y,μW (w , ,w ) X (X , ,X ) Z Z Z

W W W σ σ σ y y y

2, ( ) ,1

2, ( ) ,1

- If 1 and 1, generate ~ ( , ) conditional on 0. - If 0 and 1, generate ~ ( , ) conditional on 0.

qik i pa k ik k il lk kl

ikq

ik i pa k ik k il lk kl

ik

y y X N z wX

y y X N z wX

| { }:ik ikX R X


3. MCMC algorithm • Step 2. generate

where

and

25

2 | { }, 1,..., ~ ( , ), k k k kR m p N

1

, ( )1

, ( )2 2 21 1, ,

1 1n

qi pa k ni

k i pa k ik il lkk lk k

yy X z w

1

, ( )2 1

2 2,

1n

i pa ki

kk

y


3. MCMC algorithm


where

and

26

min{ , } min{ , }| { }, 1,..., ~( ( , ),0 ),k k k q k k q q kW R W k p N

'

' 1min{ , } 2 2

, ,

( 1 )1( ) k k k k nk k q k k k

k k

Z XI Z Z

1 , ( )Here ( ,..., ) , 1 , 1,...,

and min , , 1,...,k k nk k i pa k

k

X X X diag I y i n

diag I l k q l q

' 1min{ , } 2

,

1( )k k q k k kk

I Z Z


3. MCMC algorithm


where

and

27

| { }, i=1,...n ~ N ( , ),i i q i iZ R Z

11 1 '' ( - )( W )i i i ii i z XWW

1

, ( )

-1'( W )~

where ( ( ) 1, 1,..., ).

i i i z

i i pa k

W

diag I y k p


3. MCMC algorithm


28

2 2 2, , 1| { }, 1, , ~ ( /2, /2)n

z l z l z z iliR l q InvGamma a n b z


4. Convergence diagnostic.

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 itera-tion

• Trace plots, ACF (Auto Correlation functions) and his-tograms of the three selected s and a selected

(Note ).

29

k

2,( ) ( , 1, ) ' , z lCov X Wdiag l qW'( , )k kCov X X


3. Convergence diagnostic: Three s


• 1000 posterior samples obtained at every 100 iter-ation

• Trace plot, acf functions and histograms of the three selected s

30

k


4. Convergence diagnostic: A


• 1000 posterior samples obtained at every 100 itera-tion

• Trace plot, acf functions and histograms of the three selected s(25%, 50%, 75%) and

31

k

2,( ) ( , 1, ) 'z lCov X Wdiag l qW

'( , )k kCov X X

32

PCA

•

• Scree plot

1

1

Apply the standard PCA to estimate correlation to obtain the many PCA direction vectors

, , with the corresponding eigenvectors 0

q

q

qW W

33

Visualizing Modes of Variation•

12,

Assume that follows the multivariate Gaussiandistribution with mean and covariance

= '+

where =diag( , k=1, ,p) and

i

q

X l l ll

k

X

WW

2 2,

1 1 .

q

k l lkl

w

34


•

1

With fixing (W, 1, , ), ( , 1, , ) and obtained from PCA, the posterior distribution of as well as conditional on ys are obtained by MCMC algorithm.

Latent variable : | ~ (

l l

i

i

i i p il ll

l q l qX

X Z N z W

, )

and ~ (0, ( , 1, , ))

q

i lZ N diag l q

All of the other parameters , , and are estimable.i iX Z

35


•

Obtain the posterior distribution of the projection ',given .

li l l i l

i

X W X Wy

( )

( ) ( ) ( ), ( )

Define the th projected tree {( 0, 0)| }

li

l l lik ik i pa k i

l Ty median I X X y

36

Center Point, μ

37

Approximately μ + 0.5 PC1

38


39


40


41

Center Point, μ

42

Approximately μ - 0.5 PC1

43


44


45


46

Visualizing Modes of Variation• Hard to Interpret• Scaling Issues?• Promising and Intuitive• Work in Progress …• Future goals• Improved Notion of PCA• Tune Bayes Approach for Better Interpretation• Integrate with Non-Neg. Matrix Factorization• ……..

Principal Component Analysis of Tree Topology

Documents

Transcript of Principal Component Analysis of Tree Topology