Principal Component Analysis of Tree Topology

46
Principal Component Analysis of Tree Topology Yongdai Kim Seoul National University 2011. 6. 5 1 Presented by J. S. Marron, SAMSI

description

Yongdai Kim Seoul National University 2011. 6. 5. Principal Component Analysis of Tree Topology. Presented by J. S. Marron , SAMSI. Dyck Path Challenges. Data trees not like PC projections Branch Lengths ≥ 0 Big flat spots. Brain Data: Mean – 2 σ 1 PC1. - PowerPoint PPT Presentation

Transcript of Principal Component Analysis of Tree Topology

Page 1: Principal Component Analysis of Tree Topology

1

Principal Component Analysis of Tree Topology

Yongdai KimSeoul National University2011. 6. 5

Presented by J. S. Marron, SAMSI

Page 2: Principal Component Analysis of Tree Topology

2

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Page 3: Principal Component Analysis of Tree Topology

3

UNC, Stat & OR

Brain Data: Mean – 2 σ1 PC1

Careful about values < 0

Page 4: Principal Component Analysis of Tree Topology

4

UNC, Stat & OR

Interpret’n: Directions Leave Positive Or-thant

(pic here)

Page 5: Principal Component Analysis of Tree Topology

5

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

Page 6: Principal Component Analysis of Tree Topology

6

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Page 7: Principal Component Analysis of Tree Topology

7

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Dan

Page 8: Principal Component Analysis of Tree Topology

8

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Dan

Page 9: Principal Component Analysis of Tree Topology

9

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Lingsong

Page 10: Principal Component Analysis of Tree Topology

10

UNC, Stat & OR

Non-neg’ve Matrix Factoriza-tion

Ideas:

Linearly Approx. Data (as in PCA)

But Stay in Positive Orthant

Page 11: Principal Component Analysis of Tree Topology

11

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Page 12: Principal Component Analysis of Tree Topology

12

Contents

1. Introduction

2. Proposed Method

3. Bayesian Factor Model

4. PCA

5. Estimation of Projected Trees

Page 13: Principal Component Analysis of Tree Topology

13

Introduction Given data , let be branch

length vectors.

Dimension p = # nodes in support (union) tree.

For tree , define tree topology vector , p-dimensional binary vector where

Goal: PCA method for

1, , nT T 1, , nv v

iT iy ( 0),ik iky I v 1, , .k p

1, , ny y

Page 14: Principal Component Analysis of Tree Topology

14

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

Page 15: Principal Component Analysis of Tree Topology

15

UNC, Stat & OR

Goal of Bayes Factor Model

Model Large Flat Spots as yi = 0

Page 16: Principal Component Analysis of Tree Topology

16

Proposed Method

1. Gaussian Latent Variable Model

2. Est. Corr. Matrix: Bayes Factor Model

3. PCA on Est’ed Correlation Matrix

4. Interpret in Tree Space

Page 17: Principal Component Analysis of Tree Topology

17

Proposed Method

1. Gaussian Latent Variable Model

~ ( , ), 1, ,i PX N i n

( )

1

Assume ( 0, 0), 1, , where( , , )' and ( ) is the parent node of k

ik ik pa k

pp

y I x x k pR pa k

Page 18: Principal Component Analysis of Tree Topology

18

Proposed Method

2. Estimation of the correlation matrix by Bayesian factor model

• Estimate and by Bayesian factor model

Page 19: Principal Component Analysis of Tree Topology

19

Proposed Method

3. PCA with an estimated correlation matrix

• Apply the PCA to an estimated

Page 20: Principal Component Analysis of Tree Topology

20

Proposed Method

3. Estimation of projected tree

• Define projected trees on PCA directions

• Estimate the projected trees by MCMC algorithm

Page 21: Principal Component Analysis of Tree Topology

21

Bayesian Factor Model

1. Model

2. Priors

3. MCMC algorithm

4. Convergence diagnostic

Page 22: Principal Component Analysis of Tree Topology

Bayesian Factor Model

1. Model

22

11

12,

2,

2, ,

| , , ~ ( , ), 1, ,

where q is a positive integer, , , arep dimensionalvectors, ~ (0, ), 1, , ,and ( , 1, , ).Let ( , 1, , ) and

q

i i iq P il li

q

il z l

k

z z l k

X z z N z W i n

W Wz N l q

diag k pdiag l q 2 1 for identifiability.

Page 23: Principal Component Analysis of Tree Topology

Bayesian Factor Model

2. Prior•

• This prior has been proposed by Ghosh and Dun-son(2009)

23

2~ (0, ) iidk N

0

1

~ (0,1), iid and ~ , where ( , , )'.

lk lk

l l lp

w N l k w l kW w w

2, , 1, , ~ ( , ) iidz l z zl q InvGamma a b

Page 24: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. MCMC algorithm• Notation

• Step 1. generate

24

2 2

1 1 12 2 2

1 1 1

Let { , } where , , ( , , ) ,

( , , ) , ( , , ), and ( , , ).

k z ε

k k qk n n

p z z zq n

R X,Z,W ,σ σ ,y,μW (w , ,w ) X (X , ,X ) Z Z Z

W W W σ σ σ y y y

2, ( ) ,1

2, ( ) ,1

- If 1 and 1, generate ~ ( , ) conditional on 0. - If 0 and 1, generate ~ ( , ) conditional on 0.

qik i pa k ik k il lk kl

ikq

ik i pa k ik k il lk kl

ik

y y X N z wX

y y X N z wX

| { }:ik ikX R X

Page 25: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. MCMC algorithm • Step 2. generate

where

and

25

2 | { }, 1,..., ~ ( , ), k k k kR m p N

1

, ( )1

, ( )2 2 21 1, ,

1 1n

qi pa k ni

k i pa k ik il lkk lk k

yy X z w

1

, ( )2 1

2 2,

1n

i pa ki

kk

y

Page 26: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. MCMC algorithm

• Step 3. generate

where

and

26

min{ , } min{ , }| { }, 1,..., ~( ( , ),0 ),k k k q k k q q kW R W k p N

'

' 1min{ , } 2 2

, ,

( 1 )1( ) k k k k nk k q k k k

k k

Z XI Z Z

1 , ( )Here ( ,..., ) , 1 , 1,...,

and min , , 1,...,k k nk k i pa k

k

X X X diag I y i n

diag I l k q l q

' 1min{ , } 2

,

1( )k k q k k kk

I Z Z

Page 27: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. MCMC algorithm

• Step 4. generate

where

and

27

| { }, i=1,...n ~ N ( , ),i i q i iZ R Z

11 1 '' ( - )( W )i i i ii i z XWW

1

, ( )

-1'( W )~

where ( ( ) 1, 1,..., ).

i i i z

i i pa k

W

diag I y k p

Page 28: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. MCMC algorithm

• Step 5. generate

28

2 2 2, , 1| { }, 1, , ~ ( /2, /2)n

z l z l z z iliR l q InvGamma a n b z

Page 29: Principal Component Analysis of Tree Topology

Bayesian Factor Model

4. Convergence diagnostic.

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 itera-tion

• Trace plots, ACF (Auto Correlation functions) and his-tograms of the three selected s and a selected

(Note ).

29

k

2,( ) ( , 1, ) ' , z lCov X Wdiag l qW'( , )k kCov X X

Page 30: Principal Component Analysis of Tree Topology

Bayesian Factor Model

3. Convergence diagnostic: Three s

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 iter-ation

• Trace plot, acf functions and histograms of the three selected s

30

k

Page 31: Principal Component Analysis of Tree Topology

Bayesian Factor Model

4. Convergence diagnostic: A

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 itera-tion

• Trace plot, acf functions and histograms of the three selected s(25%, 50%, 75%) and

31

k

2,( ) ( , 1, ) 'z lCov X Wdiag l qW

'( , )k kCov X X

Page 32: Principal Component Analysis of Tree Topology

32

PCA

• Scree plot

1

1

Apply the standard PCA to estimate correlation to obtain the many PCA direction vectors

, , with the corresponding eigenvectors 0

q

q

qW W

Page 33: Principal Component Analysis of Tree Topology

33

Visualizing Modes of Variation•

12,

Assume that follows the multivariate Gaussiandistribution with mean and covariance

= '+

where =diag( , k=1, ,p) and

i

q

X l l ll

k

X

WW

2 2,

1 1 .

q

k l lkl

w

Page 34: Principal Component Analysis of Tree Topology

34

Visualizing Modes of Variation•

1

With fixing (W, 1, , ), ( , 1, , ) and obtained from PCA, the posterior distribution of as well as conditional on ys are obtained by MCMC algorithm.

Latent variable : | ~ (

l l

i

i

i i p il ll

l q l qX

X Z N z W

, )

and ~ (0, ( , 1, , ))

q

i lZ N diag l q

All of the other parameters , , and are estimable.i iX Z

Page 35: Principal Component Analysis of Tree Topology

35

Visualizing Modes of Variation•

Obtain the posterior distribution of the projection ',given .

li l l i l

i

X W X Wy

( )

( ) ( ) ( ), ( )

Define the th projected tree {( 0, 0)| }

li

l l lik ik i pa k i

l Ty median I X X y

Page 36: Principal Component Analysis of Tree Topology

36

Center Point, μ

Page 37: Principal Component Analysis of Tree Topology

37

Approximately μ + 0.5 PC1

Page 38: Principal Component Analysis of Tree Topology

38

Approximately μ + 1.0 PC1

Page 39: Principal Component Analysis of Tree Topology

39

Approximately μ + 1.5 PC1

Page 40: Principal Component Analysis of Tree Topology

40

Approximately μ + 2.0 PC1

Page 41: Principal Component Analysis of Tree Topology

41

Center Point, μ

Page 42: Principal Component Analysis of Tree Topology

42

Approximately μ - 0.5 PC1

Page 43: Principal Component Analysis of Tree Topology

43

Approximately μ - 1.0 PC1

Page 44: Principal Component Analysis of Tree Topology

44

Approximately μ - 1.5 PC1

Page 45: Principal Component Analysis of Tree Topology

45

Approximately μ - 2.0 PC1

Page 46: Principal Component Analysis of Tree Topology

46

Visualizing Modes of Variation• Hard to Interpret• Scaling Issues?• Promising and Intuitive• Work in Progress …• Future goals• Improved Notion of PCA• Tune Bayes Approach for Better Interpretation• Integrate with Non-Neg. Matrix Factorization• ……..