Principal Component Analysis of Tree Topology

Post on 24-Feb-2016

34 views 0 download

description

Yongdai Kim Seoul National University 2011. 6. 5. Principal Component Analysis of Tree Topology. Presented by J. S. Marron , SAMSI. Dyck Path Challenges. Data trees not like PC projections Branch Lengths ≥ 0 Big flat spots. Brain Data: Mean – 2 σ 1 PC1. - PowerPoint PPT Presentation

Transcript of Principal Component Analysis of Tree Topology

1

Principal Component Analysis of Tree Topology

Yongdai KimSeoul National University2011. 6. 5

Presented by J. S. Marron, SAMSI

2

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

3

UNC, Stat & OR

Brain Data: Mean – 2 σ1 PC1

Careful about values < 0

4

UNC, Stat & OR

Interpret’n: Directions Leave Positive Or-thant

(pic here)

5

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

6

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

7

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Dan

8

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Dan

9

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

Discussed by Lingsong

10

UNC, Stat & OR

Non-neg’ve Matrix Factoriza-tion

Ideas:

Linearly Approx. Data (as in PCA)

But Stay in Positive Orthant

11

UNC, Stat & OR

Dyck Path Challenges

Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots

Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model

12

Contents

1. Introduction

2. Proposed Method

3. Bayesian Factor Model

4. PCA

5. Estimation of Projected Trees

13

Introduction Given data , let be branch

length vectors.

Dimension p = # nodes in support (union) tree.

For tree , define tree topology vector , p-dimensional binary vector where

Goal: PCA method for

1, , nT T 1, , nv v

iT iy ( 0),ik iky I v 1, , .k p

1, , ny y

14

UNC, Stat & OR

Visualize Trees

Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches

15

UNC, Stat & OR

Goal of Bayes Factor Model

Model Large Flat Spots as yi = 0

16

Proposed Method

1. Gaussian Latent Variable Model

2. Est. Corr. Matrix: Bayes Factor Model

3. PCA on Est’ed Correlation Matrix

4. Interpret in Tree Space

17

Proposed Method

1. Gaussian Latent Variable Model

~ ( , ), 1, ,i PX N i n

( )

1

Assume ( 0, 0), 1, , where( , , )' and ( ) is the parent node of k

ik ik pa k

pp

y I x x k pR pa k

18

Proposed Method

2. Estimation of the correlation matrix by Bayesian factor model

• Estimate and by Bayesian factor model

19

Proposed Method

3. PCA with an estimated correlation matrix

• Apply the PCA to an estimated

20

Proposed Method

3. Estimation of projected tree

• Define projected trees on PCA directions

• Estimate the projected trees by MCMC algorithm

21

Bayesian Factor Model

1. Model

2. Priors

3. MCMC algorithm

4. Convergence diagnostic

Bayesian Factor Model

1. Model

22

11

12,

2,

2, ,

| , , ~ ( , ), 1, ,

where q is a positive integer, , , arep dimensionalvectors, ~ (0, ), 1, , ,and ( , 1, , ).Let ( , 1, , ) and

q

i i iq P il li

q

il z l

k

z z l k

X z z N z W i n

W Wz N l q

diag k pdiag l q 2 1 for identifiability.

Bayesian Factor Model

2. Prior•

• This prior has been proposed by Ghosh and Dun-son(2009)

23

2~ (0, ) iidk N

0

1

~ (0,1), iid and ~ , where ( , , )'.

lk lk

l l lp

w N l k w l kW w w

2, , 1, , ~ ( , ) iidz l z zl q InvGamma a b

Bayesian Factor Model

3. MCMC algorithm• Notation

• Step 1. generate

24

2 2

1 1 12 2 2

1 1 1

Let { , } where , , ( , , ) ,

( , , ) , ( , , ), and ( , , ).

k z ε

k k qk n n

p z z zq n

R X,Z,W ,σ σ ,y,μW (w , ,w ) X (X , ,X ) Z Z Z

W W W σ σ σ y y y

2, ( ) ,1

2, ( ) ,1

- If 1 and 1, generate ~ ( , ) conditional on 0. - If 0 and 1, generate ~ ( , ) conditional on 0.

qik i pa k ik k il lk kl

ikq

ik i pa k ik k il lk kl

ik

y y X N z wX

y y X N z wX

| { }:ik ikX R X

Bayesian Factor Model

3. MCMC algorithm • Step 2. generate

where

and

25

2 | { }, 1,..., ~ ( , ), k k k kR m p N

1

, ( )1

, ( )2 2 21 1, ,

1 1n

qi pa k ni

k i pa k ik il lkk lk k

yy X z w

1

, ( )2 1

2 2,

1n

i pa ki

kk

y

Bayesian Factor Model

3. MCMC algorithm

• Step 3. generate

where

and

26

min{ , } min{ , }| { }, 1,..., ~( ( , ),0 ),k k k q k k q q kW R W k p N

'

' 1min{ , } 2 2

, ,

( 1 )1( ) k k k k nk k q k k k

k k

Z XI Z Z

1 , ( )Here ( ,..., ) , 1 , 1,...,

and min , , 1,...,k k nk k i pa k

k

X X X diag I y i n

diag I l k q l q

' 1min{ , } 2

,

1( )k k q k k kk

I Z Z

Bayesian Factor Model

3. MCMC algorithm

• Step 4. generate

where

and

27

| { }, i=1,...n ~ N ( , ),i i q i iZ R Z

11 1 '' ( - )( W )i i i ii i z XWW

1

, ( )

-1'( W )~

where ( ( ) 1, 1,..., ).

i i i z

i i pa k

W

diag I y k p

Bayesian Factor Model

3. MCMC algorithm

• Step 5. generate

28

2 2 2, , 1| { }, 1, , ~ ( /2, /2)n

z l z l z z iliR l q InvGamma a n b z

Bayesian Factor Model

4. Convergence diagnostic.

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 itera-tion

• Trace plots, ACF (Auto Correlation functions) and his-tograms of the three selected s and a selected

(Note ).

29

k

2,( ) ( , 1, ) ' , z lCov X Wdiag l qW'( , )k kCov X X

Bayesian Factor Model

3. Convergence diagnostic: Three s

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 iter-ation

• Trace plot, acf functions and histograms of the three selected s

30

k

Bayesian Factor Model

4. Convergence diagnostic: A

• 100000 iteration of MCMC algorithm after 10000 burn-in iteration

• 1000 posterior samples obtained at every 100 itera-tion

• Trace plot, acf functions and histograms of the three selected s(25%, 50%, 75%) and

31

k

2,( ) ( , 1, ) 'z lCov X Wdiag l qW

'( , )k kCov X X

32

PCA

• Scree plot

1

1

Apply the standard PCA to estimate correlation to obtain the many PCA direction vectors

, , with the corresponding eigenvectors 0

q

q

qW W

33

Visualizing Modes of Variation•

12,

Assume that follows the multivariate Gaussiandistribution with mean and covariance

= '+

where =diag( , k=1, ,p) and

i

q

X l l ll

k

X

WW

2 2,

1 1 .

q

k l lkl

w

34

Visualizing Modes of Variation•

1

With fixing (W, 1, , ), ( , 1, , ) and obtained from PCA, the posterior distribution of as well as conditional on ys are obtained by MCMC algorithm.

Latent variable : | ~ (

l l

i

i

i i p il ll

l q l qX

X Z N z W

, )

and ~ (0, ( , 1, , ))

q

i lZ N diag l q

All of the other parameters , , and are estimable.i iX Z

35

Visualizing Modes of Variation•

Obtain the posterior distribution of the projection ',given .

li l l i l

i

X W X Wy

( )

( ) ( ) ( ), ( )

Define the th projected tree {( 0, 0)| }

li

l l lik ik i pa k i

l Ty median I X X y

36

Center Point, μ

37

Approximately μ + 0.5 PC1

38

Approximately μ + 1.0 PC1

39

Approximately μ + 1.5 PC1

40

Approximately μ + 2.0 PC1

41

Center Point, μ

42

Approximately μ - 0.5 PC1

43

Approximately μ - 1.0 PC1

44

Approximately μ - 1.5 PC1

45

Approximately μ - 2.0 PC1

46

Visualizing Modes of Variation• Hard to Interpret• Scaling Issues?• Promising and Intuitive• Work in Progress …• Future goals• Improved Notion of PCA• Tune Bayes Approach for Better Interpretation• Integrate with Non-Neg. Matrix Factorization• ……..