Principal Component Analysis of Tree Topology
description
Transcript of Principal Component Analysis of Tree Topology
1
Principal Component Analysis of Tree Topology
Yongdai KimSeoul National University2011. 6. 5
Presented by J. S. Marron, SAMSI
2
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
3
UNC, Stat & OR
Brain Data: Mean – 2 σ1 PC1
Careful about values < 0
4
UNC, Stat & OR
Interpret’n: Directions Leave Positive Or-thant
(pic here)
5
UNC, Stat & OR
Visualize Trees
Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches
6
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model
7
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model
Discussed by Dan
8
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model
Discussed by Dan
9
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model
Discussed by Lingsong
10
UNC, Stat & OR
Non-neg’ve Matrix Factoriza-tion
Ideas:
Linearly Approx. Data (as in PCA)
But Stay in Positive Orthant
11
UNC, Stat & OR
Dyck Path Challenges
Data trees not like PC projections• Branch Lengths ≥ 0• Big flat spots
Alternate Approaches: Branch Length Representation Tree Pruning Non-negative Matrix Factorization Bayesian Factor Model
12
Contents
1. Introduction
2. Proposed Method
3. Bayesian Factor Model
4. PCA
5. Estimation of Projected Trees
13
Introduction Given data , let be branch
length vectors.
Dimension p = # nodes in support (union) tree.
For tree , define tree topology vector , p-dimensional binary vector where
Goal: PCA method for
1, , nT T 1, , nv v
iT iy ( 0),ik iky I v 1, , .k p
1, , ny y
14
UNC, Stat & OR
Visualize Trees
Important Note:Tendency TowardsLarge Flat SpotsAnd Bursts of Nearby Branches
15
UNC, Stat & OR
Goal of Bayes Factor Model
Model Large Flat Spots as yi = 0
16
Proposed Method
1. Gaussian Latent Variable Model
2. Est. Corr. Matrix: Bayes Factor Model
3. PCA on Est’ed Correlation Matrix
4. Interpret in Tree Space
17
Proposed Method
1. Gaussian Latent Variable Model
•
•
~ ( , ), 1, ,i PX N i n
( )
1
Assume ( 0, 0), 1, , where( , , )' and ( ) is the parent node of k
ik ik pa k
pp
y I x x k pR pa k
18
Proposed Method
2. Estimation of the correlation matrix by Bayesian factor model
• Estimate and by Bayesian factor model
19
Proposed Method
3. PCA with an estimated correlation matrix
• Apply the PCA to an estimated
20
Proposed Method
3. Estimation of projected tree
• Define projected trees on PCA directions
• Estimate the projected trees by MCMC algorithm
21
Bayesian Factor Model
1. Model
2. Priors
3. MCMC algorithm
4. Convergence diagnostic
Bayesian Factor Model
1. Model
•
22
11
12,
2,
2, ,
| , , ~ ( , ), 1, ,
where q is a positive integer, , , arep dimensionalvectors, ~ (0, ), 1, , ,and ( , 1, , ).Let ( , 1, , ) and
q
i i iq P il li
q
il z l
k
z z l k
X z z N z W i n
W Wz N l q
diag k pdiag l q 2 1 for identifiability.
Bayesian Factor Model
2. Prior•
•
•
• This prior has been proposed by Ghosh and Dun-son(2009)
23
2~ (0, ) iidk N
0
1
~ (0,1), iid and ~ , where ( , , )'.
lk lk
l l lp
w N l k w l kW w w
2, , 1, , ~ ( , ) iidz l z zl q InvGamma a b
Bayesian Factor Model
3. MCMC algorithm• Notation
• Step 1. generate
24
2 2
1 1 12 2 2
1 1 1
Let { , } where , , ( , , ) ,
( , , ) , ( , , ), and ( , , ).
k z ε
k k qk n n
p z z zq n
R X,Z,W ,σ σ ,y,μW (w , ,w ) X (X , ,X ) Z Z Z
W W W σ σ σ y y y
2, ( ) ,1
2, ( ) ,1
- If 1 and 1, generate ~ ( , ) conditional on 0. - If 0 and 1, generate ~ ( , ) conditional on 0.
qik i pa k ik k il lk kl
ikq
ik i pa k ik k il lk kl
ik
y y X N z wX
y y X N z wX
| { }:ik ikX R X
Bayesian Factor Model
3. MCMC algorithm • Step 2. generate
where
and
25
2 | { }, 1,..., ~ ( , ), k k k kR m p N
1
, ( )1
, ( )2 2 21 1, ,
1 1n
qi pa k ni
k i pa k ik il lkk lk k
yy X z w
1
, ( )2 1
2 2,
1n
i pa ki
kk
y
Bayesian Factor Model
3. MCMC algorithm
• Step 3. generate
where
and
26
min{ , } min{ , }| { }, 1,..., ~( ( , ),0 ),k k k q k k q q kW R W k p N
'
' 1min{ , } 2 2
, ,
( 1 )1( ) k k k k nk k q k k k
k k
Z XI Z Z
1 , ( )Here ( ,..., ) , 1 , 1,...,
and min , , 1,...,k k nk k i pa k
k
X X X diag I y i n
diag I l k q l q
' 1min{ , } 2
,
1( )k k q k k kk
I Z Z
Bayesian Factor Model
3. MCMC algorithm
• Step 4. generate
where
and
27
| { }, i=1,...n ~ N ( , ),i i q i iZ R Z
11 1 '' ( - )( W )i i i ii i z XWW
1
, ( )
-1'( W )~
where ( ( ) 1, 1,..., ).
i i i z
i i pa k
W
diag I y k p
Bayesian Factor Model
3. MCMC algorithm
• Step 5. generate
28
2 2 2, , 1| { }, 1, , ~ ( /2, /2)n
z l z l z z iliR l q InvGamma a n b z
Bayesian Factor Model
4. Convergence diagnostic.
• 100000 iteration of MCMC algorithm after 10000 burn-in iteration
• 1000 posterior samples obtained at every 100 itera-tion
• Trace plots, ACF (Auto Correlation functions) and his-tograms of the three selected s and a selected
(Note ).
29
k
2,( ) ( , 1, ) ' , z lCov X Wdiag l qW'( , )k kCov X X
Bayesian Factor Model
3. Convergence diagnostic: Three s
• 100000 iteration of MCMC algorithm after 10000 burn-in iteration
• 1000 posterior samples obtained at every 100 iter-ation
• Trace plot, acf functions and histograms of the three selected s
30
k
Bayesian Factor Model
4. Convergence diagnostic: A
• 100000 iteration of MCMC algorithm after 10000 burn-in iteration
• 1000 posterior samples obtained at every 100 itera-tion
• Trace plot, acf functions and histograms of the three selected s(25%, 50%, 75%) and
31
k
2,( ) ( , 1, ) 'z lCov X Wdiag l qW
'( , )k kCov X X
32
PCA
•
• Scree plot
1
1
Apply the standard PCA to estimate correlation to obtain the many PCA direction vectors
, , with the corresponding eigenvectors 0
q
q
qW W
33
Visualizing Modes of Variation•
12,
Assume that follows the multivariate Gaussiandistribution with mean and covariance
= '+
where =diag( , k=1, ,p) and
i
q
X l l ll
k
X
WW
2 2,
1 1 .
q
k l lkl
w
34
Visualizing Modes of Variation•
•
1
With fixing (W, 1, , ), ( , 1, , ) and obtained from PCA, the posterior distribution of as well as conditional on ys are obtained by MCMC algorithm.
Latent variable : | ~ (
l l
i
i
i i p il ll
l q l qX
X Z N z W
, )
and ~ (0, ( , 1, , ))
q
i lZ N diag l q
All of the other parameters , , and are estimable.i iX Z
35
Visualizing Modes of Variation•
•
Obtain the posterior distribution of the projection ',given .
li l l i l
i
X W X Wy
( )
( ) ( ) ( ), ( )
Define the th projected tree {( 0, 0)| }
li
l l lik ik i pa k i
l Ty median I X X y
36
Center Point, μ
37
Approximately μ + 0.5 PC1
38
Approximately μ + 1.0 PC1
39
Approximately μ + 1.5 PC1
40
Approximately μ + 2.0 PC1
41
Center Point, μ
42
Approximately μ - 0.5 PC1
43
Approximately μ - 1.0 PC1
44
Approximately μ - 1.5 PC1
45
Approximately μ - 2.0 PC1
46
Visualizing Modes of Variation• Hard to Interpret• Scaling Issues?• Promising and Intuitive• Work in Progress …• Future goals• Improved Notion of PCA• Tune Bayes Approach for Better Interpretation• Integrate with Non-Neg. Matrix Factorization• ……..