A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...
-
Upload
lorena-sweeting -
Category
Documents
-
view
213 -
download
0
Transcript of A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...
A powerful low-parameter method for inferring quartets
underthe General Markov Model
Jeremy Sumner
Barbara Holland
Peter Jarvis
The general markov model (GMM)
π
M1
M2
M3
M4
M5
A C G T
0.3 0.2 0.4 0.1e.g. π =
A C G T
A 0.80 0.10 0.07 0.03
C 0.05 0.75 0.15 0.05
G 0.02 0.03 0.92 0.03
T 0.02 0.05 0.04 0.87
M =
Base composition
• In the GMM the mutation transition matrices do not have to be symmetrical.
• As a consequence of this, base frequencies could be different in different taxa.
• Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.
The exception: Log-det distances
dxy
= -ln det Fxy
GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC...GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC...
A C G T
A 6 0 0 0
C 1 8 0 2
G 1 1 8 1
T 1 0 0 9
Fxy =
Markov invariants
• The log det is an example (the simplest) of a Markov invariant
• JS and PJ extended the theory of Markov invariants to larger subsets of taxa
– Tangles (3 taxa)– Squangles (4 taxa)– Stangles
Math wizards...
...and their magical polynomials3 1 18 69 171 256-3 1 18 69 172 255-3 1 18 69 175 2523 1 18 69 176 251-3 1 18 69 187 2403 1 18 69 188 2393 1 18 69 191 236-3 1 18 69 192 235-1 1 18 71 169 2561 1 18 71 172 2531 1 18 71 173 252-1 1 18 71 176 2491 1 18 71 185 240-1 1 18 71 188 237-1 1 18 71 189 2361 1 18 71 192 2331 1 18 72 169 255-1 1 18 72 171 253-1 1 18 72 173 2511 1 18 72 175 249-1 1 18 72 185 2391 1 18 72 187 2371 1 18 72 189 235-1 1 18 72 191 233-3 1 18 73 167 2563 1 18 73 168 2553 1 18 73 175 248-3 1 18 73 176 2473 1 18 73 183 240-3 1 18 73 184 239-3 1 18 73 191 2323 1 18 73 192 231
and another 66,712 terms
coefficient
indices 1-256
e.g. 3*p1*p18*p73*p168*p255
= 3*pAAAA*pACAC*pCAGA*pGGCT*pTTTG
Squangle table
1
2
3
4
1
3
2
41
4
3
2
q1 q
2q
3
0 -u u
v 0 -v
-w w 0
q1 + q
2 + q
3 = 0
Choosing a quartet
0
0
00
q1
q2
q3
0 -u u
v 0 -v
-w w 0
u-u
Choosing a quartet
0
0
00
q1
q2
q3
0 -u u
v 0 -v
-w w 0
u=0
Residual sum of squares• Pick the quartet tree that minimises the residual sum
of squares (RSS)
• u = max {0,(q3-q2)/2} (v,w similar)
• The RSS are always of the form
q12 + [(q3-q2)/2 – u]2
If things are in the right order (q3>q2) then the second
term vanishes, but if they aren't then u gets set to 0
Weights (I)
• Weight each quartet
wi = 1/RSSi
• A posterior probability (ish) weighting scheme for the quartets is then
pi = wi/(w1+w2+w3)
Example
((Rhea,Hippo),Platypus,Wallaroo);
q1 = 9.14e-07
q2 = -7.58e-06
q3 = 6.67e-06
p1 = 0.978
p2 = 0.011
p3 = 0.011
MtDNA genomes13856 sites
RSS1 = 8.36e-13
RSS2 = 6.58e-11
RSS3 = 6.25e-11
u = 7.13e-06 v = 0w = 0
Weights (II)
• The RSS weights give a measure of the relative support for each topology.
• It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet
q1
q2
q3
0 -u u
v 0 -v
-w w 0
The most likely suspect is u = (q
3-q
2)/2
0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600
-0.0300
-0.0200
-0.0100
0.0000
0.0100
0.0200
0.0300q1
probability of mutation on middle edge
q1
q2
q3
0 -u u
v 0 -v
-w w 0
0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600
-0.1000
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
q3-q2
probability of mutation on middle edge
q1
q2
q3
0 -u u
v 0 -v
-w w 0
Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1
Basic simulation setup
Felsenstein zone Farris zone
Jukes Cantor model: equal base frequencies, all changes equally likely
100 data sets for each parameter choice
Simulations (I)Testing power compared to cNJ
201 402 804 1608 10000
0
10
20
30
40
50
60
70
80
Farris: short edge = 0.0025 , long edge
=0.15
SQ(12)(34)
SQ(13)(24)
SQ(14)(23)
NJ(12)(34)
NJ(13)(24)
NJ(14)(23)
#sites
fre
qu
en
cy
201 402 804 1608 10000
0
10
20
30
40
50
60
70
80
90
100
Farris: short edge=0.005, long
edge=0.075
SQ(12)(34)
SQ(13)(24)
SQ(14)(23)
NJ(12)(34)
NJ(13)(24)
NJ(14)(23)
#sites
freq
uenc
y
201 402 804 1608 10000
0
10
20
30
40
50
60
70
80
Felsensteinshort edge = 0.0025, long edge =
0.15
SQ(12)(34)
SQ(13)(24)
SQ(14)(23)
NJ(12)(34)
NJ(13)(24)
NJ(14)(23)
#sites
fre
qu
en
cy
201 402 804 1608 10000
0
10
20
30
40
50
60
70
80
90
100
Felsensteinshort edge = 0.005, long edge =
0.075
SQ(12)(34)
SQ(13)(24)
SQ(14)(23)
NJ(12)(34)
NJ(13)(24)
NJ(14)(23)
#sites
fre
qu
en
cy
Simulations (II)Adding base composition drift
• Added a GC bias along the long edges
A C G T
A * pl*b p
l*b p
l
C pl
* pl
pl
G pl
pl
* pl
T pl
pl*b p
l*b *
GC bias on long edges
bias = 1 2 3 4 5
#Sites =200
SQ 71NJ 66
5949
5015
390
240
400 8686
6847
628
420
350
800 9391
7553
633
600
360
1600 100100
9256
790
590
380
10000 100100
10067
950
780
500
Felsenstein: short edge = 0.005, long edge = 0.075
Simulations (II)Adding a proportion of invariant sites
pInv = 0 0.1 0.2 0.3 0.4 0.5
#Sites =200 73 64 60 53 48 35
400 84 78 62 57 42 41
800 93 76 70 55 47 34
1600 97 89 72 66 35 22
10000 100 100 97 63 26 4
Felsenstein: short edge = 0.005, long edge = 0.075
Putting it all together
• Most people want to build trees on more than 4 taxa
• Fortunately there are already several methods for going from quartets to larger trees
– Q*
– Quartet puzzling
– Any supertree method
• Or from quartets to splits graphs
– QNet
Qnet – distance based weights
mt genomes
mt genomes
1st codon pos2nd
3rdQnet – distance based weights
Detecting invariant sites
• The residual sum of squares (RSS) scores give an opportunity to detect invariant sites.
• Remove constant sites in order to– Idea 1: Minimise sum of RSS– Idea 2: Minimise minimum RSS
15,000 sites of which 5000 are invariableproportion of constant sites out of 10,000 variable sites was 0.58
constant sites PP: sum RSS min RSS
0.72 0.25 0.44 0.30 2.22E-09 7.14E-10
0.70 0.28 0.40 0.32 1.74E-09 5.40E-10
0.68 0.31 0.35 0.33 1.45E-09 3.81E-10
0.66 0.37 0.30 0.33 1.20E-09 2.46E-10
0.64 0.45 0.25 0.30 9.90E-10 1.37E-10
0.62 0.59 0.18 0.23 8.31E-10 5.85E-11
0.60 0.80 0.09 0.11 7.40E-10 1.33E-11
0.57 0.98 0.01 0.01 7.07E-10 2.06E-14
0.55 0.97 0.02 0.02 7.15E-10 1.29E-11
0.52 0.82 0.09 0.08 7.38E-10 4.26E-11
0.50 0.69 0.17 0.14 7.47E-10 7.76E-11
Vagaries of real data
• Dealing sensibly with missing or ambiguous data
• Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment
• Seems better to do this on a per quartet basis
Code
• R code• Python code, creates output that can be
understood by Qnet
Simulation plans
• Compare to likelihood• Compare to NJ with log-det distances• Look at rates across sites instead of just
proportions of invariant sites