A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...

A powerful low-parameter method for inferring quartets

underthe General Markov Model

Jeremy Sumner

Barbara Holland

Peter Jarvis

The general markov model (GMM)

π

M1

M2

M3

M4

M5

A C G T

0.3 0.2 0.4 0.1e.g. π =

A C G T

A 0.80 0.10 0.07 0.03

C 0.05 0.75 0.15 0.05

G 0.02 0.03 0.92 0.03

T 0.02 0.05 0.04 0.87

M =

Base composition

• In the GMM the mutation transition matrices do not have to be symmetrical.

• As a consequence of this, base frequencies could be different in different taxa.

• Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.

The exception: Log-det distances

dxy

= -ln det Fxy

GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC...GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC...

A C G T

A 6 0 0 0

C 1 8 0 2

G 1 1 8 1

T 1 0 0 9

Fxy =

Markov invariants

• The log det is an example (the simplest) of a Markov invariant

• JS and PJ extended the theory of Markov invariants to larger subsets of taxa

– Tangles (3 taxa)– Squangles (4 taxa)– Stangles

Math wizards...

...and their magical polynomials3 1 18 69 171 256-3 1 18 69 172 255-3 1 18 69 175 2523 1 18 69 176 251-3 1 18 69 187 2403 1 18 69 188 2393 1 18 69 191 236-3 1 18 69 192 235-1 1 18 71 169 2561 1 18 71 172 2531 1 18 71 173 252-1 1 18 71 176 2491 1 18 71 185 240-1 1 18 71 188 237-1 1 18 71 189 2361 1 18 71 192 2331 1 18 72 169 255-1 1 18 72 171 253-1 1 18 72 173 2511 1 18 72 175 249-1 1 18 72 185 2391 1 18 72 187 2371 1 18 72 189 235-1 1 18 72 191 233-3 1 18 73 167 2563 1 18 73 168 2553 1 18 73 175 248-3 1 18 73 176 2473 1 18 73 183 240-3 1 18 73 184 239-3 1 18 73 191 2323 1 18 73 192 231

and another 66,712 terms

coefficient

indices 1-256

e.g. 3*p1*p18*p73*p168*p255

= 3*pAAAA*pACAC*pCAGA*pGGCT*pTTTG

Squangle table

1

2

3

4

1

3

2

41

4

3

2

q1 q

2q

3

0 -u u

v 0 -v

-w w 0

q1 + q

2 + q

3 = 0

Choosing a quartet

0

0

00

q1

q2

q3

0 -u u

v 0 -v

-w w 0

u-u

Choosing a quartet

0

0

00

q1

q2

q3

0 -u u

v 0 -v

-w w 0

u=0

Residual sum of squares• Pick the quartet tree that minimises the residual sum

of squares (RSS)

• u = max {0,(q3-q2)/2} (v,w similar)

• The RSS are always of the form

q12 + [(q3-q2)/2 – u]2

If things are in the right order (q3>q2) then the second

term vanishes, but if they aren't then u gets set to 0

Weights (I)

• Weight each quartet

wi = 1/RSSi

• A posterior probability (ish) weighting scheme for the quartets is then

pi = wi/(w1+w2+w3)

Example

((Rhea,Hippo),Platypus,Wallaroo);

q1 = 9.14e-07

q2 = -7.58e-06

q3 = 6.67e-06

p1 = 0.978

p2 = 0.011

p3 = 0.011

MtDNA genomes13856 sites

RSS1 = 8.36e-13

RSS2 = 6.58e-11

RSS3 = 6.25e-11

u = 7.13e-06 v = 0w = 0

Weights (II)

• The RSS weights give a measure of the relative support for each topology.

• It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet

q1

q2

q3

0 -u u

v 0 -v

-w w 0

The most likely suspect is u = (q

3-q

2)/2

0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600

-0.0300

-0.0200

-0.0100

0.0000

0.0100

0.0200

0.0300q1

probability of mutation on middle edge

q1

q2

q3

0 -u u

v 0 -v

-w w 0

0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600

-0.1000

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

q3-q2

probability of mutation on middle edge

q1

q2

q3

0 -u u

v 0 -v

-w w 0

Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1

Basic simulation setup

Felsenstein zone Farris zone

Jukes Cantor model: equal base frequencies, all changes equally likely

100 data sets for each parameter choice

Simulations (I)Testing power compared to cNJ

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

Farris: short edge = 0.0025 , long edge

=0.15

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

90

100

Farris: short edge=0.005, long

edge=0.075

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

freq

uenc

y

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

Felsensteinshort edge = 0.0025, long edge =

0.15

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

90

100

Felsensteinshort edge = 0.005, long edge =

0.075

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

Simulations (II)Adding base composition drift

• Added a GC bias along the long edges

A C G T

A * pl*b p

l*b p

l

C pl

* pl

pl

G pl

pl

* pl

T pl

pl*b p

l*b *

GC bias on long edges

bias = 1 2 3 4 5

#Sites =200

SQ 71NJ 66

5949

5015

390

240

400 8686

6847

628

420

350

800 9391

7553

633

600

360

1600 100100

9256

790

590

380

10000 100100

10067

950

780

500

Felsenstein: short edge = 0.005, long edge = 0.075

Simulations (II)Adding a proportion of invariant sites

pInv = 0 0.1 0.2 0.3 0.4 0.5

#Sites =200 73 64 60 53 48 35

400 84 78 62 57 42 41

800 93 76 70 55 47 34

1600 97 89 72 66 35 22

10000 100 100 97 63 26 4

Felsenstein: short edge = 0.005, long edge = 0.075

Putting it all together

• Most people want to build trees on more than 4 taxa

• Fortunately there are already several methods for going from quartets to larger trees

– Q*

– Quartet puzzling

– Any supertree method

• Or from quartets to splits graphs

– QNet

Qnet – distance based weights

mt genomes

mt genomes

1st codon pos2nd

3rdQnet – distance based weights

Detecting invariant sites

• The residual sum of squares (RSS) scores give an opportunity to detect invariant sites.

• Remove constant sites in order to– Idea 1: Minimise sum of RSS– Idea 2: Minimise minimum RSS

15,000 sites of which 5000 are invariableproportion of constant sites out of 10,000 variable sites was 0.58

constant sites PP: sum RSS min RSS

0.72 0.25 0.44 0.30 2.22E-09 7.14E-10

0.70 0.28 0.40 0.32 1.74E-09 5.40E-10

0.68 0.31 0.35 0.33 1.45E-09 3.81E-10

0.66 0.37 0.30 0.33 1.20E-09 2.46E-10

0.64 0.45 0.25 0.30 9.90E-10 1.37E-10

0.62 0.59 0.18 0.23 8.31E-10 5.85E-11

0.60 0.80 0.09 0.11 7.40E-10 1.33E-11

0.57 0.98 0.01 0.01 7.07E-10 2.06E-14

0.55 0.97 0.02 0.02 7.15E-10 1.29E-11

0.52 0.82 0.09 0.08 7.38E-10 4.26E-11

0.50 0.69 0.17 0.14 7.47E-10 7.76E-11

Vagaries of real data

• Dealing sensibly with missing or ambiguous data

• Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment

• Seems better to do this on a per quartet basis

Code

• R code• Python code, creates output that can be

understood by Qnet

Simulation plans

• Compare to likelihood• Compare to NJ with log-det distances• Look at rates across sites instead of just

proportions of invariant sites

A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...

Documents

Transcript of A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...