A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...

28
A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis

Transcript of A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner...

Page 1: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

A powerful low-parameter method for inferring quartets

underthe General Markov Model

Jeremy Sumner

Barbara Holland

Peter Jarvis

Page 2: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

The general markov model (GMM)

π

M1

M2

M3

M4

M5

A C G T

0.3 0.2 0.4 0.1e.g. π =

A C G T

A 0.80 0.10 0.07 0.03

C 0.05 0.75 0.15 0.05

G 0.02 0.03 0.92 0.03

T 0.02 0.05 0.04 0.87

M =

Page 3: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Base composition

• In the GMM the mutation transition matrices do not have to be symmetrical.

• As a consequence of this, base frequencies could be different in different taxa.

• Almost all phylogenetic methods / commonly used models cannot account for drift in base composition across the tree.

Page 4: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

The exception: Log-det distances

dxy

= -ln det Fxy

GCCTACGTCGAAGTCGTAGCTGTGCATGCTAGCGTCTC...GTCTACATCGAAGTCGTATTTGTGCATGCAACAGTCTC...

A C G T

A 6 0 0 0

C 1 8 0 2

G 1 1 8 1

T 1 0 0 9

Fxy =

Page 5: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Markov invariants

• The log det is an example (the simplest) of a Markov invariant

• JS and PJ extended the theory of Markov invariants to larger subsets of taxa

– Tangles (3 taxa)– Squangles (4 taxa)– Stangles

Page 6: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Math wizards...

Page 7: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

...and their magical polynomials3 1 18 69 171 256-3 1 18 69 172 255-3 1 18 69 175 2523 1 18 69 176 251-3 1 18 69 187 2403 1 18 69 188 2393 1 18 69 191 236-3 1 18 69 192 235-1 1 18 71 169 2561 1 18 71 172 2531 1 18 71 173 252-1 1 18 71 176 2491 1 18 71 185 240-1 1 18 71 188 237-1 1 18 71 189 2361 1 18 71 192 2331 1 18 72 169 255-1 1 18 72 171 253-1 1 18 72 173 2511 1 18 72 175 249-1 1 18 72 185 2391 1 18 72 187 2371 1 18 72 189 235-1 1 18 72 191 233-3 1 18 73 167 2563 1 18 73 168 2553 1 18 73 175 248-3 1 18 73 176 2473 1 18 73 183 240-3 1 18 73 184 239-3 1 18 73 191 2323 1 18 73 192 231

and another 66,712 terms

coefficient

indices 1-256

e.g. 3*p1*p18*p73*p168*p255

= 3*pAAAA*pACAC*pCAGA*pGGCT*pTTTG

Page 8: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Squangle table

1

2

3

4

1

3

2

41

4

3

2

q1 q

2q

3

0 -u u

v 0 -v

-w w 0

q1 + q

2 + q

3 = 0

Page 9: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Choosing a quartet

0

0

00

q1

q2

q3

0 -u u

v 0 -v

-w w 0

u-u

Page 10: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Choosing a quartet

0

0

00

q1

q2

q3

0 -u u

v 0 -v

-w w 0

u=0

Page 11: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Residual sum of squares• Pick the quartet tree that minimises the residual sum

of squares (RSS)

• u = max {0,(q3-q2)/2} (v,w similar)

• The RSS are always of the form

q12 + [(q3-q2)/2 – u]2

If things are in the right order (q3>q2) then the second

term vanishes, but if they aren't then u gets set to 0

Page 12: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Weights (I)

• Weight each quartet

wi = 1/RSSi

• A posterior probability (ish) weighting scheme for the quartets is then

pi = wi/(w1+w2+w3)

Page 13: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Example

((Rhea,Hippo),Platypus,Wallaroo);

q1 = 9.14e-07

q2 = -7.58e-06

q3 = 6.67e-06

p1 = 0.978

p2 = 0.011

p3 = 0.011

MtDNA genomes13856 sites

RSS1 = 8.36e-13

RSS2 = 6.58e-11

RSS3 = 6.25e-11

u = 7.13e-06 v = 0w = 0

Page 14: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Weights (II)

• The RSS weights give a measure of the relative support for each topology.

• It would also be useful to have a quartet weight that was related to the edge length of the middle edge of the quartet

q1

q2

q3

0 -u u

v 0 -v

-w w 0

The most likely suspect is u = (q

3-q

2)/2

Page 15: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600

-0.0300

-0.0200

-0.0100

0.0000

0.0100

0.0200

0.0300q1

probability of mutation on middle edge

q1

q2

q3

0 -u u

v 0 -v

-w w 0

0.0000 0.0200 0.0400 0.0600 0.0800 0.1000 0.1200 0.1400 0.1600

-0.1000

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

q3-q2

probability of mutation on middle edge

q1

q2

q3

0 -u u

v 0 -v

-w w 0

Felsenstein tree, pendant short edges = 0.01, pendant long edges = 0.1

Page 16: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Basic simulation setup

Felsenstein zone Farris zone

Jukes Cantor model: equal base frequencies, all changes equally likely

100 data sets for each parameter choice

Page 17: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Simulations (I)Testing power compared to cNJ

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

Farris: short edge = 0.0025 , long edge

=0.15

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

90

100

Farris: short edge=0.005, long

edge=0.075

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

freq

uenc

y

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

Felsensteinshort edge = 0.0025, long edge =

0.15

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

201 402 804 1608 10000

0

10

20

30

40

50

60

70

80

90

100

Felsensteinshort edge = 0.005, long edge =

0.075

SQ(12)(34)

SQ(13)(24)

SQ(14)(23)

NJ(12)(34)

NJ(13)(24)

NJ(14)(23)

#sites

fre

qu

en

cy

Page 18: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Simulations (II)Adding base composition drift

• Added a GC bias along the long edges

A C G T

A * pl*b p

l*b p

l

C pl

* pl

pl

G pl

pl

* pl

T pl

pl*b p

l*b *

Page 19: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

GC bias on long edges

bias = 1 2 3 4 5

#Sites =200

SQ 71NJ 66

5949

5015

390

240

400 8686

6847

628

420

350

800 9391

7553

633

600

360

1600 100100

9256

790

590

380

10000 100100

10067

950

780

500

Felsenstein: short edge = 0.005, long edge = 0.075

Page 20: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Simulations (II)Adding a proportion of invariant sites

pInv = 0 0.1 0.2 0.3 0.4 0.5

#Sites =200 73 64 60 53 48 35

400 84 78 62 57 42 41

800 93 76 70 55 47 34

1600 97 89 72 66 35 22

10000 100 100 97 63 26 4

Felsenstein: short edge = 0.005, long edge = 0.075

Page 21: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Putting it all together

• Most people want to build trees on more than 4 taxa

• Fortunately there are already several methods for going from quartets to larger trees

– Q*

– Quartet puzzling

– Any supertree method

• Or from quartets to splits graphs

– QNet

Page 22: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Qnet – distance based weights

mt genomes

mt genomes

Page 23: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

1st codon pos2nd

3rdQnet – distance based weights

Page 24: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Detecting invariant sites

• The residual sum of squares (RSS) scores give an opportunity to detect invariant sites.

• Remove constant sites in order to– Idea 1: Minimise sum of RSS– Idea 2: Minimise minimum RSS

Page 25: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

15,000 sites of which 5000 are invariableproportion of constant sites out of 10,000 variable sites was 0.58

constant sites PP: sum RSS min RSS

0.72 0.25 0.44 0.30 2.22E-09 7.14E-10

0.70 0.28 0.40 0.32 1.74E-09 5.40E-10

0.68 0.31 0.35 0.33 1.45E-09 3.81E-10

0.66 0.37 0.30 0.33 1.20E-09 2.46E-10

0.64 0.45 0.25 0.30 9.90E-10 1.37E-10

0.62 0.59 0.18 0.23 8.31E-10 5.85E-11

0.60 0.80 0.09 0.11 7.40E-10 1.33E-11

0.57 0.98 0.01 0.01 7.07E-10 2.06E-14

0.55 0.97 0.02 0.02 7.15E-10 1.29E-11

0.52 0.82 0.09 0.08 7.38E-10 4.26E-11

0.50 0.69 0.17 0.14 7.47E-10 7.76E-11

Page 26: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Vagaries of real data

• Dealing sensibly with missing or ambiguous data

• Currently remove all sites with questions marks, gaps or ambiguities over the whole alignment

• Seems better to do this on a per quartet basis

Page 27: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Code

• R code• Python code, creates output that can be

understood by Qnet

Page 28: A powerful low-parameter method for inferring quartets under the General Markov Model Jeremy Sumner Barbara Holland Peter Jarvis.

Simulation plans

• Compare to likelihood• Compare to NJ with log-det distances• Look at rates across sites instead of just

proportions of invariant sites