Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2010/week5.pdf ·...

Week 5: Distance methods, DNA and protein models

Genome 570

February, 2010

Week 5: Distance methods, DNA and protein models – p.1/34

Kimura 2-parameter model

β β β β

Parameters of K2P in terms of transition/transversion rati o

α = RR+1

Transition [sic] probabilities for K2P

Prob (transition|t) = 14− 1

(− 2R+1

+ 14exp

(− 2

Prob (transversion|t) = 12− 1

(− 2

R+1t).

Transition and transversion when R = 10

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Total differences

Transitions

Transversions

Time (branch length)

R = 10

Transition and transversion when R = 2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Transversions

Transitions

Total differences

Time (branch length)

ML estimates for the K2P model

t = −14ln

[(1 − 2Q)(1 − 2P − Q)2

R = − ln(1−2P−Q)− ln(1−2Q) − 1

Likelihood for two species under the K2P model

L = Prob (data | t,R)

)n(1 − P − Q)n−n1−n2 Pn1

where n1 is the number of sites differing by transitions, and n2 is thenumber of sites differing by transversions. P and Q are the expectedfractions of transition and transversion differences, as given by theexpressions three screens above.

The Tamura/Nei model, F84, and HKY

To : A G C TFrom :

A − αRπG/πR + βπG βπC βπT

G αRπA/πR + βπA − βπC βπT

C βπA βπG − αYπT/πY + βπT

T βπA βπG αYπC/πY + βπC −

For the F84 model, αR = αY

For the HKY model, αR/αY = πR/πY

Transition/transversion ratio for the Tamura-Nei model

Ts = 2 αR πA πG / πR + 2 αY πC πT / πY

+ β ( πAπG + 2πC πT)

Tv = 2 β πR πY

To get Ts/Tv = R and Ts + Tv = 1,

2πRπY(1 + R)

αY =πRπYR − πAπG − πCπT

(1 + R) (πYπAπGρ + πRπCπT)

αR = ρ αY

Using fictional events to mimic the Tamura-Nei model

We imagine two types of events:

Type I:If the existing base is a purine, draw a replacement from apurine pool with bases in relative proportions πA : πG . Thisevent has rate αR.If the existing base is a pyrimidine, draw a replacement from apyrimidine pool with bases in relative proportions πC : πT .This event has rate αY.

Type II: No matter what the existing base is, replace it by a basedrawn from a pool at the overall equilibrium frequencies:πA : πC : πG : πT. This event has rate β.

Transition [sic] probabilities with the Tamura-Nei model

If the branch starts with a purine:No events exp(−(αR + β)t)

Some type I, no type II exp(−β t) (1 − exp(−αR t))

Some type II 1 − exp(−β t)

If the branch starts with a pyrimidine:No events exp(−(αY + β) t)

Some type I, no type II exp(−β t) (1 − exp(−αY t))

Some type II 1 − exp(−β t)

A transition probability

So if we want to compute the probability of getting a G given that a branchstarts with an A, we add up

The probability of no events, times 0 (as you can’t get a G from an Awith no events)

The probability of “some type I, no type II” times πG/πY (as the lasttype I event puts in a G with probability equal to the fraction of G’sout of all purines).

The probability of “some type II” times πG (as if there is any type IIevent, we thereafter have a probability of G equal to its overallexpected frequency, and further type I events don’t change that).

A transition probability

So that, for example

Prob (G|A, t) =

exp(−β t) (1 − exp(−αR t)) πG

+ (1 − exp(−β t)) πG

A more compact expression

More generally, we can use the Kronecker delta notation δij and the“Watson-Kronecker” notation εij to write

Prob (j | i, t) =

exp(−(αi + β)t) δij

+exp(−βt) (1 − exp(−αit))(

πjεijP

k εjkπk

+(1 − exp(−βt)) πj

where δij is 1 if the two bases i and j are different (0 otherwise),and εij is 1 if, of the two bases i and j, one is a purine and one is apyrimidine (0 otherwise).

Reversibility and the GTR model

πi Prob (j|i, t) = πj Prob (i|j, t)

The general time-reversible model:

To : A G C TFrom :

A − πG α πC β πT γG πA α − πC δ πT εC πA β πG δ − πT ηT πA γ πG ε πC η −

The GTR model

βπβπ

γπαπ

δπεπ

ηπ C

Standardizing the rates

2πA πG α + 2 πA πC β + 2 πA πT γ

+ 2 πG πC δ + 2 πG πT ε + 2 πC πT η = 1

General Time Reversible models – inferenceA data example (simulated under a K2P model, true distance 0.2

transition/transversion ratio = 2

A G C T total

A 93 13 3 3 112

G 10 105 3 4 122

C 6 4 113 18 141

T 7 4 21 93 125

total 116 126 140 118 500

Averaging across the diagonal ...

A G C T total

A 93 11.5 4.5 5 114

G 11.5 105 3.5 4 124

C 4.5 3.5 113 19.5 140.5T 5 4 19.5 93 121.5

total 114 124 140.5 121.5 500

Dividing each column by its sum

(column, because Pij is to be the probability of change from j to i )

0.815789 0.0927419 0.0320285 0.04115230.100877 0.846774 0.024911 0.03292180.0394737 0.0282258 0.80427 0.1604940.0438596 0.0322581 0.13879 0.765432

Rate matrix from the matrix logarithm

If the rate matrix is A,

P = eA t

so that

At = log(P

−0.212413 0.110794 0.034160 0.0467260.120512 −0.174005 0.025043 0.0355540.0421002 0.028375 −0.236980 0.2055790.0498001 0.034837 0.177778 −0.287859

Standardizing the rates

If we denote by D the diagonal matrix of observed base frequencies, andwe require that the rate of (potentially-observable) substitution is 1:

−trace(AD) = 1

We get:

t = −trace(AtD) = −trace(log(P)D

and that also gives us an estimate of the rate matrix:

A = log(P

) /− trace

(log(P)D

The rate estimates

−0.931124 0.485671 0.149741 0.2048260.528274 −0.762764 0.109776 0.1558520.184549 0.124383 −1.038820 0.9011680.218302 0.152710 0.779302 −1.261850

moderately close to the actual K2P rate matrix used in the simulationwhich was:

−1 2/3 1/6 1/62/3 −1 1/6 1/61/6 1/6 −1 2/31/6 1/6 2/3 −1

(but if any of the eigenvalues of log(P) are negative, this doesn’t workand the divergence time is estimated to be infinite).

The lattice of these modelsGeneral 12−parameter model (12)

General time−reversible model (9)

Kimura K2P (2)

Tamura−Nei (6)

HKY (5) F84 (5)

Jukes−Cantor (1)Week 5: Distance methods, DNA and protein models – p.25/34

Variation of rates of evolution across sites

L(t) =

sites∏

(∫∞

f(r) πniPmini

(r t) dr

The Gamma distribution

f(r) =1

Γ(α) βαxα−1 e−

E[x] = α β

Var[x] = α β2

To get a mean of 1, set β = 1/α so that

f(r) =αα

Γ(α)rα−1 e−αr.

so that the squared coefficient of variation is 1/α.

Gamma distributions

0.5 1 1.5 20rate

α = 0.25cv = 2

α = 1cv = 1

= 11.1111cv = 0.3

Gamma rate variation in the Jukes-Cantor modelFor example, for the Jukes-Cantor distance, to get the fraction of sitesdifferent we do

∫∞

(1 − e−

43r ut

leading to the formula for D as a function of DS

D = −3

[1 −

(1 −

)−1/α

Gamma rate variation in other modelsFor many other distances such as the Tamura-Nei family, the transitionprobabilites are of the form

Pij(t) = Aij + Bij e−bt + Cij e−ct

and integrating termwise we can make use of the fact that

[e−b r t

)−α

Dayhoff’s PAM001 matrixA R N D C Q E G H I L K M F P S T Wala arg asn asp cys gln glu gly his ile leu lys met phe pro ser thr trp

A ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0R arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8N asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1D asp 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0C cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0Q gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0E glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0G gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0H his 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1I ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0L leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4K lys 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0M met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0F phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3P pro 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0S ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5T thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0W trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976Y tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2V val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0

The codon model

U C A G

ser stop stop

UCA UAA UGA

changing, and to what

Probabilities of change vary dependingon whether amino acid is

Goldman & Yang, MBE 1994; Muse and Weir, MBE 1994

A codon-based model of protein evolution

In each cell:Pij ij

ijP is the probability of codon change

and is the probability that the change is accepted

where (v)

AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG

01010000

observation:

Considerations for a protein model

Making a model for protein evolution (a not-very-practical approach)

Use a good model of DNA evolution.

Use the appropriate genetic code.

When an amino acid changes, accept it with probability that declinesas the amino acids become more different.Fit this to empirical information on protein evolution.

Take into account variation of rate from site to site.Take into account correlation of rates in adjacent sites.

How about protein structure? Secondary structure? 3D structure?

(the first four steps are the “codon model” of Goldman and Yang, 1994and Muse and Gaut, 1994, both in Molecular Biology and Evolution. Thenext two are the rate variation machinery of Yang, 1995, 1996 andFelsenstein and Churchill, 1996).

Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2010/week5.pdf ·...

Documents

Transcript of Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2010/week5.pdf ·...

Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2008/week5.pdf · 2008-06-04 · cv = 2 α= 1 cv = 1 = 11.1111 ... For example, for the Jukes-Cantor

Recombinant DNA technology

Athens Digital Week

Business Week 2012 - Raptopoulos

DNA Mitocondrial

Business Week 2012 - Vasilaras

Distance Relay Test

2015 Holy Week Schedule

ΧΡΗΜΑ Week #587

Ανασυνδυασμένο DNA

Symbolic Data Analysis: Dissimilarity/Similarity/Distance ... · Dis/Similarity / Distance Measures Distance Measures, Similarity/Dissimilarity Matrices: Goalis to subdivide the complete

[AIESEC] Welcome Week Presentation

Happy Week #406

DNA 6η ∆ιάλεξη

Week 4 - 1 slides

Theta Week 160

Χρήμα Week #519

Europe Code Week Presentation

20120509 hsa business-week

Unit 10 - Week 9