Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2008/week5.pdf ·...
Transcript of Week 5: Distance methods, DNA and protein modelsevolution.gs.washington.edu/gs570/2008/week5.pdf ·...
Week 5: Distance methods, DNA and protein models
Genome 570
April, 2008
Week 5: Distance methods, DNA and protein models – p.1/34
Kimura 2-parameter model
A G
C T
α
α
β β β β
Week 5: Distance methods, DNA and protein models – p.2/34
Parameters of K2P in terms of transition/transversion ratio
α = RR+1
β =(
12
)1
R+1
Week 5: Distance methods, DNA and protein models – p.3/34
Transition [sic] probabilities for K2P
Prob (transition|t) = 14− 1
2exp
(− 2R+1
R+1t)
+ 14exp
(− 2
R+1t)
Prob (transversion|t) = 12− 1
2exp
(− 2
R+1t).
Week 5: Distance methods, DNA and protein models – p.4/34
Transition and transversion when R = 10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Total differences
Transitions
Transversions
Time (branch length)
Dif
fere
nce
s
R = 10
Week 5: Distance methods, DNA and protein models – p.5/34
Transition and transversion when R = 2
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Transversions
Transitions
Total differences
Dif
fere
nce
s
Time (branch length)
R = 2
Week 5: Distance methods, DNA and protein models – p.6/34
ML estimates for the K2P model
t = −14ln
[(1 − 2Q)(1 − 2P − Q)2
]
R = − ln(1−2P−Q)− ln(1−2Q) − 1
2
Week 5: Distance methods, DNA and protein models – p.7/34
Likelihood for two species under the K2P model
L = Prob (data | t, R)
=(
14
)n(1 − P − Q)n−n1−n2 Pn1
(12Q
)n2
where n1 is the number of sites differing by transitions, and n2 is thenumber of sites differing by transversions. P and Q are the expected
fractions of transition and transversion differences, as given by theexpressions three screens above.
Week 5: Distance methods, DNA and protein models – p.8/34
The Tamura/Nei model, F84, and HKY
To : A G C TFrom :
A − αRπG/πR + βπG βπC βπT
G αRπA/πR + βπA − βπC βπT
C βπA βπG − αYπT/πY + βπT
T βπA βπG αYπC/πY + βπC −
For the F84 model, αR = αY
For the HKY model, αR/αY = πR/πY
Week 5: Distance methods, DNA and protein models – p.9/34
Transition/transversion ratio for the Tamura-Nei model
Ts = 2 αR πA πG / πR + 2 αY πC πT / πY
+ β ( πAπG + 2πC πT)
Tv = 2 β πR πY
To get Ts/Tv = R and Ts + Tv = 1,
β =1
2πRπY(1 + R)
αY =πRπYR − πAπG − πCπT
(1 + R) (πYπAπGρ + πRπCπT)
αR = ρ αY
Week 5: Distance methods, DNA and protein models – p.10/34
Using fictional events to mimic the Tamura-Nei model
We imagine two types of events:
Type I:
If the existing base is a purine, draw a replacement from apurine pool with bases in relative proportions πA : πG . Thisevent has rate αR.
If the existing base is a pyrimidine, draw a replacement from apyrimidine pool with bases in relative proportions πC : πT .
This event has rate αY.
Type II: No matter what the existing base is, replace it by a basedrawn from a pool at the overall equilibrium frequencies:πA : πC : πG : πT. This event has rate β.
Week 5: Distance methods, DNA and protein models – p.11/34
Transition [sic] probabilities with the Tamura-Nei model
If the branch starts with a purine:
No events exp(−(αR + β)t)
Some type I, no type II exp(−β t) (1 − exp(−αR t))
Some type II 1 − exp(−β t)
If the branch starts with a pyrimidine:
No events exp(−(αY + β) t)
Some type I, no type II exp(−β t) (1 − exp(−αY t))
Some type II 1 − exp(−β t)
Week 5: Distance methods, DNA and protein models – p.12/34
A transition probability
So if we want to compute the probability of getting a G given that abranch starts with an A, we add up
The probability of no events, times 0 (as you can’t get a G from an A
with no events)
The probability of “some type I, no type II” times πG/πY (as the
last type I event puts in a G with probability equal to the fraction ofG’s out of all purines).
The probability of “some type II” times πG (as if there is any type IIevent, we thereafter have a probability of G equal to its overall
expected frequency, and further type I events don’t change that).
Week 5: Distance methods, DNA and protein models – p.13/34
A transition probability
So that, for example
Prob (G|A, t) =
exp(−β t) (1 − exp(−αR t)) πGπR
+ (1 − exp(−β t)) πG
Week 5: Distance methods, DNA and protein models – p.14/34
A more compact expression
More generally, we can use the Kronecker delta notation δij and the
“Watson-Kronecker” notation εij to write
Prob (j | i, t) =
exp(−(αi + β)t) δij
+ exp(−βt) (1 − exp(−αit))(
πjεij∑k εjkπk
)
+ (1 − exp(−βt)) πj
Week 5: Distance methods, DNA and protein models – p.15/34
Reversibility and the GTR model
πi Prob (j|i, t) = πj Prob (i|j, t)
The general time-reversible model:
To : A G C TFrom :
A − πG α πC β πT γG πA α − πC δ πT εC πA β πG δ − πT ηT πA γ πG ε πC η −
Week 5: Distance methods, DNA and protein models – p.16/34
Standardizing the rates
2πA πG α + 2 πA πC β + 2 πA πT γ
+ 2 πG πC δ + 2 πG πT ε + 2 πC πT η = 1
Week 5: Distance methods, DNA and protein models – p.17/34
General Time Reversible models – inference
A data example (simulated under a K2P model, true distance 0.2transition/transversion ratio = 2
A G C T total
A 93 13 3 3 112
G 10 105 3 4 122
C 6 4 113 18 141
T 7 4 21 93 125
total 116 126 140 118 500
Week 5: Distance methods, DNA and protein models – p.18/34
Averaging across the diagonal ...
A G C T total
A 93 11.5 4.5 5 114
G 11.5 105 3.5 4 124
C 4.5 3.5 113 19.5 140.5T 5 4 19.5 93 121.5
total 114 124 140.5 121.5 500
Week 5: Distance methods, DNA and protein models – p.19/34
Dividing each column by its sum
(column, because Pij is to be the probability of change from j to i )
P =
0.815789 0.0927419 0.0320285 0.04115230.100877 0.846774 0.024911 0.03292180.0394737 0.0282258 0.80427 0.1604940.0438596 0.0322581 0.13879 0.765432
Week 5: Distance methods, DNA and protein models – p.20/34
Rate matrix from the matrix logarithm
If the rate matrix is A,
P = eA t
so that
At = log(P
)
=
−0.212413 0.110794 0.034160 0.0467260.120512 −0.174005 0.025043 0.0355540.0421002 0.028375 −0.236980 0.2055790.0498001 0.034837 0.177778 −0.287859
.
Week 5: Distance methods, DNA and protein models – p.21/34
Standardizing the rates
If we denote by D the diagonal matrix of observed base frequencies,
and we require that the rate of (potentially-observable) substitution is 1:
−trace(AD) = 1
We get:
t = −trace(AtD) = −trace(log(P)D
)
and that also gives us an estimate of the rate matrix:
A = log(P
)/− trace
(log(P)D
)
Week 5: Distance methods, DNA and protein models – p.22/34
The rate estimates
A =
−0.931124 0.485671 0.149741 0.2048260.528274 −0.762764 0.109776 0.1558520.184549 0.124383 −1.038820 0.9011680.218302 0.152710 0.779302 −1.261850
.
moderately close to the actual K2P rate matrix used in the simulationwhich was:
A =
−1 2/3 1/6 1/62/3 −1 1/6 1/61/6 1/6 −1 2/31/6 1/6 2/3 −1
(but if any of the eigenvalues of log(P) are negative, this doesn’t work
and the divergence time is estimated to be infinite).
Week 5: Distance methods, DNA and protein models – p.23/34
The lattice of these models
General 12−parameter model (12)
General time−reversible model (9)
Kimura K2P (2)
Tamura−Nei (6)
HKY (5) F84 (5)
Jukes−Cantor (1)Week 5: Distance methods, DNA and protein models – p.24/34
Variation of rates of evolution across sites
L(t) =
sites∏
i=1
(∫ ∞
0
f(r) πniPmini
(r t) dr
)
Week 5: Distance methods, DNA and protein models – p.25/34
The Gamma distribution
f(r) =1
Γ(α) βαxα−1 e−
xβ
E[x] = α β
Var[x] = α β2
To get a mean of 1, set β = 1/α so that
f(r) =αα
Γ(α)rα−1 e−αr.
so that the squared coefficient of variation is 1/α.
Week 5: Distance methods, DNA and protein models – p.26/34
Gamma distributions
0.5 1 1.5 20rate
freq
uen
cyα
α = 0.25cv = 2
α = 1cv = 1
= 11.1111cv = 0.3
Week 5: Distance methods, DNA and protein models – p.27/34
Gamma rate variation in the Jukes-Cantor model
For example, for the Jukes-Cantor distance, to get the fraction of sitesdifferent we do
DS =
∫∞
0
f(r)3
4
(1 − e−
43r ut
)dr
leading to the formula for D as a function of DS
D = −3
4α
[1 −
(1 −
4
3DS
)−1/α
]
Week 5: Distance methods, DNA and protein models – p.28/34
Gamma rate variation in other models
For many other distances such as the Tamura-Nei family, the transitionprobabilites are of the form
Pij(t) = Aij + Bij e−bt + Cij e−ct
and integrating termwise we can make use of the fact that
Er
[e−b r t
]=
(1 +
1
αb t
)−α
Week 5: Distance methods, DNA and protein models – p.29/34
Dayhoff’s PAM001 matrix
A R N D C Q E G H I L K M F P S T W Y Vala arg asn asp cys gln glu gly his ile leu lys met phe pro ser thr trp tyr val
A ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18R arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1N asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1D asp 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1C cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2Q gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1E glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2G gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5H his 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1I ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33L leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15K lys 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1M met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4F phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0P pro 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2S ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2T thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9W trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0Y tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1V val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
Week 5: Distance methods, DNA and protein models – p.30/34
The codon model
U C A G
U
C
A
G
phe
phe
leu
leu
leu
leu
leu
leu
ile
ile
ile
met
val
val
val
val
ser stop stop
U
C
C
U
U
C
A
G
A
G
A
G
U
C
A
G
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
UCA UAA UGA
changing, and to what
Probabilities of change vary dependingon whether amino acid is
Goldman & Yang, MBE 1994; Muse and Weir, MBE 1994
Week 5: Distance methods, DNA and protein models – p.31/34
A codon-based model of protein evolution
In each cell:Pij ij
(v) a
ijP is the probability of codon change
and is the probability that the change is accepted
where (v)
ij a
....
....
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG
AAA
AAC
AAG
AAT
ACA
ACC
ACG
ACT
AGA
AGC
AGG
01010000
000
lys
lys
asn
asn
thr
thr
thr
thr
arg
ser
arg
observation:
Week 5: Distance methods, DNA and protein models – p.32/34
Considerations for a protein model
Making a model for protein evolution (a not-very-practical approach)
Use a good model of DNA evolution.
Use the appropriate genetic code.
When an amino acid changes, accept it with probability that
declines as the amino acids become more different.
Fit this to empirical information on protein evolution.
Take into account variation of rate from site to site.
Take into account correlation of rates in adjacent sites.
How about protein structure? Secondary structure? 3D structure?
(the first four steps are the “codon model” of Goldman and Yang, 1994
and Muse and Gaut, 1994, both in Molecular Biology and Evolution. Thenext two are the rate variation machinery of Yang, 1995, 1996 and
Felsenstein and Churchill, 1996).
Week 5: Distance methods, DNA and protein models – p.33/34
How it was done
This freeware-friendly presentation prepared with
Linux (operating system)
LaTeX (mathematical typesetting)
ps2pdf (to turn Postscript into a PDF)
Idraw (drawing program to modify plots and draw figures)
Adobe Acrobat Reader (to display the PDF in full-screen mode)
Week 5: Distance methods, DNA and protein models – p.34/34