GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein...

5
GTR model GTR model The most The most General Time Reversible General Time Reversible model. model. Universal assumption – time reversibility: Universal assumption – time reversibility: π i i P( j|i,t j|i,t ) = ) = π j j P( i|j,t i|j,t) π i i R j,i j,i = = π j j R i,j i,j Criticism: observed in DNA stationary probabilities Criticism: observed in DNA stationary probabilities π 1 , , π 2 , , π 3 , , π 4 4 of of letters are not equal – letters are not equal – π i ’s should be model parameters. ’s should be model parameters. = * * * * 4 4 4 3 3 3 2 2 2 1 1 1 ε π η π χ π ε π δ π α π η π δ π β π χ π α π β π R π i ’s and 6 more independent parameters: ’s and 6 more independent parameters: The more general model, the more adequate, but … more parameters The more general model, the more adequate, but … more parameters must be set by the user. must be set by the user. PROSITE – protein PROSITE – protein families database families database PROSITE collects biologically significant PROSITE collects biologically significant sequential patterns obtained from amino acid sequential patterns obtained from amino acid sequences multialignments. sequences multialignments. 4functional amino acid patterns, functional amino acid patterns, 4protein domains, protein domains, 4protein families characterized by conservative motifs. protein families characterized by conservative motifs. Ability of detecting patterns in given amino acid sequences. Ability of detecting patterns in given amino acid sequences. Two kinds of records describing patterns: Two kinds of records describing patterns: 4profiles, profiles, 4regular expressions of special format. regular expressions of special format.

Transcript of GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein...

Page 1: GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein families database PROSITE pattern notation: 4– – separator between the pattern’s

GTR modelGTR model

The most The most General Time ReversibleGeneral Time Reversible model. model. Universal assumption – time reversibility:Universal assumption – time reversibility:

ππi i PP((j|i,tj|i,t ) = ) = ππj j PP((i|j,ti|j,t )) ⇒⇒ ππi i RRj,ij,i = = ππj j RRi,ji,j

Criticism: observed in DNA stationary probabilities Criticism: observed in DNA stationary probabilities ππ11, , ππ22, , ππ33, , ππ4 4 of of

letters are not equal – letters are not equal – ππii ’s should be model parameters.’s should be model parameters.

=

*

*

*

*

444

333

222

111

επηπχπεπδπαπηπδπβπχπαπβπ

R

ππii ’s and 6 more independent parameters:’s and 6 more independent parameters:

The more general model, the more adequate, but … more parameters The more general model, the more adequate, but … more parameters must be set by the user.must be set by the user.

PROSITE – protein PROSITE – protein families databasefamilies database

PROSITE collects biologically significant PROSITE collects biologically significant sequential patterns obtained from amino acid sequential patterns obtained from amino acid sequences multialignments.sequences multialignments.

44functional amino acid patterns,functional amino acid patterns,44protein domains,protein domains,44protein families characterized by conservative motifs.protein families characterized by conservative motifs.Ability of detecting patterns in given amino acid sequences.Ability of detecting patterns in given amino acid sequences.

Two kinds of records describing patterns:Two kinds of records describing patterns:44profiles,profiles,44regular expressions of special format.regular expressions of special format.

Page 2: GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein families database PROSITE pattern notation: 4– – separator between the pattern’s

PROSITE – protein PROSITE – protein families databasefamilies database

PROSITE pattern notation:PROSITE pattern notation:44–– – separator between the pattern’s elements, – separator between the pattern’s elements,44VV – any letter, one letter amino acid code, – any letter, one letter amino acid code,44xx – any amino acid, – any amino acid,44[[……]] – one amino acid from bracket, – one amino acid from bracket,44{{ ……}} – one amino acid, but not from bracket, – one amino acid, but not from bracket,44ee((ii)) – for element – for element ee and number and number ii : repetition of : repetition of ee exactly exactly ii times, times,44ee((i,ji,j )) – repetition of – repetition of ee exactly exactly kk times, where times, where kk≥≥ii and and kk≤≤jj ..

Example.Example. Pattern of some RNA-binding proteins’ family: Pattern of some RNA-binding proteins’ family:[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM][RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]Fragment of multialignment:Fragment of multialignment:44SRSLKMSRSLKMRGQAFVIFRGQAFVIFKEVSSATKEVSSAT44KLTGRPKLTGRPRGVAFVRYRGVAFVRYNKREEAQNKREEAQ44VGCSVHVGCSVHKGFAFVQYKGFAFVQYVNERNARVNERNAR

PROSITE – protein PROSITE – protein families databasefamilies database

Example.Example. PROSITE PROSITE pattern description.pattern description.

Page 3: GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein families database PROSITE pattern notation: 4– – separator between the pattern’s

Maximum likelihoodMaximum likelihoodIdea:Idea: use a model of sequence evolution use a model of sequence evolution PP((i|j,ti|j,t ). Find a tree ). Find a tree TT((V,EV,E) ) with weights on edges (time lengths) with weights on edges (time lengths) ww::EE→→RR≥≥00 for which an appearance for which an appearance probability probability PP((wwvv::vv∈∈LL||T,wT,w) of observed leaves’ sequences ) of observed leaves’ sequences wwvv ( (vv∈∈L, L, ||wwvv||

==ll )) is maximal possible.is maximal possible.Phylogenetic analysis of species set Phylogenetic analysis of species set LL::1. Find a gene/protein whose homologues are present in all species of 1. Find a gene/protein whose homologues are present in all species of LL,,2. Make multialignment, delete columns containing spaces,2. Make multialignment, delete columns containing spaces,3. Calculation of a 3. Calculation of a likelihoodlikelihood PP((wwvv::vv∈∈LL||T,wT,w) for a given tree ) for a given tree TT and and weights weights ww is efficient. But heuristic search among both: different trees is efficient. But heuristic search among both: different trees and weigths of their edges (2|and weigths of their edges (2|LL|-3 continuous parameters) is necessary.|-3 continuous parameters) is necessary.44the phylogenetic model as well as final result are the most reliable,the phylogenetic model as well as final result are the most reliable,44heuristic optimization performed among discrete and continuous heuristic optimization performed among discrete and continuous variables. Very large search space. Time consuming numerical variables. Very large search space. Time consuming numerical calculations.calculations.

Page 4: GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein families database PROSITE pattern notation: 4– – separator between the pattern’s

Maximum likelihoodMaximum likelihood44each position in words may be treated separately, therefore the each position in words may be treated separately, therefore the likelihood is a product over all multialignment columns, likelihood is a product over all multialignment columns, 44if an evolution model is time-reversible (i.e. if an evolution model is time-reversible (i.e. ππi i PP((j|i,tj|i,t ) = ) = ππj j PP((i|j,ti|j,t )) ))

then adding a root in any place doesn’t change the likelihood – then adding a root in any place doesn’t change the likelihood – heuristic search may be performed on unrooted trees.heuristic search may be performed on unrooted trees.

Given a binary rooted tree Given a binary rooted tree TT((V,EV,E), weights ), weights ww::EE→→RR≥≥00 and letters and letters aauu∈Σ∈Σ

for leaves for leaves uu∈∈L. L. Knowing a letter Knowing a letter aavv==aa in vertex in vertex vv∈∈VV find the find the

probability of appearance of correct letters in leaves as a result of probability of appearance of correct letters in leaves as a result of evolution in a subtree evolution in a subtree TTvv rooted at rooted at vv..

PP((aauu::uu∈∈LL((TTvv)|)|TTvv,a,avv=a,w=a,w|E|E((TTvv)))=?)=?

Problem.Problem. How to find a likelihood for one multialignment column? How to find a likelihood for one multialignment column?

xx

vv

yy

wwxx wwyy

… … dynamic programming,dynamic programming, bottom-up proceeding order … bottom-up proceeding order …

== ΣΣbb∈Σ∈Σ PP((bb||a,wa,wxx) ) PP((aauu::uu∈∈LL((TTxx)|)|TTxx,a,axx=b,w=b,w|E|E((TTxx))))··

··ΣΣcc∈Σ∈Σ PP((cc||a,wa,wyy) ) PP((aauu::uu∈∈LL((TTyy)|)|TTyy,a,ayy=c,w=c,w|E|E((TTyy))))

Likelihood = Likelihood = ΣΣaa∈Σ∈Σ ππaa··PP((aauu::uu∈∈LL||T,aT,arootroot==a,wa,w))

Page 5: GTR model - kaims.eti.pg.gda.plkaims.eti.pg.gda.pl/~giaro/biocomp/rest.pdfPROSITE – protein families database PROSITE pattern notation: 4– – separator between the pattern’s

Bayesian approachBayesian approach4DD – input data: sequences – input data: sequences wwvv ( (vv∈∈L, L, ||wwvv|=|=ll))

4((T,wT,w) – results: binary unrooted tree) – results: binary unrooted tree

44But weighted trees (regardless of data) are seen as not equiprobable:But weighted trees (regardless of data) are seen as not equiprobable:PP((T,wT,w) – ) – prior probabilityprior probability (density) of weighted trees (density) of weighted trees

4PP((D|T,wD|T,w) – likelihood. But we want the most probable weighted tree ) – likelihood. But we want the most probable weighted tree for a given data D, not a tree (for a given data D, not a tree (T,wT,w) for which data D is the most ) for which data D is the most probable!probable!

priorprior

hard to estimate, but … unnecesary factorhard to estimate, but … unnecesary factor

likelihoodlikelihood

44Posterior tree’s probabilityPosterior tree’s probability::PP((T,w|DT,w|D)=)=PP((T,w,DT,w,D)/)/PP((DD)=)=PP((D|T,wD|T,w))PP((T,wT,w)/)/PP((DD))

~ ~ PP((D|T,wD|T,w))PP((T,wT,w))

new weighted new weighted tree’s quality tree’s quality

functionfunction

Bayesian approachBayesian approach

1. Create probability distributions 1. Create probability distributions ppvv((uu)>0 (for )>0 (for uu∈∈NNGG((vv))))..

2. Let 2. Let vv∈∈VV;;3. 3. repeatrepeat

choose random choose random uu∈∈NNGG((vv) with probability distribution ) with probability distribution ppvv((uu););

aa:=:=ppuu((vv))ff((uu)/)/ppvv((uu))ff((vv););

ifif aa≥≥randomrandom([0;1]) ([0;1]) thenthen vv:=:=uu;;printprint vv;;

untiluntil falsefalse;;

Metropolis–Hastings algorithmMetropolis–Hastings algorithm ( (Markov chain Monte CarloMarkov chain Monte Carlo).). Input:Input: connected graph connected graph GG((V,EV,E), function ), function ff::VV→→RR++

Output:Output: switching states probabilities switching states probabilities aauvuv for all { for all {u,vu,v}} ∈∈EE that discrete that discrete

time Markov chain with states from time Markov chain with states from VV has stationary distribution ~ has stationary distribution ~ff..

In phylogenetics: states – weighted trees, In phylogenetics: states – weighted trees, ff – posterior prob. – posterior prob. Do not Do not optimize!optimize! Just run a chain for a long time, take a sample of probable Just run a chain for a long time, take a sample of probable trees.trees.