Selection analysis using HyPhy
-
Author
bcbbslides -
Category
Science
-
view
178 -
download
0
Embed Size (px)
Transcript of Selection analysis using HyPhy
Office of Technology Information Systems
Fall 2015Kurt Wollenberg, PhDPhylogenetics SpecialistBioinformatics and Computational Biology Branch (BCBB)Phylogenetics and Sequence AnalysisLecture 6: Selection Analysis Using HyPhy
10
Course OrganizationBuilding a clean sequenceCollecting homologsAligning your sequencesBuilding treesFurther Analysis
Lecture OrganizationWhat is selection?What does selection look like?HyPhy: using maximum likelihood to quantify selectionHyPhy inputInterpreting your output
What is selection?Mutation the source of all variationSubstitution changing one nucleotide to anotherSynonymous substitutionNon-synonymous substitutionConservative substitution
Synonymous/non-synonymousConservative/non-conservative
Adapted from Miller and Levine
Substitution...GGG TGT TGT...
...GGC AGT TGG...
...G-S-W......G-C-C...conservative
What is selection?Differential survival of genotypesPositive selection - diversifyingMaintain variationVariation can also be maintained by genetic driftNegative selection purifyingRemove variation
Why is selection important?PathogensSurviving the immune systemPatientsSurviving infectionIdentification of the molecular basis for survival
What is selection analysis?Determining the dN/dS ratioInference of ancestral statesDetermination of the effect of inferred nucleotide changes on amino acidsIs dN/dS significantly different from 1?
What is selection analysis?Positive (Diversifying) SelectiondN/dS > 1Negative (Purifying) SelectiondN/dS < 1
Two types of selection analysisSelection at sitesAre there significant levels of non-synonymous substitution at specific codons?Averaged across lineages.Selection along lineagesAre there significant levels of non-synonymous substitution in specific groups?Averaged across sequences.
Two types of selection analysis 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA
TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA Selection at sites (codons)Selection along lineages
Calculating dN/dSEstimate transition/transversion rate ratioTransition: A G C TTransversion: purine purimidineBased on four-fold degenerate sites at third codon positions and nondegenerate sites.Count synonymous and nonsynonymous substitutions for each codon site or along each lineage.Use ML estimates of ancestral codon states.Correct counts for multiple hits.Fitting the data to a distribution of dN/dS ratios. Calculate the probability of dN/dS falling into a particular class.
HyPhy: the softwareInput data formatSequences must be aligned.Fasta, PHYLIP, or Nexus format.Include the tree after the alignment.Program optionsAnalysis scripts.http://datamonkey.org
Standard AnalysesHyPhy: the software
Positive Selection AnalysesHyPhy: the software
REL relative effects model very similar to PAML codeml F61 analysis2pFEL 2-parameter fixed-effects modelMEME mixed-effects modelThe AlgorithmsHyPhy: the software
Analyzes levels of positive selection at individual sites while allowing level of positive selection to vary from branch to branch.Other models (FEL, REL, etc.) assume level of positive selection to be uniform across all branches.Includes site-to-site variation in synonymous substitution rate, just like FEL.The MEME AlgorithmHyPhy: the softwareMixed Effects Model of Evolution
HyPhy: Input file formats>MC1_01B4fs TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT>MC1_01A10 TGCACT---AATCTGACAAAGGCTATTAAGACCAATGGGAATGCTAATAATACCAGTACT>MC1_01C1 TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT>MC1_01A20 TGCACTAATAATCTGACAAAGGCTAGTAATGCCACTGAGAAGGCTAATAATACCATTACT >MC1_01TA1 TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT
(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); FastaMake sure these match!No stop codons!
HyPhy: Input file formats>MC1_01B4fs >MC1_01A10 >MC1_01C1 >MC1_01A20 >MC1_01TA1
TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACTTGCACT---AATCTGACAAAGGCTATTAAGACCAATGGGAATGCTAATAATACCAGTACTTGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACTTGCACTAATAATCTGACAAAGGCTAGTAATGCCACTGAGAAGGCTAATAATACCATTACT TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT
AAT---------ACCACTATTAATAGCGGGGGAGAAATAA AATGCCACTGAGAGTACCATTACTAATACCACTGAAATAA AAT---------ACCACTATTAATAGCGGGGGAGAAATAA AAT---------ACCACTATTGATAGCGGGGGAGAAATAA AAT---------ACCACTATTGATAGCGGGGGAGAAATAA
(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); Fasta - interleavedMake sure these match!No stop codons!
HyPhy: Input file formats 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA
TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA
1(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); PHLYIPMake sure these match!No stop codons!
HyPhy: Input file formats #NEXUS Begin data;Dimensions ntax=5 nchar=100;Format datatype=DNA gap=- interleave;MATRIX MC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA
TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA;End;Begin trees;TREE tree=(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1));End; NexusMake sure these match!No stop codons!
HyPhy: Input file formats 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA
TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA Sequence dataGAPS!Gaps must be in multiples of three and preserve the reading frame!ML models cannot analyze gapped codon sites they will be ignored for the analysis.
HyPhy: data input
Codon frequency models0: 1/61 All codons equally probable1: F1x4 - Codon frequencies based on overall nucleotide frequencies (fA, fC, fG, fT)2: F3x4 - Codon frequencies based on nucleotide frequencies at each codon site (fA1, fC1, fG1, fT1, fA2, fC2, fG2, fT2, fA3, fC3, fG3, fT3)3: F61 - Codon frequencies based on frequency of each codon in the data (faaa, faac, faag, faat, faca, facc, facg, fact, faga, fagc, fagg, fagt, ...)F3x4 is used in the MEME analysis
HyPhy: data output
HyPhy: data output
HyPhy: data outputCodonalphabeta1p1beta2p2LRTp-valueq-valueLog(L)1250.00000.00000.725313.60100.27478.01760.00810.3258-30.69571530.91390.00000.892749.53350.10738.77970.00550.2580-22.60631630.66460.00000.847820.02700.15225.43090.03040.8559-25.81351880.00000.00000.96731883.38000.03279.79070.00330.1853-12.70602110.72450.00000.96604586.81000.034011.90000.00110.1062-22.59282180.89030.00000.9662304.81500.033813.31920.00060.1555-17.26402340.00000.00000.8160263.59100.184010.18910.00270.1893-33.74092370.09690.09690.719117.68310.280912.37690.00090.1252-35.26642390.67190.00000.825411.67360.17465.95410.02320.7269-22.81522640.00000.00000.93708.80910.06306.67460.01600.5655-11.4430
HyPhy: further analyses
Selection AnalysisSite vs. Lineage SelectionGenerally, if significant positive (or negative) selection occurs at only a few sites, averaging over the length of the sequences will result in no significant positive (or negative) selection being detected.
Significance of ResultsSelection analysis only determines if there is a significant excess or lack of non-synonymous substitution. The relative level of physiochemical similarity among the two amino acids is beyond the capabilities of selection analysis software.Selection Analysis
ReferencesFundamentals of Molecular Evolution. 2nd edition. Grauer and Li. 2000A good introduction to the major concepts in molecular evolution.The Phylogenetics Handbook. P. Lemey, editor. 2005.Chapter 14: Theory of and practice of diversifying selection analysis.MEME reference: Murrell B, et al. (2012) Detecting individual sites subject to episodic diversifying selection. PLoS Genetics,8(7):e1002764.
Seminar Follow-Up SiteFor access to past recordings, handouts, slides visit this site from the NIH network: http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/
34
1. Select a Subject MatterView:Seminar DetailsHandout and Reference DocsRelevant LinksSeminar Recording Links2. Select a TopicRecommended Browsers:IE for Windows,Safari for Mac (Firefox on a Mac is incompatible with NIH Authentication technology)LoginIf prompted to log in use NIH\ in front of your username
35Retrieving Slides/Handouts
This lecture series
36Retrieving Slides/Handouts
This lectureThese slides
37
Questions?
38Next
Molecular Evolutionary Analysis of Pathogens Using BEASTThursday, 10 December at 1300