Selection analysis using HyPhy

of 38 /38
Fall 2015 Kurt Wollenberg, PhD Phylogenetics Specialist Bioinformatics and Computational Biology Branch (BCBB) Phylogenetics and Sequence Analysis Lecture 6: Selection Analysis Using HyPhy

Embed Size (px)

Transcript of Selection analysis using HyPhy

Office of Technology Information Systems

Fall 2015Kurt Wollenberg, PhDPhylogenetics SpecialistBioinformatics and Computational Biology Branch (BCBB)Phylogenetics and Sequence AnalysisLecture 6: Selection Analysis Using HyPhy

10

Course OrganizationBuilding a clean sequenceCollecting homologsAligning your sequencesBuilding treesFurther Analysis

Lecture OrganizationWhat is selection?What does selection look like?HyPhy: using maximum likelihood to quantify selectionHyPhy inputInterpreting your output

What is selection?Mutation the source of all variationSubstitution changing one nucleotide to anotherSynonymous substitutionNon-synonymous substitutionConservative substitution

Synonymous/non-synonymousConservative/non-conservative

Adapted from Miller and Levine

Substitution...GGG TGT TGT...

...GGC AGT TGG...

...G-S-W......G-C-C...conservative

What is selection?Differential survival of genotypesPositive selection - diversifyingMaintain variationVariation can also be maintained by genetic driftNegative selection purifyingRemove variation

Why is selection important?PathogensSurviving the immune systemPatientsSurviving infectionIdentification of the molecular basis for survival

What is selection analysis?Determining the dN/dS ratioInference of ancestral statesDetermination of the effect of inferred nucleotide changes on amino acidsIs dN/dS significantly different from 1?

What is selection analysis?Positive (Diversifying) SelectiondN/dS > 1Negative (Purifying) SelectiondN/dS < 1

Two types of selection analysisSelection at sitesAre there significant levels of non-synonymous substitution at specific codons?Averaged across lineages.Selection along lineagesAre there significant levels of non-synonymous substitution in specific groups?Averaged across sequences.

Two types of selection analysis 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA

TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA Selection at sites (codons)Selection along lineages

Calculating dN/dSEstimate transition/transversion rate ratioTransition: A G C TTransversion: purine purimidineBased on four-fold degenerate sites at third codon positions and nondegenerate sites.Count synonymous and nonsynonymous substitutions for each codon site or along each lineage.Use ML estimates of ancestral codon states.Correct counts for multiple hits.Fitting the data to a distribution of dN/dS ratios. Calculate the probability of dN/dS falling into a particular class.

HyPhy: the softwareInput data formatSequences must be aligned.Fasta, PHYLIP, or Nexus format.Include the tree after the alignment.Program optionsAnalysis scripts.http://datamonkey.org

Standard AnalysesHyPhy: the software

Positive Selection AnalysesHyPhy: the software

REL relative effects model very similar to PAML codeml F61 analysis2pFEL 2-parameter fixed-effects modelMEME mixed-effects modelThe AlgorithmsHyPhy: the software

Analyzes levels of positive selection at individual sites while allowing level of positive selection to vary from branch to branch.Other models (FEL, REL, etc.) assume level of positive selection to be uniform across all branches.Includes site-to-site variation in synonymous substitution rate, just like FEL.The MEME AlgorithmHyPhy: the softwareMixed Effects Model of Evolution

HyPhy: Input file formats>MC1_01B4fs TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT>MC1_01A10 TGCACT---AATCTGACAAAGGCTATTAAGACCAATGGGAATGCTAATAATACCAGTACT>MC1_01C1 TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT>MC1_01A20 TGCACTAATAATCTGACAAAGGCTAGTAATGCCACTGAGAAGGCTAATAATACCATTACT >MC1_01TA1 TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT

(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); FastaMake sure these match!No stop codons!

HyPhy: Input file formats>MC1_01B4fs >MC1_01A10 >MC1_01C1 >MC1_01A20 >MC1_01TA1

TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACTTGCACT---AATCTGACAAAGGCTATTAAGACCAATGGGAATGCTAATAATACCAGTACTTGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACTTGCACTAATAATCTGACAAAGGCTAGTAATGCCACTGAGAAGGCTAATAATACCATTACT TGCACTAATAATCTGATT---------AATATCACTGAGAATACTAATAATACCATTACT

AAT---------ACCACTATTAATAGCGGGGGAGAAATAA AATGCCACTGAGAGTACCATTACTAATACCACTGAAATAA AAT---------ACCACTATTAATAGCGGGGGAGAAATAA AAT---------ACCACTATTGATAGCGGGGGAGAAATAA AAT---------ACCACTATTGATAGCGGGGGAGAAATAA

(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); Fasta - interleavedMake sure these match!No stop codons!

HyPhy: Input file formats 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA

TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA

1(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1)); PHLYIPMake sure these match!No stop codons!

HyPhy: Input file formats #NEXUS Begin data;Dimensions ntax=5 nchar=100;Format datatype=DNA gap=- interleave;MATRIX MC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA

TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA;End;Begin trees;TREE tree=(((MC1_01B4fs,MC1_01A10),MC1_01C1),(MC1_01A20,MC1_01TA1));End; NexusMake sure these match!No stop codons!

HyPhy: Input file formats 5 100 IMC1_01B4fs TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A10 TGCACT---A ATCTGACAAA GGCTATTAAG ACCAATGGGA ATGCTAATAA MC1_01C1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA MC1_01A20 TGCACTAATA ATCTGACAAA GGCTAGTAAT GCCACTGAGA AGGCTAATAA MC1_01TA1 TGCACTAATA ATCTGATT-- -------AAT ATCACTGAGA ATACTAATAA

TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCAGTACT AATGCCACTG AGAGTACCAT TACTAATACC ACTGAAATAA TACCATTACT AAT------- --ACCACTAT TAATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA TACCATTACT AAT------- --ACCACTAT TGATAGCGGG GGAGAAATAA Sequence dataGAPS!Gaps must be in multiples of three and preserve the reading frame!ML models cannot analyze gapped codon sites they will be ignored for the analysis.

HyPhy: data input

Codon frequency models0: 1/61 All codons equally probable1: F1x4 - Codon frequencies based on overall nucleotide frequencies (fA, fC, fG, fT)2: F3x4 - Codon frequencies based on nucleotide frequencies at each codon site (fA1, fC1, fG1, fT1, fA2, fC2, fG2, fT2, fA3, fC3, fG3, fT3)3: F61 - Codon frequencies based on frequency of each codon in the data (faaa, faac, faag, faat, faca, facc, facg, fact, faga, fagc, fagg, fagt, ...)F3x4 is used in the MEME analysis

HyPhy: data output

HyPhy: data output

HyPhy: data outputCodonalphabeta1p1beta2p2LRTp-valueq-valueLog(L)1250.00000.00000.725313.60100.27478.01760.00810.3258-30.69571530.91390.00000.892749.53350.10738.77970.00550.2580-22.60631630.66460.00000.847820.02700.15225.43090.03040.8559-25.81351880.00000.00000.96731883.38000.03279.79070.00330.1853-12.70602110.72450.00000.96604586.81000.034011.90000.00110.1062-22.59282180.89030.00000.9662304.81500.033813.31920.00060.1555-17.26402340.00000.00000.8160263.59100.184010.18910.00270.1893-33.74092370.09690.09690.719117.68310.280912.37690.00090.1252-35.26642390.67190.00000.825411.67360.17465.95410.02320.7269-22.81522640.00000.00000.93708.80910.06306.67460.01600.5655-11.4430

HyPhy: further analyses

Selection AnalysisSite vs. Lineage SelectionGenerally, if significant positive (or negative) selection occurs at only a few sites, averaging over the length of the sequences will result in no significant positive (or negative) selection being detected.

Significance of ResultsSelection analysis only determines if there is a significant excess or lack of non-synonymous substitution. The relative level of physiochemical similarity among the two amino acids is beyond the capabilities of selection analysis software.Selection Analysis

ReferencesFundamentals of Molecular Evolution. 2nd edition. Grauer and Li. 2000A good introduction to the major concepts in molecular evolution.The Phylogenetics Handbook. P. Lemey, editor. 2005.Chapter 14: Theory of and practice of diversifying selection analysis.MEME reference: Murrell B, et al. (2012) Detecting individual sites subject to episodic diversifying selection. PLoS Genetics,8(7):e1002764.

Seminar Follow-Up SiteFor access to past recordings, handouts, slides visit this site from the NIH network: http://collab.niaid.nih.gov/sites/research/SIG/Bioinformatics/

34

1. Select a Subject MatterView:Seminar DetailsHandout and Reference DocsRelevant LinksSeminar Recording Links2. Select a TopicRecommended Browsers:IE for Windows,Safari for Mac (Firefox on a Mac is incompatible with NIH Authentication technology)LoginIf prompted to log in use NIH\ in front of your username

35Retrieving Slides/Handouts

This lecture series

36Retrieving Slides/Handouts

This lectureThese slides

37

Questions?

38Next

Molecular Evolutionary Analysis of Pathogens Using BEASTThursday, 10 December at 1300