Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College...

17 August 2011

Clustal Ω for Protein Multiple Sequence AlignmentDes Higgins (Conway Institute, University College Dublin, Ireland), “Clustal Omega for Protein Multiple Sequence Alignment,” presentation at ISMB/ECCB 2011.

Sievers et al., “Fast, scalable generation of high quality protein multiple sequence alignments using Clustal Omega,” unpublished manuscript, 2011.

Presented by Hershel Safer in Ron Shamir’s group meeting on 17.8.2011.

Clustal Omega for Protein Multiple Sequence Alignment – Hershel Safer

17 August 2011

OutlineBackground on multiple sequence alignment (MSA)

Considerations for a new MSA tool

Clustal Ω

Benchmarking: Methods and issues

Benchmarking results

References


17 August 2011

Example of MSA: Globins


From Higgins 2011

17 August 2011

Example continued: Red columns are alpha helices


From Higgins 2011

17 August 2011

Approaches to finding MSAsExact solution using dynamic programming: Finding “optimal” MSA for N sequences of length L takes time O(LN)

Progressive alignment: Greedy heuristic that mimics evolution.

• Start by creating guide tree that specifies “evolutionary closeness.” Complexity is O(N2) for fixed L.

• Build increasingly large sub-alignments in the order specified by the guide tree. Complexity is O(N).

• Works for up to a few thousand sequences


17 August 2011

Example of progressive alignment


From Higgins 2011

17 August 2011

Example of progressive alignment, cont’d.


From Higgins 2011

17 August 2011

Features of progressive alignmentAdvantages

• Fast

• Gives pretty good results on large problems

• Provides good basis for manual tweaking

Disadvantages

• Hard to know if a solution is good – no objective function

• Errors are not corrected. Once two sequences are aligned, they keep the same relative alignment (e.g., later indels apply identically to both sequences).


17 August 2011

Consistency criterionAddresses problem of errors introduced by early mis-alignments

Use library of pairwise alignments that is created for building the guide tree

For each pair of aligned residues in the library, check their alignment in other pairwise alignments.

Scores for progressive alignment are modified to reflect consistency across the entire library of pairwise comparisons. Helps avoid early mis-alignment.

Complexity: worst case O(N3L2), in practice O(N3L).


17 August 2011

Two kinds of popular MSA toolsFast (<10,000 sequences)

• Clustal W

• MAFFT (with --partree, can handle >>10,000 sequences)

• Muscle

• Kalign

Accurate but slow (<100s of sequences)

• T-Coffee

• ProbCons

• MSAProbs


17 August 2011



Clustal Ω



References


17 August 2011

Why a new MSA tool?Starting to see uses for MSAs with hundreds of thousands of sequences

• Metagenomics

• Next-generation sequencing


17 August 2011

Goals for a new MSA toolWant a tool that scales well (time and space) to hundreds of thousands of sequences and still gives accurate results

Scalability: Up to several hours to align hundreds of thousands of sequences on a desktop computer

Accuracy: Similar to Clustal W


17 August 2011



Clustal Ω



References


17 August 2011

Clustal Ω: Possibly the last MSA tool you will needBuilding guide tree: Use mBed to cluster in time O(N log(N))

Progressive alignment: Use HHalign to sequentially align pairs of profile HMMs

Take advantage of existing alignments

• External profile alignment: Use an existing profile HMM of sequences homologous to input set to help align input set

• Iterate guide tree construction and/or progressive alignment

• Add sequences to existing alignments without starting from scratch


17 August 2011

Building guide tree using mBedReduces quadratic time/space of clustering and guide-tree construction to O(N log(N))

1. Cluster sequences

a. Select log2(N) seed sequences

b. Compute distance from each sequence to all seeds, using k-tuple distance measure (k=2) for unaligned sequences.

c. Cluster sequences using k-means

2. Build guide tree

a. Construct UPGMA sub-tree separately for each cluster (use UPGMA code from Muscle)

b. Link sub-trees using distances between clusters


17 August 2011

Progressive alignment using HHalignHHalign is a method for pairwise alignment of profile HMMs

It was designed to search HMM databases to identify remote homologs (sequence identity <20%)

In Clustal Ω, sequences and sub-alignments are converted to profile HMMs. Transition, insertion, and deletion probabilities are computed, and pseudo-counts are added as needed.

HHalign is used to align sub-alignments, in the order defined by the guide tree.


17 August 2011

External profile alignment (EPA)Take advantage of existing HMMs to guide pairwise alignment in early stages – avoid seemingly good alignments that are bad in the context of the entire MSA

If the kinds of sequences are known, can often find a relevant HMM in Pfam.

Contribution of external profile decreases as sub-alignments get larger, as larger sub-alignments contain the information that would come from the external profile.

Overhead: Can triple the alignment time


17 August 2011

EPA performance


17 August 2011

Iteration instead of EPACan bootstrap profile information if external profile is not available or not desired

MSA of original sequences can be converted to HMM and used as in EPA

MSA can also be used to rebuild guide tree

Can iterate this process

Can decouple iteration of guide-tree construction and HMM construction – can freeze one and just iterate the other, or iterate both


17 August 2011

Iteration performance


17 August 2011

Availability of Clustal ΩDownload a copy (Unix/Linux, Windows, Mac)

EBI website

Galaxy analysis system (coming soon?)


17 August 2011



Clustal Ω



References


17 August 2011

Benchmark databases for MSABAliBASE

• Collection of manually refined MSAs based on 3D structural superposition

• Annotated core blocks: highly conserved regions that can be reliably aligned

• Occasionally updated to represent kinds of complex sequences encountered in real problems, as kinds of alignments attempted change.

• Divided into reference sets that represent different kinds of alignment challenges

Other MSA benchmark DBs: Prefab, Homstrad, Oxbench, SABmark, IRMbase


17 August 2011

Clustal Ω benchmarking approachCompared to 11 other MSA programs

Score is fraction of columns identical in generated and reference alignments

Used 3 benchmark databases

• BAliBASE: Consider only core regions of alignments

• Prefab

• HomFam: Created for this work to test scalability to many sequences. Combined Homstrad families with corresponding Pfam families. Only tested with “fast” tools.


17 August 2011

Problems with benchmarking databasesDBs include questionable alignments

DBs have biased coverage of fold families and kinds of proteins

Test results may be biased if similar methods used to construct DB and in MSA tool (e.g., pairwise alignment method)

Focus on core blocks over-estimates accuracy because these regions are more easily aligned

Including gaps is problematic: Gap position is not considered, and a misplaced gap can improve the accuracy score.

Amount of sequence divergence in DB alignments: twilight zone (20-35% identity) vs. higher or lower

Sum-of-pairs vs. column scores


17 August 2011

Problems with benchmarking databases, cont’d.How representative is the benchmark?

• Method may behave well on benchmark, not in real world

• Method may behave well in real world, not on benchmark

Conclusion of Edgars: “protein alignment assessment is more challenging than generally realized, and skepticism is appropriate for claims that method rankings or advances can be reliably measured by current benchmarks.”


17 August 2011



Clustal Ω



References


17 August 2011

BAliBASE benchmark


17 August 2011

Prefab benchmark


17 August 2011

HomFam benchmark


17 August 2011

Scalability of running time


17 August 2011



Clustal Ω



References


17 August 2011

Additional referencesNotredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency]

Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm]

Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm]

Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.


17 August 2011

Additional references, cont’d.Mizuguchi et al. (1998), “HOMSTRAD: A database of protein structure alignments for homologous families,” Protein Sci 7:2469.

Edgar (2004), “MUSCLE: Multiple sequence alignment with high accuracy and high throughput,” Nucleic Acids Res 32:1792. [Introduced PREFAB benchmarking DB]

Edgar (2010), “Quality measures for protein alignment benchmarks,” Nucleic Acids Res 38:2145.

Aniba et al. (2010), “Issues in bioinformatics benchmarking: The case study of multiple sequence alignment,” Nucleic Acids Res 38:7353.


Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College...

Documents

Transcript of Clustal Ω for Protein Multiple Sequence Alignment Des Higgins (Conway Institute, University College...