Scoring a multiple alignment
Sum of pairs Star Tree
A
A
C
CA
AA
A
A
A
A
C C
CC
Sum of Pairs
AAAAAAAAAAACACC
A
A
A
AA
10α
A
A
A
CA
+ (6α - 4β)
A
A
C
CA
+ (4α - 6β)
= 20α - 10β
Sum-of-Pairs Scoring Function
Score of multiple alignment
= ∑i <j score(Si,Sj)
where score(Si,Sj) = score of induced pairwise alignment
Induced Pairwise AlignmentS1 S - T I S C T G - S - N IS2 L - T I – C N G S S - N IS3 L R T I S C S G F S Q N I
Induced pairwise alignment of S1, S2:
S1 S T I S C T G - S N IS2 L T I – C N G S S N I
Star alignment• Heuristic method for multiple sequence
alignments• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index i
c, perform a Needleman-Wunsch global alignment
• Aggregate alignments with the principle “once a gap, always a gap.”
Choosing a center• Try them all and pick the one which is most similar
to all of the sequences• Let S(xi,xj) be the optimal score between
sequences xi and xj.• Calculate all O(k2) alignments, and choose as xc
the sequence xi that maximizes the following Σ S(xi,xj)j ≠ i
Star alignment example
s2
s1 s3
s4
S1: MPES2: MKES3: MSKES4: SKE
MPE
| |
MKE
MSKE
| ||
M-KE
SKE
||
MKE MPEMKE
M-PEM-KEMSKE
M-PEM-KEMSKES-KE
Analysis• Assuming all sequences have length n• O(k2n2) to calculate center• Step i of iterative pairwise alignment takes
O((i·n)·n) time• two strings of length n and i·n
• O(k2n2) overall cost
ClustalW• Most popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts of
alignment are weighted differently).• Three-step process
1.) Construct pairwise alignments2.) Build Guide Tree (by Neighbor Joining method)3.) Progressive Alignment guided by the tree
- The sequences are aligned progressively according to the branching order in the guide tree
Step 1: Pairwise Alignment• Aligns each sequence again each other
giving a similarity matrix• Similarity = exact matches / sequence length
(percent identity) v1 v2 v3 v4
v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -
(.17 means 17 % identical)
Step 2: Guide Tree• Create Guide Tree using the similarity matrix
• ClustalW uses the neighbor-joining method
• Guide tree roughly reflects evolutionary relations
Step 2: Guide Tree (cont’d)
v1
v3
v4 v2
Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)
v1 v2 v3 v4
v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -
Step 3: Progressive Alignment• Start by aligning the two most similar
sequences• Following the guide tree, add in the next
sequences, aligning to the existing alignment• Insert gaps as necessaryFOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:
Dots and stars show how well-conserved a column is.
ClustalW: another exampleS1 ALSK
S2 TNSDS3 NASKS4 NTSD
ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
Distance Matrix
All pairwisealignments
ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
Distance Matrix
S3
S1
S2
S4
Rooted Tree
All pairwisealignments
NeighborJoining
ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S32. Align S2 with S43. Align (S1, S3) with (S2, S4)
Distance Matrix
S3
S1
S2
S4
Rooted Tree
Multiple Alignment Steps
All pairwisealignments
NeighborJoining
ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S3
2. Align S2 with S4
3. Align (S1, S3) with (S2, S4)
Distance Matrix
Multiple Alignment Steps
All pairwisealignments
NeighborJoining
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SDMultiple
Alignment
S3
S1
S2
S4
Rooted Tree
Progressive alignment
Find seq Build guide tree Create intmdt alimost similarto each other
Other progressive approaches
• PILEUP• Similar to CLUSTALW• Uses UPGMA to produce tree
Problems with progressive alignments
• Depend on pairwise alignments• If sequences are very distantly related,
much higher likelihood of errors• Care must be made in choosing scoring
matrices and penalties
Iterative refinement in progressive alignmentAnother problem of progressive alignment:• Initial alignments are “frozen” even when new
evidence comes
Example:x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear that correct y = GA-CTT
Evaluating multiple alignments• Balibase benchmark (Thompson, 1999)
• De-facto standard for assessing the quality of a multiple alignment tool
• Manually refined multiple sequence alignments• Quality measured by how good it matches the
core blocks• Another benchmark: SABmark benchmark
• Based on protein structural families
Scoring multiple alignments• Ideally, a scoring scheme should
• Penalize variations in conserved positions higher• Relate sequences by a phylogenetic tree
• Tree alignment• Usually assume
• Independence of columns• Quality computation
• Entropy-based scoring• Compute the Shannon entropy of each column
• Sum-of-pairs (SP) score
Multiple Alignments: Scoring • Number of matches (multiple longest
common subsequence score)
• Entropy score
• Sum of pairs (SP-Score)
Multiple LCS Score• A column is a “match” if all the letters in the
column are the same
• Only good for very similar sequences
AAAAAAAATATC
Entropy• Define frequencies for the occurrence of each
letter in each column of multiple alignment• pA = 1, pT=pG=pC=0 (1st column)• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)
• Compute entropy of each column
CGTAX
XX pp,,,
log
AAAAAAAATATC
Entropy: Example
0
AAAA
entropy
2)241(4
41log
41
CGTA
entropy
Best case
Worst case
CGTAX
XNx
X ppS,,,
1log
Multiple Alignment: Entropy Score Entropy for a multiple alignment is the sum of entropies of its columns:
over all columns - X=A,T,G,C pX logpX
Entropy of an Alignment: Example
column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)
•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0
•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811
•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0
•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811
A A A
A C C
A C G
A C T
Representing a profile as a logo• Contribution of each residue to a position
Phosphate binding pattern from Prosite:
[LIVMF]-[GSA]-x(5)-P-x(4)-[LIVMFYW]-x-[LIVMF]-x-G-D-[GSA]-[GSAC]
Multiple Alignment Induces Pairwise AlignmentsEvery multiple alignment induces pairwise alignments
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum of Pairs (SP) Scoring• SP scoring is the standard method for scoring
multiple sequence alignments.• Columns are scored by a ‘sum of pairs’
function using a substitution matrix (PAM or BLOSUM)
• Assumes statistical independence for the columns, does not use a phylogenetic tree.
Sum of Pairs Score(SP-Score)
• Consider pairwise alignment of sequences ai and aj
imposed by a multiple alignment of k sequences • Denote the score of this suboptimal (not
necessarily optimal) pairwise alignment as s*(ai, aj)• Sum up the pairwise scores for a multiple
alignment:s(a1,…,ak) = Σi,j s*(ai, aj)
Computing SP-Score
Aligning 4 sequences: 6 pairwise alignments
Given a1,a2,a3,a4:
s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)
SP-Score: Examplea1
.ak
ATG-C-AATA-G-CATATATCCCATTT
ji
jik aaSaaS,
*1 ),()...(
2n
Pairs of Sequences
A
A A11
1
G
C G1
Score=3 Score = 1 –
Column 1 Column 3
May also calculate the scores column by column:
Example• Compute Sum of Pairs Score of the following
multiple alignment with match = 3, mismatch = -1, S(X,-) = -1, S(-,-) = 0
X: G T A C GY: T G C C GZ: C G G C CW: C G G A C -2 6-2 6 2
Sum of pairs = -2+6-2+6+2 = 10
Multiple alignment tools• Clustal W (Thompson, 1994)
• Most popular• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)
from: C. Notredame, “Recent progresses in multiple alignment: a survey”,Pharmacogenomics (2002) 3(1)
Useful links http://cnx.org/content/m11036/latest/
http://www.biokemi.uu.se/Utbildning/Exercises/ClustalX/index.shtm
http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/
http://homepage.usask.ca/~ctl271/857/paper1_overview.shtml
http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html
Was the quagga (now extinct) more like a zebra or a horse?
The quagga was an African animal that is now extinct. It looked partly like a horse, and partly like a zebra. In 1872, the last living quagga was photographed (above). Mitochondrial DNA was obtained from a museum specimen of a quagga and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?
• For the Entrez search Equus quagga boehmi, there is only one mitochondrial DNA sequence (mitochondrial D-loop, AF499309). From Entrez nucleotide the accession numbers are AY246194 (horse) and AF499309 (zebra). The sequences are:
• >gi|29650801|gb|AY246194.1| Equus caballus haplotype be1 mitochondrial D-loop, complete sequenceGATTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATCAGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTATGAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAAGCATGTACATAATATCATTTATCTTACATAAGTACATTATATTATTGATCGTGCATACCCCATCCAAGTCAAATCATTTCCAGTCAACACGCATATCACAGCCCATGTTCCACGAGCTTAATCACCAAGCCGCGGGAAATCAGCAACCCTCCCAACTACGTGTCCCAATCCTCGCTCCGGGCCCATCCAAACGTGGGGGTTTCTACAATGAAACTATACCTGGCATCTGGTTCTTTCTTCAGGGCCATTCCCACCCAACCTCGCCCATTCTTTCCCCTTAAATAAGACATCTCGATGGACTAATGACTAATCAGCCCATGCTCACACATAACTGTGATTTCATGCATTTGGTATCTTTTTATATTTGGGGATGCTATGACTCAGCTATGGCCGTCAAAGGCCTCGACGCAGTCAATTAAATTGAAGCTGGACTTAAATTGAACGTTATTCCTCCGCATCAGCAACCATAAGGTGTTATTCAGTCCATGGTAGCGGGACATAGGAAACAAGTGCACCTGTGCACCTACCCGCGCAGTAAGCAAGTAATATAGCTTTCTTAATCAAACCCCCCCTACCCCCCATTAAACTCCACATATGTACATTCAACACAATCTTGCCAAACCCCAAAAACAAGACTAAACAATGCACAATACTTCATGAAGCTTAACCCTCGCATGCCAACCATAATAACTCAACACACCTAACAATCTTAACAGAACTTTCCCCCCGCCATTAATACCAACATGCTACTTTAATCAATAAAATTTCCATAGACAGGCATCCCCCTAGATCTAATTTTCTAAATCTGTCAACCCTTCTTCCCC
• >gi|20335096|gb|AF499310.1| Equus burchellii isolate Be1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGCATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACATCGTGCATTAAATTGTT
• >gi|20335095|gb|AF499309.1| Equus quagga boehmi isolate Bo1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGTATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACGTCGTGCATTAGATTGTT
• The portion of the multiple sequence alignment in (excluding additional nonoverlapping horse sequence) is as follows:
• gi|20335096|gb|AF499310.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [zebra]gi|20335095|gb|AF499309.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [quagga]gi|29650801|gb|AY246194.1| ------------------------------------------------GA 2 [horse] **
• gi|20335096|gb|AF499310.1| TTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATT 100gi|20335095|gb|AF499309.1| TTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATT 100gi|29650801|gb|AY246194.1| TTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATC 52 **** ***************** ** ********** ************
• gi|20335096|gb|AF499310.1| AAAATACATCCT-ATGTAGCATTATACA-GTTCAACATATAATACCCTGT 148gi|20335095|gb|AF499309.1| AAAATACATCCT-ATGTAGTATTATACA-GTTCAACATATAATACCCTGT 148gi|29650801|gb|AY246194.1| AGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTAT 102 * * ** * ** * ** * **** **** **** * * *
• gi|20335096|gb|AF499310.1| TAACATCCTATGTACATCGTGCATTAAATTGTT----------------- 181gi|20335095|gb|AF499309.1| TAACATCCTATGTACGTCGTGCATTAGATTGTT----------------- 181gi|29650801|gb|AY246194.1| GAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAA 152 ** ********* ********** *****
Visual inspection of the alignment can give you a clue that the quagga DNA is closely related to zebra DNA: [1] the internal gaps in the alignment suggest that horse DNA (on the bottom row) is an outlier, and [2] positions that are not conserved (i.e. positions lacking an asterisk) also consistently show that horse DNA differs while quagga and zebra sequences match each other. The pairwise alignment scores also show clearly that the quagga is closer to a zebra:Sequence 1: gi|29650801|gb|AY246194.1| 976 bp [horse]Sequence 2: gi|20335096|gb|AF499310.1| 181 bp [zebra]Sequence 3: gi|20335095|gb|AF499309.1| 181 bp [quagga]Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 56Sequences (1:3) Aligned. Score: 55Sequences (2:3) Aligned. Score: 97
Finally a PubMed search with the terms quagga, horse and zebra links to an article suggesting that zebra and quagga shared a common ancestor several million years ago.
Top Related