Scoring a multiple alignment

Scoring a multiple alignment

Sum of pairs Star Tree

A

A

C

CA

AA

A

A

A

A

C C

CC

Sum of Pairs

AAAAAAAAAAACACC

A

A

A

AA

10α

A

A

A

CA

+ (6α - 4β)

A

A

C

CA

+ (4α - 6β)

= 20α - 10β

Sum-of-Pairs Scoring Function

Score of multiple alignment

= ∑i <j score(Si,Sj)

where score(Si,Sj) = score of induced pairwise alignment

Induced Pairwise AlignmentS1 S - T I S C T G - S - N IS2 L - T I – C N G S S - N IS3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1, S2:

S1 S T I S C T G - S N IS2 L T I – C N G S S N I

Star alignment• Heuristic method for multiple sequence

alignments• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index i

c, perform a Needleman-Wunsch global alignment

• Aggregate alignments with the principle “once a gap, always a gap.”

Choosing a center• Try them all and pick the one which is most similar

to all of the sequences• Let S(xi,xj) be the optimal score between

sequences xi and xj.• Calculate all O(k2) alignments, and choose as xc

the sequence xi that maximizes the following Σ S(xi,xj)j ≠ i

Star alignment example

s2

s1 s3

s4

S1: MPES2: MKES3: MSKES4: SKE

MPE

| |

MKE

MSKE

| ||

M-KE

SKE

||

MKE MPEMKE

M-PEM-KEMSKE

M-PEM-KEMSKES-KE

Analysis• Assuming all sequences have length n• O(k2n2) to calculate center• Step i of iterative pairwise alignment takes

O((i·n)·n) time• two strings of length n and i·n

• O(k2n2) overall cost

ClustalW• Most popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts of

alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments2.) Build Guide Tree (by Neighbor Joining method)3.) Progressive Alignment guided by the tree

- The sequences are aligned progressively according to the branching order in the guide tree

Step 1: Pairwise Alignment• Aligns each sequence again each other

giving a similarity matrix• Similarity = exact matches / sequence length

(percent identity) v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

(.17 means 17 % identical)

Step 2: Guide Tree• Create Guide Tree using the similarity matrix

• ClustalW uses the neighbor-joining method

• Guide tree roughly reflects evolutionary relations

Step 2: Guide Tree (cont’d)

v1

v3

v4 v2

Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Step 3: Progressive Alignment• Start by aligning the two most similar

sequences• Following the guide tree, add in the next

sequences, aligning to the existing alignment• Insert gaps as necessaryFOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

ClustalW: another exampleS1 ALSK

S2 TNSDS3 NASKS4 NTSD

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

All pairwisealignments


S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

S3

S1

S2

S4

Rooted Tree


NeighborJoining


S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S32. Align S2 with S43. Align (S1, S3) with (S2, S4)

Distance Matrix

S3

S1

S2

S4

Rooted Tree

Multiple Alignment Steps


NeighborJoining


S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

Multiple Alignment Steps


NeighborJoining

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SDMultiple

Alignment

S3

S1

S2

S4

Rooted Tree

Progressive alignment

Find seq Build guide tree Create intmdt alimost similarto each other

Other progressive approaches

• PILEUP• Similar to CLUSTALW• Uses UPGMA to produce tree

Problems with progressive alignments

• Depend on pairwise alignments• If sequences are very distantly related,

much higher likelihood of errors• Care must be made in choosing scoring

matrices and penalties

Iterative refinement in progressive alignmentAnother problem of progressive alignment:• Initial alignments are “frozen” even when new

evidence comes

Example:x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear that correct y = GA-CTT

Evaluating multiple alignments• Balibase benchmark (Thompson, 1999)

• De-facto standard for assessing the quality of a multiple alignment tool

• Manually refined multiple sequence alignments• Quality measured by how good it matches the

core blocks• Another benchmark: SABmark benchmark

• Based on protein structural families

Scoring multiple alignments• Ideally, a scoring scheme should

• Penalize variations in conserved positions higher• Relate sequences by a phylogenetic tree

• Tree alignment• Usually assume

• Independence of columns• Quality computation

• Entropy-based scoring• Compute the Shannon entropy of each column

• Sum-of-pairs (SP) score

Multiple Alignments: Scoring • Number of matches (multiple longest

common subsequence score)

• Entropy score

• Sum of pairs (SP-Score)

Multiple LCS Score• A column is a “match” if all the letters in the

column are the same

• Only good for very similar sequences

AAAAAAAATATC

Entropy• Define frequencies for the occurrence of each

letter in each column of multiple alignment• pA = 1, pT=pG=pC=0 (1st column)• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Entropy: Example

0

AAAA

entropy

2)241(4

41log

41

CGTA

entropy

Best case

Worst case

CGTAX

XNx

X ppS,,,

1log

Multiple Alignment: Entropy Score Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns - X=A,T,G,C pX logpX

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

Representing a profile as a logo• Contribution of each residue to a position

Phosphate binding pattern from Prosite:

[LIVMF]-[GSA]-x(5)-P-x(4)-[LIVMFYW]-x-[LIVMF]-x-G-D-[GSA]-[GSAC]

Multiple Alignment Induces Pairwise AlignmentsEvery multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum of Pairs (SP) Scoring• SP scoring is the standard method for scoring

multiple sequence alignments.• Columns are scored by a ‘sum of pairs’

function using a substitution matrix (PAM or BLOSUM)

• Assumes statistical independence for the columns, does not use a phylogenetic tree.

Sum of Pairs Score(SP-Score)

• Consider pairwise alignment of sequences ai and aj

imposed by a multiple alignment of k sequences • Denote the score of this suboptimal (not

necessarily optimal) pairwise alignment as s*(ai, aj)• Sum up the pairwise scores for a multiple

alignment:s(a1,…,ak) = Σi,j s*(ai, aj)

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a1,a2,a3,a4:

s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)

SP-Score: Examplea1

.ak

ATG-C-AATA-G-CATATATCCCATTT

ji

jik aaSaaS,

*1 ),()...(

2n

Pairs of Sequences

A

A A11

1

G

C G1

Score=3 Score = 1 –

Column 1 Column 3

May also calculate the scores column by column:

Example• Compute Sum of Pairs Score of the following

multiple alignment with match = 3, mismatch = -1, S(X,-) = -1, S(-,-) = 0

X: G T A C GY: T G C C GZ: C G G C CW: C G G A C -2 6-2 6 2

Sum of pairs = -2+6-2+6+2 = 10

Multiple alignment tools• Clustal W (Thompson, 1994)

• Most popular• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)

from: C. Notredame, “Recent progresses in multiple alignment: a survey”,Pharmacogenomics (2002) 3(1)

Useful links http://cnx.org/content/m11036/latest/

http://www.biokemi.uu.se/Utbildning/Exercises/ClustalX/index.shtm

http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/

http://homepage.usask.ca/~ctl271/857/paper1_overview.shtml

http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html

Was the quagga (now extinct) more like a zebra or a horse?

The quagga was an African animal that is now extinct. It looked partly like a horse, and partly like a zebra. In 1872, the last living quagga was photographed (above). Mitochondrial DNA was obtained from a museum specimen of a quagga and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?

• For the Entrez search Equus quagga boehmi, there is only one mitochondrial DNA sequence (mitochondrial D-loop, AF499309). From Entrez nucleotide the accession numbers are AY246194 (horse) and AF499309 (zebra). The sequences are:

• >gi|29650801|gb|AY246194.1| Equus caballus haplotype be1 mitochondrial D-loop, complete sequenceGATTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATCAGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTATGAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAAGCATGTACATAATATCATTTATCTTACATAAGTACATTATATTATTGATCGTGCATACCCCATCCAAGTCAAATCATTTCCAGTCAACACGCATATCACAGCCCATGTTCCACGAGCTTAATCACCAAGCCGCGGGAAATCAGCAACCCTCCCAACTACGTGTCCCAATCCTCGCTCCGGGCCCATCCAAACGTGGGGGTTTCTACAATGAAACTATACCTGGCATCTGGTTCTTTCTTCAGGGCCATTCCCACCCAACCTCGCCCATTCTTTCCCCTTAAATAAGACATCTCGATGGACTAATGACTAATCAGCCCATGCTCACACATAACTGTGATTTCATGCATTTGGTATCTTTTTATATTTGGGGATGCTATGACTCAGCTATGGCCGTCAAAGGCCTCGACGCAGTCAATTAAATTGAAGCTGGACTTAAATTGAACGTTATTCCTCCGCATCAGCAACCATAAGGTGTTATTCAGTCCATGGTAGCGGGACATAGGAAACAAGTGCACCTGTGCACCTACCCGCGCAGTAAGCAAGTAATATAGCTTTCTTAATCAAACCCCCCCTACCCCCCATTAAACTCCACATATGTACATTCAACACAATCTTGCCAAACCCCAAAAACAAGACTAAACAATGCACAATACTTCATGAAGCTTAACCCTCGCATGCCAACCATAATAACTCAACACACCTAACAATCTTAACAGAACTTTCCCCCCGCCATTAATACCAACATGCTACTTTAATCAATAAAATTTCCATAGACAGGCATCCCCCTAGATCTAATTTTCTAAATCTGTCAACCCTTCTTCCCC

• >gi|20335096|gb|AF499310.1| Equus burchellii isolate Be1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGCATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACATCGTGCATTAAATTGTT

• >gi|20335095|gb|AF499309.1| Equus quagga boehmi isolate Bo1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGTATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACGTCGTGCATTAGATTGTT

• The portion of the multiple sequence alignment in (excluding additional nonoverlapping horse sequence) is as follows:

• gi|20335096|gb|AF499310.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [zebra]gi|20335095|gb|AF499309.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [quagga]gi|29650801|gb|AY246194.1| ------------------------------------------------GA 2 [horse] **

• gi|20335096|gb|AF499310.1| TTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATT 100gi|20335095|gb|AF499309.1| TTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATT 100gi|29650801|gb|AY246194.1| TTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATC 52 **** ***************** ** ********** ************

• gi|20335096|gb|AF499310.1| AAAATACATCCT-ATGTAGCATTATACA-GTTCAACATATAATACCCTGT 148gi|20335095|gb|AF499309.1| AAAATACATCCT-ATGTAGTATTATACA-GTTCAACATATAATACCCTGT 148gi|29650801|gb|AY246194.1| AGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTAT 102 * * ** * ** * ** * **** **** **** * * *

• gi|20335096|gb|AF499310.1| TAACATCCTATGTACATCGTGCATTAAATTGTT----------------- 181gi|20335095|gb|AF499309.1| TAACATCCTATGTACGTCGTGCATTAGATTGTT----------------- 181gi|29650801|gb|AY246194.1| GAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAA 152 ** ********* ********** *****

Visual inspection of the alignment can give you a clue that the quagga DNA is closely related to zebra DNA: [1] the internal gaps in the alignment suggest that horse DNA (on the bottom row) is an outlier, and [2] positions that are not conserved (i.e. positions lacking an asterisk) also consistently show that horse DNA differs while quagga and zebra sequences match each other. The pairwise alignment scores also show clearly that the quagga is closer to a zebra:Sequence 1: gi|29650801|gb|AY246194.1| 976 bp [horse]Sequence 2: gi|20335096|gb|AF499310.1| 181 bp [zebra]Sequence 3: gi|20335095|gb|AF499309.1| 181 bp [quagga]Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 56Sequences (1:3) Aligned. Score: 55Sequences (2:3) Aligned. Score: 97

Finally a PubMed search with the terms quagga, horse and zebra links to an article suggesting that zebra and quagga shared a common ancestor several million years ago.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=6504142&dopt=Abstract

Scoring a multiple alignment

Documents

Transcript of Scoring a multiple alignment