Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

44
Scoring a multiple alignment Sum of pairs Star Tree A A C C A A A A A A A C C C C

Transcript of Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Page 1: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Scoring a multiple alignment

Sum of pairs Star Tree

A

A

C

CA

AA

A

A

A

A

C C

CC

Page 2: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Sum of Pairs

AAAAAAAAAAACACC

A

A

A

AA

10α

A

A

A

CA

+ (6α - 4β)

A

A

C

CA

+ (4α - 6β)

= 20α - 10β

Page 3: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Sum-of-Pairs Scoring Function

Score of multiple alignment

= ∑i <j score(Si,Sj)

where score(Si,Sj) = score of induced pairwise alignment

Page 4: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Induced Pairwise Alignment

S1 S - T I S C T G - S - N IS2 L - T I – C N G S S - N IS3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1, S2:

S1 S T I S C T G - S N IS2 L T I – C N G S S N I

Page 5: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Star alignment

• Heuristic method for multiple sequence alignments

• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index i

c, perform a Needleman-Wunsch global alignment

• Aggregate alignments with the principle “once a gap, always a gap.”

Page 6: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Choosing a center

• Try them all and pick the one which is most similar to all of the sequences

• Let S(xi,xj) be the optimal score between sequences xi and xj.

• Calculate all O(k2) alignments, and choose as xc the sequence xi that maximizes the following

Σ S(xi,xj)j ≠ i

Page 7: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Star alignment example

s2

s1s3

s4

S1: MPES2: MKES3: MSKES4: SKE

MPE

| |

MKE

MSKE

| ||

M-KE

SKE

||

MKE MPEMKE

M-PEM-KEMSKE

M-PEM-KEMSKES-KE

Page 8: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Analysis

• Assuming all sequences have length n• O(k2n2) to calculate center• Step i of iterative pairwise alignment takes

O((i·n)·n) time• two strings of length n and i·n

• O(k2n2) overall cost

Page 9: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW• Most popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts of

alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments

2.) Build Guide Tree (by Neighbor Joining method)

3.) Progressive Alignment guided by the tree

- The sequences are aligned progressively according to the branching order in the guide tree

Page 10: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Step 1: Pairwise Alignment

• Aligns each sequence again each other giving a similarity matrix

• Similarity = exact matches / sequence length (percent identity)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

(.17 means 17 % identical)

Page 11: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Step 2: Guide Tree

• Create Guide Tree using the similarity matrix

• ClustalW uses the neighbor-joining method

• Guide tree roughly reflects evolutionary relations

Page 12: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Step 2: Guide Tree (cont’d)

v1

v3

v4 v2

Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Page 13: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Step 3: Progressive Alignment• Start by aligning the two most similar

sequences• Following the guide tree, add in the next

sequences, aligning to the existing alignment• Insert gaps as necessaryFOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

Page 14: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW: another exampleS1 ALSK

S2 TNSDS3 NASKS4 NTSD

Page 15: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

All pairwisealignments

Page 16: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

S3

S1

S2

S4

Rooted Tree

All pairwisealignments

NeighborJoining

Page 17: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

S3

S1

S2

S4

Rooted Tree

Multiple Alignment Steps

All pairwisealignments

NeighborJoining

Page 18: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

Multiple Alignment Steps

All pairwisealignments

NeighborJoining

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SDMultiple

Alignment

S3

S1

S2

S4

Rooted Tree

Page 19: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Progressive alignment

Find seq Build guide tree Create intmdt alimost similarto each other

Page 20: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Other progressive approaches

• PILEUP• Similar to CLUSTALW• Uses UPGMA to produce tree

Page 21: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Problems with progressive alignments

• Depend on pairwise alignments• If sequences are very distantly related,

much higher likelihood of errors• Care must be made in choosing scoring

matrices and penalties

Page 22: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Page 23: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Iterative refinement in progressive alignmentAnother problem of progressive alignment:• Initial alignments are “frozen” even when new

evidence comes

Example:x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear that correct y = GA-CTT

Page 24: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Evaluating multiple alignments• Balibase benchmark (Thompson, 1999)

• De-facto standard for assessing the quality of a multiple alignment tool

• Manually refined multiple sequence alignments• Quality measured by how good it matches the

core blocks• Another benchmark: SABmark benchmark

• Based on protein structural families

Page 25: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Scoring multiple alignments• Ideally, a scoring scheme should

• Penalize variations in conserved positions higher• Relate sequences by a phylogenetic tree

• Tree alignment

• Usually assume • Independence of columns• Quality computation

• Entropy-based scoring• Compute the Shannon entropy of each column

• Sum-of-pairs (SP) score

Page 26: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Multiple Alignments: Scoring

• Number of matches (multiple longest common subsequence score)

• Entropy score

• Sum of pairs (SP-Score)

Page 27: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Multiple LCS Score

• A column is a “match” if all the letters in the column are the same

• Only good for very similar sequences

AAAAAAAATATC

Page 28: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Entropy

• Define frequencies for the occurrence of each letter in each column of multiple alignment• pA = 1, pT=pG=pC=0 (1st column)

• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)

• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Page 29: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Entropy: Example

0

A

A

A

A

entropy

2)24

1(4

4

1log

4

1

C

G

T

A

entropy

Best case

Worst case

CGTAX

X

Nx

X ppS,,,

1log

Page 30: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Multiple Alignment: Entropy Score

Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns - X=A,T,G,C pX logpX

Page 31: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

Page 32: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Representing a profile as a logo

• Contribution of each residue to a position

Phosphate binding pattern from Prosite:

[LIVMF]-[GSA]-x(5)-P-x(4)-[LIVMFYW]-x-[LIVMF]-x-G-D-[GSA]-[GSAC]

Page 33: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 34: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Sum of Pairs (SP) Scoring

• SP scoring is the standard method for scoring multiple sequence alignments.

• Columns are scored by a ‘sum of pairs’ function using a substitution matrix (PAM or BLOSUM)

• Assumes statistical independence for the columns, does not use a phylogenetic tree.

Page 35: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Sum of Pairs Score(SP-Score)

• Consider pairwise alignment of sequences

ai and aj

imposed by a multiple alignment of k sequences

• Denote the score of this suboptimal (not necessarily optimal) pairwise alignment as

s*(ai, aj)

• Sum up the pairwise scores for a multiple alignment:

s(a1,…,ak) = Σi,j s*(ai, aj)

Page 36: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a1,a2,a3,a4:

s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)

Page 37: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

SP-Score: Examplea1

.ak

ATG-C-AATA-G-CATATATCCCATTT

ji

jik aaSaaS,

*1 ),()...(

2

nPairs of Sequences

A

A A

11

1

G

C G

1

Score=3 Score = 1 –

Column 1 Column 3

May also calculate the scores column by column:

Page 38: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Example

• Compute Sum of Pairs Score of the following multiple alignment with match = 3, mismatch = -1, S(X,-) = -1, S(-,-) = 0

X: G T A C G

Y: T G C C G

Z: C G G C C

W: C G G A C

-2 6-2 6 2

Sum of pairs = -2+6-2+6+2 = 10

Page 39: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Multiple alignment tools

• Clustal W (Thompson, 1994)• Most popular

• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)

Page 40: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

from: C. Notredame, “Recent progresses in multiple alignment: a survey”,Pharmacogenomics (2002) 3(1)

Page 41: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Useful links

http://cnx.org/content/m11036/latest/

http://www.biokemi.uu.se/Utbildning/Exercises/ClustalX/index.shtm

http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/

http://homepage.usask.ca/~ctl271/857/paper1_overview.shtml

http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html

Page 42: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Was the quagga (now extinct) more like a zebra or a horse?

The quagga was an African animal that is now extinct. It looked partly like a horse, and partly like a zebra. In 1872, the last living quagga was photographed (above). Mitochondrial DNA was obtained from a museum specimen of a quagga and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?

Page 43: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

• For the Entrez search Equus quagga boehmi, there is only one mitochondrial DNA sequence (mitochondrial D-loop, AF499309). From Entrez nucleotide the accession numbers are AY246194 (horse) and AF499309 (zebra). The sequences are:

• >gi|29650801|gb|AY246194.1| Equus caballus haplotype be1 mitochondrial D-loop, complete sequenceGATTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATCAGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTATGAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAAGCATGTACATAATATCATTTATCTTACATAAGTACATTATATTATTGATCGTGCATACCCCATCCAAGTCAAATCATTTCCAGTCAACACGCATATCACAGCCCATGTTCCACGAGCTTAATCACCAAGCCGCGGGAAATCAGCAACCCTCCCAACTACGTGTCCCAATCCTCGCTCCGGGCCCATCCAAACGTGGGGGTTTCTACAATGAAACTATACCTGGCATCTGGTTCTTTCTTCAGGGCCATTCCCACCCAACCTCGCCCATTCTTTCCCCTTAAATAAGACATCTCGATGGACTAATGACTAATCAGCCCATGCTCACACATAACTGTGATTTCATGCATTTGGTATCTTTTTATATTTGGGGATGCTATGACTCAGCTATGGCCGTCAAAGGCCTCGACGCAGTCAATTAAATTGAAGCTGGACTTAAATTGAACGTTATTCCTCCGCATCAGCAACCATAAGGTGTTATTCAGTCCATGGTAGCGGGACATAGGAAACAAGTGCACCTGTGCACCTACCCGCGCAGTAAGCAAGTAATATAGCTTTCTTAATCAAACCCCCCCTACCCCCCATTAAACTCCACATATGTACATTCAACACAATCTTGCCAAACCCCAAAAACAAGACTAAACAATGCACAATACTTCATGAAGCTTAACCCTCGCATGCCAACCATAATAACTCAACACACCTAACAATCTTAACAGAACTTTCCCCCCGCCATTAATACCAACATGCTACTTTAATCAATAAAATTTCCATAGACAGGCATCCCCCTAGATCTAATTTTCTAAATCTGTCAACCCTTCTTCCCC

• >gi|20335096|gb|AF499310.1| Equus burchellii isolate Be1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGCATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACATCGTGCATTAAATTGTT

• >gi|20335095|gb|AF499309.1| Equus quagga boehmi isolate Bo1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGTATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACGTCGTGCATTAGATTGTT

Page 44: Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

• The portion of the multiple sequence alignment in (excluding additional nonoverlapping horse sequence) is as follows:

• gi|20335096|gb|AF499310.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [zebra]gi|20335095|gb|AF499309.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [quagga]gi|29650801|gb|AY246194.1| ------------------------------------------------GA 2  [horse]                                                                           **

• gi|20335096|gb|AF499310.1| TTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATT 100gi|20335095|gb|AF499309.1| TTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATT 100gi|29650801|gb|AY246194.1| TTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATC 52                           **** ***************** ** ********** ************

• gi|20335096|gb|AF499310.1| AAAATACATCCT-ATGTAGCATTATACA-GTTCAACATATAATACCCTGT 148gi|20335095|gb|AF499309.1| AAAATACATCCT-ATGTAGTATTATACA-GTTCAACATATAATACCCTGT 148gi|29650801|gb|AY246194.1| AGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTAT 102                           * * ** * **  *  **  *  ****       ****  **** * * *

• gi|20335096|gb|AF499310.1| TAACATCCTATGTACATCGTGCATTAAATTGTT----------------- 181gi|20335095|gb|AF499309.1| TAACATCCTATGTACGTCGTGCATTAGATTGTT----------------- 181gi|29650801|gb|AY246194.1| GAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAA 152                            **   ********* ********** *****

Visual inspection of the alignment can give you a clue that the quagga DNA is closely related to zebra DNA: [1] the internal gaps in the alignment suggest that horse DNA (on the bottom row) is an outlier, and [2] positions that are not conserved (i.e. positions lacking an asterisk) also consistently show that horse DNA differs while quagga and zebra sequences match each other. The pairwise alignment scores also show clearly that the quagga is closer to a zebra:Sequence 1: gi|29650801|gb|AY246194.1| 976 bp [horse]Sequence 2: gi|20335096|gb|AF499310.1| 181 bp [zebra]Sequence 3: gi|20335095|gb|AF499309.1| 181 bp [quagga]Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 56Sequences (1:3) Aligned. Score: 55Sequences (2:3) Aligned. Score: 97

Finally a PubMed search with the terms quagga, horse and zebra links to an article suggesting that zebra and quagga shared a common ancestor several million years ago.