Scoring a multiple alignment

44
Scoring a multiple alignment Sum of pairs Star Tree A A C C A A A A A A A C C C C

description

Scoring a multiple alignment. A. A. A. A. C. A. C. A. C. A. C. C. A. C. A. Sum of pairs. Star. Tree. A. AAA AAA AAA AA C A CC. A. C. A. A. A. A. A. A. A. C. 10 α. + (6 α - 4 β ). + (4 α - 6 β ). A. A. A. C. Sum of Pairs. = 20 α - 10 β. - PowerPoint PPT Presentation

Transcript of Scoring a multiple alignment

Page 1: Scoring a multiple alignment

Scoring a multiple alignment

Sum of pairs Star Tree

A

A

C

CA

AA

A

A

A

A

C C

CC

Page 2: Scoring a multiple alignment

Sum of Pairs

AAAAAAAAAAACACC

A

A

A

AA

10α

A

A

A

CA

+ (6α - 4β)

A

A

C

CA

+ (4α - 6β)

= 20α - 10β

Page 3: Scoring a multiple alignment

Sum-of-Pairs Scoring Function

Score of multiple alignment

= ∑i <j score(Si,Sj)

where score(Si,Sj) = score of induced pairwise alignment

Page 4: Scoring a multiple alignment

Induced Pairwise AlignmentS1 S - T I S C T G - S - N IS2 L - T I – C N G S S - N IS3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1, S2:

S1 S T I S C T G - S N IS2 L T I – C N G S S N I

Page 5: Scoring a multiple alignment

Star alignment• Heuristic method for multiple sequence

alignments• Select a sequence c as the center of the star• For each sequence x1, …, xk such that index i

c, perform a Needleman-Wunsch global alignment

• Aggregate alignments with the principle “once a gap, always a gap.”

Page 6: Scoring a multiple alignment

Choosing a center• Try them all and pick the one which is most similar

to all of the sequences• Let S(xi,xj) be the optimal score between

sequences xi and xj.• Calculate all O(k2) alignments, and choose as xc

the sequence xi that maximizes the following Σ S(xi,xj)j ≠ i

Page 7: Scoring a multiple alignment

Star alignment example

s2

s1 s3

s4

S1: MPES2: MKES3: MSKES4: SKE

MPE

| |

MKE

MSKE

| ||

M-KE

SKE

||

MKE MPEMKE

M-PEM-KEMSKE

M-PEM-KEMSKES-KE

Page 8: Scoring a multiple alignment

Analysis• Assuming all sequences have length n• O(k2n2) to calculate center• Step i of iterative pairwise alignment takes

O((i·n)·n) time• two strings of length n and i·n

• O(k2n2) overall cost

Page 9: Scoring a multiple alignment

ClustalW• Most popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts of

alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments2.) Build Guide Tree (by Neighbor Joining method)3.) Progressive Alignment guided by the tree

- The sequences are aligned progressively according to the branching order in the guide tree

Page 10: Scoring a multiple alignment

Step 1: Pairwise Alignment• Aligns each sequence again each other

giving a similarity matrix• Similarity = exact matches / sequence length

(percent identity) v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

(.17 means 17 % identical)

Page 11: Scoring a multiple alignment

Step 2: Guide Tree• Create Guide Tree using the similarity matrix

• ClustalW uses the neighbor-joining method

• Guide tree roughly reflects evolutionary relations

Page 12: Scoring a multiple alignment

Step 2: Guide Tree (cont’d)

v1

v3

v4 v2

Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Page 13: Scoring a multiple alignment

Step 3: Progressive Alignment• Start by aligning the two most similar

sequences• Following the guide tree, add in the next

sequences, aligning to the existing alignment• Insert gaps as necessaryFOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFDFOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFDFOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFDFOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQFOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is.

Page 14: Scoring a multiple alignment

ClustalW: another exampleS1 ALSK

S2 TNSDS3 NASKS4 NTSD

Page 15: Scoring a multiple alignment

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

All pairwisealignments

Page 16: Scoring a multiple alignment

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

Distance Matrix

S3

S1

S2

S4

Rooted Tree

All pairwisealignments

NeighborJoining

Page 17: Scoring a multiple alignment

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S32. Align S2 with S43. Align (S1, S3) with (S2, S4)

Distance Matrix

S3

S1

S2

S4

Rooted Tree

Multiple Alignment Steps

All pairwisealignments

NeighborJoining

Page 18: Scoring a multiple alignment

ClustalW exampleS1 ALSKS2 TNSDS3 NASKS4 NTSD

S1 S2 S3 S4

S1 0 9 4 7

S2 0 8 3

S3 0 7

S4 0

1. Align S1 with S3

2. Align S2 with S4

3. Align (S1, S3) with (S2, S4)

Distance Matrix

Multiple Alignment Steps

All pairwisealignments

NeighborJoining

-ALSKNA-SK

-TNSDNT-SD

-ALSK-TNSDNA-SKNT-SDMultiple

Alignment

S3

S1

S2

S4

Rooted Tree

Page 19: Scoring a multiple alignment

Progressive alignment

Find seq Build guide tree Create intmdt alimost similarto each other

Page 20: Scoring a multiple alignment

Other progressive approaches

• PILEUP• Similar to CLUSTALW• Uses UPGMA to produce tree

Page 21: Scoring a multiple alignment

Problems with progressive alignments

• Depend on pairwise alignments• If sequences are very distantly related,

much higher likelihood of errors• Care must be made in choosing scoring

matrices and penalties

Page 22: Scoring a multiple alignment
Page 23: Scoring a multiple alignment

Iterative refinement in progressive alignmentAnother problem of progressive alignment:• Initial alignments are “frozen” even when new

evidence comes

Example:x: GAAGTTy: GAC-TT

z: GAACTGw: GTACTG

Frozen!

Now clear that correct y = GA-CTT

Page 24: Scoring a multiple alignment

Evaluating multiple alignments• Balibase benchmark (Thompson, 1999)

• De-facto standard for assessing the quality of a multiple alignment tool

• Manually refined multiple sequence alignments• Quality measured by how good it matches the

core blocks• Another benchmark: SABmark benchmark

• Based on protein structural families

Page 25: Scoring a multiple alignment

Scoring multiple alignments• Ideally, a scoring scheme should

• Penalize variations in conserved positions higher• Relate sequences by a phylogenetic tree

• Tree alignment• Usually assume

• Independence of columns• Quality computation

• Entropy-based scoring• Compute the Shannon entropy of each column

• Sum-of-pairs (SP) score

Page 26: Scoring a multiple alignment

Multiple Alignments: Scoring • Number of matches (multiple longest

common subsequence score)

• Entropy score

• Sum of pairs (SP-Score)

Page 27: Scoring a multiple alignment

Multiple LCS Score• A column is a “match” if all the letters in the

column are the same

• Only good for very similar sequences

AAAAAAAATATC

Page 28: Scoring a multiple alignment

Entropy• Define frequencies for the occurrence of each

letter in each column of multiple alignment• pA = 1, pT=pG=pC=0 (1st column)• pA = 0.75, pT = 0.25, pG=pC=0 (2nd column)• pA = 0.50, pT = 0.25, pC=0.25 pG=0 (3rd column)

• Compute entropy of each column

CGTAX

XX pp,,,

log

AAAAAAAATATC

Page 29: Scoring a multiple alignment

Entropy: Example

0

AAAA

entropy

2)241(4

41log

41

CGTA

entropy

Best case

Worst case

CGTAX

XNx

X ppS,,,

1log

Page 30: Scoring a multiple alignment

Multiple Alignment: Entropy Score Entropy for a multiple alignment is the sum of entropies of its columns:

over all columns - X=A,T,G,C pX logpX

Page 31: Scoring a multiple alignment

Entropy of an Alignment: Example

column entropy: -( pAlogpA + pClogpC + pGlogpG + pTlogpT)

•Column 1 = -[1*log(1) + 0*log0 + 0*log0 +0*log0] = 0

•Column 2 = -[(1/4)*log(1/4) + (3/4)*log(3/4) + 0*log0 + 0*log0] = -[ (1/4)*(-2) + (3/4)*(-.415) ] = +0.811

•Column 3 = -[(1/4)*log(1/4)+(1/4)*log(1/4)+(1/4)*log(1/4) +(1/4)*log(1/4)] = 4* -[(1/4)*(-2)] = +2.0

•Alignment Entropy = 0 + 0.811 + 2.0 = +2.811

A A A

A C C

A C G

A C T

Page 32: Scoring a multiple alignment

Representing a profile as a logo• Contribution of each residue to a position

Phosphate binding pattern from Prosite:

[LIVMF]-[GSA]-x(5)-P-x(4)-[LIVMFYW]-x-[LIVMF]-x-G-D-[GSA]-[GSAC]

Page 33: Scoring a multiple alignment

Multiple Alignment Induces Pairwise AlignmentsEvery multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 34: Scoring a multiple alignment

Sum of Pairs (SP) Scoring• SP scoring is the standard method for scoring

multiple sequence alignments.• Columns are scored by a ‘sum of pairs’

function using a substitution matrix (PAM or BLOSUM)

• Assumes statistical independence for the columns, does not use a phylogenetic tree.

Page 35: Scoring a multiple alignment

Sum of Pairs Score(SP-Score)

• Consider pairwise alignment of sequences ai and aj

imposed by a multiple alignment of k sequences • Denote the score of this suboptimal (not

necessarily optimal) pairwise alignment as s*(ai, aj)• Sum up the pairwise scores for a multiple

alignment:s(a1,…,ak) = Σi,j s*(ai, aj)

Page 36: Scoring a multiple alignment

Computing SP-Score

Aligning 4 sequences: 6 pairwise alignments

Given a1,a2,a3,a4:

s(a1…a4) = s*(ai,aj) = s*(a1,a2) + s*(a1,a3) + s*(a1,a4) + s*(a2,a3) + s*(a2,a4) + s*(a3,a4)

Page 37: Scoring a multiple alignment

SP-Score: Examplea1

.ak

ATG-C-AATA-G-CATATATCCCATTT

ji

jik aaSaaS,

*1 ),()...(

2n

Pairs of Sequences

A

A A11

1

G

C G1

Score=3 Score = 1 –

Column 1 Column 3

May also calculate the scores column by column:

Page 38: Scoring a multiple alignment

Example• Compute Sum of Pairs Score of the following

multiple alignment with match = 3, mismatch = -1, S(X,-) = -1, S(-,-) = 0

X: G T A C GY: T G C C GZ: C G G C CW: C G G A C -2 6-2 6 2

Sum of pairs = -2+6-2+6+2 = 10

Page 39: Scoring a multiple alignment

Multiple alignment tools• Clustal W (Thompson, 1994)

• Most popular• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)

Page 40: Scoring a multiple alignment

from: C. Notredame, “Recent progresses in multiple alignment: a survey”,Pharmacogenomics (2002) 3(1)

Page 41: Scoring a multiple alignment

Useful links http://cnx.org/content/m11036/latest/

http://www.biokemi.uu.se/Utbildning/Exercises/ClustalX/index.shtm

http://bioinformatics.weizmann.ac.il/~pietro/Making_and_using_protein_MA/

http://homepage.usask.ca/~ctl271/857/paper1_overview.shtml

http://journal-ci.csse.monash.edu.au/ci/vol04/mulali/mulali.html

Page 42: Scoring a multiple alignment

Was the quagga (now extinct) more like a zebra or a horse?

The quagga was an African animal that is now extinct. It looked partly like a horse, and partly like a zebra. In 1872, the last living quagga was photographed (above). Mitochondrial DNA was obtained from a museum specimen of a quagga and sequenced. Perform a multiple sequence alignment of quagga (Equus quagga boehmi), horse (Equus caballus), and zebra (Equus burchelli) mitochondrial DNA. To which animal was the quagga more closely related?

Page 43: Scoring a multiple alignment

• For the Entrez search Equus quagga boehmi, there is only one mitochondrial DNA sequence (mitochondrial D-loop, AF499309). From Entrez nucleotide the accession numbers are AY246194 (horse) and AF499309 (zebra). The sequences are:

• >gi|29650801|gb|AY246194.1| Equus caballus haplotype be1 mitochondrial D-loop, complete sequenceGATTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATCAGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTATGAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAAGCATGTACATAATATCATTTATCTTACATAAGTACATTATATTATTGATCGTGCATACCCCATCCAAGTCAAATCATTTCCAGTCAACACGCATATCACAGCCCATGTTCCACGAGCTTAATCACCAAGCCGCGGGAAATCAGCAACCCTCCCAACTACGTGTCCCAATCCTCGCTCCGGGCCCATCCAAACGTGGGGGTTTCTACAATGAAACTATACCTGGCATCTGGTTCTTTCTTCAGGGCCATTCCCACCCAACCTCGCCCATTCTTTCCCCTTAAATAAGACATCTCGATGGACTAATGACTAATCAGCCCATGCTCACACATAACTGTGATTTCATGCATTTGGTATCTTTTTATATTTGGGGATGCTATGACTCAGCTATGGCCGTCAAAGGCCTCGACGCAGTCAATTAAATTGAAGCTGGACTTAAATTGAACGTTATTCCTCCGCATCAGCAACCATAAGGTGTTATTCAGTCCATGGTAGCGGGACATAGGAAACAAGTGCACCTGTGCACCTACCCGCGCAGTAAGCAAGTAATATAGCTTTCTTAATCAAACCCCCCCTACCCCCCATTAAACTCCACATATGTACATTCAACACAATCTTGCCAAACCCCAAAAACAAGACTAAACAATGCACAATACTTCATGAAGCTTAACCCTCGCATGCCAACCATAATAACTCAACACACCTAACAATCTTAACAGAACTTTCCCCCCGCCATTAATACCAACATGCTACTTTAATCAATAAAATTTCCATAGACAGGCATCCCCCTAGATCTAATTTTCTAAATCTGTCAACCCTTCTTCCCC

• >gi|20335096|gb|AF499310.1| Equus burchellii isolate Be1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGCATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACATCGTGCATTAAATTGTT

• >gi|20335095|gb|AF499309.1| Equus quagga boehmi isolate Bo1 mitochondrial D-loop, partial sequenceGCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGATTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATTAAAATACATCCTATGTAGTATTATACAGTTCAACATATAATACCCTGTTAACATCCTATGTACGTCGTGCATTAGATTGTT

Page 44: Scoring a multiple alignment

• The portion of the multiple sequence alignment in (excluding additional nonoverlapping horse sequence) is as follows:

• gi|20335096|gb|AF499310.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [zebra]gi|20335095|gb|AF499309.1| GCTCCACCGTCAACACCCAAAGCTGAAATTCTACTTAAACTATTCCTTGA 50 [quagga]gi|29650801|gb|AY246194.1| ------------------------------------------------GA 2  [horse]                                                                           **

• gi|20335096|gb|AF499310.1| TTTCCTCCCCTAAACGACAACAATTCACCCTCATGTACTATGTCAGTATT 100gi|20335095|gb|AF499309.1| TTTCCTCCCCTAAACGACAACAGTTCACCCTCATGTACTATGTCAGTATT 100gi|29650801|gb|AY246194.1| TTTCTTCCCCTAAACGACAACAATTTACCCTCATGTGCTATGTCAGTATC 52                           **** ***************** ** ********** ************

• gi|20335096|gb|AF499310.1| AAAATACATCCT-ATGTAGCATTATACA-GTTCAACATATAATACCCTGT 148gi|20335095|gb|AF499309.1| AAAATACATCCT-ATGTAGTATTATACA-GTTCAACATATAATACCCTGT 148gi|29650801|gb|AY246194.1| AGATTATACCCCCACATAACACCATACCCACCTGACATGCAATATCTTAT 102                           * * ** * **  *  **  *  ****       ****  **** * * *

• gi|20335096|gb|AF499310.1| TAACATCCTATGTACATCGTGCATTAAATTGTT----------------- 181gi|20335095|gb|AF499309.1| TAACATCCTATGTACGTCGTGCATTAGATTGTT----------------- 181gi|29650801|gb|AY246194.1| GAATGGCCTATGTACGTCGTGCATTAAATTGTCTGCCCCATGAATAATAA 152                            **   ********* ********** *****

Visual inspection of the alignment can give you a clue that the quagga DNA is closely related to zebra DNA: [1] the internal gaps in the alignment suggest that horse DNA (on the bottom row) is an outlier, and [2] positions that are not conserved (i.e. positions lacking an asterisk) also consistently show that horse DNA differs while quagga and zebra sequences match each other. The pairwise alignment scores also show clearly that the quagga is closer to a zebra:Sequence 1: gi|29650801|gb|AY246194.1| 976 bp [horse]Sequence 2: gi|20335096|gb|AF499310.1| 181 bp [zebra]Sequence 3: gi|20335095|gb|AF499309.1| 181 bp [quagga]Start of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 56Sequences (1:3) Aligned. Score: 55Sequences (2:3) Aligned. Score: 97

Finally a PubMed search with the terms quagga, horse and zebra links to an article suggesting that zebra and quagga shared a common ancestor several million years ago.