HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05...

28
1/20/05 CAP5510/CGS5166 (Lec 4) 1 Types of Sequence Alignments Global HIV Strain 1 HIV Strain 2 Local Semi-Global Strain 1 Strain 2 Strain 3 Strain 4 Multiple

Transcript of HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05...

Page 1: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 1

Types of Sequence Alignments

GlobalHIV Strain 1

HIV Strain 2

Local

Semi-Global

Strain 1

Strain 2

Strain 3

Strain 4

Multiple

Page 2: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 2

Global Alignment: An example

V: G A A T T C A G T T A W: G G A T C G A

δ[I, J] = Score of Matching the Ith character of sequence V &the Jth character of sequence W

Given

S[I, J] = Score of Matching First I characters of sequence V &First J characters of sequence W

Compute

Given

S[I, J] = MAXIMUM S[I-1, J-1] + δ(V[I], W[J]),S[I-1, J] + δ(V[I], ⎯),S[I, J-1] + δ(⎯ , W[J])

Recurrence Relation

Page 3: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 3

V: G A A T T C A G T T A W: G G A T C G A

S[I, J] = MAXIMUM S[I-1, J-1] + δ(V[I], W[J]),S[I-1, J] + δ(V[I], ⎯),S[I, J-1] + δ(⎯ , W[J])

Global Alignment: An example

Page 4: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 4

Traceback

V: G A A T T C A G T T A| | | | | |

W: G G A – T C – G - - A

Page 5: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 5

Alternative Traceback

V: G - A A T T C A G T T A| | | | | |

W: G G - A – T C – G - - A

V: G A A T T C A G T T A| | | | | |

W: G G A – T C – G - - APrevious

Page 6: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 6

Improved TracebackG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0

G 0 ×1 ←1 ← 1 ← 1 ← 1 ← 1 ← 1 ×1 ← 1 ← 1 ← 1

G 0 ×1 ↑1 ↑1 ↑1 ↑1 ↑1 ↑1 ×2 ← 2 ← 2 ← 2

A 0 ↑1 ↑1 ×2 ← 2 ← 2 ← 2 ×2 ↑2 ↑2 ↑2 ×3

T 0 ↑1 ← 2 ↑2 ×3 ×3 ← 3 ← 3 ← 3 ×3 ×3 ↑3

C 0 ↑1 ↑2 ↑2 ↑3 ↑3 ×4 ← 4 ← 4 ← 4 ← 4 ← 4

G 0 ↑1 ↑2 ↑2 ↑3 ↑3 ↑4 ↑4 ×5 ← 5 ← 5 ← 5

A 0 ↑1 ↑2 ×3 ↑3 ↑3 ↑4 ×5 ↑5 ↑5 ↑5 ×6

Page 7: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 7

Improved TracebackG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0

G 0 ×1 ←1 ← 1 ← 1 ← 1 ← 1 ← 1 ×1 ← 1 ← 1 ← 1

G 0 ×1 ↑1 ↑1 ↑1 ↑1 ↑1 ↑1 ×2 ← 2 ← 2 ← 2

A 0 ↑1 ↑1 ×2 ← 2 ← 2 ← 2 ×2 ↑2 ↑2 ↑2 ×3

T 0 ↑1 ← 2 ↑2 ×3 ×3 ← 3 ← 3 ← 3 ×3 ×3 ↑3

C 0 ↑1 ↑2 ↑2 ↑3 ↑3 ×4 ← 4 ← 4 ← 4 ← 4 ← 4

G 0 ↑1 ↑2 ↑2 ↑3 ↑3 ↑4 ↑4 ×5 ← 5 ← 5 ← 5

A 0 ↑1 ↑2 ×3 ↑3 ↑3 ↑4 ×5 ↑5 ↑5 ↑5 ×6

Page 8: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 8

Improved TracebackG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0

G 0 ×1 ←1 ← 1 ← 1 ← 1 ← 1 ← 1 ×1 ← 1 ← 1 ← 1

G 0 ×1 ↑1 ↑1 ↑1 ↑1 ↑1 ↑1 ×2 ← 2 ← 2 ← 2

A 0 ↑1 ↑1 ×2 ← 2 ← 2 ← 2 ×2 ↑2 ↑2 ↑2 ×3

T 0 ↑1 ← 2 ↑2 ×3 ×3 ← 3 ← 3 ← 3 ×3 ×3 ↑3

C 0 ↑1 ↑2 ↑2 ↑3 ↑3 ×4 ← 4 ← 4 ← 4 ← 4 ← 4

G 0 ↑1 ↑2 ↑2 ↑3 ↑3 ↑4 ↑4 ×5 ← 5 ← 5 ← 5

A 0 ↑1 ↑2 ×3 ↑3 ↑3 ↑4 ×5 ↑5 ↑5 ↑5 ×6

V: G A - A T T C A G T T A| | | | | |

W: G - G A – T C – G - - A

Page 9: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 9

Subproblems• Optimally align V[1..I] and W[1..J] for every

possible values of I and J.• Having optimally aligned

– V[1..I-1] and W[1..J-1]– V[1..I] and W[1..J-1]– V[1..I-1] and W[1, J]

it is possible to optimally align V[1..I] and W[1..J]

• O(mn),where m = length of V,and n = length of W.

Page 10: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 10

Generalizations of Similarity Function• Mismatch Penalty = α• Spaces (Insertions/Deletions, InDels) = β• Affine Gap Penalties:

(Gap open, Gap extension) = (γ,δ)• Weighted Mismatch = Φ(a,b)• Weighted Matches = Ω(a)

Page 11: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 11

Alternative Scoring SchemesG A A T T C A G T T A

0 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12

G -2 × 1 ← -1 ← -2 ← -3 ← -4 ← -5 ← -6 ← -7 ← -8 ← -9 ← -10

G -3 ↑-1 × -1 ← -3 ← -4 ← -5 ← -6 ← -7 × -5 ← -7 ← -8 ← -9

A -4 ↑-2 × 0 × 0 ← -2 ← -3 ← -4 ← -5 ← -6 ← -7 ← -8 × -7

T -5 ↑-3 ↑ -2 ↑-2 × 1 ← -1 ← -2 ← -3 ← -4 ← -5 ← -6 ← -7

C -6 ↑-4 ↑ -3 ↑-3 ↑-1 × -1 × 0 ← -2 ← -3 ← -4 ← -5 ← -6

G -7 ↑-5 ↑-4 ↑-4 ↑-2 ↑-3 ↑-2 × -2 × -1 ← -3 ← -4 ← -5

A -8 ↑-6 ↑-5 ↑-5 ↑-3 ↑-4 ↑-3 × -1 ↑-3 × -3 × -5 × -3

V: G A A T T C A G T T A| | | | | |

W: G G A T - C – G - - A

Match +1Mismatch –2Gap (-2, -1)

Page 12: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 12

Local Sequence Alignment• Example: comparing long stretches of anonymous DNA; aligning

proteins that share only some motifs or domains. • Smith-Waterman Algorithm

Page 13: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 13

Recurrence Relations (Global vs Local Alignments)

• S[I, J] = MAXIMUM S[I-1, J-1] + δ(V[I], W[J]),S[I-1, J] + δ(V[I], ⎯),S[I, J-1] + δ(⎯ , W[J])

----------------------------------------------------------------------• S[I, J] = MAXIMUM 0,

S[I-1, J-1] + δ(V[I], W[J]),S[I-1, J] + δ(V[I], ⎯),S[I, J-1] + δ(⎯ , W[J])

Global Alignment

Local Alignment

Page 14: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 14

Local Alignment: ExampleG A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0

G 0 × 1 0 0 0 0 0 0 0 0 0 0

G 0 × 1 ← 0 0 0 0 0 0 × 1 0 0 0

A 0 0 × 2 × 1 0 0 0 × 1 0 0 0 × 1

T 0 0 ↑ 0 × 1 × 2 ← 1 0 0 0 × 1 × 1 0

C 0 0 0 0 ↑ 0 × 0 × 2 0 0 0 0 0

G 0 0 0 0 0 0 0 0 × 1 0 0 0

A 0 0 × 1 × 1 0 0 0 × 1 0 0 0 × 1

V: - G A A T T C A G T T A| | | |

W: G G - A T - C – G - - A

Match +1Mismatch –1Gap (-1, -1)

Page 15: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 15

Properties of Smith-Waterman Algorithm

• How to find all regions of “high similarity”?– Find all entries above a threshold score and traceback.

• What if: Matches = 1 & Mismatches/spaces = 0?– Longest Common Subsequence Problem

• What if: Matches = 1 & Mismatches/spaces = -∝?– Longest Common Substring Problem

• What if the average entry is positive?– Global Alignment

Page 16: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 16

How to score mismatches?

Page 17: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 17

BLOSUM n Substitution Matrices

• For each amino acid pair a, b– For each BLOCK

• Align all proteins in the BLOCK• Eliminate proteins that are more than n%

identical• Count F(a), F(b), F(a,b)• Compute Log-odds Ratio

⎟⎟⎠

⎞⎜⎜⎝

⎛)()(

),(logbFaF

baF

Page 18: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 18

String Matching ProblemString Matching Problem

Pattern PSet of Locations L

Text T

Page 19: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 19

(Approximate) String Matching(Approximate) String MatchingInput: Text T , Pattern PQuestion(s):

Does P occur in T? Find one occurrence of P in T.Find all occurrences of P in T.Count # of occurrences of P in T.Find longest substring of P in T.Find closest substring of P in T.Locate direct repeats of P in T.Many More variants

Input: Text T , Pattern PQuestion(s):

Does P occur in T? Find one occurrence of P in T.Find all occurrences of P in T.Count # of occurrences of P in T.Find longest substring of P in T.Find closest substring of P in T.Locate direct repeats of P in T.Many More variants

Applications: Is P already in the database T?Locate P in T.Can P be used as a primer for T?Is P homologous to anything in T?Has P been contaminated by T?Is prefix(P) = suffix(T)?Locate tandem repeats of P in T.

Applications: Is P already in the database T?Locate P in T.Can P be used as a primer for T?Is P homologous to anything in T?Has P been contaminated by T?Is prefix(P) = suffix(T)?Locate tandem repeats of P in T.

Page 20: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 20

Input: Input: Text T; Pattern PText T; Pattern P

All occurrences of P in T.All occurrences of P in T.Output: Output:

Methods:Methods:

• Naïve Method• Rabin-Karp Method• FSA-based method• Knuth-Morris-Pratt algorithm• Boyer-Moore• Suffix Tree method• Shift-And method

Page 21: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 21

Naive StrategyNaive Strategy

ATAQAANANASPVANAGVERANANESISITALVDANANANANASATAQAANANASPVANAGVERANANESISITALVDANANANANAS

ANANASANANASANANASANANASANANASANANASANANASANANASANANASANANASANANASANANASANANASANANAS ANANASANANAS ANANASANANAS ANANASANANASANANASANANASANANASANANAS

Page 22: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 22

Finite State AutomatonFinite State Automaton

ANANASANANAS

4

5

3

6

1

7

A N A

A A

A

2

AN

FiniteState Automaton

S A

N

ATAQAANANASPVANAGVERANANESISITALVDANANANANASATAQAANANASPVANAGVERANANESISITALVDANANANANAS

Page 23: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 23

State Transition DiagramState Transition Diagram

A N S *0 1 0 0 01 1 2 0 02 3 0 0 03 1 4 0 04 5 0 0 05 1 4 6 06 1 0 0 0

-A

ANANA

ANANANANA

ANANAS

Page 24: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 24

Input: Input: Text T; Pattern PText T; Pattern P

All occurrences of P in T.All occurrences of P in T.Output: Output:

Sliding Window Strategy:Sliding Window Strategy:

Initialize window on T;

While (window within T) do

endwhile;

Shift: shift window to right

Scan: if (window = P) then report it;

(by ?? positions)

Page 25: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 25

Tries

Storing:BIGBIGGERBILLGOODGOSH

Page 26: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 26

Suffix Tries & Compact Suffix TriesStore all suffixes of GOOGOL$

Suffix Trie

Compact Suffix Trie

Page 27: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 27

Suffix Tries to Suffix Trees

Compact Suffix Trie

Suffix Tree

Page 28: HIV Strain 1 Global HIV Strain 2users.cis.fiu.edu/~giri/teach/Bioinf/S05/Lec4.pdf · 1/20/05 CAP5510/CGS5166 (Lec 4) 2 Global Alignment: An example V: G A A T T C A G T T A W: G G

1/20/05 CAP5510/CGS5166 (Lec 4) 28

Suffix Trees• Linear-time construction!• String Matching, Substring matching, substring common to k of n

strings• All-pairs prefix-suffix problem• Repeats & Tandem repeats• Approximate string matching