Nick Heppenstall (biology) Michal Dvir (Mathematics/CS) Andrew Dittmore (physics) Under guidance of...

Post on 22-Dec-2015

213 views 0 download

Transcript of Nick Heppenstall (biology) Michal Dvir (Mathematics/CS) Andrew Dittmore (physics) Under guidance of...

Nick Heppenstall (biology)

Michal Dvir (Mathematics/CS)

Andrew Dittmore (physics)

Under guidance of Dr. Yung-Pin Chen (Mathematics)

DNA Sequence Comparison and Alignment

Outline

• Pi in base 4

• DNA Overview

• Markov Chains & Models

• Sequence Alignment

• Future Plans

in base 103.14159…

3.021003331…ππ in base 4

Looking at π in base 4, the chance of seeing 2 is 1:422 is 1:16222 is 1:642222 is 1:256

The Normality of Pi

3, 0, 2, 1, 0, 0, 3, 3, 3, 1, 2, 2, 2, 2, 0, 2, 0, 2, 0, 1, 1, 2, 2, 0, 3, 0, 0, 2, 0, 3, 1, 0, 3, 0, 1, 0, 3, 0, 1, 2, 1, 2, 0, 2, 2, 0, 2, 3, 2, 0, 0, 0, 3, 1, 3, 0, 0, 1, 3, 0, 3, 1, 0, 1, 0, 2, 2, 1, 0, 0, 0, 2, 1, 0, 3, 2, 0, 0, 2, 0, 2, 0, 2, 2, 1, 2, 1, 3, 3, 0, 3, 0, 1, 3, 1, 0, 0, 0, 0, 2,0, 0, 2, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3, 2, 3, 0, 1, 0, 3, 2, 1, 2, 3, 0, 2, 0, 2, 1, 1, 0, 1, 1, 0, 2, 2, 0, 0, 2, 0, 1, 3, 2, 1, 2, 0, 3, 2, 0, 3, 1, 0, 0, 0, 1, 0, 3, 1, 3, 1, 3, 2, 3, 3, 2, 1, 1, 1,0, 1, 2, 1, 2, 3, 0, 3, 3, 0, 3, 1, 0, 3, 2, 2, 1, 0, 0, 3, 0, 1, 2, 3, 0, 3, 0, 0, 0, 2, 2, 3, 0, 0, 2, 2, 1, 2, 3, 1, 3, 3, 0, 2, 1, 1, 3, 3, 0, 1, 1, 0, 0, 3, 1, 3, 1, 0, 3, 3, 3, 2, 0, 1, 0, 3, 1, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 1, 0, 1, 3, 0, 0, 2, 1, 0, 1, 1, 3, 2, 1, 0, 2, 0, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, 0, 2, 1, 1, 3, 2, 1, 3, 3, 2, 3, 0, 1, 2, 3, 3, 1, 0, 1, 0, 3, 0, 1, 0, 0, 2, 3, 2, 2, 1, 2, 2, 1, 2, 0, 3, 1, 3, 3, 2, 3, 1, 1, 2, 2, 3, 0, 0, 2, 3, 3, 3, 3, 3, 1, 1, 3, 0, 2, 3, 1, 2, 3, 3, 1, 0, 0, 0, 1, 2, 2, 3, 1, 3, 3, 2, 3, 1, 3, 2, 3, 2, 0, 3, 2, 0, 1, 2, 2, 3, 3, 3, 2, 3, 1, 1, 2, 2, 2, 0, 2, 1, 2, 1, 3, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 1, 3, 3, 0, 2, 1, 0, 0, 1, 0, 1, 1, 3, 3, 0, 1, 0, 2, 3, 0, 1, 3, 3, 3, 2, 1, 2, 1, 0, 2, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 3, 2, 3, 0, 3, 2, 1, 0, 1, 1, 2, 3, 0, 3, 3, 1, 3, 0, 0, 2, 0, 0, 0, 0, 1, 3, 3, 0, 2, 3, 2, 0, 2, 2, 0, 1, 1, 2, 0, 3, 2, 3, 3, 3, 0, 0, 1, 1, 2, 1, 2, 0, 3, 1, 2, 2, 1, 0, 2, 0, 0, 3, 1, 2, 0, 1, 3, 0 . . .

Digits of π in base 4:

First 5000 digits of π in base 4.

t,a,g,t,a,a,a,a,t,t,a,a,a,t,t,a,a,t,t,a,t,a,a,a,a,t,t,a,t,a,t,a,t,a,t,a,a,t,t,t,a,c,t,a,a,c,t,t,t,a,g,t,t,a,g,a,t,a,a,a,t,t,a,a,t,a,a,t,a,t,a,t,a,a,g,t,t,t,t,a,g,t,a,c,a,t,t,a,a,t,a,t,t,a,t,a,t,t,t,t,a,a,a,t,a,t,t,t,t,a,t,t,t,a,g,t,g,t,c,t,a,g,a,a,a,a,a,a,a,t,g,t,g,t,a,a,c,c,c,a,t,g,a,c,t,g,t,a,g,g,a,a,a,c,t,c,t,a,ga,g,g,g,t,a,a,g,a,a,a,g,a,t,c,g,a,t,c,g,c,t,t,t,a,t,a,g,a,g,a,c,c,a,t,c,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,c,a,tc,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,ca,t,c,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,c,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,t,g,t,a,a,a,a,c,t,t,t,t,t,t,a,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,tt,t,g,t,a,a,a,a,c,t,t,t,t,t,t,a,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a…

Bases of the cowpox genome:

First 5000 bases of the cowpox genome.

*H.T. Chang, N Lo, W. Lu, C.J. Kuo, “Visualization and Comparison of DNA Sequences by Use of Three-Dimensional Trajectories.”

Pi (random) DNA

Three-Dimensional Trajectories*

DNA

• Deoxyribonucleic Acid

• Double helix

• Chain of nucleotide subunits

• Four bases in DNA (A,T,C,G)

• Hold information for maintaining life

• Passed from parent(s) to offspring

Mutations

• Single base substitutions

• Insertions/Deletions

• Duplications

• Translocations

• Inversions

…ACT CCT GAG GAG……ACT CCT GTG GAG…

Thr Pro Glu Glu

Thr Pro Val Glu

…ACT CCT GAG GAG……ACT CCT GAG TAG

Thr Pro Glu Glu

Thr Pro Glu STOP

…ACT CCT GAG GAG……ACT CCT GAG GAA…

Thr Pro Glu Glu

Thr Pro Glu Glu

• Environmental factors

• Copying errors

DNA sequence comparison

• Homologous genes

• Conserved sequences

• Identify mutations

• Forensics

• Evolution

QUANTITATIVE!

Markov Chain

Definition:A collection of random variables having the property that, given the present, the future is conditionally independent of the past.

CountryCity

0.05

0.03

0. 95 0. 97

Example: Annual percentage migration between city and country

Hidden Markov Model

A Hidden Markov Model is a Markov chain, where each state (City/Country) generates an observation or emission (Pet). The state can be predicted by observing emissions.

Cow 0.5Dog 0.3Cat 0.1

None 0.1

Cow 0.0Dog 0.1Cat 0.4

None 0.50.05

0.03

0. 95 0. 97

Example: Annual percentage migration between city and country

City Country

HMM: State Transitions

Match

Mismatch

InDel

States: Match, Mismatch and Indel

HMM: Emissions

Match

Mismatch

InDel

A

C

G

T

A/-

C/-

G/-

T/-

A/C

A/G

A/T

C/G

C/T

G/T

Emissions: A, C, G and T

Alignment/Comparison

Types of alignment• Local

• Global

• Gapped

• Ungapped

Mutations are recorded in DNA• Allow for comparison/alignment

Scoring matrices

A C G T

A 1 0 0 0

C 0 1 0 0

G 0 0 1 0

T 0 0 0 1

Gap = -1

Local alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: GAGCAAA

Local alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: GAGCAAA|

Score: 0+1+0+0+0+0+0 = 1

Local alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: GAGCAAA|

Score: 0+0+1+0+0+0+0 = 1

Local alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: GAGCAAA|

Score: 0+0+1+0+0+0+0 = 1

Local alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: GAGCAAA|||||||

Score: 1+1+1+1+1+1+1= 7

Global alignment

Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA

Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGGGTA

Global alignment

Human: TATGGTGGCGAGCAAACGTTGCGCGTGTA

Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGTGTA || |||| ||||||| |

14 matches

16 mismatches

Score: 14(1)+16(0) = 14

Global alignment

Human: TATGGTGGCGAGCAAA-CGTTGCGCGTGTA

Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGTGTA || |||| ||||||| || || |||||||

24 Matches

5 Mismatches

1 Indel

Score: 24(1)+5(0)+1(-1) = 23

The scoring problem

Alignment

Scoring matrix

What if we align the DNA sequence to a

model, instead of another sequence?

Our Solution

Why is this a solution?

Start with an initial model with equally likely probabilities. Then modify the model recursively using one or more parent sequences. The initial model is updated to replace the random probabilities.

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

0.92 0.03 0.05

0.18 0.69 0.13

0.14 0.19 0.67Modification

Recursive

How does it score?

1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix

The Model: ACTGTGTAG

1. Match/Match2. Match/Mismatch3. Match/Indel4. Mismatch/Match

.

.

.

Without knowing the initial state, the algorithm checks all possible state transitions and emissions for a best fit to the model.

How does it score?

ACTGTGTAG

1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix

The Model:

Now the previous state is defined, so we have only 3 possible transitions to consider.

1. Match/Match2. Match/Mismatch3. Match/Indel

How does it score?

1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix

The Model: ACTGTGTAG

This process will continue through the sequence, calculating the score and remembering the best fit to the model.

1. Mismatch/Match2. Mismatch/Mismatch3. Mismatch/Indel

Future Plans

Create working Hidden Markov Model.

Find convergence as the Model is modified.

Apply similar model to codon analysis.

Develop DNA trajectories as an alternativeapproach to sequence comparison.

Modeling DNA with a Tetrahedron

G

C

T

A

Directional Vectors

G

C

T

A

AGTTCG

AGTTCG

G

C

T

A

AGTTCG

G

C

T

A

AGTTCG

G

C

T

A

AGTTCG

G

C

T

A

AGTTCGG

C

T

A

AGTTCGG

C

T

A

AGTTCG

G

C

A

AGTTCG

G

C

A

AGTTCG

G

C

T

A

AGTTCG

G

C

T

A

AGTTCGG

C

T

A

AGTTCGG

C

T

A

Change Points

Approximate Vectors Between Change Points

Quantify Regions Between Change Points

• Trajectory Length– Tells the base count

• Vector Direction– Tells the relative frequencies of each base

• Vector Length vs. Trajectory Length– Tells how much the trajectory deviates from a straight

line

DNA trajectories can be used to

• Match patterns by grouping similar vectors

• Find conserved regions (vectors that do not change from sequence to sequence)

• Perform many local alignments to assemble global alignments

Thanks!

• Kellar Autumn• Jeff Ely• Amanda Gassett• Deborah Lycan• Harvey Schmidt• Collin Trail• Greg Hermann• Matt Wilkinson

Work supported by

John S. Rogers Science Research Program