Stochastic differential equations - » Department of Mathematics
Nick Heppenstall (biology) Michal Dvir (Mathematics/CS) Andrew Dittmore (physics) Under guidance of...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Nick Heppenstall (biology) Michal Dvir (Mathematics/CS) Andrew Dittmore (physics) Under guidance of...
Nick Heppenstall (biology)
Michal Dvir (Mathematics/CS)
Andrew Dittmore (physics)
Under guidance of Dr. Yung-Pin Chen (Mathematics)
DNA Sequence Comparison and Alignment
Outline
• Pi in base 4
• DNA Overview
• Markov Chains & Models
• Sequence Alignment
• Future Plans
in base 103.14159…
3.021003331…ππ in base 4
Looking at π in base 4, the chance of seeing 2 is 1:422 is 1:16222 is 1:642222 is 1:256
The Normality of Pi
3, 0, 2, 1, 0, 0, 3, 3, 3, 1, 2, 2, 2, 2, 0, 2, 0, 2, 0, 1, 1, 2, 2, 0, 3, 0, 0, 2, 0, 3, 1, 0, 3, 0, 1, 0, 3, 0, 1, 2, 1, 2, 0, 2, 2, 0, 2, 3, 2, 0, 0, 0, 3, 1, 3, 0, 0, 1, 3, 0, 3, 1, 0, 1, 0, 2, 2, 1, 0, 0, 0, 2, 1, 0, 3, 2, 0, 0, 2, 0, 2, 0, 2, 2, 1, 2, 1, 3, 3, 0, 3, 0, 1, 3, 1, 0, 0, 0, 0, 2,0, 0, 2, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3, 2, 3, 0, 1, 0, 3, 2, 1, 2, 3, 0, 2, 0, 2, 1, 1, 0, 1, 1, 0, 2, 2, 0, 0, 2, 0, 1, 3, 2, 1, 2, 0, 3, 2, 0, 3, 1, 0, 0, 0, 1, 0, 3, 1, 3, 1, 3, 2, 3, 3, 2, 1, 1, 1,0, 1, 2, 1, 2, 3, 0, 3, 3, 0, 3, 1, 0, 3, 2, 2, 1, 0, 0, 3, 0, 1, 2, 3, 0, 3, 0, 0, 0, 2, 2, 3, 0, 0, 2, 2, 1, 2, 3, 1, 3, 3, 0, 2, 1, 1, 3, 3, 0, 1, 1, 0, 0, 3, 1, 3, 1, 0, 3, 3, 3, 2, 0, 1, 0, 3, 1, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 1, 0, 1, 3, 0, 0, 2, 1, 0, 1, 1, 3, 2, 1, 0, 2, 0, 1, 1, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, 0, 2, 1, 1, 3, 2, 1, 3, 3, 2, 3, 0, 1, 2, 3, 3, 1, 0, 1, 0, 3, 0, 1, 0, 0, 2, 3, 2, 2, 1, 2, 2, 1, 2, 0, 3, 1, 3, 3, 2, 3, 1, 1, 2, 2, 3, 0, 0, 2, 3, 3, 3, 3, 3, 1, 1, 3, 0, 2, 3, 1, 2, 3, 3, 1, 0, 0, 0, 1, 2, 2, 3, 1, 3, 3, 2, 3, 1, 3, 2, 3, 2, 0, 3, 2, 0, 1, 2, 2, 3, 3, 3, 2, 3, 1, 1, 2, 2, 2, 0, 2, 1, 2, 1, 3, 3, 2, 2, 1, 1, 2, 2, 3, 2, 2, 1, 3, 3, 0, 2, 1, 0, 0, 1, 0, 1, 1, 3, 3, 0, 1, 0, 2, 3, 0, 1, 3, 3, 3, 2, 1, 2, 1, 0, 2, 1, 0, 2, 2, 0, 1, 2, 1, 2, 1, 1, 0, 1, 3, 2, 3, 0, 3, 2, 1, 0, 1, 1, 2, 3, 0, 3, 3, 1, 3, 0, 0, 2, 0, 0, 0, 0, 1, 3, 3, 0, 2, 3, 2, 0, 2, 2, 0, 1, 1, 2, 0, 3, 2, 3, 3, 3, 0, 0, 1, 1, 2, 1, 2, 0, 3, 1, 2, 2, 1, 0, 2, 0, 0, 3, 1, 2, 0, 1, 3, 0 . . .
Digits of π in base 4:
First 5000 digits of π in base 4.
t,a,g,t,a,a,a,a,t,t,a,a,a,t,t,a,a,t,t,a,t,a,a,a,a,t,t,a,t,a,t,a,t,a,t,a,a,t,t,t,a,c,t,a,a,c,t,t,t,a,g,t,t,a,g,a,t,a,a,a,t,t,a,a,t,a,a,t,a,t,a,t,a,a,g,t,t,t,t,a,g,t,a,c,a,t,t,a,a,t,a,t,t,a,t,a,t,t,t,t,a,a,a,t,a,t,t,t,t,a,t,t,t,a,g,t,g,t,c,t,a,g,a,a,a,a,a,a,a,t,g,t,g,t,a,a,c,c,c,a,t,g,a,c,t,g,t,a,g,g,a,a,a,c,t,c,t,a,ga,g,g,g,t,a,a,g,a,a,a,g,a,t,c,g,a,t,c,g,c,t,t,t,a,t,a,g,a,g,a,c,c,a,t,c,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,c,a,tc,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,a,g,t,g,a,c,t,c,ca,t,c,a,g,a,a,a,g,a,g,g,t,t,t,a,a,t,a,t,t,t,t,t,g,t,g,a,g,a,c,c,a,t,c,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,t,t,t,g,t,a,a,a,a,c,t,t,t,t,t,t,a,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a,g,a,g,a,a,t,a,a,a,a,a,t,a,t,t,tt,t,g,t,a,a,a,a,c,t,t,t,t,t,t,a,t,g,a,g,a,c,c,a,t,t,g,a,a,g,a,g,a,g,a,a,a…
Bases of the cowpox genome:
First 5000 bases of the cowpox genome.
*H.T. Chang, N Lo, W. Lu, C.J. Kuo, “Visualization and Comparison of DNA Sequences by Use of Three-Dimensional Trajectories.”
Pi (random) DNA
Three-Dimensional Trajectories*
DNA
• Deoxyribonucleic Acid
• Double helix
• Chain of nucleotide subunits
• Four bases in DNA (A,T,C,G)
• Hold information for maintaining life
• Passed from parent(s) to offspring
Mutations
• Single base substitutions
• Insertions/Deletions
• Duplications
• Translocations
• Inversions
…ACT CCT GAG GAG……ACT CCT GTG GAG…
Thr Pro Glu Glu
Thr Pro Val Glu
…ACT CCT GAG GAG……ACT CCT GAG TAG
Thr Pro Glu Glu
Thr Pro Glu STOP
…ACT CCT GAG GAG……ACT CCT GAG GAA…
Thr Pro Glu Glu
Thr Pro Glu Glu
• Environmental factors
• Copying errors
DNA sequence comparison
• Homologous genes
• Conserved sequences
• Identify mutations
• Forensics
• Evolution
QUANTITATIVE!
Markov Chain
Definition:A collection of random variables having the property that, given the present, the future is conditionally independent of the past.
CountryCity
0.05
0.03
0. 95 0. 97
Example: Annual percentage migration between city and country
Hidden Markov Model
A Hidden Markov Model is a Markov chain, where each state (City/Country) generates an observation or emission (Pet). The state can be predicted by observing emissions.
Cow 0.5Dog 0.3Cat 0.1
None 0.1
Cow 0.0Dog 0.1Cat 0.4
None 0.50.05
0.03
0. 95 0. 97
Example: Annual percentage migration between city and country
City Country
HMM: State Transitions
Match
Mismatch
InDel
States: Match, Mismatch and Indel
HMM: Emissions
Match
Mismatch
InDel
A
C
G
T
A/-
C/-
G/-
T/-
A/C
A/G
A/T
C/G
C/T
G/T
Emissions: A, C, G and T
Alignment/Comparison
Types of alignment• Local
• Global
• Gapped
• Ungapped
Mutations are recorded in DNA• Allow for comparison/alignment
Scoring matrices
A C G T
A 1 0 0 0
C 0 1 0 0
G 0 0 1 0
T 0 0 0 1
Gap = -1
Local alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: GAGCAAA
Local alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: GAGCAAA|
Score: 0+1+0+0+0+0+0 = 1
Local alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: GAGCAAA|
Score: 0+0+1+0+0+0+0 = 1
Local alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: GAGCAAA|
Score: 0+0+1+0+0+0+0 = 1
Local alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: GAGCAAA|||||||
Score: 1+1+1+1+1+1+1= 7
Global alignment
Human: TATGGTGGCGAGCAAACGTTGCGTGCGTA
Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGGGTA
Global alignment
Human: TATGGTGGCGAGCAAACGTTGCGCGTGTA
Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGTGTA || |||| ||||||| |
14 matches
16 mismatches
Score: 14(1)+16(0) = 14
Global alignment
Human: TATGGTGGCGAGCAAA-CGTTGCGCGTGTA
Mouse: CATTGTGGTGAGCAAAGCGGTGGGCGTGTA || |||| ||||||| || || |||||||
24 Matches
5 Mismatches
1 Indel
Score: 24(1)+5(0)+1(-1) = 23
The scoring problem
Alignment
Scoring matrix
What if we align the DNA sequence to a
model, instead of another sequence?
Our Solution
Why is this a solution?
Start with an initial model with equally likely probabilities. Then modify the model recursively using one or more parent sequences. The initial model is updated to replace the random probabilities.
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3
0.92 0.03 0.05
0.18 0.69 0.13
0.14 0.19 0.67Modification
Recursive
How does it score?
1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix
The Model: ACTGTGTAG
1. Match/Match2. Match/Mismatch3. Match/Indel4. Mismatch/Match
.
.
.
Without knowing the initial state, the algorithm checks all possible state transitions and emissions for a best fit to the model.
How does it score?
ACTGTGTAG
1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix
The Model:
Now the previous state is defined, so we have only 3 possible transitions to consider.
1. Match/Match2. Match/Mismatch3. Match/Indel
How does it score?
1. Modification number2. Length of original sequence3. Transition matrix4. Each emission matrix
The Model: ACTGTGTAG
This process will continue through the sequence, calculating the score and remembering the best fit to the model.
1. Mismatch/Match2. Mismatch/Mismatch3. Mismatch/Indel
Future Plans
Create working Hidden Markov Model.
Find convergence as the Model is modified.
Apply similar model to codon analysis.
Develop DNA trajectories as an alternativeapproach to sequence comparison.
Modeling DNA with a Tetrahedron
G
C
T
A
Directional Vectors
G
C
T
A
AGTTCG
AGTTCG
G
C
T
A
AGTTCG
G
C
T
A
AGTTCG
G
C
T
A
AGTTCG
G
C
T
A
AGTTCGG
C
T
A
AGTTCGG
C
T
A
AGTTCG
G
C
A
AGTTCG
G
C
A
AGTTCG
G
C
T
A
AGTTCG
G
C
T
A
AGTTCGG
C
T
A
AGTTCGG
C
T
A
Change Points
Approximate Vectors Between Change Points
Quantify Regions Between Change Points
• Trajectory Length– Tells the base count
• Vector Direction– Tells the relative frequencies of each base
• Vector Length vs. Trajectory Length– Tells how much the trajectory deviates from a straight
line
DNA trajectories can be used to
• Match patterns by grouping similar vectors
• Find conserved regions (vectors that do not change from sequence to sequence)
• Perform many local alignments to assemble global alignments
Thanks!
• Kellar Autumn• Jeff Ely• Amanda Gassett• Deborah Lycan• Harvey Schmidt• Collin Trail• Greg Hermann• Matt Wilkinson
Work supported by
John S. Rogers Science Research Program