Post on 20-Apr-2020
An exploration of the size of the neighborhood of a
wordYoann Dufresne, Hélène Touzet
Seqbio 2017
Levenshtein distance• Alphabet (Σ): {A, B}
• Words: AAABA, ABA, AAA
• Edit operations
• Levenshtein distance: smallest number of operations needed to transform one word to another
ABL| |ALL
A-L| |ALL
ABL| |A-L
Substitution Insertion Deletion
BALLAD- || ||SAL-ADS
distance = 3
Neighborhood of a word• a fixed number k
• a word P over Σ
• what is the number of words over Σ whose distance to P is bounded by k ?
Lev(AAA,1) → 17 elementsAAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA, ABAA, ALAA,AABA, AALA, AAAB, AAAL
Lev(LAB,1) → 19 elementsLAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB, LLAB, LAAB,LBAB, LALB, LABA, LABB, LABL
Neighborhood of a word• a fixed number k
• a word P over Σ
• what is the number of words over Σ whose distance to P is bounded by k ?
Lev(AAA,1) → 17 elementsAAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA, ABAA, ALAA,AABA, AALA, AAAB, AAAL
Lev(LAB,1) → 19 elementsLAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB, LLAB, LAAB,LBAB, LALB, LABA, LABB, LABL
Levenshtein automaton
Universal automaton
Universal automaton
Deterministic universal automaton
Wordborhood
Wordborhood
Wordborhood
Exhaustive executionsAAA218AAC240AAG240AAT240ACA256ACC240ACG258ACT258AGA256AGC258AGG240AGT258ATA256ATC258ATG258ATT240
CAA240CAC256CAG258CAT258CCA240CCC218CCG240CCT240CGA258CGC256CGG240CGT258CTA258CTC256CTG258CTT240
GAA240GAC258GAG256GAT258GCA258GCC240GCG256GCT258GGA240GGC240GGG218GGT240GTA258GTC258GTG256GTT240
TAA240TAC258TAG258TAT256TCA258TCC240TCG258TCT256TGA258TGC258TGG240TGT256TTA240TTC240TTG240TTT218
Σ=4, n=3, k=2
k=1A C C T T A A G C C C C T σ = |Σ|
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
subs
(σ-1)n
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
inside ith block≠xi ins
(n-p)(σ-1)
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
inside ith block≠xi ins
(n-p)(σ-1)
1st or last letter≠xi ins
2(σ-1)
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
inside ith block≠xi ins
(n-p)(σ-1)
1st or last letter≠xi ins
2(σ-1)
between ith and jth blocks≠xi ≠xj ins
(p-1)(σ-2)
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
inside ith block≠xi ins
(n-p)(σ-1)
1st or last letter≠xi ins
2(σ-1)
between ith and jth blocks≠xi ≠xj ins
(p-1)(σ-2)
(σ−1)(2n+1)+p+2
k=1A C C T T A A G C C C C Tp blocks σ = |Σ|
dels
p
subs
(σ-1)n
inss xi: letter of theith block
ith blockxi ins
p
inside ith block≠xi ins
(n-p)(σ-1)
1st or last letter≠xi ins
2(σ-1)
between ith and jth blocks≠xi ≠xj ins
(p-1)(σ-2)
(σ−1)(2n+1)+p+2 (σ−1)(2n+1)+2 + p
k=1 conclusion
• Proportional to σn
• All values between (σ−1)(2n+1)+2 and (σ−1)(2n+1)+2+p for fixed n and σ.
• Min values for p=1AAAA, AA, TTTTTT
• Max values for p=nACTAGC, ACACACA, ACTGACTGAC
(σ−1)(2n+1)+2+p
k \ σ 2 3 4 5 6
1 16 29 42 55 68
2 95 313 659 1 133 1 735
3 351 2 085 6 379 14 403 27 327
4 1 003 10 653 47 325 140 329 329 919
5 2 503 47 161 303 875 1 183 039 3 446 311
Focus on minimum
k \ σ 2 3 4 5 6
1 16 29 42 55 68
2 95 313 659 1 133 1 735
3 351 2 085 6 379 14 403 27 327
4 1 003 10 653 47 325 140 329 329 919
5 2 503 47 161 303 875 1 183 039 3 446 311
Focus on minimum
AAAAAABBBBBBCCCCCC
…
Patterns:
k \ σ 2 3 4 5 6
1 16 29 42 55 68
2 95 313 659 1 133 1 735
3 351 2 085 6 379 14 403 27 327
4 1 003 10 653 47 325 140 329 329 919
5 2 503 47 161 303 875 1 183 039 3 446 311
Focus on minimum
AAAAAABBBBBBCCCCCC
…
Patterns: Minimum: An
k \ σ 2 3 4 5 6
1 16 29 42 55 68
2 95 313 659 1 133 1 735
3 351 2 085 6 379 14 403 27 327
4 1 003 10 653 47 325 140 329 329 919
5 2 503 47 161 303 875 1 183 039 3 446 311
Focus on minimum
AAAAAABBBBBBCCCCCC
…
Patterns: Minimum: An We havea proof !
Calculate the minimum
Calculate the minimum
A A A A A A A A A A A
Calculate the minimum
A A A A A A A A A A A
subs
Calculate the minimum
A A A A A A A A A A A
dels subs
Calculate the minimum
A A A A A A A A A A A
dels insssubs
Calculate the minimum
A A A A A A A A A A A
dels insssubs
Calculate the minimum
A A A A A A A A A A A
dels insssubs
•n-k≤x• p≤k• x+p≤n-k
Neighborhood for an. All neighbors are formed with x a letter and p other letters.
And the maximum ?k \ σ 2 3 4 5 6
1 21 34 47 60 73
2 146 463 924 1 543 2 320
3 498 3 443 10 486 23 465 44 234
4 1 257 17 378 83 039 252 425 601 336
5 2 883 72 691 532 653 2 194 036 6 601 814
And the maximum ?k \ σ 2 3 4 5 6
1 21 34 47 60 73
2 146 463 924 1 543 2 320
3 498 3 443 10 486 23 465 44 234
4 1 257 17 378 83 039 252 425 601 336
5 2 883 72 691 532 653 2 194 036 6 601 814
ABABAB ABCABC ABCDAB ABCDEA ABCDEF
And the maximum ?k \ σ 2 3 4 5 6
1 21 34 47 60 73
2 146 463 924 1 543 2 320
3 498 3 443 10 486 23 465 44 234
4 1 257 17 378 83 039 252 425 601 336
5 2 883 72 691 532 653 2 194 036 6 601 814
ABABAB ABCABC ABCDAB ABCDEA ABCDEF
Maximum: each windowof size k with different letters
And the maximum ?k \ σ 2 3 4 5 6
1 21 34 47 60 73
2 146 463 924 1 543 2 320
3 498 3 443 10 486 23 465 44 234
4 1 257 17 378 83 039 252 425 601 336
5 2 883 72 691 532 653 2 194 036 6 601 814
ABABAB ABCABC ABCDAB ABCDEA ABCDEF
Maximum: each windowof size k with different letters No formal proof yet :(
And the maximum ?k \ σ 2 3 4 5 6
1 21 34 47 60 73
2 146 463 924 1 543 2 320
3 498 3 443 10 486 23 465 44 234
4 1 257 17 378 83 039 252 425 601 336
5 2 883 72 691 532 653 2 194 036 6 601 814
ABABAB ABCABC ABCDAB ABCDEA ABCDEF
Maximum: each windowof size k with different letters No formal proof yet :(k=2: ABCB, ABABA
k=3: ABCA, CABDAC
BCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Block positions
Positions
Perspectives
• Use the automata to prove the upper bound
• Find a formula to compute the upper bound neighborhood without the DULA for k>7
• Discover influence of each parameter (n,p,σ,…) in general case
• Create a method to generate formulas for any kind of word
Questions ?