An exploration of the size of the neighborhood of a wordNeighborhood of a word • a fixed number k...

Post on 20-Apr-2020

0 views 0 download

Transcript of An exploration of the size of the neighborhood of a wordNeighborhood of a word • a fixed number k...

An exploration of the size of the neighborhood of a

wordYoann Dufresne, Hélène Touzet

Seqbio 2017

Levenshtein distance• Alphabet (Σ): {A, B}

• Words: AAABA, ABA, AAA

• Edit operations

• Levenshtein distance: smallest number of operations needed to transform one word to another

ABL| |ALL

A-L| |ALL

ABL| |A-L

Substitution Insertion Deletion

BALLAD- || ||SAL-ADS

distance = 3

Neighborhood of a word• a fixed number k

• a word P over Σ

• what is the number of words over Σ whose distance to P is bounded by k ?

Lev(AAA,1) → 17 elementsAAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA, ABAA, ALAA,AABA, AALA, AAAB, AAAL

Lev(LAB,1) → 19 elementsLAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB, LLAB, LAAB,LBAB, LALB, LABA, LABB, LABL

Neighborhood of a word• a fixed number k

• a word P over Σ

• what is the number of words over Σ whose distance to P is bounded by k ?

Lev(AAA,1) → 17 elementsAAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA, ABAA, ALAA,AABA, AALA, AAAB, AAAL

Lev(LAB,1) → 19 elementsLAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB, LLAB, LAAB,LBAB, LALB, LABA, LABB, LABL

Levenshtein automaton

Universal automaton

Universal automaton

Deterministic universal automaton

Wordborhood

Wordborhood

Wordborhood

Exhaustive executionsAAA218AAC240AAG240AAT240ACA256ACC240ACG258ACT258AGA256AGC258AGG240AGT258ATA256ATC258ATG258ATT240

CAA240CAC256CAG258CAT258CCA240CCC218CCG240CCT240CGA258CGC256CGG240CGT258CTA258CTC256CTG258CTT240

GAA240GAC258GAG256GAT258GCA258GCC240GCG256GCT258GGA240GGC240GGG218GGT240GTA258GTC258GTG256GTT240

TAA240TAC258TAG258TAT256TCA258TCC240TCG258TCT256TGA258TGC258TGG240TGT256TTA240TTC240TTG240TTT218

Σ=4, n=3, k=2

k=1A C C T T A A G C C C C T σ = |Σ|

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

subs

(σ-1)n

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

inside ith block≠xi ins

(n-p)(σ-1)

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

inside ith block≠xi ins

(n-p)(σ-1)

1st or last letter≠xi ins

2(σ-1)

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

inside ith block≠xi ins

(n-p)(σ-1)

1st or last letter≠xi ins

2(σ-1)

between ith and jth blocks≠xi ≠xj ins

(p-1)(σ-2)

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

inside ith block≠xi ins

(n-p)(σ-1)

1st or last letter≠xi ins

2(σ-1)

between ith and jth blocks≠xi ≠xj ins

(p-1)(σ-2)

(σ−1)(2n+1)+p+2

k=1A C C T T A A G C C C C Tp blocks σ = |Σ|

dels

p

subs

(σ-1)n

inss xi: letter of theith block

ith blockxi ins

p

inside ith block≠xi ins

(n-p)(σ-1)

1st or last letter≠xi ins

2(σ-1)

between ith and jth blocks≠xi ≠xj ins

(p-1)(σ-2)

(σ−1)(2n+1)+p+2 (σ−1)(2n+1)+2 + p

k=1 conclusion

• Proportional to σn

• All values between (σ−1)(2n+1)+2 and (σ−1)(2n+1)+2+p for fixed n and σ.

• Min values for p=1AAAA, AA, TTTTTT

• Max values for p=nACTAGC, ACACACA, ACTGACTGAC

(σ−1)(2n+1)+2+p

k \ σ 2 3 4 5 6

1 16 29 42 55 68

2 95 313 659 1 133 1 735

3 351 2 085 6 379 14 403 27 327

4 1 003 10 653 47 325 140 329 329 919

5 2 503 47 161 303 875 1 183 039 3 446 311

Focus on minimum

k \ σ 2 3 4 5 6

1 16 29 42 55 68

2 95 313 659 1 133 1 735

3 351 2 085 6 379 14 403 27 327

4 1 003 10 653 47 325 140 329 329 919

5 2 503 47 161 303 875 1 183 039 3 446 311

Focus on minimum

AAAAAABBBBBBCCCCCC

Patterns:

k \ σ 2 3 4 5 6

1 16 29 42 55 68

2 95 313 659 1 133 1 735

3 351 2 085 6 379 14 403 27 327

4 1 003 10 653 47 325 140 329 329 919

5 2 503 47 161 303 875 1 183 039 3 446 311

Focus on minimum

AAAAAABBBBBBCCCCCC

Patterns: Minimum: An

k \ σ 2 3 4 5 6

1 16 29 42 55 68

2 95 313 659 1 133 1 735

3 351 2 085 6 379 14 403 27 327

4 1 003 10 653 47 325 140 329 329 919

5 2 503 47 161 303 875 1 183 039 3 446 311

Focus on minimum

AAAAAABBBBBBCCCCCC

Patterns: Minimum: An We havea proof !

Calculate the minimum

Calculate the minimum

A A A A A A A A A A A

Calculate the minimum

A A A A A A A A A A A

subs

Calculate the minimum

A A A A A A A A A A A

dels subs

Calculate the minimum

A A A A A A A A A A A

dels insssubs

Calculate the minimum

A A A A A A A A A A A

dels insssubs

Calculate the minimum

A A A A A A A A A A A

dels insssubs

•n-k≤x• p≤k• x+p≤n-k

Neighborhood for an. All neighbors are formed with x a letter and p other letters.

And the maximum ?k \ σ 2 3 4 5 6

1 21 34 47 60 73

2 146 463 924 1 543 2 320

3 498 3 443 10 486 23 465 44 234

4 1 257 17 378 83 039 252 425 601 336

5 2 883 72 691 532 653 2 194 036 6 601 814

And the maximum ?k \ σ 2 3 4 5 6

1 21 34 47 60 73

2 146 463 924 1 543 2 320

3 498 3 443 10 486 23 465 44 234

4 1 257 17 378 83 039 252 425 601 336

5 2 883 72 691 532 653 2 194 036 6 601 814

ABABAB ABCABC ABCDAB ABCDEA ABCDEF

And the maximum ?k \ σ 2 3 4 5 6

1 21 34 47 60 73

2 146 463 924 1 543 2 320

3 498 3 443 10 486 23 465 44 234

4 1 257 17 378 83 039 252 425 601 336

5 2 883 72 691 532 653 2 194 036 6 601 814

ABABAB ABCABC ABCDAB ABCDEA ABCDEF

Maximum: each windowof size k with different letters

And the maximum ?k \ σ 2 3 4 5 6

1 21 34 47 60 73

2 146 463 924 1 543 2 320

3 498 3 443 10 486 23 465 44 234

4 1 257 17 378 83 039 252 425 601 336

5 2 883 72 691 532 653 2 194 036 6 601 814

ABABAB ABCABC ABCDAB ABCDEA ABCDEF

Maximum: each windowof size k with different letters No formal proof yet :(

And the maximum ?k \ σ 2 3 4 5 6

1 21 34 47 60 73

2 146 463 924 1 543 2 320

3 498 3 443 10 486 23 465 44 234

4 1 257 17 378 83 039 252 425 601 336

5 2 883 72 691 532 653 2 194 036 6 601 814

ABABAB ABCABC ABCDAB ABCDEA ABCDEF

Maximum: each windowof size k with different letters No formal proof yet :(k=2: ABCB, ABABA

k=3: ABCA, CABDAC

BCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Block positions

Positions

Perspectives

• Use the automata to prove the upper bound

• Find a formula to compute the upper bound neighborhood without the DULA for k>7

• Discover influence of each parameter (n,p,σ,…) in general case

• Create a method to generate formulas for any kind of word

Questions ?