On the Levenshtein Automaton and the Size of the ... · PDF fileOn the Levenshtein Automaton...
Transcript of On the Levenshtein Automaton and the Size of the ... · PDF fileOn the Levenshtein Automaton...
On the Levenshtein Automaton and the Size of theNeighbourhood of a Word
Yoann Dufresne, Helene Touzet
Levenshtein distance
• alphabet Σ
• two words over Σ
• three edit operations
A B L A - L A B L
| | | | | |
A L L A B L A - L
substitution insertion deletion
• Levenshtein distance : smallest number of operations neededto transform one word into another
B A L L A D -
distance = 3 | | | |
S A L - A D S
Size of the neighbourhood of a word
• a fixed number k
• a word P over Σ
• what is the number of words over Σ whose distance to P isbounded by k ?
Lev(P, k) = {V ∈ Σ∗; Lev(P,V ) ≤ k}
Lev(AAA,1) → 17 elements
AAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA,ABAA, ALAA, AABA, AALA, AAAB, AAAL
Lev(LAB,1) → 19 elements
LAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB,LLAB, LAAB, LBAB, LALB, LABA, LABB, LABL
Idea #1 : Enumerating all possible alignments
A A A A A A A A A
| | | | | |
A - A A A - - A A
L B L B L B L B
| | | |
L - A B L A - B
A A A A L A A A A A A A L A A A -
| | | | | | | | | | | | |
A A A L A A A A - A A A L A A A A
. . . not such a good idea
Idea #2
• Characterize the k-neighbourhood of P
Levenshtein Automaton
Universal Levenshtein Automaton [Mihov 2004]
Here : new route to build the Universal LevenshteinAutomaton
• Count the number of words within this language
Levenshtein automaton
S
S
S A
A
A L
L
L A
A D
D
D
A
ΣΣ
Σ
Σ Σ
Σ Σ
Σ Σ
Σ ΣΣ
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
Σ,ε
0#0 1#0 2#0 3#0
0#1
0#2
4#0
1#2 2#2 3#2 4#2 5#2
1#1 2#1 3#1 4#1 5#1
5#0
→ : identity ↑ : insertion ↗ : substitution (Σ), deletion (ε)
P=SALAD and k = 2
Bit vector representation
S
A L L A D
L
A
D
A
S
B
$ $ B A L L A D $ $ $ $
S 0 0 0 0 0A 0 0 1 0 0L 0 0 1 1 0A 1 0 0 1 0D 0 0 0 1 0S 0 0 0 0 0$ 0 0 1 1 1$ 0 1 1 1 1
• P=BALLAD, V=SALAD, k = 2
• pattern P → $kP$2k
• word V → V $|P|−|V |+k
• sequence of |P|+ k bit vectors of length 2k + 1
Bit vector representation
S
A L L A D
L
A
D
A
S
B $ $ B A L L A D $ $ $ $
S 0 0 0 0 0A 0 0 1 0 0L 0 0 1 1 0A 1 0 0 1 0D 0 0 0 1 0S 0 0 0 0 0$ 0 0 1 1 1$ 0 1 1 1 1
• P=BALLAD, V=SALAD, k = 2
• pattern P → $kP$2k
• word V → V $|P|−|V |+k
• sequence of |P|+ k bit vectors of length 2k + 1
Nondeterministic Universal Levenshtein Automaton
0
0 00
000
0
0
0 0
0 0
B A L L A D
L
A
D
A
S
S
0 0 0
0 01
1
1
1
1
1
xxx
Nondeterministic Universal Levenshtein Automaton
deletion (1 del + id)00
0 00
000
0
0
0 0
0 0
substitution
insertion
identity
B A L L A D
L
A
D
A
S
S
0 0 0
01
1
1
1
1
1
xxx
Nondeterministic Universal Levenshtein Automaton
0
001
0
01
1
D2 del+id
identity
substitution
--
- -
- -
1 del+id
--
--
A L L A
L
A
D
A
S
S
0 0 0
0 00
0 00
000
0
0
0 0
0 0
B
insertion
x + 1,x + 1,
y + 2x + 2,
x + 1,y + 1y − 1 y
1
11
1
1x , y
1
state (x , y) : ”I am in the lane y and have made x errors so far”
Nondeterministic Universal Levenshtein Automaton
◦1◦◦◦1◦◦◦◦ ◦◦◦◦1
◦◦001◦◦0◦◦
◦◦01◦◦◦◦0◦◦◦0◦◦
◦◦01◦
◦◦1◦◦
◦1◦◦◦ ◦◦1◦◦ ◦◦◦1◦
◦◦◦1◦
◦01◦◦◦0◦◦◦
◦0◦◦◦ ◦◦0◦◦
◦◦0◦◦
◦◦◦0◦
1,0
0,0
2,-2 2,-1 2,0 2,2
1,-1
2,1
◦◦◦01
1,1
◦◦1◦◦
NULA(2) (k = 2)
Deterministic Universal Levenshtein Automaton
Add {(0, 0)} to DULA(k) as an unmarked stateWhile DULA(k) contains an unmarked state do
Let T be that unmarked stateMark TFor each bit vector u ∈ {0, 1}2k+1 do
S = {q′ ∈ Qk ;∃q ∈ T q′ ∈ ∆(q, u)}S ′ = Reduced(S)Define δ(T , u) = S ′
If S ′ is not in DULA(k) already thenAdd S ′ to DULA(k) as an unmarked state
EndIfEndFor
EndWhile
Determinization of NULA(k) to obtain DULA(k)
S ′ = Reduced(S)
y − 1 y y + 1y − 2 y + 2
x , y
x + 1y
x + 1y − 1x + 1
y + 1
x + 2 x + 2 x + 2x + 2x + 2
• subsumption triangle : inclusion of right languages
• Reduced(S) : largest subset of S such that no two elementsare subsumed
Deterministic Universal Levenshtein Automaton
◦11
10011◦
011
101
110
010
◦00
111◦1◦ 1◦1 ◦1◦
◦◦1
1◦◦
◦01
◦10
10◦
1◦0
0◦1
01◦
◦01001
1
5
6
74
3
2
0
DULA(1) (k = 1)
Ula hoop
development in C++11 with library openFST
output format : dot (vizualisation), FSM, FST
freely available athttps ://github.com/yoann-dufresne/ula
k number of states time (sec) RAM (kB) fsmFile1 8 0.003 3364 1342 50 0.006 3456 2.1kB3 322 0.044 4200 39kB4 2 187 0.738 16064 730kB5 15 510 16.231 218548 14MB6 113 633 399 3753340 257MB
• Characterize the k-neighbourhood of P
Levenshtein Automaton
Universal Levenshtein Automaton
Here : new route to build the Universal LevenshteinAutomaton
• Count the number of words within this language
Counting the number of words
◦11
100
11◦
011
101
110
010
◦00
111◦1◦ 1◦1 ◦1◦
◦◦1
1◦◦
◦01
◦10
10◦
1◦0
0◦1
01◦
◦01
001
1
5
6
74
3
2
0
AAA
$4011
000110
000100
000111
000011
001 011
0 1 2 3 4
$3
Counting the number of words
◦11
100
11◦
011
101
110
010
◦00
111◦1◦ 1◦1 ◦1◦
◦◦1
1◦◦
◦01
◦10
10◦
1◦0
0◦1
01◦
◦01
001
1
5
6
74
3
2
0
AAA
$4011
000110
000100
000111
000011
001 011
0 1 2 3 4
$3
Counting the number of words over bit vectors
• DULA(k) : set of sequences of bit vectors with distancebounded by k
• Encod(P, k) : set of sequences of bit vectors that are a validencoding for some string V with respect to P
• DULA(k) ∩ Encod(P, k) : product automaton
• Number of accepted sequences over bit vectors : number ofpaths in the product automaton starting from the initial state
• Dynamic programming
Counting the number of words over ΣFrom the sequences of bit vectors to words over Σ
• DULA(k) ∩ Encod(P, k) : product automaton
• add weights to the transitions of the automaton
000 001 010 011 100 101 110 111
|Σ| − |w |Σ 1 1 1 1 1 1 1
• Lev(P, k) : sum of all weights of all paths in the productautomaton
A
1
1
1
1,$3
0,$4
AAA
1 1
2,$4
0,1
64,34,1 2 4 4,4 3
6,$4
1
6
111
110A111
011 $
$
011 110A0,0
67,4
0,2
4,2
1 0,3
000BL
001$
000BL
000BL
BL
$
100 A
011
100A
011
000
A A
011
2
LAB
100 1007,2
010
001
4
010
0,0 1 0,3 1
7,3
5,$3
7,4 6
6
001
1
14,2
1
26,2
1,2
6,$4
010 L 010A 010 011
$
B
AL
A
100L
1,$3 1
B
100A
$
L
011
011
100
$
B
4,3 2
2
6,3 4
5,2
14,1
1,1 1
$
$
B
001A
000 B
001B 001
000 4,4 3
0,$4 10,1 1 10,2
25,$4
12,$4$
100000 LA
B