On the Levenshtein Automaton and the Size of the ... · PDF fileOn the Levenshtein Automaton...

23
On the Levenshtein Automaton and the Size of the Neighbourhood of a Word Yoann Dufresne, H´ el` ene Touzet

Transcript of On the Levenshtein Automaton and the Size of the ... · PDF fileOn the Levenshtein Automaton...

On the Levenshtein Automaton and the Size of theNeighbourhood of a Word

Yoann Dufresne, Helene Touzet

Levenshtein distance

• alphabet Σ

• two words over Σ

• three edit operations

A B L A - L A B L

| | | | | |

A L L A B L A - L

substitution insertion deletion

• Levenshtein distance : smallest number of operations neededto transform one word into another

B A L L A D -

distance = 3 | | | |

S A L - A D S

Size of the neighbourhood of a word

• a fixed number k

• a word P over Σ

• what is the number of words over Σ whose distance to P isbounded by k ?

Lev(P, k) = {V ∈ Σ∗; Lev(P,V ) ≤ k}

Lev(AAA,1) → 17 elements

AAA, AA, AAB, AAL, ABA, ALA, BAA, LAA, AAAA, BAAA, LAAA,ABAA, ALAA, AABA, AALA, AAAB, AAAL

Lev(LAB,1) → 19 elements

LAB, LA, AB, LB, AAB, BAB, LBB, LLB, LAA, LAL, ALAB, BLAB,LLAB, LAAB, LBAB, LALB, LABA, LABB, LABL

Idea #1 : Enumerating all possible alignments

A A A A A A A A A

| | | | | |

A - A A A - - A A

L B L B L B L B

| | | |

L - A B L A - B

A A A A L A A A A A A A L A A A -

| | | | | | | | | | | | |

A A A L A A A A - A A A L A A A A

. . . not such a good idea

Idea #2

• Characterize the k-neighbourhood of P

Levenshtein Automaton

Universal Levenshtein Automaton [Mihov 2004]

Here : new route to build the Universal LevenshteinAutomaton

• Count the number of words within this language

Levenshtein automaton

S

S

S A

A

A L

L

L A

A D

D

D

A

ΣΣ

Σ

Σ Σ

Σ Σ

Σ Σ

Σ ΣΣ

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

Σ,ε

0#0 1#0 2#0 3#0

0#1

0#2

4#0

1#2 2#2 3#2 4#2 5#2

1#1 2#1 3#1 4#1 5#1

5#0

→ : identity ↑ : insertion ↗ : substitution (Σ), deletion (ε)

P=SALAD and k = 2

Bit vector representation

S

A L L A D

L

A

D

A

S

B

$ $ B A L L A D $ $ $ $

S 0 0 0 0 0A 0 0 1 0 0L 0 0 1 1 0A 1 0 0 1 0D 0 0 0 1 0S 0 0 0 0 0$ 0 0 1 1 1$ 0 1 1 1 1

• P=BALLAD, V=SALAD, k = 2

• pattern P → $kP$2k

• word V → V $|P|−|V |+k

• sequence of |P|+ k bit vectors of length 2k + 1

Bit vector representation

S

A L L A D

L

A

D

A

S

B $ $ B A L L A D $ $ $ $

S 0 0 0 0 0A 0 0 1 0 0L 0 0 1 1 0A 1 0 0 1 0D 0 0 0 1 0S 0 0 0 0 0$ 0 0 1 1 1$ 0 1 1 1 1

• P=BALLAD, V=SALAD, k = 2

• pattern P → $kP$2k

• word V → V $|P|−|V |+k

• sequence of |P|+ k bit vectors of length 2k + 1

Nondeterministic Universal Levenshtein Automaton

0

0 00

000

0

0

0 0

0 0

B A L L A D

L

A

D

A

S

S

0 0 0

0 01

1

1

1

1

1

xxx

Nondeterministic Universal Levenshtein Automaton

deletion (1 del + id)00

0 00

000

0

0

0 0

0 0

substitution

insertion

identity

B A L L A D

L

A

D

A

S

S

0 0 0

01

1

1

1

1

1

xxx

Nondeterministic Universal Levenshtein Automaton

0

001

0

01

1

D2 del+id

identity

substitution

--

- -

- -

1 del+id

--

--

A L L A

L

A

D

A

S

S

0 0 0

0 00

0 00

000

0

0

0 0

0 0

B

insertion

x + 1,x + 1,

y + 2x + 2,

x + 1,y + 1y − 1 y

1

11

1

1x , y

1

state (x , y) : ”I am in the lane y and have made x errors so far”

Nondeterministic Universal Levenshtein Automaton

◦1◦◦◦1◦◦◦◦ ◦◦◦◦1

◦◦001◦◦0◦◦

◦◦01◦◦◦◦0◦◦◦0◦◦

◦◦01◦

◦◦1◦◦

◦1◦◦◦ ◦◦1◦◦ ◦◦◦1◦

◦◦◦1◦

◦01◦◦◦0◦◦◦

◦0◦◦◦ ◦◦0◦◦

◦◦0◦◦

◦◦◦0◦

1,0

0,0

2,-2 2,-1 2,0 2,2

1,-1

2,1

◦◦◦01

1,1

◦◦1◦◦

NULA(2) (k = 2)

Deterministic Universal Levenshtein Automaton

Add {(0, 0)} to DULA(k) as an unmarked stateWhile DULA(k) contains an unmarked state do

Let T be that unmarked stateMark TFor each bit vector u ∈ {0, 1}2k+1 do

S = {q′ ∈ Qk ;∃q ∈ T q′ ∈ ∆(q, u)}S ′ = Reduced(S)Define δ(T , u) = S ′

If S ′ is not in DULA(k) already thenAdd S ′ to DULA(k) as an unmarked state

EndIfEndFor

EndWhile

Determinization of NULA(k) to obtain DULA(k)

S ′ = Reduced(S)

y − 1 y y + 1y − 2 y + 2

x , y

x + 1y

x + 1y − 1x + 1

y + 1

x + 2 x + 2 x + 2x + 2x + 2

• subsumption triangle : inclusion of right languages

• Reduced(S) : largest subset of S such that no two elementsare subsumed

Deterministic Universal Levenshtein Automaton

◦11

10011◦

011

101

110

010

◦00

111◦1◦ 1◦1 ◦1◦

◦◦1

1◦◦

◦01

◦10

10◦

1◦0

0◦1

01◦

◦01001

1

5

6

74

3

2

0

DULA(1) (k = 1)

Ula hoop

development in C++11 with library openFST

output format : dot (vizualisation), FSM, FST

freely available athttps ://github.com/yoann-dufresne/ula

k number of states time (sec) RAM (kB) fsmFile1 8 0.003 3364 1342 50 0.006 3456 2.1kB3 322 0.044 4200 39kB4 2 187 0.738 16064 730kB5 15 510 16.231 218548 14MB6 113 633 399 3753340 257MB

• Characterize the k-neighbourhood of P

Levenshtein Automaton

Universal Levenshtein Automaton

Here : new route to build the Universal LevenshteinAutomaton

• Count the number of words within this language

Counting the number of words

◦11

100

11◦

011

101

110

010

◦00

111◦1◦ 1◦1 ◦1◦

◦◦1

1◦◦

◦01

◦10

10◦

1◦0

0◦1

01◦

◦01

001

1

5

6

74

3

2

0

AAA

$4011

000110

000100

000111

000011

001 011

0 1 2 3 4

$3

Counting the number of words

◦11

100

11◦

011

101

110

010

◦00

111◦1◦ 1◦1 ◦1◦

◦◦1

1◦◦

◦01

◦10

10◦

1◦0

0◦1

01◦

◦01

001

1

5

6

74

3

2

0

AAA

$4011

000110

000100

000111

000011

001 011

0 1 2 3 4

$3

Counting the number of words over bit vectors

• DULA(k) : set of sequences of bit vectors with distancebounded by k

• Encod(P, k) : set of sequences of bit vectors that are a validencoding for some string V with respect to P

• DULA(k) ∩ Encod(P, k) : product automaton

• Number of accepted sequences over bit vectors : number ofpaths in the product automaton starting from the initial state

• Dynamic programming

Counting the number of words over ΣFrom the sequences of bit vectors to words over Σ

• DULA(k) ∩ Encod(P, k) : product automaton

• add weights to the transitions of the automaton

000 001 010 011 100 101 110 111

|Σ| − |w |Σ 1 1 1 1 1 1 1

• Lev(P, k) : sum of all weights of all paths in the productautomaton

A

1

1

1

1,$3

0,$4

AAA

1 1

2,$4

0,1

64,34,1 2 4 4,4 3

6,$4

1

6

111

110A111

011 $

$

011 110A0,0

67,4

0,2

4,2

1 0,3

000BL

001$

000BL

000BL

BL

$

100 A

011

100A

011

000

A A

011

2

LAB

100 1007,2

010

001

4

010

0,0 1 0,3 1

7,3

5,$3

7,4 6

6

001

1

14,2

1

26,2

1,2

6,$4

010 L 010A 010 011

$

B

AL

A

100L

1,$3 1

B

100A

$

L

011

011

100

$

B

4,3 2

2

6,3 4

5,2

14,1

1,1 1

$

$

B

001A

000 B

001B 001

000 4,4 3

0,$4 10,1 1 10,2

25,$4

12,$4$

100000 LA

B

Conclusion

• Effective algorithm to build the Universal LevensheinAutomaton

LATA 2016 + ula hoop

• Application to the neighbourhood problem (algorithm linear inthe size of the P)

• Perspectives : generation, sampling, alternative patterns,alternative distances. . .