1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

27
1 verage-Optimal Multiple Approximate S tring Matching Kimmo Fredriksson , Gonzalo Navarro CM Journal of Experimental Algorithmi cs, ol 9, Article No. 1.4,2004, Pages 1-4 7 Professor R.C.T Lee Speaker K.W.Liu

Transcript of 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

Page 1: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

1

Average-Optimal Multiple Approximate String Matching

Kimmo Fredriksson , Gonzalo NavarroACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, Pages 1-47

Professor R.C.T LeeSpeaker K.W.Liu

Page 2: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

2

The Problem

•The approximate string matching problem:

Given text T[1...n] and pattern P[1...m] over some finite alphabet ∑ of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).

Page 3: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

3

For a window of size m-k, if there exists a substring s1 in this window such that its edit distance with every substring of P is greater than k, we move P

Our algorithm scans from the right as shown below:

Fig. 3

S1T:

P:

m - k

Page 4: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

4

For a window of size m-k, if there exists a suffix S

1 such that its edit distance with every substring of P is greater than k, we move P to S1

Our algorithm scans from the right as shown below:

Fig. 3

S1T:

P:

m - k

Page 5: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

5

But, how do we know that ED(S1,S2) > k?

We use a very useful lemma.

Page 6: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

6

LemmaConsider string Q and P. Let Q be divided in

to q1,q2,…,qn as shown below:

qn … q2 q1

For each qi, let pi be the substring in P such that ED(qi,pi) is the smallest, among all substrings in P.

kPQkpqn

iii

),ED( then ,),ED( If1

Page 7: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

7

Proof: Divide P into n pieces as shown below

qn … q2 q1

p'n … p'2 p'1

Q

P

smallest. theis ),ED( because ),ED(

Therefore, . ),ED( that assume We

1

1

ii

n

iii

n

iii

pqkpq

kpq

Page 8: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

8

To determine whether ED(S1,S2) > k, we mayUse the lemma.

We divide the window into small pieces: t1, t2, …,ta.

For each ti, we find the substring pi in P where ED(pi,ti) is the smallest.

T:

P:

Window W

Fig. 7

… t2 t1

p1 p2

Page 9: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

9

. ),ED( whether find togprogrammin dynamic

use tohave We.conclusionany makecannot we

, ),( window, theof end at the If

. movecan weNow

. ),ED( that know we1, Lemma

toaccording ,),( assoon As

21

kPW

kpt

P

kSS

kpt

ii

ii

Page 10: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

10

In general, to find such a pi, we may use Dynamic programming [Sellers 1980].

But, we may use a special kind of small pieces.

It is customary to call a small piece with sizeL a L-gram.

Let us use the 2-gram.

Page 11: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

11

Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them.

Thus, we may use 2-grams in our algorithm.

Page 12: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

12

Our algorithm

• Make a table D to store the smallest edit

distance between each possible 2-gram from

finite alphabet set and all substrings of the

pattern P.

•The above is done in the preprocessing stage.

Page 13: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

13

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “aa” and all substrings of P = 0 Smallest edit distance between “gg” and all substrings of P = 2

∴∑ > k

Page 14: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

14

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “tt” and all substrings of P = 0

Smallest edit distance between “aa” and all substrings of P = 0

Smallest edit distance between “at” and all substrings of P = 0

Smallest edit distance between “ga” and all substrings of P = 1

∴∑ == k

← m+2k →

c t a g g g a a t a a t t t a c a a t t

← m-k →

i

i+1

Page 15: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

15

c t a g g g a a t a a t t t a c a a t t

← m-k →

← m+k →i+1

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

To find the edit distance between “gaataattta” and P.

Page 16: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

16

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

c t a g g g a a t a a t t t a c a a t t

← m-k →

Page 17: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

17

In the preprocessing We make a D table to record the smallest edit

distance between each possible l-gram from alphabet set whose length is l and all substrings of P.

Page 18: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

18

D table : example ( step by step )

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

For example P = aacaccgaa

For P = a a c a c c g a a a

For P = a a c a c c g a a a vs “aa”

Page 19: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

19

For P = a a c a c c g a a a vs “ac”

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Page 20: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

20

For P = a a c a c c g a a a vs “ag”For P = a a c a c c g a a a with “at”

For P = a a c a c c g a a a with “ca” For P = a a c a c c g a a a with cc

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 0 0 1 1 0 0 2 2 2 2 2 2 2 2 2 2

Page 21: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

21

Time complexity

The average complexity of the algorithm is

mnrmk log for 121

Page 22: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

22

The end

Page 23: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

23

[BYN2000] New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G. 2000. Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns.

[BYN2002] New and faster filters for multiple approximate string matching. BAEZA-YATES, R. AND NAVARRO, G. 2002. Random Structures and Algorithms 20, 23–49.

[BYR99] Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Addison-Wesley, Reading, MA.

[BYN99] Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G. 1999. Algorithmica 23, 2, 127–158.

[CL94] Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E. 1994. Algorithmica 12, 4/5, 327–344.

[CM94] Approximate string matching and local similarity. CHANG, W. AND MARR, T. 1994. In Proceedings of 5th Combinatorial Pattern Matching (CPM’94). LNCS, vol. 807. Springer-Verlag, Berlin, 259–273.

[CCGJLPR94] Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Algorithmica 12, 4/5, 247–267.

Page 24: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

24

[CR94] Text Algorithms. CROCHEMORE, M. AND RYTTER, W. 1994. Oxford University Press, Oxford, UK.

[DM79] Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds. 1979. IEEE Press.

[EL90] A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I. 1990. Pattern Recogn. 23, 3/4, 337–346.

[F2003] Row-wise tiling for the Myers’ bit-parallel approximate string matching algorithm. FREDRIKSSON, K. 2003. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE’03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66–79.

[FN2003]

Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2003. In Proceedings of 14th Combinatorial Pattern Matching (CPM’03). LNCS, vol. 2676. 109–128.

[FN2004]

Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2004. In Proceedings of 15th Combinatorial Pattern Matching (CPM’04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457–471.

Page 25: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

25

[GL89] Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F. 1989. Information Processing Letters 33, 3, 113–120. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501–506.

[HFN2004]

Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G. 2004. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA’04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285–298.

[HN2002] Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G. 2002. In Proceedings of 13th Combinatorial Pattern Matching (CPM’02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica.

[JTU96] A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. Software Practice and Experience 26, 12, 1439–1458.

[K92] Techniques for automatically correcting words in text. KUKICH, K. 1992. ACM Computing Surveys 24, 4, 377–439.

[KS94] A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E. 1994. In Proceedings of National Computer Security Conference. 11–21.

[LT94] On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A. 1994. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156–165.

Page 26: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

26

[MM96] Approximate multiple string search. MUTH, R. AND MANBER, U. 1996. In Proceedings of 7th Combinatorial Pattern Matching (CPM’96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75–86.

[M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W. 1999. J. ACM 46, 3, 395–415.

[N2001] A guided tour to approximate string matching. NAVARRO, G. 2001. ACM Computing Surveys 33, 1, 31–88.

[NB99] Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA-YATES, R. 1999. Inf. Process. Lett. 72, 65–70.

[NB2001] Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R. 2001. Algorithmica 30, 4, 473–502.

[NF2004] Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K. 2004. Theor. Comput. Sci. 321, 2-3, 283–290.

[NR2000] Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M. 2000. ACM J. Exp. Algorithmics 5, 4.

[NR2002] Flexible Pattern Matching in Strings—Practical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M. 2002. Cambridge University Press, Cambridge, UK.

[NSTT2000]

Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J. 2000. In Proceedings of 11th Combinatorial Pattern Matching (CPM’00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350–363.

Page 27: 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No.

27

[PS80] Decision trees and random access machines. PAUL, W. AND SIMON, J. 1980. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340.

[SK83]

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Addison-Wesley, Reading, MA.

[S80] The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P. 1980. J. Algorithms 1, 359–373.

[ST96] Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J. 1996. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50–63.

[TU93]

Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E. 1993. SIAM J. Comput. 22, 2, 243–260.

[U85] Finding approximate patterns in strings. UKKONEN, E. 1985. J. Algorithms 6, 132–137.

[W95] Introduction to Computational Biology. WATERMAN, M. 1995. Chapman and Hall, London.

[Y79] The complexity of pattern matching for a random string. YAO, A. C. 1979. SIAM J. Comput. 8, 3, 368–387.