1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

Post on 27-Mar-2015

222 views 0 download

Transcript of 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

1

Average-Optimal Multiple Approximate String Matching

Kimmo Fredriksson , Gonzalo NavarroACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, Pages 1-47

Professor R.C.T LeeSpeaker K.W.Liu

2

The Problem

•The approximate string matching problem:

Given text T[1...n] and pattern P[1...m] over some finite alphabet ∑ of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).

3

For a window of size m-k, if there exists a substring s1 in this window such that its edit distance with every substring of P is greater than k, we move P

Our algorithm scans from the right as shown below:

Fig. 3

S1T:

P:

m - k

4

For a window of size m-k, if there exists a suffix S

1 such that its edit distance with every substring of P is greater than k, we move P to S1

Our algorithm scans from the right as shown below:

Fig. 3

S1T:

P:

m - k

5

But, how do we know that ED(S1,S2) > k?

We use a very useful lemma.

6

LemmaConsider string Q and P. Let Q be divided in

to q1,q2,…,qn as shown below:

qn … q2 q1

For each qi, let pi be the substring in P such that ED(qi,pi) is the smallest, among all substrings in P.

kPQkpqn

iii

),ED( then ,),ED( If1

7

Proof: Divide P into n pieces as shown below

qn … q2 q1

p'n … p'2 p'1

Q

P

smallest. theis ),ED( because ),ED(

Therefore, . ),ED( that assume We

1

1

ii

n

iii

n

iii

pqkpq

kpq

8

To determine whether ED(S1,S2) > k, we mayUse the lemma.

We divide the window into small pieces: t1, t2, …,ta.

For each ti, we find the substring pi in P where ED(pi,ti) is the smallest.

T:

P:

Window W

Fig. 7

… t2 t1

p1 p2

9

. ),ED( whether find togprogrammin dynamic

use tohave We.conclusionany makecannot we

, ),( window, theof end at the If

. movecan weNow

. ),ED( that know we1, Lemma

toaccording ,),( assoon As

21

kPW

kpt

P

kSS

kpt

ii

ii

10

In general, to find such a pi, we may use Dynamic programming [Sellers 1980].

But, we may use a special kind of small pieces.

It is customary to call a small piece with sizeL a L-gram.

Let us use the 2-gram.

11

Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them.

Thus, we may use 2-grams in our algorithm.

12

Our algorithm

• Make a table D to store the smallest edit

distance between each possible 2-gram from

finite alphabet set and all substrings of the

pattern P.

•The above is done in the preprocessing stage.

13

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “aa” and all substrings of P = 0 Smallest edit distance between “gg” and all substrings of P = 2

∴∑ > k

14

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “tt” and all substrings of P = 0

Smallest edit distance between “aa” and all substrings of P = 0

Smallest edit distance between “at” and all substrings of P = 0

Smallest edit distance between “ga” and all substrings of P = 1

∴∑ == k

← m+2k →

c t a g g g a a t a a t t t a c a a t t

← m-k →

i

i+1

15

c t a g g g a a t a a t t t a c a a t t

← m-k →

← m+k →i+1

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

To find the edit distance between “gaataattta” and P.

16

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

c t a g g g a a t a a t t t a c a a t t

← m-k →

17

In the preprocessing We make a D table to record the smallest edit

distance between each possible l-gram from alphabet set whose length is l and all substrings of P.

18

D table : example ( step by step )

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

For example P = aacaccgaa

For P = a a c a c c g a a a

For P = a a c a c c g a a a vs “aa”

19

For P = a a c a c c g a a a vs “ac”

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

20

For P = a a c a c c g a a a vs “ag”For P = a a c a c c g a a a with “at”

For P = a a c a c c g a a a with “ca” For P = a a c a c c g a a a with cc

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 0 0 1 1 0 0 2 2 2 2 2 2 2 2 2 2

21

Time complexity

The average complexity of the algorithm is

mnrmk log for 121

22

The end

23

[BYN2000] New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G. 2000. Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns.

[BYN2002] New and faster filters for multiple approximate string matching. BAEZA-YATES, R. AND NAVARRO, G. 2002. Random Structures and Algorithms 20, 23–49.

[BYR99] Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Addison-Wesley, Reading, MA.

[BYN99] Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G. 1999. Algorithmica 23, 2, 127–158.

[CL94] Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E. 1994. Algorithmica 12, 4/5, 327–344.

[CM94] Approximate string matching and local similarity. CHANG, W. AND MARR, T. 1994. In Proceedings of 5th Combinatorial Pattern Matching (CPM’94). LNCS, vol. 807. Springer-Verlag, Berlin, 259–273.

[CCGJLPR94] Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Algorithmica 12, 4/5, 247–267.

24

[CR94] Text Algorithms. CROCHEMORE, M. AND RYTTER, W. 1994. Oxford University Press, Oxford, UK.

[DM79] Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds. 1979. IEEE Press.

[EL90] A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I. 1990. Pattern Recogn. 23, 3/4, 337–346.

[F2003] Row-wise tiling for the Myers’ bit-parallel approximate string matching algorithm. FREDRIKSSON, K. 2003. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE’03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66–79.

[FN2003]

Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2003. In Proceedings of 14th Combinatorial Pattern Matching (CPM’03). LNCS, vol. 2676. 109–128.

[FN2004]

Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2004. In Proceedings of 15th Combinatorial Pattern Matching (CPM’04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457–471.

25

[GL89] Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F. 1989. Information Processing Letters 33, 3, 113–120. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501–506.

[HFN2004]

Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G. 2004. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA’04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285–298.

[HN2002] Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G. 2002. In Proceedings of 13th Combinatorial Pattern Matching (CPM’02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica.

[JTU96] A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. Software Practice and Experience 26, 12, 1439–1458.

[K92] Techniques for automatically correcting words in text. KUKICH, K. 1992. ACM Computing Surveys 24, 4, 377–439.

[KS94] A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E. 1994. In Proceedings of National Computer Security Conference. 11–21.

[LT94] On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A. 1994. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156–165.

26

[MM96] Approximate multiple string search. MUTH, R. AND MANBER, U. 1996. In Proceedings of 7th Combinatorial Pattern Matching (CPM’96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75–86.

[M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W. 1999. J. ACM 46, 3, 395–415.

[N2001] A guided tour to approximate string matching. NAVARRO, G. 2001. ACM Computing Surveys 33, 1, 31–88.

[NB99] Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA-YATES, R. 1999. Inf. Process. Lett. 72, 65–70.

[NB2001] Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R. 2001. Algorithmica 30, 4, 473–502.

[NF2004] Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K. 2004. Theor. Comput. Sci. 321, 2-3, 283–290.

[NR2000] Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M. 2000. ACM J. Exp. Algorithmics 5, 4.

[NR2002] Flexible Pattern Matching in Strings—Practical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M. 2002. Cambridge University Press, Cambridge, UK.

[NSTT2000]

Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J. 2000. In Proceedings of 11th Combinatorial Pattern Matching (CPM’00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350–363.

27

[PS80] Decision trees and random access machines. PAUL, W. AND SIMON, J. 1980. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340.

[SK83]

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Addison-Wesley, Reading, MA.

[S80] The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P. 1980. J. Algorithms 1, 359–373.

[ST96] Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J. 1996. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50–63.

[TU93]

Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E. 1993. SIAM J. Comput. 22, 2, 243–260.

[U85] Finding approximate patterns in strings. UKKONEN, E. 1985. J. Algorithms 6, 132–137.

[W95] Introduction to Computational Biology. WATERMAN, M. 1995. Chapman and Hall, London.

[Y79] The complexity of pattern matching for a random string. YAO, A. C. 1979. SIAM J. Comput. 8, 3, 368–387.