1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

Average-Optimal Multiple Approximate String Matching

Kimmo Fredriksson , Gonzalo NavarroACM Journal of Experimental Algorithmics,

Vol 9, Article No. 1.4,2004, Pages 1-47

Professor R.C.T LeeSpeaker K.W.Liu

The Problem

•The approximate string matching problem:

Given text T[1...n] and pattern P[1...m] over some finite alphabet ∑ of size σ, find the approximate occurrences of P from T, allowing at most k differences ( insertion, deletion, substitution).

For a window of size m-k, if there exists a substring s1 in this window such that its edit distance with every substring of P is greater than k, we move P

Our algorithm scans from the right as shown below:

Fig. 3

For a window of size m-k, if there exists a suffix S

1 such that its edit distance with every substring of P is greater than k, we move P to S1

Our algorithm scans from the right as shown below:

Fig. 3

But, how do we know that ED(S1,S2) > k?

We use a very useful lemma.

LemmaConsider string Q and P. Let Q be divided in

to q1,q2,…,qn as shown below:

qn … q2 q1

For each qi, let pi be the substring in P such that ED(qi,pi) is the smallest, among all substrings in P.

kPQkpqn

),ED( then ,),ED( If1

Proof: Divide P into n pieces as shown below

qn … q2 q1

p'n … p'2 p'1

smallest. theis ),ED( because ),ED(

Therefore, . ),ED( that assume We

To determine whether ED(S1,S2) > k, we mayUse the lemma.

We divide the window into small pieces: t1, t2, …,ta.

For each ti, we find the substring pi in P where ED(pi,ti) is the smallest.

Window W

Fig. 7

… t2 t1

. ),ED( whether find togprogrammin dynamic

use tohave We.conclusionany makecannot we

, ),( window, theof end at the If

. movecan weNow

. ),ED( that know we1, Lemma

toaccording ,),( assoon As

In general, to find such a pi, we may use Dynamic programming [Sellers 1980].

But, we may use a special kind of small pieces.

It is customary to call a small piece with sizeL a L-gram.

Let us use the 2-gram.

Note that for two substrings P and Q which are of length 2, the edit distance between them is equal to the Hamming distance between them.

Thus, we may use 2-grams in our algorithm.

Our algorithm

• Make a table D to store the smallest edit

distance between each possible 2-gram from

finite alphabet set and all substrings of the

pattern P.

•The above is done in the preprocessing stage.

Example T = ctagggaataatttacaatt P = ttaatatat k = 1

c t a g g g a a t a a t t t a c a a t t

← m-k →

Smallest edit distance between “aa” and all substrings of P = 0 Smallest edit distance between “gg” and all substrings of P = 2

∴∑ > k

← m-k →

Smallest edit distance between “tt” and all substrings of P = 0

Smallest edit distance between “aa” and all substrings of P = 0

Smallest edit distance between “at” and all substrings of P = 0

Smallest edit distance between “ga” and all substrings of P = 1

∴∑ == k

← m+2k →

← m-k →

← m+k →i+1

To find the edit distance between “gaataattta” and P.

← m-k →

In the preprocessing We make a D table to record the smallest edit

distance between each possible l-gram from alphabet set whose length is l and all substrings of P.

D table : example ( step by step )

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

Dp 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

For example P = aacaccgaa

For P = a a c a c c g a a a

For P = a a c a c c g a a a vs “aa”

For P = a a c a c c g a a a vs “ac”

Dp 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2

For P = a a c a c c g a a a vs “ag”For P = a a c a c c g a a a with “at”

For P = a a c a c c g a a a with “ca” For P = a a c a c c g a a a with cc

Dp 0 0 1 1 0 0 2 2 2 2 2 2 2 2 2 2

Time complexity

The average complexity of the algorithm is

mnrmk log for 121

The end

[BYN2000] New models and algorithms for multidimensional approximate pattern matching. BAEZA-YATES, R. AND NAVARRO, G. 2000. Journal of Discrete Algorithms 1, 1, 21–49. Special issue on Matching Patterns.

[BYN2002] New and faster filters for multiple approximate string matching. BAEZA-YATES, R. AND NAVARRO, G. 2002. Random Structures and Algorithms 20, 23–49.

[BYR99] Modern Information Retrieval. BAEZA-YATES, R. AND RIBEIRO-NETO, B. 1999. Addison-Wesley, Reading, MA.

[BYN99] Faster approximate string matching. BAEZA-YATES, R. A. AND NAVARRO, G. 1999. Algorithmica 23, 2, 127–158.

[CL94] Sublinear approximate string matching and biological applications. CHANG, W. AND LAWLER, E. 1994. Algorithmica 12, 4/5, 327–344.

[CM94] Approximate string matching and local similarity. CHANG, W. AND MARR, T. 1994. In Proceedings of 5th Combinatorial Pattern Matching (CPM’94). LNCS, vol. 807. Springer-Verlag, Berlin, 259–273.

[CCGJLPR94] Speeding up two string matching algorithms. CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W., AND RYTTER, W. 1994. Algorithmica 12, 4/5, 247–267.

[CR94] Text Algorithms. CROCHEMORE, M. AND RYTTER, W. 1994. Oxford University Press, Oxford, UK.

[DM79] Automatic Speech and Speaker Recognition. DIXON, R. AND MARTIN, T., Eds. 1979. IEEE Press.

[EL90] A review of segmentation and contextual analysis techniques for text recognition. ELLIMAN, D. AND LANCASTER, I. 1990. Pattern Recogn. 23, 3/4, 337–346.

[F2003] Row-wise tiling for the Myers’ bit-parallel approximate string matching algorithm. FREDRIKSSON, K. 2003. In Proceedings of 10th Symposium on String Processing and Information Retrieval (SPIRE’03). LNCS, vol. 2857. Springer-Verlag, Berlin, 66–79.

[FN2003]

Average-optimal multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2003. In Proceedings of 14th Combinatorial Pattern Matching (CPM’03). LNCS, vol. 2676. 109–128.

[FN2004]

Improved single and multiple approximate string matching. FREDRIKSSON, K. AND NAVARRO, G. 2004. In Proceedings of 15th Combinatorial Pattern Matching (CPM’04). LNCS, vol. 3109. Springer-Verlag, Berlin, 457–471.

[GL89] Simple and efficient string matching with k mismatches. GROSSI, R. AND LUCCIO, F. 1989. Information Processing Letters 33, 3, 113–120. HORSPOOL, R. 1980. Practical fast searching in strings. Software Practice and Experience 10, 501–506.

[HFN2004]

Increased bit-parallelism for approximate string matching. HYYR¨O, H., FREDRIKSSON, K., AND NAVARRO, G. 2004. In Proceedings of 3rd Workshop on Efficient and Experimental Algorithms (WEA’04). LNCS, vol. 3059. Springer-Verlag, Berlin, 285–298.

[HN2002] Faster bit-parallel approximate string matching. HYYR¨O, H. AND NAVARRO, G. 2002. In Proceedings of 13th Combinatorial Pattern Matching (CPM’02). LNCS, vol. 2373. Springer-Verlag, Berlin, 203–224. Extended version to appear in Algorithmica.

[JTU96] A comparison of approximate string matching algorithms. JOKINEN, P., TARHIO, J., AND UKKONEN, E. 1996. Software Practice and Experience 26, 12, 1439–1458.

[K92] Techniques for automatically correcting words in text. KUKICH, K. 1992. ACM Computing Surveys 24, 4, 377–439.

[KS94] A pattern-matching model for intrusion detection. KUMAR, S. AND SPAFFORD, E. 1994. In Proceedings of National Computer Security Conference. 11–21.

[LT94] On the searchability of electronic ink. LOPRESTI, D. AND TOMKINS, A. 1994. In Proceedings of 4th International Workshop on Frontiers in Handwriting Recognition. 156–165.

[MM96] Approximate multiple string search. MUTH, R. AND MANBER, U. 1996. In Proceedings of 7th Combinatorial Pattern Matching (CPM’96). LNCS, vol. 1075. Springer-Verlag, Berlin, 75–86.

[M99] A fast bit-vector algorithm for approximate string matching based on dynamic programming. MYERS, E.W. 1999. J. ACM 46, 3, 395–415.

[N2001] A guided tour to approximate string matching. NAVARRO, G. 2001. ACM Computing Surveys 33, 1, 31–88.

[NB99] Very fast and simple approximate string matching. NAVARRO, G. AND BAEZA-YATES, R. 1999. Inf. Process. Lett. 72, 65–70.

[NB2001] Improving an algorithm for approximate pattern matching. NAVARRO, G. AND BAEZA-YATES, R. 2001. Algorithmica 30, 4, 473–502.

[NF2004] Average complexity of exact and approximate multiple string matching. NAVARRO, G. AND FREDRIKSSON, K. 2004. Theor. Comput. Sci. 321, 2-3, 283–290.

[NR2000] Fast and flexible string matching by combining bitparallelism and suffix automata. NAVARRO, G. AND RAFFINOT, M. 2000. ACM J. Exp. Algorithmics 5, 4.

[NR2002] Flexible Pattern Matching in Strings—Practical on-line Search Algorithms for Texts and Biological Sequences. NAVARRO, G. AND RAFFINOT, M. 2002. Cambridge University Press, Cambridge, UK.

[NSTT2000]

Indexing text with approximate q-grams. NAVARRO, G., SUTINEN, E., TANNINEN, J., AND TARHIO, J. 2000. In Proceedings of 11th Combinatorial Pattern Matching (CPM’00). LNCS, vol. 1848. Springer-Verlag, Berlin, 350–363.

[PS80] Decision trees and random access machines. PAUL, W. AND SIMON, J. 1980. In Proceedings of International Symposium on Logic and Algorithmic (Zurich). 331–340.

[SK83]

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. SANKOFF, D. AND KRUSKAL, J., Eds. 1983. Addison-Wesley, Reading, MA.

[S80] The theory and computation of evolutionary distances: Pattern recognition. SELLERS, P. 1980. J. Algorithms 1, 359–373.

[ST96] Filtration with q-samples in approximate string matching. SUTINEN, E. AND TARHIO, J. 1996. In Proceedings of 7th Combinatorial Pattern Matching. LNCS, vol. 1075. Springer-Verlag, Berlin, 50–63.

[TU93]

Approximate Boyer–Moore string matching. TARHIO, J. AND UKKONEN, E. 1993. SIAM J. Comput. 22, 2, 243–260.

[U85] Finding approximate patterns in strings. UKKONEN, E. 1985. J. Algorithms 6, 132–137.

[W95] Introduction to Computational Biology. WATERMAN, M. 1995. Chapman and Hall, London.

[Y79] The complexity of pattern matching for a random string. YAO, A. C. 1979. SIAM J. Comput. 8, 3, 368–387.

1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

Documents

Transcript of 1 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM...

Ejercicios Econometría - GONZALO VILLA

Massive Data Algorithmics - SHARIF UNIVERSITY …ce.sharif.edu/courses/94-95/2/ce686-1/resources/root/...2d range searchingin O(N B log B N log B log B N) space - O(log BN+T=B) I/O

Advanced Algorithmics (6EAP) - Kursused - · PDF fileAdvanced Algorithmics (6EAP) ... , list, stack, queue, deque, priority queue, table ... • Plain old data structure. Abstract

ELEG404/604: Digital Imaging & Photography€¦ · ELEG404/604: Digital Imaging & Photography Gonzalo R. Arce Department of Electrical and Computer Engineering University of Delaware

round object aCM or spherical shell) having mass M, … spherical shell) having mass M, radius R and rotational inertia I about its center of mass, rolling without slipping down an

SUSY SO(10) - IPPP Conference Management System (Indico) · SUSY SO(10) Tomas E. Gonzalo Velasco University College London Tomas E. Gonzalo Velasco SUSY SO(10) University College

ALplus2-ALCplus2 (ACM PDH Radio Link) - Training Manual [Modo de Compatibilidad]

ACM case report · ACM case report Δέσποινα Παρχαρίο ... 2. Negative T waves in lateral precordial leads 3. MRI suspicious for ACLV 4. Cardiocutaneous phenotype. Genetic

Winter 2016 (c) Ian Davis 1ijdavis/teaching/2017spring...Winter 2016 (c) Ian Davis 5 Operational analysis • Material drawn from: • ACM Computing Surveys –Special issue on queuing

The Exact Complexity of the First- Order Logic Definibility Problem Marcelo Arenas PUC Chile Gonzalo Díaz University of Oxford.

Gonzalo Rivera - DIGINTO · 2020. 8. 14. · +3 dB är dubbel så mycket än referensen-3 dB är hälften så mycket än referensen ... Dipolantennen uppfanns 1886 av en tysk fysiker

BIBLIOTECA GONZALO DE BERCEO · [fol. 99 v, a] 1. En el nomne del rey que regna por natura qui eſ fin τ comie[n]zo de toda creatura, ... La caſa de los clérigos aui ...

Algorithmics examination preparation...Algorithmics examination preparation Hi, feel free to correct what is wrong and to add what is missing. This doc is gonna be published on vowi

Real-Time Optimization - CEPACcepac.cheme.cmu.edu/pasi2011/library/biegler/LB_rto.pdf · Dynamic Real-time Optimization ... Simulation environments (e.g., ACM, gPROMS) and first principle

DISEÑO DE TIJERA Y JOIST GONZALO G. RODRIGUEZ

Advanced Algorithmics (6EAP) - ut · Advanced Algorithmics (6EAP) MTAT.03.238 Linear structures, sorting, searching, etc Jaak Vilo 2016 Fall Jaak Vilo 1

Tema 1: Recursos (Resumen) Gonzalo Mora Pérez Física 2º · 2013-09-24 · 4 Tema 1: Recursos (Resumen) Gonzalo Mora Pérez Física 2º IPEP DE HUELVA 2

Multi-Valued Symbolic Model-Checkingchechik/pubs/tosem04.pdf · ACM Transactions on Software Engineering and Methodology, Vol. 12, No. 4, October 2004. P1: IAZ CM203A-01 ACM-TRANSACTION

Chandra Chekuri, Nitish Korula and Martin Pal Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms (SODA 08) Improved Algorithms.

A derivation of Einstein’s vacuum eld equations · PDF fileA derivation of Einstein’s vacuum eld equations Gonzalo E. Reyes Universit e de Montr eal 4 December 2009 For Mih aly: