1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and...

of 69 /69
1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landa u, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309- 325 Adviser: R. C. T. Lee Speaker: C. W. Cheng

Embed Size (px)

Transcript of 1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and...

  • Slide 1

1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325 Adviser: R. C. T. Lee Speaker: C. W. Cheng Slide 2 2 Problem Definition The Indexing Problem Input A Text T of length n over alphabet , a pattern P of length m over alphabet and an integer k. Output All occurrences of P in T with at most k mismatches. Slide 3 3 Main idea In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2m. Then we find the prefix P 1,j-1 in prefix tree and the suffix P j+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs. Slide 4 4 Processing 1.Construct a suffix tree S T of the text string T and suffix tree S T R of the string T R is the reversed text T R = t n t 1. Slide 5 5 Ex T=AGCAGAT T R =TAGACGA Slide 6 6 Ex T=AGCAGAT T R =TAGACGA Slide 7 7 Processing 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order. Slide 8 8 Ex T=AGCAGAT T R =TAGACGA Slide 9 9 Processing 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf v l and rightmost leave v r in the linked list. Slide 10 10 Ex T=AGCAGAT T R =TAGACGA Slide 11 11 Processing 4. Designate each leaf in S T by the starting location of its suffix. Designate each leaf in S T R by n i + 3, where i is the starting position of the leafs suffix in T R. Slide 12 12 Ex T=AGCAGAT T R =TAGACGA Slide 13 13 Query Processing For j = 1, ., m do 1. Find node v, the location of P j+1 P m in S T, if such a node exists. 2. Find node w, the location of P j-1.. P 1 in S T R, if such a node exist. 3. If v and w exist, the values of leaves under v and w are V[v l .v r ] and W[w l w r ], to find the intersections I of V[v l .v r ] and W[w l w r ]. If the intersections exist, the approximate string matching occurs on T i-3 T i-3+m, for all i I. Slide 14 14 Example Ex T=actgacctcagctta P=ctga k=1 Slide 15 15 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T Slide 16 16 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T R Slide 17 17 Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa Suffix Tree of T j=1 v=P j+1 P m =taa w=P j-1 P 1 = V[v l .v r ]={} Slide 18 18 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=1 v=P j+1 P m =taa w=P j-1 P 1 = V[v l .v r ]={} W[v l .v r ]={3,12,,14} I={} Slide 19 21 Suffix Tree of T Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 P m =a w=P j-1 P 1 =tc V[v l .v r ]={15,5, 1,10} Slide 22 22 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 P m =a w=P j-1 P 1 =tc V[v l .v r ]={15,5, 1,10} W[v l .v r ]={5,10,15} I={15,5,10} Slide 23 23 When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on T i-j T i-j+m, for all i I. T 2 T 6 T 7 T 11 T 12 T 15 T=actgacctcagctta P=ctaa Slide 24 24 Suffix Tree of T Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=4 v=P j+1 P m = w=P j-1 P 1 =atc V[v l .v r ]={15,5, ,13} Slide 25 25 Suffix Tree of T R Ex T=actgacctcagctta T R =attcgactccagtca P=ctaa j=3 v=P j+1 P m = w=P j-1 P 1 =atc V[v l .v r ]={15,5, ,13} W[v l .v r ]={} I={} Slide 26 26 Range Query Problem In step 3, given nodes v and w, we want to find the leaves that appear both in interval [v l v r ] and in the interval [w l w r ], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem. Slide 27 27 Problem Definition of Range Query Input Let V=[v 1,v 2 v n ] and W=[w 1,w 2 w n ] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n. Output Find the intersection of elements of V[i i+k] and W[j j+l]. Slide 28 28 Example V=[8,5,1,4,3,7,6,2] W=[3,6,4,7,2,1,5,8] i=3,k=4 j=2,l=5 Output: the intersection of V[v 3,v 4,v 5,v 6 ] and W[w 2,w 3,w 4,w 5,w 6 ] Slide 29 29 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8 Slide 30 30 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8 8 5 1 4 3 6 7 2 Slide 31 31 Preprocessing V= W= 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 85143762 36472158 1 2 3 4 5 6 7 8 8 5 1 4 3 7 6 2 The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}. Slide 32 32 Time Complexity of Range Query Problem By using Overmars algorithm, the range query problem can be solved with preprocessing time and, where k is the number of points in the range. [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275. Slide 33 33 Time Complexity For the indexing problem, the preprocessing time is and the query can be implemented in, where tocc is the number of occurrences of the pattern in the text with one error. Slide 34 34 The Dictionary Matching Problem Slide 35 35 Problem Definition The Dictionary Matching Problem Input 1. A dictionary P = {p 1,., p s }, where p i, i = 1,., s, are patterns over alphabet , and is the sum of the lengths of all the dictionary patterns. 2. A Text T of length n over alphabet . 3. An integer k. Output All occurrences of any dictionary patterns in T with at most k mismatches. Slide 36 36 Main idea In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2n. Then we find the prefix T 1,j-1 in prefix tree and the suffix T j+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs. Slide 37 37 Processing 1. Construct a suffix tree S D of string D and suffix tree S D R of the string D R, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where D R is the reversal of string D. Slide 38 38 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Slide 39 39 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D (S D ) Slide 40 40 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D R (S D R ) Slide 41 41 Processing 2. Modify suffix tree S D, and S D R respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled $, where , break (u,v) into (u,w) and (w,v). Label (u,v) with and (w,v) with $. Slide 42 42 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D (S D ) Slide 43 43 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D R (S D R ) Slide 44 44 Preprocessing 3. Scan suffix tree S D, respectively S D R, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes. Slide 45 45 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D (S D ) Slide 46 46 Example P={tca,gctga,gca} D=TCA$GCTGA$GCA$ D R =ACG$AGTCG$ACT$ Suffix Tree of D R (S D R ) Slide 47 47 Query Processing For j = 1,., n do 1. Find node v, the location of the longest prefix of t j+1 t n in S D. 2. Find node w, the location of the longest prefix of t j-1 t 1 in S D R. 3. Find intersection of markings of nodes on the path from the root to v in S D and on the path from the root to w in S D R. Slide 48 48 Example T=acagccga D={tca,gctga,gca} K=1 Slide 49 49 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga Slide 50 50 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga Slide 51 51 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=1 v=T j+1 T m =cagccga w=T j-1 T 1 = V={10,2} Slide 52 54 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=2 v=T j+1 T m =agccga w=T j-1 T 1 =a V={11,8,3} Slide 55 56 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=3 v=T j+1 T m =gccga w=T j-1 T 1 =ca V={} Slide 57 57 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=3 v=T j+1 T m =gccga w=T j-1 T 1 =ca V={} W={} I={} Slide 58 58 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=4 v=T j+1 T m =ccga w=T j-1 T 1 =aca V={} Slide 59 59 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=4 v=T j+1 T m =ccga w=T j-1 T 1 =aca V={} W={} I={} Slide 60 60 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=5 v=T j+1 T m =cga w=T j-1 T 1 =gaca V={ } Slide 61 61 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=5 v=T j+1 T m =cga w=T j-1 T 1 =gaca V={} W={6,11} I={} Slide 62 62 Suffix Tree of D (S D ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=6 v=T j+1 T m =ga w=T j-1 T 1 =cgaca V={7} Slide 63 63 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=6 v=T j+1 T m =ga w=T j-1 T 1 =cgaca V={7} W={7,12} I={7} Slide 64 66 Suffix Tree of D R (S D R ) Example P={tca,gctga,gca} D=tca$gctga$gca$ D R =acg$agtcg$act$ T=acagccga j=1 v=T j+1 T m = w=T j-1 T 1 =ccgaca V={11,8,,6} W={} I={} Slide 67