of 69

• date post

12-Feb-2016
• Category

## Documents

• view

26

0

Embed Size (px)

description

Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem ： - PowerPoint PPT Presentation

### Transcript of Text Indexing and Dictionary Matching with One Error

• Text Indexing and Dictionary Matching with One ErrorAmir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325Adviser: R. C. T. LeeSpeaker: C. W. Cheng

• Problem DefinitionThe Indexing ProblemInputA Text T of length n over alphabet , a pattern P of length m over alphabet and an integer k.

Output All occurrences of P in T with at most k mismatches.

• Main ideaIn this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

• Processing1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn t1.

• Ex T=AGCAGAT TR=TAGACGA

• Ex T=AGCAGAT TR=TAGACGA

• Processing2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

• Ex T=AGCAGAT TR=TAGACGA

• Processing3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.

• Ex T=AGCAGAT TR=TAGACGA

• Processing4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n i + 3, where i is the starting position of the leafs suffix in TR.

• Ex T=AGCAGAT TR=TAGACGA

• Query ProcessingFor j = 1, ., m do1. Find node v, the location of Pj+1 Pm in ST, if such a node exists.2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist.3. If v and w exist, the values of leaves under v and w are V[vl.vr] and W[wlwr], to find the intersections I of V[vl.vr] and W[wlwr]. If the intersections exist, the approximate string matching occurs on Ti-3Ti-3+m, for all i I.

• ExampleEx T=actgacctcagctta P=ctga k=1

• Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of T

• Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of TR

• Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of T j=1v=Pj+1Pm=taaw=Pj-1P1=

V[vl.vr]={}

• Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=1v=Pj+1Pm=taaw=Pj-1P1=

V[vl.vr]={}W[vl.vr]={3,12,,14}

I={}

• Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=2v=Pj+1Pm=aaw=Pj-1P1=c

V[vl.vr]={}

• Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=2v=Pj+1Pm=aaw=Pj-1P1=c

V[vl.vr]={}W[vl.vr]={4,8,9,14,11}

I={}

• Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=aw=Pj-1P1=tc

V[vl.vr]={15,5,1,10}

• Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=aw=Pj-1P1=tc

V[vl.vr]={15,5,1,10}W[vl.vr]={5,10,15}

I={15,5,10}

• When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-jTi-j+m, for all i I.

T2T6T7T11T12T15T=actgacctcagcttaP=ctaa

• Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=4v=Pj+1Pm=w=Pj-1P1=atc

V[vl.vr]={15,5,,13}

• Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=w=Pj-1P1=atc

V[vl.vr]={15,5,,13}W[vl.vr]={}

I={}

• Range Query ProblemIn step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl vr] and in the interval [wl wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

• Problem Definition of Range QueryInput Let V=[v1,v2 vn] and W=[w1,w2 wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n.Output Find the intersection of elements of V[i i+k] and W[j j+l].

• ExampleV=[8,5,1,4,3,7,6,2]W=[3,6,4,7,2,1,5,8]i=3,k=4j=2,l=5

Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]

• PreprocessingV=

W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

85143762

36472158

• PreprocessingV=

W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 85143672

85143762

36472158

• PreprocessingV=

W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 85143762The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.

85143762

36472158

• Time Complexity of Range Query ProblemBy using Overmars algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range.

[O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

• Time ComplexityFor the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.

• The Dictionary Matching Problem

• Problem DefinitionThe Dictionary Matching ProblemInput1. A dictionary P = {p1,., ps}, where pi, i = 1,., s, are patterns over alphabet , and is the sum of the lengths of all the dictionary patterns.2. A Text T of length n over alphabet .3. An integer k.OutputAll occurrences of any dictionary patterns in T with at most k mismatches.

• Main ideaIn this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

• Processing1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of D (SD)

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of DR (SDR)

• Processing2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled \$, where , break (u,v) into (u,w) and (w,v). Label (u,v) with and (w,v) with \$.

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of D (SD)

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of DR (SDR)

• Preprocessing3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)\$. So, go to edge (v,w) with label beginning with \$, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of D (SD)

• ExampleP={tca,gctga,gca}D=TCA\$GCTGA\$GCA\$DR=ACG\$AGTCG\$ACT\$

Suffix Tree of DR (SDR)

• Query ProcessingFor j = 1,., n do1. Find node v, the location of the longest prefix of tj+1 tn in SD.2. Find node w, the location of the longest prefix of tj-1 t1 in SDR.3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.

• Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

• Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

• Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=1v=Tj+1Tm=cagccgaw=Tj-1T1=

V={10,2}

• Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=1v=Tj+1Tm=cagccgaw=Tj-1T1=

V={10,2}W={13,5,, 8}I={10,2}

• When j=1, the intersection of V[10,2] and W[13,5,,8] is I={10,2}. Therefore approximate string matching occurs on T with PP.

T=acagccgaP={tca,gctga,gca}

• Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=2v=Tj+1Tm=agccgaw=Tj-1T1=a

V={11,8,3}

• Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=2v=Tj+1Tm=agccgaw=Tj-1T1=a

V={11,8,3}W={}I={}

• Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=3v=Tj+1Tm=gccgaw=Tj-1T1=ca

V={}

• Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=3v=Tj+1Tm=gccgaw=Tj-1T1=ca

V={}W={}I={}

• Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca\$gctga\$gca\$DR=acg\$agtcg\$act\$T=acagccga

j=4v=Tj+1Tm=ccgaw=Tj-1T1=aca

V={}