Text Indexing and Dictionary Matching with One Error

download Text Indexing and Dictionary Matching with One Error

of 69

  • date post

    12-Feb-2016
  • Category

    Documents

  • view

    26
  • download

    0

Embed Size (px)

description

Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem : - PowerPoint PPT Presentation

Transcript of Text Indexing and Dictionary Matching with One Error

  • Text Indexing and Dictionary Matching with One ErrorAmir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325Adviser: R. C. T. LeeSpeaker: C. W. Cheng

  • Problem DefinitionThe Indexing ProblemInputA Text T of length n over alphabet , a pattern P of length m over alphabet and an integer k.

    Output All occurrences of P in T with at most k mismatches.

  • Main ideaIn this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

  • Processing1.Construct a suffix tree ST of the text string T and suffix tree STR of the string TR is the reversed text TR = tn t1.

  • Ex T=AGCAGAT TR=TAGACGA

  • Ex T=AGCAGAT TR=TAGACGA

  • Processing2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

  • Ex T=AGCAGAT TR=TAGACGA

  • Processing3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.

  • Ex T=AGCAGAT TR=TAGACGA

  • Processing4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in STR by n i + 3, where i is the starting position of the leafs suffix in TR.

  • Ex T=AGCAGAT TR=TAGACGA

  • Query ProcessingFor j = 1, ., m do1. Find node v, the location of Pj+1 Pm in ST, if such a node exists.2. Find node w, the location of Pj-1 .. P1 in STR, if such a node exist.3. If v and w exist, the values of leaves under v and w are V[vl.vr] and W[wlwr], to find the intersections I of V[vl.vr] and W[wlwr]. If the intersections exist, the approximate string matching occurs on Ti-3Ti-3+m, for all i I.

  • ExampleEx T=actgacctcagctta P=ctga k=1

  • Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of T

  • Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of TR

  • Ex T=actgacctcagctta TR=attcgactccagtca P=ctaaSuffix Tree of T j=1v=Pj+1Pm=taaw=Pj-1P1=

    V[vl.vr]={}

  • Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa

    j=1v=Pj+1Pm=taaw=Pj-1P1=

    V[vl.vr]={}W[vl.vr]={3,12,,14}

    I={}

  • Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=2v=Pj+1Pm=aaw=Pj-1P1=c

    V[vl.vr]={}

  • Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=2v=Pj+1Pm=aaw=Pj-1P1=c

    V[vl.vr]={}W[vl.vr]={4,8,9,14,11}

    I={}

  • Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=aw=Pj-1P1=tc

    V[vl.vr]={15,5,1,10}

  • Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=aw=Pj-1P1=tc

    V[vl.vr]={15,5,1,10}W[vl.vr]={5,10,15}

    I={15,5,10}

  • When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-jTi-j+m, for all i I.

    T2T6T7T11T12T15T=actgacctcagcttaP=ctaa

  • Suffix Tree of TEx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=4v=Pj+1Pm=w=Pj-1P1=atc

    V[vl.vr]={15,5,,13}

  • Suffix Tree of TREx T=actgacctcagctta TR=attcgactccagtca P=ctaa j=3v=Pj+1Pm=w=Pj-1P1=atc

    V[vl.vr]={15,5,,13}W[vl.vr]={}

    I={}

  • Range Query ProblemIn step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl vr] and in the interval [wl wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

  • Problem Definition of Range QueryInput Let V=[v1,v2 vn] and W=[w1,w2 wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n.Output Find the intersection of elements of V[i i+k] and W[j j+l].

  • ExampleV=[8,5,1,4,3,7,6,2]W=[3,6,4,7,2,1,5,8]i=3,k=4j=2,l=5

    Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]

  • PreprocessingV=

    W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

    85143762

    36472158

  • PreprocessingV=

    W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 85143672

    85143762

    36472158

  • PreprocessingV=

    W=1 2 3 4 5 6 7 82 3 4 5 6 7 81 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 85143762The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.

    85143762

    36472158

  • Time Complexity of Range Query ProblemBy using Overmars algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range.

    [O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

  • Time ComplexityFor the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.

  • The Dictionary Matching Problem

  • Problem DefinitionThe Dictionary Matching ProblemInput1. A dictionary P = {p1,., ps}, where pi, i = 1,., s, are patterns over alphabet , and is the sum of the lengths of all the dictionary patterns.2. A Text T of length n over alphabet .3. An integer k.OutputAll occurrences of any dictionary patterns in T with at most k mismatches.

  • Main ideaIn this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

  • Processing1. Construct a suffix tree SD of string D and suffix tree SDR of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of D (SD)

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of DR (SDR)

  • Processing2. Modify suffix tree SD, and SDR respectively, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled $, where , break (u,v) into (u,w) and (w,v). Label (u,v) with and (w,v) with $.

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of D (SD)

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of DR (SDR)

  • Preprocessing3. Scan suffix tree SD, respectively SDR, and modify as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of D (SD)

  • ExampleP={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

    Suffix Tree of DR (SDR)

  • Query ProcessingFor j = 1,., n do1. Find node v, the location of the longest prefix of tj+1 tn in SD.2. Find node w, the location of the longest prefix of tj-1 t1 in SDR.3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SDR.

  • ExampleT=acagccgaD={tca,gctga,gca}K=1

  • Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

  • Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

  • Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=1v=Tj+1Tm=cagccgaw=Tj-1T1=

    V={10,2}

  • Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=1v=Tj+1Tm=cagccgaw=Tj-1T1=

    V={10,2}W={13,5,, 8}I={10,2}

  • When j=1, the intersection of V[10,2] and W[13,5,,8] is I={10,2}. Therefore approximate string matching occurs on T with PP.

    T=acagccgaP={tca,gctga,gca}

  • Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=2v=Tj+1Tm=agccgaw=Tj-1T1=a

    V={11,8,3}

  • Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=2v=Tj+1Tm=agccgaw=Tj-1T1=a

    V={11,8,3}W={}I={}

  • Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=3v=Tj+1Tm=gccgaw=Tj-1T1=ca

    V={}

  • Suffix Tree of DR (SDR)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=3v=Tj+1Tm=gccgaw=Tj-1T1=ca

    V={}W={}I={}

  • Suffix Tree of D (SD)ExampleP={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

    j=4v=Tj+1Tm=ccgaw=Tj-1T1=aca

    V={}