Text Indexing and Dictionary Matching with One Error
description
Transcript of Text Indexing and Dictionary Matching with One Error
1
Text Indexing and Dictionary Matching with One ErrorAmir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000,
pp. 309-325
Adviser: R. C. T. LeeSpeaker: C. W. Cheng
2
Problem Definition
• The Indexing Problem :– Input : A Text T of length n over alphabet Σ,
a pattern P of length m over alphabet Σ and an integer k.
– Output : All occurrences of P in T with at most k mismatches.
3
Main idea
• In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.
4
Processing
• 1.Construct a suffix tree ST of the text string T and suffix tree ST
R of the string TR is the reversed text TR = tn … t1.
5
Ex : T=AGCAGAT
TR=TAGACGA
6
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
7
Processing
• 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.
8
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
9
Processing
• 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.
10
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
11
Processing
• 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in ST
R by n – i + 3, where i is the starting position of the leaf’s suffix in TR.
12
4 1 6 3 5 2 7 3 6 8 5 4 7 9
g
a
t$
tagacga$
g
at
$
cagat$
t$
cagat$
at$
cagat$
a g
ga$
acga$
cga$
$cga$
gacga$
Ex : T=AGCAGAT
TR=TAGACGA
13
Query Processing
• For j = 1, …., m do– 1. Find node v, the location of Pj+1 … Pm in ST, if such a
node exists.– 2. Find node w, the location of Pj-1 .. P1 in ST
R, if such a node exist.
– 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all i I.
14
ExampleEx : T=actgacctcagctta P=ctga k=1
1515 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of T
163 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of TR
1715 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa
Suffix Tree of T
j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε
V[vl….vr]={ε}
183 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε
V[vl….vr]={ε}W[vl….vr]={3,12,…,14}
I={ε}
1915 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=2v=Pj+1…Pm=aaw=Pj-1…P1=c
V[vl….vr]={ε}
203 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=2v=Pj+1…Pm=aaw=Pj-1…P1=c
V[vl….vr]={ε}W[vl….vr]={4,8,9,14,11}
I={ε}
2115 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=aw=Pj-1…P1=tc
V[vl….vr]={15,5,1,10}
223 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=aw=Pj-1…P1=tc
V[vl….vr]={15,5,1,10}W[vl….vr]={5,10,15}
I={15,5,10}
23
When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all i I.
T2…T6 ,T7…T11 ,T12…T15 。
T=actgacctcagcttaP=ctaa
2415 5
a
c
g tc
1 10 9 6 7 2 12 4 11 14 8 3 13
tgacctcagctta$
ctcagctta$
$gctta$
ag
ct
ta
$
ctcagctta$
cagctta$
gacctcagctta$
ta$
a$
acctcagctta$
ctta$
cagctta$
gacctcagctta$
ta$
Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=4v=Pj+1…Pm=εw=Pj-1…P1=atc
V[vl….vr]={15,5,…,13}
253 12
a g tc
7 17 4 8 9 14 11 13 6 5 10 15 14
ttcgactccagtca$
a$
ctccagtca$
gtca$
a
$gtca$
cagtca$
gactccagtca$
tccagtca$
actccagtca$
tca$
c
a$
cagtca$
gactccagtca$
tcgactccagtca$
Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa
j=3v=Pj+1…Pm=εw=Pj-1…P1=atc
V[vl….vr]={15,5,…,13}W[vl….vr]={ε}
I={ε}
26
Range Query Problem
• In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.
27
Problem Definition of Range Query
• Input : Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n.
• Output : Find the intersection of elements of V[i … i+k] and W[j … j+l].
28
Example :V=[8,5,1,4,3,7,6,2]W=[3,6,4,7,2,1,5,8]i=3,k=4j=2,l=5
Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]
29
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
30
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
8
5
1
4
3
6
7
2
31
Preprocessing
V=
W=
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
8 5 1 4 3 7 6 2
3 6 4 7 2 1 5 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
8
5
1
4
3
7
6
2
The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.
32
Time Complexity of Range Query Problem
• By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range.
[O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.
)log( nkO )log( nnO
33
Time Complexity
• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.
)log( nnO)log( nmtoccO
34
The Dictionary Matching Problem
35
Problem Definition
• The Dictionary Matching Problem– Input :
• 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns.
• 2. A Text T of length n over alphabet Σ.• 3. An integer k.
– Output :• All occurrences of any dictionary patterns in T with a
t most k mismatches.
isi pd 1
36
Main idea
• In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.
37
Processing
• 1. Construct a suffix tree SD of string D and suffix tree SD
R of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.
38
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
39
a$ g tc
gc
tga$gca$
a$
gctga$gca$
a$
tga$gca$
a$gca$
a$
c
tga$gca$
ca$gctga$gca$
ga$gca$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
40
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
a g tc
g$a
gt
cg
$a
ct
$
cg
$$
t$
gtcg$act$
agtcg$act$
act$
act$
agtcg$act$
tcg$act$
cg$act$
$t$
Suffix Tree of DR (SDR)
41
Processing
• 2. Modify suffix tree SD, and SDR respectivel
y, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.
42
a$ g tc
a$
tga$
a$
a$
c
tga$
ca$
ga$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
43
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
a g tc
g$
cg
$$
t$
gtcg$
tcg$
cg$$t
$
Suffix Tree of DR (SDR)
44
Preprocessing
• 3. Scan suffix tree SD, respectively SDR, and modi
fy as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.
4511
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
Suffix Tree of D (SD)
46
Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)
47
Query Processing
• For j = 1,…., n do– 1. Find node v, the location of the longest pref
ix of tj+1 … tn in SD.– 2. Find node w, the location of the longest pre
fix of tj-1 … t1 in SDR.
– 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SD
R.
48
Example
T=acagccgaD={tca,gctga,gca}K=1
4911
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
50
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
5111
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε
V={10,2}
52
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$t
$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε
V={10,2}W={13,5,…, 8}I={10,2}
53
When j=1, the intersection of V[10,2] and W[13,5,…,8] is I={10,2}. Therefore approximate string matching occurs on T with P…P.
T=acagccgaP={tca,gctga,gca}
5411
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a
V={11,8,3}
55
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a
V={11,8,3}W={ε}I={ε}
5611
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca
V={ε}
57
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca
V={ε}W={ε}I={ε}
5811
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca
V={ε}
59
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca
V={ε}W={ε}I={ε}
6011
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca
V={ε}
61
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca
V={ε}W={6,11}I={ε}
6211
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca
V={7}
63
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca
V={7}W={7,12}I={7}
64
When j=6, the intersection of V[7] and W[7,12] is I={7}. Therefore approximate string matching occurs on T with P.
T=acagccgaP={tca,gctga,gca}
6511
a$ g tc
8 3 10 2 5 7 9 4 1 6
a$
tga$
a$
a$
c
tga$
ca$
ga$
Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=7v=Tj+1…Tm=εw=Tj-1…T1=ccgaca
V={11,8,…,6}
66
13
a g tc
5 10 7 12 4 6 11 9 3 8
g$
cg
$$
t$
gtcg$
tcg$
cg$$
t$
Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga
j=1v=Tj+1…Tm=εw=Tj-1…T1=ccgaca
V={11,8,…,6}W={ε}I={ε}
67
Time Complexity
• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the dictionary in the text with one error.
)log( ddO)log( dntoccO
68
References[AC75] Efficient string matching, A. V. Aho and M. J. Corasick, Comm. Assoc. Comput. Mach. 18, No.
6, 1975, pp.333-340.[AF91] Adaptive dictionary matching, A. Amir and M. Farach, Proc. 32nd IEEE FOCS, 1991, pp.760-7
66.[AFI95] A. Amir, M. Farach, R. M. Idury and etc, Improved dynamic dictionary matching , Inform. And
Comput. 119, 1995, pp.258-282.[AKL99] A. Amir, D. Keselman, G. M. Landau and etc, Indexing and dictionary matching with one erro
r, Proc. 1999 Workshop on Algorithms and Data Structures (WADS), 1999, pp.181-192.[AG97] Pattern Matching Algorithms, Oxford University Press, 1997.[BM77] A fast string searching algorithm, Comm. Assoc. Comput. Mach. 20, 1977, pp.762-772.[BG96] G. S. Brodal and L. Gasieniec, Approximate Dictionary queries, Proc. 7th Annual Symposium
on Combinatorial Pattern Matching (CPM96), 1996, pp.65-74.[O88] M. H. Overmars, Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,p
p. 254-275.
69
Thanks for your listening