Text Indexing and Dictionary Matching with One Error

69
1 Text Indexing and Dictionary Matching with One Error Amir, A., KeseIman, D., Landa u, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309- 325 Adviser: R. C. T. Lee Speaker: C. W. Cheng

description

Text Indexing and Dictionary Matching with One Error. Amir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000, pp. 309-325. Adviser: R. C. T. Lee Speaker: C. W. Cheng. Problem Definition. The Indexing Problem : - PowerPoint PPT Presentation

Transcript of Text Indexing and Dictionary Matching with One Error

Page 1: Text Indexing and Dictionary Matching with One Error

1

Text Indexing and Dictionary Matching with One ErrorAmir, A., KeseIman, D., Landau, G., M. and etc, Journal of Algorithm, 37, 2000,

pp. 309-325

Adviser: R. C. T. LeeSpeaker: C. W. Cheng

Page 2: Text Indexing and Dictionary Matching with One Error

2

Problem Definition

• The Indexing Problem :– Input : A Text T of length n over alphabet Σ,

a pattern P of length m over alphabet Σ and an integer k.

– Output : All occurrences of P in T with at most k mismatches.

Page 3: Text Indexing and Dictionary Matching with One Error

3

Main idea

• In this algorithm, we construct suffix tree and prefix tree with text T. We set an integer j, j=1,2…m. Then we find the prefix P1,j-1 in prefix tree and the suffix Pj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

Page 4: Text Indexing and Dictionary Matching with One Error

4

Processing

• 1.Construct a suffix tree ST of the text string T and suffix tree ST

R of the string TR is the reversed text TR = tn … t1.

Page 5: Text Indexing and Dictionary Matching with One Error

5

Ex : T=AGCAGAT

TR=TAGACGA

Page 6: Text Indexing and Dictionary Matching with One Error

6

g

a

t$

tagacga$

g

at

$

cagat$

t$

cagat$

at$

cagat$

a g

ga$

acga$

cga$

$cga$

gacga$

Ex : T=AGCAGAT

TR=TAGACGA

Page 7: Text Indexing and Dictionary Matching with One Error

7

Processing

• 2. For each of the suffix trees, link all leaves of the suffix tree in a left-to-right order.

Page 8: Text Indexing and Dictionary Matching with One Error

8

g

a

t$

tagacga$

g

at

$

cagat$

t$

cagat$

at$

cagat$

a g

ga$

acga$

cga$

$cga$

gacga$

Ex : T=AGCAGAT

TR=TAGACGA

Page 9: Text Indexing and Dictionary Matching with One Error

9

Processing

• 3. For each of the suffix trees, set pointers from each tree node v to its left most leaf vl and rightmost leave vr in the linked list.

Page 10: Text Indexing and Dictionary Matching with One Error

10

g

a

t$

tagacga$

g

at

$

cagat$

t$

cagat$

at$

cagat$

a g

ga$

acga$

cga$

$cga$

gacga$

Ex : T=AGCAGAT

TR=TAGACGA

Page 11: Text Indexing and Dictionary Matching with One Error

11

Processing

• 4. Designate each leaf in ST by the starting location of its suffix. Designate each leaf in ST

R by n – i + 3, where i is the starting position of the leaf’s suffix in TR.

Page 12: Text Indexing and Dictionary Matching with One Error

12

4 1 6 3 5 2 7 3 6 8 5 4 7 9

g

a

t$

tagacga$

g

at

$

cagat$

t$

cagat$

at$

cagat$

a g

ga$

acga$

cga$

$cga$

gacga$

Ex : T=AGCAGAT

TR=TAGACGA

Page 13: Text Indexing and Dictionary Matching with One Error

13

Query Processing

• For j = 1, …., m do– 1. Find node v, the location of Pj+1 … Pm in ST, if such a

node exists.– 2. Find node w, the location of Pj-1 .. P1 in ST

R, if such a node exist.

– 3. If v and w exist, the values of leaves under v and w are V[vl….vr] and W[wl…wr], to find the intersections I of V[vl….vr] and W[wl…wr]. If the intersections exist, the approximate string matching occurs on Ti-3…Ti-3+m, for all i I.

Page 14: Text Indexing and Dictionary Matching with One Error

14

ExampleEx : T=actgacctcagctta P=ctga k=1

Page 15: Text Indexing and Dictionary Matching with One Error

1515 5

a

c

g tc

1 10 9 6 7 2 12 4 11 14 8 3 13

tgacctcagctta$

ctcagctta$

$gctta$

ag

ct

ta

$

ctcagctta$

cagctta$

gacctcagctta$

ta$

a$

acctcagctta$

ctta$

cagctta$

gacctcagctta$

ta$

Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa

Suffix Tree of T

Page 16: Text Indexing and Dictionary Matching with One Error

163 12

a g tc

7 17 4 8 9 14 11 13 6 5 10 15 14

ttcgactccagtca$

a$

ctccagtca$

gtca$

a

$gtca$

cagtca$

gactccagtca$

tccagtca$

actccagtca$

tca$

c

a$

cagtca$

gactccagtca$

tcgactccagtca$

Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa

Suffix Tree of TR

Page 17: Text Indexing and Dictionary Matching with One Error

1715 5

a

c

g tc

1 10 9 6 7 2 12 4 11 14 8 3 13

tgacctcagctta$

ctcagctta$

$gctta$

ag

ct

ta

$

ctcagctta$

cagctta$

gacctcagctta$

ta$

a$

acctcagctta$

ctta$

cagctta$

gacctcagctta$

ta$

Ex : T=actgacctcagctta TR=attcgactccagtca P=ctaa

Suffix Tree of T

j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε

V[vl….vr]={ε}

Page 18: Text Indexing and Dictionary Matching with One Error

183 12

a g tc

7 17 4 8 9 14 11 13 6 5 10 15 14

ttcgactccagtca$

a$

ctccagtca$

gtca$

a

$gtca$

cagtca$

gactccagtca$

tccagtca$

actccagtca$

tca$

c

a$

cagtca$

gactccagtca$

tcgactccagtca$

Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=1v=Pj+1…Pm=taaw=Pj-1…P1=ε

V[vl….vr]={ε}W[vl….vr]={3,12,…,14}

I={ε}

Page 19: Text Indexing and Dictionary Matching with One Error

1915 5

a

c

g tc

1 10 9 6 7 2 12 4 11 14 8 3 13

tgacctcagctta$

ctcagctta$

$gctta$

ag

ct

ta

$

ctcagctta$

cagctta$

gacctcagctta$

ta$

a$

acctcagctta$

ctta$

cagctta$

gacctcagctta$

ta$

Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=2v=Pj+1…Pm=aaw=Pj-1…P1=c

V[vl….vr]={ε}

Page 20: Text Indexing and Dictionary Matching with One Error

203 12

a g tc

7 17 4 8 9 14 11 13 6 5 10 15 14

ttcgactccagtca$

a$

ctccagtca$

gtca$

a

$gtca$

cagtca$

gactccagtca$

tccagtca$

actccagtca$

tca$

c

a$

cagtca$

gactccagtca$

tcgactccagtca$

Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=2v=Pj+1…Pm=aaw=Pj-1…P1=c

V[vl….vr]={ε}W[vl….vr]={4,8,9,14,11}

I={ε}

Page 21: Text Indexing and Dictionary Matching with One Error

2115 5

a

c

g tc

1 10 9 6 7 2 12 4 11 14 8 3 13

tgacctcagctta$

ctcagctta$

$gctta$

ag

ct

ta

$

ctcagctta$

cagctta$

gacctcagctta$

ta$

a$

acctcagctta$

ctta$

cagctta$

gacctcagctta$

ta$

Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=3v=Pj+1…Pm=aw=Pj-1…P1=tc

V[vl….vr]={15,5,1,10}

Page 22: Text Indexing and Dictionary Matching with One Error

223 12

a g tc

7 17 4 8 9 14 11 13 6 5 10 15 14

ttcgactccagtca$

a$

ctccagtca$

gtca$

a

$gtca$

cagtca$

gactccagtca$

tccagtca$

actccagtca$

tca$

c

a$

cagtca$

gactccagtca$

tcgactccagtca$

Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=3v=Pj+1…Pm=aw=Pj-1…P1=tc

V[vl….vr]={15,5,1,10}W[vl….vr]={5,10,15}

I={15,5,10}

Page 23: Text Indexing and Dictionary Matching with One Error

23

When j=3, the intersection of V[15,5,1,10] and W[5,10,15] is I={5,10,15}. Therefore approximate string matching occurs on Ti-j…Ti-j+m, for all i I.

T2…T6 ,T7…T11 ,T12…T15 。

T=actgacctcagcttaP=ctaa

Page 24: Text Indexing and Dictionary Matching with One Error

2415 5

a

c

g tc

1 10 9 6 7 2 12 4 11 14 8 3 13

tgacctcagctta$

ctcagctta$

$gctta$

ag

ct

ta

$

ctcagctta$

cagctta$

gacctcagctta$

ta$

a$

acctcagctta$

ctta$

cagctta$

gacctcagctta$

ta$

Suffix Tree of TEx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=4v=Pj+1…Pm=εw=Pj-1…P1=atc

V[vl….vr]={15,5,…,13}

Page 25: Text Indexing and Dictionary Matching with One Error

253 12

a g tc

7 17 4 8 9 14 11 13 6 5 10 15 14

ttcgactccagtca$

a$

ctccagtca$

gtca$

a

$gtca$

cagtca$

gactccagtca$

tccagtca$

actccagtca$

tca$

c

a$

cagtca$

gactccagtca$

tcgactccagtca$

Suffix Tree of TREx : T=actgacctcagctta TR=attcgactccagtca P=ctaa

j=3v=Pj+1…Pm=εw=Pj-1…P1=atc

V[vl….vr]={15,5,…,13}W[vl….vr]={ε}

I={ε}

Page 26: Text Indexing and Dictionary Matching with One Error

26

Range Query Problem

• In step 3, given nodes v and w, we want to find the leaves that appear both in interval [vl … vr] and in the interval [wl … wr], where the four end points of the two intervals are defined in step P.3 of the preprocessing. Thus, we are seeking a solution to the range query problem.

Page 27: Text Indexing and Dictionary Matching with One Error

27

Problem Definition of Range Query

• Input : Let V=[v1,v2 … vn] and W=[w1,w2 … wn] be two permutation arrays, where n is the number of elements. Four constants i,j,k and l, where both i+k < n and j+l < n.

• Output : Find the intersection of elements of V[i … i+k] and W[j … j+l].

Page 28: Text Indexing and Dictionary Matching with One Error

28

Example :V=[8,5,1,4,3,7,6,2]W=[3,6,4,7,2,1,5,8]i=3,k=4j=2,l=5

Output: the intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6]

Page 29: Text Indexing and Dictionary Matching with One Error

29

Preprocessing

V=

W=

1 2 3 4 5 6 7 8

2 3 4 5 6 7 8

8 5 1 4 3 7 6 2

3 6 4 7 2 1 5 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Page 30: Text Indexing and Dictionary Matching with One Error

30

Preprocessing

V=

W=

1 2 3 4 5 6 7 8

2 3 4 5 6 7 8

8 5 1 4 3 7 6 2

3 6 4 7 2 1 5 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

8

5

1

4

3

6

7

2

Page 31: Text Indexing and Dictionary Matching with One Error

31

Preprocessing

V=

W=

1 2 3 4 5 6 7 8

2 3 4 5 6 7 8

8 5 1 4 3 7 6 2

3 6 4 7 2 1 5 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

8

5

1

4

3

7

6

2

The intersection of V[v3,v4,v5,v6] and W[w2,w3,w4,w5,w6] is {1,4,7}.

Page 32: Text Indexing and Dictionary Matching with One Error

32

Time Complexity of Range Query Problem

• By using Overmars’ algorithm, the range query problem can be solved with preprocessing time and , where k is the number of points in the range.

[O88] Overmars, M. H., Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,pp. 254-275.

)log( nkO )log( nnO

Page 33: Text Indexing and Dictionary Matching with One Error

33

Time Complexity

• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the pattern in the text with one error.

)log( nnO)log( nmtoccO

Page 34: Text Indexing and Dictionary Matching with One Error

34

The Dictionary Matching Problem

Page 35: Text Indexing and Dictionary Matching with One Error

35

Problem Definition

• The Dictionary Matching Problem– Input :

• 1. A dictionary P = {p1,…., ps}, where pi, i = 1,…., s, are patterns over alphabet Σ, and is the sum of the lengths of all the dictionary patterns.

• 2. A Text T of length n over alphabet Σ.• 3. An integer k.

– Output :• All occurrences of any dictionary patterns in T with a

t most k mismatches.

isi pd 1

Page 36: Text Indexing and Dictionary Matching with One Error

36

Main idea

• In this algorithm, we construct suffix tree and prefix tree with D which is concatenation of all patterns in dictionary. We set an integer j, j=1,2…n. Then we find the prefix T1,j-1 in prefix tree and the suffix Tj+1,m in suffix tree. If both of them exist, an approximation string matching with one error occurs.

Page 37: Text Indexing and Dictionary Matching with One Error

37

Processing

• 1. Construct a suffix tree SD of string D and suffix tree SD

R of the string DR, where D is the concatenation of all dictionary patterns, with a separator at the end of each pattern, and where DR is the reversal of string D.

Page 38: Text Indexing and Dictionary Matching with One Error

38

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

Page 39: Text Indexing and Dictionary Matching with One Error

39

a$ g tc

gc

tga$gca$

a$

gctga$gca$

a$

tga$gca$

a$gca$

a$

c

tga$gca$

ca$gctga$gca$

ga$gca$

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

Suffix Tree of D (SD)

Page 40: Text Indexing and Dictionary Matching with One Error

40

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

a g tc

g$a

gt

cg

$a

ct

$

cg

$$

t$

gtcg$act$

agtcg$act$

act$

act$

agtcg$act$

tcg$act$

cg$act$

$t$

Suffix Tree of DR (SDR)

Page 41: Text Indexing and Dictionary Matching with One Error

41

Processing

• 2. Modify suffix tree SD, and SDR respectivel

y, as follows. For each separator which is treefirst but not edgefirst, i.e., it appears on an edge (u,v) labeled σ$σ”, where σ≠ε, break (u,v) into (u,w) and (w,v). Label (u,v) with σ and (w,v) with $σ’.

Page 42: Text Indexing and Dictionary Matching with One Error

42

a$ g tc

a$

tga$

a$

a$

c

tga$

ca$

ga$

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

Suffix Tree of D (SD)

Page 43: Text Indexing and Dictionary Matching with One Error

43

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

a g tc

g$

cg

$$

t$

gtcg$

tcg$

cg$$t

$

Suffix Tree of DR (SDR)

Page 44: Text Indexing and Dictionary Matching with One Error

44

Preprocessing

• 3. Scan suffix tree SD, respectively SDR, and modi

fy as follows. For each vertex v consider the associated string L(v), i.e., the string from the root to v. Label v with all the locations of the pattern suffixes, resp. prefixes, that are equal to L(v). To implement this note that all the relevant suffixes share a prefix of L(v)$. So, go to edge (v,w) with label beginning with $, assuming such exists, and scan the subtree rooted at w to find all relevant suffixes.

Page 45: Text Indexing and Dictionary Matching with One Error

4511

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

Suffix Tree of D (SD)

Page 46: Text Indexing and Dictionary Matching with One Error

46

Example :P={tca,gctga,gca}D=TCA$GCTGA$GCA$DR=ACG$AGTCG$ACT$

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)

Page 47: Text Indexing and Dictionary Matching with One Error

47

Query Processing

• For j = 1,…., n do– 1. Find node v, the location of the longest pref

ix of tj+1 … tn in SD.– 2. Find node w, the location of the longest pre

fix of tj-1 … t1 in SDR.

– 3. Find intersection of markings of nodes on the path from the root to v in SD and on the path from the root to w in SD

R.

Page 48: Text Indexing and Dictionary Matching with One Error

48

Example

T=acagccgaD={tca,gctga,gca}K=1

Page 49: Text Indexing and Dictionary Matching with One Error

4911

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

Page 50: Text Indexing and Dictionary Matching with One Error

50

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

Page 51: Text Indexing and Dictionary Matching with One Error

5111

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε

V={10,2}

Page 52: Text Indexing and Dictionary Matching with One Error

52

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$t

$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=1v=Tj+1…Tm=cagccgaw=Tj-1…T1=ε

V={10,2}W={13,5,…, 8}I={10,2}

Page 53: Text Indexing and Dictionary Matching with One Error

53

When j=1, the intersection of V[10,2] and W[13,5,…,8] is I={10,2}. Therefore approximate string matching occurs on T with P…P.

T=acagccgaP={tca,gctga,gca}

Page 54: Text Indexing and Dictionary Matching with One Error

5411

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a

V={11,8,3}

Page 55: Text Indexing and Dictionary Matching with One Error

55

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=2v=Tj+1…Tm=agccgaw=Tj-1…T1=a

V={11,8,3}W={ε}I={ε}

Page 56: Text Indexing and Dictionary Matching with One Error

5611

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca

V={ε}

Page 57: Text Indexing and Dictionary Matching with One Error

57

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=3v=Tj+1…Tm=gccgaw=Tj-1…T1=ca

V={ε}W={ε}I={ε}

Page 58: Text Indexing and Dictionary Matching with One Error

5811

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca

V={ε}

Page 59: Text Indexing and Dictionary Matching with One Error

59

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=4v=Tj+1…Tm=ccgaw=Tj-1…T1=aca

V={ε}W={ε}I={ε}

Page 60: Text Indexing and Dictionary Matching with One Error

6011

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca

V={ε}

Page 61: Text Indexing and Dictionary Matching with One Error

61

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=5v=Tj+1…Tm=cgaw=Tj-1…T1=gaca

V={ε}W={6,11}I={ε}

Page 62: Text Indexing and Dictionary Matching with One Error

6211

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca

V={7}

Page 63: Text Indexing and Dictionary Matching with One Error

63

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=6v=Tj+1…Tm=gaw=Tj-1…T1=cgaca

V={7}W={7,12}I={7}

Page 64: Text Indexing and Dictionary Matching with One Error

64

When j=6, the intersection of V[7] and W[7,12] is I={7}. Therefore approximate string matching occurs on T with P.

T=acagccgaP={tca,gctga,gca}

Page 65: Text Indexing and Dictionary Matching with One Error

6511

a$ g tc

8 3 10 2 5 7 9 4 1 6

a$

tga$

a$

a$

c

tga$

ca$

ga$

Suffix Tree of D (SD)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=7v=Tj+1…Tm=εw=Tj-1…T1=ccgaca

V={11,8,…,6}

Page 66: Text Indexing and Dictionary Matching with One Error

66

13

a g tc

5 10 7 12 4 6 11 9 3 8

g$

cg

$$

t$

gtcg$

tcg$

cg$$

t$

Suffix Tree of DR (SDR)Example :P={tca,gctga,gca}D=tca$gctga$gca$DR=acg$agtcg$act$T=acagccga

j=1v=Tj+1…Tm=εw=Tj-1…T1=ccgaca

V={11,8,…,6}W={ε}I={ε}

Page 67: Text Indexing and Dictionary Matching with One Error

67

Time Complexity

• For the indexing problem, the preprocessing time is and the query can be implemented in , where tocc is the number of occurrences of the dictionary in the text with one error.

)log( ddO)log( dntoccO

Page 68: Text Indexing and Dictionary Matching with One Error

68

References[AC75] Efficient string matching, A. V. Aho and M. J. Corasick, Comm. Assoc. Comput. Mach. 18, No.

6, 1975, pp.333-340.[AF91] Adaptive dictionary matching, A. Amir and M. Farach, Proc. 32nd IEEE FOCS, 1991, pp.760-7

66.[AFI95] A. Amir, M. Farach, R. M. Idury and etc, Improved dynamic dictionary matching , Inform. And

Comput. 119, 1995, pp.258-282.[AKL99] A. Amir, D. Keselman, G. M. Landau and etc, Indexing and dictionary matching with one erro

r, Proc. 1999 Workshop on Algorithms and Data Structures (WADS), 1999, pp.181-192.[AG97] Pattern Matching Algorithms, Oxford University Press, 1997.[BM77] A fast string searching algorithm, Comm. Assoc. Comput. Mach. 20, 1977, pp.762-772.[BG96] G. S. Brodal and L. Gasieniec, Approximate Dictionary queries, Proc. 7th Annual Symposium

on Combinatorial Pattern Matching (CPM96), 1996, pp.65-74.[O88] M. H. Overmars, Efficient data structures for range searching on a grid, J. Algorithms 9, 1988,p

p. 254-275.

Page 69: Text Indexing and Dictionary Matching with One Error

69

Thanks for your listening