• date post

02-Jan-2016
• Category

## Documents

• view

14

0

Embed Size (px)

description

Dynamic Text and Static Pattern Matching. Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University. Classical Pattern Matching. Output: locations of T where P appears. - PowerPoint PPT Presentation

### Transcript of Dynamic Text and Static Pattern Matching

• Dynamic Text and Static Pattern MatchingAmihood AmirGad M. LandauMoshe LewensteinDina Sokol

Bar-Ilan University

• Classical Pattern MatchingInput: - Pattern P = p1p2pm - Text T = t1 t2 t3 . . . tn over alphabet . m is the PATTERN size. n is the TEXT size.

Output: locations of T where P appears.

• Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

T=aaagcattagctagcagcat

• Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

Output: 1 2 3 4 5 6 13. . . 16

3

, 13

, 16,

T=aaagcattagctagcagcat

• Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

C. Dynamic Text and Static Pattern.

• Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

• Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

• Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern? Time: query - O(m + log2n) change - O(log2n)

• Dynamic Text and Static Pattern MatchingPattern is non-changingText changes over time

Goal: report new occurrences of the pattern without performing a new search.

• Motivationa14a4b2c3d5c8a6FAXIntrusion detection systems

3. Two-dimensional run-length compressed matching problem, [ALS03]

• Problem DefinitionInput: T and P over ={1, , m}. Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character in the text, e.g. location 5 from a to b.

• ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g c g a g c a t

• ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t10

• ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t108Output: {8}

• Results

O(log log m) time per replacement.

After O(n log log m + ) preprocessing time,

• Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern. Time: query - O(m + log2n) change - O(log2n) Time: change and announce O(log log m)

• Static StageTo initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt 77].

All pattern occurrences in a text of length 2m can be stored in O(1) space.

• Succinct OutputAssumption: the text is of size 2m.(Break the text T into overlapping strings of length 2m-1. )

T 1 m 2m 3m 4mP

• Succinct Output (cont.)P is periodic: A string p is periodic if it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabcaStore the output as a chain of pattern occurrences.P is non-periodic: By definition, no more than two occurrences.

• On-line AlgorithmFollowing each replacement:

Delete old matches that are no longer pattern occurrences.

Find new matches.

• Delete Old MatchesDeleting is trivial since we store the matches in constant space:P is periodic: Truncate the chain of pattern occurrences.P is non-periodic: Discard all matches that are within distance -m of the replacement.

• Find New MatchesChallenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

• Main Idea - Text CoversWe cover the text with substrings of the pattern, i.e. store the text in terms of P.PatternText = g a g a g c t a g c g a g c a t= a g a g a g c g a g a g c [ 2,7]1 2 3 4 5 6 7a g c [5,7]g a g ca[4,7][1,1]Cover:

• Text Cover (cont.)The text cover must satisfy two properties:

Substring Property: each element of the cover is a substring of P, or a character not included in P.Maximality Property: no two adjacent elements can concatenate to form a substring of P.

• Text Cover (cont.)How does a replacement in the text affect the text cover? Initially, in the static stage, we construct a text cover for T. We ensure that the cover satisfies both the substring and maximality property.

• Text Cover following replacementPattern = a g a g a g cText = g a g a g c t a g c g a g c a t g a g a g c,a g c,g a g c,a Cover: (2,7) - (5,7) (4,7) (1,1) -1 2 3 4 5 6 7a(2,7) - (5, 6)(1,1) (4,7) (1,1) -(1,3)(1,7)

• Updating the Text CoverAt most 5 pieces can violate the maximality property.

• Substring Concatenation QueryQuery: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?Query time: O(log log m).Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

• Find New MatchesGiven: a text cover which satisfies both the substring and maximality properties.

Find: all new locations of the pattern in the text.

• Key ObservationsA new match must begin within distance -m of the change.A new match can include at most one entire piece of the cover.It can span at most three pieces of the cover.

• FurthermoreA new match can begin in one of at most three pieces of the cover:the piece with the changethe previous piecethe one previous to that

PT

• Simplified ProblemSearch starts within piece of cover.

Simple O(m) time algorithm:Check each location in X for a pattern start.Use suffix trees and LCA queries to compare substrings in constant time.PTX

• Improved AlgorithmReally, we only have to check each suffix of X that is a pattern prefix.e.g. X = a g a g a

The KMP automaton can give the necessary information.However, the time is still O(m) !

• Improved AlgorithmWe can group the prefixes of P by their periods.

Each group of prefixes can be checked in constant time!

There are at most O(log m) groups.

• Groups (eg.)Pattern = a g a g a g c1 2 3 4 5 6 7X = a g a g aThere are three suffixes of X that are also pattern prefixes:{ agaga, aga } { a }Prefixes with the same period fall into a single group.

• Checking a group in Constant TimePattern = a g a g a g c1 2 3 4 5 6 7X = a g a g a

a g a g a a g t . . . a g a g a g a g a g cIdea: Match the period ag as far as possible. As soon as (ag)* doesnt match, check for a c.g c . . .

• GroupsA string cannot have more than O(log m) border groups.Hence, the time of the algorithm is O(log m).[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

• Even Better...We check only a constant number of groups.Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log log m) time per replacement.

• Open ProblemsAllowing insertions and deletions to the text.Searching for a set of multiple static patterns.

1