# Dynamic Text and Static Pattern Matching

date post

02-Jan-2016Category

## Documents

view

14download

0

Embed Size (px)

description

### Transcript of Dynamic Text and Static Pattern Matching

Dynamic Text and Static Pattern MatchingAmihood AmirGad M. LandauMoshe LewensteinDina Sokol

Bar-Ilan University

Classical Pattern MatchingInput: - Pattern P = p1p2pm - Text T = t1 t2 t3 . . . tn over alphabet . m is the PATTERN size. n is the TEXT size.

Output: locations of T where P appears.

Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

T=aaagcattagctagcagcat

Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

Output: 1 2 3 4 5 6 13. . . 16

3

, 13

, 16,

T=aaagcattagctagcagcat

Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

C. Dynamic Text and Static Pattern.

Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern? Time: query - O(m + log2n) change - O(log2n)

Dynamic Text and Static Pattern MatchingPattern is non-changingText changes over time

Goal: report new occurrences of the pattern without performing a new search.

Motivationa14a4b2c3d5c8a6FAXIntrusion detection systems

2. Info alerts

3. Two-dimensional run-length compressed matching problem, [ALS03]

Problem DefinitionInput: T and P over ={1, , m}. Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character in the text, e.g. location 5 from a to b.

ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g c g a g c a t

ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t10

ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t108Output: {8}

Results

O(log log m) time per replacement.

After O(n log log m + ) preprocessing time,

Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern. Time: query - O(m + log2n) change - O(log2n) Time: change and announce O(log log m)

Static StageTo initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt 77].

All pattern occurrences in a text of length 2m can be stored in O(1) space.

Succinct OutputAssumption: the text is of size 2m.(Break the text T into overlapping strings of length 2m-1. )

T 1 m 2m 3m 4mP

Succinct Output (cont.)P is periodic: A string p is periodic if it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabcaStore the output as a chain of pattern occurrences.P is non-periodic: By definition, no more than two occurrences.

On-line AlgorithmFollowing each replacement:

Delete old matches that are no longer pattern occurrences.

Find new matches.

Delete Old MatchesDeleting is trivial since we store the matches in constant space:P is periodic: Truncate the chain of pattern occurrences.P is non-periodic: Discard all matches that are within distance -m of the replacement.

Find New MatchesChallenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

Main Idea - Text CoversWe cover the text with substrings of the pattern, i.e. store the text in terms of P.PatternText = g a g a g c t a g c g a g c a t= a g a g a g c g a g a g c [ 2,7]1 2 3 4 5 6 7a g c [5,7]g a g ca[4,7][1,1]Cover:

Text Cover (cont.)The text cover must satisfy two properties:

Substring Property: each element of the cover is a substring of P, or a character not included in P.Maximality Property: no two adjacent elements can concatenate to form a substring of P.

Text Cover (cont.)How does a replacement in the text affect the text cover? Initially, in the static stage, we construct a text cover for T. We ensure that the cover satisfies both the substring and maximality property.

Text Cover following replacementPattern = a g a g a g cText = g a g a g c t a g c g a g c a t g a g a g c,a g c,g a g c,a Cover: (2,7) - (5,7) (4,7) (1,1) -1 2 3 4 5 6 7a(2,7) - (5, 6)(1,1) (4,7) (1,1) -(1,3)(1,7)

Updating the Text CoverAt most 5 pieces can violate the maximality property.

Substring Concatenation QueryQuery: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?Query time: O(log log m).Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

Find New MatchesGiven: a text cover which satisfies both the substring and maximality properties.

Find: all new locations of the pattern in the text.

Key ObservationsA new match must begin within distance -m of the change.A new match can include at most one entire piece of the cover.It can span at most three pieces of the cover.

FurthermoreA new match can begin in one of at most three pieces of the cover:the piece with the changethe previous piecethe one previous to that

PT

Simplified ProblemSearch starts within piece of cover.

Simple O(m) time algorithm:Check each location in X for a pattern start.Use suffix trees and LCA queries to compare substrings in constant time.PTX

Improved AlgorithmReally, we only have to check each suffix of X that is a pattern prefix.e.g. X = a g a g a

The KMP automaton can give the necessary information.However, the time is still O(m) !

Improved AlgorithmWe can group the prefixes of P by their periods.

Each group of prefixes can be checked in constant time!

There are at most O(log m) groups.

Groups (eg.)Pattern = a g a g a g c1 2 3 4 5 6 7X = a g a g aThere are three suffixes of X that are also pattern prefixes:{ agaga, aga } { a }Prefixes with the same period fall into a single group.

Checking a group in Constant TimePattern = a g a g a g c1 2 3 4 5 6 7X = a g a g a

a g a g a a g t . . . a g a g a g a g a g cIdea: Match the period ag as far as possible. As soon as (ag)* doesnt match, check for a c.g c . . .

GroupsA string cannot have more than O(log m) border groups.Hence, the time of the algorithm is O(log m).[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

Even Better...We check only a constant number of groups.Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log log m) time per replacement.

Open ProblemsAllowing insertions and deletions to the text.Searching for a set of multiple static patterns.

1