Download - Dynamic Text and Static Pattern Matching

Transcript

Dynamic Text and Static Pattern Matching

Amihood AmirGad M. Landau

Moshe LewensteinDina Sokol

Bar-Ilan University

Classical Pattern Matching

Input: - Pattern P = p1p2…pm - Text T = t1 t2 t3 . . . tn

over alphabet Σ.

• m is the PATTERN size.

• n is the TEXT size.

Output: locations of T where P appears.

Pattern Matching (eg.)

Input: P=agca = {a,g,c,t}

T=aaagcattagctagcagcat

Pattern Matching (eg.)

Input: P=agca = {a,g,c,t}

Output:

1 2 3 4 5 6 … 13. . . 16

3

, 13

, 16,…

T=aaagcattagctagcagcat

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

C. Dynamic Text and Static Pattern.

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

a.k.a. - the indexing problem

Solution: Preprocess text and

answer pattern queries

Preprocessing Data Structure:

Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

a.k.a. - the dynamic indexing problem Solution: sophisticated data structures

[SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

C. Dynamic Text and Static Pattern?

Time: query - O(m + log2n) change - O(log2n)

Dynamic Text and Static Pattern Matching

Pattern is non-changing Text changes over time

Goal: report new occurrences of the pattern without performing a new

search.

Motivation

a14

a4b2c3d5

c8a6

FAX

1. Intrusion detection systems

2. Info alerts

3. Two-dimensional run-length compressed matching problem, [ALS03]

Problem Definition Input: T and P over Σ ={1, …, m}.

Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character

in the text, e.g. location 5 from a to b.

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g c g a g c a t

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g a g a g c a t

10

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g a g a g c a t

108

Output: {8}

Results

O(log log m) time per replacement.

mm log After O(n log log m + ) preprocessing time,

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

C. Dynamic Text and Static Pattern.

Time: query - O(m + log2n) change - O(log2n)

Time: change and announce O(log log m)

Static Stage

To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77].

All pattern occurrences in a text of length 2m can be stored in O(1) space.

Succinct Output

Assumption: the text is of size 2m.(Break the text T into overlapping strings

of length 2m-1. )

T 1 m 2m 3m 4m

P

Succinct Output (cont.)

P is periodic: A string p is periodic if it matches itself before position |P|/2.

e.g. p = abcabcabca abcabcabca

Store the output as a ‘chain’ of pattern occurrences.

P is non-periodic: By definition, no more than two

occurrences.

On-line Algorithm

Following each replacement:

Delete old matches that are no longer pattern occurrences.

Find new matches.

Delete Old Matches

Deleting is trivial since we store the matches in constant space:

P is periodic: Truncate the chain of pattern occurrences.

P is non-periodic: Discard all matches that are within distance -m of the replacement.

Find New Matches

Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

Main Idea - Text Covers

We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P.

Pattern

Text = g a g a g c t a g c g a g c a t

= a g a g a g c

g a g a g c

[ 2,7]

1 2 3 4 5 6 7

a g c

] 5,7[

g a g ca

[4,7] [1,1]Cover:

Text Cover (cont.)

The text cover must satisfy two properties:

Substring Property: each element of the cover is a substring of P, or a character not included in P.

Maximality Property: no two adjacent elements can concatenate to form a substring of P.

Text Cover (cont.)

How does a replacement in the text affect the text cover?

• Initially, in the static stage, we construct a text cover for T.

• We ensure that the cover satisfies both the substring and maximality property.

Text Cover following replacement

Pattern = a g a g a g c

Text = g a g a g c t a g c g a g c a t g a g a g c,a g

c,g a g c,

a Cover: (2,7) - (5,7) (4,7) (1,1) -

1 2 3 4 5 6 7

a

(2,7) - (5, 6)(1,1) (4,7) (1,1) -

(1,3)

(1,7)

Updating the Text Cover

At most 5 pieces can violate the maximality property.

Substring Concatenation Query

Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?

Query time: O(log log m).Preprocessing time:

(also uses - [BG00])

). log ( mmO

Hence, in O(log log m) we can update the cover satisfying both properties.

Find New Matches

Given: a text cover which satisfies both the substring and maximality properties.

Find: all new locations of the pattern in the text.

Key Observations

A new match must begin within distance -m of the change.

A new match can include at most one entire piece of the cover.

It can span at most three pieces of the cover.

Furthermore

A new match can begin in one of at most three pieces of the cover:– the piece with the change– the previous piece– the one previous to that

PT

Simplified Problem

Search starts within piece of cover.

Simple O(m) time algorithm:– Check each location in X for a

pattern start.– Use suffix trees and LCA queries to

compare substrings in constant time.

PT X

Improved Algorithm

Really, we only have to check each suffix of X that is a pattern prefix.

e.g. X = a g a g a

The KMP automaton can give the necessary information.

However, the time is still O(m) !

Improved Algorithm

We can group the prefixes of P by their periods.

Each group of prefixes can be checked in constant time!

There are at most O(log m) groups.

Groups (eg.)

Pattern = a g a g a g c1 2 3 4 5 6 7

X = a g a g a

There are three suffixes of X that are also pattern prefixes:

{ agaga, aga } { a }

Prefixes with the same period fall into a single group.

Checking a group in Constant Time

Pattern = a g a g a g c1 2 3 4 5 6 7

X = a g a g a

a g a g a a g t . . . a g a g a g a g a g c

Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’

g c . . .

Groups

A string cannot have more than O(log m) border groups.

Hence, the time of the algorithm is O(log m).

[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

Even Better...

We check only a constant number of groups.

Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log

log m) time per replacement.

Open Problems

Allowing insertions and deletions to the text.

Searching for a set of multiple static patterns.