Dynamic Text and Static Pattern Matching

39
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University

description

Dynamic Text and Static Pattern Matching. Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University. Classical Pattern Matching. Output: locations of T where P appears. - PowerPoint PPT Presentation

Transcript of Dynamic Text and Static Pattern Matching

Page 1: Dynamic Text and Static Pattern Matching

Dynamic Text and Static Pattern Matching

Amihood AmirGad M. Landau

Moshe LewensteinDina Sokol

Bar-Ilan University

Page 2: Dynamic Text and Static Pattern Matching

Classical Pattern Matching

Input: - Pattern P = p1p2…pm - Text T = t1 t2 t3 . . . tn

over alphabet Σ.

• m is the PATTERN size.

• n is the TEXT size.

Output: locations of T where P appears.

Page 3: Dynamic Text and Static Pattern Matching

Pattern Matching (eg.)

Input: P=agca = {a,g,c,t}

T=aaagcattagctagcagcat

Page 4: Dynamic Text and Static Pattern Matching

Pattern Matching (eg.)

Input: P=agca = {a,g,c,t}

Output:

1 2 3 4 5 6 … 13. . . 16

3

, 13

, 16,…

T=aaagcattagctagcagcat

Page 5: Dynamic Text and Static Pattern Matching

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

C. Dynamic Text and Static Pattern.

Page 6: Dynamic Text and Static Pattern Matching

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

a.k.a. - the indexing problem

Solution: Preprocess text and

answer pattern queries

Preprocessing Data Structure:

Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

Page 7: Dynamic Text and Static Pattern Matching

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

a.k.a. - the dynamic indexing problem Solution: sophisticated data structures

[SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

Page 8: Dynamic Text and Static Pattern Matching

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

C. Dynamic Text and Static Pattern?

Time: query - O(m + log2n) change - O(log2n)

Page 9: Dynamic Text and Static Pattern Matching

Dynamic Text and Static Pattern Matching

Pattern is non-changing Text changes over time

Goal: report new occurrences of the pattern without performing a new

search.

Page 10: Dynamic Text and Static Pattern Matching

Motivation

a14

a4b2c3d5

c8a6

FAX

1. Intrusion detection systems

2. Info alerts

3. Two-dimensional run-length compressed matching problem, [ALS03]

Page 11: Dynamic Text and Static Pattern Matching

Problem Definition Input: T and P over Σ ={1, …, m}.

Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character

in the text, e.g. location 5 from a to b.

Page 12: Dynamic Text and Static Pattern Matching

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g c g a g c a t

Page 13: Dynamic Text and Static Pattern Matching

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g a g a g c a t

10

Page 14: Dynamic Text and Static Pattern Matching

Example

Input: P=agagagc = (ag)3c = {a,g,c,t}

T = g a g a g c t a g a g a g c a t

108

Output: {8}

Page 15: Dynamic Text and Static Pattern Matching

Results

O(log log m) time per replacement.

mm log After O(n log log m + ) preprocessing time,

Page 16: Dynamic Text and Static Pattern Matching

“Dynamic” Pattern Matching

A. Static Text and Dynamic Pattern.

B. Dynamic Text and Dynamic Pattern.

Time: O(n) preprocessing O(m) query time

C. Dynamic Text and Static Pattern.

Time: query - O(m + log2n) change - O(log2n)

Time: change and announce O(log log m)

Page 17: Dynamic Text and Static Pattern Matching

Static Stage

To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77].

All pattern occurrences in a text of length 2m can be stored in O(1) space.

Page 18: Dynamic Text and Static Pattern Matching

Succinct Output

Assumption: the text is of size 2m.(Break the text T into overlapping strings

of length 2m-1. )

T 1 m 2m 3m 4m

P

Page 19: Dynamic Text and Static Pattern Matching

Succinct Output (cont.)

P is periodic: A string p is periodic if it matches itself before position |P|/2.

e.g. p = abcabcabca abcabcabca

Store the output as a ‘chain’ of pattern occurrences.

P is non-periodic: By definition, no more than two

occurrences.

Page 20: Dynamic Text and Static Pattern Matching

On-line Algorithm

Following each replacement:

Delete old matches that are no longer pattern occurrences.

Find new matches.

Page 21: Dynamic Text and Static Pattern Matching

Delete Old Matches

Deleting is trivial since we store the matches in constant space:

P is periodic: Truncate the chain of pattern occurrences.

P is non-periodic: Discard all matches that are within distance -m of the replacement.

Page 22: Dynamic Text and Static Pattern Matching

Find New Matches

Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

Page 23: Dynamic Text and Static Pattern Matching

Main Idea - Text Covers

We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P.

Pattern

Text = g a g a g c t a g c g a g c a t

= a g a g a g c

g a g a g c

[ 2,7]

1 2 3 4 5 6 7

a g c

] 5,7[

g a g ca

[4,7] [1,1]Cover:

Page 24: Dynamic Text and Static Pattern Matching

Text Cover (cont.)

The text cover must satisfy two properties:

Substring Property: each element of the cover is a substring of P, or a character not included in P.

Maximality Property: no two adjacent elements can concatenate to form a substring of P.

Page 25: Dynamic Text and Static Pattern Matching

Text Cover (cont.)

How does a replacement in the text affect the text cover?

• Initially, in the static stage, we construct a text cover for T.

• We ensure that the cover satisfies both the substring and maximality property.

Page 26: Dynamic Text and Static Pattern Matching

Text Cover following replacement

Pattern = a g a g a g c

Text = g a g a g c t a g c g a g c a t g a g a g c,a g

c,g a g c,

a Cover: (2,7) - (5,7) (4,7) (1,1) -

1 2 3 4 5 6 7

a

(2,7) - (5, 6)(1,1) (4,7) (1,1) -

(1,3)

(1,7)

Page 27: Dynamic Text and Static Pattern Matching

Updating the Text Cover

At most 5 pieces can violate the maximality property.

Page 28: Dynamic Text and Static Pattern Matching

Substring Concatenation Query

Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?

Query time: O(log log m).Preprocessing time:

(also uses - [BG00])

). log ( mmO

Hence, in O(log log m) we can update the cover satisfying both properties.

Page 29: Dynamic Text and Static Pattern Matching

Find New Matches

Given: a text cover which satisfies both the substring and maximality properties.

Find: all new locations of the pattern in the text.

Page 30: Dynamic Text and Static Pattern Matching

Key Observations

A new match must begin within distance -m of the change.

A new match can include at most one entire piece of the cover.

It can span at most three pieces of the cover.

Page 31: Dynamic Text and Static Pattern Matching

Furthermore

A new match can begin in one of at most three pieces of the cover:– the piece with the change– the previous piece– the one previous to that

PT

Page 32: Dynamic Text and Static Pattern Matching

Simplified Problem

Search starts within piece of cover.

Simple O(m) time algorithm:– Check each location in X for a

pattern start.– Use suffix trees and LCA queries to

compare substrings in constant time.

PT X

Page 33: Dynamic Text and Static Pattern Matching

Improved Algorithm

Really, we only have to check each suffix of X that is a pattern prefix.

e.g. X = a g a g a

The KMP automaton can give the necessary information.

However, the time is still O(m) !

Page 34: Dynamic Text and Static Pattern Matching

Improved Algorithm

We can group the prefixes of P by their periods.

Each group of prefixes can be checked in constant time!

There are at most O(log m) groups.

Page 35: Dynamic Text and Static Pattern Matching

Groups (eg.)

Pattern = a g a g a g c1 2 3 4 5 6 7

X = a g a g a

There are three suffixes of X that are also pattern prefixes:

{ agaga, aga } { a }

Prefixes with the same period fall into a single group.

Page 36: Dynamic Text and Static Pattern Matching

Checking a group in Constant Time

Pattern = a g a g a g c1 2 3 4 5 6 7

X = a g a g a

a g a g a a g t . . . a g a g a g a g a g c

Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’

g c . . .

Page 37: Dynamic Text and Static Pattern Matching

Groups

A string cannot have more than O(log m) border groups.

Hence, the time of the algorithm is O(log m).

[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

Page 38: Dynamic Text and Static Pattern Matching

Even Better...

We check only a constant number of groups.

Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log

log m) time per replacement.

Page 39: Dynamic Text and Static Pattern Matching

Open Problems

Allowing insertions and deletions to the text.

Searching for a set of multiple static patterns.