Dynamic Text and Static Pattern Matching
Amihood AmirGad M. Landau
Moshe LewensteinDina Sokol
Bar-Ilan University
Classical Pattern Matching
Input: - Pattern P = p1p2…pm - Text T = t1 t2 t3 . . . tn
over alphabet Σ.
• m is the PATTERN size.
• n is the TEXT size.
Output: locations of T where P appears.
Pattern Matching (eg.)
Input: P=agca = {a,g,c,t}
Output:
1 2 3 4 5 6 … 13. . . 16
3
, 13
, 16,…
T=aaagcattagctagcagcat
“Dynamic” Pattern Matching
A. Static Text and Dynamic Pattern.
B. Dynamic Text and Dynamic Pattern.
C. Dynamic Text and Static Pattern.
“Dynamic” Pattern Matching
A. Static Text and Dynamic Pattern.
a.k.a. - the indexing problem
Solution: Preprocess text and
answer pattern queries
Preprocessing Data Structure:
Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time
“Dynamic” Pattern Matching
A. Static Text and Dynamic Pattern.
B. Dynamic Text and Dynamic Pattern.
Time: O(n) preprocessing O(m) query time
a.k.a. - the dynamic indexing problem Solution: sophisticated data structures
[SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)
“Dynamic” Pattern Matching
A. Static Text and Dynamic Pattern.
B. Dynamic Text and Dynamic Pattern.
Time: O(n) preprocessing O(m) query time
C. Dynamic Text and Static Pattern?
Time: query - O(m + log2n) change - O(log2n)
Dynamic Text and Static Pattern Matching
Pattern is non-changing Text changes over time
Goal: report new occurrences of the pattern without performing a new
search.
Motivation
a14
a4b2c3d5
c8a6
FAX
1. Intrusion detection systems
2. Info alerts
3. Two-dimensional run-length compressed matching problem, [ALS03]
Problem Definition Input: T and P over Σ ={1, …, m}.
Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character
in the text, e.g. location 5 from a to b.
“Dynamic” Pattern Matching
A. Static Text and Dynamic Pattern.
B. Dynamic Text and Dynamic Pattern.
Time: O(n) preprocessing O(m) query time
C. Dynamic Text and Static Pattern.
Time: query - O(m + log2n) change - O(log2n)
Time: change and announce O(log log m)
Static Stage
To initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt ‘77].
All pattern occurrences in a text of length 2m can be stored in O(1) space.
Succinct Output
Assumption: the text is of size 2m.(Break the text T into overlapping strings
of length 2m-1. )
T 1 m 2m 3m 4m
P
Succinct Output (cont.)
P is periodic: A string p is periodic if it matches itself before position |P|/2.
e.g. p = abcabcabca abcabcabca
Store the output as a ‘chain’ of pattern occurrences.
P is non-periodic: By definition, no more than two
occurrences.
On-line Algorithm
Following each replacement:
Delete old matches that are no longer pattern occurrences.
Find new matches.
Delete Old Matches
Deleting is trivial since we store the matches in constant space:
P is periodic: Truncate the chain of pattern occurrences.
P is non-periodic: Discard all matches that are within distance -m of the replacement.
Find New Matches
Challenge: How can we locate occurrences of P, following each replacement, without actually searching for P?
Main Idea - Text Covers
We ‘cover’ the text with substrings of the pattern, i.e. store the text in terms of P.
Pattern
Text = g a g a g c t a g c g a g c a t
= a g a g a g c
g a g a g c
[ 2,7]
1 2 3 4 5 6 7
a g c
] 5,7[
g a g ca
[4,7] [1,1]Cover:
Text Cover (cont.)
The text cover must satisfy two properties:
Substring Property: each element of the cover is a substring of P, or a character not included in P.
Maximality Property: no two adjacent elements can concatenate to form a substring of P.
Text Cover (cont.)
How does a replacement in the text affect the text cover?
• Initially, in the static stage, we construct a text cover for T.
• We ensure that the cover satisfies both the substring and maximality property.
Text Cover following replacement
Pattern = a g a g a g c
Text = g a g a g c t a g c g a g c a t g a g a g c,a g
c,g a g c,
a Cover: (2,7) - (5,7) (4,7) (1,1) -
1 2 3 4 5 6 7
a
(2,7) - (5, 6)(1,1) (4,7) (1,1) -
(1,3)
(1,7)
Substring Concatenation Query
Query: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?
Query time: O(log log m).Preprocessing time:
(also uses - [BG00])
). log ( mmO
Hence, in O(log log m) we can update the cover satisfying both properties.
Find New Matches
Given: a text cover which satisfies both the substring and maximality properties.
Find: all new locations of the pattern in the text.
Key Observations
A new match must begin within distance -m of the change.
A new match can include at most one entire piece of the cover.
It can span at most three pieces of the cover.
Furthermore
A new match can begin in one of at most three pieces of the cover:– the piece with the change– the previous piece– the one previous to that
PT
Simplified Problem
Search starts within piece of cover.
Simple O(m) time algorithm:– Check each location in X for a
pattern start.– Use suffix trees and LCA queries to
compare substrings in constant time.
PT X
Improved Algorithm
Really, we only have to check each suffix of X that is a pattern prefix.
e.g. X = a g a g a
The KMP automaton can give the necessary information.
However, the time is still O(m) !
Improved Algorithm
We can group the prefixes of P by their periods.
Each group of prefixes can be checked in constant time!
There are at most O(log m) groups.
Groups (eg.)
Pattern = a g a g a g c1 2 3 4 5 6 7
X = a g a g a
There are three suffixes of X that are also pattern prefixes:
{ agaga, aga } { a }
Prefixes with the same period fall into a single group.
Checking a group in Constant Time
Pattern = a g a g a g c1 2 3 4 5 6 7
X = a g a g a
a g a g a a g t . . . a g a g a g a g a g c
Idea: Match the period ‘ag’ as far as possible. As soon as (ag)* doesn’t match, check for a ‘c.’
g c . . .
Groups
A string cannot have more than O(log m) border groups.
Hence, the time of the algorithm is O(log m).
[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]
Even Better...
We check only a constant number of groups.
Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log
log m) time per replacement.
Top Related