Dynamic Text and Static Pattern Matching

download Dynamic Text and Static Pattern Matching

of 39

  • date post

    02-Jan-2016
  • Category

    Documents

  • view

    14
  • download

    0

Embed Size (px)

description

Dynamic Text and Static Pattern Matching. Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University. Classical Pattern Matching. Output: locations of T where P appears. - PowerPoint PPT Presentation

Transcript of Dynamic Text and Static Pattern Matching

  • Dynamic Text and Static Pattern MatchingAmihood AmirGad M. LandauMoshe LewensteinDina Sokol

    Bar-Ilan University

  • Classical Pattern MatchingInput: - Pattern P = p1p2pm - Text T = t1 t2 t3 . . . tn over alphabet . m is the PATTERN size. n is the TEXT size.

    Output: locations of T where P appears.

  • Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

    T=aaagcattagctagcagcat

  • Pattern Matching (eg.)Input: P=agca = {a,g,c,t}

    Output: 1 2 3 4 5 6 13. . . 16

    3

    , 13

    , 16,

    T=aaagcattagctagcagcat

  • Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

    B. Dynamic Text and Dynamic Pattern.

    C. Dynamic Text and Static Pattern.

  • Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

    a.k.a. - the indexing problem Solution: Preprocess text and answer pattern queries Preprocessing Data Structure: Suffix trees, [Wei73,McC75,Ukk95,Far97] Time: O(n) prepro. O(m) query time

  • Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

    B. Dynamic Text and Dynamic Pattern.

    Time: O(n) preprocessing O(m) query time a.k.a. - the dynamic indexing problem Solution: sophisticated data structures [SV96,ABR00] Time: query - O(m + log2n) change - O(log2n)

  • Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

    B. Dynamic Text and Dynamic Pattern.

    Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern? Time: query - O(m + log2n) change - O(log2n)

  • Dynamic Text and Static Pattern MatchingPattern is non-changingText changes over time

    Goal: report new occurrences of the pattern without performing a new search.

  • Motivationa14a4b2c3d5c8a6FAXIntrusion detection systems

    2. Info alerts

    3. Two-dimensional run-length compressed matching problem, [ALS03]

  • Problem DefinitionInput: T and P over ={1, , m}. Output: 1. at start: all occurrences of P in T. 2. after change operation: a. report all new occurrences of P in T. b. discard all old occurrences of P in T.Change Operation: change one character in the text, e.g. location 5 from a to b.

  • ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g c g a g c a t

  • ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t10

  • ExampleInput: P=agagagc = (ag)3c = {a,g,c,t} T = g a g a g c t a g a g a g c a t108Output: {8}

  • Results

    O(log log m) time per replacement.

    After O(n log log m + ) preprocessing time,

  • Dynamic Pattern MatchingA. Static Text and Dynamic Pattern.

    B. Dynamic Text and Dynamic Pattern.

    Time: O(n) preprocessing O(m) query timeC. Dynamic Text and Static Pattern. Time: query - O(m + log2n) change - O(log2n) Time: change and announce O(log log m)

  • Static StageTo initially find all occurrences of P in T, use KMP [Knuth-Morris-Pratt 77].

    All pattern occurrences in a text of length 2m can be stored in O(1) space.

  • Succinct OutputAssumption: the text is of size 2m.(Break the text T into overlapping strings of length 2m-1. )

    T 1 m 2m 3m 4mP

  • Succinct Output (cont.)P is periodic: A string p is periodic if it matches itself before position |P|/2. e.g. p = abcabcabca abcabcabcaStore the output as a chain of pattern occurrences.P is non-periodic: By definition, no more than two occurrences.

  • On-line AlgorithmFollowing each replacement:

    Delete old matches that are no longer pattern occurrences.

    Find new matches.

  • Delete Old MatchesDeleting is trivial since we store the matches in constant space:P is periodic: Truncate the chain of pattern occurrences.P is non-periodic: Discard all matches that are within distance -m of the replacement.

  • Find New MatchesChallenge: How can we locate occurrences of P, following each replacement, without actually searching for P?

  • Main Idea - Text CoversWe cover the text with substrings of the pattern, i.e. store the text in terms of P.PatternText = g a g a g c t a g c g a g c a t= a g a g a g c g a g a g c [ 2,7]1 2 3 4 5 6 7a g c [5,7]g a g ca[4,7][1,1]Cover:

  • Text Cover (cont.)The text cover must satisfy two properties:

    Substring Property: each element of the cover is a substring of P, or a character not included in P.Maximality Property: no two adjacent elements can concatenate to form a substring of P.

  • Text Cover (cont.)How does a replacement in the text affect the text cover? Initially, in the static stage, we construct a text cover for T. We ensure that the cover satisfies both the substring and maximality property.

  • Text Cover following replacementPattern = a g a g a g cText = g a g a g c t a g c g a g c a t g a g a g c,a g c,g a g c,a Cover: (2,7) - (5,7) (4,7) (1,1) -1 2 3 4 5 6 7a(2,7) - (5, 6)(1,1) (4,7) (1,1) -(1,3)(1,7)

  • Updating the Text CoverAt most 5 pieces can violate the maximality property.

  • Substring Concatenation QueryQuery: Given two substrings of P, P[i,j] and P[k,l]. Is their concatenation also a substring of P?Query time: O(log log m).Preprocessing time: (also uses - [BG00]) Hence, in O(log log m) we can update the cover satisfying both properties.

  • Find New MatchesGiven: a text cover which satisfies both the substring and maximality properties.

    Find: all new locations of the pattern in the text.

  • Key ObservationsA new match must begin within distance -m of the change.A new match can include at most one entire piece of the cover.It can span at most three pieces of the cover.

  • FurthermoreA new match can begin in one of at most three pieces of the cover:the piece with the changethe previous piecethe one previous to that

    PT

  • Simplified ProblemSearch starts within piece of cover.

    Simple O(m) time algorithm:Check each location in X for a pattern start.Use suffix trees and LCA queries to compare substrings in constant time.PTX

  • Improved AlgorithmReally, we only have to check each suffix of X that is a pattern prefix.e.g. X = a g a g a

    The KMP automaton can give the necessary information.However, the time is still O(m) !

  • Improved AlgorithmWe can group the prefixes of P by their periods.

    Each group of prefixes can be checked in constant time!

    There are at most O(log m) groups.

  • Groups (eg.)Pattern = a g a g a g c1 2 3 4 5 6 7X = a g a g aThere are three suffixes of X that are also pattern prefixes:{ agaga, aga } { a }Prefixes with the same period fall into a single group.

  • Checking a group in Constant TimePattern = a g a g a g c1 2 3 4 5 6 7X = a g a g a

    a g a g a a g t . . . a g a g a g a g a g cIdea: Match the period ag as far as possible. As soon as (ag)* doesnt match, check for a c.g c . . .

  • GroupsA string cannot have more than O(log m) border groups.Hence, the time of the algorithm is O(log m).[Intuition: each new group has a new period which has to be at least double the size of the old period. e.g. aagaagaa]

  • Even Better...We check only a constant number of groups.Choosing these O(1) groups takes O(log log m) time.Hence, our algorithm takes O(log log m) time per replacement.

  • Open ProblemsAllowing insertions and deletions to the text.Searching for a set of multiple static patterns.

    1