Second Exam

download Second Exam

of 59

description

Algorithm pattern matching

Transcript of Second Exam

  • Pattern Matching Algorithms: An OverviewShoshana NeuburgerThe Graduate Center, CUNY

    9/15/2009

    * of 59

    OverviewPattern Matching in 1DDictionary MatchingPattern Matching in 2DIndexing Suffix TreeSuffix ArrayResearch Directions

    * of 59

    What is Pattern Matching?Given a pattern and text, find the pattern in the text.

    * of 59

    What is Pattern Matching? is an alphabet.Input:Text T = t1 t2 tn Pattern P = p1 p2 pm

    Output: All i such that

    * of 59

    Pattern Matching - ExampleInput: P=cagc = {a,g,c,t} T=acagcatcagcagctagcat

    Output: {2,8,11}

    1 2 3 4 5 6 7 8 . 11acagcatcagcagctagcat

    * of 59

    Pattern Matching AlgorithmsNave ApproachCompare pattern to text at each location.O(mn) time.

    More efficient algorithms utilize information from previous comparisons.

    * of 59

    Pattern Matching AlgorithmsLinear time methods have two stages preprocess pattern in O(m) time and space.scan text in O(n) time and space.

    Knuth, Morris, Pratt (1977): automata methodBoyer, Moore (1977): can be sublinear

    * of 59

    KMP AutomatonP = ababcb

    * of 59

    Dictionary Matching is an alphabet.Input:Text T = t1 t2 tn Dictionary of patterns D = {P1, P2, , Pk}All characters in patterns and text belong to .Output: All i, j such that

    where mj = |Pj|

    * of 59

    Dictionary Matching AlgorithmsNave Approach:Use an efficient pattern matching algorithm for each pattern in the dictionary.O(kn) time.

    More efficient algorithms process text once.

    * of 59

    AC AutomatonAho and Corasick extended the KMP automaton to dictionary matching

    Preprocessing time: O(d)Matching time: O(n log || +k).Independent of dictionary size!

    * of 59

    AC AutomatonD = {ab, ba, bab, babb, bb}

    * of 59

    Dictionary MatchingKMP automaton does not depend on alphabet size while AC automaton does branching.Dori, Landau (2006): AC automaton is built in linear time for integer alphabets.Breslauer (1995) eliminates log factor in text scanning stage.

    * of 59

    PeriodicityA crucial task in preprocessing stage of most pattern matching algorithms:computing periodicity.

    Many formsfailure tablewitnesses

    * of 59

    PeriodicityA periodic pattern can be superimposed on itself without mismatch before its midpoint.

    Why is periodicity useful?Can quickly eliminate many candidates for pattern occurrence.

    * of 59

    PeriodicityDefinition:S is periodic if S = and is a proper suffix of .S is periodic if its longest prefix that is also a suffix is at least half |S|.The shortest period corresponds to the longest border.

    * of 59

    Periodicity - ExampleS = abcabcabcab|S| = 11Longest border of S: b = abcabcab; |b| = 8 so S is periodic.Shortest period of S: =abc = 3 so S is periodic.

    * of 59

    WitnessesPopular paradigm in pattern matching:find consistent candidatesverify candidates

    consistent candidates verification is linear

    * of 59

    WitnessesVishkin introduced the duel to choose between two candidates by checking the value of a witness.Alphabet-independent method.

    * of 59

    WitnessesPreprocess pattern:Compute witness for each location of self-overlap.Size of witness table: , if P is periodic,, otherwise.

    * of 59

    WitnessesWIT[i] = any k such that P[k] P[k-i+1].WIT[i] = 0, if there is no such k.

    k is a witness against i being a period of P.

    Example:Pattern

    Witness Table

    a

    a

    a

    b

    0

    4

    4

    4

    1

    2

    3

    4

    * of 59

    Witnesses

    Let j>i. Candidates i and j are consistent if they are sufficiently far from each other OR WIT[j-i]=0.

    * of 59

    DuelScan text:If pair of candidates is close and inconsistent, perform duel to eliminate one (or both).Sufficient to identify pairwise consistent candidates: transitivity of consistent positions.

    P=

    T= i jwitnessba?

    a

    a

    a

    b

    * of 59

    2D Pattern Matching is an alphabet.Input:Text T [1 n, 1 n]Pattern P [1 m, 1 m]

    Output: All (i, j) such that MRI

    * of 59

    2D Pattern Matching - ExampleInput: Pattern= {A,B}

    Text

    Output: { (1,4),(2,2),(4, 3)}

    ABAABAAAB

    AABABAABABABABAABAABBBAABAAAABABAAABBAABABBBBABAB

    AABABAABABABABAABAABBBAABAAAABABAAABBAABABBBBABAB

    AABABAABABABABAABAABBBAABAAAABABAAABBAABABBBBABAB

    AABABAABABABABAABAABBBAABAAAABABAAABBAABABBBBABAB

    * of 59

    Bird / BakerFirst linear-time 2D pattern matching algorithm.View each pattern row as a metacharacter to linearize problem.Convert 2D pattern matching to 1D.

    * of 59

    Bird / BakerPreprocess pattern:Name rows of pattern using AC automaton.Using names, pattern has 1D representation.Construct KMP automaton of pattern.

    Identical rows receive identical names.

    * of 59

    Bird / BakerScan text:Name positions of text that match a row of pattern, using AC automaton within each row.Run KMP on named columns of text.

    Since the 1D names are unique, only one name can be given to a text location.

    * of 59

    Bird / Baker - ExamplePreprocess pattern:Name rows of pattern using AC automaton.Using names, pattern has 1D representation.Construct KMP automaton of pattern.

    ABAABAAAB

    112

    * of 59

    Bird / Baker - ExampleScan text:Name positions of text that match a row of pattern, using AC automaton within each row.Run KMP on named columns of text.

    AABABAABABABABAABAABBBAABAAAABABAAABBAABABBBBABAB

    0021010000101000210200002100001010000002100000010

    0021010000101000210200002100001010000002100000010

    * of 59

    Bird / BakerComplexity of Bird / Baker algorithm: time and space.Alphabet-dependent.Real-time since scans text characters once.Can be used for dictionary matching: replace KMP with AC automaton.

    * of 59

    2D WitnessesAmir et. al. 2D witness table can be used for linear time and space alphabet-independent 2D matching.The order of duels is significant.Duels are performed in 2 waves over text.

    * of 59

    IndexingIndex textSuffix TreeSuffix ArrayFind pattern in O(m) time

    Useful paradigm when text will be searched for several patterns.

    * of 59

    Suffix Trie$$$$$$suf1suf2suf3suf4suf5suf6suf7

    One leaf per suffix.An edge represents one character.Concatenation of edge-labels on the path from the root to leaf i spells the suffix that starts at position i.suf1suf2

    suf6suf5suf4suf3$suf7T = banana$

    * of 59

    Suffix Tree$$$suf1suf2suf3suf4suf5suf6suf7

    Compact representation of trie. A node with one child is merged with its parent. Up to n internal nodes. O(n) space by using indices to label edgessuf1suf2suf6suf5suf4suf3[7,7]$[1,7][3,4][2,2][7,7][5,7][7,7][7,7][5,7][3,4]T = banana$

    * of 59

    Suffix Tree ConstructionNave Approach: O(n2) time

    Linear-time algorithms:

    AuthorDateInnovationScan DirectionWeiner1973First linear-time algorithm,alphabet-dependent suffix linksRight to leftMcCreight1976Alphabet-independent suffix links, more efficientLeft to rightUkkonen1995Online linear-time construction, represents current endLeft to rightAmir and Nor2008Real-time constructionLeft to right

    * of 59

    Suffix Tree ConstructionLinear-time suffix tree construction algorithms rely on suffix links to facilitate traversal of tree.A suffix link is a pointer from a node labeled xS to a node labeled S; x is a character and S a possibly empty substring.Alphabet-dependent suffix links point from a node labeled S to a node labeled xS, for each character x.

    * of 59

    Index of PatternsCan answer Lowest Common Ancestor (LCA) queries in constant time if preprocess tree accordingly.In suffix tree, LCA corresponds to Longest Common Prefix (LCP) of strings represented by leaves.

    * of 59

    Index of PatternsTo index several patterns: Concatenate patterns with unique characters separating them and build suffix tree.Problem: inserts meaningless suffixes that span several patterns.OR Build generalized suffix tree single structure for suffixes of individual patterns.Can be constructed with Ukkonens algorithm.

    * of 59

    Suffix ArrayThe Suffix Array stores lexicographic order of suffixes.More space efficient than suffix tree.Can locate all occurrences of a substring by binary search.With Longest Common Prefix (LCP) array can perform even more efficient searches.LCP array stores longest common prefix between two adjacent suffixes in suffix array.

    * of 59

    Suffix ArrayIndexSuffixIndexSuffixLCP

    1 mississippi11i02 ississippi8ippi13 ssissippi5issippi14 sissippi2ississippi45 issippi 1mississippi06 ssippi10pi07sippi9ppi18 ippi7sippi09 ppi4sissippi210 pi6ssippi111 i3ssissippi3sort suffixes alphabetically

    * of 59

    Suffix arrayT = mississippi 14001020131LCP

    * of 59

    Search in Suffix ArrayO(m log n):Idea: two binary searches - search for leftmost position of X - search for rightmost position of XIn between are all suffixes that begin with X

    With LCP array: O(m + log n) search.

    * of 59

    Suffix Array ConstructionNave Approach: O(n2) time

    Indirect Construction: preorder traversal of suffix treeLCA queries for LCP.Problem: does not achieve better space efficiency.

    * of 59

    Suffix Array ConstructionDirect construction algorithms:

    LCP array construction: range-minima queries.

    AuthorDateComplexityInnovationManber, Myers1993O(n log n)Sort and search, KMR renamingKarkkainen and Sanders2003O(n)Linear-timeKo and Aluru2003O(n)Linear-timeKim, et. al.2003O(n)Linear-time

    * of 59

    Compressed IndicesSuffix Tree: O(n) words = O(n log n) bits

    Compressed suffix treeGrossi and Vitter (2000) O(n) space.Sadakane (2007) O(n log ||) space.Supports all suffix tree operations efficiently.Slowdown of only polylog(n).

    * of 59

    Compressed IndicesSuffix array is an array of n indices, which is stored in: O(n) words = O(n log n) bits

    Compressed Suffix Array (CSA)Grossi and Vitter (2000)O(n log ||) bitsaccess time increased from O(1) to O(log n)Sadakane (2003)Pattern matching as efficient as in uncompressed SA.O(n log H0) bitsCompressed self-index

    * of 59

    Compressed IndicesFM indexFerragina and Manzini (2005)Self-indexing data structure First compressed suffix array that respects the high-order empirical entropy Size relative to compressed text length.Improved by Navarro and Makinen (2007)

    * of 59

    Dynamic Suffix TreeDynamic Suffix TreeChoi and Lam (1997)Strings can be inserted or deleted efficiently.Update time proportional to string inserted/deleted.No edges labeled by a deleted string.Two-way pointer for each edge, which can be done in space linear in the size of the tree.

    * of 59

    Dynamic Suffix ArrayDynamic Suffix ArrayRecent work by Salson et. al.Can update suffix array after construction if text changes.More efficient than rebuilding suffix array.Open problems:Worst case O(n log n).No online algorithm yet.

    * of 59

    Word-Based IndexText size n contains k distinct wordsIndex a subset of positions that correspond to word beginningsWith O(n) working space can index entire text and discard unnecessary positions.Desired complexityO(k) space.will always need O(n) time.Problem: missing suffix links.

    * of 59

    Word-Based Suffix TreeConstruction Algorithms:

    AuthorDateResultsKarkkainen and Ukkonen1996O(n) time and O(n/j) space construction of sparse suffix tree (every jth suffix)Anderson et. al.1999Expected linear-time and k-space construction of word-based suffix tree for k words.Inenaga and Takeda2006Online, O(n) time and k-space construction of word-based suffix tree for k words.

    * of 59

    Word-Based Suffix ArrayFerragina and Fischer (2007) word-based suffix array construction algorithmTime and space optimal construction.Computation of word-based LCP array in O(n) time and O(k) space. Alternative algorithm for construction of word-based suffix tree.Searching as efficient as ordinary sufffix array.

    * of 59

    Research DirectionsProblems we are considering:Small space dictionary matching.Time-space optimal 2D compressed dictionary matching algorithm.Compressed parameterized matching.Self-indexing word-based data structure.Dynamic suffix array in O(n) construction time.

    * of 59

    Small-SpaceApplications arise in which storage space is limited.Many innovative algorithms exist for single pattern matching using small additional space:Galil and Seiferas (1981) developed first time-space optimal algorithm for pattern matching.Rytter (2003) adapted the KMP algorithm to work in O(1) additional space, O(n) time.

    * of 59

    Research DirectionsFast dictionary matching algorithms exist for 1D and 2D. Achieve expected sublinear time.No deterministic dictionary matching method that works in linear time and small space. We believe that recent results in compressed self-indexing will facilitate the development of a solution to the small space dictionary matching problem.

    * of 59

    Compressed MatchingData is compressed to save space.Lossless compression schemes can be reversed without loss of data.Pattern matching cannot be done in compressed text pattern can span a compressed character.LZ78: data can be uncompressed in time and space proportional to the uncompressed data.

    * of 59

    Research DirectionsAmir et. al. (2003) devised an algorithm for 2D LZ78 compressed matching.They define strongly inplace as a criteria for the algorithm: that the extra space is proportional to the optimal compression of all strings of the given length.We are seeking a time-space optimal solution to 2D compressed dictionary matching.

    * of 59

    Thank you!

    ******