String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can...

40
String Algorithms Jaehyun Park CS 97SI Stanford University June 30, 2015

Transcript of String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can...

Page 1: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

String Algorithms

Jaehyun Park

CS 97SIStanford University

June 30, 2015

Page 2: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

String Matching Problem 2

Page 3: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

String Matching Problem

◮ Given a text T and a pattern P , find all occurrences of P

within T

◮ Notations:

– n and m: lengths of P and T

– Σ: set of alphabets (of constant size)– Pi: ith letter of P (1-indexed)– a, b, c: single letters in Σ– x, y, z: strings

String Matching Problem 3

Page 4: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Example

◮ T = AGCATGCTGCAGTCATGCTTAGGCTA

◮ P = GCT

◮ P appears three times in T

◮ A naive method takes O(mn) time

– Initiate string comparison at every starting point– Each comparison takes O(m) time

◮ We can do much better!

String Matching Problem 4

Page 5: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Hash Table 5

Page 6: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Hash Function

◮ A function that takes a string and outputs a number

◮ A good hash function has few collisions

– i.e., If x 6= y, H(x) 6= H(y) with high probability

◮ An easy and powerful hash function is a polynomial mod someprime p

– Consider each letter as a number (ASCII value is fine)– H(x1 . . . xk) = x1ak−1 + x2ak−2 + · · · + xk−1a + xk (mod p)– How do we find H(x2 . . . xk+1) from H(x1 . . . xk)?

Hash Table 6

Page 7: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Hash Table

◮ Main idea: preprocess T to speedup queries

– Hash every substring of length k

– k is a small constant

◮ For each query P , hash the first k letters of P to retrieve allthe occurrences of it within T

◮ Don’t forget to check collisions!

Hash Table 7

Page 8: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Hash Table

◮ Pros:

– Easy to implement– Significant speedup in practice

◮ Cons:– Doesn’t help the asymptotic efficiency

◮ Can still take Θ(nm) time if hashing is terrible or data isdifficult

– A lot of memory consumption

Hash Table 8

Page 9: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Knuth-Morris-Pratt (KMP) Algorithm 9

Page 10: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Knuth-Morris-Pratt (KMP) Matcher

◮ A linear time (!) algorithm that solves the string matchingproblem by preprocessing P in Θ(m) time

– Main idea is to skip some comparisons by using the previouscomparison result

◮ Uses an auxiliary array π that is defined as the following:

– π[i] is the largest integer smaller than i such that P1 . . . Pπ[i] isa suffix of P1 . . . Pi

◮ ... It’s better to see an example than the definition

Knuth-Morris-Pratt (KMP) Algorithm 10

Page 11: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

π Table Example (from CLRS)

◮ π[i] is the largest integer smaller than i such that P1 . . . Pπ[i]is a suffix of P1 . . . Pi

– e.g., π[6] = 4 since abab is a suffix of ababab

– e.g., π[9] = 0 since no prefix of length ≤ 8 ends with c

◮ Let’s see why this is useful

Knuth-Morris-Pratt (KMP) Algorithm 11

Page 12: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ T = ABC ABCDAB ABCDABCDABDE

◮ P = ABCDABD

◮ π = (0, 0, 0, 0, 1, 2, 0)

◮ Start matching at the first position of T :

◮ Mismatch at the 4th letter of P !

Knuth-Morris-Pratt (KMP) Algorithm 12

Page 13: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ We matched k = 3 letters so far, and π[k] = 0

– Thus, there is no point in starting the comparison at T2, T3

(crucial observation)

◮ Shift P by k − π[k] = 3 letters

◮ Mismatch at T4 again!

Knuth-Morris-Pratt (KMP) Algorithm 13

Page 14: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ We matched k = 0 letters so far

◮ Shift P by k − π[k] = 1 letter (we define π[0] = −1)

◮ Mismatch at T11!

Knuth-Morris-Pratt (KMP) Algorithm 14

Page 15: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ π[6] = 2 means P1P2 is a suffix of P1 . . . P6

◮ Shift P by 6 − π[6] = 4 letters

◮ Again, no point in shifting P by 1, 2, or 3 letters

Knuth-Morris-Pratt (KMP) Algorithm 15

Page 16: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ Mismatch at T11 again!

◮ Currently 2 letters are matched

◮ Shift P by 2 − π[2] = 2 letters

Knuth-Morris-Pratt (KMP) Algorithm 16

Page 17: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ Mismatch at T11 yet again!

◮ Currently no letters are matched

◮ Shift P by 0 − π[0] = 1 letter

Knuth-Morris-Pratt (KMP) Algorithm 17

Page 18: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ Mismatch at T18

◮ Currently 6 letters are matched

◮ Shift P by 6 − π[6] = 4 letters

Knuth-Morris-Pratt (KMP) Algorithm 18

Page 19: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Using the π Table

◮ Finally, there it is!

◮ Currently all 7 letters are matched

◮ After recording this match (at T16 . . . T22, we shift P again inorder to find other matches

– Shift by 7 − π[7] = 7 letters

Knuth-Morris-Pratt (KMP) Algorithm 19

Page 20: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Computing π

◮ Observation 1: if P1 . . . Pπ[i] is a suffix of P1 . . . Pi, thenP1 . . . Pπ[i]−1 is a suffix of P1 . . . Pi−1

– Well, obviously...

◮ Observation 2: all the prefixes of P that are a suffix ofP1 . . . Pi can be obtained by recursively applying π to i

– e.g., P1 . . . Pπ[i], P1 . . . , Pπ[π[i]], P1 . . . , Pπ[π[π[i]]] are allsuffixes of P1 . . . Pi

Knuth-Morris-Pratt (KMP) Algorithm 20

Page 21: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Computing π

◮ A non-obvious conclusion:

– First, let’s write π(k)[i] as π[·] applied k times to i

– e.g., π(2)[i] = π[π[i]]– π[i] is equal to π(k)[i − 1] + 1, where k is the smallest integer

that satisfies Pπ(k)[i−1]+1 = Pi

◮ If there is no such k, π[i] = 0

◮ Intuition: we look at all the prefixes of P that are suffixes ofP1 . . . Pi−1, and find the longest one whose next lettermatches Pi

Knuth-Morris-Pratt (KMP) Algorithm 21

Page 22: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Implementation

pi[0] = -1;

int k = -1;

for(int i = 1; i <= m; i++) {

while(k >= 0 && P[k+1] != P[i])

k = pi[k];

pi[i] = ++k;

}

Knuth-Morris-Pratt (KMP) Algorithm 22

Page 23: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Pattern Matching Implementation

int k = 0;

for(int i = 1; i <= n; i++) {

while(k >= 0 && P[k+1] != T[i])

k = pi[k];

k++;

if(k == m) {

// P matches T[i-m+1..i]

k = pi[k];

}

}

Knuth-Morris-Pratt (KMP) Algorithm 23

Page 24: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Suffix Trie 24

Page 25: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Suffix Trie

◮ Suffix trie of a string T is a rooted tree that stores all thesuffixes (thus all the substrings)

◮ Each node corresponds to some substring of T

◮ Each edge is associated with an alphabet

◮ For each node that corresponds to ax, there is a specialpointer called suffix link that leads to the node correspondingto x

◮ Surprisingly easy to implement!

Suffix Trie 25

Page 26: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Example

(Figure modified from Ukkonen’s original paper)

Suffix Trie 26

Page 27: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Incremental Construction

◮ Given the suffix tree for T1 . . . Tn

– Then we append Tn+1 = a to T , creating necessary nodes

◮ Start at node u corresponding to T1 . . . Tn

– Create an a-transition to a new node v

◮ Take the suffix link at u to go to u′, corresponding toT2 . . . Tn

– Create an a-transition to a new node v′

– Create a suffix link from v to v′

Suffix Trie 27

Page 28: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Incremental Construction

◮ Repeat the previous process:

– Take the suffix link at the current node– Make a new a-transition there– Create the suffix link from the previous node

◮ Stop if the node already has an a-transition

– Because from this point, all nodes that are reachable via suffixlinks already have an a-transition

Suffix Trie 28

Page 29: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Given the suffix trie for aba

We want to add a new letter c

Suffix Trie 29

Page 30: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Suffix Trie 30

Page 31: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Suffix Trie 31

Page 32: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Suffix Trie 32

Page 33: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Suffix Trie 33

Page 34: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

Suffix Trie 34

Page 35: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

◮ Construction time is linear in the tree size

◮ But the tree size can be quadratic in n

– e.g., T = aa . . . abb . . . b

Suffix Trie 35

Page 36: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Construction Example

◮ To find P , start at the root and keep following edges labeledwith P1, P2, etc.

◮ Got stuck? Then P doesn’t exist in T

Suffix Trie 36

Page 37: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Outline

String Matching Problem

Hash Table

Knuth-Morris-Pratt (KMP) Algorithm

Suffix Trie

Suffix Array

Suffix Array 37

Page 38: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Suffix Array

Suffix Array 38

Page 39: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Suffix Array

◮ Memory usage is O(n)

◮ Has the same computational power as suffix trie

◮ Can be constructed in O(n) time (!)

– But it’s hard to implement

◮ There is an approachable O(n log2 n) algorithm

– If you want to see how it works, read the paper on the coursewebsite

– http://cs97si.stanford.edu/suffix-array.pdf

Suffix Array 39

Page 40: String Algorithms - Stanford University · PDF fileString Algorithms Jaehyun Park ... Can still take Θ(nm) time if hashing is terrible or data ... Suffix trie of a string T is a rooted

Notes on String Problems

◮ Always be aware of the null-terminators

◮ Simple hash works so well in many problems

◮ If a problem involves rotations of some string, considerconcatenating it with itself and see if it helps

◮ Stanford team notebook has implementations of suffix arraysand the KMP matcher

Suffix Array 40