Post on 03-Feb-2022
String Matching
Algorithm Design and Analysis (Week 7)
1
Battle Plan
β’ String matching problem
β’ Notation and terminology
β’ Four different algorithms
2
Algorithm Preprocessing Time Matching Time
NaΓ―ve 0 π( π βπ + 1 π)
Rabin-Karp Ξ(π) π( π βπ + 1 π)
Finite automaton π(π Ξ£ ) Ξ(π)
Knuth-Morris-Pratt Ξ(π) Ξ(π)
String-Matching Problem
β’ βWhereβs the hotel in idahotelescope?β
β’ Formalization of the string-matching problem
β Text is an array π,1. . π- of length π
β Pattern is an array π,1. .π- of length π β€ π
β π and π are drawn from a finite alphabet Ξ£, they are often called strings of characters
β π occurs with shift π in π if 0 β€ π β€ π βπ and π π + 1. . π + π = π,1. .π-
3
i d a h o t e l e s c o p e
h o t e l π = 3
Text π
Pattern π
Notation and Terminology
β’ Strings β Ξ£β set of all finite-length strings with characters from Ξ£
β π zero-length empty string also belongs to Ξ£β
β π₯ length of a string π₯
β π₯π¦ concatenation of strings π₯ and π¦ has length π₯ + π¦
β’ Prefix and suffix β string π€ is a prefix of a string π₯, denoted as π€ β π₯, if π₯ = π€π¦ for some string π¦ β Ξ£β
β string π€ is a suffix of a string π₯, denoted as π€ β π₯, if π₯ = π¦π€ for some string π¦ β Ξ£β
β ππ denotes the π-character prefix π,1. . π- of the string π 1. . π and thus π0 = π and ππ = π = π 1. . π .
4
Observations
β’ Strings
β π = 0
β’ Prefix and suffix
β for any string π₯, π β π₯ and π β π₯
β if π€ β π₯ or π€ β π₯, then π€ β€ π₯
β for any two strings π₯ and π¦ and any character π, π₯ β π¦ β ππ₯ β ππ¦ and π₯ β π¦ β π₯π β π¦π
β both β and β are transitive relations
β’ Reformulated string-matching problem
β finding all shifts π in the range 0 β€ π β€ π βπ such that π β ππ +π
5
Examples
β’ Assume Ξ£ = a,b,c
β Ξ£β = *π,a,b,c,aa,ab,ac,ba,bb,bc,ca,cb,cc, β¦ +
β π₯ = ab and π¦ = ba
β π₯ = π¦ = 2
β π₯π¦ = abba
β π₯π¦ = π₯ + π¦ = 4
β π β abba and π β abba
β a β abba and a β abba
β ab β abba and ba β abba
β abb β abba and bba β abba
β abba β abba and abba β abba
6
Overlapping-Suffix Lemma
β’ Assume π₯, π¦, and π§ are strings such that π₯ β π§ and π¦ β π§
β if π₯ β€ |π¦|, then π₯ β π¦
β if π₯ β₯ |π¦|, then π¦ β π₯
β if π₯ = |π¦|, then π₯ = π¦
β’ Proof
7
π₯
π§
π¦
π¦
π₯
π₯
π§
π¦
π¦
π₯
π₯
π§
π¦
π¦
π₯
NaΓ―ve String-Matching Algorithm
NAΓVE-STRING-MATCHER π, π 1 π β πππππ‘β π 2 π β πππππ‘β π 3 for π β 0 to π βπ 4 do if π 1. .π = π π + 1. . π + π 5 then print βPattern occurs with shiftβ π
8
β’ Comparing two stings (line 4) takes time Ξ π‘ + 1 β π‘ denotes the number of matching characters β β+1β to cater for non-matching strings (β π 0 )
β’ NaΓ―ve algorithm takes time π π βπ + 1 π β tight bound in the worst-case Ξ π βπ + 1 π β consider matching text an and the pattern aπ β if π = π
2 , the worst-case running time is Ξ π2
Example
β’ Graphical interpretation
β sliding pattern over text in steps of length 1
β noting for which shifts all of pattern characters equal the corresponding text characters
9
a c a a b c
a a b π = 0
a c a a b c
a a b π = 1
a c a a b c
a a b π = 2
a c a a b c
a a b π = 3
Rabin-Karp Algorithm
β’ Motivation
β comparing numbers is βcheaperβ than matching strings
β represent text and pattern as numbers
β use number-theoretic notions to match strings
β’ Assumptions and notation
β Ξ£10 = *0,1,2,β¦ , 9+, but in the general case each character will be a digit in radix-π notation where π = Ξ£
β π denotes the value corresponding to π 1. .π
β given π,1. . π-, π‘π denotes the value of the length-π substring π π + 1. . π + π , for π = 0,1, β¦ , π β π
β π‘π = π β π π + 1. . π + π = π 1. .π
10
Rabin-Karp Algorithm
β’ Goal
β compute π in time Ξ π
β compute all π‘π values in a total time Ξ π βπ + 1
β get all valid shifts in time Ξ π + Ξ π βπ + 1 = Ξ(π)
β’ Computing π from π 1. .π
β can be done in Ξ(π) using Hornerβs rule β π = π π + π(π π β 1 + π π π β 2 +β―+ π π 2 + ππ 1 β― )
11
Rabin-Karp Algorithm
β’ Computing π‘0 from π 1. . π
β use Hornerβs rule to compute π‘0 in time Ξ π
β’ Computing π‘1, π‘2, β¦ , π‘πβπ from π,1. . π-
β can be done in time Ξ π βπ since π‘π +1 can be computed from π‘π in constant time
β π‘π +1 = π π‘π β ππβ1π π + 1 + π,π + π + 1-
β’ Example
β assume Ξ£10, π = ,3,1,4,1,5,9,2,6-, π = ,1,4,1-, and π = 3
β π = 141, π‘0 = 314
β π‘1 = 10 π‘0 β 102π 1 + π 4 = 10 314 β 300 + 1 = 141
12
Allβs Well That Ends Well
β’ Yes, Bill! But weβre not done yet...
β π and π‘π may be too large to work with conveniently
β assuming arithmetic operations on these numbers take βconstant timeβ is unreasonable
β’ Simple solution
β compute π and all π‘π modulo a suitable modulus π
β adding one operation does not change compute time
β π is typically chosen as a prime such that ππ fits within one computer word
β π‘π +1 = π π‘π β π π + 1 β + π π +π + 1 mod π, where β β‘ ππβ1 mod π
13
Make It As Simple As Possible But Not Simpler
β’ Okay, Al! Maybe we went too far this time... β tπ β‘ π mod π does not imply π‘π = π
β tπ β’ π mod π does imply π‘π β π
β’ Example β Assume Ξ£10, π = 31415, and π = 13
β 31415 β‘ 7 (mod 13)
β 67399 β‘ 7 mod 13
β’ Solution β use negative test as a fast heuristic to rule out invalid shifts
β positive test must be validated to sort out spurious hits
β if π is large, spurious hits are likely to occur less frequently
14
Rabin-Karp Algorithm
RABIN-KARP-MATCHER π, π, π, π 1 π β πππππ‘β π 2 π β πππππ‘β π 3 β β ππβ1 mod π 4 π β 0 5 π‘0 β 0 6 for π β 1 to π Preprocessing 7 do π β ππ + π π mod π 8 π‘0 β ππ‘0 + π π mod π 9 for π β 0 to π βπ Matching 10 do if π = π‘π 11 then if π 1. .π = π,π + 1. . π +π- 12 then print βPattern occurs with shiftβ π 13 if π < π βπ 14 then π‘π +1 β π π‘π β π π + 1 β + π π +π + 1 mod π
15
Run-Time Analysis
β’ Worst case
β Ξ(π) to preprocess and Ξ π βπ + 1 π to match
β’ Heuristic analysis of average case
β βmodulo πβ acts as a random mapping from Ξ£β to β€π
β number of spurious hits expected to be π ππ since the
probability of π‘π β‘ π (mod π) can be estimated as 1 π
β expected matching time of Rabin-Karp algorithm
π π + π(π π£ + ππ )
where π£ is the number of valid shifts
β if π£ = π 1 and π β₯ π, the running time is π π +π and since π β€ π it is even expected to be π π !
16
no match match
String Matching with Finite Automata
β’ Idea
β build a finite automaton to scan π for all occurrences of π
β examine each character exactly once and in constant time
β matching time Ξ(π), but preprocessing time can be large
β’ A finite automaton π is a 5-tuple (π, π0, π΄, Ξ£, πΏ)
β π is a finite set of states
β π0 β π is the start state
β π΄ β π is a distinguished set of accepting states
β Ξ£ is a finite input alphabet
β πΏ is a function from π Γ Ξ£ into π, called transition function of π
17
String Matching with Finite Automata
β’ Finite automaton β begins in state π0, reads one input character π at a time
β transitions from state π into state πΏ(π, π)
β accepts the string read so far if current state π β π΄
β reject the string read so far if current state π β π΄
β’ A finite automaton induces a final-state function π β π: Ξ£β β π, such that π = π(π€) is the state π is in after
scanning the string π€
β π accepts a string π€ if and only if π π€ β π΄
β recursive definition of π
π π = π0
π π€π = πΏ π π€ , π for π€ β Ξ£β, π β Ξ£
18
String-Matching Automata
β’ For every pattern π 1. .π , we need to construct a string-matching automaton in preprocessing
β the state set π is 0,1, β¦ ,π , where start state π0 is state 0 and state π is the only accepting state
β the transition function is defined as πΏ π, π = π πππ for
any state π and character π
β’ Suffix function π for a given pattern π 1. .π
β π: Ξ£ β 0,1,β¦ ,π such that π π₯ = max π: ππ β π₯ is the length of the longest prefix of π that is a suffix of π₯
β for a pattern π of length π, π π₯ = π if and only if π β π₯
β if π₯ β π¦, then π π₯ β€ π(π¦)
19
Example
β’ Assume pattern π = ababaca
β 8 states and a βspineβ of forward transitions
β πΏ 1,a = 1, since π1a = aa and π π1a = 1 β πΏ 3,a = 1, since π3a = abaa and π π3a = 1
β πΏ 5,a = 1 since π5a = ababaa and π π5a = 1
β πΏ 5,b = 4, since π5b = ababab and π π5b = 4
β πΏ 7,a = 1, since π7a = ababacaa and π π7a = 1
β πΏ 7,b = 2, since π7b = ababacab and π π7b = 2
20
0 1 2 3 4 5 6 7 a a a a b b c
a
a a a
b b
String-Matching Automata
FINITE-AUTOMATON-MATCHER π, P, Ξ£,π 1 π β πππππ‘β π 2 πΏ β COMPUTE-TRANSITION-FUNCTION π, Ξ£ 3 π β 0 4 for π β 1 to π 5 do π β πΏ(π, π π ) 6 if π = π 7 then print βPattern occurs with shiftβ π β π
21
β’ Matching time on a text of length π is Ξ(π) β simple loop structure with π iterations β does not account for the time required to compute the
transition function πΏ
Computing the Transition Function Ξ΄
COMPUTE-TRANSITION-FUNCTION π, Ξ£ 1 π β πππππ‘β π 2 for π β 0 to π 3 do for each character π β Ξ£ 4 do π β min(π + 1, π + 2) 5 repeat π β π β 1 6 until ππ β πππ
7 πΏ π, π β π 8 return πΏ
22
β’ Computing transition function takes time π π3 Ξ£ β outer two for loops contribute a factor of π3 Ξ£ β inner repeat loop can run at most π+ 1 times β test ππ β πππ can require up to π comparisons
Knuth-Morris-Pratt Algorithm
β’ Idea β avoid both computing transition function πΏ in time π π Ξ£ and testing useless shifts as in naΓ―ve algorithm
β use auxiliary function π 1. .π that can be pre-computed from the pattern in time Ξ π
β array π allows πΏ to be computed efficiently βon the flyβ as needed, in the amortized sense
β’ Prefix function π for a pattern π 1. .π β π: 1,2,β¦ ,π β 0,1, β¦ ,π β 1 such that π π =max *π: π < π and ππ β ππ+
β π π is the length of the longest prefix of π that is a proper suffix of ππ
23
Example
β’ Whatβs the next possible shift that should be tested?
24
b a c b a b a b a a b c b a b
a b a b a c a
b a c b a b a b a a b c b a b
a b a b a c a
Bad Idea!
b a c b a b a b a a b c b a b
a b a b a c a
βKnowledge Horizonβ
π
π
π
π
π
π
0 0 1 2 3 0 1
a b a b a c a
1 2 3 4 5 6 7 π
π π
π π
π
π
π + π β π π
Knuth-Morris-Pratt Algorithm
KMP-MATCHER π, π 1 π β πππππ‘β π 2 π β πππππ‘β π 3 π β COMPUTE-PREFIX-FUNCTION π 4 π β 0 Number of characters matched 5 for π β 1 to π Scan the text from left to right 6 do while π > 0 and π π + 1 β π π 7 do π β π π Next character does not match 8 if π π + 1 = π π 9 then π β π + 1 Next character matches 10 if π = π Is all of π matched? 11 then print βPattern occurs with shiftβ π β π 12 π β π π Look for the next match
25
Computing the Prefix Function π
COMPUTE-PREFIX-FUNCTION π 1 π β πππππ‘β π 2 π 1 β 0 3 π β 0 4 for π β 2 to π 5 do while π > 0 and π π + 1 β π π 6 do π β π π 7 if π π + 1 = π π 8 then π β π + 1 9 π π β π 10 return π
26
Run-Time Analysis
β’ Computing the prefix function takes time Ξ π
β outer for loop takes time Ξ π
β amortized cost of for loop body is π 1 β’ amortized analysis with a potential of π, corresponding to the
current state of π in the algorithm
β’ in each iteration of the for loop, π increases at most by 1
β’ since π π < π, there is a decrease of π for each increase of π
β’ String-matching takes time Ξ π
β with π as the potential function, the same amortized argument as above can be made for the matching time
27