String Matching - Portland State University

Post on 03-Feb-2022

7 views 0 download

Transcript of String Matching - Portland State University

String Matching

Algorithm Design and Analysis (Week 7)

1

Battle Plan

β€’ String matching problem

β€’ Notation and terminology

β€’ Four different algorithms

2

Algorithm Preprocessing Time Matching Time

NaΓ―ve 0 𝑂( 𝑛 βˆ’π‘š + 1 π‘š)

Rabin-Karp Θ(π‘š) 𝑂( 𝑛 βˆ’π‘š + 1 π‘š)

Finite automaton 𝑂(π‘š Ξ£ ) Θ(𝑛)

Knuth-Morris-Pratt Θ(π‘š) Θ(𝑛)

String-Matching Problem

β€’ β€œWhere’s the hotel in idahotelescope?”

β€’ Formalization of the string-matching problem

– Text is an array 𝑇,1. . 𝑛- of length 𝑛

– Pattern is an array 𝑃,1. .π‘š- of length π‘š ≀ 𝑛

– 𝑇 and 𝑃 are drawn from a finite alphabet Ξ£, they are often called strings of characters

– 𝑃 occurs with shift 𝒔 in 𝑇 if 0 ≀ 𝑠 ≀ 𝑛 βˆ’π‘š and 𝑇 𝑠 + 1. . 𝑠 + π‘š = 𝑃,1. .π‘š-

3

i d a h o t e l e s c o p e

h o t e l 𝑠 = 3

Text 𝑇

Pattern 𝑃

Notation and Terminology

β€’ Strings – Ξ£βˆ— set of all finite-length strings with characters from Ξ£

– πœ– zero-length empty string also belongs to Ξ£βˆ—

– π‘₯ length of a string π‘₯

– π‘₯𝑦 concatenation of strings π‘₯ and 𝑦 has length π‘₯ + 𝑦

β€’ Prefix and suffix – string 𝑀 is a prefix of a string π‘₯, denoted as 𝑀 ⊏ π‘₯, if π‘₯ = 𝑀𝑦 for some string 𝑦 ∈ Ξ£βˆ—

– string 𝑀 is a suffix of a string π‘₯, denoted as 𝑀 ⊐ π‘₯, if π‘₯ = 𝑦𝑀 for some string 𝑦 ∈ Ξ£βˆ—

– π‘†π‘˜ denotes the π‘˜-character prefix 𝑆,1. . π‘˜- of the string 𝑆 1. . 𝑛 and thus 𝑆0 = πœ– and 𝑆𝑛 = 𝑆 = 𝑆 1. . 𝑛 .

4

Observations

β€’ Strings

– πœ– = 0

β€’ Prefix and suffix

– for any string π‘₯, πœ– ⊏ π‘₯ and πœ– ⊐ π‘₯

– if 𝑀 ⊏ π‘₯ or 𝑀 ⊐ π‘₯, then 𝑀 ≀ π‘₯

– for any two strings π‘₯ and 𝑦 and any character π‘Ž, π‘₯ ⊏ 𝑦 β†’ π‘Žπ‘₯ ⊏ π‘Žπ‘¦ and π‘₯ ⊐ 𝑦 β†’ π‘₯π‘Ž ⊐ π‘¦π‘Ž

– both ⊏ and ⊐ are transitive relations

β€’ Reformulated string-matching problem

– finding all shifts 𝑠 in the range 0 ≀ 𝑠 ≀ 𝑛 βˆ’π‘š such that 𝑃 ⊐ 𝑇𝑠+π‘š

5

Examples

β€’ Assume Ξ£ = a,b,c

– Ξ£βˆ— = *πœ–,a,b,c,aa,ab,ac,ba,bb,bc,ca,cb,cc, … +

– π‘₯ = ab and 𝑦 = ba

– π‘₯ = 𝑦 = 2

– π‘₯𝑦 = abba

– π‘₯𝑦 = π‘₯ + 𝑦 = 4

– πœ– ⊏ abba and πœ– ⊐ abba

– a ⊏ abba and a ⊐ abba

– ab ⊏ abba and ba ⊐ abba

– abb ⊏ abba and bba ⊐ abba

– abba ⊏ abba and abba ⊐ abba

6

Overlapping-Suffix Lemma

β€’ Assume π‘₯, 𝑦, and 𝑧 are strings such that π‘₯ ⊐ 𝑧 and 𝑦 ⊐ 𝑧

– if π‘₯ ≀ |𝑦|, then π‘₯ ⊐ 𝑦

– if π‘₯ β‰₯ |𝑦|, then 𝑦 ⊐ π‘₯

– if π‘₯ = |𝑦|, then π‘₯ = 𝑦

β€’ Proof

7

π‘₯

𝑧

𝑦

𝑦

π‘₯

π‘₯

𝑧

𝑦

𝑦

π‘₯

π‘₯

𝑧

𝑦

𝑦

π‘₯

NaΓ―ve String-Matching Algorithm

NAÏVE-STRING-MATCHER 𝑇, 𝑃 1 𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇 2 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃 3 for 𝑠 ← 0 to 𝑛 βˆ’π‘š 4 do if 𝑃 1. .π‘š = 𝑇 𝑠 + 1. . 𝑠 + π‘š 5 then print β€œPattern occurs with shift” 𝑠

8

β€’ Comparing two stings (line 4) takes time Θ 𝑑 + 1 – 𝑑 denotes the number of matching characters – β€œ+1” to cater for non-matching strings (β‰  𝑂 0 )

β€’ NaΓ―ve algorithm takes time 𝑂 𝑛 βˆ’π‘š + 1 π‘š – tight bound in the worst-case Θ 𝑛 βˆ’π‘š + 1 π‘š – consider matching text an and the pattern aπ‘š – if π‘š = 𝑛

2 , the worst-case running time is Θ 𝑛2

Example

β€’ Graphical interpretation

– sliding pattern over text in steps of length 1

– noting for which shifts all of pattern characters equal the corresponding text characters

9

a c a a b c

a a b 𝑠 = 0

a c a a b c

a a b 𝑠 = 1

a c a a b c

a a b 𝑠 = 2

a c a a b c

a a b 𝑠 = 3

Rabin-Karp Algorithm

β€’ Motivation

– comparing numbers is β€œcheaper” than matching strings

– represent text and pattern as numbers

– use number-theoretic notions to match strings

β€’ Assumptions and notation

– Ξ£10 = *0,1,2,… , 9+, but in the general case each character will be a digit in radix-𝑑 notation where 𝑑 = Ξ£

– 𝑝 denotes the value corresponding to 𝑃 1. .π‘š

– given 𝑇,1. . 𝑛-, 𝑑𝑠 denotes the value of the length-π‘š substring 𝑇 𝑠 + 1. . 𝑠 + π‘š , for 𝑠 = 0,1, … , 𝑛 βˆ’ π‘š

– 𝑑𝑠 = 𝑝 ⇔ 𝑇 𝑠 + 1. . 𝑠 + π‘š = 𝑃 1. .π‘š

10

Rabin-Karp Algorithm

β€’ Goal

– compute 𝑝 in time Θ π‘š

– compute all 𝑑𝑠 values in a total time Θ 𝑛 βˆ’π‘š + 1

– get all valid shifts in time Θ π‘š + Θ 𝑛 βˆ’π‘š + 1 = Θ(𝑛)

β€’ Computing 𝑝 from 𝑃 1. .π‘š

– can be done in Θ(π‘š) using Horner’s rule – 𝑝 = 𝑃 π‘š + 𝑑(𝑃 π‘š βˆ’ 1 + 𝑑 𝑃 π‘š βˆ’ 2 +β‹―+ 𝑑 𝑃 2 + 𝑑𝑃 1 β‹― )

11

Rabin-Karp Algorithm

β€’ Computing 𝑑0 from 𝑇 1. . 𝑛

– use Horner’s rule to compute 𝑑0 in time Θ π‘š

β€’ Computing 𝑑1, 𝑑2, … , π‘‘π‘›βˆ’π‘š from 𝑇,1. . 𝑛-

– can be done in time Θ 𝑛 βˆ’π‘š since 𝑑𝑠+1 can be computed from 𝑑𝑠 in constant time

– 𝑑𝑠+1 = 𝑑 𝑑𝑠 βˆ’ π‘‘π‘šβˆ’1𝑇 𝑠 + 1 + 𝑇,𝑠 + π‘š + 1-

β€’ Example

– assume Ξ£10, 𝑇 = ,3,1,4,1,5,9,2,6-, 𝑃 = ,1,4,1-, and π‘š = 3

– 𝑝 = 141, 𝑑0 = 314

– 𝑑1 = 10 𝑑0 βˆ’ 102𝑇 1 + 𝑇 4 = 10 314 βˆ’ 300 + 1 = 141

12

All’s Well That Ends Well

β€’ Yes, Bill! But we’re not done yet...

– 𝑝 and 𝑑𝑠 may be too large to work with conveniently

– assuming arithmetic operations on these numbers take β€œconstant time” is unreasonable

β€’ Simple solution

– compute 𝑝 and all 𝑑𝑠 modulo a suitable modulus π‘ž

– adding one operation does not change compute time

– π‘ž is typically chosen as a prime such that π‘‘π‘ž fits within one computer word

– 𝑑𝑠+1 = 𝑑 𝑑𝑠 βˆ’ 𝑇 𝑠 + 1 β„Ž + 𝑇 𝑠 +π‘š + 1 mod π‘ž, where β„Ž ≑ π‘‘π‘šβˆ’1 mod π‘ž

13

Make It As Simple As Possible But Not Simpler

β€’ Okay, Al! Maybe we went too far this time... – t𝑠 ≑ 𝑝 mod π‘ž does not imply 𝑑𝑠 = 𝑝

– t𝑠 β‰’ 𝑝 mod π‘ž does imply 𝑑𝑠 β‰  𝑝

β€’ Example – Assume Ξ£10, 𝑝 = 31415, and π‘ž = 13

– 31415 ≑ 7 (mod 13)

– 67399 ≑ 7 mod 13

β€’ Solution – use negative test as a fast heuristic to rule out invalid shifts

– positive test must be validated to sort out spurious hits

– if π‘ž is large, spurious hits are likely to occur less frequently

14

Rabin-Karp Algorithm

RABIN-KARP-MATCHER 𝑇, 𝑃, 𝑑, π‘ž 1 𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇 2 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃 3 β„Ž ← π‘‘π‘šβˆ’1 mod π‘ž 4 𝑝 ← 0 5 𝑑0 ← 0 6 for 𝑖 ← 1 to π‘š Preprocessing 7 do 𝑝 ← 𝑑𝑝 + 𝑃 𝑖 mod π‘ž 8 𝑑0 ← 𝑑𝑑0 + 𝑇 𝑖 mod π‘ž 9 for 𝑠 ← 0 to 𝑛 βˆ’π‘š Matching 10 do if 𝑝 = 𝑑𝑠 11 then if 𝑃 1. .π‘š = 𝑇,𝑠 + 1. . 𝑠 +π‘š- 12 then print β€œPattern occurs with shift” 𝑠 13 if 𝑠 < 𝑛 βˆ’π‘š 14 then 𝑑𝑠+1 ← 𝑑 𝑑𝑠 βˆ’ 𝑇 𝑠 + 1 β„Ž + 𝑇 𝑠 +π‘š + 1 mod π‘ž

15

Run-Time Analysis

β€’ Worst case

– Θ(π‘š) to preprocess and Θ 𝑛 βˆ’π‘š + 1 π‘š to match

β€’ Heuristic analysis of average case

– β€œmodulo π‘žβ€ acts as a random mapping from Ξ£βˆ— to β„€π‘ž

– number of spurious hits expected to be 𝑂 π‘›π‘ž since the

probability of 𝑑𝑠 ≑ 𝑝 (mod π‘ž) can be estimated as 1 π‘ž

– expected matching time of Rabin-Karp algorithm

𝑂 𝑛 + 𝑂(π‘š 𝑣 + π‘›π‘ž )

where 𝑣 is the number of valid shifts

– if 𝑣 = 𝑂 1 and π‘ž β‰₯ π‘š, the running time is 𝑂 𝑛 +π‘š and since π‘š ≀ 𝑛 it is even expected to be 𝑂 𝑛 !

16

no match match

String Matching with Finite Automata

β€’ Idea

– build a finite automaton to scan 𝑇 for all occurrences of 𝑃

– examine each character exactly once and in constant time

– matching time Θ(𝑛), but preprocessing time can be large

β€’ A finite automaton 𝑀 is a 5-tuple (𝑄, π‘ž0, 𝐴, Ξ£, 𝛿)

– 𝑄 is a finite set of states

– π‘ž0 ∈ 𝑄 is the start state

– 𝐴 βŠ† 𝑄 is a distinguished set of accepting states

– Ξ£ is a finite input alphabet

– 𝛿 is a function from 𝑄 Γ— Ξ£ into 𝑄, called transition function of 𝑀

17

String Matching with Finite Automata

β€’ Finite automaton – begins in state π‘ž0, reads one input character π‘Ž at a time

– transitions from state π‘ž into state 𝛿(π‘ž, π‘Ž)

– accepts the string read so far if current state π‘ž ∈ 𝐴

– reject the string read so far if current state π‘ž βˆ‰ 𝐴

β€’ A finite automaton induces a final-state function πœ™ – πœ™: Ξ£βˆ— β†’ 𝑄, such that π‘ž = πœ™(𝑀) is the state 𝑀 is in after

scanning the string 𝑀

– 𝑀 accepts a string 𝑀 if and only if πœ™ 𝑀 ∈ 𝐴

– recursive definition of πœ™

πœ™ πœ– = π‘ž0

πœ™ π‘€π‘Ž = 𝛿 πœ™ 𝑀 , π‘Ž for 𝑀 ∈ Ξ£βˆ—, π‘Ž ∈ Ξ£

18

String-Matching Automata

β€’ For every pattern 𝑃 1. .π‘š , we need to construct a string-matching automaton in preprocessing

– the state set 𝑄 is 0,1, … ,π‘š , where start state π‘ž0 is state 0 and state π‘š is the only accepting state

– the transition function is defined as 𝛿 π‘ž, π‘Ž = 𝜎 π‘ƒπ‘žπ‘Ž for

any state π‘ž and character π‘Ž

β€’ Suffix function 𝜎 for a given pattern 𝑃 1. .π‘š

– 𝜎: Ξ£ β†’ 0,1,… ,π‘š such that 𝜎 π‘₯ = max π‘˜: π‘ƒπ‘˜ ⊐ π‘₯ is the length of the longest prefix of 𝑃 that is a suffix of π‘₯

– for a pattern 𝑃 of length π‘š, 𝜎 π‘₯ = π‘š if and only if 𝑃 ⊐ π‘₯

– if π‘₯ ⊐ 𝑦, then 𝜎 π‘₯ ≀ 𝜎(𝑦)

19

Example

β€’ Assume pattern 𝑃 = ababaca

– 8 states and a β€œspine” of forward transitions

– 𝛿 1,a = 1, since 𝑃1a = aa and 𝜎 𝑃1a = 1 – 𝛿 3,a = 1, since 𝑃3a = abaa and 𝜎 𝑃3a = 1

– 𝛿 5,a = 1 since 𝑃5a = ababaa and 𝜎 𝑃5a = 1

– 𝛿 5,b = 4, since 𝑃5b = ababab and 𝜎 𝑃5b = 4

– 𝛿 7,a = 1, since 𝑃7a = ababacaa and 𝜎 𝑃7a = 1

– 𝛿 7,b = 2, since 𝑃7b = ababacab and 𝜎 𝑃7b = 2

20

0 1 2 3 4 5 6 7 a a a a b b c

a

a a a

b b

String-Matching Automata

FINITE-AUTOMATON-MATCHER 𝑇, P, Ξ£,π‘š 1 𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇 2 𝛿 ← COMPUTE-TRANSITION-FUNCTION 𝑃, Ξ£ 3 π‘ž ← 0 4 for 𝑖 ← 1 to 𝑛 5 do π‘ž ← 𝛿(π‘ž, 𝑇 𝑖 ) 6 if π‘ž = π‘š 7 then print β€œPattern occurs with shift” 𝑖 βˆ’ π‘š

21

β€’ Matching time on a text of length 𝑛 is Θ(𝑛) – simple loop structure with 𝑛 iterations – does not account for the time required to compute the

transition function 𝛿

Computing the Transition Function Ξ΄

COMPUTE-TRANSITION-FUNCTION 𝑃, Ξ£ 1 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃 2 for π‘ž ← 0 to π‘š 3 do for each character π‘Ž ∈ Ξ£ 4 do π‘˜ ← min(π‘š + 1, π‘ž + 2) 5 repeat π‘˜ ← π‘˜ βˆ’ 1 6 until π‘ƒπ‘˜ ⊐ π‘ƒπ‘žπ‘Ž

7 𝛿 π‘ž, π‘Ž ← π‘˜ 8 return 𝛿

22

β€’ Computing transition function takes time 𝑂 π‘š3 Ξ£ – outer two for loops contribute a factor of π‘š3 Ξ£ – inner repeat loop can run at most π‘š+ 1 times – test π‘ƒπ‘˜ ⊐ π‘ƒπ‘žπ‘Ž can require up to π‘š comparisons

Knuth-Morris-Pratt Algorithm

β€’ Idea – avoid both computing transition function 𝛿 in time 𝑂 π‘š Ξ£ and testing useless shifts as in naΓ―ve algorithm

– use auxiliary function πœ‹ 1. .π‘š that can be pre-computed from the pattern in time Θ π‘š

– array πœ‹ allows 𝛿 to be computed efficiently β€œon the fly” as needed, in the amortized sense

β€’ Prefix function πœ‹ for a pattern 𝑃 1. .π‘š – πœ‹: 1,2,… ,π‘š β†’ 0,1, … ,π‘š βˆ’ 1 such that πœ‹ π‘ž =max *π‘˜: π‘˜ < π‘ž and π‘ƒπ‘˜ ⊐ π‘ƒπ‘ž+

– πœ‹ π‘ž is the length of the longest prefix of 𝑃 that is a proper suffix of π‘ƒπ‘ž

23

Example

β€’ What’s the next possible shift that should be tested?

24

b a c b a b a b a a b c b a b

a b a b a c a

b a c b a b a b a a b c b a b

a b a b a c a

Bad Idea!

b a c b a b a b a a b c b a b

a b a b a c a

β€œKnowledge Horizon”

𝑇

𝑇

𝑇

𝑃

𝑃

𝑃

0 0 1 2 3 0 1

a b a b a c a

1 2 3 4 5 6 7 𝑖

𝑃 𝑖

πœ‹ 𝑖

𝑠

π‘ž

𝑠 + π‘ž βˆ’ πœ‹ π‘ž

Knuth-Morris-Pratt Algorithm

KMP-MATCHER 𝑇, 𝑃 1 𝑛 ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑇 2 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃 3 πœ‹ ← COMPUTE-PREFIX-FUNCTION 𝑃 4 π‘ž ← 0 Number of characters matched 5 for 𝑖 ← 1 to 𝑛 Scan the text from left to right 6 do while π‘ž > 0 and 𝑃 π‘ž + 1 β‰  𝑇 𝑖 7 do π‘ž ← πœ‹ π‘ž Next character does not match 8 if 𝑃 π‘ž + 1 = 𝑇 𝑖 9 then π‘ž ← π‘ž + 1 Next character matches 10 if π‘ž = π‘š Is all of 𝑃 matched? 11 then print β€œPattern occurs with shift” 𝑖 βˆ’ π‘š 12 π‘ž ← πœ‹ π‘ž Look for the next match

25

Computing the Prefix Function πœ‹

COMPUTE-PREFIX-FUNCTION 𝑃 1 π‘š ← π‘™π‘’π‘›π‘”π‘‘β„Ž 𝑃 2 πœ‹ 1 ← 0 3 π‘˜ ← 0 4 for π‘ž ← 2 to π‘š 5 do while π‘˜ > 0 and 𝑃 π‘˜ + 1 β‰  𝑃 π‘ž 6 do π‘˜ ← πœ‹ π‘˜ 7 if 𝑃 π‘˜ + 1 = 𝑃 π‘ž 8 then π‘˜ ← π‘˜ + 1 9 πœ‹ π‘ž ← π‘˜ 10 return πœ‹

26

Run-Time Analysis

β€’ Computing the prefix function takes time Θ π‘š

– outer for loop takes time Θ π‘š

– amortized cost of for loop body is 𝑂 1 β€’ amortized analysis with a potential of π‘˜, corresponding to the

current state of π‘˜ in the algorithm

β€’ in each iteration of the for loop, π‘˜ increases at most by 1

β€’ since πœ‹ π‘˜ < π‘˜, there is a decrease of π‘˜ for each increase of π‘˜

β€’ String-matching takes time Θ 𝑛

– with π‘ž as the potential function, the same amortized argument as above can be made for the matching time

27