Parallel and Distributed Implementations of Multiple and...

176
University of Macedonia Department of Applied Informatics Doctoral Thesis Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern Matching Algorithms Charalampos S. Kouzinopoulos September, 2013

Transcript of Parallel and Distributed Implementations of Multiple and...

Page 1: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

University of Macedonia

Department of Applied Informatics

Doctoral Thesis

Parallel and Distributed Implementations ofMultiple and Two-Dimensional Pattern

Matching Algorithms

Charalampos S. Kouzinopoulos

September, 2013

Page 2: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Περίληψη

Η αναζήτηση προτύπων αποτελεί ένα θεμελιώδες πρόβλημα στον κλάδο της Επιστήμης της Πληρο-φορικής. Σε αυτήν την διατριβή εξετάζεται το πρόβλημα της αναζήτησης πολλαπλών και διδιάστατωνπροτύπων. Η αναζήτηση πολλαπλών προτύπων μπορεί να οριστεί γενικά ως η εύρεση όλων των θέσεωνσε ένα μονοδιάστατο αλφαριθμητικό, το επονομαζόμενο και ως κείμενο (text), στις οποίες εμφανίζονταιένα ή περισσότερα αλφαριθμητικά, τα επονομαζόμενα και ως πρότυπα (patterns), από ένα πεπερασμένοσύνολο προτύπων. Η αναζήτηση διδιάστατων προτύπων μπορεί να οριστεί γενικά ως η εύρεση όλωντων θέσεων σε ένα διδιάστατο κείμενο στις οποίες εμφανίζεται ένα διδιάστατο πρότυπο.Η διατριβή αυτή γράφτηκε έχοντας τρεις διαφορετικούς στόχους. Ο πρώτος στόχος περιλαμβάνει

την ταξινόμηση και μελέτη των αλγορίθμων που επιλύουν τα προβλήματα της αναζήτησης πολλαπλών καιδιδιάστατων προτύπων και την αξιολόγηση των επιδόσεων μερικών από τους πιο γνωστούς αλγορίθμουςανά κατηγορία για διαφορετικές παραμέτρους, συμπεριλαμβανομένου του μεγέθους του κειμένου και τηςαλφαβήτου καθώς και του μήκους των προτύπων. Αν και η θεωρητική μόνο μελέτη των αλγορίθμωνμπορεί να είναι χρήσιμη, είναι γνωστό ότι αλγόριθμοι με εξαιρετική θεωρητική απόδοση μπορεί ναέχουν, σε αρκετές περιπτώσεις, υποδεέστερες επιδόσεις από αλγόριθμους πιο απλής φύσεως, ανάλογαμε τις τεχνικές υλοποίησης ή βελτιστοποίησης, την αρχιτεκτονική του συστήματος και τον τύπο τωνδεδομένων που χρησιμοποιούνται.Ο δεύτερος στόχος της διατριβής, είναι η βελτίωση της απόδοσης μερικών από τους πιο γνωστούς

αλγορίθμους διδιάστατης αναζήτησης προτύπων, του Baker and Bird και του Baeza-Yates and Regnier.Και οι δύο αλγόριθμοι χρησιμοποιούν τον αλγόριθμο αναζήτησης πολλαπλών προτύπων Aho-Corasickώστε να μετατρέψουν ουσιαστικά το πρόβλημα της αναζήτησης διδιάστατων προτύπων σε πρόβλημααναζήτησης πολλαπλών προτύπων. Η διατριβή αυτή παρουσιάζει αποτελεσματικές παραλλαγές των δύοαλγορίθμων και αξιολογεί τις επιδόσεις τους για διαφορετικά είδη δεδομένων.Ο τρίτος στόχος αποτελεί την αύξηση των επιδόσεων των αλγορίθμων που επιλύουν τα προβλήματα

της αναζήτησης πολλαπλών και διδιάστατων προτύπων, αλλά μέσα από μια διαφορετική προσέγγιση.Καθώς τα μεγέθη των δεδομένων που καλούνται να επεξεργαστούν οι αλγόριθμοι που παρουσιάζονταιστη διατριβή αυτή αυξάνονται τα τελευταία χρόνια με σχεδόν εκθετικό ρυθμο, συχνά είναι ανέφικτη ηεκτέλεση τους σειριακά από συμβατικούς υπολογιστές. Τα τελευταία χρόνια δημιουργήθηκαν διάφοραεργαλεία και API με σκοπό τη διευκόλυνση των προγραμματιστών ώστε να αξιοποιήσουν την τεράστιαεπεξεργαστική ισχύ μιας μονάδας επεξεργασίας γραφικών (GPU). Η χρήση μονάδων επεξεργασίαςγραφικών στη θέση των επεξεργαστών για την εκτέλεση υπολογισμών γενικού σκοπού είναι γνωστήως Εκτέλεση Υπολογισμών Γενικού Σκοπού σε Μονάδες Επεξεργασίας Γραφικών (GPGPU). Οι αλ-γόριθμοι αναζήτησης πολλαπλών και διδιάστατων προτύπων που παρουσιάζονται σε αυτή τη διατριβήυλοποιούνται παράλληλα σε διαφορετικές αρχιτεκτονικές συμπεριλαμβανομένων μονάδων επεξεργασίαςγραφικών και υβριδικών συστοιχίων υπολογιστών. Επίσης, αναλύονται διαφορετικές τεχνικές υλοποίη-σης και βελτιστοποίησης και η απόδοση των υλοποιήσεων αξιολογείται για διαφορετικές παραμέτρους.

i

Page 3: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

ii

Page 4: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Ευχαριστίες

Θα ήθελα να ευχαριστήσω τους γονείς μου για την αγάπη που μου έδειξαν και την υποστήριξή τουςκαθόλη τη διάρκεια των σπουδών μου και τους φίλους μου που ήταν δίπλα μου όταν τους χρειάστηκα.

Ιδιαίτερα θα ήθελα να ευχαριστήσω τον επιβλέποντα Καθηγητή κ. Κωνσταντίνο Μαργαρίτη για τηνπαρακολούθηση, τις σημαντικές παρατηρήσεις του και τη βοήθειά του κατά τη διάρκεια της εκπόνησηςτης διδακτορικής διατριβής καθώς και τον Λέκτορα κ. Παναγιώτη Μιχαηλίδη, τον Καθηγήτη κ. ΓεώργιοΕυαγγελίδη και τον Αναπληρωτή Καθηγητή κ. Νικόλαο Σαμαρά για τις πολύτιμες παρατηρήσεις τουςοι οποίες βοήθησαν στην βελτίωση του κειμένου της διατριβής.

Θα ήθελα να ευχαριστήσω ονομαστικά όσους εργάστηκαν στο Εργαστήριο Παράλληλης και Κατα-νεμημένης Επεξεργασίας για την υποστήριξη και τη συμπαράστασή τους όλα αυτά τα χρόνια. Το ΘέμηΠυργιώτη, τη Θάλεια Καρύδη, τη Σάντυ Μανουσαρίδου, τη Φωτεινή Τρίμμη, τη Βάσια Μολέ, την ΄ΑνναΚρασσά, τον Γιώργο Φώτη, τον Βαγγέλη Γρηγορόπουλο και τον Γιώργο Καλπάκη.

Τέλος θα ήθελα να ευχαριστήσω θερμά όλους όσους συνέβαλαν με οποιοδήποτε τρόπο στην ολο-κλήρωση της ερευνητικής προσπάθειας αυτής.

iii

Page 5: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

iv

Page 6: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Contents

Contents vii

1 Introduction 1

1.1 Results and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Multiple Pattern Matching Algorithm Implementations . . . . . . . . . . . . 3

1.2.2 Two-dimensional Pattern Matching Algorithm Implementations . . . . . . . . 4

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 9

2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Alphabets and Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Bit Vectors and Bitwise Operations . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 String Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Knuth-Morris-Pratt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Boyer-Moore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Backward Oracle Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Shift-Or . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Multiple Pattern Matching 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Multiple Pattern Matching Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Aho-Corasick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2 Set Horspool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.3 Set Backward Oracle Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.4 Wu-Manber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.5 SOG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.3 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Appendix: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

v

Page 7: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

vi CONTENTS

4 Two-dimensional Matching using Multiple Pattern Matching Algorithms 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Two-dimensional Pattern Matching Algorithms . . . . . . . . . . . . . . . . . . . . . 64

4.3.1 Baker and Bird . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.2 Baeza-Yates and Regnier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Algorithm Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.1 Two-dimensional Pattern Matching using Set Horspool . . . . . . . . . . . . 73

4.4.2 Two-dimensional Pattern Matching using Set Backward Oracle Matching . . 73

4.4.3 Two-dimensional Pattern Matching using Wu-Manber . . . . . . . . . . . . . 74

4.4.4 Two-dimensional Pattern Matching using SOG . . . . . . . . . . . . . . . . . 75

4.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5.2 Algorithm Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.6 Appendix: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Modern Parallel Architectures 89

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2 Classification of Computer Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3 Distributed Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.3.1 Distributed Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.3.2 The Master-Worker Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.3.3 The MPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.3 GPU Program Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.4 The CUDA Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 Hybrid Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6 Multiple Pattern Matching on a GPU using CUDA 107

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3 GPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3.1 Basic Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3.2 Implementation Limitations and Optimization Techniques . . . . . . . . . . . 113

6.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.5 Appendix: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Two-dimensional Pattern Matching on a GPU using CUDA 129

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2 GPU implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2.1 Basic Parallelization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2.2 Implementation Limitations and Optimization Techniques . . . . . . . . . . . 131

7.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.4 Appendix: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Page 8: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

CONTENTS vii

8 Implementation of Pattern Matching Algorithms on a Hybrid Architecture 1438.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438.3 Hybrid Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8.3.1 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448.3.2 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1478.5 Appendix: Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Conclusions 1559.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1559.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Page 9: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

viii CONTENTS

Page 10: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

List of Figures

2.1 An example of a labeled rooted tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Shifting the pattern using the Knuth-Morris-Pratt algorithm . . . . . . . . . . . . . 11

2.3 The automaton of the Knuth-Morris-Pratt algorithm . . . . . . . . . . . . . . . . . . 12

2.4 Shifting the pattern when a mismatch occurs using the Boyer-Moore algorithm . . . 16

2.5 The factor oracle of the Backward Oracle Matching algorithm . . . . . . . . . . . . . 16

3.1 The automaton of the Aho-Corasick algorithm . . . . . . . . . . . . . . . . . . . . . 24

3.2 The automaton of the Set Horspool algorithm . . . . . . . . . . . . . . . . . . . . . . 27

3.3 The automaton of the Set Backward Oracle Matching algorithm . . . . . . . . . . . 29

3.4 Comparing the suffix and prefix of the search window of the Wu-Manber algorithm . 32

3.5 Preprocessing time of the algorithms for randomly generated data with |Σ| = 2 . . . 37

3.6 Preprocessing time of the algorithms for the E.coli genome with |Σ| = 4 . . . . . . . 38

3.7 Preprocessing time of the algorithms for the Swiss Prot. Amino Acid database . . . 38

3.8 Preprocessing time of the algorithms for English alphabet data . . . . . . . . . . . . 39

3.9 Searching time of the algorithms for randomly generated input strings . . . . . . . . 41

3.10 Searching time of the algorithms for the E.coli genome . . . . . . . . . . . . . . . . . 41

3.11 Searching time of the algorithms for the FASTA Nucleic Acid database . . . . . . . . 42

3.12 Searching time of the algorithms for the Swiss Prot. Amino Acid database . . . . . . 42

3.13 Searching time of the algorithms for the FASTA Amino Acid database . . . . . . . . 43

3.14 Searching time of the algorithms for English alphabet input strings . . . . . . . . . . 43

3.15 Experimental map of the multiple pattern matching implementations for m = 8 . . . 45

3.16 Experimental map of the multiple pattern matching implementations for m = 32 . . 46

3.17 Comparison between the searching times of the Aho-Corasick algorithm for m = 8 . 47

3.18 Comparison between the searching times of the Aho-Corasick algorithm for m = 32 . 48

3.19 Comparison between the searching times of the Set Horspool algorithm for m = 8 . . 49

3.20 Comparison between the searching times of the Set Horspool algorithm for m = 32 . 50

3.21 Comparison between the searching times of the SBOM algorithm for m = 8 . . . . . 51

3.22 Comparison between the searching times of the SBOM algorithm for m = 32 . . . . 52

3.23 Comparison between the searching times of the Wu-Manber algorithm for m = 8 . . 53

3.24 Comparison between the searching times of the Wu-Manber algorithm for m = 32 . 54

3.25 Comparison between the searching times of the SOG algorithm for m = 8 . . . . . . 55

3.26 Comparison between the searching times of the SOG algorithm for m = 32 . . . . . 56

4.1 The automaton of the Aho-Corasick algorithm for the example pattern P . . . . . . 68

4.2 The automaton of the Knuth-Morris-Pratt algorithm for the row indices of P ′ . . . 68

4.3 Preprocessing time of the algorithm implementations . . . . . . . . . . . . . . . . . . 78

4.4 Searching time of the Baker and Bird implementations when n = 1, 000 . . . . . . . 79

ix

Page 11: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

x LIST OF FIGURES

4.5 Searching time of the Baker and Bird implementations when n = 10, 000 . . . . . . . 804.6 Searching time of the Baeza-Yates and Regnier implementations when n = 1, 000 . . 814.7 Searching time of the Baeza-Yates and Regnier implementations when n = 10, 000 . 824.8 Experimental map of the implementations of the Baker and Bird algorithm . . . . . 844.9 Experimental map of the implementations of the Baeza-Yates and Regnier algorithm 85

5.1 The NVIDIA GTX 280 GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . 985.2 The NVIDIA GT 240 GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . 995.3 An overview of a hybrid GPU cluster architecture . . . . . . . . . . . . . . . . . . . 105

6.1 Pseudocode of the host function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Speedup of different Aho-Corasick implementations for all types of data . . . . . . . 1226.3 Speedup of different Set Horspool implementations for all types of data . . . . . . . 1226.4 Speedup of different SBOM implementations for all types of data . . . . . . . . . . . 1236.5 Speedup of different Wu-Manber implementations for all types of data . . . . . . . . 1236.6 Speedup of different SOG implementations for all types of data . . . . . . . . . . . . 124

7.1 Speedup of the parallel implementation of the Baker and Bird algorithm . . . . . . . 1387.2 Speedup of the parallel implementation of the Baeza-Yates and Regnier algorithm . 139

8.1 Pseudocode of the hybrid implementation . . . . . . . . . . . . . . . . . . . . . . . . 1468.2 The searching time of the trie-based multiple pattern matching algorithms . . . . . . 1488.3 The searching time of the hash-based multiple pattern matching algorithms . . . . . 1498.4 The searching time of the two-dimensional pattern matching algorithms . . . . . . . 1498.5 The time to export data to the nodes and to copy preprocessing arrays to the GPU . 150

Page 12: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

List of Tables

2.1 Known theoretical time complexity of string matching algorithms . . . . . . . . . . . 20

3.1 Known theoretical time complexity of multiple pattern matching algorithms . . . . . 36

3.2 Preprocessing and searching time for a randomly generated input string with m = 8 57

3.3 Preprocessing and searching time for a randomly generated input string with m = 32 57

3.4 Preprocessing and searching time for the E.coli genome with m = 8 . . . . . . . . . . 58

3.5 Preprocessing and searching time for the E.coli genome with m = 32 . . . . . . . . . 58

3.6 Preprocessing and searching time for the FNA database with m = 8 . . . . . . . . . 59

3.7 Preprocessing and searching time for the FNA database with m = 32 . . . . . . . . . 59

3.8 Preprocessing and searching time for the Swiss Prot. database with m = 8 . . . . . . 60

3.9 Preprocessing and searching time for the Swiss Prot. database with m = 32 . . . . . 60

3.10 Preprocessing and searching time for the FAA database with m = 8 . . . . . . . . . 61

3.11 Preprocessing and searching time for the FAA database with m = 32 . . . . . . . . . 61

3.12 Preprocessing and searching time for an English alphabet input string with m = 8 . 62

3.13 Preprocessing and searching time for an English alphabet input string with m = 32 . 62

4.1 Known theoretical time complexity of the Baker and Bird implementations . . . . . 76

4.2 Known theoretical time complexity of the Baeza-Yates and Regnier implementations 76

4.3 Preprocessing time of the Baker and Bird implementations . . . . . . . . . . . . . . 86

4.4 Preprocessing time of the Baeza-Yates and Regnier implementations . . . . . . . . . 86

4.5 Searching time of the Baker and Bird implementations when n = 1, 000 . . . . . . . 87

4.6 Searching time of the Baker and Bird implementations when n = 10, 000 . . . . . . . 87

4.7 Searching time of the Baeza-Yates and Regnier implementations when n = 1, 000 . . 88

4.8 Searching time of the Baeza-Yates and Regnier implementations when n = 10, 000 . 88

5.1 Essential functions of the MPI library . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 MPI Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Useful information of the MPI Status function . . . . . . . . . . . . . . . . . . . . . 94

5.4 Operators of the MPI Reduce function . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.5 Flags of the cudaHostAlloc function . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6 GPGPU implementation notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1 Register usage per thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2 Searching time of the parallel implementation of the Aho-Corasick algorithm . . . . 126

6.3 Searching time of the parallel implementation of the Set Horspool algorithm . . . . . 126

6.4 Searching time of the parallel implementation of the SBOM algorithm . . . . . . . . 127

6.5 Searching time of the parallel implementation of the Wu-Manber algorithm . . . . . 127

xi

Page 13: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

xii LIST OF TABLES

6.6 Searching time of the parallel implementation of the SOG algorithm . . . . . . . . . 128

7.1 Register usage per thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357.2 Searching time of the Baker and Bird implementation for |Σ| = 2 . . . . . . . . . . . 1407.3 Searching time of the Baker and Bird implementation for |Σ| = 256 . . . . . . . . . . 1407.4 Searching time of the Baker and Bird implementation for |Σ| = 1024 . . . . . . . . . 1417.5 Searching time of the Baeza-Yates and Regnier implementation for |Σ| = 2 . . . . . . 1417.6 Searching time of the Baeza-Yates and Regnier implementation for |Σ| = 256 . . . . 1427.7 Searching time of the Baeza-Yates and Regnier implementation for |Σ| = 1024 . . . . 142

8.1 Execution time of the parallel implementation of Aho-Corasick . . . . . . . . . . . . 1518.2 Execution time of the parallel implementation of Set Horspool . . . . . . . . . . . . 1518.3 Execution time of the parallel implementation of Set Backward Oracle Matching . . 1528.4 Execution time of the parallel implementation of Wu-Manber . . . . . . . . . . . . . 1528.5 Execution time of the parallel implementation of SOG . . . . . . . . . . . . . . . . . 1538.6 Execution time of the parallel implementation of Baker and Bird . . . . . . . . . . . 1538.7 Execution time of the parallel implementation of Baeza-Yates and Regnier . . . . . . 154

Page 14: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 1

Introduction

String matching is a fundamental problem in the area of scientific computing. When two differ-ent one-dimensional strings are taken as an input, the so called “input string” and the so called“pattern”, the string matching problem involves the location of all the positions in the input stringwhere the pattern appears. As there has been an increased interest in some form of string matchingin order to solve needs very different in nature, different variations of the string matching prob-lem have been introduced over the years. Two such variations are multiple pattern matching andtwo-dimensional pattern matching.

The multiple pattern matching problem involves the location of all the positions in a one-dimensional input string where one or more patterns from a finite pattern set occur and is commonlyused by intrusion detection systems, virus scanners and spam filters as well as in the area ofcomputational biology. The two-dimensional pattern matching problem is used to locate all thepositions inside a two-dimensional input string where a two-dimensional pattern of a smaller orequal size occurs. It used in some form by OCR systems and for information retrieval from imagedatabases. As textual, biological and image databases continue to grow with an exponential rateover the past years, there is an increased cost involved in terms of execution time and memoryresources for the algorithms to process them. There is not a single solution to this problem; insteadit can be addressed from different perspectives.

This thesis is written with three goals in mind. The first goal involves the classification andsurvey of the algorithms that solve the multiple and two-dimensional pattern matching problemsand the evaluation of the performance of some of the most well known algorithms for differentproblem parameters, including the size of the input string, the pattern and the alphabet used aswell as the size of the pattern set in the case of multiple pattern matching algorithms. Althougha theoretical-only study of the algorithms might be appropriate, it is well known that algorithmswith an exceptional theoretical performance (i.e. algorithms with a linear or sub-linear theoreticalsearching phase) can be, in many cases, outperformed by simpler algorithms, depending on theimplementation or optimization techniques used, the system architecture and the type of data.The next few chapters, with the help of experimental maps, can help the reader to identify themost practical algorithms for either multiple or two-dimensional pattern matching.

The second goal of this thesis is to improve the performance of some of the most well knowntwo-dimensional pattern matching algorithms, Baker and Bird and Baeza-Yates and Regnier. Bothalgorithms, as described in later chapters, use the Aho-Corasick multiple pattern matching toessentially transform the two-dimensional to a multiple pattern matching problem. The samepaper that introduced the Baeza-Yates and Regnier algorithm [11], also suggested that the useof a different multiple pattern matching algorithm instead of Aho-Corasick should result in the

1

Page 15: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2 CHAPTER 1. INTRODUCTION

improvement of the algorithm’s performance. This motivated us to introduce efficient variants ofBaker and Bird and Baeza-Yates and Regnier in chapter 4 and to evaluate their performance fordifferent types of data.

The third goal is also about the performance increase of multiple and two-dimensional patternmatching algorithms but following a different approach. It can often be impractical to execute thealgorithms sequentially on conventional computer systems when large data set sizes are involved.In the past years, different tools and APIs were introduced to help the programmer tap the sheerprocessing power of a Graphics Processing Unit (GPU). The use of a GPU instead of a CPU toperform general purpose computations is known as General Purpose computing on Graphics Pro-cessing Units (GPGPU). To exploit the massively parallel nature of GPUs in order to achieve asignificant performance increase, the algorithms of chapters 3 and 4 are implemented in parallelon different architectures including GPUs and a hybrid cluster that uses message-passing to syn-chronize homogeneous computer nodes of CPUs and GPUs. Then, different implementation andoptimization techniques are presented and the performance of the implementations is evaluated fordifferent problem parameters.

1.1 Results and Contributions

The main contributions of this thesis are as follows:

• We classify and survey different multiple pattern matching algorithms, we present an extensiveexperimental analysis of the preprocessing and searching phase of the Aho-Corasick, SetHorspool, Set Backward Oracle Matching, Wu-Manber and SOG algorithms and we identifythe most practical algorithms for several problem parameters.

• We introduce efficient variants of the Baker and Bird and the Baeza-Yates and Regniertwo-dimensional pattern matching algorithms and evaluate their performance in terms ofpreprocessing and searching time. Additionally, we identify the most practical algorithmimplementations for several problem parameters.

• We implement the Set Horspool, Set Backward Oracle Matching and SOG multiple patternmatching algorithms in parallel on a GPU using the CUDA API. For the implementation ofthe algorithms we describe different optimization techniques and we elaborate on the impactthey have on the performance of the algorithms. The speedup of the parallel implementationsover the sequential when executed on an NVIDIA GTX 280 GPU is also given for differenttypes of data.

• We present efficient parallel implementations of the Baker and Bird and the Baeza-Yatesand Regnier two-dimensional pattern matching algorithms on a GPU using the CUDA API.The different optimization techniques that were used to implement the algorithms are alsodescribed while the speedup of the parallel implementations over the sequential is given fordifferent types of data when executed on an NVIDIA GTX 280 GPU.

• We introduce a hybrid parallel architecture and we implement in parallel the Aho-Corasick,Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG multiple pattern match-ing algorithms and the Baker and Bird and the Baeza-Yates and Regnier two-dimensionalpattern matching algorithms using the MPI and CUDA APIs. The parallel architecture isevaluated on a homogeneous cluster of 11 computer nodes, each with an Intel Core i3 530CPU and an NVIDIA GT 240 GPU.

Page 16: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

1.2. EXPERIMENTAL METHODOLOGY 3

1.2 Experimental Methodology

This section describes the experimental methodology used for the experiments of this thesis. All al-gorithms were implemented using the ANSI C programming language and were compiled using theGCC compiler with the “-O2” and “-funroll-loops” optimization flags. The sequential implemen-tations of chapters 3 to 4 and 6 to 7 were compiled with version 4.4.3 of GCC while the sequentialimplementations of chapter 8 were compiled with version 4.6.3. The parallel GPU implementationsof the algorithms of chapters 6, 7 and 8 were compiled using the NVCC 5.0 compiler of the CUDAAPI with the “-O2” optimization flag. Finally, the hybrid parallel architecture of chapter 8 wascompiled using the MPICC version 3.0.2 as included in the MPICH2 implementation of the MessagePassing Interface (MPI) API.

All computer systems used to evaluate the algorithm implementations presented in this the-sis had the Ubuntu Linux operating system installed. During the experiments only the typicalbackground processes ran. The experiments of chapters 3 and 4 were executed locally on an IntelCore 2 Duo E8400 CPU with a 3GHz clock speed and 2Gb of memory, 64KB L1 cache and 6MBL2 cache. For the experiments of chapters 6 and 7, an Intel Xeon CPU with a 2.40GHz clockspeed and 2GB of memory was used as a host and a GTX 280 GPU was used as the device. Toevaluate the performance of the hybrid parallel architecture of chapter 8, a homogeneous cluster of11 nodes was used. Each node had an Intel Core i3 530 CPU with a 3GHz clock speed and 2GB ofmemory, an NVIDIA GT215 GPU and a Realtek RTL8111/8168B PCI Express Gigabit EthernetController. The nodes were interconnected using a 3Com OfficeConnect Gigabit Switch. One nodewas designated as the master of the cluster and was responsible for the transmission of data fromand to the worker nodes while the workers handled the actual data processing. To improve theperformance of the NFS filesystem, the NFS exports were mounted by the clients using the optionsrsize=32768, wsize=32768, intr, noatime.

Preprocessing time is the time in seconds an algorithm uses to preprocess the pattern / patternset while searching time is the total time in seconds an algorithm uses to locate all occurrences of anypattern in the input string. Running time is the total execution time of an algorithm. All sequentialtimes were measured using the MPI Wtime function of MPI since it has a better resolution thanthe standard clock() function of time.h. The searching time of the CUDA implementations wasmeasured using the CUDA event API. To decrease random variation, all time results were averagesof 100 runs.

1.2.1 Multiple Pattern Matching Algorithm Implementations

The parameters that describe the performance of multiple pattern matching algorithms are the sizeof the input string n, the size of the pattern set d, the size of the patterns m and the size |Σ| of thealphabet used.

The data set used for the experiments of chapters 3 and 6 was similar to the sets used in[44, 59, 79]. It consisted of randomly generated input strings of a binary alphabet, the genome ofEscherichia coli from the Large Canterbury Corpus, the Swiss Prot. Amino Acid sequence database,the FASTA Amino Acid (FAA) and FASTA Nucleic Acid (FNA) sequences of the A-thaliana genomeand natural language input strings of the English alphabet:

• Randomly generated input strings of size n = 4, 000, 000 with a binary alphabet. The alphabetused was Σ = 0, 1.

• The genome of Escherichia coli from the Large Canterbury Corpus with a size of n =4, 638, 690 characters and the FASTA Nucleic Acid (FNA) of the A-thaliana genome with

Page 17: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4 CHAPTER 1. INTRODUCTION

a size of n = 116, 237, 486 characters. The alphabet Σ = a, c, g, t of both genomes consistedof the four nucleobases of the Deoxyribonucleic Acid (DNA).

• The FASTA Amino Acid (FAA) of the A-thaliana genome with a size of n = 10, 830, 882characters and the Swiss Prot. Amino Acid sequence database with a size of n = 177, 660, 096characters. The alphabet Σ = a, c, d, e, f, g, h, i, k, l,m, n, p, q, r, s, t, v, w, y used by thedatabases consisted of 20 different characters.

• The CIA World Fact Book from the Large Canterbury Corpus. The input string had a sizeof n = 1, 914, 500 characters and an alphabet of size |Σ| = 128 characters.

For the experiments on the multiple pattern matching algorithm implementations of chapter 8,a snapshot of the Expressed Sequence Tags (ESTs) from the February, 2009 assembly of the humangenome was used, as produced by the Genome Reference Consortium and retrieved by the archivesof the University of California Santa Cruz [71]. The snapshot, that was first converted to a one-dimensional input string and had any comments removed, consisted of 2, 000, 000, 000 charactersand had an alphabet of Σ = a, c, g, t.

The pattern set for the experiments of chapter 3 was created from subsequences of the cor-responding input string, consisting of 100 to 100, 000 patterns with each pattern having a size ofm = 8 and m = 32 characters. The subsequences were chosen for at least mind, ⌊ n

m⌋ matches.

The pattern sets for the experiments of chapters 6 and 8 were similar; the pattern set of chapter 6consisted of 1, 000 and 8, 000 patterns while for chapter 8, the multiple pattern set consisted of8, 000 patterns. For both sets, each pattern had a size of m = 8 characters.

1.2.2 Two-dimensional Pattern Matching Algorithm Implementations

The parameters that describe the performance of two-dimensional pattern matching algorithms arethe size n2 of the input string, the size m2 of the pattern and the size |Σ| of the alphabet used.

The data set used for the experiments of chapters 4 and 7 was a superset of the sets used in[11, 85, 100]. It consisted of a randomly generated input string with a size of n = 1, 000 andn = 10, 000 with three alphabets of size 2, 256 and 1024 to simulate bitmaps with different colordepths. The pattern had a size of m = 4, 8, 16, 32, 64, 128 and 256 characters.

For the experiments on the two-dimensional pattern matching algorithm implementations ofchapter 8, a two-dimensional uncompressed BMP image of the Orion Nebula was used as capturedby the Hubble Space Telescope with a size of 18, 000×18, 000 pixels. The image was taken from theESA/Hubble webpage [32] and was converted to grayscale using the 8-bit per pixel (8bpp) formatto support 256 distinct colors. In that case, the two-dimensional pattern was taken directly fromthe reference two-dimensional image with a size of 8× 8 pixels for a single match.

1.3 Overview

This thesis is structured as follows:

Chapter 2: Background

In chapter 2 we give the notions of alphabet, string, prefix, suffix and substring, we define theconcepts of bit vectors and trees and we present the string matching problem. Then, we elaborateon some of the most well known algorithms that solve the string matching problem as they formthe basis of the algorithms that are presented in chapters 3 and 4.

Page 18: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

1.3. OVERVIEW 5

Knuth-Morris-Pratt [48] is a prefix searching string matching algorithm that computes thelongest suffix of the input string that is also a prefix of the pattern. Boyer-Moore [16] is a suffix-based algorithm that computes the longest suffix of the input string that is also a suffix of thepattern. Then, in the case of a mismatch or a complete match, it uses two shift functions thatare computed during the preprocessing phase from the pattern, to shift the search window to theright by one or more positions. Horspool [40] is a simplification of the Boyer-Moore algorithm thatonly uses one of the two shift functions. Backward Oracle Matching [3] uses a factor oracle torecognize at least any factor of the pattern. That way, when a mismatch occurs at a position ofthe input string, the search window can be safely skipped past that position. Finally, Shift-Or [13]is a bit-parallel algorithm. For each position of the input string, it maintains a set of the prefixesof the pattern that match the suffix of the input string using a bit vector.

Our evaluation of the performance of the Knuth-Morris-Pratt and Horspool algorithms whenimplemented in parallel on a Graphics Processing Unit was published in [51].

Chapter 3: Multiple Pattern Matching

In chapter 3 we present a survey of algorithms for multiple pattern matching. Additionally, weelaborate on different well known multiple pattern matching algorithms and we give a pseudocodeof their preprocessing and search phases, together with examples of the way the algorithms work.Then we evaluate the performance of the algorithms in terms of preprocessing and searching time fordifferent types of data, including randomly generated sets, biological sequence databases and naturallanguage input strings. Finally, we give experimental maps to assist the reader in determining themost practical algorithm for a particular type of data.

Aho-Corasick [2] is an extension of Knuth-Morris-Pratt for multiple pattern matching that usesa deterministic finite state pattern matching machine, constructed from the set of the patterns.The Set Horspool [67] algorithm uses a trie created from the set of the patterns in reverse. When amismatch or a complete match occurs, a number of input string positions can be skipped based onthe shift function of the Horspool algorithm generalized for a set of patterns. Set Backward OracleMatching [67] extends Backward Oracle Matching for multiple pattern matching. It uses a factororacle, a trie with additional external transitions such that at least any factor of any pattern can berecognized. Wu-Manber [96] scans the characters of the input string backwards for the occurrencesof the patterns, shifting the search window to the right using the shift function of the Horspoolalgorithm generalized for multiple pattern matching when a mismatch or a complete match occurs.Finally, the Multipattern Shift-Or with Q-Grams (SOG) [78] algorithm is an extension of Shift-Or for multiple pattern matching that combines bit-parallelism, hashing and a technique calledq-grams to essentially increase the alphabet size.

The analysis and experimental results we conducted on different algorithms for multiple patternmatching was published in [52, 53, 54].

Chapter 4: Two-Dimensional Pattern Matching using Multiple Pattern MatchingAlgorithms

In chapter 4 we present a survey of algorithms that solve the two-dimensional pattern matchingproblem. Then, we introduce efficient variants of the Baker and Bird [14, 15] and the Baeza-Yatesand Regnier [11] algorithms. The variants use the Set Horspool, Wu-Manber, Set Backward OracleMatching and SOG multiple pattern matching algorithms in place of the Aho-Corasick algorithmto perform two-dimensional pattern matching. Different experimental maps are given at the end ofthe chapter to assist the reader in identifying the most efficient implementation of Baker and Bird

Page 19: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6 CHAPTER 1. INTRODUCTION

and Baeza-Yates and Regnier for different pattern, alphabet and input string sizes.

Baker and Bird combines the Aho-Corasick and Knuth-Morris-Pratt algorithms. The ideabehind the algorithm is to construct the trie of Aho-Corasick from the pattern rows and thenassign an index to each distinct row. This way, the two-dimensional pattern can be reduced to aone-dimensional array of indices and is subsequently preprocessed using the Knuth-Morris-Prattalgorithm. Baeza-Yates and Regnier is also based on the Aho-Corasick algorithm to assign anindex to each distinct row of the pattern. Some of the rows of the input string are then scannedhorizontally for the occurrences of the pattern indices, selected in such way that they cover allpossible positions where a pattern row may occur.

We first introduced variants of the Baker and Bird and the Baeza-Yates and Regnier algorithmsin [49] that substitute the Aho-Corasick algorithm with a simpler multiple pattern matching al-gorithm. The variants had a performance improvement comparing to the original algorithms forrandomly generated two-dimensional input strings and patterns. The research of chapter 4 waspublished in [55].

Chapter 5: Modern Parallel Architectures

Chapter 5 gives an insight on modern parallel architectures, the distributed memory and theGPU architectures. These architectures are utilized in chapters 6 to 8 to speedup the executiontimes of multiple and two-dimensional pattern matching algorithms. In this chapter we give aclassification of computer architectures using Flynn’s taxonomy, we briefly discuss different parallelconcepts, including data parallelism, message-passing and the master-worker model, we describethe distributed memory architecture, we give details on the modern GPU architecture, focusingspecifically on the NVIDIA GTX 280 and GT 240 GPUs and the way they can be used for GeneralPurpose computing on Graphics Processing Units and we present the MPI and CUDA APIs withan emphasis on the functions that were used for the parallel implementation of the algorithms ofthis thesis.

Chapter 6: Multiple Pattern Matching on a GPU using CUDA

In chapter 6 we introduce parallel implementations of the multiple pattern matching algorithmsof chapter 3 on a GPU using the CUDA API and we discuss the way their searching time is affectedwhen different optimization techniques are used. The performance of the parallel implementationsof the algorithms is evaluated on a GTX 280 GPU for different types of data.

For the algorithm implementations, the following optimization techniques were used. The pre-processing arrays and the pattern set array in the case of the Set Backward Oracle Matching,Wu-Manber and SOG algorithms were bound to the texture memory of the GPU, the charactersof the input string were copied to shared memory by all threads of the device and finally, eachthread of the device read simultaneously 16 input string characters from the global memory witheach memory access by using uint4 vectors.

Our evaluation for the speedup of the parallel implementation of the Wu-Manber multiplepattern matching algorithm on a GTX 280 GPU was published in [76], using the OpenCL API forGPGPU.

Chapter 7: Two-dimensional Pattern Matching on a GPU using CUDA

Chapter 7 explores a similar concept to chapter 6. Given the architecture of the GTX 280GPU, as presented in chapter 5, we introduce parallel implementations of the Baker and Bird andthe Baeza-Yates and Regnier two-dimensional pattern matching algorithms on a GPU using the

Page 20: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

1.3. OVERVIEW 7

CUDA API. The implementation and optimization techniques of the previous chapter are adaptedfor a two-dimensional data set and the performance of the algorithm implementations is evaluatedfor different alphabet, pattern and input string sizes.

The optimization techniques that were used for the implementation of each of the two-dimensionalpattern matching algorithms were as follows. The preprocessing arrays were bound to the texturememory of the device. For the Baker and Bird implementation, the pattern array was collectivelycopied by all the threads of a thread block to the shared memory of the device while for theBaeza-Yates and Regnier implementation, the threads stored input string characters to the sharedmemory of the device to work around the coalescence requirements of the global memory.

Chapter 8: Implementation of Multiple and Two-Dimensional Pattern MatchingAlgorithms on a Hybrid Parallel Architecture

In chapter 8 we present a 3-tier modular hybrid parallel architecture to take advantage of theflexibility of a CPU, the efficiency of a GPU and the scalability, transparency, reconfigurability,availability and reliability that a distributed memory architecture can offer. The presented archi-tecture combines distributed memory message-passing using the MPI API and GPGPU with theuse of the CUDA API to implement in parallel the multiple and two-dimensional algorithms ofchapters 3 and 4. The performance of the algorithm implementations using the presented hybridparallel architecture is then evaluated on a homogeneous cluster of 11 computer nodes.

The work of this chapter is based on our previous research, as published in [50, 56, 57].

Chapter 9: Conclusions

In chapter 9 we present the conclusions of this thesis and we give ideas for future work.

Page 21: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8 CHAPTER 1. INTRODUCTION

Page 22: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 2

Background

2.1 Basic Concepts

2.1.1 Alphabets and Strings

In this section, the notions of alphabet, string, prefix, suffix and substring are defined.

Definition. An alphabet Σ is a finite set of characters. The size of the alphabet is denoted by|Σ|

A string s is a series of characters from an alphabet Σ. The size of the string is defined as thenumber of characters that it consists of and is denoted as |s|. The empty string is a special-casestring with a size |s| = 0 and is denoted as ǫ.

Definition. Given a string T = t0 . . . tn−1 of size n, string Tpref = t0 . . . tj is a prefix of Tsuch that j ≤ n− 1

Definition. Given a string T = t0 . . . tn−1 of size n, string Tsuf = ti . . . tn−1 is a suffix of Tsuch that i ≥ 0

Definition. Given a string T = t0 . . . tn−1 of size n, string Tsub = ti . . . tj is a substring of Tsuch that i ≥ 0 and j ≤ n− 1

2.1.2 Bit Vectors and Bitwise Operations

The notion of bit vector is defined in [77] as follows:

Definition. A bit vector is a sequence of bits. A bit vector of width w is defined as E =ew−1 . . . e0 where e0 is the least significant bit (LSB) and ew−1 the most significant (MSB)

In this thesis, the following bitwise operators are used, mainly for the Shift-Or, the Wu-Manberand the SOG algorithms. Using the ‘<<’ operator, the bits of a bit vector are shifted by one ormore positions to the left, the bits that are shifted out from the left are discarded and one or morezeros are inserted from the right. With the ‘>>’ operator, the bits of a bit vector are shifted byone or more positions to the right, the bits that are shifted out from the right are discarded and

9

Page 23: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

10 CHAPTER 2. BACKGROUND

one or more zeros are inserted from the left. The bitwise OR operator is denoted as ‘|’, the bitwiseAND operator is denoted as ‘&’ while the bitwise XOR operator is denoted as ‘⊻’. The ‘∼’ operatorcompliments all bits.

2.1.3 Trees

Tree is a type of data structure, that consists of a set of nodes. A rooted tree is a tree with a singlenode designated as the root. The nodes are connected with links. Then, the source node of a nodeq is called the parent of q while q is called the child of its parent node. All nodes have a parentnode, except the root node of the tree. A node without a child node is called a leaf of the tree.Each link can have a label that for this thesis will be some character σ ∈ Σ. According to [67], trieis a labeled rooted tree that is associated to a set of strings. Figure 2.1 is an example of a labeledrooted tree.

Figure 2.1: An example of a labeled rooted tree

2.2 String Matching Algorithms

String matching is an important problem in text processing and is commonly used to locate allthe occurrences of a one-dimensional keyword. A survey and experimental results on some wellknown string matching algorithms is presented in [62]. This section elaborates on some of themost well known algorithms that solve the string matching problem, the Knuth-Morris-Pratt [48],Boyer-Moore [16], Backward Oracle Matching [3] and Shift-Or [13] algorithms as they form thebasis upon which the algorithms presented in chapters 3 and 4 are based.

The string matching problem can be defined as:

Definition. Given an input string T = t0 . . . tn−1 of size n and a pattern P = p0 . . . pm−1 ofsize m over an alphabet Σ, the task is to find all occurrences of the pattern in the input string. Moreformally, find all i where 0 ≤ i < n−m+1 such that for all j where 0 ≤ j < m it holds that ti+j = pj

2.2.1 Knuth-Morris-Pratt

Knuth-Morris-Pratt is a linear time prefix searching string matching algorithm. During the searchphase and for each position i of the input string, the algorithm computes the longest suffix of t0 . . . ti

Page 24: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2.2. STRING MATCHING ALGORITHMS 11

that is also a prefix of the pattern until a mismatch or a complete match is found. In the case ofa mismatch, a number of input string positions can be potentially skipped based on informationobtained during the preprocessing phase.

Figure 2.2: Shifting the pattern when a mismatch occurs using the Knuth-Morris-Pratt algorithm

Assume that the pattern is aligned with the input string and that characters t0 . . . ti−1 havealready been scanned. As depicted in Figure 2.2 A, the longest suffix of the input string at positioni− 1 that is also a prefix of the pattern is denoted as u while the longest prefix of the pattern thatis both a prefix and a suffix of u is denoted as v and is called the border of u. The character σlocated at position i of the input string is then compared with pattern character α. If σ = α thenuσ will be the longest suffix of the input string at position i that is also a prefix of the pattern.A complete match of the pattern is found when |uσ| = m. If σ 6= α, the longest border v of thepattern will be aligned with the input string at position i − 1 and σ will be compared with β, asshown in Figure 2.2 B. The above is repeated until either σ matches with a pattern character orno more borders of u exist; in that case the pattern is shifted past the input string position wherethe mismatch occurred.

ALGORITHM 2.1: The preprocessing phase of the Knuth-Morris-Pratt algorithm

i := 0; j := −1; next[0] := −1while i < m do

while j ≥ 0 and pi 6= pj doj := next[j]

endi := i+ 1; j := j + 1if i < m and pi = pj then

next[i] := next[j]else

next[i] := jend

end

The preprocessing phase is performed in practice using the same technique as the one describedabove, by shifting two copies of the pattern against each other. During preprocessing, the longestborder v of each pattern prefix is computed and is stored in a next table. Knuth proposed thefollowing improvement; when a mismatch occurs during the preprocessing or the searching phase,it holds that σ 6= α and therefore a match between σ and β can only exist when β 6= α. Algorithmlistings 2.1 and 2.2 give details on the creation of the preprocessing and the search phase of the

Page 25: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

12 CHAPTER 2. BACKGROUND

Knuth-Morris-Pratt algorithm respectively, including the improvement proposed by Knuth.

ALGORITHM 2.2: The search phase of the Knuth-Morris-Pratt algorithm

i := 0; j := 0while i < n do

while j ≥ 0 and pj 6= ti doj := next[j]

endi := i+ 1; j := j + 1if j ≥ m then

report match at i− jj := next[j]

end

end

Knuth-Morris-Pratt can also be represented using a deterministic finite state automaton. Eachpattern character labels a state q. The state labeled after the mth character of the pattern is calledthe terminal state while L(q) is the label of the path between the initial and a given state. Thestates are connected with each other using two different links, a success link and a supply link. Thesuccess link is used to visit the next node of the automaton after a successful match of a patterncharacter, until the terminal state is reached; in that case a complete match of the pattern occursin the input string. After a mismatch, the automaton returns to a previous state based on thesupply link as determined by a supply function Supply. Supply is the equivalent of the next tableas presented above. It associates each state q to another state, called the supply state of q. The linkbetween q and Supply(q) is then called the supply link. A Knuth-Morris-Pratt automaton basedon an example pattern “AACGTAAC” is depicted in Figure 2.3. The circles represent the statesof the automaton, each dashed line is the supply link between a state and its supply state while thedouble circle indicates the terminal state of the automaton.

Figure 2.3: The automaton of the Knuth-Morris-Pratt algorithm for the pattern “AACGTAAC”

The Knuth-Morris-Pratt algorithm uses O(m) space to store the next table. The time todetermine the longest border of each pattern prefix during the preprocessing phase is O(m) whilethe time to compute the longest suffix of the input string that is also a prefix of the pattern isO(n).

Consider the following example for the pattern “AACGTAAC” and the input string “TAATAACG-TAAC”. The arrows represent the comparisons between the pattern and the input string. Afterpreprocessing, the next table will be:

i = 0 1 2 3 4 5 6 7 8P = A A C G T A A C

next[i] = -1 -1 1 0 0 -1 -1 1 3

The pattern is aligned with the input string, starting from the leftmost character:

Page 26: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2.2. STRING MATCHING ALGORITHMS 13

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A CA A C G T A A C

i = 0 1 2 3 4 5 6 7↑

A mismatch occurs between the first character of the pattern and the first character of the inputstring and thus the pattern is shifted to the right by one space:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A C

A A C G T A A Ci = 0 1 2 3 4 5 6 7

↑ ↑ ↑

The first two characters of the pattern now match with the input string but a mismatch occursat position j = 3. It can be seen that substring “AA” is the longest suffix u of the input string thatis also a prefix of the pattern at that position. During the preprocessing phase it was determinedthat pattern character p0 is the longest border v of u and this information was stored in the nexttable. Since a mismatch occurred between pattern character |u| + 1 and t3, the pattern can beshifted to the right so that the pattern character |v|+ 1 will be aligned with the same input stringcharacter. As next[2] = 1, p1 is then compared with the input string character at position j = 3.Note that p0 is not compared with the input string at position j = 2; the algorithm reuses theinformation that these characters already match:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A C

A A C G T A A Ci = 0 1 2 3 4 5 6 7

A new mismatch exists between characters ‘T’ and ‘A’. Because no border of u exists (next[1] =−1), the pattern is shifted past the position where the mismatch occurred. This time the patternis completely matched in the input string:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A C

A A C G T A A Ci = 0 1 2 3 4 5 6 7

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

2.2.2 Boyer-Moore

Boyer-Moore is considered as one of the most efficient algorithms for string matching. It scansthe characters of the input string backwards for the occurrences of the pattern in quadratic timein the worst case but sub-linear in the average. For each position i of the input string wherem − 1 ≤ i ≤ n − 1, the algorithm computes the longest suffix of t0 . . . ti that is also a suffix ofthe pattern. In the case of a mismatch or a complete match, the algorithm uses two precomputedfunctions to shift the pattern to the right, the good suffix shift and the bad character shift.

Both shift functions are created during the preprocessing phase from the pattern. For eachsuffix u of the pattern, the good suffix shift is the distance to the next factor of the pattern that

Page 27: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

14 CHAPTER 2. BACKGROUND

matches u (Figure 2.4 A). If no such factor exists, the good suffix shift is the length of the longestprefix of the pattern that is also a suffix of u (Figure 2.4 B). The good suffix shift is stored in atable gs of size m. The bad character shift is calculated for each different character σ ∈ Σ as thedistance of the rightmost occurrence of σ to the end of the pattern. If no such character exists, itsbad character shift is set to m (Figure 2.4 C). For the bad character shift, a table bc of size |Σ| isused.

ALGORITHM 2.3: The computation of the bad character shift of the Boyer-Moore algo-rithm

for i := 0; i < |Σ|; i := i+ 1 dobc[i] := m

endfor i := 0; i < m− 1; i := i+ 1 do

bc[pi] := m− i− 1end

Assume that the pattern is aligned with the input string, a suffix u of the input string matcheswith a suffix of the pattern and a mismatch occurs at position i between characters σ of theinput string and α of the pattern. The pattern will then be safely shifted to the right, basedon the maximum value between the good suffix shift and the bad character shift functions. Thecomputation of the bad character shift is given in Algorithm listing 2.3 and of the good suffix shiftin Algorithm listings 2.4 and 2.5. The search phase of the Boyer-Moore algorithm is presentedin Algorithm listing 2.6. To calculate the values of gs, a table suf of size m is used; suf [i] isdefined as the maximum k for each position 0 ≤ i ≤ m− 1 of the pattern such that pi−k+1 . . . pi =pm−k . . . pm−1.

ALGORITHM 2.4: Determining the suffixes for the good suffix shift of the Boyer-Moorealgorithm

suf [m− 1] = m; g = m− 1; f = 0for i := m− 2; i ≥ 0; i := i− 1 do

if i > g AND suf [i+m− 1− f ] < i− g thensuf [i] := suf [i+m− 1− f ]

elseif i < g then

g := iendf := iwhile g ≥ 0 AND pg = pg+m−1−f do

g := g − 1endsuf [i] := f − g

end

end

The Boyer-Moore algorithm uses O(m) space to store the gs table, O(m) space for the suf tableand O(|Σ|) for the bc table for a total space of O(m+ |Σ|). The time to preprocess the pattern isO(m+ |Σ|). The search phase of Boyer-Moore is O(nm) in the worst case although variants exist

Page 28: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2.2. STRING MATCHING ALGORITHMS 15

that improve this time to linear (one such variant is presented for example in [35]). On average thealgorithm is sub-linear. In the best case, only one in every m characters has to be checked for atotal of n

mcomparisons.

ALGORITHM 2.5: The computation of the good suffix shift of the Boyer-Moore algorithm

j := 0Determine suffixes using the code from Algorithm listing 2.4for i := 0; i < m; i := i+ 1 do

gs[i] := mendfor i := m− 1; i ≥ 0; i := i− 1 do

if suf [i] = i+ 1 thenfor ; j < m− i− 1; j := j + 1 do

if gs[j] = m thengs[j] := m− i− 1

end

end

end

endfor i := 0; i ≤ m− 2; i := i+ 1 do

gs[m− suf [i]− 1] = m− i− 1end

Horspool is a simplification of the original Boyer-Moore algorithm introduced by R. N. Horspool.It is based on the observation that the bad character shift leads to the longest safe shift in most cases,while the good suffix shift function is only relevant in cases where there are repeating substringsin the pattern. Using the Boyer-Moore with only the bad character shift produces a very efficientalgorithm in practice, especially when used on data with a relatively large alphabet size. In [40] it isstated that Horspool has a similar performance to the Boyer-Moore algorithm with the advantageof a simpler preprocessing phase. The experiments in [67] also confirm that Horspool is faster thanthe Boyer-Moore algorithm when randomly generated patterns and input strings were used. Thepreprocessing phase of the Horspool algorithm is O(m + |Σ|) time and O(|Σ|) space. The searchphase complexity is the same as Boyer-Moore with O(nm) time in the worst case, sub-linear in theaverage and O( n

m) in the best case.

2.2.3 Backward Oracle Matching

The Backward Oracle Matching algorithm uses a factor oracle, a deterministic acyclic automatoncreated from the pattern in reverse. The factor oracle is based on the notion of weak factorrecognition; it can recognize at least any factor of the pattern and therefore it cannot recognize if agiven string is definitely a factor of the pattern but on the other hand it can determine if it is not.The automaton has m+1 states, a goto function g, a supply function and at most m−1 additionalexternal transitions. Each pattern character labels a state q.

The goto function and the external transitions are constructed during the preprocessing phase.The automaton is extended for each character of the pattern in reverse. For each state q then, theexternal transitions are created from the states between the initial and q. For this reason, a supplyfunction Supply is used to specify a supply state for each state q. The supply state of the initial

Page 29: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

16 CHAPTER 2. BACKGROUND

Figure 2.4: Shifting the pattern when a mismatch occurs using the Boyer-Moore algorithm

ALGORITHM 2.6: The search phase of the Boyer-Moore algorithm

i := 0while i ≤ n−m do

j := m− 1while j ≥ 0 AND pj = ti+j do

j := j − 1endif j < 0 then

print match at ii := i+ gs[0]

elsei := i+MAX(gs[j], bc[ti+j ]−m+ j + 1)

end

end

Figure 2.5: The factor oracle of the Backward Oracle Matching algorithm for the pattern “AACG-TAAC”

Page 30: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2.2. STRING MATCHING ALGORITHMS 17

state is being set to ∅. Assume that the goto and supply functions for all states up to q − 1 werealready computed and that g(q, σ) = q + 1. Then, the supply state of q is visited to determine ifthere is an outgoing transition by σ. If there is no such transition, then g(Supply(q), σ) is set toq + 1 as an external transition. In that case, Supply(Supply(q)) is visited next and so on, until asupply state with an outgoing transition by σ is found or until there are no more states to visit.If, on the other hand, there is a transition g(Supply(q), σ) = l, then Supply(q) is set to l. Anexample of a Backward Oracle Matching oracle for the reversed pattern “AACGTAAC” is depictedin Figure 2.5.

ALGORITHM 2.7: The construction of the factor oracle of the Backward Oracle Matchingalgorithmcreate state q0set Supply(q0) := failforall the α ∈ Σ do

g(q0, α) := failendnewState := 0j := m− 1; state := q0while j ≥ 0 do

create state qcurrentforall the α ∈ Σ do

g(qcurrent, α) := failendnewState := qcurrentg(state, pil) := newStatek := Supply(state)while k 6= fail AND g(k, pj) = fail do

g(k, pj) := newStatek := Supply(k)

endif k 6= fail then

Supply(newState) := g(k, pj)else

Supply(newState) := q0endstate := newState

endj := j − 1

During the search phase, the algorithm reads backwards the longest suffix u of the input stringthat is also a suffix of the pattern. Assume that a mismatch occurs at position i of the input stringwhen reading character σ of the input string. The factor oracle can not determine with certaintythat a substring is a factor of the pattern but can recognize if a substring is not, therefore σu is nota factor of the pattern. The factor oracle can then be safely shifted past i. The oracle recognizesonly one m-sized string, the pattern. Therefore, when m consecutive characters of the input stringwere scanned without a mismatch and state m+ 1 of the automaton has been reached, a match ofthe pattern has been found in the input string. Then, the factor oracle is shifted by one positionto the right.

Page 31: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

18 CHAPTER 2. BACKGROUND

Backward Oracle Matching is O(m|Σ|) if implemented using a linked list of arrays to representthe transitions of the goto function, as described later in section 3.3.3. The searching time of the

algorithm is O(nm) in the worst case and O(n log|Σ| m

m) in the average. The preprocessing phase

of the Backward Oracle Matching algorithm is presented in Algorithm listing 2.7 while the searchphase is given in Algorithm listing 2.8.

ALGORITHM 2.8: The search phase of the Backward Oracle Matching algorithm

i := m− 1while i < n do

state := q0, j := 0while j < m AND (newState := g(state, ti−j) 6= fail) do

state := newState; j := j + 1endif j = m then

i := i+ 1else

i := i+m− jend

end

Consider again the example of page 12 for the pattern “AACGTAAC” and the input string“TAATAACGTAAC”. The arrows represent the comparisons between the pattern and the inputstring. The pattern is aligned with the input string with the search starting from position m − 1of the input string backwards:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A CA A C G T A A C

i = 0 1 2 3 4 5 6 7↑

Although the characters at t7 and p7 are different, the factor oracle, based on the notion ofweak factor recognition, reports that a factor of the pattern was found. The search continues untilposition 4 of the input string, since there are transitions in the automaton via the goto functionfor input string characters t7 . . . t4:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A CA A C G T A A C

i = 0 1 2 3 4 5 6 7↑ ↑ ↑ ↑

Since there is no outgoing transition from state 8 of the automaton and less than m characterswere checked, a mismatch is reported at position 3 of the input string, and thus the pattern isshifted past that position:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A C

A A C G T A A Ci = 0 1 2 3 4 5 6 7

Page 32: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

2.2. STRING MATCHING ALGORITHMS 19

The search resumes from position 11 of the input string. Since there are outgoing transitionsbetween the states of the automaton for characters t7 . . . t0 of the input string and since state m+1was reached, the algorithm reports a match starting at position 4 of the input string. Recall thatthe automaton recognizes only one m-sized string, the pattern:

j = 0 1 2 3 4 5 6 7 8 9 10 11T A A T A A C G T A A C

A A C G T A A Ci = 0 1 2 3 4 5 6 7

↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑

2.2.4 Shift-Or

Bit-parallelism is a technique to effectively group several variables inside a single computer word.Then, by using the bitwise operations provided by a CPU, the value of a computer word andsubsequently of all the variables can be updated with a single processor instruction. An importantconstraint to the use of bit-parallelism is the number of bits in a computer word w1. Shift-Or is onesuch efficient bit-parallel algorithm that solves the string matching problem, essentially simulatinga nondeterministic automaton that can recognize the pattern.

ALGORITHM 2.9: The preprocessing phase of the Shift-Or algorithm

for i := 0; i < |Σ|; i := i+ 1 doV [i] := 0

endfor i := 0, j := 1; i < m; i := i+ 1, j := j << 1 do

V [pi] := V [pi]& ∼ jend

During the preprocessing phase, a bit vector V with a size of |Σ| is computed from the reversedpattern. The bit vector stores a bit mask for each character of the pattern. This way, the ith bitof the mask for a character σ is set to 0 if pj = σ or to 1 otherwise, simulating a transition fromstate i− 1 of the automaton to i over σ.

ALGORITHM 2.10: The search phase of the Shift-Or algorithm

for E :=∼ 0, i := 0; i < n; i := i+ 1 doE := (E ≪ 1) | V [ti]if There is a 0 at the first position of E then

report match at i−m+ 1end

end

The basic idea behind the algorithm is to use a bit mask E to determine, for a given position iof the input string, all the prefixes of the pattern that match the suffix u of t0 . . . ti. Initially, thevalue of E is set to 1m. Then for each position of the input string, E is updated with the followingformula:

1Modern processors have computer words with a size w of 32 or 64 bits.

Page 33: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

20 CHAPTER 2. BACKGROUND

E = (E ≪ 1) | V [σ]

Using again the example of page 12 for pattern “AACGTAAC”, the four different bit masks ofbit vector V will be computed as follows:

V =

A 10011100

C 01111011

G 11110111

T 11101111

Initially, the value of E is set to 11111111. The algorithm starts scanning the input string fromcharacter σ located at position tm−1. By shifting E to the left, an additional 0 is inserted to theright of the bit mask, simulating the first state of the nondeterministic automaton. Then, withthe bitwise OR’ing of E with V [σ], the value of E will be set to 11111110 only if bit e0 of therespective bit mask is 0, indicating that there is an occurrence of that character. In other words,the algorithm will move to the next state only if there is an outgoing transition by σ. If after anyupdate of E its most significant bit em−1 is set to 0, a complete match of the pattern in the inputstring was found, simulating that the terminal state of the automaton was reached.

Table 2.1: Known theoretical space, preprocessing and worst and average searching time complexityof the Knuth-Morris-Pratt, Horspool, Backward Oracle Matching and Shift-Or string matchingalgorithms

Algorithm Extra space Preprocessing Worst case Average case

Knuth-Morris-Pratt m m n n

Horspool m+ |Σ| m+ |Σ| nm sub-linear

Backward Oracle Matching m|Σ| m|Σ| nmn log|Σ| m

m

Shift-Or m+ |Σ| m+ |Σ| n n

The theoretical space and preprocessing time complexity of Shift-Or is O(m + |Σ|) while theworst and average searching phase complexity of the algorithm is O(n) when the size of the patternis less or equal to w. The preprocessing and search phases of Shift-Or are given in Algorithmlistings 2.9 and 2.10 respectively.

Table 2.1 summarizes the theoretical space, preprocessing and searching time complexity ofthe Knuth-Morris-Pratt, Horspool, Backward Oracle Matching and Shift-Or string matching algo-rithms.

Page 34: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 3

Multiple Pattern Matching

3.1 Introduction

Multiple pattern matching is a variant of string matching that involves the location of all the po-sitions of an input string where one or more patterns from a finite pattern set occur. It is thecomputationally intensive kernel of many security and network applications including informationretrieval, web filtering, intrusion detection systems, virus scanners and spam filters. Moreover, inrecent years there is an interest in string matching problems as a powerful tool to locate nucleotideor Amino Acid sequence patterns in biological sequence databases. The multiple pattern matchingproblem can be defined as:

Definition. Given an input string T = t0t1 . . . tn−1 of size n and a finite set of d patternsP = p0, p1, . . . , pd−1, where each pr is a string pr = pr0p

r1 . . . p

rm−1 of size m over a finite character

set Σ and the total size of all patterns is denoted as |P |, the task is to find all occurrences of anyof the patterns in the input string. More formally, for each pr find all i where 0 ≤ i < n −m + 1such that for all j where 0 ≤ j < m it holds that ti = prj

This chapter elaborates on the Aho-Corasick [2], Set Horspool [67], Set Backward Oracle Match-ing [67], Wu-Manber [96] and SOG [78] multiple pattern matching algorithms and presents exper-imental results in terms of preprocessing and searching time for different types of data. Thealgorithms were chosen since they are efficient, practical and are frequently encountered in otherresearch papers. Aho-Corasick is a classic multiple pattern matching algorithm with a linear inn search phase in the worst and average case. Set Horspool has a sub-linear search phase in theaverage case. Commentz-Walter [21], the algorithm upon which Set Horspool is based, is sub-stantially faster in practice than the Aho-Corasick algorithm, particularly when long patterns areinvolved [93, 96]. Set Backward Oracle Matching also has a sub-linear search phase in the averagecase. It appears to be very efficient when used on large pattern sets and has the same worst casecomplexity as Set Backward Dawg Matching but uses a much simpler automaton and is faster inall cases [67]. The Wu-Manber algorithm was chosen as it is considered a very fast algorithm inpractice [67]. Finally, Salmela-Tarhio-Kytojoki is a recently introduced family of algorithms thathas a reportedly good performance on specific types of data [52]. SOG has a linear search phase inthe average case.

The lack of previously published work with extensive experimental analysis on multiple patternmatching algorithms motivated us to compare the performance of the Aho-Corasick, Set Horspool,Set Backward Oracle Matching, Wu-Manber and SOG algorithms in terms of preprocessing and

21

Page 35: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

22 CHAPTER 3. MULTIPLE PATTERN MATCHING

searching time and to identify a suitable and preferably fast algorithm for randomly generated datawith a binary alphabet, biological sequence databases containing the building blocks of nucleotidesand Amino Acids and natural language data of the English alphabet and for several problemparameters such as the total size of the pattern set, the alphabet size and the size of the inputstring and the patterns. To the best of our knowledge, this is the first time that the cost of thepreprocessing phase is studied based on experimental results and is compared to the running timeof the specific algorithms.

3.2 Related Work

Experimental results on multiple pattern matching algorithms have been reported in the past.The performance of a number of algorithms including Aho-Corasick, Set Horspool, Set BackwardOracle Matching and Wu-Manber was evaluated in [67] for a randomly generated data set. Theinput string had a size of approximately 10 MB while the pattern set consisted of 5 to 1, 000 patternswhere each pattern had a size m = 5 to 100 characters. There, it was concluded that for relativelysmall pattern set sizes, Aho-Corasick had the best performance when patterns of size m = 5 to 15characters and an alphabet of size |Σ| = 2 to 4 were used, Set Backward Oracle Matching was thefastest algorithm for patterns of size m = 15 to 100 characters and an alphabet of size |Σ| = 2 to 8while Wu-Manber outperformed the other algorithms when the alphabet of the data set used had asize |Σ| = 8 to 64. For larger pattern sets, the Set Backward Oracle Matching algorithm was moreattractive; it was the fastest algorithm when patterns of a size m = 20 to 100 were used and forall alphabet sizes while the Aho-Corasick and Wu-Manber algorithms were faster for patterns of asize m = 5 to 20 characters.

A variant of the Wu-Manber algorithm called QWM was presented in [29] and its performancewas compared to the Aho-Corasick, Commentz-Walter and the original Wu-Manber algorithm forrandomly generated data with a binary alphabet and an alphabet of size |Σ| = 4 as well as fordata with an English and a Chinese language alphabet. The pattern sets used for the experimentsof that paper consisted of 100 to 2, 000 patterns. It was shown that Aho-Corasick was the fastestalgorithm for data with a binary alphabet while Wu-Manber outperformed the Aho-Corasick andCommentz-Walter algorithms on randomly generated data with an alphabet of size |Σ| = 4 andon data with an English and a Chinese language alphabet. HMA, a hierarchical multiple patternmatching algorithm was introduced in [80] and its performance was compared among others to theperformance of a compressed version of the Aho-Corasick algorithm on data for Intrusion DetectionSystems.

The Aho-Corasick, Set Horspool, Set Backward Oracle Matching and Wu-Manber algorithmswere compared in [78] in terms of searching time for biological sequence databases and randominput strings for sets consisting of 100 to 100, 000 patterns. Each pattern had a size of m = 8and 32 characters while the input string had a size of approximately 32 MB. It was shown thatfor the specific problem parameters, Wu-Manber outperformed the Aho-Corasick, Set Horspooland Set Backward Oracle Matching algorithms. Moreover, the performance of the search phase ofthe Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG algorithmswas evaluated. For m = 8 and |Σ| = 256, Wu-Manber was the fastest algorithm when up to 5, 000patterns were used while SOG outperformed the rest of the algorithms for more than 5, 000 patterns.When m = 32 and |Σ| = 4 on the other hand, the Set Backward Oracle Matching algorithm hadthe best performance when up to 1, 000 patterns were used while SOG was the fastest for largerpattern set sizes.

Finally, a performance study of the Commentz-Walter, Wu-Manber, Set Backward Oracle

Page 36: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 23

Matching and the Salmela-Tarhio-Kytojoki algorithms for biological sequence databases was pre-sented in [54] while the preprocessing phase of the same algorithms was evaluated in [53] forrandomly generated data, English alphabet data and biological sequence databases.

3.3 Multiple Pattern Matching Algorithms

A naive solution to the multiple pattern matching problem is to perform d separate searches inthe input string with a string matching algorithm leading to a worst-case complexity of O(|P |) forthe preprocessing phase and O(n|P |) for the search phase. While frequently used in the past, thistechnique is not efficient, especially when a large pattern set is involved. The modern multiplepattern matching algorithms can scan the input string in a single pass to locate all occurrences ofthe patterns. These algorithms are often based on string matching algorithms with some of theirfunctions generalized to process multiple patterns simultaneously during the preprocessing phase.Based on the way the multiple patterns are represented and the search is performed, the algorithmscan generally be classified in to one of the four following approaches.

• Prefix algorithms The prefix searching algorithms use a trie to store the patterns, a datastructure where each node represents a prefix u of one of the patterns. For a given positioni of the input string, the algorithms traverse the trie looking for the longest possible suffix uof t0...ti that is a prefix of one of the patterns. One of the most well known prefix multiplepattern matching algorithms is Aho-Corasick, an efficient algorithm based on Knuth-Morris-Pratt that preprocesses the pattern in time linear in |P | and searches the input string in timelinear in n in the worst case. Multiple Shift-And, a bit-parallel algorithm generalization of theShift-And algorithm for multiple pattern matching was introduced in [67] but is only usefulfor a small size of |P | since the pattern set must fit in a few computer words.

• Suffix algorithms The suffix algorithms store the patterns backwards in a suffix trie, a rooteddirected tree that represents the suffixes of all patterns. At each position i of the input stringthe algorithms compute the longest suffix u of the input string that is a suffix of one of thepatterns. Commentz-Walter combines a suffix trie with the good suffix and bad charactershift functions of the Boyer-Moore algorithm. A simpler variant of Commentz-Walter isSet Horspool, an extension of the Horspool algorithm that uses only the bad character shiftfunction. Suffix searching is generally considered to be more efficient than prefix searchingsince on average more input string positions are skipped following each mismatch.

• Factor algorithms The factor searching algorithms build a factor oracle, a trie with addi-tional transitions that can recognize any substring (or factor) of the patterns. Dawg-Match[25] and MultiBDM [27] were the first two factor algorithms, algorithms complicated andwith a poor performance in practice [67]. The Set Backward Oracle Matching and the SetBackward Dawg Matching algorithms [67] are natural extensions of the Backward OracleMatching and the Backward Dawg Matching [24] algorithms respectively for multiple patternmatching.

• Hashing algorithms The algorithms following this approach use hashing to reduce theirmemory footprint, usually in conjunction with other techniques. Wu-Manber is based onthe Horspool algorithm. It reads the input string in blocks to effectively increase the size ofthe alphabet and then applies a hashing technique to reduce the necessary memory space.Zhou et al. [99] proposed an algorithm called MDH, a variant of Wu-Manber for large-scale

Page 37: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

24 CHAPTER 3. MULTIPLE PATTERN MATCHING

pattern sets. Kim and Kim introduced in [47] a multiple pattern matching algorithm that alsotakes the hashing approach. The Salmela-Tarhio-Kytojoki variants of the Horspool, Shift-Orand BNDM [66] algorithms can locate candidate matches by excluding positions of the inputstring that do not match any of the patterns. They combine hashing and a technique calledq-grams to increase the alphabet size, similar to the method used by Wu-Manber.

Many well known tools utilize multiple pattern matching algorithms: Snort [82] uses a variantof the Aho-Corasick algorithm to perform intrusion detection, the Wu-Manber algorithm is utilizedby Agrep [95] to perform approximate string searching while Commentz-Walter is used in GNUGrep [38] when searching for the occurrences of multiple patterns in an input string.

3.3.1 Aho-Corasick

Aho-Corasick is an extension of the Knuth-Morris-Pratt algorithm for a set of patterns P . It uses adeterministic finite state pattern matching machine; a rooted directed tree or trie of P with a gotofunction g and an additional supply function Supply. The goto function maps a pair consisting ofan existing state q and a symbol character into the next state. It is a generalization of the nexttable or the success link of the Knuth-Morris-Pratt algorithm for a set of patterns where a parentstate can lead to its child states by σ where σ is a matching character. Each state of the trie islabeled after a single character of a pattern pr ∈ P . If L(q) denotes the label of the path betweenthe initial state and a state q, then L(q) is also a prefix of one of the patterns. For each pattern pr

there is a state q such that L(q) = pr. This state is marked as terminal and when visited during thesearch phase indicates that a complete match of pr was found. The supply function of Aho-Corasickis based on the supply function of the Knuth-Morris-Pratt algorithm. It is used to visit a previousstate of the automaton when there is no transition from the current state to a child state via thegoto function.

Figure 3.1: The automaton of the Aho-Corasick algorithm for the pattern set “AAC”, “AGT” and“GTA”

The goto function and the supply function are constructed during the preprocessing phase.To build the goto function, the trie is depth-first traversed and extended for each character ofthe patterns from a finite pattern set P while at the same time the outgoing transitions to eachstate are created. The supply function is built in transversal order from the trie until it has beencomputed for all states. For each state q, the supply link can be determined based on the longestsuffix of L(q) that is also a prefix of any pattern from P . Assume that for the parent state qparent

Page 38: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 25

of q, g(qparent, σ) = q. If Supply(qparent) also has an outgoing transition to a state h by σ then thesupply state of q can be set to h. In any other case, Supply(Supply(qparent)) must be checked for atransition to a state by σ and so on, until one such state is found or it is determined that no suchstate exists; in that case the supply state of q is set to the initial state.

ALGORITHM 3.1: The construction of the goto function g of the Aho-Corasick automatoncreate state q0forall the α ∈ Σ do

g(q0, α) := failendfor i := 0; i < d; i := i+ 1 do

j := 0; state := q0while newState := g(state, pij) 6= fail do

state := newState; j := j + 1endfor l := j; l < m; l := l + 1 do

create state qcurrentforall the α ∈ Σ do

g(qcurrent, α) := failendnewState := qcurrentg(state, pil) := newStatestate := newState

endOutput(qcurrent) := Output(qcurrent) ∪ piAdd terminal state on qcurrent

end

Let u be the longest suffix of the input string t0 . . . ti−1 that is also a prefix of any pattern∈ P . The character σ located at position i of the input string is scanned next. If there is anoutgoing transition from the current state q to another state f as indicated by the goto functionthen L(f) = uσ is the new longest suffix of the input string at position i that is a prefix of one ofthe patterns. A match of a pattern exists in the input string if |uσ| = m. If on the other handg(q, σ) = fail then g(Supply(q), σ) is checked for an outgoing transition by σ. If g(Supply(q), σ)leads to a state f ′ then u = L(f ′). If g(Supply(q), σ) = fail then g(Supply(Supply(q)), σ) isconsidered and so on, until an outgoing transition by σ is found or until the supply state ofthe initial state is reached; in that case, the search will start again from the initial state. Theconstruction of the goto function of the Aho-Corasick automaton is given in Algorithm listing 3.1,the computation of the supply function is presented in Algorithm listing 3.2 while the search phaseof the Aho-Corasick algorithm is given in Algorithm listing 3.3. The output function returns L(q)for each terminal state q and is denoted as Output(). A transition that does not point to a state is

Page 39: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

26 CHAPTER 3. MULTIPLE PATTERN MATCHING

denoted as fail.

ALGORITHM 3.2: The construction of the supply function Supply of the Aho-Corasickautomaton

forall the α ∈ Σ doif g(q0, α) = fail then

g(q0, α) := q0else

Supply(g(q0, α)) := q0end

endforall the currentState ∈ trie states in transversal order do

forall the α ∈ Σ dos := g(currentState, α)if s 6= fail then

state := Supply(currentState)while g(state, α) = fail do

state := Supply(state)endSupply(s) := g(state, α)

end

end

end

The goto function can be implemented using any of the following data structures; an array of size|Σ| where each state has an outgoing transition for every character of the alphabet by precomputingall the transitions simulated by the supply function [67]; a linked list that is space efficient but nottime efficient; or a balanced search tree that is considered as a heavy-duty compromise and oftennot practical [30]. The implementation used for the experiments of this chapter was based on codefrom the Streamline system I/O software layer [83]. It uses a linked list for the supply function anda linked list of arrays to represent the transitions of the goto function with each cell of the arrayspotentially containing a pointer to the next node. Each list node corresponds to a different stateof the trie and has an array of size |Σ| with an outgoing transition for every character of Σ. Thetrie of P can then be built for all m patterns in O(|Σ|m2) time, with a total size of O(|Σ|m2). Thetime to pass through a transition of the goto function is O(1) in the worst and average case whilethe search phase has a cost of O(n) in the worst and average case.

ALGORITHM 3.3: The search phase of the Aho-Corasick automatonstate := q0for i := 0; i < n; i := i+ 1 do

while newState := g(state, ti) = fail dostate := Supply(state)

endstate := newStateif Output(state) is not empty then

report match at i−m+ 1end

end

An example of a complete Aho-Corasick automaton for the pattern set “AAC”, “AGT” and

Page 40: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 27

“GTA” is presented in Figure 3.1. Assume that the goto function of the trie is already constructedand that the supply function for states 0 − 4 has been computed. The supply state of state 5 isdetermined next. State 4 is the parent state of state 5 since g(4, “T”) = 5 and Supply(4) = 6,therefore the goto function of state 6 is considered next. Since g(6, “T”) = 7 then Supply(5) can beset to 7. If there was no outgoing transition from state 6 by “T” then Supply(6) would be checkednext for an outgoing transition to another state by “T” and so on, until one such state is found oris determined that no such state exists.

3.3.2 Set Horspool

The Set Horspool algorithm combines a deterministic finite state pattern matching machine withthe shift function of the Horspool algorithm to search for the occurrence of multiple patterns inthe input string in sub-linear time on average. The pattern matching machine used is a trie witha goto function g, created from each pattern pr ∈ P in reverse. The search for the occurrences ofthe patterns is then performed backwards similar to Horspool. When a mismatch or a completematch occurs, a number of input string positions can be safely skipped based on the bad charactershift of the Horspool algorithm generalized for a set of patterns.

Figure 3.2: The automaton of the Set Horspool algorithm for the reversed pattern set “AAC”,“AGT” and “GTA”

The goto function and the shift function are computed during the preprocessing phase. To buildthe goto function, the trie is depth-first traversed and extended for each character of the patterns∈ P while at the same time the outgoing transitions to each state q of the trie are created. Thebad character shift is denoted as bc(σ) and is computed for each different character σ ∈ Σ as themaximum |L(q)| where q is a state labeled by σ. If no such character exists in any pattern, the badcharacter shift for σ is set to m.

Scanning the input string for the occurrences of the patterns is performed backwards, startingfrom character tm−1. For each position i of the input string, the algorithm computes the longestsuffix u of t0 . . . ti that is also a suffix of any pattern. When a complete match is found or a mismatchoccurs when reading character σ of the input string, the trie is shifted to the right according tocharacter β at position i of the input string until β is aligned with the next state of the trie that islabeled after β. If no such β exists, the trie is shifted to the right by m positions. The computationof the bad character shift function of the Set Horspool algorithm is depicted in Algorithm listing 3.4while the search phase is presented in Algorithm listing 3.5. The computation of the goto functionis identical to that of the Aho-Corasick algorithm as already given in Algorithm listing 3.1. An

Page 41: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

28 CHAPTER 3. MULTIPLE PATTERN MATCHING

ALGORITHM 3.4: The construction of the bad character shift of the Set Horspool algo-rithm

for i := 0; i < |Σ|; i := i+ 1 dobc(i) := m

endfor i := 0; i < d; i := i+ 1 do

for j := 0; j < m− 1; j := j + 1 dobc(pij) := MIN(m− j − 1, bc(pij))

end

end

example of a complete Set Horspool automaton for the reversed pattern set “AAC”, “AGT” and“GTA” is presented in Figure 3.2.

ALGORITHM 3.5: The search phase of the Set Horspool algorithm

i := m− 1while i < n do

j := 0; state = q0while j < m AND (newState := g(state, ti−j) 6= fail) do

state := newState; j := j + 1if Output(state) is not empty then

report match at iend

endi := i+ bc(ti)

end

The implementation of Set Horspool uses a linked list of arrays to represent the transitions ofthe goto function with each cell of the arrays potentially containing a pointer to the next node.Each node of the list corresponds to a different state of the trie and has an array of size |Σ| withan outgoing transition for every character of Σ with O(1) time in the worst and average case topass through a transition of the goto function. The construction of the trie and the shift functionrequires O(|Σ||P |) time and space while the search phase of the algorithm is O(nm) worst casetime or sub-linear on average.

3.3.3 Set Backward Oracle Matching

The Set Backward Oracle Matching algorithm extends the Backward Oracle Matching string match-ing algorithm to search for the occurrence of multiple patterns in the input string in sub-linear timeon average. It uses a factor oracle, a deterministic acyclic automaton created from each patternpr ∈ P in reverse that is based on the notion of weak factor recognition. The automaton consists ofa goto function g and at most m2 additional external transitions such that at least any factor of apattern can be recognized, similar to the factor oracle of Backward Oracle Matching, as presentedin section 2.2.3. The goto function is constructed during the preprocessing phase from the set ofthe reversed patterns, similar to the automaton of the Set Horspool algorithm. For each characterσ at position i of a pattern pr, the trie is depth-first traversed. If u is the suffix pri+1 . . . p

rm−1 of a

Page 42: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 29

pattern pr and σu does not exist as a label L(q) of a path of the trie, then the trie is extended; anew state q is created and is labeled by σ and at the same time the outgoing transitions to q areconstructed from the states at all levels between the initial and q.

Figure 3.3: The automaton of the Set Backward Oracle Matching algorithm for the reversed patternset “AAC”, “AGT” and “GTA”

To construct the external transitions, a supply function Supply is used to specify a supply statefor each state q. The supply state of the initial state being set to ∅. Assume that the goto andsupply functions for all states up to the parent state qparent of q were already computed and thatg(q, σ) = h. Then, the supply state of q is visited to determine if there is an outgoing transition byσ. If there is no such transition then g(Supply(q), σ) is set to h as an external transition. In thatcase, Supply(Supply(q)) is visited next and so on, until a supply state with an outgoing transitionby σ is found or until there are no more states to visit. If, on the other hand, there is a transitiong(Supply(q), σ) = l, then Supply(q) is set to l.

Page 43: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

30 CHAPTER 3. MULTIPLE PATTERN MATCHING

ALGORITHM 3.6: The construction of the factor oracle of the Set Backward Oracle Match-ing algorithmcreate state q0set Supply(q0) := failforall the α ∈ Σ do

g(q0, α) := failendnewState := 0for i := 0; i < d; i := i+ 1 do

j := m− 1; state := q0while newState := g(state, pij) 6= fail do

state := newState; j := j − 1endfor l := j; l ≥ 0; l := l − 1 do

create state qcurrentforall the α ∈ Σ do

g(qcurrent, α) := failendnewState := qcurrentg(state, pil) := newStatek := Supply(state)while k 6= fail AND g(k, pil) = fail do

g(k, pil) := newStatek := Supply(k)

endif k 6= fail then

Supply(newState) := g(k, pil)else

Supply(newState) := q0endstate := newState

endif terminal state on qcurrent does not exist then

F (q) := 0Add terminal state on qcurrent

elseF (qcurrent) := F (qcurrent) ∪ i

end

end

Page 44: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 31

During the search phase, the algorithm reads backwards the longest suffix u of the input stringthat is also a suffix of any pattern. Assume that a mismatch occurs at position i of the input stringwhen reading character σ. The factor oracle can not determine with certainty that a substring isa factor of one of the patterns but can recognize if a substring is not, therefore σu is not a factorof any pr ∈ P . The oracle can then be safely shifted past i. As opposed to the Backward OracleMatching algorithm, whenm consecutive characters of the input string have been scanned without amismatch and a terminal state is reached, a match of some pattern in the input string has potentiallybeen found since there could be terminal states in the oracle that do not correspond to any pattern.Additionally, terminal states could exist that correspond to more than one pattern. For this reason,each terminal state q holds a set of indices F (q) to the patterns they correspond. Then, all thepatterns in F (q) are compared directly with the input string to determine if a complete match isfound and the factor oracle is shifted by one position to the right. The preprocessing phase of theSet Backward Oracle Matching algorithm is given in Algorithm listing 3.6 while the search phase ispresented in Algorithm listing 3.7. An example of a complete Set Backward Oracle Matching oraclefor the reversed pattern set “AAC”, “AGT” and “GTA” is depicted in Figure 3.3. The substring“CAT” would be recognized by the oracle although it is not a factor of any pattern.

ALGORITHM 3.7: The search phase of the Set Backward Oracle Matching algorithm

i := m− 1while i < n do

state := q0, j := 0while j < m AND (newState := g(state, ti−j) 6= fail) do

state := newState; j := j + 1endif F (state) is not empty AND j = m then

if Verify all patterns in F(state) against the input string thenreport match at i−m+ 1

endi := i+ 1

elsei := i+m− j

end

end

The implementation of Set Backward Oracle Matching uses a linked list of arrays to representthe transitions of the goto function with each cell of the arrays potentially containing a pointerto a different node. Each node of the list corresponds to a different state of the trie and has anarray of size |Σ| with an outgoing transition for every character of Σ with O(1) time in the worstand average case to pass through a transition of the goto function. The set of indices F is storedto each terminal state using an array of size d. An auxiliary table of size d is also maintainedto map the patterns to their corresponding indices The oracle is created during the preprocessingphase in O(|Σ||P |) for all d patterns of the pattern set using a size of O(|Σ||P |). The search phasecomplexity of the algorithm is O(n|P |) worst case time or sub-linear on average.

3.3.4 Wu-Manber

Wu-Manber is a generalization of the Horspool algorithm for multiple pattern matching. It scansthe characters of the input string backwards for the occurrences of the patterns, shifting the search

Page 45: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

32 CHAPTER 3. MULTIPLE PATTERN MATCHING

window to the right when a mismatch or a complete match occurs. To perform the shift, thebad character shift function of the Horspool algorithm is used. As previously discussed, the badcharacter shift for a character σ determines the safe number of shifts based on the position of therightmost occurrence of σ in any pattern. The probability of σ existing in one of the patternsincreases with the size of the pattern set and is inversely proportional to the alphabet size and thusthe maximum possible shift is decreased. To improve the efficiency of the algorithm, Wu-Manberconsiders the characters of the patterns and the input string as blocks of size B instead of singlecharacters, essentially enlarging the alphabet size to |Σ|B.

Figure 3.4: Comparing the suffix and prefix of the search window of the Wu-Manber algorithm

During the preprocessing phase, three tables are built from the patterns, the SHIFT, HASHand PREFIX tables. SHIFT is the equivalent of the bad character shift of the Horspool algorithmfor blocks of characters, generalized for multiple patterns. If B does not appear in any pattern,the search window can be safely shifted by m − B + 1 positions to the right. Let h be the hashvalue of a block of B characters as determined by a hash function h1(). Then, SHIFT [h] is thedistance of the rightmost occurrence of B to the end of any pattern. The HASH and PREFIXtables are only used when the shift value stored in SHIFT [h] is equal to 0. HASH[h] containsan ordered list of pattern indices whose B-character suffix has a hash value of h. For each of thesepatterns, let h′ be the hash value of their B′-character prefix as determined by a hash function h2().The hash value h′ for each pattern p is stored in PREFIX[p]. That way, a potential match ofthe B-character suffix of a pattern can be verified first with the B′-character prefix of the patternbefore comparing the patterns directly with the input string. The complexity of Wu-Manber wasnot given in the original paper, since hash functions h1() and h2() were not specified and the sizeof the SHIFT, HASH and PREFIX tables was not given [67]. For the experiments of this paper,the algorithm was implemented with a block size of B = 3 and B′ = 2 while hash values h and h′

were calculated by shift and add; shifting the hash value to the left by bitshift positions and thenadding in the ASCII value of a pattern or input string character. The value of bitshift was set to2. Finally, the verification of the patterns to the input string was performed using the memcmp()function of string.h.

Assume that the search window is aligned with the input string at position i and that h is thehash value of the B-character suffix of t0 . . . ti. Then the SHIFT table is used to determine thenumber of safe shift positions. If SHIFT [h] > 0 then the search window is shifted by SHIFT [h]positions. If, on the other hand, SHIFT [h] = 0, the suffix of the input string potentially matchesthe suffix of some patterns of the pattern set and thus it must be determined if a complete matchoccurs at that position. The hash value h′ of the B′-character prefix of the input string startingat position i − m + 1 is then computed. For each pattern pr with the same hash value h of itsB-character suffix, it is checked if PREFIX[p] matches with h′. If both the prefix and the suffix of

Page 46: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 33

ALGORITHM 3.8: The preprocessing phase of the Wu-Manber algorithm

Initialize all elements of SHIFT to m−B + 1for i := 0; i < d; i := i+ 1 do

for q := m; q ≥ B; q := q − 1 doh := h1(p

iq−B−1 . . . p

iq−1)

shiftlen := m− qSHIFT [h] := MIN(SHIFT [h], shiftlen)if shiftlen = 0 then

h′ := h2(pi0 . . . p

iB′−1)

HASH[h] := HASH[h] ∪ iPREFIX[i] := h′

end

end

end

the search window match with the prefix and suffix of some pr ∈ P , then the corresponding patternsare compared directly with the input string. The preprocessing phase of the Wu-Manber algorithmis given in Algorithm listing 3.8 while the search phase is presented in Algorithm listing 3.9.

ALGORITHM 3.9: The search phase of the Wu-Manber algorithm

i = m− 1while i < n do

h := h1(ti−B−1 . . . ti)if SHIFT [h] > 0 then

i := i+ SHIFT [h]else

h′ := h2(ti−m+1 . . . ti−m+1+B′−1)forall the pattern indices r stored in HASH[h] do

if PREFIX[r] = h′ thenVerify the pattern corresponding to r directly against the input string

end

endi := i+ 1

end

end

The complexity of Wu-Manber was not given in the original paper, since hash functions h1()and h2() were not specified and the size of the SHIFT, HASH and PREFIX tables was not given[67]. For the experiments of this chapter, the algorithm was implemented with a block size of B = 3and B′ = 2 while hash values h and h′ were calculated by shift and add; shifting the hash valueto the left by bitshift positions and then adding in the ASCII value of a pattern or input stringcharacter. The value of bitshift was set to 2. Finally, the verification of the patterns to the inputstring was performed using the memcmp() function of string.h. The cost of the implementationused in the experiments of this chapter is as follows. To calculate the values of the SHIFT, HASHand PREFIX tables during the preprocessing phase, the algorithm requires an O(|P |) time. Thespace of Wu-Manber depends on the size of SHIFT, HASH and PREFIX. The space needed for the

Page 47: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

34 CHAPTER 3. MULTIPLE PATTERN MATCHING

SHIFT table is

B−1∑

i=0

|Σ|×(2bitshift)i. In the worst case there could be d patterns with the same hash

value h or h′ for their B-character suffix or B′-character prefix respectively, therefore HASH and

PREFIX require a d×B−1∑

i=0

|Σ|×(2bitshift)i space for a space complexity ofO(d×B−1∑

i=0

|Σ|×(2bitshift)i).

In the worst case for the searching phase of the Wu-Manber algorithm, the input string andm− 1 characters of all d patterns consist of the same repeating character σ with the character atposition m−B − 1 of each pattern being different. The algorithm will then encounter a potentialmatch on every position of the input string since SHIFT [h] will constantly be 0. Therefore, ashash values h and h′ of the patterns will be identical, the m − B characters of all d patternswill be compared directly with the input string using the memcmp() function. The worst casesearching time of Wu-Manber is given in [20] as O(n|P | log|Σ| |P |). In [65] the lower bound for the

average time complexity of exact multiple pattern matching algorithms is given as Ω(n log|Σ|(|P |)

m)

and according to [20] the searching phase of the Wu-Manber algorithm is optimal in the average

case for a time complexity of O(n log|Σ|(|P |)

m). In [61] the average time complexity of Wu-Manber

was also estimated as O( n

(m−B+1)×(1−(m−B+1)×d

2×|Σ|B)).

3.3.5 SOG

SOG extends the Shift-Or string matching algorithm to perform multiple pattern matching inlinear time on average. Similar to Shift-Or, SOG is a bit-parallel algorithm that simulates a non-deterministic automaton. Moreover, it acts as a character class filter; it constructs a generalizedpattern that can simultaneously match all patterns from a finite set. The generalized patternaccepts classes of characters based on the actual position of the characters in the patterns. Thesearching phase of the algorithm consists of a filtering phase and a verification phase. When acandidate match is found at a given position of the input string during the filtering phase, thepatterns are verified using a combination of hashing and binary search to determine if a completematch of a pattern occurs. When the pattern set has a relatively large size, every position ofthe generalized pattern will accept most characters of the alphabet. In that case, false candidatematches will occur in most positions of the input string. To overcome this problem, SOG increasesthe alphabet size to |Σ|B by processing the characters in blocks of size B, similar to the methodologyused by the Wu-Manber algorithm.

ALGORITHM 3.10: The preprocessing phase of the SOG algorithm

for i := 0; i < d; i := i+ 1 dosv := 0Create hs′ for pattern pr

for j := 0; j < m−B; j := j + 1 doh := h1(p

ij . . . p

ij+B−1)

V [h] := V [h] | (2m − (1 ≪ sv))sv := sv + 1

end

end

During the preprocessing phase, a hash value h is assigned to each different B-character block of

Page 48: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.3. MULTIPLE PATTERN MATCHING ALGORITHMS 35

the patterns using a hash function h1() by bit-shifting the ASCII values of the characters. For eachdifferent h, a bit vector V [h] is initialized by setting the ith bit of V [h] to 0 if the B-character blockcorresponding to h is found in the ith position of any pattern or to 1 otherwise. For the verificationphase, a hash value hs is computed for each pattern by forming a 32-bit integer of every four bytesof the pattern and then using the XOR logical operation between the integers. For a pattern pr ofsize m = 8, its hash value hs will be:

hs = pr0pr1p

r2p

r3 ⊻ pr4p

r5p

r6p

r7

To improve the efficiency of the hashing method described above, a two-level hashing technique[64] is used. This technique involves creating a 16-bit hash value hs′ by using the XOR logicaloperation between the lower and the upper 16 bits of hs that is stored in an ordered table. In [78]is stated that the use of two-level hashing significantly improves the performance of the algorithmwhen less than 100, 000 patterns are involved.

ALGORITHM 3.11: The search phase of the SOG algorithm

E := 2m

for i := 0; i < n−B + 1; i := i+ 1 doE := (E ≪ 1) | V [h1(ti . . . ti+B−1)]if E | 2m−B then

continueendVerify the patterns using two-level hashing

end

To search for the occurrence of the patterns in the input string, a bit mask E is used consistingof m−B+1 bits where each bit is initialized to 1. For each position 0 ≤ i < n−B+1 of the inputstring, a hash value h is assigned to the character block ti . . . ti+B−1 using function h1() again. Eis then updated with the following formula:

E = (E ≪ 1) | V [h]

If the (m−B)th bit of E is equal to 0, then a candidate match of one of the patterns in the patternset occurs starting from position i−m+B of the input string. To verify the candidate match, theordered table is examined using binary search for the hash value hs′ of the input string charactersti−m+B . . . ti+B−1. For the experiments of this chapter, SOG was implemented using a block size ofB = 3. The preprocessing phase of the SOG algorithm is presented in Algorithm listing 3.10 whilethe search phase is given in Algorithm listing 3.11.

The preprocessing time of the SOG algorithm is O(|P |) with an O(|Σ|B + |P |) space. For theverification of the patterns using two-level hashing and binary search, no checking of candidatematches is needed in the best case. In the worst case and assuming that all pattern rows and inputstring positions have the same hash value, the verification time is O(n|P |). If the pattern rowshave a different hash value, the verification time is O(n(logm+m)) in the worst case. The filteringphase is linear in n in the worst and average case. The combined filtering and verification phase isthen O(n|P |) when all pattern rows have the same hash value and O(nm) otherwise in the worstcase and linear in n in the average case.

Table 3.1 summarizes the theoretical space, preprocessing and searching time complexity of thepresented multiple pattern matching algorithms. The searching time of SOG corresponds to thetime of the filtering and verification phases of the algorithm combined.

Page 49: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

36 CHAPTER 3. MULTIPLE PATTERN MATCHING

Table 3.1: Known theoretical space, preprocessing and worst and average searching time complexityof the Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG multiplepattern matching algorithms

Algorithm Extra space Preprocessing Worst case Average case

Aho-Corasick |Σ||P | |Σ||P | n n

Set Horspool |Σ||P | |Σ||P | nm sub-linear

Set Backward Oracle Matching |Σ||P | |Σ||P | n|P | sub-linear

Wu-Manber m×

B−1∑

i=0

|Σ| × (2bitshift)i |P | n|P | log|Σ| |P |n log|Σ| |P |

m

SOG |Σ|B + |P | |P | n|P | n

3.4 Analysis

In the previous sections, details on the Aho-Corasick, Set Horspool, Set Backward Oracle Match-ing, Wu-Manber and SOG multiple pattern matching algorithms were given and the experimentalmethodology was discussed. In this section, the performance of the algorithms is evaluated in termsof preprocessing and searching time for different sets of data.

Figures 3.5 to 3.8 present the preprocessing time of the algorithms. Each pattern had a sizeof m = 8 and m = 32 characters and an alphabet size |Σ| of 2, 4, 20 and 128 characters while thesets consisted of 100 to 100, 000 distinct patterns. When a pattern of size m = 8 and a binaryalphabet were used, there were only 28 = 256 different patterns while for a pattern of size m = 8and an alphabet of |Σ| = 4, there was a maximum of 48 = 65, 536 different patterns. The time ofthe algorithms to preprocess the pattern set of the FASTA Nucleic Acid and the FASTA AminoAcid database was similar to that of the E.coli genome and the Swiss Prot. Amino Acid databaserespectively and thus the corresponding Figures were omitted.

The time of the multiple pattern matching algorithms to locate all the occurrences of any patternfrom the pattern set in randomly generated binary alphabet input strings, the E.coli genome, theFASTA Nucleic Acid of the A-thaliana genome, the Swiss Prot. Amino Acid sequence database, theFASTA Amino Acid of the A-thaliana genome as well as in input strings with an English alphabetis depicted in Figures 3.9 to 3.14. Finally, Figures 3.17 to 3.26 present a comparison between thepreprocessing, searching and running time of the presented algorithms for all types of data. Thebottom graphs of the same Figures depict the contribution of the preprocessing and the searchingtime of each algorithm to its total running time. All Figures have a logarithmic horizontal axis anda linear vertical while both the horizontal and vertical axis of the bottom graphs of Figures 3.17to 3.26 are logarithmic.

As can be seen from the Figures, varying parameters such as the size of the pattern set andthe alphabet as well as the size of the patterns and the input string can affect the preprocessing,searching and running times of the multiple pattern matching algorithms in different ways. Theexperimental study proved that no algorithm is the best for all values of the problem parameters.

The presented multiple pattern matching algorithms can generally be classified into two cate-gories based on their experimentally obtained performance, as depicted in Figures 3.5 to 3.14 andtheir theoretical time complexity as reported in the original papers and summarized in Table 3.1;the trie-based category of the Aho-Corasick, Set Horspool and Set Backward Oracle Matching algo-rithms and the hash-based category of the Wu-Manber and SOG algorithms. All algorithms withinthe same category share common characteristics, as discussed below.

Page 50: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 37

3.4.1 Preprocessing

The time of the Aho-Corasick, Set Horspool and Set Backward Oracle Matching algorithms topreprocess the pattern set was expected theoretically to increase linearly in the size d of the patternset, the size m of each pattern and the alphabet size |Σ|. The experimental results confirmed thatthe time to preprocess the pattern set increased linearly in d for patterns of a sizem = 8 andm = 32with an alphabet of size |Σ| of 2, 4, 20 and 128 characters. It is worth noting the significant increasein the time to construct the supply function Supply of Aho-Corasick when sets of more than 2, 000patterns were used for all alphabet sizes and pattern sizes, that resulted in a significantly higherincrease of the preprocessing time of the algorithm comparing to Set Horspool and Set BackwardOracle Matching.

Figures 3.5 to 3.8 depict the dependent relationship between the size m of the patterns andthe preprocessing time of the Aho-Corasick, Set Horspool and Set Backward Oracle Matchingalgorithms. Quadrupling m from 8 to 32 characters resulted in a proportional increase in the timeto preprocess the pattern set. The highest increase rate of the preprocessing time of the algorithmswas observed when Aho-Corasick, Set Horspool and Set Backward Oracle Matching were used onpatterns with a small alphabet size, including randomly generated sets with a binary alphabet andsets constructed from the genome of E.coli. Generally it can be concluded that for all types ofpattern sets, the preprocessing time of Aho-Corasick increased in m with a higher rate comparingto the Set Horspool and Set Backward Oracle Matching algorithms.

The time of the Aho-Corasick, Set Horspool and Set Backward Oracle Matching algorithms topreprocess the pattern set also increased linearly in |Σ| for most types of pattern sets. It is worthnoting that the preprocessing time of Aho-Corasick actually decreased when used on sets with anEnglish alphabet, contradicting the theoretical preprocessing time of the algorithm.

0

0.002

0.004

0.006

0.008

0.01

100 256

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.01

0.02

0.03

0.04

0.05

0.06

100 1000 10000 100000

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.5: Preprocessing time of the algorithms for randomly generated data with |Σ| = 2

Page 51: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

38 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.01

0.02

0.03

0.04

0.05

0.06

100 1000 10000 65536

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.01

0.02

0.03

0.04

0.05

0.06

100 1000 10000 100000P

repr

oces

sing

tim

e (s

ec)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.6: Preprocessing time of the algorithms for the E.coli genome with |Σ| = 4

0

0.01

0.02

0.03

0.04

0.05

0.06

100 1000 10000 100000

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.01

0.02

0.03

0.04

0.05

0.06

100 1000 10000 100000

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.7: Preprocessing time of the algorithms for the Swiss Prot. Amino Acid database with|Σ| = 20

Page 52: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 39

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

100 1000 10000 100000

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

100 1000 10000 100000

Pre

proc

essi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.8: Preprocessing time of the algorithms for English alphabet data with |Σ| = 128

Page 53: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

40 CHAPTER 3. MULTIPLE PATTERN MATCHING

Wu-Manber had the fastest preprocessing phase between the multiple pattern matching algo-rithms of this chapter, as depicted in Figures 3.5 to 3.8. The time spent during the preprocessingphase of the Wu-Manber and SOG algorithms was expected to increase linearly in the size d ofthe pattern set and the size m of the patterns, since both have a theoretical preprocessing time ofO(|P |). The experimental results confirmed that the time of the Wu-Manber algorithm to prepro-cess the pattern set increased linearly in d for all types of data. The preprocessing phase of SOGappears in the Figures to be roughly constant in d when used on sets of up to 2, 000 patterns andlinear in d for larger pattern set sizes. This behavior was due to the relatively high cost in termsof preprocessing time to setup the necessary data structures of the algorithm.

The performance of the preprocessing phase of the hash-based algorithms was also affected bythe size m of the patterns used. The preprocessing time of both Wu-Manber and SOG algorithmsincreased linearly in m for all types of data. Finally, the time of Wu-Manber and SOG to preprocessthe pattern set was generally constant in |Σ|, confirming in practice the theoretical preprocessingtime of the algorithms.

3.4.2 Searching

Table 3.1 summarizes the theoretical time of the presented multiple pattern matching algorithmsto scan the input string; O(n) in the worst and average case for Aho-Corasick, O(nm) and sub-linear in the worst and average case respectively for Set Horspool and O(n|P |) and sub-linear forSet Backward Oracle Matching. In practice, the time spent during the search phase of the threetrie-based algorithms increased linearly in the size of the pattern set, as depicted in Figures 3.9to 3.14, although for Aho-Corasick and Set Horspool it was expected to be constant in d.

The time of Aho-Corasick and Set Horspool to scan the input string increased linearly in m.When data with a binary alphabet and an alphabet of size |Σ| = 4 as well as English alphabetdata were used, the increase rate in m was notably higher, especially for sets of more than 10, 000patterns. It is interesting that the performance in terms of searching time of the Set BackwardOracle Matching algorithm improved when used on patterns with a size m = 32 as opposed tom = 8 for most types of data, while on English alphabet data the searching time of Set BackwardOracle Matching actually increased linearly in m, similar to the Aho-Corasick and Set Horspoolalgorithms.

The time of the Aho-Corasick algorithm to search the input string for all occurrences of anypattern from the pattern set was roughly constant in the size |Σ| of the alphabet for most typesof data. The performance of Set Horspool and Set Backward Oracle Matching on the other handimproved when used to search data sets with a larger alphabet size, although it was not expected bytheir theoretical average searching time complexity. This is clearly shown in Figures 3.11 and 3.12;even though the input string of the Swiss Prot. Amino Acid database had significantly morecharacters than the input string of the FASTA Nucleic Acid database, the searching time of the SetHorspool algorithm decreased for a set of 20, 000 patterns with m = 32 from over 12 to 7 secondswhile the searching time of the Set Backward Oracle Matching decreased for the same pattern setfrom 1.14 to 0.58 seconds.

Page 54: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 41

0

0.2

0.4

0.6

0.8

1

100 256

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.5

1

1.5

2

2.5

3

100 1000 10000 100000S

earc

hing

tim

e (s

ec)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.9: Searching time of the algorithms for randomly generated input strings with |Σ| = 2

0

0.2

0.4

0.6

0.8

1

100 1000 10000 65536

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.2

0.4

0.6

0.8

1

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.10: Searching time of the algorithms for the E.coli genome with |Σ| = 4

Page 55: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

42 CHAPTER 3. MULTIPLE PATTERN MATCHING

2

4

6

8

10

12

14

16

18

20

100 1000 10000 65536

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

2

4

6

8

10

12

14

16

18

20

100 1000 10000 100000S

earc

hing

tim

e (s

ec)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.11: Searching time of the algorithms for the FASTA Nucleic Acid database with |Σ| = 4

0

1

2

3

4

5

6

7

8

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

1

2

3

4

5

6

7

8

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.12: Searching time of the algorithms for the Swiss Prot. Amino Acid database with|Σ| = 20

Page 56: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 43

0

0.1

0.2

0.3

0.4

0.5

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.1

0.2

0.3

0.4

0.5

100 1000 10000 100000S

earc

hing

tim

e (s

ec)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.13: Searching time of the algorithms for the FASTA Amino Acid database with |Σ| = 20

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=8)

ACSH

SBOMWM

SOG

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

100 1000 10000 100000

Sea

rchi

ng ti

me

(sec

)

Pattern set size (m=32)

ACSH

SBOMWM

SOG

Figure 3.14: Searching time of the algorithms for English alphabet input strings with |Σ| = 128

Page 57: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

44 CHAPTER 3. MULTIPLE PATTERN MATCHING

The theoretical time complexity of the Wu-Manber algorithm to locate all the occurrences of

the patterns in the input string is O(n|P | log|Σ| |P |) in the worst and O(n log|Σ| |P |

m) in the average

case while the theoretical searching time of SOG is O(n|P |) in the worst and O(n) in the averagecase. Based on the theoretical time of Wu-Manber, it was expected that the time of the algorithmto scan the input string would increase linearly in the size n of the input string and the size d ofthe pattern set and decrease in the size m of the patterns and |Σ| of the alphabet in the averagecase. The searching time of the SOG algorithm on the other hand was expected to increase linearlyin n and be constant in d and |Σ| in the average case.

Figures 3.9 to 3.14 confirm that the searching time of the Wu-Manber algorithm increasedlinearly in the size d of the pattern set for all alphabet and pattern sizes used. The searching timeof the SOG algorithm also increased linearly in d, close to the worst case searching complexity ofthe algorithm. As a general remark, it can be observed that the rate of increase of the searchingtime for the hash-based algorithms in the size d of the pattern set was slightly higher than thecorresponding rate of increase of the trie-based algorithms for all types of data.

Although it was not expected by the average theoretical time of the Wu-Manber algorithm, itssearching time was roughly constant in the size m of the patterns when data with an alphabetof size |Σ| = 20 and 128 were used while actually increased when used on randomly generatedbinary alphabet data, the E.coli genome and the FASTA Nucleic Acid database. The time of theSOG algorithm to search the input string for all occurrences of any pattern from the pattern setincreased significantly when patterns of a size m = 32 were used as opposed to patterns of a sizem = 8, also close to its worst case theoretical searching complexity. The increase in the searchingtime was higher when data with a larger alphabet size were used, including the Swiss Prot. AminoAcid database, the FASTA Amino Acid database as well as English alphabet data.

Finally, similar to Set Horspool and Set Backward Oracle Matching, the searching time of theWu-Manber and SOG algorithms decreased in the alphabet size |Σ|. This observation confirms thetheoretical average time complexity of the Wu-Manber algorithm but contradicts the theoreticalcomplexity of the SOG algorithm.

3.4.3 Overall Comparison

As can generally be seen from Figures 3.9 to 3.11, for randomly generated data with a binaryalphabet, the E.coli genome and the FASTA Nucleic Acid database, the trie-based algorithmsoutperformed the hash-based algorithms in terms of searching time for all sizes of the pattern setand the patterns. For these types of data, Aho-Corasick was the fastest algorithm when patternsof a size m = 8 were used while Set Backward Oracle Matching outperformed the rest of thealgorithms for patterns of a size m = 32. For the Swiss Prot. and the FASTA Amino Aciddatabases, SOG was the fastest algorithm when sets of up to 2, 000 patterns were used, togetherwith patterns of a size m = 8. For the same data and for sets of more than 2, 000 patterns orwhen patterns with a size of m = 32 characters were used, Set Backward Oracle Matching wasthe fastest among the presented algorithms. Finally, when English alphabet data were used asdepicted in Figure 3.14, the performance in terms of searching time of the algorithms converged,with the SOG algorithm being slightly faster when patterns of a size m = 8 were used and the SetBackward Oracle Matching, Wu-Manber and the Aho-Corasick algorithms outperforming the restof the algorithms when patterns of a size m = 32 were used.

Aho-Corasick was the fastest algorithm to scan the input string when data sets with an alphabetof a size |Σ| = 2 and 4 together with patterns of a size m = 8 were used; randomly generated inputstrings with a binary alphabet, the E.coli genome and the FASTA Nucleic Acid database. The

Page 58: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 45

1,000 10,000 100,000

2

4

20

128

d

Σ

AC

SBOM

SOG

Figure 3.15: Experimental map of the multiple pattern matching algorithm implementations form = 8

inadequate performance of Aho-Corasick on data with a large alphabet size or on patterns witha size of m = 32 characters was generally due to the slow preprocessing phase of the algorithmon these types of data, as already discussed. The fast running time of Aho-Corasick on the otherhand on data sets with a small alphabet size and a pattern size of m = 8 was expected by theexperimental maps presented in [67] where Aho-Corasick was faster than the Wu-Manber and theSet Backward Oracle Matching algorithms when data with a binary alphabet size were used. Similarresults for multiple pattern matching were reported in [29], where for a binary alphabet data setthe Aho-Corasick outperformed the Commentz-Walter and Wu-Manber algorithms.

The Set Horspool algorithm had a generally moderate performance. Its searching time hadsimilar characteristics to Aho-Corasick albeit was slightly slower on each type of data. Set BackwardOracle Matching on the other hand was one of the fastest algorithms in terms of searching time.It outperformed the Aho-Corasick, Set Horspool, Wu-Manber and SOG algorithms on most typesof data where patterns of a size m = 32 were used. It was also the fastest algorithm when patternsof a size m = 8 were used on the Swiss Prot. and FASTA Amino Acid databases for sets of morethan 5, 000 patterns. In the experiments presented in [67], Set Backward Oracle Matching was thefastest algorithm for 1, 000 patterns and for a pattern size m of more than 20 characters which alsoholds true for the experiments presented in this chapter.

Although Wu-Manber had the fastest preprocessing phase, it was one of the slowest algorithmsto search the input string for the occurrences of any pattern from the pattern set on most types ofdata. It was faster than the rest of the algorithms in terms of searching time only when Englishalphabet data sets were used together with patterns of a size m = 32. Based on the averagesearching time complexity of the algorithms as given in Table 3.1, it was expected for Wu-Manberto outperform the SOG algorithm on most types of data, with SOG being faster only when patternsof a size m = 8 and an alphabet size of |Σ| = 2 and 4 were used. In practice though, Wu-Manberwas faster than SOG for data with an alphabet of size |Σ| = 2 and 4 or when patterns of a sizem = 32 were used. Finally, SOG outperformed the Aho-Corasick, Set Horspool, Set BackwardOracle Matching and the Wu-Manber algorithms on the Swiss Prot. and FASTA Amino Aciddatabases and on data sets with an English alphabet when sets of up to 2, 000 patterns were used

Page 59: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

46 CHAPTER 3. MULTIPLE PATTERN MATCHING

1,000 10,000 100,000

2

4

20

128

d

Σ

SBOM

WM AC

Figure 3.16: Experimental map of the multiple pattern matching algorithm implementations form = 32

with a size of m = 8 characters.The experimental maps that are depicted in Figures 3.15 and 3.16 summarize the most efficient

implementations of the Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manberand SOG algorithms for different pattern, alphabet and pattern set sizes.

Figures 3.17 to 3.26 depict a comparison between the preprocessing, searching and overallrunning time of the presented algorithms. As can be seen, the running time of the Aho-Corasickalgorithm was affected more by the performance of the preprocessing phase than the performanceof its searching phase for most types of data when sets of more than 2, 000 patterns were used.This can be attributed to the time to construct the supply function of the algorithm as alreadydiscussed. Figure 3.20 clearly depicts that when patterns with a size m = 32 together with datawith a relatively small input string size n were used, including the E.coli genome, the FASTAAmino Acid database and data sets with an English alphabet, the preprocessing phase of the SetBackward Oracle Matching algorithm was also slower than its searching phase.

As opposed to the Aho-Corasick and Set Backward Oracle Matching algorithms, the time ofWu-Manber to preprocess the available pattern sets was only a fraction of its searching time for alltypes of data used. The preprocessing time of the SOG algorithm on the other hand was similar toits searching time on the FASTA Amino Acid database and on English alphabet data sets when upto 2, 000 patterns were used with a size of m = 8, as depicted in Figure 3.25. SOG was slower thanthe Wu-Manber algorithm to preprocess all types of pattern sets. Moreover, for sets of up to 1, 000patterns where the patterns had a size of m = 8 or sets of up to 500 patterns where the patternshad a size of m = 32, SOG was the slowest algorithm in terms of preprocessing time comparing tothe Aho-Corasick, Set Horspool, Set Backward Oracle Matching and Wu-Manber algorithms whilefor larger pattern set sizes, the Aho-Corasick algorithm was the slowest.

Page 60: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 47

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

0.6

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.17: Comparison between the preprocessing, searching and running time of the Aho-Corasick algorithm for all types of data for m = 8

Page 61: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

48 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.01

0.02

0.03

0.04

0.05

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

0.6

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.18: Comparison between the preprocessing, searching and running time of the Aho-Corasick algorithm for all types of data for m = 32

Page 62: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 49

0

0.02

0.04

0.06

0.08

0.1

0.12

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.19: Comparison between the preprocessing, searching and running time of the Set Horspoolalgorithm for all types of data for m = 8

Page 63: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

50 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

1

2

3

4

5

6

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.20: Comparison between the preprocessing, searching and running time of the Set Horspoolalgorithm for all types of data for m = 32

Page 64: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 51

0

0.05

0.1

0.15

0.2

0.25

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

2

4

6

8

10

12

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

0.5

1

1.5

2

2.5

3

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.21: Comparison between the preprocessing, searching and running time of the Set Back-ward Oracle Matching algorithm for all types of data for m = 8

Page 65: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

52 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

0.2

0.4

0.6

0.8

1

1.2

1.4T

ime

(sec

)

Swiss Prot.

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.22: Comparison between the preprocessing, searching and running time of the Set Back-ward Oracle Matching algorithm for all types of data for m = 32

Page 66: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 53

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.5

1

1.5

2

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

20

40

60

80

100

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

10

20

30

40

50

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.5

1

1.5

2

2.5

3

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.23: Comparison between the preprocessing, searching and running time of the Wu-Manberalgorithm for all types of data for m = 8

Page 67: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

54 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.5

1

1.5

2

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

20

40

60

80

100

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

10

20

30

40

50

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.5

1

1.5

2

2.5

3

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

0.25

0.3

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.24: Comparison between the preprocessing, searching and running time of the Wu-Manberalgorithm for all types of data for m = 32

Page 68: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.4. ANALYSIS 55

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.5

1

1.5

2

2.5

3

3.5

4

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

10

20

30

40

50

60

70

80

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.25: Comparison between the preprocessing, searching and running time of the SOG algo-rithm for all types of data for m = 8

Page 69: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

56 CHAPTER 3. MULTIPLE PATTERN MATCHING

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Tim

e (s

ec)

Random 2

preprocessingsearching

running

0

0.5

1

1.5

2

2.5

3

3.5

4

Tim

e (s

ec)

E.coli

preprocessingsearching

running

0

10

20

30

40

50

60

70

80

Tim

e (s

ec)

FASTA Nucleic Acid

preprocessingsearching

running

0

2

4

6

8

10

12

14

Tim

e (s

ec)

Swiss Prot.

preprocessingsearching

running

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Tim

e (s

ec)

FASTA Amino Acid

preprocessingsearching

running

0

0.05

0.1

0.15

0.2

Tim

e (s

ec)

English

preprocessingsearching

running

100 256

Pattern set size

100 1000 10000 65536

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

100 1000 10000 100000

Pattern set size

Figure 3.26: Comparison between the preprocessing, searching and running time of the SOG algo-rithm for all types of data for m = 32

Page 70: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.5.APPENDIX

:EXPERIM

ENTALRESULTS

57

3.5 Appendix: Experimental Results

Table 3.2: Preprocessing and searching time of the algorithms for a randomly generated input string with |Σ| = 2 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.029 0.000 0.098 0.000 0.167 0.000 0.400 0.007 0.603200 0.000 0.028 0.000 0.094 0.000 0.211 0.000 0.653 0.007 0.892

Table 3.3: Preprocessing and searching time of the algorithms for a randomly generated input string with |Σ| = 2 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.033 0.000 0.130 0.000 0.013 0.000 0.505 0.023 0.756200 0.002 0.042 0.000 0.172 0.000 0.016 0.000 0.944 0.023 1.264500 0.009 0.050 0.000 0.230 0.001 0.021 0.000 2.276 0.023 2.8231000 0.034 0.056 0.001 0.288 0.001 0.026 0.000 4.660 0.023 5.4232000 0.133 0.063 0.002 0.362 0.003 0.032 0.000 9.547 0.023 10.6915000 1.866 0.070 0.007 0.468 0.009 0.045 0.000 24.524 0.024 26.72710000 8.451 0.103 0.017 0.649 0.019 0.073 0.001 49.447 0.026 53.45020000 34.336 0.216 0.042 1.163 0.038 0.112 0.003 99.006 0.026 103.29450000 205.383 0.338 0.119 2.047 0.099 0.188 0.009 253.267 0.026 201.234100000 412.340 0.418 0.253 2.765 0.197 0.277 0.018 545.401 0.027 385.392

Page 71: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

58CHAPTER

3.MULTIP

LE

PATTERN

MATCHIN

G

Table 3.4: Preprocessing and searching time of the algorithms for the E.coli genome with |Σ| = 4 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.038 0.000 0.082 0.000 0.044 0.000 0.085 0.007 0.118200 0.000 0.046 0.000 0.092 0.000 0.058 0.000 0.113 0.007 0.303500 0.001 0.069 0.000 0.122 0.000 0.083 0.000 0.198 0.007 0.5241000 0.004 0.084 0.000 0.146 0.000 0.111 0.000 0.326 0.007 0.7292000 0.014 0.095 0.000 0.168 0.001 0.147 0.000 0.571 0.008 1.0775000 0.137 0.103 0.001 0.195 0.002 0.214 0.000 1.276 0.009 2.08210000 0.447 0.112 0.002 0.224 0.003 0.303 0.000 2.338 0.010 3.57020000 1.202 0.124 0.006 0.348 0.006 0.459 0.000 4.131 0.014 6.10650000 3.284 0.147 0.017 0.569 0.014 0.712 0.002 7.822 0.026 11.344

Table 3.5: Preprocessing and searching time of the algorithms for the E.coli genome with |Σ| = 4 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.043 0.000 0.085 0.000 0.006 0.000 0.081 0.023 0.349200 0.002 0.051 0.000 0.100 0.000 0.007 0.000 0.114 0.023 0.429500 0.012 0.070 0.000 0.126 0.001 0.009 0.000 0.201 0.023 0.5791000 0.042 0.088 0.001 0.157 0.002 0.011 0.000 0.337 0.023 0.7892000 0.179 0.104 0.004 0.193 0.005 0.016 0.000 0.592 0.023 1.1745000 2.355 0.124 0.011 0.289 0.015 0.031 0.000 1.382 0.024 2.27510000 10.929 0.159 0.023 0.455 0.033 0.050 0.001 2.679 0.026 4.07320000 45.544 0.264 0.050 0.698 0.069 0.072 0.003 5.362 0.030 7.68250000 285.623 0.471 0.130 1.045 0.187 0.128 0.009 13.602 0.043 18.436100000 519.345 0.889 0.210 2.123 0.312 0.178 0.017 25.403 0.049 35.883

Page 72: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.5.APPENDIX

:EXPERIM

ENTALRESULTS

59

Table 3.6: Preprocessing and searching time of the algorithms for the FNA database with |Σ| = 4 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.981 0.000 2.081 0.000 1.158 0.000 2.141 0.007 2.495200 0.000 1.166 0.000 2.378 0.000 1.550 0.000 3.354 0.007 7.023500 0.001 1.699 0.000 3.131 0.000 2.275 0.000 6.382 0.007 12.7961000 0.004 2.088 0.000 3.703 0.000 2.977 0.000 10.254 0.007 18.3412000 0.013 2.354 0.000 4.249 0.000 3.946 0.000 18.400 0.007 27.7345000 0.119 2.567 0.001 4.874 0.001 5.681 0.000 40.347 0.008 52.70610000 0.394 2.762 0.002 5.459 0.003 7.475 0.000 71.900 0.010 88.60320000 1.072 3.030 0.004 6.157 0.006 9.749 0.000 120.144 0.014 147.27250000 2.981 3.140 0.008 6.423 0.013 13.981 0.002 212.313 0.026 267.074100000 4.916 3.102 0.016 6.440 0.022 19.625 0.004 291.762 0.048 377.561

Table 3.7: Preprocessing and searching time of the algorithms for the FNA database with |Σ| = 4 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 1.085 0.000 2.225 0.000 0.173 0.000 2.142 0.023 8.605200 0.002 1.304 0.000 2.607 0.000 0.202 0.000 3.211 0.023 10.751500 0.011 1.759 0.000 3.256 0.001 0.244 0.000 6.290 0.023 14.3801000 0.042 2.188 0.001 3.950 0.002 0.287 0.000 10.856 0.023 19.7772000 0.173 2.599 0.003 4.754 0.005 0.338 0.000 19.782 0.023 29.2785000 2.359 3.026 0.009 6.039 0.014 0.501 0.000 45.090 0.024 56.88710000 10.906 3.621 0.021 7.873 0.031 0.782 0.001 87.036 0.026 102.45720000 45.405 6.318 0.048 12.039 0.067 1.144 0.003 175.350 0.030 192.72350000 89.241 9.345 0.130 22.898 0.182 1.702 0.009 453.176 0.042 466.449100000 170.324 17.399 0.268 33.011 0.388 2.195 0.018 1136.009 0.063 931.688

Page 73: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

60CHAPTER

3.MULTIP

LE

PATTERN

MATCHIN

G

Table 3.8: Preprocessing and searching time of the algorithms for the Swiss Prot. database with |Σ| = 20 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 1.131 0.000 1.804 0.000 0.389 0.000 0.624 0.007 0.459200 0.000 1.433 0.000 2.191 0.000 0.458 0.000 1.287 0.007 0.459500 0.002 1.743 0.000 2.302 0.000 0.599 0.000 2.443 0.007 0.4601000 0.008 2.004 0.000 2.461 0.001 0.705 0.000 3.430 0.007 0.4622000 0.034 2.510 0.001 3.077 0.002 0.794 0.000 5.002 0.008 0.5225000 0.329 3.669 0.002 4.129 0.005 0.953 0.000 8.244 0.009 1.58310000 0.974 4.281 0.004 4.543 0.008 1.098 0.000 12.254 0.010 4.31220000 3.263 5.070 0.011 5.130 0.014 1.421 0.001 19.793 0.014 10.89050000 25.811 8.250 0.043 8.285 0.047 2.458 0.002 43.000 0.027 21.749100000 94.405 11.298 0.090 11.545 0.096 3.284 0.005 81.116 0.050 26.223

Table 3.9: Preprocessing and searching time of the algorithms for the Swiss Prot. database with |Σ| = 20 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.001 1.221 0.000 1.889 0.000 0.108 0.000 0.466 0.023 4.690200 0.002 1.517 0.000 2.218 0.000 0.122 0.000 1.043 0.023 5.849500 0.013 1.914 0.001 2.443 0.002 0.138 0.000 2.210 0.023 8.7261000 0.046 2.358 0.002 2.807 0.003 0.163 0.000 3.359 0.023 12.4732000 0.171 2.995 0.005 3.450 0.007 0.188 0.000 4.968 0.023 16.5815000 2.122 4.511 0.016 4.798 0.016 0.285 0.001 8.265 0.025 20.91210000 10.485 5.928 0.036 5.851 0.041 0.424 0.002 12.538 0.027 22.88220000 43.883 7.318 0.075 7.131 0.088 0.588 0.003 20.578 0.031 24.77650000 274.946 10.481 0.194 10.785 0.245 0.887 0.009 44.447 0.048 27.745100000 1200.034 13.597 0.399 14.289 0.499 1.196 0.019 88.054 0.075 30.304

Page 74: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

3.5.APPENDIX

:EXPERIM

ENTALRESULTS

61

Table 3.10: Preprocessing and searching time of the algorithms for the FAA database with |Σ| = 20 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.075 0.000 0.109 0.000 0.026 0.000 0.040 0.007 0.028200 0.000 0.093 0.000 0.139 0.000 0.031 0.000 0.082 0.007 0.028500 0.002 0.113 0.000 0.144 0.000 0.041 0.000 0.160 0.007 0.0281000 0.009 0.132 0.000 0.158 0.001 0.049 0.000 0.223 0.007 0.0292000 0.036 0.172 0.001 0.204 0.002 0.058 0.000 0.32 0.008 0.0365000 0.389 0.242 0.003 0.278 0.005 0.102 0.000 0.513 0.009 0.15610000 1.692 0.330 0.009 0.396 0.011 0.158 0.000 0.773 0.010 0.50820000 6.646 0.594 0.020 0.578 0.023 0.218 0.001 1.266 0.014 0.95750000 39.104 1.031 0.057 0.937 0.072 0.317 0.002 2.771 0.027 1.391100000 161.813 1.307 0.118 1.243 0.126 0.443 0.005 5.286 0.052 1.631

Table 3.11: Preprocessing and searching time of the algorithms for the FAA database with |Σ| = 20 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.001 0.080 0.000 0.116 0.000 0.007 0.000 0.032 0.023 0.289200 0.002 0.098 0.000 0.137 0.000 0.008 0.000 0.073 0.023 0.360500 0.013 0.127 0.001 0.159 0.002 0.011 0.000 0.151 0.023 0.5521000 0.049 0.159 0.002 0.186 0.004 0.017 0.000 0.220 0.023 0.7682000 0.210 0.213 0.006 0.238 0.009 0.028 0.000 0.323 0.023 1.0155000 2.309 0.351 0.017 0.346 0.026 0.044 0.000 0.519 0.025 1.28710000 10.441 0.475 0.037 0.436 0.063 0.058 0.002 0.786 0.027 1.39820000 43.032 0.603 0.077 0.603 0.124 0.088 0.004 1.282 0.031 1.51550000 270.579 0.963 0.201 1.117 0.368 0.165 0.009 2.824 0.048 1.692100000 1194.956 1.278 0.417 1.659 0.777 0.288 0.019 5.934 0.075 1.854

Page 75: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

62CHAPTER

3.MULTIP

LE

PATTERN

MATCHIN

G

Table 3.12: Preprocessing and searching time of the algorithms for an English alphabet input string with |Σ| = 128 and m = 8 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.000 0.014 0.000 0.014 0.000 0.005 0.000 0.004 0.007 0.005200 0.000 0.017 0.000 0.020 0.000 0.006 0.000 0.007 0.007 0.005500 0.003 0.021 0.000 0.026 0.000 0.008 0.000 0.015 0.007 0.0051000 0.009 0.027 0.000 0.032 0.001 0.011 0.000 0.023 0.007 0.0062000 0.034 0.039 0.002 0.049 0.004 0.024 0.000 0.040 0.008 0.0145000 0.217 0.069 0.005 0.097 0.011 0.062 0.000 0.074 0.009 0.04410000 0.688 0.125 0.011 0.186 0.016 0.112 0.000 0.106 0.011 0.07820000 2.089 0.167 0.022 0.243 0.037 0.160 0.001 0.150 0.015 0.12150000 8.489 0.226 0.050 0.350 0.085 0.247 0.003 0.258 0.028 0.177100000 25.820 0.262 0.096 0.437 0.158 0.345 0.006 0.413 0.052 0.226

Table 3.13: Preprocessing and searching time of the algorithms for an English input string with |Σ| = 128 and m = 32 (sec)

AC SH SBOM WM SOG

d Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching Preproc. Searching

100 0.001 0.014 0.000 0.011 0.000 0.001 0.000 0.001 0.023 0.042200 0.004 0.018 0.000 0.016 0.001 0.002 0.000 0.003 0.023 0.043500 0.018 0.024 0.002 0.034 0.005 0.006 0.000 0.011 0.024 0.0471000 0.058 0.035 0.006 0.054 0.011 0.018 0.000 0.022 0.024 0.0542000 0.220 0.054 0.013 0.094 0.024 0.038 0.000 0.039 0.024 0.0695000 1.520 0.089 0.033 0.233 0.057 0.078 0.001 0.073 0.025 0.10910000 5.533 0.138 0.065 0.605 0.117 0.130 0.002 0.108 0.028 0.15820000 20.077 0.190 0.128 1.028 0.230 0.218 0.004 0.165 0.032 0.21850000 111.898 0.230 0.313 1.533 0.566 0.421 0.010 0.331 0.049 0.290100000 159.867 0.298 0.453 1.636 1.034 0.819 0.022 0.456 0.067 0.360

Page 76: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 4

Two-dimensional Pattern Matching

using Multiple Pattern Matching

Algorithms

4.1 Introduction

The expansion of imaging, graphics and multimedia often requires the use of string matching tohigher than one dimension leading to the two- and multi- dimensional pattern matching problemwere input strings and patterns of two or more dimensions are involved. The two-dimensionalpattern matching problem can be defined as:

Definition. Given an input string T = t0,0 . . . tn1−1,n2−1 of size n1n2 and a pattern P =p0,0 . . . pm1−1,m2−1 of size m1m2 over a finite character set Σ, the task is to find all occurrences ofthe pattern in the input string. More formally, find all pairs i1, i2 where 0 ≤ i1 < n1 −m1 + 1 and0 ≤ i2 < n2 − m2 + 1 such that for all j1, j2 where 0 ≤ j1 < m1 and 0 ≤ j2 < m2 it holds thatti1+j1,i2+j2 = pj1,j2

The two-dimensional pattern matching problem has many applications, especially in imageprocessing. It is used for content based information retrieval from image databases, image analysisand medical diagnostics. It is also used by some methods of detecting edges, where a set of edgedetectors is matched against a picture and by some OCR systems [74]. The use of pattern matchingis also important in areas like motion analysis, cartography, aerial photography, remote sensing,document analysis and object tracking.

The solution to the two-dimensional pattern matching problem differs if the algorithm used ison-line, where the input string is not known in advance and only the pattern is preprocessed, oroff-line, with both the pattern and the input string being preprocessed. This chapter focuses onon-line algorithms and on the exact case where the pattern must be located as is (i.e. withoutapproximation) in the input string. The algorithms presented in this chapter can easily handlerectangular arrays where each dimension has a different size, but for the sake of simplicity onlysquare arrays are considered for both input string and pattern where n1 = n2 = n and m1 = m2 =m. For data that involve irregular shapes or for problems that include approximation [10], rotation[7, 34] or scaling [8], different algorithms can be used.

Baker and Bird [14, 15] and Baeza-Yates and Regnier [11] are regarded to be among the most

63

Page 77: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

64CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

efficient two-dimensional pattern matching algorithms. They both handle the two-dimensionalpattern matching problem as a special case of multiple pattern matching, where the two-dimensionalpattern array is considered as a finite set of different patterns. To preprocess the pattern and locateall of its occurrences in the input string, the automaton of the Aho-Corasick algorithm is used, adata structure that is considered by many as inefficient (i.e. [27] refers to the efficiency of Aho-Corasick as not entirely satisfactory). This chapter presents efficient variants of the Baker and Bird[14, 15] and the Baeza-Yates and Regnier [11] algorithms that use the Set Horspool, Wu-Manber,Set Backward Oracle Matching and SOG multiple pattern matching algorithms in place of the Aho-Corasick algorithm to perform two-dimensional pattern matching. The Karp-Rabin [46] and theZhu and Takaoka [100] algorithms were also considered but were omitted as previous studies (i.e.[50]) showed that they were not as efficient for two-dimensional pattern matching when comparedto Baker and Bird and Baeza-Yates and Regnier.

4.2 Related Work

A survey of algorithms for exact, approximate, scaled and compressed two-dimensional patternmatching was presented in [5, 28]. The Baeza-Yates and Regnier algorithm was introduced in [11]and its running time was compared to the Brute Force, Zhu and Takaoka and the Baker and Birdalgorithms for a randomly generated pattern and input string with a binary alphabet. The patternhad a size of m = 2 to 20 while the input string a size of n = 1, 000. For the specific problemparameters it was shown that Baeza-Yates and Regnier was the fastest algorithm for all patternsizes except for m = 2 where the Brute Force algorithm had the best performance. The samepaper also suggested that the use of a multiple pattern matching algorithm based on Boyer-Mooreinstead of Aho-Corasick should result in the improvement of the searching phase of the Baeza-Yatesand Regnier algorithm. This motivated us to introduce the variants of the Baker and Bird andthe Baeza-Yates and Regnier algorithms presented in this chapter. Details on the Baker and Birdalgorithm were given in [27] while Tarhio introduced in [85] a two-dimensional pattern matchingalgorithm based on Boyer-Moore and compared it to the Naive and the Baeza-Yates and Regnieralgorithms for a binary alphabet data set where the pattern had a size of m = 2 to 64 and theinput string a size of n = 1, 000. In that paper it was shown that the Brute Force algorithm wasthe best for a pattern of a size up to m = 5 while for larger pattern sizes, the newly introducedalgorithm was preferable. The subject of exact, scaled and approximate two-dimensional patternmatching was discussed in [9]. The Baker and Bird and the Baeza-Yates and Regnier algorithmswere discussed in [84]. Finally, a variant of the Baker and Bird algorithm that uses solely finiteautomata was introduced in [97].

4.3 Two-dimensional Pattern Matching Algorithms

Based on the way the search is performed, the two-dimensional algorithms can generally be classifiedin to one of the four following approaches.

• Classical algorithms The classical algorithms are using a search window that is slidingthrough successive positions of the input string from the top left to the bottom right corner,comparing the pattern to the input string on a character by character basis. The BruteForce or Naive algorithm can locate all the occurrences of a two-dimensional pattern in atwo-dimensional input string in O(m2n2) time in the worst case since it has to scan everypossible position of the input string to locate all the occurrences of the pattern, although in

Page 78: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.3. TWO-DIMENSIONAL PATTERN MATCHING ALGORITHMS 65

practice it is much faster. The Knuth-Morris-Pratt algorithm, as presented in section 2.2.1can easily be adapted to handle two-dimensional text and pattern arrays.

• Automaton algorithms The automaton-based algorithms use a trie data structure duringthe preprocessing phase in order to efficiently store and retrieve information on the pattern.Baker and Bird independently gave a linear time algorithm with O(n2+m2) worst and averagetheoretical time that uses the automaton of the Aho-Corasick algorithm. The Amir-Landau-Vishkin [8] algorithm is using a suffix tree to achieve the same theoretical performance.Baeza-Yates and Regnier [11] presented the first on-line sub-linear algorithm that also uses

Aho-Corasick for an O(2n2m) worst and O(n2

m) average theoretical search time. Polcar [74]

introduced an algorithm based on two-dimensional online tessellation automata with a linearmatching phase complexity that does not depend on the size of the pattern but has anexponential preprocessing phase in O(|Σ|23m

2).

• Hashing algorithms The algorithms following this approach are using hashing on both theinput string and the pattern to speed up the searching phase and reduce their memory foot-print. Karp and Rabin uses a hashing technique called rolling hash to successively calculatethe hash value at each position of the input string to achieve an O(n2 + m2) running timein average. Zhu and Takaoka introduced two algorithms that use the hash function methodproposed by Karp and Rabin vertically and then scan the input string for the occurrences ofthe pattern using a search function that is implemented with either the Knuth-Morris-Prattor the Boyer-Moore algorithm in sub-linear time.

• Periodicity algorithms The algorithms that take advantage of the notion of periodicityconstruct a witness table during the preprocessing phase. This table is then used duringthe search phase to eliminate a number of positions of the input string as possible matches.[27] contains information on the concepts of periodicity, sampling and duels. Amir, Bensonand Farach [6] presented an alphabet independent algorithm that uses the concept of two-dimensional periodicity for the input processing. Galil and Park [36] developed a linear timealgorithm which is alphabet independent and is based on two-dimensional periodicity for bothtext and pattern processing. Crochemore et al. [26] presented a sampling technique that worksin linear time for almost all patterns. Ukkonen and Karkkainen [45] showed that the lower

bound for expected running time is O(n2log|Σ|m

2

m2 ) and presented an optimal algorithm whichachieves this bound with an additional O(m2) for preprocessing.

4.3.1 Baker and Bird

Baker and Bird utilizes the Aho-Corasick and Knuth-Morris-Pratt algorithms to perform two-dimensional pattern matching in worst and average-case linear time. The general idea behind thealgorithm is to construct the trie of Aho-Corasick from the pattern rows and then assign an indexto each distinct row. During the preprocessing phase of the algorithm, the trie of the Aho-Corasickalgorithm is built from each pattern row p0, p1, . . . , pm−1 ∈ P and a unique index is assigned to eachterminal state q. Consequently, the two-dimensional pattern P can be reduced to a one-dimensionalarray P ′ of indices. Array P ′ is then preprocessed using the Knuth-Morris-Pratt algorithm so thatthe longest border v of each suffix of P ′ can be computed. This information is stored in a next

Page 79: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

66CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

table.

ALGORITHM 4.1: The row-matching step of the Baker and Bird algorithm

head := q0for k := 0; k < n; k := k + 1 do

a[k] := 0endfor j := 0; j < n; j := j + 1 do

r := headfor k := 0; k < n; k := k + 1 do

while (s := g(r, tj,k)) = fail dor := Supply(r)

endr := sif Output(r) is not empty then

Perform the column matching step using the code from Algorithm listing 4.2else

a[k] := 0end

end

end

To locate all the occurrences of the pattern in the input string, two distinct steps are used;row-matching and column-matching. During the row-matching step, the input string is scannedhorizontally. For each position j, k it is determined the state of the Aho-Corasick trie that corre-sponds to the longest suffix uk of characters tj,0 . . . tj,k. If that state is terminal, then a patternrow pr matches with uk. For every such position it must be determined if pattern rows p0, . . . , pr−1

also occur at positions tj−r,k−m+1 . . . tj−r,k, . . . , tj−1,k−m+1 . . . tj−1,k. This is done in the column-matching step with the use of the Knuth-Morris-Pratt algorithm vertically. For this purpose, atable a of size n is maintained. The table is used to store values for each of the n columns ofthe input string, such that when a[k] = r, pattern rows 0, . . . , r − 1 exist immediately above thecurrent. If a[k] = r and the suffix of the input string at position j, k matches with pattern rowpr then a[k] is set to r + 1. If the input string suffix does not match with pr, a[k] is set to s + 1where s is the longest border of P ′

0, . . . , P′r−1 as determined based on the values of the next table.

A complete match of the pattern is found with its top left corner at position j −m+ 1, k −m+ 1of the input string when a[k] ≥ m− 1 at the end of the column-matching step. If the row-matchingstep does not locate an occurrence of any pattern row at a column k of the input string then a[k]is set to 0. Algorithm listings 4.1 and 4.2 depict the row-matching and column-matching steps ofthe search phase of Baker and Bird. The implementation used for the experiments of this chapter

Page 80: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.3. TWO-DIMENSIONAL PATTERN MATCHING ALGORITHMS 67

is based on code from [86].

ALGORITHM 4.2: The column-matching step of the Baker and Bird algorithm

i := a[k]while i > 0 AND pi0...m−1 6= Output(r) do

i := next[i]endif i+ 1 < m then

a[k] := i+ 1else

a[k] := 0endif i >= m− 1 then

report match at j −m+ 1, k −m+ 1end

Baker and Bird uses O(n + |Σ|m2) extra space to store tables a and P ′ and the trie of theAho-Corasick algorithm. The trie is built in O(|Σ|m2) time during the preprocessing phase, asdiscussed previously. Array P ′ is preprocessed using the Knuth-Morris-Pratt algorithm in O(m)time. The algorithm requires an O(n2) theoretical searching time in the worst and average case toscan the input string.

Consider the following example; pattern P is used during the preprocessing phase to create thetrie of the Aho-Corasick algorithm:

P =

A A C C A

A A A G G

A A C C A

A A A G G

A A A A C

As can be seen from Figure 4.1, L(5) = p0, L(8) = p1 and L(10) = p4. Index 0 can be assignedto pattern p0, index 1 can be assigned to pattern p2 and index 2 can be assigned to pattern p4. Notethat p2 = p0 and p3 = p1 and therefore they share the same index. Hence, the two-dimensionalpattern P can be reduced to a one-dimensional array of indices P ′:

P ′ =

0

1

0

1

2

The array P ′ is then preprocessed using the Knuth-Morris-Pratt algorithm to compute the nexttable:

i = 0 1 2 3 4 5P ′ = 0 1 0 1 2

next[i] = -1 0 -1 0 2 0

The next table can also be represented as a finite automaton, as depicted in Figure 4.2. State 0 isthe initial state, each character of P ′ labels the path between two different states of the automatonwhile L(q) is the label of the path between the initial and a given state q. The solid lines represent

Page 81: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

68CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

Figure 4.1: The automaton of the Aho-Corasick algorithm for the example pattern P

the success links that are used to visit the next node of the automaton after a successful matchof a pattern character, similar to the goto function of the Aho-Corasick algorithm. The dashedlines represent the supply links that are used to return to a previous state of the automaton aftera mismatch between a pattern character and an input string character.

Figure 4.2: The automaton of the Knuth-Morris-Pratt algorithm for the row indices of P ′

During the row-matching step, each row of the input string is scanned horizontally, startingfrom t0,0. At each position j, k is determined the state of the trie that corresponds to the longestsuffix of tj,0 . . . tj,k. If this state is terminal, the index that corresponds to the matching patternrow is assigned to the suffix of the input string. In any other case, −1 is assigned. Subsequently,the input string T can be converted to T ′:

T =

A A C C A A A C C

A A A G G A A A G

A A C C A A A C C

A A A G G A A A G

A A C C A A A C C

A A A G G A A A G

A A C C A A A C C

A A A G G A A A G

A A A A C A A A A

T ′ =

-1 -1 -1 -1 0 -1 -1 -1 -1

-1 -1 -1 -1 1 -1 -1 -1 -1

-1 -1 -1 -1 0 -1 -1 -1 -1

-1 -1 -1 -1 1 -1 -1 -1 -1

-1 -1 -1 -1 0 -1 -1 -1 -1

-1 -1 -1 -1 1 -1 -1 -1 -1

-1 -1 -1 -1 0 -1 -1 -1 -1

-1 -1 -1 -1 1 -1 -1 -1 -1

-1 -1 -1 -1 2 -1 -1 -1 -1

Note that the index for each suffix of the input string is calculated on-line and thus table T ′ isnot stored in memory. At position t0,4 of the input string, the Baker and Bird algorithm determinesthat a pattern row occurs as a suffix of t0,0. . . t0,4 since index 0 was encountered. As the value of a[4]was initially set to 0 and there is an occurrence of any pattern row, the value of a[4] is incrementedto 1. By using the goto and the supply functions of the trie, the indices for the subsequent inputstring positions are calculated until index 1 is encountered at position t1,4. Since the value of a[k]was previously set to 1 and p1 is equal to the m-character suffix of t1,0. . . t1,4, the value of a[4] is

Page 82: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.3. TWO-DIMENSIONAL PATTERN MATCHING ALGORITHMS 69

incremented to 2. The above procedure is repeated for positions t2,4 and t3,4 of the input stringwhere the value of a[4] is set to 3 and 4 respectively. At position t4,4, index 0 is encountered butp4 does not match with the suffix of t4,0. . . t4,4. Based on the information of the next table, thelongest border v of P ′

5 is P ′0P

′1, therefore P ′

2 can be aligned with the input string at position t4,4.The algorithm continues until it is determined that a complete match of the pattern occurs endingat position t8,4 of the input string. The values of table a for column 4 of the specific input stringwill be:

a[4] =

1

2

3

4

3

4

3

4

5

Since the value of a[4] = 5 at row 8 of the input string, a complete match of P is reported att4,0 . . . t4,4,. . . , t8,0 . . . t8,4.

4.3.2 Baeza-Yates and Regnier

Baeza-Yates and Regnier is similar to the Baker and Bird algorithm. It uses the Aho-Corasickalgorithm to create a trie from the pattern rows and assign an index to each distinct row. Theinput string is then scanned horizontally for the occurrences of the pattern indices. A significantdifference though is that during the search phase only ⌊ n

m⌋ primary rows of the input string are

scanned, since they cover all possible positions where a pattern row may occur. If a match is foundon a primary row, then m characters on each of the m− 1 secondary rows are also scanned for thepattern indices using the Aho-Corasick algorithm, to determine if a complete match occurs.

During the preprocessing phase of the algorithm, the trie of the Aho-Corasick algorithm is builtfrom each pattern row p0, p1, . . . , pm−1 ∈ P and a unique index is assigned to each terminal stateq. Consequently, the two-dimensional pattern P can be reduced to a one-dimensional array P ′ ofindices. The indices assigned to the terminal states are accessible through an Id() function.

For each position j, k of a primary row of the input string, the algorithm uses the Aho-Corasicktrie to determine the state that corresponds to the longest suffix uk of tj,0 . . . tj,k as depicted inAlgorithm listing 4.3. Note that the search of the primary rows of the input string is identicalto the row-matching step of the Baker and Bird algorithm, as shown in Algorithm listing 4.1,with the difference that it is performed on only ⌊ n

m⌋ rows. If this state is terminal, pattern

row pr occurs in the input string at that position. Then it must be determined if pattern rowsp0, . . . , pr−1 occur at positions tj−r,k−m+1 . . . tj−r,k, . . . , tj−1,k−m+1 . . . tj−1,k and pr+1, . . . , pm−1 atpositions tj+1,k−m+1 . . . tj+1,k, . . . , tj+m−r−1,k−m+1 . . . tj+m−r−1,k of the input string using again theAho-Corasick trie, as presented in Algorithm listing 4.4. When two or more rows of the patternare identical (for example pr = pl), scanning on the secondary rows of the input string should berepeated for each of the identical rows. In the worst case (where p0 = . . . = pm−1), 2m− 1 possiblerows must be scanned (the primary row, m − 1 secondary rows above and m − 1 secondary rowsbelow the primary). In that case, the search on each position of the input string would be O(m3)for a total cost of O(n2m2) for the algorithm. To improve the efficiency of the vertical search, atable b of size 2m−1 is used to store and reuse the indices of the primary and the 2m−2 secondary

Page 83: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

70CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

ALGORITHM 4.3: The search of the Baeza-Yates and Regnier algorithm on the primaryrowshead := q0for j := m− 1; j < n; j := j +m do

r := headfor k := 0; k < n; k := k + 1 do

while (s := g(r, tj,k)) = fail dor := Supply(r)

endr := sif Output(r) is not empty then

if P ′m−1 = m− 1 thenSearch on the secondary rows when all pattern rows are unique using the codefrom Algorithm listing 4.4

elseSearch on the secondary rows when duplicate pattern rows exist using thecode from Algorithm listing 4.5

end

end

end

end

ALGORITHM 4.4: The search of the Baeza-Yates and Regnier algorithm on the secondaryrows when all pattern rows are unique

head := q0for tt := j − index, pp := 0; pp < m; tt := tt+ 1, pp := pp+ 1 do

if pp = index thencontinue

endr := headfor c := k; c ≤ k +m− 1; c := c+ 1 do

while (s := g(r, ttt,c)) = fail dor := Supply(r)

endr := s

endif Id(r) 6= P ′

pp thenreturn

endreport match at j − index+ 1, k

end

Page 84: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.3. TWO-DIMENSIONAL PATTERN MATCHING ALGORITHMS 71

rows. That way, the input string is scanned only for the occurrence of the distinct rows of thepattern, as can be seen in Algorithm listing 4.5.

ALGORITHM 4.5: The search of the Baeza-Yates and Regnier algorithm on the secondaryrows when duplicate pattern rows exist

head := q0for i := 0; i < 2m− 1; i := i+ 1 do

b[i] := −1endb[m− 1] := indexfor i := 0; i < m; i := i+ 1 do

if P ′i 6= index then

continueendfor d := j − i, w := m− 1− i; d ≤ j − i+m− 1 AND d < n; d := d+ 1, w := w + 1 do

if b[w] = P ′d−j+i then

continueendr := headfor c := k; c ≤ k +m− 1; c := c+ 1 do

while (s := g(r, td,c)) = fail dor := Supply(r)

endr := s

endb[w] := Output(r)if b[w] 6= P ′

d−j+i thenbreak

end

endif d = j − i+m then

report match at d−m, kend

end

Baeza-Yates and Regnier uses O(|Σ|m2) extra space to store the trie of the Aho-Corasick al-gorithm, O(m) extra space for table b and O(m) extra space for array P ′. The trie is built in

O(|Σ|m2) time during the preprocessing phase. The algorithms uses O(n2

m) time in the worst case

to scan all characters of the primary rows and O(m2) for all characters of the secondary rows fora searching cost of O(n2m). On average, the probability that a row of the pattern is matched at a

given position of the input string is given by m|Σ|m with an upper bound of n2

mfor the scanning of the

primary rows and a total search phase cost of n2m2

|Σ|m . Since m2

|Σ|m < 4m

for each m ≥ 1 and |Σ| ≥ 2,

the search phase complexity of the Baeza-Yates and Regnier algorithm is O(n2

m) in the average

case. This chapter uses an implementation based on code from the Xaa project by Baeza-Yatesand Fuentes [12].

Consider the following example with the same pattern P as the one used in Figure 4.1. As canbe seen from the Figure, L(5) = p0, L(8) = p1 and L(10) = p4. Index 0 can be assigned to pattern

Page 85: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

72CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

p0, index 1 can be assigned to pattern p2 and index 2 can be assigned to pattern p4. Rememberthat since p2 = p0 and p3 = p1 they share the same index. Hence, the two-dimensional pattern Pcan be reduced to a one-dimensional array of indices P ′ and subsequently the input string T canbe converted to T ′. As with the Baker and Bird example, the index for each suffix of the inputstring is calculated on-line and therefore table T ′ is not stored in memory. The values of P , P ′, Tand T ′ are identical to the respective tables as given in section 4.3.1.

The Baeza-Yates and Regnier algorithm starts scanning from position t4,0 of the input string.At t4,4, index 0 is encountered. Pattern rows p0 and p2 share the same index since they are identicaland thus table b is used. That way is ensured that pattern rows will be compared to substringsof the input string at most 2m − 1 times. The initial value of b[4] is set to the index of thematching pattern row. Pattern row p0 is aligned next with the 4th row of the input string. In thatcase, the indices of the input string suffixes t4,0 . . . t4,4, . . . , t8,0 . . . t8,4 are calculated and comparedwith the indices of pattern rows p0 to p4. Since b[4] = P ′[0], the index of the suffix of the inputstring at positions t4,0 . . . t4,4 can be reused. The indices of the input string suffixes at positionst5,0 . . . t5,4, . . . , t8,0 . . . t8,4 are calculated and stored in b[5], . . . , b[8]. Since all indices of b now matchwith the indices stored in P ′ it is determined that a complete match occurs:

b =

-1

-1

-1

-1

0

-1

-1

-1

-1

0

1

0

1

2

0

1

0

1

0

1

0

1

2

Pattern row p2 is then aligned with the fifth row of the input string. The indices of the inputstring suffixes t2,0 . . . t2,4, . . . , t6,0 . . . t6,4 are calculated and compared with the indices of patternrows p0 to p4. The indices of input string suffixes t2,0 . . . t2,4 and t3,0 . . . t3,4 were not determinedbefore, therefore are calculated and stored in b[2] and b[3]. The values of b[4] and b[5] were calculatedin the previous step and since b[4] = P ′[2] and b[5] = P ′[3] the indices of input string suffixest4,0 . . . t4,4 and t5,0 . . . t5,4 can be reused. The value of b[6] was previously set but is different fromP ′[4] therefore is calculated again using the trie of Aho-Corasick. Since the new value of b[6] isequal to 0 and P ′[4] = 2, it is determined that a mismatch occurs between the pattern and theinput string:

b =

-1

-1

-1

-1

0

1

0

1

2

0

1

0

1

2

0

1

0

1

0

1

0

1

2

Page 86: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.4. ALGORITHM VARIANTS 73

4.4 Algorithm Variants

This section discusses implementation details for the variants of the Baker and Bird and the Baeza-Yates and Regnier two-dimensional pattern matching algorithms and highlights their differences tothe original implementations. The analysis of the algorithms assumes that the pattern set consistsof m patterns where each pattern has a size of m characters for a total size of m2 characters.

4.4.1 Two-dimensional Pattern Matching using Set Horspool

Using the Set Horspool trie instead of the Aho-Corasick trie in the Baker and Bird algorithm isfairly straightforward. During the preprocessing phase, the trie is built from each pattern rowp0, p1, . . . , pm−1 ∈ P in reverse and a unique index is assigned to each terminal state. Similar tothe original implementation, the indices are used by the Knuth-Morris-Pratt algorithm to create anext table that is then utilized during the column-matching step of Baker and Bird. Moreover, thebad character shift is computed for each different character σ ∈ Σ. In the row-matching step, eachrow j of the input string is scanned backwards from character tj,m−1 to tj,n−1. For each positionk of the input string row, the algorithm computes the longest suffix u of tj,0 . . . tj,k that is also asuffix of any pattern. When a complete match is found or a mismatch occurs between characterσ of the input string and α of the trie, the trie is shifted to the right according to character β atposition k of the input string until β is aligned with the next state of the trie that is labeled afterβ. If no such β exists, the trie is shifted to the right by m positions. The search phase of the SetHorspool variant of the Baker and Bird algorithm is O(n2m) in the worst case. Since on averagethe searching phase of the Set Horspool algorithm is sub-linear when used for multiple patternmatching, it is expected for the Set Horspool variant of the Baker and Bird algorithm to be O(Kn)where K a parameter such that K < n.

Replacing the Aho-Corasick trie with the Set Horspool trie in the Baeza-Yates and Regnieralgorithm is also straightforward. The trie is built during the preprocessing phase from eachpattern row p0, p1, . . . , pm−1 ∈ P in reverse and the two-dimensional pattern P is reduced to aone-dimensional array P ′ of indices. Additionally, the bad character shift is computed for eachdifferent character σ ∈ Σ. The scanning on each primary row j of the input string is performedbackwards, starting from character tj,m−1. Then, for each position k of the primary row, the SetHorspool trie is used to determine the longest suffix u of tj,0 . . . tj,k that is also a suffix of anypattern from the pattern set. When a complete match or a mismatch occurs, the trie is shiftedto the right by a number of positions based on the bad character shift function. If pattern rowpr occurs in a primary row of the input string, the trie of Set Horspool is also used backwards inthe secondary rows to determine if pattern rows p0, . . . , pr−1 and pr+1, . . . , pm−1 occur immediatelyabove and below the current row. The search phase of the Set Horspool variant of the Baeza-Yatesand Regnier algorithm is O(n2m2) in the worst case and O(Kn

m) in the average.

4.4.2 Two-dimensional Pattern Matching using Set Backward Oracle Matching

To adapt Baker and Bird algorithm so that it can use the factor oracle of the Set Backward OracleMatching algorithm it must be taken under consideration that a terminal state q could correspondto no pattern or to multiple patterns at once. During the preprocessing phase, the factor oracle iscreated from the patterns in P in reverse and a unique index is assigned to each distinct patternpr. The indices are processed by the Knuth-Morris-Pratt algorithm to create a next table. In therow-matching step, each row j of the input string is scanned backwards, starting from charactertj,m−1. If during row-matching a terminal state q of the oracle is reached at position j, k of the input

Page 87: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

74CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

string, all patterns that F (q) points to are compared directly with the input string to determinethe index to be assigned to the suffix of tj,0 . . . tj,k that corresponds to the matching pattern andthe oracle is shifted by one position to the right. If no pattern corresponds to the suffix of theinput string or a mismatch occurs between a character σ of the input string and α of the oracleat column k, the factor oracle is shifted past k. The search phase of the Set Backward OracleMatching variant of the Baker and Bird algorithm is O(n2m2) in the worst case and O(Kn) in theaverage.

Baeza-Yates and Regnier was implemented using the Set Backward Oracle Matching algorithmas follows. During the preprocessing phase, the factor oracle is constructed from the pattern rowsin reverse and the two-dimensional pattern is reduced to a one-dimensional array of indices. Foreach position k of a primary row j of the input string, the factor oracle is used to read backwardsthe longest suffix u of tj,0 . . . tj,k that is also a suffix of a pattern row. When a mismatch occursat position j, k, the oracle can be safely skipped to the right past k. If a pattern row is matchedin a primary row then the secondary rows are also scanned backwards using the factor oracle todetermine if a complete match occurs and the factor oracle is then shifted by one position to theright. Since a terminal state q of the oracle could exist that does not correspond to any pattern orthat corresponds to more than one pattern, any potential matches in the primary or the secondaryrows must be verified directly with the input string using the set of indices F (q), similar to theSet Backward Oracle Matching implementation of the Baker and Bird algorithm. The theoreticalsearching time complexity of the Set Backward Oracle Matching implementation of the Baeza-Yatesand Regnier algorithm is O(n2m3) in the worst case and O(Kn

m) in the average.

4.4.3 Two-dimensional Pattern Matching using Wu-Manber

Implementing Baker and Bird with the Wu-Manber algorithm requires significant more effort thanthe Set Horspool and Set Backward Oracle Matching variants presented above. Rather than usinga trie, the Wu-Manber variant of Baker and Bird is using a search window that slides along eachrow j of the input string. During the preprocessing phase, a hash value hs is computed for eachpattern row p0, p1, . . . , pm−1 using the hash function of the Wu-Manber algorithm. Both the hashvalues and the pointers to the pattern rows they correspond are stored in a hashmap that is thenused to assign a unique index to each different pattern row in O(m2) time. Pattern rows with thesame hash value are compared on a character by character basis to determine if they are identical,before they are assigned the same index. That way, the two-dimensional pattern P is reduced toa one-dimensional array P ′ of indices. Moreover, the SHIFT, HASH and PREFIX tables of theWu-Manber algorithm are built from the pattern rows. The array P ′ is then preprocessed using theKnuth-Morris-Pratt algorithm to compute the values of the next table. During the row-matchingstep and for each position j, k of the input string, a hash value h of the B-character suffix of thesearch window at tj,k−m+1 . . . tj,k is computed backwards using the h1() function. If SHIFT [h] = 0,the hash value h′ of the B′-character prefix of the search window is also computed. If both theprefix and suffix of the search window match the prefix and suffix of one or more pattern rows, thecorresponding rows are compared directly with the input string to determine if a match of one ofthe pattern rows occurs. The search window is then shifted by one position to the right. If, onthe other hand SHIFT [h] 6= 0, the search window is subsequently shifted by SHIFT [h] positions.The Wu-Manber variant of the Baker and Bird algorithm is O(n2m2 log|Σ|m

2) in the worst case

and O(n2 log|Σ| m

2

m) in the average.

The Baeza-Yates and Regnier algorithm also requires a unique index to be assigned to eachdifferent pattern row. During the preprocessing phase, a hash value hs is calculated for each p ∈ P

Page 88: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.4. ALGORITHM VARIANTS 75

and together with a pointer to the pattern it corresponds are stored in a hashmap. Each patternrow is then assigned an index in O(m2) time using the hashmap as already discussed. Finallythe SHIFT, HASH and PREFIX tables are computed. The search phase is performed backwardsonly on the primary rows of the input string. Starting from position m − 1 of a primary row jand for each character tj,k, the hash value h of the B-character suffix of the search window attj,k−m+1 . . . tj,k is calculated. Based on the value of SHIFT [h], the search window is either shiftedto the right by SHIFT [h] positions or its B′-character prefix is calculated. If both the prefix andsuffix of the search window match the prefix and suffix of some pattern rows then the correspondingrows are compared directly with the input string. If there is a match on a primary row, then ahash value hs is computed for each of the m-character substrings of the at most 2m− 2 secondaryrows and based on the hashmap it is determined if they correspond to the appropriate indices ofthe pattern rows. The search window is then shifted by one position to the right. The theoreticalsearch phase complexity of the Baeza-Yates and Regnier algorithm when implemented using the

Wu-Manber algorithm is O(n2m3 log|Σ|m2) in the worst case and O(

n2 log|Σ| m2

m2 ) in the average.

4.4.4 Two-dimensional Pattern Matching using SOG

The SOG variant of the Baker and Bird algorithm uses a search window that slides along each rowof the input string instead of a trie, similar to the Wu-Manber variant presented above. During thepreprocessing phase, a hash value h is assigned to each different B-character block of the patternsusing the hash function h1() of the SOG algorithm. Additionally, a hash value hs is computed fromeach pattern row and a two-level hash value hs′ is created from hs and stored in an ordered table.The hash value hs for each pattern row and a pointer to the pattern row it corresponds are stored ina hashmap that is then used to assign a unique index to each different pattern row in O(m2) time.To search for the occurrence of the patterns in the input string during the row-matching step, anm-bit variable E is used that is updated based on the hash value h of the B-character substrings ofeach row j of the input string. If the (m−B)th bit of E is equal to 0, a candidate match occurs thatis verified by calculating the two-level hash of the search window and then searching the orderedtable for its occurrence using binary search. When the Baker and Bird algorithm is implementedusing the SOG algorithm, it has an O(n2m2) theoretical search phase complexity in the worst caseand O(n2) in the average case.

The SOG variant of the Baeza-Yates and Regnier algorithm can be implemented in a similar way.During the preprocessing phase, the two-dimensional pattern P is reduced to a one-dimensionalarray P ′ of indices by calculating the hash value hs of each pattern row and storing it in a hashmaptogether with a pointer to the pattern row it corresponds. Moreover, a two-level hash value hs′ iscreated from hs and stored in an ordered table. Finally, a hash value h is assigned to each differentB-character block of the pattern rows using a hash function h1(). The search phase is performedon only the primary rows as per the original Baeza-Yates and Regnier algorithm. For each positionk of a primary row j, an m−B +1-bit variable E is used that is updated based on the hash valueh of each B-character substring of the input string. If the (m − B)th bit of E is equal to 0, acandidate match occurs that is verified by calculating the two-level hash of the search window andthen searching the ordered table for its occurrence using binary search. If a match is found on aprimary row, then E is updated based on the hash value h of each B-character substring for eachof the secondary rows in a similar way. The searching phase complexity of the SOG variant of theBaeza-Yates and Regnier algorithm is the same as of its original implementation; O(n2m3) in the

worst case and O(n2

m) in the average.

Tables 4.1 and 4.2 summarize the theoretical extra space, preprocessing and worst and average

Page 89: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

76CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

Table 4.1: Known theoretical extra space, preprocessing, worst and average searching time com-plexity of the Baker and Bird algorithm implementations

Implementation Extra space Preprocessing Worst case Average case

Aho-Corasick n+ |Σ|m2 |Σ|m2 n2 n2

Set Horspool n+ |Σ|m2 |Σ|m2 n2m Kn

Set Backward Oracle Matching n+ |Σ|m2 |Σ|m2 n2m2 Kn

Wu-Manber n+m×

B−1∑

i=0

|Σ| × (2bitshift)i m2 n2 log|Σ|(m2)m(m− 1) n2 log|Σ|(m

2)/m

SOG n+ |Σ|B +m2 m2 n2m2 n2

searching time complexity of the presented Baker and Bird and Baeza-Yates and Regnier algorithmimplementations. The parameter K is defined such that K < n in the average case. The searchingtime of SOG corresponds to the time of the filtering and verification phases of the algorithmcombined.

Table 4.2: Known theoretical extra space, preprocessing, worst and average searching time com-plexity of the Baeza-Yates and Regnier algorithm implementations

Implementation Extra space Preprocessing Worst case Average case

Aho-Corasick |Σ|m2 |Σ|m2 2n2m n2

m

Set Horspool |Σ|m2 |Σ|m2 2n2m2 Knm

Set Backward Oracle Matching |Σ|m2 |Σ|m2 2n2m3 Knm

Wu-Manber m×

B−1∑

i=0

|Σ| × (2bitshift)i m2 2n2 log|Σ|(m2)m2(m− 1)

n2 log|Σ|(m2)

m2

SOG |Σ|B +m2 m2 2n2m3 n2

m

4.5 Analysis

This section discusses the efficiency and presents a performance evaluation in terms of preprocessingand searching time of the original Baker and Bird and Baeza-Yates and Regnier two-dimensionalpattern matching algorithms and their Set Horspool, Set Backward Oracle Matching, Wu-Manberand SOG variants for different problem parameters including the pattern, input string and alphabetsizes used.

Figure 4.3 depicts the preprocessing time of the different implementations of the Baker and Birdalgorithm. The preprocessing time of the Baeza-Yates and Regnier algorithm was similar and thusis omitted. The searching time of the original Baker and Bird and the Baeza-Yates and Regnieralgorithms and their variants is presented in Figures 4.4 to 4.7. All Figures have a linear horizontalaxis and a logarithmic vertical. Since the SOG algorithm uses bitwise operations to create the hashvalues for the verification phase, it is not trivial to implement it on a 32-bit hardware for a patternwhere the size m of each dimension is larger than 32 characters although it could potentially beused as a filter to search for the occurrence of subpatterns with a size of up to 32 characters. Forthis reason, the searching time of the Baker and Bird and the Baeza-Yates and Regnier algorithms

Page 90: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.5. ANALYSIS 77

is given only for patterns of size m = 8, 16 and 32 characters when implemented using the SOGalgorithm, similar to the methodology presented in [78]. The experimental results with only a fewexceptions are consistent with the theoretical time complexity as presented in Tables 4.1 and 4.2.

4.5.1 Performance Evaluation

The Baker and Bird and the Baeza-Yates and Regnier implementations presented in this chapter usethe Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and the SOG multiplepattern matching algorithms to preprocess the pattern as discussed in previous sections. As shownin Table 3.1, the theoretical time to construct the trie of Aho-Corasick and Set Horspool and thefactor oracle of Set Backward Oracle Matching is O(|Σ|m2) while the theoretical preprocessingtime of the Wu-Manber and SOG algorithms is O(m2). If the theoretical time complexity ofan algorithm can be used as a means to predict its performance, the preprocessing time of theimplementations was expected to increase linearly in m2. The experimental results, as presentedin Figure 4.3, confirm that the preprocessing time of the original implementation of the Baker andBird and Baeza-Yates and Regnier algorithms and of their Set Horspool, Set Backward OracleMatching and Wu-Manber variants increased linearly in the size of the pattern. The preprocessingtime of the SOG variant of both the Baker and Bird and the Baeza-Yates and Regnier algorithmsincreased proportional to the size of the pattern when used on a pattern with m = 16 as opposedto m = 8 characters but it was constant in m2 as the size increased from 16 to 32 characters. Thisirregularity was caused by the aggressive optimization of the code by the GCC compiler when the“O2” flag was used and was not observed with the “O0” flag.

From Figure 4.3 it can also be seen that the preprocessing time of all algorithm implementations,regarding the alphabet size, was in line with their theoretical complexity. More specifically, thetime of the Aho-Corasick, Set Horspool and Set Backward Oracle Matching implementations of theBaker and Bird and Baeza-Yates and Regnier algorithms to preprocess the pattern increased linearlyin |Σ| while the time of the Wu-Manber and SOG variants of both two-dimensional algorithms wasconstant in |Σ|.

The experiments confirm in practice that the time of the Aho-Corasick implementation of theBaker and Bird and the Baeza-Yates and Regnier algorithms to locate all occurrences of the patternin the input string increased linearly in the size n2 of the input string. The pattern size and thealphabet size are two parameters that are closely correlated as to the manner in which they affect theperformance of the algorithm implementations. The searching time of the original Baker and Birdalgorithm was roughly constant in m2, as expected by its average case theoretical time, although anincrease in the searching time of the implementation is apparent for large pattern sizes, especiallywhen data with alphabets of size |Σ| = 256 and |Σ| = 1024 were used. A similar behavior of theoriginal Baeza-Yates and Regnier algorithm can be observed to a much greater extent in Figures 4.6and 4.7. In that case and although the searching time of the implementation had a tendency todecline, in accordance to its theoretical complexity, a sharp increase can be observed, especially ondata with alphabets of size |Σ| = 256 and |Σ| = 1024. The increase in the searching time of theimplementation is related to the rate of memory cache misses as trie nodes are visited during thesearching phase. The CPU used for the experiments of this chapter had a 64 KB L1 cache per coreand a 6 MB L2 cache. All data read or written by the CPU are stored in the L1 cache in the formof 64-byte cache lines, namely blocks of contiguous data words. The limited size of the cache resultsin the eviction of the cache lines to the L2 cache and from there to main memory to make roomin the caches for new cache lines. Each trie node uses a table of size |Σ| to store the transitionsto other nodes and since in principle the frequency of cache misses is determined by the size of

Page 91: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

78CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 2)

ACSH

SBOMWM

SOG

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 256)

ACSH

SBOMWM

SOG

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 1024)

ACSH

SBOMWM

SOG

Figure 4.3: Preprocessing time of the algorithm implementations

Page 92: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.5. ANALYSIS 79

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 2)

ACSH

SBOMWM

SOG

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 256)

ACSH

SBOMWM

SOG

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 1024)

ACSH

SBOMWM

SOG

Figure 4.4: Searching time of the Baker and Bird implementations for an input string with n =1, 000

Page 93: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

80CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

1

10

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 2)

ACSH

SBOMWM

SOG

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 256)

ACSH

SBOMWM

SOG

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 1024)

ACSH

SBOMWM

SOG

Figure 4.5: Searching time of the Baker and Bird implementations for an input string with n =10, 000

Page 94: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.5. ANALYSIS 81

0.001

0.01

0.1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 2)

ACSH

SBOMWM

SOG

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 256)

ACSH

SBOMWM

SOG

0.001

0.01

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 1024)

ACSH

SBOMWM

SOG

Figure 4.6: Searching time of the Baeza-Yates and Regnier implementations for an input stringwith n = 1, 000

Page 95: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

82CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 2)

ACSH

SBOMWM

SOG

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 256)

ACSH

SBOMWM

SOG

0.1

1

4 8 16 32 64 128 256

Run

ning

tim

e (s

ec)

Pattern length (Alphabet = 1024)

ACSH

SBOMWM

SOG

Figure 4.7: Searching time of the Baeza-Yates and Regnier implementations for an input stringwith n = 10, 000

Page 96: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.5. ANALYSIS 83

the data, increasing the size of the alphabet resulted in the degradation of the implementationsperformance. To measure the effect of the memory caches on the performance of the algorithmimplementations, Callgrind, an execution-driven cache simulator was used from the Valgrind suite[68]. A thorough analysis on the memory architecture of modern processors can be found in [31].Finally, it can be seen that the searching time of the original Baker and Bird and Baeza-Yates andRegnier algorithms was constant in |Σ|, if the sharp increase in time due to the cache misses istaken into consideration.

The performance in terms of searching time of the Set Horspool and Set Backward OracleMatching variants of the Baker and Bird and the Baeza-Yates and Regnier algorithms has similarcharacteristics to the original implementations. The time to locate all occurrences of the patternin the input string increased linearly in n2 while a sharp increase in m2 of their searching time isevident when data with alphabets of size |Σ| = 256 and |Σ| = 1024 were used due to cache misses.Contrary to the original Baker and Bird algorithm, the searching time of its Set Horspool and theSet Backward Oracle Matching variants actually decreased in the size of the alphabet although itwas expected to be constant in |Σ| in the average case. Recall that when a mismatch occurs at aposition i of the input string, Set Horspool uses the bad character shift of the Horspool algorithmto shift the trie while the oracle of the Set Backward Oracle Matching algorithm is shifted past thatposition. As the maximum possible shift of the trie and the factor oracle respectively is greatlyreduced on data with a small |Σ|, it can explain the reduction of the efficiency of the specificalgorithm implementations when used on data with a binary alphabet. The searching time of theSet Horspool and the Set Backward Oracle Matching variants of the Baeza-Yates and Regnieralgorithm on the other hand was roughly constant in |Σ|; although the maximum shift was reducedin that case as well, both the trie and the factor oracle were only shifted along the primary rowsof the input string, thereby not affecting significantly the performance of the implementations.

The searching time of the Wu-Manber variant of the Baker and Bird and the Baeza-Yatesand Regnier algorithms also increased linearly in n2. The performance in terms of searching timeof the Wu-Manber implementation of the Baker and Bird algorithm was influenced by the sizeof the pattern in different ways, based on the alphabet size used; the time to scan the inputstring increased in m2 for data with a binary alphabet while decreased when alphabets of size|Σ| = 256 and |Σ| = 1024 were used. Wu-Manber, similar to the Set Horspool algorithm, usesthe bad character shift of the Horspool algorithm to shift the search window when a mismatch ora complete match occurs and thus, for the reasons described above, the algorithm is not efficientwhen data with a small alphabet size are used. In practice, for data with a binary alphabet togetherwith patterns where m ≥ 16, the Wu-Manber variant of Baker and Bird performed only single-character shifts during the row-matching step of the algorithm. As with the Set Horspool and theSet Backward Oracle Matching variants of the Baeza-Yates and Regnier algorithm, the searchingtime of the Wu-Manber variant was not significantly affected by the reduction of the maximumshift when binary alphabet data were used and thus it was decreased in m2 for all alphabet sizes,as expected by the average theoretical complexity of the algorithm. The time of the Wu-Manbervariant of both the Baker and Bird and the Baeza-Yates and Regnier algorithms to locate alloccurrences of the pattern in the input string decreased in |Σ|.

Finally, the searching time of the SOG variant of both the Baker and Bird and the Baeza-Yatesand Regnier algorithms increased linearly in n2 while it was roughly constant in m2 and |Σ|, asexpected by the average theoretical searching time of the algorithm implementations.

Page 97: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

84CHAPTER 4. TWO-DIMENSIONAL MATCHING USING MULTIPLE PATTERN

MATCHING ALGORITHMS

4.5.2 Algorithm Comparison

As can be seen by Figures 4.4 to 4.7, the searching time of the two-dimensional algorithm imple-mentations was affected in various ways by different problem parameters and as such, differentimplementations are preferred for different types of data. The implementations of the presentedtwo-dimensional pattern matching algorithms had similar characteristics for input strings with sizesof n = 1, 000 and n = 10, 000.

4 8 16 32 64 128 256

2

256

1024

m

Σ

AC

SH WM

Figure 4.8: Experimental map of the implementations of the Baker and Bird algorithm for n =1, 000 and n = 10, 000

When comparing the implementations of the Baker and Bird algorithm in terms of searchingtime, it is clear that the original algorithm was faster than the presented variants for data witha binary alphabet size. The Set Horspool variant of Baker and Bird outperformed the rest of theimplementations when data with alphabets of size |Σ| = 256 and |Σ| = 1024 were used togetherwith patterns with a size m between 4 and 16 characters. Finally, the Wu-Manber variant hadthe fastest searching time for patterns with a size of m ≥ 32 characters when the same alphabetsizes were used. Although the Set Backward Oracle Matching and SOG variants of the Bakerand Bird algorithm had a comparable performance in terms of searching time to the rest of theimplementations, they were consistently slower for all types of data.

The original implementation of the Baeza-Yates and Regnier algorithm had the fastest searchingtime among the presented implementations for patterns with a size between m = 4 and m = 16when data with a binary alphabet were used. Moreover, the Aho-Corasick together with the SetHorspool implementation of Baeza-Yates and Regnier were faster than the Set Backward OracleMatching, Wu-Manber and SOG variants for data with an alphabet of size |Σ| = 256 and 1024 andfor patterns with a size of m = 4. The Wu-Manber variant outperformed the rest of the algorithmimplementations for patterns with a size of m ≥ 32 and a binary alphabet and when data with analphabet of size |Σ| = 256 and 1024 together with patterns of a size m ≥ 8 were used.

The presented implementations of the Baeza-Yates and Regnier algorithm had a consistentlybetter performance than the corresponding implementations of the Baker and Bird algorithm whenused on data with a binary alphabet. It is interesting to note that the Wu-Manber variant of theBaeza-Yates and Regnier algorithm was two orders of magnitude faster than the respective variant

Page 98: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.5. ANALYSIS 85

4 8 16 32 64 128 256

2

256

1024

m

Σ

AC

AC/SH

WM

Figure 4.9: Experimental map of the implementations of the Baeza-Yates and Regnier algorithmfor n = 1, 000 and n = 10, 000

of the Baker and Bird algorithm for data with a binary alphabet and for all pattern and input stringsizes, as can be seen in Figures 4.4 to 4.7. On data with alphabets of size |Σ| = 256 and |Σ| = 1024characters, the Aho-Corasick, Set Horspool and Set Backward Oracle Matching implementations ofthe Baker and Bird algorithm were faster than the Aho-Corasick, Set Horspool and Set BackwardOracle Matching implementations of Baeza-Yates and Regnier while the Wu-Manber variant wasup to three times slower. For the same types of data, the SOG variant of both Baker and Bird andBaeza-Yates and Regnier algorithms had a roughly equal performance.

The experimental maps that are depicted in Figures 4.8 and 4.9 summarize the most efficientimplementations of the Baker and Bird and the Baeza-Yates and Regnier algorithms for differentpattern, alphabet and input string sizes.

Page 99: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

86CHAPTER

4.TWO-D

IMENSIO

NALMATCHIN

GUSIN

GMULTIP

LE

PATTERN

MATCHIN

GALGORIT

HMS

4.6 Appendix: Experimental Results

Table 4.3: Preprocessing time of the Baker and Bird implementations (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0112 0.00008 0.0000 0.0000 0.0000 0.0000 0.0075 0.0000 0.0000 0.0000 0.0000 0.0075 0.0001 0.0000 0.0082 0.0000 0.007516 0.0000 0.0000 0.0000 0.0000 0.0318 0.0002 0.0000 0.0001 0.0000 0.0318 0.0005 0.0001 0.0073 0.0000 0.031832 0.0001 0.0001 0.0001 0.0000 0.0242 0.0007 0.0002 0.0005 0.0000 0.0241 0.0023 0.0006 0.0072 0.0000 0.024264 0.0007 0.0002 0.0003 0.0000 0.0030 0.0008 0.0021 0.0000 0.0123 0.0050 0.0080 0.0001128 0.0047 0.0009 0.0014 0.0001 0.0173 0.0064 0.0106 0.0002 0.0548 0.0235 0.0130 0.0002256 0.0317 0.0035 0.0058 0.0004 0.0869 0.0284 0.0406 0.0006 0.2339 0.1004 0.0149 0.0006

Table 4.4: Preprocessing time of the Baeza-Yates and Regnier implementations (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0081 0.00008 0.0000 0.0000 0.0000 0.0000 0.0075 0.0000 0.0000 0.0000 0.0000 0.0077 0.0001 0.0000 0.0061 0.0000 0.007616 0.0000 0.0000 0.0000 0.0000 0.0318 0.0002 0.0001 0.0001 0.0000 0.0318 0.0005 0.0001 0.0111 0.0000 0.031832 0.0001 0.0001 0.0001 0.0000 0.0241 0.0007 0.0002 0.0005 0.0000 0.0241 0.0022 0.0006 0.0133 0.0000 0.024264 0.0007 0.0002 0.0003 0.0000 0.0029 0.0008 0.0021 0.0000 0.0122 0.0050 0.0157 0.0001128 0.0048 0.0009 0.0014 0.0001 0.0213 0.0064 0.0106 0.0002 0.0547 0.0235 0.0177 0.0002256 0.0311 0.0035 0.0058 0.0005 0.0875 0.0285 0.0407 0.0006 0.1340 0.0651 0.0193 0.0006

Page 100: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

4.6.APPENDIX

:EXPERIM

ENTALRESULTS

87

Table 4.5: Searching time of the Baker and Bird implementations for an input string with n = 1, 000 (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.0073 0.0145 0.0257 0.0209 0.0062 0.0053 0.0112 0.0119 0.0062 0.0053 0.0112 0.01198 0.0066 0.0154 0.0156 0.0193 0.0117 0.0054 0.0046 0.0081 0.0073 0.0084 0.0054 0.0045 0.0082 0.0073 0.008416 0.0057 0.0192 0.0152 0.0287 0.0145 0.0045 0.0042 0.0068 0.0050 0.0082 0.0045 0.0045 0.0073 0.0050 0.008232 0.0048 0.0220 0.0154 0.0443 0.0166 0.0045 0.0041 0.0064 0.0038 0.0082 0.0048 0.0048 0.0072 0.0038 0.008364 0.0045 0.0269 0.0158 0.0778 0.0046 0.0056 0.0065 0.0032 0.0054 0.0076 0.0080 0.0032128 0.0044 0.0345 0.0168 0.1399 0.0083 0.0115 0.0096 0.0029 0.0093 0.0163 0.0130 0.0029256 0.0044 0.0405 0.0140 0.2255 0.0093 0.0147 0.0109 0.0026 0.0145 0.0209 0.0149 0.0018

Table 4.6: Searching time of the Baker and Bird implementations for an input string with n = 10, 000 (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.7378 1.4830 2.5807 2.0947 0.6243 0.5465 1.1186 1.1928 0.6234 0.5278 1.1170 1.19128 0.6797 1.6082 1.7295 1.8555 1.0351 0.5512 0.5410 0.8097 0.7292 0.8371 0.5410 0.4561 0.8213 0.7291 0.839216 0.5661 1.9054 1.5287 2.9011 1.4453 0.4437 0.4258 0.6749 0.5071 0.8309 0.4453 0.4521 0.7226 0.5068 0.821832 0.4811 2.2097 1.5446 4.5357 1.6617 0.4380 0.3913 0.6180 0.3851 0.8089 0.4438 0.4996 0.7005 0.3852 0.811764 0.4484 2.8085 1.7788 8.3362 0.4326 0.5259 0.6190 0.3327 0.4368 0.6803 0.7241 0.3329128 0.4298 3.6996 1.8871 15.8425 0.5015 0.8849 0.7469 0.3155 0.5200 0.7592 0.7400 0.2584256 0.4189 5.1251 1.8810 29.5086 0.5750 1.0678 0.8553 0.3254 0.5899 0.8211 0.7992 0.2091

Page 101: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

88CHAPTER

4.TWO-D

IMENSIO

NALMATCHIN

GUSIN

GMULTIP

LE

PATTERN

MATCHIN

GALGORIT

HMS

Table 4.7: Searching time of the Baeza-Yates and Regnier implementations for an input string with n = 1, 000 (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.0054 0.0070 0.0132 0.0080 0.0040 0.0040 0.0079 0.0054 0.0043 0.0043 0.0081 0.00538 0.0028 0.0038 0.0061 0.0035 0.0099 0.0029 0.0028 0.0057 0.0027 0.0091 0.0033 0.0036 0.0061 0.0027 0.008916 0.0024 0.0033 0.0049 0.0029 0.0106 0.0062 0.0056 0.0100 0.0018 0.0079 0.0077 0.0083 0.0111 0.0019 0.007632 0.0036 0.0046 0.0056 0.0028 0.0077 0.0089 0.0096 0.0126 0.0016 0.0082 0.0134 0.0129 0.0133 0.0017 0.008464 0.0039 0.0051 0.0053 0.0025 0.0092 0.0103 0.0124 0.0014 0.0133 0.0135 0.0157 0.0015128 0.0036 0.0041 0.0042 0.0021 0.0126 0.0135 0.0146 0.0013 0.0152 0.0160 0.0177 0.0013256 0.0027 0.0028 0.0032 0.0016 0.0161 0.0152 0.0169 0.0009 0.0168 0.0188 0.0193 0.0011

Table 4.8: Searching time of the Baeza-Yates and Regnier implementations for an input string with n = 10, 000 (sec)

|Σ| = 2 |Σ| = 256 |Σ| = 1024

m AC SH SBOM WM SOG AC SH SBOM WM SOG AC SH SBOM WM SOG

4 0.5439 0.7037 1.3139 0.8023 0.3997 0.4006 0.7895 0.5264 0.4203 0.4231 0.8242 0.52648 0.2854 0.3920 0.6764 0.4491 1.0829 0.2935 0.2951 0.7097 0.2702 0.8961 0.3295 0.3622 0.6001 0.2686 0.884216 0.2508 0.3342 0.5067 0.2982 1.0602 0.6342 0.5824 1.0713 0.1826 0.8140 0.8582 0.8697 1.2107 0.1826 0.786532 0.3637 0.4793 0.5790 0.2861 0.7748 0.9349 1.0182 1.3138 0.1590 0.8400 1.3481 1.3334 1.6672 0.1596 0.836664 0.4268 0.5805 0.5688 0.2745 0.9903 1.1287 1.3358 0.1522 1.3919 1.4199 1.6447 0.1521128 0.4702 0.4128 0.5297 0.2676 1.0743 1.1784 1.3466 0.1464 1.4205 1.4787 1.7271 0.1369256 0.4422 0.3725 0.5214 0.2543 1.0951 1.2021 1.3507 0.1420 1.4526 1.5299 1.8860 0.1278

Page 102: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 5

Modern Parallel Architectures

5.1 Introduction

Very often, scientific and commercial applications need more processing power than a sequentialcomputer can provide. One way of overcoming this limitation is to improve the operating speedof processors and other components so that they can offer the power required by computationallyintensive applications. Even though this is currently possible to a certain extent, future improve-ments are constrained by the speed of light, the thermodynamic laws, and the high financial costs forprocessor fabrication. A viable and cost-effective alternative solution is to coordinate the efforts ofmultiple interconnected processors and share computational tasks among them. The simultaneousexecution of a task by different processing elements is called parallel computing.

This chapter presents two modern parallel architectures, the distributed memory and the GPUarchitecture. Moreover, a classification of computer architectures is given using Flynn’s taxonomy,different parallel concepts are briefly discussed, including data parallelism, message-passing and themaster-worker model. Finally, the MPI and CUDA APIs are presented.

The Aho-Corasick, Set Horspool, Set Backward Oracle Matching, Wu-Manber and SOG mul-tiple pattern matching algorithms and the Baker and Bird and Baeza-Yates and Regnier two-dimensional pattern matching algorithms that were discussed in the previous chapters are imple-mented in parallel on a GPU in chapters 6 and 7. The same algorithms are implemented in chapter 8on a parallel hybrid architecture and validated on a homogeneous cluster of computer nodes withGPUs.

5.2 Classification of Computer Architectures

Based on the number of instructions and the number of data streams that can concurrently beprocessed, computer architectures can be classified into the following categories using Flynn’s tax-onomy [33].

• Single Instruction, Single Data stream (SISD) A SISD computer consists of a processorthat executes an instruction from the instruction stream sequentially to operate on a singledata stream. This way, only a single instruction can be executed at any given time. Thistype of computer is known as a sequential or serial computer.

• Single Instruction, Multiple Data streams (SIMD) A SIMD computer consists ofmultiple processors, or processors with multiple cores, that direct a common instruction to

89

Page 103: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

90 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

operate on different data streams. Modern CPUs include SIMD instructions, such as theMMX instruction set and its successor, the SSE2 instruction set, to help the programmerexploit data-level parallelism. Modern GPUs are using SIMD on a grand scale to performGPGPU. NVIDIA refers to SIMD for GPUs as Single Instruction, Multiple Threads (SIMT),as discussed later in this chapter.

• Multiple Instruction, Single Data stream (MISD) A MISD computer consists of mul-tiple or multi-core processors that direct multiple instructions to operate on a single datastream. This type of redundancy where multiple processors reach to the same result is com-monly used for fault tolerance.

• Multiple Instruction, Multiple Data streams (MIMD) A MIMD computer consists ofmultiple or multi-core processors that operate asynchronously. The processors direct multipleinstructions on different data streams. The processors in this category can generally operateautonomously and have their own control unit and local memory. Computer clusters typicallyfollow MIMD.

MIMD computers can be further classified in two categories based on the way they accessmemory.

• Shared Memory MIMD Shared memory computers are using processors that have access toa common, shared memory. This memory is used for communication and for synchronizationbetween the processors. All data is stored in the shared memory while memory coherenceis maintained automatically by the operating system. This way, data changes by a singleprocessor become directly accessible to the rest.

• Distributed Memory MIMD Each processor of a distributed memory computer has accessto its own local memory since no processor has direct access over the local memory of otherprocessors. Communication and synchronization between them is usually performed with theencapsulation of data into messages that are subsequently passed over a common networkinterface.

5.3 Distributed Memory

5.3.1 Distributed Memory Architecture

A computer cluster is a type of parallel or distributed processing system, which consists of acollection of interconnected stand-alone computer nodes working together as a single, integratedcomputing resource. A computer node can be a single or multiprocessor system (PCs, workstations,SMPs and multi-core processors) with memory, I/O facilities, and an operating system. The nodescan exist in a single cabinet or be physically separated and connected via a common network. Aninterconnected cluster of computers can appear as a single system to users and applications. Sucha system can provide a cost-effective way to gain features and benefits (fast and reliable services)that have historically been found only on more expensive proprietary shared memory systems.

The clusters can be classified into two categories based on the architecture of the computernodes. With homogeneous clusters, all nodes have a similar architecture with equal performancecharacteristics and memory capacity. Heterogeneous clusters on the other hand consist of computernodes with different architectures. Clusters are often referred to as distributed memory parallelsystems since the nodes of the cluster are loosely-coupled; each node has its own local memory

Page 104: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.3. DISTRIBUTED MEMORY 91

and the communication between them takes place through a network, usually with the use ofmessage-passing. Clusters commonly use the master-worker model.

5.3.2 The Master-Worker Model

The master-worker model [17] consists of two entities: a master and multiple workers. The masteris responsible for decomposing the problem into small tasks and to distribute these tasks amongthe worker processes, as well as for gathering the partial results in order to produce the final resultof the computation. The worker processes execute in a very simple cycle: get a message with thetask, process the task, and send the result back to the master. Usually, the communication takesplace only between the master and the workers while in certain scenarios the workers communicatewith each other.

Master-worker may either use static load-balancing or dynamic load-balancing. With staticload-balancing, the distribution of tasks is performed at the beginning of the computation, whichallows the master to participate in the computation after each worker has been allocated a fractionof the work. The allocation of tasks can be done once or in a cyclic way. The other way is to use adynamically load-balancing master-worker paradigm, which can be more suitable in the followingcases; when the number of tasks exceeds the number of available processors, when the number oftasks is unknown at the start of the application, when the execution times are not predictable orwhen the problems are not balanced.

The master-worker model can be generalized to the hierarchical or multi-level master-workermodel in which the top-level master feeds large chunks of tasks to second-level masters, who furthersubdivide the tasks among their own workers and may perform part of the work themselves. Thismodel is generally equally suitable to shared memory or message-passing paradigms since theinteraction is naturally two-way; i.e., the master knows that it needs to give out work and workersknow that they need to get work from the master.

While using the master-worker model, care should be taken to ensure that the master does notbecome a bottleneck, which may happen if the tasks are too small (or the workers are relativelyfast). The granularity of tasks should be chosen such that the cost of doing work dominates the costof transferring work and the cost of synchronization. Asynchronous interaction may help overlapinteraction and the computation associated with work generation by the master. It may also reducewaiting times if the nature of requests from workers is non-deterministic.

5.3.3 The MPI Library

Message-passing libraries allow efficient parallel programs to be written for distributed memorysystems. They provide routines to initiate and configure the messaging environment as well as forsending and receiving packets of data. Currently, the most popular high-level message-passing APIfor scientific and engineering applications is the Message Passing Interface (MPI), as defined by theMPI Forum [81]. MPI defines both the syntax as well as the semantics of a core set of over 125library functions. These functions provide support for starting and terminating the MPI library,getting information about the parallel computing environment, and for point-to-point as well as forcollective communications.

Parallel applications using the MPI library are designed mainly for the MIMD architecture. Oneor more tasks are assigned to processes executed on each node of a computer cluster, each withan access to a separate address space. The communication between the processes is performed bythe passing of messages via a common interconnected network. The programmer can call functionsto send and receive messages and to synchronize the processes from within the application, while

Page 105: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

92 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

most details are abstracted by the MPI, simplifying the process. The message-passing model hasbecome extremely popular due to its ease of use and the large number of platforms that support it.Programs developed in this way can run in different computer architectures, while the executablecode produced is quite efficient.

This section discusses some of the most essential functions of the MPI library. Most of thesefunctions were used to develop the hybrid parallel architecture of chapter 8.

Basic Functions

The MPI library consists of numerous functions that can be used to implement message-passingparallel applications. The most essential functions of MPI, that are summarized in Table 5.1, areas follows.

Table 5.1: Essential functions of the MPI library

Function name Usage

MPI Init() Initialization of MPI

MPI Comm rank() Retrieve the identification number of a process

MPI Comm size() Retrieve the total number of processes

MPI Recv() Receive messages over a common communication network

MPI Send() Send messages over a common communication network

MPI Finalize() Finalization of MPI

Each application implemented using the MPI library must include the header file of MPI thatcontains the necessary declarations of the library’s functions:

#include “mpi.h”

The first MPI function that should be called inside the application is MPI Init() in order toinitialize MPI. MPI Init() gets as arguments pointers to the arguments of the main function, argcand argv:

MPI Init(int argc, char **argv)

Before the execution of an MPI application is terminated, the MPI Finalize() function must becalled in order to free the resources allocated by MPI. As soon as MPI Finalize() is called, no otherMPI function can be used:

MPI Finalize()

Each process of MPI has its own unique identification number, rank, based on which it can bedistinguished from other processes. The MPI Comm rank() function returns the unique identifica-tion number of a process and its syntax is as follows:

MPI Comm rank(MPI Comm comm, int *rank)

Page 106: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.3. DISTRIBUTED MEMORY 93

Apart from a way to uniquely identify each process of a parallel application, it is often importantto determine the total number of active processes. The MPI Comm size() returns that information:

MPI Comm size(MPI Comm comm, int *size)

The first argument, comm of the MPI Comm rank() and MPI Comm size() functions is calledthe communicator and is a way to group processes. To achieve this, a process group can be createdby associating a common communicator to the processes. The communicator can be used for eitherpoint-to-point or collective communication between these processes. The set of processes with acommon associated communication form a communication domain. Initially, all processes are partof a common communicator, MPI COMM WORLD.

Passing messages between the processes is performed using functions MPI Send() and MPI Recv().The MPI Send() function transmits a message to a designated process while the MPI Recv() func-tion receives a message from a process. To successfully pass a message, MPI must be aware of thefollowing information:

• The identification number of the recipient

• The identification number of the sender

• A tag

• The communicator

The tag can be used to mark different messages. If the recipient process requests messages ofonly a given tag, the rest of the messages will be buffered by the network. The MPI ANY TAGconstant can be used in place of a specific tag in the MPI Recv() function to indicate that a mes-sage with any tag value is accepted. The syntax of the MPI Send() and MPI Recv() functions isas follows:

MPI Send(void *message, int count, MPI Datatype datatype, int dest, int tag,MPI Comm comm)

MPI Recv(void *message, int count, MPI Datatype datatype, int source, int tag,MPI Comm comm, MPI Status *status)

The message pointer, points to the memory segment where the relevant data is stored. The nexttwo arguments, count and datatype are used to determined the end of the transmission; the messagecontains a number of count values, each of a type datatype. The correlation between the datatypesthat MPI uses and the corresponding C types are given in Table 5.2. As can be seen from theTable, the MPI BYTE and MPI PACKED datatypes have no equivalent types in C. MPI BYTEcorresponds to a byte, while MPI PACKED is used to pack multiple variables of a single type intoan expanded datatype in order to reduce the amount of transmitted messages and thus the totalcommunication overhead.

The dest and source arguments indicate the identification numbers of the recipient and senderprocesses. In place of the source argument, the MPI ANY SOURCE can be used. In that case, therecipient process will receive messages from any process. The status argument returns informationon the message received. Some possible types of information are given in Table 5.3.

Page 107: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

94 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

Table 5.2: MPI Datatypes

MPI datatype C datatype

MPI CHAR signed char

MPI SHORT signed short int

MPI INT signed int

MPI LONG signed long int

MPI UNSIGNED CHAR unsigned char

MPI UNSIGNED SHORT unsigned short int

MPI UNSIGNED unsigned int

MPI UNSIGNED LONG unsigned long int

MPI FLOAT float

MPI DOUBLE double

MPI LONG DOUBLE long double

MPI BYTE

MPI PACKED

Table 5.3: Useful information of the MPI Status function

Function name Purpose

MPI TAG() Returns the tag of the sender

MPI SOURCE() Returns the process ID of the sender

MPI ERROR() Returns any errors occurred during the transmission of the message

Point-to-point Communication

Point-to-point is the most simple form of communication; a message is transmitted from a processand is received by a different process, ignoring the rest. There are two types of point-to-pointcommunication in MPI, blocking and non-blocking. With the blocking type, the sender processblocks during the transmission of a message and can resume operation only when receipt of themessage is acknowledged by the recipient process. When the non-blocking type is used, the sendercan resume operation immediately after sending a message and can check at a different time thestatus of the transmission.

There are three additional communication modes, buffered, synchronous and ready. With thebuffered mode, the sender process can send a message even when there is no matching receive atthe other end. In that case, MPI buffers the message to allow the send call to complete. When thesender performs a synchronous send, the transmission will start even when there is no matchingreceive at the other end but will only complete as soon as the message is received by the recipient.Finally, a ready send will start transmitting a message only if there is a matching receive at theother end. In any other case, an error occurs.

MPI Send() and MPI Recv() are blocking functions. MPI also provides two non-blocking ver-sions of these two functions, MPI Isend() and MPI Irecv():

Page 108: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.3. DISTRIBUTED MEMORY 95

MPI Isend(void *message, int count, MPI Datatype datatype, int dest, int tag,MPI Comm comm, MPI Request *request)

MPI Irecv(void *message, int count, MPI Datatype datatype, int source, int tag,MPI Comm comm, MPI Request *request)

MPI Isend() initiates the transmission process without waiting for its completion. Thus, thefunction will immediately return control to the application while later can confirm the successfulreception of the message. Likewise, MPI Irecv() starts the reception process without waiting forthe message to actually arrive and can later check the status of the transmission. The requestargument is a handle that can be subsequently used by the processes to check the progress of thetransmission.

To actually test the transmission progress, the MPI Test() function can be used:

MPI Test(MPI Request *request, int *flag, MPI Status *status)

MPI sets the value TRUE to the flag argument if the transmission process identified by re-quest has completed or FALSE otherwise. The status argument contains information about thetransmission, such as the identification number of the sender process, the message tag and an anerror status. The MPI Wait() function can also be used to check the progress of a transmission. Itblocks and subsequently returns control to the application only when the transmission is completed:

MPI Wait(MPI Request *request, MPI Status *status)

Collective Communication

Point-to-point communication is used to exchange information between two processes. When moreprocesses are involved, the functions for collective communication that MPI provides can be used.Using the MPI Barrier() function is the simplest form of collective communication. It is not usedfor data exchange but it provides instead a mechanism for synchronizing different processes. Whensome process calls MPI Barrier(), it blocks until all other processes associated with the same com-municator comm call the same function:

MPI Barrier(MPI Comm comm)

The most straightforward function for collective communication is MPI Bcast(). Using theMPI Bcast() function, a process root can broadcast a copy of the count elements of a type datatypestored in the local buf buffer to all the other processes within the same communicator comm:

MPI Bcast(void *buf, int count, MPI Datatype datatype, int root, MPI Commcomm)

The MPI Scatter() and MPI Gather() functions of MPI can also be used for collective com-munication. MPI Scatter() is similar to the MPI Bcast() function, with the difference that eachprocess within the same communicator receives different data. When using MPI Scatter(), processroot distributes the data stored in buffer send buf among the processes of communicator comm.

Page 109: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

96 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

The contents of send buf are split into p chunks, where p is the number of processes within thesame communicator. Then, each process will receive send count elements of type send type. Therecv count and recv type arguments correspond to the number and type of elements respectivelythat are stored in the receiver buffer recv buf :

MPI Scatter(void *send buf, int send count, MPI Datatype send type, void *recv buf,int recv count, MPI Datatype recv type, int root, MPI comm comm)

When using the MPI Gather() function, each process sends send count elements of type send typestored in the buffer send buf to process root. The root process will store the elements in the recv bufin numerical order:

MPI Gather(void *send buf, int send count, MPI Datatype send type, void *recv buf,int recv count, MPI Datatype recv type, int root, MPI comm comm)

MPI Reduce() is a very useful function for collective communication. It combines the datastored in the buffer input of every processes with a communicator comm using an operator op andstores the result in the buffer output of process root :

MPI Reduce(void *input, void *output, int count, MPI Datatype datatype, MPI Opop, int root, MPI Comm comm)

The most important operators that can be used with MPI Reduce() are given in Table 5.4.

Table 5.4: Operators of the MPI Reduce function

Operator Effect

MPI MAX Maximum

MPI MIN Minimum

MPI SUM Summation

MPI LAND() Logical AND

MPI LOR() Logical OR

5.4 Graphics Processing Units

Graphics Processing Units are graphics rendering devices introduced back in the 1980s to offloadgraphics-related processing from CPUs. Unlike CPUs that are optimized for use on sequential codeby dedicating more transistors to flow control and data caching, all commodity GPUs follow astreaming, data parallel programming model. This model is structured around a graphics pipelinethat consists of a number of computation stages that vertices have to pass to undergo transformationand lighting. Initially, GPUs were using two fixed-functioned processor classes to process data,vertex transform and lighting processors that allowed the programmer to alter per-vertex attributesand pixel-fragment processors that were used to calculate the color of a fragment. Although thismodel worked perfectly for graphics processing, offered little to no flexibility for general-purpose

Page 110: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.4. GRAPHICS PROCESSING UNITS 97

programming.

The architecture of the GPUs evolved in 2001, when for the first time programmers were giventhe opportunity to program individual functions in the graphics pipeline by using programmablevertex and fragment shaders [72]. Later GPUs (as the G80 series by NVIDIA) substituted thevertex and fragment processors with a unified set of generic processors that could be used as eithervertex or fragment processors depending on the programming needs. On each new generation,additional features are introduced that move the GPUs one step closer to wider use for generalpurpose computations. CUDA [101] is a high-level abstraction API that was released in 2007 toenable programmers to write code for execution on GPUs and to use the available thread processorsof a GPU without the need to write explicitly threaded code.

5.4.1 GPU Architecture

A GPU is a hardware device that acts as a separate co-processor to the host. It is based on ascalable array of multithreaded streaming multiprocessors (SMs) that have a different design thanCPU cores; they target lower clock rates; they support instruction-level parallelism but not branchprediction or speculative execution; and they have smaller caches [94]. Each SM consists of anumber of stream processors (SPs), special function units for floating-point arithmetic operationsand mathematical functions, one or more warp schedulers and an independent instruction decoderso they can run different instructions. The SPs, or CUDA Cores, are lightweight in-order processorsthat are used for arithmetic operations. Moreover, each SM has a number of resources including a32-bit register file and a shared memory area that as discussed later in this section are distributedamong the available threads and thread blocks respectively. On compute capability 1.x GPUs, SMsare grouped into Texture Processor Clusters (TPCs) that contain additionally a texture unit anda texture memory cache.

The threads of a CUDA GPU are very lightweight comparing to the threads of multi-core CPUs,with a very small creation overhead involved. Although they are lightweight in the sense that theyoperate on small pieces of data, they are fully fledged in the conventional sense, since each threadhas its own stack, register file, program counter and local memory [39]. The threads, are organizedinto thread blocks, with each block containing a maximum of 512 or 1024 threads, depending onthe compute capability of the device. The thread blocks are then assigned to SMs, with each SMbeing capable of executing multiple thread blocks. The thread blocks can be further organized intoa one-dimensional, two-dimensional or three-dimensional grid. The thread blocks can be identifiedor in other words can be indexed in one, two or three dimensions using a blockIdx vector. Thethreads of each block are also identifiable through a unique per block thread ID using the threadIdxvector. The SMs divide the thread blocks into warps of threads that are queued for work. A warptypically consists of 32 threads with a half-warp consisting of 16 threads. Threads are groupedinto warps in a deterministic way; for warps with a size of 32 threads, the first warp will alwayscontain threads with a thread ID between 0 − 31. Stream processors follow a MIMD model forwarps since different warps can have different execution paths and process different data. Threadswithin the same warp though, follow a SIMD or as called by NVIDIA, Single Instruction, MultipleThread (SIMT) model. The instruction decoder of an SM fetches a common, single instructionthat will be executed by all the SPs at the same time, forming the basis of SIMT execution [58].Only one instruction is fetched every fetch cycle for a warp that is selected in a round robin fashionfrom among the active, ready warps of the SM. The instructions are then placed in a commonissue queue from where they are dispatched for execution. The threads of a warp start at the sameprogram address and execute concurrently a common instruction at every cycle. Individual threads

Page 111: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

98 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

can branch out and execute independently but this comes with a performance penalty. If threadsof a warp diverge via a data-dependent conditional branch, such as an if-else statement, the warpexecutes sequentially each branch path taken, disabling threads that are not on that path to ensurethe correctness of the results [70]. The compiler inserts the reconvergence point using a field in theinstruction encoding and also inserts an instruction before a diverging branch that provides thehardware with the location of the reconvergence point [73]. Using this information, the threadsconverge back to the same execution path when all paths are complete.

There are two sources of latency that affect the performance of a kernel, instruction latencyand memory latency. Instruction latency is the number of cycles between issuing an arithmeticinstruction and the processing of the instruction by the arithmetic pipeline of the GPU. Memorylatency is the number of cycles needed to access a memory address. To hide both arithmetic andmemory latency, there should be a number of warps resident or in other words maintained on-chipduring their entire lifetime on an SM, so that the SM can choose instructions to issue between them.The number of resident warps required to completely hide latency depends on the architecture ofthe device. A warp may not be ready to execute due to different factors; waiting on registerdependencies; accessing off-chip memory; waiting on some synchronization point; or waiting tofinish executing the previous instruction. At every instruction issue time, a warp scheduler switchesfrom one warp to another and switches contexts between threads. Because the execution context,including program counters and registers, for each warp processed by a multiprocessor is residenton the SMs, context switching is very fast.

Figure 5.1: The NVIDIA GTX 280 GPU architecture

The NVIDIA GTX 280 is a compute capability 1.3 GPU from the GT200 series of NVIDIA’sTesla hardware. It has 1GB of GDDR3 global memory, 602MHz Graphics clock rate, 1.3GHzProcessor clock tester rate and 1.1GHz memory clock rate. As shown in Figure 5.1 that is basedon information from [22], the GPU consists of 30 SMs, with 3 SMs per TPC. Each SM has 8 SPsfor a total of 240 SPs, 16KB of on-chip shared memory, and a 16KB 32-bit register file. Eachthread block can have a maximum of 512 threads while each SM supports up to 1024 active threadsand up to 8 active blocks. Each thread can use between 16 registers when all 1024 threads areused per SM and 128 as a maximum per-thread register usage. Since every SM contains one warpscheduler, one instruction is issued per warp over 4 cycles. The latency of the arithmetic pipelineis 24 cycles, therefore it can be completely hidden by having 6 active warps at any one time. A

Page 112: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.4. GRAPHICS PROCESSING UNITS 99

request for any words within the same memory segment of the global memory for correctly alignedaddresses is coalesced, using one memory transaction per half-warp. The size of each global memorysegment is 32 bytes for 1-byte words, 64 bytes for 2-byte words or 128 bytes for words of 4, 8 and16 bytes. The maximum memory throughput of the global memory can then be 128 bytes pertransaction. GTX 280 contains two levels of texture caches; a 24KB L1 cache within each TPC,partitioned in 3×8KB caches and 8 L2 texture caches with a size of 32KB each, visible to all SMs.The maximum size for a one-dimensional texture reference is 8, 192B while the maximum size fortwo-dimensional texture references is 65, 536 × 32, 768 texels. The shared memory consists of 16banks organized in such a way that successive 32-bit words are mapped into successive banks [70].Each bank has a bandwidth of 32 bits over two clock cycles and therefore the bandwidth of theshared memory is 64 bytes over two cycles when all banks are accessed simultaneously. A requestfor shared memory addresses by a warp is split into two different requests, one for each half-warp.Finally, if the threads of the same half-warp access different words in the same bank or differentaddresses within the same word, bank conflicts occur resulting in the serialization of the accesses.

Figure 5.2: The NVIDIA GT 240 GPU architecture

The NVIDIA GT 240 GPU is very similar to GTX 280, as can be seen from Figure 5.2, alsobased on information from [22]. The different characteristics of GT 240 are as follows. GT 240 isa compute capability 1.2 GPU from the GT200 series of NVIDIA’s GeForce graphics processingunits. It has 512MB of GDDR3 global memory, 1.34GHz GPU clock rate and 900MHz memoryclock rate. It consists of 12 SMs, with 3 SMs per TPC. Each SM has 8 SPs, for a total of 96 SPs.

5.4.2 Memory Hierarchy

GPUs have different memories, both on-chip and off-chip, each with its own advantages and lim-itations. Unlike the memory architecture of traditional computer systems where the compiler ismainly responsible for distributing data between an off-chip RAM and different levels of cache, theprogrammer of a GPU application must, in most cases, explicitly copy data between the memoryareas of the device, trying to find a balance between the size of each memory area and the cost toaccess it in terms of clock cycles. Although this model offers the potential for high performancegains, it also comes with an increased coding complexity.

The fastest and at the same time more limited memory area of a GPU is the register file, a highly

Page 113: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

100 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

banked array of processor registers built out of dense SRAM arrays [37]. As already discussed, eachSM has its own register file. The threads of each SM use dedicated registers to store their registercontext in order to perform context switching very fast. To accommodate all the resident threadsper SM, GPU devices have large register files, with their size depending on the compute capabilityof the device; 8KB per SM for compute capability 1.0 and 1.1 devices, 16KB per SM for computecapability 1.2 and 1.3 GPUs, 32KB per SM for compute capability 2.0 devices and 64KB forcompute capability 3.0 and 3.5 devices.

Shared memory is a small on-chip per SM memory space that has a low-latency access cost andcan also be used to bypass the coalescing requirements of the global memory. Because it is sharedbetween all threads of a thread block, it is usually used for synchronization as well as for dataexchange between them. To enable concurrent accesses to it, shared memory is divided into 32-bitmemory banks. The maximum bandwidth of the shared memory is then utilized when a singlememory request accesses one address from each different bank. If addresses from the same memorybank are accessed, the accesses are serialized to avoid conflicts with a significant degradation of theshared memory’s bandwidth. Compute capability 1.0 to 1.3 GPUs have 16KB of shared memoryper SM while compute capability 2.0 and newer GPUs have 48KB per SM.

Texture memory is a cached read-only memory space with a two-dimensional locality that isinitialized host-side. Compute capability 1.x GPUs have an L1 texture cache per TPC and anL2 texture cache that is accessible by all SMs. It is often used to work around the coalescingrequirements of the global memory and to increase the memory throughput of the kernel. The firsttime that an address of the texture memory is accessed, it is fetched from the global memory withthe high latency that it entails. In that case though, the texture caches of the device are used tocache the data, therefore minimizing the latency when cache hits occur. Unlike traditional CPUcaches, the texture caches don’t have coherency; writing data to texture memory either host- ordevice-side actually result in the invalidation of all caches.

The GPU has its own off-chip device memory, global memory. If data resides to pageable hostmemory, a memory area that is usually allocated using malloc(), it can be transferred to the GPUdevice explicitly before the launch of the kernel. Page-locked or pinned memory, memory thatalways resides in physical memory as it cannot be paged out to disk, can also be allocated in host.Transferring of data between page-locked host memory and the global memory of the device isperformed concurrently with kernel execution using DMA to hide part of the latency involved. Theglobal memory is accessible by all SMs using memory transactions of 32, 64 and 128 bytes with ahigh latency, usually between 400 and 800 clock cycles.

Local memory is a special type of memory that is a cacheable part of the global memory of theGPU. It is used when the registers of an SM are spilled. This can occur due to register pressure;when for example the execution context in terms of registers of a thread is higher than the hardwarelimit of the device. The term “local” refers to the fact that each thread has its own private areawhere its execution context is spilled, resolved at compilation time by the NVCC compiler.

5.4.3 GPU Program Optimization

Although the straightforward port of sequential applications to a GPU can often lead to significantspeedups, the performance of parallel implementations can be further improved when specific char-acteristics of the GPU architecture are considered. The fundamental techniques that a programmercan use to optimize a parallel CUDA application when executed on a GPU can be summarized tothe following.

The first step when implementing a CUDA application is to ensure that as much parallelism

Page 114: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.4. GRAPHICS PROCESSING UNITS 101

as possible is exposed. As discussed previously, to hide instruction and memory latency, a numberof warps should be resident during their entire lifetime on an SM, so that the SM can chooseinstructions to issue between them. Moreover, it is important for the threads of a warp to executein a lockstep fashion. In other words, the threads within a warp should follow the SIMT model asmuch as possible, without branching out and taking different execution paths.

The second step is to efficiently use the available memory areas of the device. In practice, thisstep requires more effort from the programmer since, unlike traditional CPUs, the different memoryareas must be explicitly utilized. Recall that accesses to the global memory are a major sourceof latency on the GPU, usually requiring between 400 and 800 clock cycles for a single memorytransaction. Then, it is important to reduce as possible redundant accesses to global memory, bystoring frequently used data to faster memory areas, including the texture memory and the sharedmemory spaces. To maximize the memory bandwidth of the device, it is also essential to access theglobal memory using correct patterns to achieve the coalescence of the memory accesses. The exactcriteria differ for devices of different compute capabilities, but in general fewer memory transactionsare needed when accesses to global memory are correctly aligned and sequential or within the samememory segment. A memory transaction of 32, 64 or 128 bytes is aligned when the first address ofevery 32, 64 or 128 bytes memory segment respectively is a multiple of its size.

Finally, an important fact that must be considered during the implementation of a parallelCUDA application is that the resources of each SM that are allocated to a thread block are finite.Subsequently, when too many resources are utilized per thread, the occupancy, or the number ofthreads and blocks that an SM can maintain, can be reduced. The occupancy of a GPU can beaffected by different factors; the block size, the shared memory usage and the register usage. An SMcan only support a few concurrent resident thread blocks (usually 8 or 16), therefore it is importantto use blocks with a sufficiently large block size in threads. The amount of shared memory per blockand the number of registers per thread used by a kernel are two factors that also affect significantlythe occupancy of an SM. To ensure maximum occupancy, a kernel should use up to

Total amount of shared memory per block

Maximum number of active thread blocks per SM(5.1)

shared memory per block and

Total number of registers available per block

Maximum number of threads per SM(5.2)

registers per thread, although as discussed in [92], maximizing occupancy does not always resultin a better performance.

5.4.4 The CUDA Programming Model

The CUDA API provides a set of system calls for NVIDIA GPUs that enables programmers towrite massively parallel GPGPU applications. CUDA applications consist of functions that areexecuted sequentially on the host system and functions named kernels that are called from thehost. When a kernel is called from host, it is executed in parallel on the GPU from each of thethreads of the device.

Each application implemented using the CUDA API must include the header file of CUDA:

#include “cuda.h”

Page 115: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

102 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

Host functions and kernels are distinguished from each other using function type qualifiers. Thehost qualifier declares a function as a host function that is callable only by other host functions.

It is the equivalent of not using any qualifier. The global qualifier declares a function as a kernelfunction. It is executed in parallel on the device and it is callable only by host functions. Finally,the device qualifier declares a function that is executed in parallel on the device and is callableonly by other kernel and device functions.

The device memory is managed from the host. To allocate a linear memory space of size sizeon the global memory, the cudaMalloc() function is used:

cudaMalloc(void **devPtr, size t size)

The cudaMalloc() function is usually used to allocate one-dimensional arrays. For two-dimensionalarrays, the use of the cudaMallocPitch() function is preferable:

cudaMallocPitch(void **devPtr, size t *pitch, size t width, size t height)

When cudaMallocPitch is used to allocate linear device memory, each column of the array ispadded to meet the alignment requirements of the device. Recall from the previous section thata memory transaction of 32, 64 or 128 bytes is aligned when the first address of every 32, 64 or128 bytes memory segment respectively is a multiple of its size. Then, a memory space of pitch× height bytes is allocated where pitch is at least as big in size as width. The pitch argument issubsequently used, together with the height argument to access array elements.

To free the memory space allocated by the cudaMalloc() and cudaMallocPitch() functions, aspointed to by pointer devPtr, the cudaFree() function is used in host:

cudaFree(void *devPtr)

The programmer can also allocate host pinned memory by using the cudaMallocHost() andcudaHostAlloc() functions as opposed to regular pageable host memory allocated by the malloc()function:

cudaMallocHost(void **pHost, size t size)

cudaHostAlloc(void **pHost, size t size, unsigned int flags)

The allocation of pinned host memory is required to use the zero-copy technique to transferdata from host memory to the device, a technique that is used extensively in chapter 8. Whenusing cudaHostAlloc(), the allocation of the memory can be controlled with the use of the flagsargument. The different flags of the cudaHostAlloc() function are summarized in Table 5.5. Theallocated pinned host memory can be freed using cudaFreeHost():

cudaFreeHost(void *ptr)

The cudaHostAllocDefault flag on the cudaHostAlloc() function is the equivalent of using thecudaMallocHost() function. When the cudaHostAllocPortable flag is used, the memory allocatedby cudaHostAlloc() will be considered as pinned by all CUDA contexts. Using the cudaHostAl-locMapped flag is a prerequisite to take advantage of the zero-copy technique of CUDA. It allows

Page 116: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.4. GRAPHICS PROCESSING UNITS 103

Table 5.5: Flags of the cudaHostAlloc function

Flag Effect

cudaHostAllocDefault The equivalent of using the cudaMallocHost function

cudaHostAllocPortable The allocated memory is accessible by multiple CUDA contexts

cudaHostAllocMapped The allocated memory is directly accessible from a kernel

cudaHostAllocWriteCombined Allocates the memory as write-combined

the allocated memory to be directly accessible from a kernel without the need for explicitly copyingthe data from host to the device. Instead of allocating pinned memory as cacheable, it can beallocated as write-combined instead with the use of the cudaHostAllocWriteCombined flag. In thatcase, it can be transferred to the device faster with the disadvantage that the performance of readaccesses to pinned memory by the CPU degrades rapidly.

To transfer an array of size count between the host and the device, the cudaMemcpy() functionis used:

cudaMemcpy(void *dst, const void *src, size t count, enum cudaMemcpyKindkind)

The kind argument is used to control whether the array src is copied to the array dst fromthe host to the device or from the device to host using the flags cudaMemcpyHostToDevice andcudaMemcpyDeviceToHost respectively.

Allocating shared memory is straightforward; the shared qualifier can be used before anarray declaration in a device function. Binding an array to the texture memory of the device sothat a kernel can cache it to texture memory and then use a texture reference to read from it, isa more complicated process. For a one-dimensional array, allocated with cudaMalloc(), size bytesof the linear memory area pointed to by the devPtr pointer can be bound to the texture referencetexref using the cudaBindTexture() function:

cudaBindTexture(size t *offset, const struct textureReference *texref, const void*devPtr, size t size)

The offset argument is an optional byte offset parameter. For a two-dimensional array, a cud-aChannelFormatDesc structure should be created first:

struct cudaChannelFormatDesc int x, y, z, w; enum cudaChannelFormatKind f;

The cudaChannelFormatDesc structure is created using the cudaCreateChannelDesc() functionto describe the format of the texture elements. The function returns a channel descriptor desc withformat f and the number of bits of each component x, y, z, and w :

cudaCreateChannelDesc(int x, int y, int z, int w, enum cudaChannelFormatKind f)

Then, a two-dimensional memory area with a size of width by height texels and pitch bytes, aspointed to by the devPtr pointer is bound to the texture reference texref :

Page 117: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

104 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

cudaBindTexture2D(size t *offset, const struct textureReference *texref, constvoid *devPtr, const struct cudaChannelFormatDesc *desc, size t width, size t height,size t pitch)

The texture bound to the texture reference texref can be unbound using the cudaUnbindTex-ture() function:

cudaUnbindTexture(const struct textureReference *texref)

To fetch the linear memory bound to texture reference tex and access the word at texture co-ordinate x, the tex1Dfetch() function can be used:

tex1Dfetch(texture<DataType, cudaTextureType1D, cudaReadModeElementType>texRef, int x)

For a linear memory bound to a two-dimensional texture reference tex, the tex2D function canbe used instead for texture coordinates x and y :

tex2D(texture<DataType, cudaTextureType2D, readMode> tex, float x, float y)

Table 5.6 lists the notation for the parallel implementation of the algorithms with the CUDAAPI that is used for the rest of this thesis.

Table 5.6: GPGPU implementation notation

numBlocks The number of thread blocksblockDim The size in threads of each blockthreadId The unique ID of each thread of a blockblockId The unique ID of each blockSchunk The size in characters of each chunkSthread The number of characters that each thread processesSmemsize The size in bytes of the shared memory per thread block

5.5 Hybrid Parallel Architecture

The use of Graphics Processing Units for general purpose computations by a wide range of ap-plications is extremely popular due to their architecture that enables highly parallel processing.For applications though that require more processing power, the efficiency and the high perfor-mance of GPUs can be combined with the benefits of scalability, transparency, reconfigurability,availability and reliability that, according to [69], has the distributed memory architecture. Thisway, hybrid GPU computer clusters emerge as viable high performance computing platforms. Theparallel implementation of existing applications is supported with the use of high-level APIs, suchas the MPI and CUDA APIs that were previously presented. Then on the first level, MPI can beused for communication and synchronization between the nodes of the cluster while on the secondlevel, CUDA can be used for the synchronization of the GPU threads and for the actual parallelprocessing.

Page 118: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

5.5. HYBRID PARALLEL ARCHITECTURE 105

Worker (CPU)

Worker (CPU)

Worker (CPU)

GPU

GPU

GPU

Master (CPU)

Control

Cache

ALU

ALU

ALU

ALU

Control

Control

Control

Cache

Cache

Cache

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

IU IU IU

SM SM SM

RF RF RF

Texture Cache

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

IU IU IU

SM SM SM

RF RF RF

Texture Cache

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

SP SP

IU IU IU

SM SM SM

RF RF RF

Texture Cache

Figure 5.3: An overview of a hybrid GPU cluster architecture

In contrast to traditional computer clusters, a hybrid cluster uses a layered communicationmodel. The nodes are interconnected using a common network in order to synchronize and toexchange information, usually with the use of the message-passing model. In order to transmitdata between the host memory and the memory of the Graphics Processing Units, typically throughthe PCI Express bus, the data can be explicitly be copied using the cudaMemcpy() function oralternatively, the zero-copy technique of CUDA can be used, as was previously discussed. Figure 5.3depicts a typical layout of a GPU computer cluster with a single GPU per node.

Hybrid GPU clusters are complex in nature and as such it is not trivial to take advantageof the increased performance that they offer. The programmer must often deal with problems ofcommunication latency, data distribution and scaling. Communication latency arises directly fromthe distributed nature of the cluster, stemming from the delays encountered by communicationover an increasing spatial domain [19]. This form of latency can subsequently have a significantimpact on the consistency and the overall performance of the system. The efficient distribution ofdata across different nodes, the RAM of the CPUs and the different memory areas of the GPUsis challenging, especially in heterogeneous clusters. The way a cluster scales is also important;additional computer nodes or more processing elements per node can result in a performanceincrease, with the disadvantage of an increased complexity and the waste of computing resources.

Page 119: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

106 CHAPTER 5. MODERN PARALLEL ARCHITECTURES

Page 120: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 6

Multiple Pattern Matching on a GPU

using CUDA

6.1 Introduction

As multiple pattern matching algorithms are often used to process huge amounts of data witha size that is increasing almost exponentially in time, their sequential execution on conventionalcomputer systems is often impractical. Multiple pattern matching algorithms have nearly all thecharacteristics that make them suitable for parallel execution on GPUs; they involve sequentialaccesses to the memory, parallelism can easily be exposed through data partitioning while the deviceon-chip shared memory and caches can be utilized to further decrease their execution time. Thischapter focuses on the acceleration of the Aho-Corasick [2], Set Horspool [67], Set Backward OracleMatching [67], Wu-Manber [96] and SOG [78] multiple pattern matching algorithms through theirimplementation on a GPU using the CUDA API. The performance of the parallel implementationsis evaluated after applying different optimization techniques and their searching time is comparedto the time of the sequential implementations. To the best of our knowledge, the Set Horspool, SetBackward Oracle Matching and SOG algorithms have never been implemented before on a GPUusing the CUDA API.

6.2 Related Work

Multiple pattern matching algorithms can take advantage of modern parallel systems like GPUs toimprove their performance, especially when large data are involved (i.e. aerial imaging) or in time-critical applications. The use of GPUs to accelerate the performance of multiple pattern matchingalgorithms has been studied in the past.

The Aho-Corasick algorithm was implemented in [89] using the CUDA API to perform networkintrusion detection. The Aho-Corasick trie was represented by a two-dimensional state transitionarray; each row of the array corresponded to a state of the trie, each column to a different characterof the alphabet Σ while the cells of the array represented the next state. The array was precomputedby the CPU and was then stored to the texture memory of the GPU. The input string (in the form ofnetwork packets) was stored in buffers allocated using pinned host memory using a double bufferingscheme and was copied to the global memory of the GPU as soon as each buffer was full. Sincethe input string was in the form of network packets, two different parallelization approaches wereconsidered; with the first, each packet was processed by a different warp of threads while with

107

Page 121: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

108 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

the second, each packet was processed by a different thread. A speedup of 3.2 comparing to therespective sequential implementation was reported of the parallel implementation of Aho-Corasickwhen executed using an NVIDIA GeForce G80 GPU. A similar implementation of the Aho-Corasickalgorithm was used in [90] to perform heavy-duty anti-malware operations.

The Parallel Failureless Aho-Corasick algorithm (PFAC), an interesting variant of the Aho-Corasick algorithm on a GPU was presented in [60]. It is significantly different than the rest of theparallel implementations of the same algorithm in the sense that each character of the input streamwas assigned to a different thread of the GPU. Since each thread needs only to examine if a patternexists starting at the specific character and no back-tracking takes place, the supply function ofAho-Corasick was removed. The goto function was mapped into a two-dimensional state transitionarray. The state transition array was then stored in shared memory by grouping the patterns basedon their prefixes and distributing these groups into different multiprocessors. The original paperclaimed a 4000 speedup of the PFAC algorithm when executed using an NVIDIA GeForce GTX295 GPU comparing to the sequential implementation of Aho-Corasick.

In [87], the Aho-Corasick algorithm was implemented using the CUDA API. The Aho-Corasicktrie was represented using a two-dimensional state transition array that was precomputed on theCPU and stored to the texture memory of the GPU. Instead of storing a map of the final statesin another array, the final states that corresponded to a complete pattern were flagged directlyin the goto array by reserving 1 bit. Bitwise operations were then used to check its value. Theparallelization of the algorithm was achieved by assigning different characters of the input string todifferent threads of the GPU and letting them perform the matching by accessing the shared statetransition array.

The work presented in [98] focused on the implementation of the Aho-Corasick algorithm on aGPU. The Aho-Corasick trie was precomputed on the CPU and was stored in the texture memoryof the GPU but it is not clear the way it was represented. The input string was stored in the globalmemory of the GPU, partitioned to blocks and assigned to different threads in order to achieveparallelization. The threads were then responsible to process the input string and create an outputarray of the states of the Aho-Corasick trie that corresponded to each character position. The paperutilized a number of optimizations in order to further improve the performance of the algorithmimplementation; casting the input string from unsigned char to uint4 to ensure that each threadwill read 16 input string characters from the global memory instead of 1. To further improve thebandwidth utilization, the accesses of multiple threads inside the same half-warp were coalescedby reading the required data to process an input string block to the shared memory of the device.Finally, to avoid shared memory bank conflicts, the threads of a half-warp were accessing memoryfrom different banks. The experimental results for the parallel implementation of the Aho-Corasickalgorithm on an NVIDIA Tesla GT200 GPU reported a speedup of up to 9.5 relatively to thesequential implementation of the same algorithm.

An implementation of the Aho-Corasick algorithm was also presented in [42] using the CUDAAPI. Similar to the methodology used in previously published research papers, the trie of thealgorithm was represented using a two-dimensional state transition array and was stored compressedin the texture memory of the GPU while the input string was stored in global memory. Karguswas introduced in [43], a software that utilizes the Aho-Corasick algorithm to perform intrusiondetection on a hybrid system consisting of multi-core CPUs and heterogeneous GPUs. The trieof Aho-Corasick was represented in the GPU using a two-dimensional state transition array. Thestate transition array was created in the CPU and was stored in the GPU. The network packetsthat comprise the input string were stored in texture memory. To increase the utilization of thememory bandwidth of the GPU, the input string was cast to uint4 using a technique similar to

Page 122: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 109

[98].

In [75], the Aho-Corasick and Commentz-Walter algorithms were used to perform virus scanningaccelerated through a GPU. The Aho-Corasick and Commentz-Walter tries were represented inthe GPU using stacks. The goto and supply functions were substituted with offsets, essentiallyserializing the trie in a continuous memory block. The stacks were precomputed in the CPU andwere then transferred to the GPU. Parallelization was achieved using a data-parallel approach.

Simplified Wu-Manber (SWM), a variant of the Wu-Manber algorithm was implemented in [91]using the OpenCL API for network intrusion detection. The algorithm was modified by usingsmaller values of B as well as avoiding hashing and direct comparison of the patterns to the inputstring. With these changes, the shift tables became smaller and thus they were able to fit in theshared memory of the GPU cores. Moreover, multiple pattern comparisons didn’t occur and thusthread divergence between cores was avoided. Finally, the hash calculations were entirely removed,further improving the performance of the algorithm implementation. The shift tables of the Wu-Manber variant were calculated by the CPU and were then transferred to the shared memory ofthe GPU. The input string in the form of network packets was stored in pinned host memory andwas mapped to the global memory of the GPU. To achieve parallelization, different threads wereassigned to non-adjacent sections of the network packets.

The Wu-Manber algorithm was implemented in [76] using the OpenCL SDK to perform patternmatching on biological sequence databases. The available pattern set was preprocessed by the GPUand the SHIFT, HASH and PREFIX tables were then transferred to the GPU. The SHIFT tablewas compressed to fit to the shared memory of the GPU while the HASH and PREFIX tables aswell as the input string were stored in the global memory. The experimental results for the parallelimplementation of the Wu-Manber algorithm on an NVIDIA GTX 280 GPU reported a speedupof up to 31.27 in the best case.

6.3 GPU implementation

This section presents a basic data-parallel implementation strategy for the Aho-Corasick, Set Hor-spool, Set Backward Oracle Matching, Wu-Manber and SOG multiple pattern matching algorithms,analyzes the characteristics of the algorithm implementations that leverage the capabilities of thedevice, discusses the flaws that affect their performance and addresses them using different opti-mization techniques.

6.3.1 Basic Parallelization Strategy

Recall from section 5.4 that the threads of each thread block are identifiable through a unique perblock thread ID using the threadIdx vector. As there is no built-in CUDA vector to universally indexa thread, the tid variable was used instead. Moreover, to expose data parallelism, two auxiliaryvariables, start and stop were used to indicate the input string positions where the search phasebegins and ends respectively.

To expose the parallelism of the multiple pattern matching algorithms, the following basicdata-parallel implementation strategy was used. The preprocessing phase of the algorithms wasperformed sequentially on the host CPU. The input string and all preprocessing arrays, includingthe arrays that represent the trie of the Aho-Corasick, Set Horspool and Set Backward OracleMatching algorithms as discussed later in this chapter, were copied to the global memory of thedevice. The input string was subsequently partitioned into numBlocks character chunks, eachwith a size Schunk of n

numBlockscharacters. The chunks were then assigned to numBlocks thread

Page 123: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

110 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

blocks. Each chunk was further partitioned into blockDim sub-chunks, that in turn were assignedto each of the blockDim threads of a thread block. To ensure the correctness of the results, m− 1overlapping characters were used per thread. These characters were added to the stop variable ofAho-Corasick and SOG, algorithms that perform forward searching and to the start variable of SetHorspool, Set Backward Oracle Matching and Wu-Manber, algorithms that scan the input stringbackwards. Therefore, each thread processed Sthread = n

numBlocks×blockDim+m − 1 characters for

a total of (m− 1)(numBlocks× blockDim− 1) additional characters. An array out with a size ofnumBlocks × blockDim integers was used to store the number of matches per thread. To avoidextra coding complexity it is assumed that n is divisible by both numBlocks and blockDim. Sincethe character chunks have to overlap, the fewer possible thread blocks should be used to reducethe redundant characters as long as the maximum possible occupancy level is maintained per SM.A pseudocode of the host function for the parallel implementation of the algorithms is given inFigure 6.1.

Main proceduremain ( )

1 . Execute the p r ep ro c e s s i ng phase o f the a lgor i thms ;2 . Where r e l e van t r ep r e s en t t r i e s us ing ar rays ;3 . A l l o ca t e one−dimens iona l and two−dimens iona l a r rays in the g l oba l memory

o f the dev i c e us ing the cudaMalloc ( ) and cudaMallocPitch ( ) f un c t i on s r e s p e c t i v e l y ;4 . Copy the data from the host a r rays to the dev i c e a r rays us ing

the cudaMemcpy ( ) and cudaMemcpy2D( ) f unc t i on s ;5 . Launch CUDA kerne l ;6 . Copy the r e s u l t s array back to host memory ;7 . Ca l cu la te the r e s u l t s ;

Figure 6.1: Pseudocode of the host function

Three tables were constructed during the preprocessing phase of the Aho-Corasick algorithmimplementation. State transition is a two-dimensional array where each row corresponds to a stateof the trie, each column to a different character of the alphabet Σ and the cells of the array representthe next state. To ensure that alignment requirements are met on each row, state transition wasallocated as pitched linear device memory using the cudaMallocPitch() function. State supply andstate final are one-dimensional arrays, allocated using the cudaMalloc() function. Each column ofthe arrays corresponds to a different state of the trie while the cells represent the supply state ofa given state and the information whether that state is final or not respectively. Since the Aho-Corasick trie can have a maximum of m× d+1 trie states, a value that for the experiments of thischapter was typically equal to more than 65536, each state was represented using a 4-byte integer.Algorithm listing 6.1 presents the basic data-parallel strategy used for the implementation of theAho-Corasick algorithm using the CUDA API. As can be seen from the pseudocode, there can bedivergence among the execution paths of the threads when previous states are visited using thesupply function of the algorithm.

The Set Horspool implementation used identical versions to the state transition and state finaltables of the Aho-Corasick implementation to locate all the occurrences of any pattern in the inputstring during the search phase in reverse. Moreover, state shift was used, a one-dimensional arrayallocated using the cudaMalloc() function with a size of |Σ| that represents the bad character shiftof the Horspool algorithm. Each row of the array corresponds to a different state of the trie,each column to a different character of the alphabet Σ while each cell holds the number of input

Page 124: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 111

ALGORITHM 6.1: A basic parallel implementation of the Aho-Corasick algorithm

tid := blockId× blockDim+ threadIdstart := (blockId× n)/numBlocks+ (n× threadId)/(numBlocks× blockDim)stop := start+ n/(numBlocks× blockDim) +m− 1state := 0for i := start; i < stop; i := column+ 1 do

while newState := state transition[state, ti] = fail dostate := state supply[state]

endstate := newStateout[tid] := out[tid] + state final[state]

end

string character positions that can be safely skipped during the search phase. The basic data-parallel strategy used for the implementation of the Set Horspool algorithm is given in Algorithmlisting 6.2. There can be divergence in the execution path of the threads in the basic implementationof Set Horspool when the next states are visited using the goto function of the algorithm.

ALGORITHM 6.2: A basic parallel implementation of the Set Horspool algorithm

tid := blockId× blockDim+ threadIdstart := (blockId× n)/numBlocks+ (n× threadId)/(numBlocks× blockDim) +m− 1stop := start+ n/(numBlocks× blockDim)i := startwhile i < stop do

j := 0; state := 0while j < m AND (newState := state transition[state, ti−j ] 6= fail) do

state := newState; j := j + 1endi := i+ state shift[ti]out[tid] := out[tid] + state final[state]

end

To represent the factor oracle of the Set Backward Oracle Matching algorithm, a modifiedversion of the state transition array was used with an equal size to the versions used by the Aho-Corasick and Set Horspool implementations. The cells of state transition correspond not only tothe next state for each character of the alphabet Σ but also contain information for the externaltransitions of the trie. Moreover, since terminal states can exist that correspond to more thanone pattern, the state final table that was used by the Aho-Corasick and Set Horspool algorithmswas replaced by state final multi. State final multi is a two-dimensional array where each rowrepresents a state of the trie, each column a different pattern from the pattern set while the cellsrepresent the information whether that state is final for the given pattern. Both state transition andstate final multi arrays were allocated as pitched linear device memory using the cudaMallocPitch()function to ensure that alignment requirements are met in the global memory of the device. Thebasic data-parallel strategy for the implementation of Set Backward Oracle Matching is depictedin Algorithm listing 6.3. There is the possibility for the threads of every half-warp to diverge fromtheir execution path, causing individual threads to branch out and execute independently. This

Page 125: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

112 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

can happen when there are potential matches of some patterns from the pattern set since they haveto be verified directly to the input string.

ALGORITHM 6.3: A basic parallel implementation of the Set Backward Oracle Matchingalgorithm

tid := blockId× blockDim+ threadIdstart := (blockId× n)/numBlocks+ (n× threadId)/(numBlocks× blockDim) +m− 1stop := start+ n/(numBlocks× blockDim)i := startwhile i < stop do

j := 0; state := 0while j < m AND (newState := state transition[state, ti−j ] 6= fail) do

state := newState; j := j + 1endif state final multi[state] > 0 AND j = m then

if Verify all patterns in state final multi[state] against the input string thenout[tid] := out[tid] + 1

endi := i+ 1

elsei := i+m− j

end

end

The Wu-Manber algorithm uses the one-dimensional SHIFT table, allocated using the cud-aMalloc() function and the two-dimensional HASH and PREFIX tables allocated as pitched linear

memory using cudaMallocPitch(). The space needed for the SHIFT table isB−1∑

i=0

|Σ| × (2bitshift)i.

Each row of the HASH and PREFIX represents a different pattern from the pattern set while thecolumns represent different hash values. In the worst case, there could be d patterns with the samehash value h or h′ for their B-character suffix or B′-character prefix respectively, therefore HASH

and PREFIX require a d×B−1∑

i=0

|Σ|×(2bitshift)i space each. The relevant data structures were copied

to the global memory of the device with no modifications. The basic data-parallel strategy for theimplementation of Wu-Manber is depicted in Algorithm listing 6.4. As with the implementationof the Aho-Corasick, Set Horspool and Set Backward Oracle Matching algorithms, there can bedivergence among the threads of the same-half warp during the verification of potential matches.The value of the out array is set during the verification phase.

Finally, SOG computes the V , hs array and hs’ array one-dimensional tables during the pre-processing phase, allocated as linear memory using the cudaMalloc() function. V is a bit vectorthat can store the hash value h for each different B-character block, hs array holds the hash valuehs for each pattern from the pattern set while hs’ array stores the two-level hs′ hash values. Thebasic data-parallel strategy for the implementation of the SOG algorithm is given in Algorithm list-ing 6.5. The implementation of the SOG algorithm exhibits a significant level of thread divergencethat is expected to affect its performance. As with Wu-Manber, the value of the out array is also

Page 126: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 113

set during the verification phase.

ALGORITHM 6.4: A basic parallel implementation of the Wu-Manber algorithm

tid := blockId× blockDim+ threadIdstart := (blockId× n)/numBlocks+ (n× threadId)/(numBlocks× blockDim) +m− 1stop := start+ n/(numBlocks× blockDim)i := startwhile i < stop do

h := h1(ti−B−1 . . . ti)if SHIFT [h] > 0 then

i := i+ SHIFT [h]else

h′ := h2(ti−m+1 . . . ti−m+1+B′−1)forall the pattern indices r stored in HASH[h] do

if PREFIX[r] = h′ thenVerify the pattern corresponding to r directly against the input string

end

endi := i+ 1

end

end

6.3.2 Implementation Limitations and Optimization Techniques

As discussed in section 5.4, accesses to global memory for compute capability 1.3 GPUs by allthreads of a half-warp are coalesced into a single memory transaction when all the requestedwords are within the same memory segment. The segment size is 32 bytes when 1-byte wordsare accessed, 64 bytes for 2-byte words and 128 bytes for words of 4, 8 and 16 bytes. With thebasic implementation strategy, each thread reads a single 1-byte character on each iteration of thesearch loop; in this case the memory segment has a size of 32 bytes. When Sthread > 32, eachthread accesses a word from a different memory segment of the global memory. This results touncoalesced memory transactions, with one memory transaction for every access of a thread. Sincethe maximum memory throughput of the global memory is 128 bytes per transaction, the accesspattern of the threads results in the utilization of only the 1

128 of the available bandwidth.

Coalescing Memory Accesses

To work around the coalescing requirements of the global memory and increase the utilization ofthe memory bandwidth, it is important to change the memory access pattern by reading words fromthe same memory segment and subsequently store them in the shared memory of the device. Thisinvolves the partition of the input string into n

Smemsizechunks and the collective read of Smemsize

characters from the global into the shared memory by all blockDim threads of a thread block. Foreach 16 successive characters from the same segment then, only a single memory transaction will beused. This technique results in the utilization of the 1

8 of the global memory bandwidth, improvedby a factor of 16. The threads can subsequently access the characters stored in shared memoryin any order with a very low latency. The change in the memory access of the parallel algorithmimplementations is evident in the 5th row of Algorithm listings 6.6 to 6.10.

Using the shared memory to increase the utilization of the memory bandwidth has two disad-vantages. First, a total of n

Smemsize× (blockDim− 1)× (m− 1) redundant characters are used that

Page 127: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

114 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

ALGORITHM 6.5: A basic parallel implementation of the SOG algorithm

tid := blockId× blockDim+ threadIdstart := (blockId× n)/numBlocks+ (n× threadId)/(numBlocks× blockDim)stop := start+ n/(numBlocks× blockDim) +m− 1E := 2m

for i := start; i < stop; i := i+ 1 doE = (E ≪ 1) ∨ V [h1(ti . . . ti+B−1)]if E ∧ 2m−B then

continueendVerify the patterns using two-level hashing

end

introduce significantly more work overhead when compared to the basic data-parallel implementa-tion strategy. Second, using the shared memory effectively reduces the occupancy of the SMs. Asthe size of the shared memory for each SM of the GTX 280 GPU is 16KB, using the whole sharedmemory would reduce the occupancy to one thread block per SM. Partitioning the shared memoryis not an efficient option since it was determined experimentally that it further increases the totalwork overhead.

Packing Input String Characters

The utilization of the global memory bandwidth can also increase when the threads read 16-bytewords instead of single characters on every memory transaction. For that, the built-in uint4 vectorcan be used, a C structure with members x, y, z, and w that is derived from the basic integer type, ascan be seen in rows 6−12 of Algorithm listings 6.6 to 6.10. This way, each thread accesses an 128-bituint4 word that corresponds to 16 characters of the input string with a single memory transactionwhile at the same time the memory segment size increases from 32 to 128 bytes. By having eachthread read 128-bit uint4 words from different memory segments results in the utilization of the 1

8of the global memory bandwidth similar to the coalescing technique above. The input string arraystored in global memory can be casted to uint4 as follows:

uint4 ∗ u i n t 4 t e x t = r e i n t e r p r e t c a s t < uint4 ∗ > ( d t ex t ) ;

The two previous techniques can be combined; reading 16 successive 128-bit words or 256 bytesin the form of 16 uint4 vectors from global to shared memory can be done with just two memorytransactions, fully utilizing the global memory bandwidth. The input string characters are thenextracted from the uint4 vectors as retrieved from the global memory and are subsequently storedin shared memory on a character-by-character basis. To access the characters inside a uint4 vector,the vector can be recasted to uchar4:

uint4 u in t4 va r = u in t 4 t e x t [ i ] ;

uchar4 uchar4 var0 = ∗ r e i n t e r p r e t c a s t < uchar4 ∗ > ( &u in t4 va r . x ) ;uchar4 uchar4 var1 = ∗ r e i n t e r p r e t c a s t < uchar4 ∗ > ( &u in t4 va r . y ) ;uchar4 uchar4 var2 = ∗ r e i n t e r p r e t c a s t < uchar4 ∗ > ( &u in t4 va r . z ) ;uchar4 uchar4 var3 = ∗ r e i n t e r p r e t c a s t < uchar4 ∗ > ( &u in t4 va r .w ) ;

The drawback of casting input string characters to uint4 vectors and recasting them to uchar4vectors is that it can be expensive in terms of processing power.

Page 128: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 115

Texture Binding

The preprocessing arrays of the algorithms are relatively small in size while at the same timethey are frequently accessed by the threads. In the case of the Set Backward Oracle Matching,Wu-Manber and SOG algorithms where a character-by-character verification of the patterns to theinput string is required, the pattern set array is also accessed often. The performance of the parallelimplementation of the algorithms should then benefit from the binding of the relevant arrays tothe texture memory of the device. The linear memory region was bound to the texture referenceusing cudaBindTexture() for one-dimensional arrays and cudaBindTexture2D() for two-dimensionalarrays. The textures were then accessed in-kernel using the tex1Dfetch() and tex2D() functions.Arrays accessed via textures not only take advantage of the texture caches to minimize the memorylatency when cache hits occur but also bypass the coalescing requirements of the global memory.

The tex state transition, tex state supply, tex state final, tex state shift, tex state final multi,tex SHIFT , tex HASH, tex PREFIX and tex V arrays in Algorithm listings 6.6 to 6.10 cor-respond to state transition, state supply, state final, state shift, state final multi, SHIFT ,HASH, PREFIX and V respectively when bound to the texture memory of the device.

Avoiding Bank Conflicts

The shared memory of the GTX 280 GPU consists of 16 memory banks numbered 0−15. The banksare organized in such a way that successive 32-bit words are mapped into successive banks withword i being stored in bank i mod 16. Bank conflicts occur when two or more threads of the samehalf-warp try to simultaneously access words i, j, . . . z when i mod 16 = j mod 16 = . . . = z mod 16.When the memory coalescence optimizations described above are used, it is challenging to avoidbank conflicts when the 16 characters of a uint4 vector are successively stored to the shared memoryand when are retrieved from shared memory by the threads of the same half-warp during the searchphase. Storing the input string characters in shared memory results in a 4-way bank conflict. Analternative would be to cast each uint4 vector to 4 uchar4 vectors and store them in shared memoryin a round-robin fashion:

int t id16 = threadIdx . x % 16 ;

i f ( t id16 < 4 )

ucha r4 s a r r ay [ threadIdx . x ∗ 4 + 0 ] = uchar4 var0 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 1 ] = uchar4 var1 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 2 ] = uchar4 var2 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 3 ] = uchar4 var3 ;

else i f ( t id16 < 8 )

ucha r4 s a r r ay [ threadIdx . x ∗ 4 + 1 ] = uchar4 var1 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 2 ] = uchar4 var2 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 3 ] = uchar4 var3 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 0 ] = uchar4 var0 ;

else i f ( t id16 < 12 )

ucha r4 s a r r ay [ threadIdx . x ∗ 4 + 2 ] = uchar4 var2 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 3 ] = uchar4 var3 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 0 ] = uchar4 var0 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 1 ] = uchar4 var1 ;

else

ucha r4 s a r r ay [ threadIdx . x ∗ 4 + 3 ] = uchar4 var3 ;

Page 129: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

116 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

ucha r4 s a r r ay [ threadIdx . x ∗ 4 + 0 ] = uchar4 var0 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 1 ] = uchar4 var1 ;u cha r4 s a r r ay [ threadIdx . x ∗ 4 + 2 ] = uchar4 var2 ;

This technique was not used since in practice the performance of the implementations did notimprove. Although it was conflict-free when storing the vectors, it resulted in a 4-way threaddivergence that caused the serialization of accesses to shared memory, the same effect that theexample code was trying to avoid. The modulo operator is very expensive when used inside CUDAkernels and had a significant impact therefore in the implementations’ performance.

Table 6.1 lists the register usage per thread for the basic and optimized versions of the algorithmimplementations as reported by the NVCC compiler using the –ptxas-options=-v flag. Algorithmlistings 6.6 to 6.10 depict the pseudocode of the optimized kernels for the Aho-Corasick, Set Hor-spool, Set Backward Oracle Matching, Wu-Manber and SOG algorithms. sArray represents anarray stored in the shared memory of the device with a size of Smemsize = 16, 128 Bytes.

Table 6.1: Register usage per thread

Algorithm Basic implementation Optimized implementationAho-Corasick 9 18Set Horspool 9 19Set Backward Oracle Matching 13 22Wu-Manber 15 22SOG 13 21

Page 130: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 117

ALGORITHM 6.6: An optimized parallel implementation of the Aho-Corasick algorithm

tid := blockId× blockDim+ threadIdstart := Smemsize × threadId/blockDimstop := start+ Smemsize/blockDim+m− 1matches := 0for gMem := blockId×Smemsize; gMem < n; gMem := gMem+numBlocks×Smemsize do

for i := gMem/16 + threadId, j := threadId; j < Smemsize/16; i := i+ blockDim, j :=j + blockDim do

if i < n/16 thenread a uint4 vector from tiunpack the uint4 vector and store the 16 characters to sArrayj∗16+0...j∗16+15

end

endAdd m− 1 redundant characters to the end of the shared memorysyncthreads()

state := 0for i := start; i < stop AND gMem+ i < n; i := i+ 1 do

while newState := tex state transition[state, sArrayi] = fail dostate := tex state supply[state]

endmatches := matches+ tex state final[state]

endsyncthreads()

endout[tid] := matches

Page 131: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

118 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

ALGORITHM 6.7: An optimized parallel implementation of the Set Horspool algorithm

tid := blockId× blockDim+ threadIdstart := Smemsize × threadId/blockDim+m− 1stop := start+ Smemsize/blockDimmatches := 0for gMem := blockId×Smemsize; gMem < n; gMem := gMem+numBlocks×Smemsize do

for i := gMem/16 + threadId, j := threadId; j < Smemsize/16; i := i+ blockDim, j :=j + blockDim do

if i < n/16 thenread a uint4 vector from tiunpack the uint4 vector and store the 16 characters to sArrayj∗16+0...j∗16+15

end

endAdd m− 1 redundant characters to the end of the shared memorysyncthreads()

i := startwhile i < stop AND gMem < n− i do

j := 0; state := 0while j < m AND (newState := tex state transition[state, sArrayi−j ] 6= fail) do

state := newState; j := j + 1endi := i+ tex state shift[ti]out[tid] := out[tid] + tex state final[state]

endsyncthreads()

endout[tid] := matches

Page 132: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.3. GPU IMPLEMENTATION 119

ALGORITHM 6.8: An optimized parallel implementation of the Set Backward OracleMatching algorithm

tid := blockId× blockDim+ threadIdstart := Smemsize × threadId/blockDim+m− 1stop := start+ Smemsize/blockDimmatches := 0for gMem := blockId×Smemsize; gMem < n; gMem := gMem+numBlocks×Smemsize do

for i := gMem/16 + threadId, j := threadId; j < Smemsize/16; i := i+ blockDim, j :=j + blockDim do

if i < n/16 thenread a uint4 vector from tiunpack the uint4 vector and store the 16 characters to sArrayj∗16+0...j∗16+15

end

endAdd m− 1 redundant characters to the end of the shared memorysyncthreads()

i := startwhile i < stop AND do

j := 0; state := 0while j < m AND (newState := tex state transition[state, sArrayi−j ] 6= fail) do

state := newState; j := j + 1endif tex state final multi[state] > 0 AND j = m then

if Verify all patterns in tex state final multi[state] against the input string thenout[tid] := out[tid] + 1

endi := i+ 1

elsei := i+m− j

end

endsyncthreads()

endout[tid] := matches

Page 133: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

120 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

ALGORITHM 6.9: An optimized parallel implementation of the Wu-Manber algorithm

tid := blockId× blockDim+ threadIdstart := Smemsize × threadId/blockDim+m− 1stop := start+ Smemsize/blockDimmatches := 0for gMem := blockId×Smemsize; gMem < n; gMem := gMem+numBlocks×Smemsize do

for i := gMem/16 + threadId, j := threadId; j < Smemsize/16; i := i+ blockDim, j :=j + blockDim do

if i < n/16 thenread a uint4 vector from tiunpack the uint4 vector and store the 16 characters to sArrayj∗16+0...j∗16+15

end

endAdd m− 1 redundant characters to the end of the shared memorysyncthreads()

i := startwhile i < stop do

h := h1(sArrayi−B−1 . . . sArrayi)if tex SHIFT [h] > 0 then

i := i+ SHIFT [h]else

h′ := h2(sArrayi−m+1 . . . sArrayi−m+1+B′−1)forall the pattern indices r stored in tex HASH[h] do

if tex PREFIX[r] = h′ thenVerify the pattern corresponding to r directly against the input string

end

endi := i+ 1

end

endsyncthreads()

endout[tid] := matches

Page 134: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.4. ANALYSIS 121

ALGORITHM 6.10: An optimized parallel implementation of the SOG algorithm

tid := blockId× blockDim+ threadIdstart := Smemsize × threadId/blockDim+m− 1stop := start+ Smemsize/blockDimmatches := 0for gMem := blockId×Smemsize; gMem < n; gMem := gMem+numBlocks×Smemsize do

for i := gMem/16 + threadId, j := threadId; j < Smemsize/16; i := i+ blockDim, j :=j + blockDim do

if i < n/16 thenread a uint4 vector from tiunpack the uint4 vector and store the 16 characters to sArrayj∗16+0...j∗16+15

end

endAdd m− 1 redundant characters to the end of the shared memorysyncthreads()

E := 2m

for i := start; i < stop; i := i+ 1 doE = (E ≪ 1) ∨ tex V [h1(sArrayi . . . sArrayi+B−1]if E ∧ 2m−B then

continueendVerify the patterns using two-level hashing

endsyncthreads()

endout[tid] := matches

6.4 Analysis

In the previous sections were given details on the design and the implementation of the algorithmsusing the CUDA API. In this section, the performance of the algorithm implementations is evaluatedwhen a number of optimizations are applied for different sets of data and is compared to theperformance of the sequential implementations. Each implementation stage also incorporates theoptimizations of the previous stages.

1. The first stage of the implementation was unoptimized. The input string, the pattern set, thepreprocessing arrays of the algorithms and the out array that holds the number of matches perthread, were stored in the global memory of the device. Each thread accessed input stringcharacters directly from the global memory as needed and stored the number of matchesdirectly to out. At the end of the algorithms’ search phase, the results array was transferredto the host memory and the total number of matches was calculated by the CPU.

2. The second stage of the implementation involved the binding of the preprocessing arrays andthe pattern set array in the case of the Set Backward Oracle Matching, Wu-Manber and SOGalgorithms to the texture memory of the device.

3. In the third stage, the threads worked around the coalescing requirements of the globalmemory by collectively reading input string characters and storing them to the shared memory

Page 135: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

122 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

of the device.

4. The fourth stage of the implementation was similar to the third but in addition, each threadread simultaneously 16 input string characters from the global memory by using a uint4vector. The characters were extracted from the vectors using uchar4 vectors and were sub-sequently stored to the shared memory.

2

4

6

8

10

12

14

16

18

20

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (1000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

2

4

6

8

10

12

14

16

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (8000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

Figure 6.2: Speedup of different Aho-Corasick implementations for randomly generated inputstrings, biological sequence databases and English alphabet data

0

5

10

15

20

25

30

35

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (1000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

0

5

10

15

20

25

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (8000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

Figure 6.3: Speedup of different Set Horspool implementations for randomly generated inputstrings, biological sequence databases and English alphabet data

For the experiments of this chapter, 30 thread blocks were used, one per each SM, with 256threads per block. As discussed in section 6.3, the shared memory was utilized to work around thecoalescing requirements of the global memory during the third implementation stage and it wasdetermined experimentally that further partitioning the shared memory was not an efficient optionas the total work overhead increases.

Figures 6.2 to 6.6 depict the speedup of different implementations of the Aho-Corasick, SetHorspool, Set Backward Oracle Matching, Wu-Manber and SOG algorithms for randomly generatedinput strings, biological sequence databases and English alphabet data on a GTX280 GPU. As can

Page 136: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.4. ANALYSIS 123

0

5

10

15

20

25

30

35

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (1000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

0

5

10

15

20

25

30

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (8000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

Figure 6.4: Speedup of different Set Backward Oracle Matching implementations for randomlygenerated input strings, biological sequence databases and English alphabet data

1

2

3

4

5

6

7

8

9

10

11

12

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (1000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

2

3

4

5

6

7

8

9

10

11

12

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (8000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

Figure 6.5: Speedup of different Wu-Manber implementations for randomly generated input strings,biological sequence databases and English alphabet data

generally be seen, the performance of the algorithms increased significantly when implementedin parallel on a GPU using the CUDA API, even before applying any optimization. When thepresented optimization techniques were used though, the implementations had a significantly higherspeedup over the sequential versions, although different algorithms were affected in different ways.

When the preprocessing arrays and the pattern set in the case of some algorithms were bound tothe texture memory of the device, the performance of the implementations increased considerably.Also in most cases, the implementations exhibited a faster running time when the accesses to theinput string for the threads of the same half-warp were coalesced with the use of the shared memory.On the other hand, it can be observed that the performance of the algorithm implementations didnot improve significantly in the fourth implementation stage due to the high cost involved in castingthe input string characters to uint4 vectors and recasting them to uchar4 vectors as discussed insection 6.3.

For the Aho-Corasick algorithm, the experimentally measured speedup for the basic parallelimplementation of the Aho-Corasick algorithm over the sequential implementation ranged from 3.2to 10.9. For the second and third implementation stages, the overhead that was caused by the

Page 137: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

124 CHAPTER 6. MULTIPLE PATTERN MATCHING ON A GPU USING CUDA

1

2

3

4

5

6

7

8

9

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (1000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Random E.coli FNA Swiss Prot FAA English

Spe

edup

Data set (8000 patterns)

UnoptimizedTexture cache

Coalesced readsReads of 16 characters

Figure 6.6: Speedup of different SOG implementations for randomly generated input strings, bio-logical sequence databases and English alphabet data

frequent accesses of the threads to the global memory of the device was partially reduced withan additional performance increase of up to 3.5 times. The speedup of the final, optimized kernelranged between 10 and 18.5 comparing to the sequential implementation, slightly faster than thereported speedup in [89, 98]. This was expected since for the experiments of this paper a morerecent GPU was used to evaluate the performance of the algorithm implementations. The highestspeedup was generally observed when the E.coli genome and the FASTA Nucleic Acid (FNA) ofthe A-thaliana genome were used, together with sets of 1, 000 patterns.

The unoptimized implementation of the Set Horspool algorithm had a speedup between 2.5 and7.5 over the sequential version. As with the implementation of the Aho-Corasick algorithm, the par-allel searching time of Set Horspool also benefited when the preprocessing arrays were bound to thetexture memory of the device in the second stage of the implementation, although to a lesser degree;the speedup increased between 1.2 and 1.9 times comparing to the basic parallel implementation forsets of 1, 000 patterns and between 1.1 and 1.4 times for sets of 8, 000 patterns. The performanceof the implementation increased to a much greater extent during the third implementation stage,with a speedup increase over the second implementation stage of up to 8.7. The speedup of theoptimized kernel was between 22.1 and 33.7, significantly higher than the corresponding speedup ofthe Aho-Corasick algorithm. Again, the highest speedup was achieved when the E.coli genome andthe FASTA Nucleic Acid of the A-thaliana genome were used, together with sets of 1, 000 patterns.

As can be seen in Figure 6.4, the performance of the Set Backward Oracle Matching algorithmhad similar characteristics to that of Set Horspool. The speedup of the implementation rangedbetween 2.6 and 6.3 over the sequential version of the algorithm and further improved by up to 1.6times when the preprocessing arrays and the pattern set were bound to the texture memory of thedevice. When the accesses for the threads of the same half-warp were coalesced, the performanceof the algorithm implementation improved by up to an additional 8.2 times. The speedup of theoptimized kernel of the algorithm ranged between 16.1 and 33.7. The highest speed was achievedwhen the Swiss Prot. Amino Acid sequence database and the FASTA Amino Acid (FAA) of theA-thaliana genome were used, for sets of either 1, 000 or 8, 000 patterns.

The unoptimized implementation of Wu-Manber was up to 5.9 times faster than its sequentialimplementation. When the pattern set as well as the SHIFT, HASH and PREFIX tables that werecreated during the preprocessing phase of the algorithm, were bound to the texture memory of thedevice and the shared memory was used to bypass the coalescing requirements of the global memory

Page 138: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.4. ANALYSIS 125

during the second and third implementation stages, the speedup of the implementation increasedby up to an additional 6.1 times. The performance of the final, optimized implementation of Wu-Manber was up to 11.8 times faster than the sequential implementation and was not significantlyaffected by the type of the dataset used. Although the Wu-Manber has a higher speedup in thebest case in [76], it is only valid for data sets of 100 patterns. For bigger pattern sets, the reportedspeedup is comparable to the speedup of this chapter.

The parallel implementation of SOG exhibited the lowest speedup comparing to the rest ofthe presented multiple pattern matching algorithms. The performance of the unoptimized kernelincreased by up to 5.7 times, comparing to the sequential implementation. As can be seen inFigure 6.6, the algorithm implementation had a significant speedup increase when the preprocessingarrays were bound to the texture memory and the input string was copied to the shared memory ofthe device. The increase was more evident for the Swiss Prot. Amino Acid sequence database, theFASTA Amino Acid of the A-thaliana genome and for the English language datasets, all of themdata with a large alphabet size. The optimized algorithm implementation had a speedup of up to8.9 that was observed when the Swiss Prot. Amino Acid sequence database was used together withsets of 1, 000 patterns.

Page 139: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

126CHAPTER

6.MULTIP

LE

PATTERN

MATCHIN

GON

AGPU

USIN

GCUDA

6.5 Appendix: Experimental Results

Table 6.2: Searching time of the sequential implementation and the different parallel implementation stages of the Aho-Corasickalgorithm (sec)

d = 1, 000 d = 8, 000

Implementation Random E.coli FNA Swiss Prot. FAA English Random E.coli FNA Swiss Prot. FAA English

Sequential 0.0241 0.0914 2.2674 2.4360 0.1658 0.0414 0.0239 0.1139 2.8369 4.3975 0.3133 0.1086Parallel unoptimized 0.0069 0.0110 0.3804 0.5549 0.0286 0.0063 0.0074 0.0157 0.4997 0.7806 0.0424 0.0100

Parallel texture memory 0.0026 0.0062 0.2415 0.3078 0.0125 0.0028 0.0026 0.0130 0.4123 0.5682 0.0319 0.0079Parallel coalesced access 0.0026 0.0053 0.1280 0.1607 0.0107 0.0025 0.0026 0.0117 0.2835 0.3778 0.0283 0.0073Parallel uint4 vector 0.0024 0.0051 0.1245 0.1579 0.0103 0.0025 0.0024 0.0115 0.2809 0.3739 0.0279 0.0072

Table 6.3: Searching time of the sequential implementation and the different parallel implementation stages of the Set Horspoolalgorithm (sec)

d = 1, 000 d = 8, 000

Implementation Random E.coli FNA Swiss Prot. FAA English Random E.coli FNA Swiss Prot. FAA English

Sequential 0.1033 0.1821 4.6413 3.4116 0.2189 0.0483 0.1021 0.2403 5.9968 5.2887 0.3616 0.1585Parallel unoptimized 0.0215 0.0312 1.5021 1.3816 0.0594 0.0111 0.0216 0.0448 1.9654 1.7766 0.0790 0.0212

Parallel texture memory 0.0151 0.0214 1.2506 1.0659 0.0372 0.0058 0.0151 0.0354 1.7346 1.5136 0.0630 0.0166Parallel coalesced access 0.0046 0.0060 0.1437 0.1468 0.0093 0.0023 0.0045 0.0138 0.3285 0.2888 0.0221 0.0100Parallel uint4 vector 0.0043 0.0058 0.1392 0.1427 0.0090 0.0022 0.0044 0.0137 0.3268 0.2853 0.0219 0.0100

Page 140: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

6.5.APPENDIX

:EXPERIM

ENTALRESULTS

127

Table 6.4: Searching time of the sequential implementation and the different parallel implementation stages of the Set Backward OracleMatching algorithm (sec)

d = 1, 000 d = 8, 000

Implementation Random E.coli FNA Swiss Prot. FAA English Random E.coli FNA Swiss Prot. FAA English

Sequential 0.4275 0.1641 4.3900 1.0480 0.0731 0.0166 0.4268 0.3847 10.1298 1.5408 0.1434 0.1087Parallel unoptimized 0.1109 0.0339 1.3970 0.2499 0.0151 0.0038 0.1123 0.1073 3.9412 0.5145 0.0227 0.0181

Parallel texture memory 0.0704 0.0288 1.2582 0.2268 0.0141 0.0034 0.0709 0.0978 3.7105 0.5106 0.0220 0.0146Parallel coalesced access 0.0294 0.0103 0.2848 0.0384 0.0031 0.0013 0.0296 0.0487 1.2691 0.0625 0.0057 0.0088Parallel uint4 vector 0.0291 0.0101 0.2811 0.0313 0.0026 0.0012 0.0299 0.0486 1.2667 0.0573 0.0053 0.0088

Table 6.5: Searching time of the sequential implementation and the different parallel implementation stages of the Wu-Manber algorithm(sec)

d = 1, 000 d = 8, 000

Implementation Random E.coli FNA Swiss Prot. FAA English Random E.coli FNA Swiss Prot. FAA English

Sequential 1.6960 0.5216 16.4850 5.6569 0.3769 0.0384 1.7048 3.0657 95.2652 17.4932 1.1070 0.1500Parallel unoptimized 0.4025 0.2274 7.2427 4.1271 0.2264 0.0216 0.2890 0.6879 21.4159 5.4839 0.3182 0.0553

Parallel texture memory 0.2390 0.0635 2.3763 1.9797 0.0927 0.0125 0.2417 0.4130 15.1495 2.0579 0.1286 0.0279Parallel coalesced access 0.1786 0.0470 1.7517 0.6768 0.0454 0.0053 0.1811 0.3331 12.1671 1.7273 0.0940 0.0224Parallel uint4 vector 0.1778 0.0466 1.7407 0.6730 0.0450 0.0052 0.1803 0.3319 12.1329 1.7194 0.0937 0.0223

Page 141: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

128CHAPTER

6.MULTIP

LE

PATTERN

MATCHIN

GON

AGPU

USIN

GCUDA

Table 6.6: Searching time of the sequential implementation and the different parallel implementation stages of the SOG algorithm (sec)

d = 1, 000 d = 8, 000

Implementation Random E.coli FNA Swiss Prot. FAA English Random E.coli FNA Swiss Prot. FAA English

Sequential 2.2144 1.1008 27.7066 1.7657 0.0486 0.0103 4.2572 5.6846 141.7921 3.7539 0.5055 0.0797Parallel unoptimized 0.5209 0.1941 5.1279 0.9521 0.0416 0.0088 1.7864 1.1362 29.4094 1.0390 0.1331 0.0206

Parallel texture memory 0.5300 0.1868 4.9387 0.8763 0.0373 0.0076 1.5444 1.0854 24.1431 0.9605 0.1278 0.0205Parallel coalesced access 0.4813 0.1762 4.5012 0.2093 0.0082 0.0026 1.4487 1.0298 22.7312 0.7003 0.0890 0.0173Parallel uint4 vector 0.4446 0.1630 4.1641 0.1991 0.0072 0.0023 1.3377 1.0186 21.4480 0.6423 0.0823 0.0159

Page 142: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 7

Two-dimensional Pattern Matching

on a GPU using CUDA

7.1 Introduction

Two-dimensional pattern matching algorithms are frequently used to process huge image databases.As the size of image databases rapidly increases over the past years, it is expected that the algo-rithms should benefit when implemented in parallel on a GPU. This chapter introduces parallel im-plementations of the Baker and Bird [14, 15] and the Baeza-Yates and Regnier [11] two-dimensionalpattern matching algorithms on a GPU using the CUDA API. The implementation methodologyand the optimization techniques introduced in chapter 6 are adapted here for a two-dimensionaldata set and the performance of the algorithm implementations when executed in parallel on a GTX280 GPU is evaluated. To the best of our knowledge, this is the first time that two-dimensionalpattern matching algorithms are implemented in parallel on a GPU using different optimizationtechniques.

7.2 GPU implementation

This section presents data-parallel implementations of the Baker and Bird and the Baeza-Yatesand Regnier two-dimensional algorithms, analyzes the specific characteristics that leverage thecapabilities of the device, discusses the flaws that affect their performance and addresses themusing different optimization techniques.

7.2.1 Basic Parallelization Strategy

To expose the parallelism of Baker and Bird and Baeza-Yates and Regnier, different implementationapproaches were considered. Different approaches were chosen for each of the algorithms as eachyielded the best performance results.

The execution of Baker and Bird in parallel in an efficient manner is not trivial, due to thedata dependencies introduced during the column-matching step of the algorithm. In practice, thesedependencies inhibit the parallelism level of the implementation. As discussed in section 4.3.1, theKnuth-Morris-Pratt algorithm is used to determine all the positions of the input string where acomplete match of the pattern occurs. This is achieved by maintaining a table a with a size of n.Then, for each position j, k of the input string, the value of a[k] depends on the previous valueof a[k], as calculated in position j − 1, k and affects the value of a[k], as determined in position

129

Page 143: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

130 CHAPTER 7. TWO-DIMENSIONAL PATTERN MATCHING ON A GPU USING CUDA

j + 1, k. For this reason, the rows of the input string should be partitioned in such a way that allcharacters of a single column are processed by the same thread.

ALGORITHM 7.1: A parallel implementation of the row-matching step of Baker and Birdfor GPU architecturestid := blockId× blockDim+ threadIda := 0;matches := 0for j := 0; j < n; j := j + 1 do

state := 0for k := tid; k < tid+m; k := k + 1 do

if state := state transition[state, tj,k] = fail thenbreak

end

endif state 6= fail AND state final[state] > 0 then

Perform the column matching step using the code from Algorithm listing 7.2else

a := 0end

endout[tid] := matches

To implement the Baker and Bird algorithm in parallel using the CUDA API, the followingparallelization approach was used. Each thread received a single character, starting from the firstrow of the input string and moving to each subsequent row when all threads had completed theirexecution paths. Since each active thread was responsible for a different column of the input string,table a can be substituted with a variable a. Consider the following example; assume that characterσ, located at position j, k of the input string, is assigned to a thread i of the device. If there isa transition between the initial state of the trie and a state q labeled by σ, the thread reads thecharacter at position j, k + 1 next and so on, until a mismatch or a terminal state is found viathe goto function of the Aho-Corasick algorithm. When a terminal state is encountered or in thecase of a mismatch, the value of variable a is updated according to the methodology describedin section 4.3.1. As no movement of the trie takes place, the supply function of the Aho-Corasickalgorithm is no longer required and can be removed, similar to the PFAC variant of Aho-Corasick aspresented in [60] and summarized in section 6.2. Algorithm listing 7.1 presents the implementationof the row-matching step of the Baker and Bird algorithm while the column-matching step of thealgorithm for the presented parallelization approach is given in Algorithm listing 7.2.

Since there are no dependencies between characters of the input string for the Baeza-Yates andRegnier algorithm, different rows of the input string can be simultaneously processed by the threadsof the GPU. Thus, it is expected that a greater speedup can be achieved, comparing to Baker andBird. To implement Baeza-Yates and Regnier in parallel, the following parallelization approachwas used. The first numBlocks primary rows of the input string were assigned to thread blocks0, . . . , numBlocks − 1. The n characters of each primary row were further divided into chunks,with each chunk having a size of Schunk = m characters and were subsequently assigned to threads0, . . . , threadId−1 of each block. To ensure the correctness of the results, additionalm−1 characterswere used per chunk, for a total of approximately n redundant characters per primary row of theinput string. Ifm×(blockDim−1) < n, then the next blockDim chunks were assigned to the threads

Page 144: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.2. GPU IMPLEMENTATION 131

ALGORITHM 7.2: A parallel implementation of the column-matching step of the Bakerand Bird algorithm for GPU architectures

i := a[k]while i > 0 AND pi0...m−1 6= tj,k−m...k−1 do

i := next[i]endif i+ 1 < m then

a := i+ 1else

a := 0endif i >= m− 1 then

matches := matches+ 1end

and so on, until the end of the row was reached. Likewise, if m − 1 + (numBlocks− 1) ×m < n,then the next numBlocks primary rows of the input string were assigned to the thread blocks,until the end of the input string was reached. When a potential match was found on a primaryrow, the m − 1 secondary rows were also scanned by the corresponding thread to determine if acomplete match of the pattern occurs. The improvement for the secondary rows of the input stringwhen duplicate pattern rows exist as described in section 4.3.2 was omitted, as it is inefficient toallocate a copy of table b for each of the numBlocks× blockDim threads of the device. Therefore,in the worst case where p0 = . . . = pm−1, 2m − 2 secondary rows will be scanned. The parallelimplementation of the search of the Baeza-Yates and Regnier algorithm on the primary and thesecondary rows of the input string is given in Algorithm listings 7.3 to 7.5.

The preprocessing phase of the Baker and Bird and the Baeza-Yates and Regnier algorithmswas performed sequentially by the host CPU. The following tables, that were stored in globalmemory and therefore were accessible by all threads of the device, were constructed during thepreprocessing phase. State transition is a two-dimensional array where each row corresponds toone of the m × m + 1 states of the Aho-Corasick trie, each column to a different character ofthe alphabet Σ while the cells of the array represent the next state. To ensure that alignmentrequirements are met on each row, state transition was allocated as pitched linear device memoryusing the cudaMallocPitch() function. State final is a one-dimensional array, where each columncorresponds to a different state of the trie while the cells indicate whether that state is final or not.Finally, an array out with a size of ⌈ n

m⌉ integers was used to determine the number of matches

per thread. Baker and Bird also utilized the next one-dimensional array of the Knuth-Morris-Prattalgorithm that consists of m+1 integers. Baeza-Yates and Regnier used the state supply array forthe supply function of the Aho-Corasick algorithm. The Id() function of the algorithm that is usedto access the indices assigned to the terminal states of the trie was represented by an id array ofsize m. A pseudocode of the host function for the parallel implementation of the algorithms wasgiven in Figure 6.1.

7.2.2 Implementation Limitations and Optimization Techniques

Although the presented parallel implementations can be over an order of magnitude faster thanthe corresponding sequential, this section discusses a number of limitations that are apparent inAlgorithm listings 7.1 to 7.5.

Page 145: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

132 CHAPTER 7. TWO-DIMENSIONAL PATTERN MATCHING ON A GPU USING CUDA

ALGORITHM 7.3: The parallel implementation of the search of the Baeza-Yates and Reg-nier algorithm on the primary rows of the input string

tid := blockId× blockDim+ threadIdstartRow := m− 1 +m× (tid/(n/m))startColumn := (tid%(n/m))×mjumpRow := numBlocks× blockDim/(n/m)matches := 0for j := startRow; j < n; j := j +m× jumpRow do

state := 0for k := startColumn; k < startColumn+m+m− 1; k := k + 1 do

if k < n thenwhile newState := state transition[state, tj,k] = fail do

state := state supply[state]endstate := newStateif state final[state] > 0 then

if P ′m−1 = m− 1 thenSearch on the secondary rows when all pattern rows are unique using thecode from Algorithm listing 7.4

elseSearch on the secondary rows when duplicate pattern rows exist using thecode from Algorithm listing 7.5

end

end

end

end

endout[tid] := matches

The parallel implementation of the Baker and Bird algorithm has a reduced efficiency due to itsdata dependencies. Recall from section 5.4 that the GT200 GPU that is used for the experimentsof this chapter has 30 SMs and each SM supports up to 1024 resident threads for a maximumhardware support of 30720 resident threads. Based on the way the parallelism of the Baker andBird is exposed though, only a maximum of n − m + 1 threads can be active at any time. Thisresults to the underutilization of the hardware resources and a reduced occupancy of the SMs,even when an input string with a size of n = 10, 000 was used. It is worth noting though that asdiscussed in [92], maximizing occupancy does not always result in a better performance. For thechosen implementation approach of the Baeza-Yates and Regnier algorithm on the other hand, atotal of ⌊ n

m⌋ threads can be simultaneously active at each of the ⌊ n

m⌋ primary rows of the input

string.

Coalescing Memory Accesses

The performance of the algorithm implementations depends largely on the access pattern of thethreads to the global memory of the device. For the implementation of the Baker and Bird algo-rithm, the threads access consecutive characters of the input string. To expand from the example

Page 146: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.2. GPU IMPLEMENTATION 133

ALGORITHM 7.4: The search of the Baeza-Yates and Regnier algorithm on the secondaryrows of the input string when all pattern rows are unique

for tt := j − index, pp := 0; pp < m; tt := tt+ 1, pp := pp+ 1 doif pp = index then

continueendstate := 0for c := k; c ≤ k +m− 1; c := c+ 1 do

while (state := state transition[r, ttt,c]) = fail donewState := state supply[state]

endstate := newState

endif id[state] 6= P ′

pp thenreturn

endmatches := matches+ 1

end

of section 7.2.1, assume that thread i is the first thread of a half-warp1 and that the warp schedulerassigns to it the character σ at position j, k of the input string. Recall that accesses to globalmemory for compute capability 1.3 GPUs by all threads of a half-warp are coalesced into a singlememory transaction when the requested words are within the same memory segment, as discussedin section 5.4. The segment size is 32 bytes when 1-byte words are accessed, 64 bytes for 2-bytewords and 128 bytes for words of 4, 8 and 16 bytes. If the input string resides in a correctly alignedarea of the global memory, then threads i, . . . , i+15 will access input string characters j, k . . . , k+15inside the same memory segment. Then, all 16 accesses will be coalesced into a single memory ac-cess. Since the maximum memory throughput of the global memory is 128 bytes per transaction,the access pattern of the threads results in the utilization of the 1

8 of the available bandwidth.

For the implementation of the Baeza-Yates and Regnier algorithm, the memory accesses of thethreads to the global memory for characters of a primary row of the input string are strided with astride of size m. Since 1-byte words are accessed, the memory segment size is 32 bytes. Then, whenpatterns with a size m = 4 are used, accesses to global memory are coalesced into groups of 4, forpatterns with a size m = 8 the accesses are coalesced into groups of 2 while for patterns with a sizem ≥ 16, accesses to global memory are serialized. To work around the coalescing requirements ofthe global memory and increase the utilization of the memory bandwidth, it is important to changethe memory access pattern by reading words from the same memory segment and subsequentlystore them in the shared memory of the device. This involves the partition of the primary rows ofthe input string into n

Smemsizechunks and the collective read of Smemsize characters from the global

into the shared memory by all blockDim threads of a thread block. Then, for each 16 successivecharacters from the same segment, only a single memory transaction will be used. This techniqueresults in the improvement of the global memory bandwidth utilization by a factor of 16. Thethreads can subsequently access the characters stored in shared memory in any order with a verylow latency. Using the shared memory to increase the utilization of the memory bandwidth has

1Since threads are grouped in a deterministic way, and a half-warp of a compute capability 1.3 GPU consists of16 threads, it holds that i mod 16 = 0

Page 147: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

134 CHAPTER 7. TWO-DIMENSIONAL PATTERN MATCHING ON A GPU USING CUDA

ALGORITHM 7.5: The search of the Baeza-Yates and Regnier algorithm on the secondaryrows of the input string when duplicate pattern rows exist

state := 0for i := 0; i < m; i := i+ 1 do

if P ′i 6= index then

continueendfor d := j − i; d ≤ j − i+m− 1 AND d < n; d := d+ 1 do

state := 0for c := k; c ≤ k +m− 1; c := c+ 1 do

while (newState := state transition[state, td,c]) = fail dostate := state supply[state]

endstate := newState

endif id[state] 6= P ′

d−j+i thenbreak

end

endif d = j − i+m then

matches := matches+ 1end

end

two disadvantages. First, a total of nSmemsize

× (m− 1) redundant characters are used per primaryrow that introduce significantly more work overhead when compared to the basic data-parallelimplementation strategy. Second, using the shared memory effectively reduces the occupancy ofthe SMs. As the size of the shared memory for each SM of the GTX 280 GPU is 16KB, using thewhole shared memory would reduce the occupancy to one thread block per SM. Partitioning theshared memory is not an efficient option since it would further increase the total work overhead.For the optimized implementation of the algorithms, Smemsize = 16, 128 Bytes.

Texture Binding

The preprocessing arrays of the algorithms are relatively small in size while at the same time theyare frequently accessed by the threads. The performance of their parallel implementation shouldthen benefit when the relevant arrays are bound to the texture memory of the device. The texturereference was bound to the device memory using cudaBindTexture() for one-dimensional arrays andcudaBindTexture2D() for two-dimensional arrays allocated as pitched linear device memory. Thetextures were then accessed in-kernel using the tex1Dfetch() and tex2D() functions. Arrays accessedvia textures not only take advantage of the texture caches to minimize the memory latency whencache hits occur but also bypass the coalescing requirements of the global memory. In the case ofthe Baker and Bird algorithm implementation, the pattern is accessed directly during the column-matching step to perform character-by-character verification against the input string. To improvethe performance of the implementation, the pattern array can be either bound to the texturememory or it can be copied during the start of the search phase directly to the shared memory ofthe device by all threads of each thread block. As binding would increase the pressure on the texture

Page 148: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.3. ANALYSIS 135

caches, the pattern array was copied to the shared memory for the optimized implementation ofBaker and Bird. Recall also that the first dimension of the state transition array corresponds to them×m+1 states of the Aho-Corasick trie and that the maximum size for two-dimensional texturereferences is 65, 536×32, 768 texels. Then, for pattern sizes of m = 256 or more, m×m+1 > 65, 536and therefore the state transition array cannot be bound to the texture memory of the device.

Table 6.1 lists the register usage per thread for the basic and optimized versions of the algorithmimplementations as reported by the NVCC compiler using the –ptxas-options=-v flag.

Table 7.1: Register usage per thread

Algorithm Basic implementation Optimized implementationBaker and Bird 13 12Baeza-Yates and Regnier 25 24

7.3 Analysis

In the previous sections, details on the design and the implementation of the algorithms usingthe CUDA API were given. In this section, the performance of the algorithm implementations isevaluated when a number of optimizations are applied for different sets of data and is compared tothe performance of the sequential implementations. Each implementation stage also incorporatesthe optimizations of the previous stages. The optimization techniques used can be summarized tothe following:

1. The first stage of the implementation was unoptimized. The input string, the pattern, thepreprocessing arrays of the algorithms as well as the out array were stored in the globalmemory of the device. Each thread accessed input string characters directly from the globalmemory as needed and stored any number of matches found to out. At the end of thealgorithms’ search phase, the results array was transferred to the host memory and the totalnumber of matches was calculated by the CPU.

2. The second stage of the implementation involved the binding of the preprocessing arrays tothe texture memory of the device.

3. In the third stage, the pattern array was collectively copied by all the threads of a threadblock to the shared memory of the device. This stage only applied to the implementation ofthe Baker and Bird algorithm.

4. With the fourth stage of the implementation, the threads worked around the coalescing re-quirements of the global memory by collectively reading input string characters and storingthem to the shared memory of the device. As already discussed, this optimization was onlyrequired for the implementation of the Baeza-Yates and Regnier algorithm.

Figures 7.1 and 7.2 depict the speedup of the presented implementations of the Baker and Birdand the Baeza-Yates and Regnier algorithms for different types of data on a GTX280 GPU. Recallfrom the previous section that the optimizations were valid only for patterns with a size m ≤ 128,since the maximum size for two-dimensional texture references is 65, 536 × 32, 768 texels. As cangenerally be seen, the speedup of the implementations increased with each optimization made tothe code, although the algorithms were affected in different ways.

Page 149: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

136 CHAPTER 7. TWO-DIMENSIONAL PATTERN MATCHING ON A GPU USING CUDA

As discussed in section 7.2, the occupancy of each SM of the GPU for the parallel implementationof the Baker and Bird algorithm depends on n−m+1. It is not surprising then that the speedup ofthe unoptimized implementation of the Baker and Bird algorithm over the equivalent CPU-boundversion was heavily affected by the size n of the input string and the size m of the pattern and lessby the alphabet size |Σ|. As can be seen from Figure 7.1, for n = 1, 000 and m = 4, the speedup ofthe unoptimized implementation of the Baker and Bird algorithm was up to 1.97 over the sequentialimplementation when data with an alphabet of size |Σ| = 2 were used and between 1.44 and 1.47for data with an alphabet of size |Σ| = 256 and |Σ| = 1024. The speedup of the implementationdecreased in the size of the pattern, down to 0.03 over the sequential implementation for m = 256and |Σ| = 2 and approximately 0.11 for m = 256 when alphabets of size |Σ| = 256 or |Σ| = 1024were used.

When an input string of a size n = 10, 000 was used, the speedup of the unoptimized parallelimplementation of Baker and Bird increased to 16.5 for m = 4 and |Σ| = 2, to 11.1 for m = 4 and|Σ| = 256 and to 5.7 for m = 4 and |Σ| = 1024. Similar to when an input string of a size n = 1, 000was used, the performance of the parallel implementation of the algorithm decreased in the size ofthe pattern due to the decreased occupancy of the device, with the speedup ranging between 0.05and 0.2 over the sequential implementation. The performance of the algorithm implementationincreased when the preprocessing arrays were bound to the texture memory of the device, witha speedup of up to 1.2 comparing to the unoptimized version. The speedup due to that specificoptimization stage was more evident when data with a large alphabet size were used. The increasedefficiency was caused mainly due to the frequent uncoalesced accesses of the threads to the globalmemory for the data stored in the preprocessing arrays.

When the pattern was collectively copied to the shared memory of the GPU by all threads ofa thread block, the speedup increased slightly with an additional factor of up to 1.03 comparingto the second optimization stage, especially when larger pattern sizes were used. In practice, theinitial cost to copy the pattern array from the global to the shared memory of the device wassubstantial while the thread accesses to the pattern during the column-matching step of the Bakerand Bird algorithm were not frequent enough to justify it. Although this behavior is not evident inthe Figures due to their scale, it can be observed in Tables 7.2 to 7.4 of Appendix 7.4. The speedupof the final, optimized kernel ranged between 0.06 and 2.37 for n = 1, 000 and between 0.58 and21.88 for n = 10, 000.

The parallel implementation of Baeza-Yates and Regnier had a significantly higher speedupcomparing to Baker and Bird. This can be attributed to the fact that the level of the algorithm’sparallelism was improved due to the absence of data dependencies. Similar to the implementationof the Baker and Bird algorithm, the speedup of Baeza-Yates and Regnier was mainly affected bythe size n of the input string and the size m of the pattern and to a lesser degree by the size |Σ| ofthe alphabet. For n = 1, 000 and m = 4, the speedup of the unoptimized implementation of Baeza-Yates and Regnier was 22.19 over the sequential implementation for |Σ| = 2 and approximately21.5 for |Σ| = 256 and |Σ| = 1024. When an input string of size n = 10, 000 was used and form = 4, the performance of the parallel implementation improved further, with a speedup between29.7 and 38.6, depending on the alphabet size.

For most types of data, the performance of the parallel implementation of the Baeza-Yates andRegnier algorithm decreased in the size of the pattern. A discrepancy can be observed thoughin Figure 7.2 when an input string with a size of n = 10, 000 together with alphabets of size|Σ| = 256 and |Σ| = 1024. In these cases, the speedup of both the unoptimized and the optimizedimplementations of the algorithm actually increased in m. This behavior was caused by the sharpincrease in the searching time of the sequential implementation of the algorithm and was unrelated

Page 150: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.3. ANALYSIS 137

to any specific characteristics of the parallel implementation. When the second implementationstage was used, the speedup of the parallel implementations improved by up to a factor of 2,especially when an input string with a size of n = 10, 000 was used. Finally, when the sharedmemory was used to work around the coalescing requirements of the device, the speedup of theparallel implementation of the Baeza-Yates and Regnier algorithm increased further by up to 1.14times over the second implementation stage. The speedup of the final, optimized kernel rangedbetween 0.35 and 27.55 for n = 1, 000 and between 12.49 and 113.62 when an input string with asize n = 10, 000 was used.

Page 151: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

138 CHAPTER 7. TWO-DIMENSIONAL PATTERN MATCHING ON A GPU USING CUDA

0.5

1

1.5

2

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 2, n = 1000)

UnoptimizedTexture cache

Pattern to shared memory

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 256, n = 1000)

UnoptimizedTexture cache

Pattern to shared memory

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 1024, n = 1000)

UnoptimizedTexture cache

Pattern to shared memory

5

10

15

20

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 2, n = 10000)

UnoptimizedTexture cache

Pattern to shared memory

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 256, n = 10000)

UnoptimizedTexture cache

Pattern to shared memory

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 1024, n = 10000)

UnoptimizedTexture cache

Pattern to shared memory

Figure 7.1: Speedup of different optimization techniques for the parallel implementation of theBaker and Bird algorithm

Page 152: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.3. ANALYSIS 139

5

10

15

20

25

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 2, n = 1000)

Unoptimized Texture cache Coalesced reads

5

10

15

20

25

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 256, n = 1000)

Unoptimized Texture cache Coalesced reads

5

10

15

20

25

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 1024, n = 1000)

Unoptimized Texture cache Coalesced reads

10

20

30

40

50

60

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 2, n = 10000)

Unoptimized Texture cache Coalesced reads

30

40

50

60

70

80

90

100

110

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 256, n = 10000)

Unoptimized Texture cache Coalesced reads

30

40

50

60

70

80

90

100

110

4 8 16 32 64 128 256

Spe

edup

Pattern size (Alphabet = 1024, n = 10000)

Unoptimized Texture cache Coalesced reads

Figure 7.2: Speedup of different optimization techniques for the parallel implementation of theBaeza-Yates and Regnier algorithm

Page 153: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

140CHAPTER

7.TWO-D

IMENSIO

NALPATTERN

MATCHIN

GON

AGPU

USIN

GCUDA

7.4 Appendix: Experimental Results

Table 7.2: Searching time of the sequential implementation and the different parallel implementation stages of the Baker and Birdalgorithm for |Σ| = 2 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0132 0.0124 0.0101 0.0092 0.0089 0.0085 0.0082 1.3352 1.1478 1.0116 0.9211 0.8861 0.8673 0.8497Parallel unoptimized 0.0067 0.0125 0.0245 0.0481 0.0955 0.1898 0.2528 0.0809 0.1417 0.2798 0.5431 1.0439 2.1487 4.2653

Parallel texture memory 0.0056 0.0106 0.0206 0.0427 0.0893 0.1801 0.2519 0.0622 0.1150 0.2225 0.4736 0.9129 1.8361 3.9821Parallel shared memory 0.0056 0.0105 0.0205 0.0428 0.0889 0.1470 0.2503 0.0610 0.1138 0.2202 0.4724 0.8986 1.4883 3.7192

Table 7.3: Searching time of the sequential implementation and the different parallel implementation stages of the Baker and Birdalgorithm for |Σ| = 256 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0098 0.0089 0.0087 0.0086 0.0088 0.0120 0.0378 0.9865 0.9004 0.8546 0.8421 0.8379 0.9125 1.6439Parallel unoptimized 0.0066 0.0129 0.0257 0.0494 0.0958 0.1896 0.5496 0.0895 0.2642 0.7332 0.9174 1.1948 2.0360 4.1930

Parallel texture memory 0.0056 0.0108 0.0224 0.0444 0.0900 0.1804 0.5423 0.0617 0.1417 0.2702 0.5012 0.9318 1.8272 3.7183Parallel shared memory 0.0055 0.0108 0.0224 0.0444 0.0895 0.1469 0.5395 0.0607 0.1417 0.2675 0.4989 0.9163 1.4902 3.5623

Page 154: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

7.4.APPENDIX

:EXPERIM

ENTALRESULTS

141

Table 7.4: Searching time of the sequential implementation and the different parallel implementation stages of the Baker and Birdalgorithm for |Σ| = 1024 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0098 0.0090 0.0087 0.0089 0.0097 0.0139 0.0497 0.9893 0.9017 0.8587 0.8511 0.8496 0.7787 1.9631Parallel unoptimized 0.0066 0.0129 0.0263 0.0506 0.0963 0.1903 0.5745 0.1725 0.3483 1.1735 1.7200 1.6432 1.1099 2.2532

Parallel texture memory 0.0055 0.0108 0.0228 0.0457 0.0902 0.1805 0.5683 0.0614 0.1554 0.2879 0.5132 0.9321 0.6755 2.1029Parallel shared memory 0.0055 0.0108 0.0228 0.0458 0.0898 0.1472 0.5632 0.0604 0.1541 0.2861 0.5104 0.9167 0.6706 1.9830

Table 7.5: Searching time of the sequential implementation and the different parallel implementation stages of the Baeza-Yates andRegnier algorithm for |Σ| = 2 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0080 0.0042 0.0037 0.0054 0.0054 0.0052 0.0040 0.7974 0.4293 0.4754 0.5545 0.5886 0.6438 0.8100Parallel unoptimized 0.0004 0.0005 0.0009 0.0022 0.0040 0.0153 0.0366 0.0215 0.0180 0.0243 0.0337 0.0470 0.0791 0.0984

Parallel texture memory 0.0003 0.0004 0.0008 0.0021 0.0039 0.0150 0.0354 0.0135 0.0134 0.0207 0.0302 0.0331 0.0519 0.0952Parallel coalesced access 0.0003 0.0004 0.0008 0.0020 0.0039 0.0149 0.0351 0.0135 0.0119 0.0192 0.0291 0.0327 0.0515 0.0921

Page 155: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

142CHAPTER

7.TWO-D

IMENSIO

NALPATTERN

MATCHIN

GON

AGPU

USIN

GCUDA

Table 7.6: Searching time of the sequential implementation and the different parallel implementation stages of the Baeza-Yates andRegnier algorithm for |Σ| = 256 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0061 0.0043 0.0093 0.0134 0.0168 0.0289 0.0616 0.6002 0.4388 0.7408 1.4054 2.4909 5.6789 10.6820Parallel unoptimized 0.0003 0.0005 0.0009 0.0022 0.0040 0.0153 0.0427 0.0167 0.0162 0.0242 0.0340 0.0467 0.0779 0.1102

Parallel texture memory 0.0002 0.0004 0.0008 0.0021 0.0040 0.0151 0.0411 0.0108 0.0123 0.0206 0.0304 0.0333 0.0522 0.1021Parallel coalesced access 0.0002 0.0004 0.0008 0.0021 0.0040 0.0150 0.0402 0.0108 0.0108 0.0191 0.0289 0.0329 0.0519 0.0992

Table 7.7: Searching time of the sequential implementation and the different parallel implementation stages of the Baeza-Yates andRegnier algorithm for |Σ| = 1024 (sec)

n = 1, 000 n = 10, 000

Implementation / m 4 8 16 32 64 128 256 4 8 16 32 64 128 256

Sequential 0.0064 0.0048 0.0119 0.0168 0.0222 0.0256 0.0300 0.6400 0.4842 0.7050 1.2382 2.1006 4.8641 9.3390Parallel unoptimized 0.0003 0.0004 0.0009 0.0022 0.0041 0.0153 0.0328 0.0215 0.0193 0.0244 0.0360 0.0505 0.0657 0.0849

Parallel texture memory 0.0002 0.0004 0.0008 0.0022 0.0040 0.0151 0.0321 0.0108 0.0123 0.0206 0.0301 0.0333 0.0432 0.0742Parallel coalesced access 0.0002 0.0004 0.0008 0.0021 0.0040 0.0150 0.0310 0.0108 0.0108 0.0196 0.0288 0.0329 0.0428 0.0703

Page 156: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 8

Implementation of Multiple and

Two-Dimensional Pattern Matching

Algorithms on a Hybrid Parallel

Architecture

8.1 Introduction

Over the last decades, as text and image databases grow with an exponential rate, it is essential todevelop faster and more efficient search algorithms. Sequential computers can not easily deal withthe execution time and the memory requirements of multiple and two-dimensional pattern matchingalgorithms for large-scale data sets. The use of data parallelism, a form of parallelization that can beachieved through the partitioning of a given data set into smaller parts, is a viable and cost-effectivesolution. A common trend in parallel computing today is to use commodity off-the-shelf componentsto create clusters of nodes: several computer workstations featuring Graphics Processing Units(GPUs) that in turn are interconnected using a common network infrastructure. The abundantprocessing power that is provided by these types of clusters can be useful in computational problemsthat involve processing of very large data sets and that traditional sequential computers struggleto deal with.

This chapter presents a 3-tier modular hybrid parallel architecture that combines the flexibilityof a CPU with the efficiency of a GPU and the scalability that a distributed memory architecturecan offer. The presented hybrid architecture uses different technologies and APIs to implement inparallel the Aho-Corasick [2], Set Horspool [67], Set Backward Oracle Matching [3], Wu-Manber [96]and the Multipattern Shift-Or with Q-Grams (SOG) [78] multiple pattern matching algorithms aswell as the Baker and Bird [14, 15] and the Baeza-Yates and Regnier [11] two-dimensional patternmatching algorithms. To the best of our knowledge, no attempt has been made in the past toimplement the specific pattern matching algorithms on hybrid computing systems of CPUs andGPUs using the MPI and CUDA APIs.

8.2 Related Work

The performance evaluation of different multiple and two-dimensional pattern matching algorithmswhen implemented in parallel using the MPI and CUDA APIs is a research subject that has been

143

Page 157: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

144CHAPTER 8. IMPLEMENTATION OF PATTERN MATCHING ALGORITHMS ON A

HYBRID ARCHITECTURE

studied in the past.The performance of Baker and Bird and Baeza-Yates and Regnier two-dimensional pattern

matching algorithms was evaluated in [50] along Karp and Rabin, Zhu and Takaoka and a Bruteforce two-dimensional pattern matching algorithm on a homogeneous computer cluster of 15 IntelXeon CPU workstations with a 3.00GHz clock rate for randomly generated input strings anda collection of uncompressed satellite images. The Wu-Manber and the SOG multiple patternmatching algorithms were implemented on a hybrid homogeneous computer cluster consisting of 10multi-core nodes with an Intel Core i3 CPU with Hyper-Threading that had a 2.93GHz clock ratewith the use of the MPI and OpenMP APIs. The performance of the hybrid cluster, that combinedthe advantages of both shared and distributed memory parallelization, was evaluated in [56].

An analytical performance model of the parallel implementation of string matching algorithmswas introduced in [63] and was used to validate the experimental results of the same algorithms ona cluster of 9 distributed workstations, each one having a 100MHz Intel Pentium CPU and 64MBof memory. A performance model of a hybrid MPI and OpenMP architecture was presented in[1], based on a combination of static analysis and feedback from a runtime benchmark for bothcommunication and multithreading efficiency measurement. Finally, the Aho-Corasick algorithmwas implemented in [88] for a heterogeneous computer cluster with 5 NVIDIA Tesla S1070 boxes,each box being the equivalent of 4 C1060 GPUs, for a total of 20 GPUs.

8.3 Hybrid Parallel Architecture

As discussed in chapters 6 and 7, both multiple and two-dimensional pattern matching algorithmsgenerally take advantage of the massively parallel nature of modern GPUs with high speedupsof their searching time reported over the sequential implementations of the same algorithms. Anumber of disadvantages though can be identified.

There is a performance penalty involved when applications that introduce data dependencies areexecuted in parallel, a significant latency is introduced during the launch of the GPU kernel and alsowhen copying data between the host and the device memory due to the bandwidth of the systembus1 while trie-based algorithms are handled inefficiently due to the high cost of in-kernel dynamicmemory allocation. To address some of these deficiencies, this section presents a hybrid parallelarchitecture that combines distributed memory message-passing and General-Purpose computingon Graphics Processing Units to implement the Aho-Corasick, Set Horspool, Set Backward OracleMatching, Wu-Manber and SOG multiple pattern matching algorithms and the Baker and Bird andBaeza-Yates and Regnier two-dimensional pattern matching algorithms on a homogeneous clusterof GPU nodes.

In section 5.5, an overview of the hybrid parallel architecture was given. In the next pages, theparallel implementation of the algorithms is discussed and a performance model is presented.

8.3.1 Parallel Implementation

The hybrid parallel architecture was implemented as follows. The master-worker model was used,as it was concluded in [23] and confirmed in practice that it is the most appropriate for patternmatching on message-passing systems. A node of the computer cluster was designated as themaster while the rest of the nodes were set as workers. The data set was initially stored in thelocal storage of the master and was subsequently exported to the worker nodes using the NFSprotocol. At the first level, the nodes loaded the pattern set, in the case of multiple pattern

1This cost can be in part hidden when zero-copying data from pinned host memory

Page 158: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8.3. HYBRID PARALLEL ARCHITECTURE 145

matching algorithms or the pattern, in the case of two-dimensional pattern matching algorithmsto the host memory via NFS. At the second level, the preprocessing phase of each algorithm wasexecuted sequentially by the CPU of each node. The Aho-Corasick, Set Horspool and the SetBackward Oracle Matching multiple pattern matching algorithms and the Baker and Bird and theBaeza-Yates and Regnier two-dimensional pattern matching algorithms are using trie structuresthat are implemented using dynamically allocated linked lists of arrays. Since dynamic memoryallocated from CUDA kernels using the cudaMalloc() function resides in heap and is stored inglobal memory, it is highly inefficient and is subsequently avoided. For this reason, the necessarydata structures were precomputed sequentially from the CPU of each node in the form of arrays,similar to the methodology followed in the previous chapters and, among others, in [42, 60, 87, 89].

At the third level, the resulted preprocessing arrays were copied to the global memory of theGPUs and were then bound to the texture cache of the device. At the fourth level, each nodecalculated the size of the input string chunk based on the size of the cluster. The input stringchunk was then retrieved via NFS to a pinned memory area of the host and was subsequentlyset for copying to the global memory of the device with the use of the zero-copy technique ofthe CUDA API. With that technique, data is transferred automatically between host and devicememory over the PCI Express bus as needed, in such a way that copy and computation partiallyoverlap. In the early stages of the implementation, each node used the standard I/O fopen, fclose,fseek, and getc functions of the stdio.h library to load the chunk. In practice though, when thespecific functions were used to load data exported from NFS, a significant performance penaltydue to their line-buffering behavior. Instead, the POSIX API open, close, lseek and read systemcalls were used, resulting in a significantly improved performance. At the fifth level, the searchingphase of the algorithms was executed in parallel by the threads of the GPU. At the sixth and finallevel, the number of matches that each node computed was gathered by the master node using theMPI Reduce() function of MPI and its CPU was used to determine the total amount of matches ofthe pattern or the pattern set in the input string. Figure 8.1 depicts a pseudocode of the presentedhybrid parallel architecture.

The following optimization techniques were utilized when the searching phase of the multipleand two-dimensional pattern matching algorithms was implemented in parallel with the use of theCUDA API. All preprocessing arrays were bound to the texture memory of the device to workaround the coalescing requirements of the global memory and to increase the memory throughputof the implementations. To partially hide the latency of the data-copy from the host to the globalmemory of the device, the input string array was allocated on pinned host memory and was thenzero-copied to the GPU. In the case of the Set Backward Oracle Matching, Wu-Manber and SOGalgorithms, the pattern set array was bound to the texture memory of the device while in the caseof the Baker and Bird algorithm, the pattern array was collectively copied by all the threads ofeach thread block to the shared memory of the device to reduce the pressure on the texture caches.For all multiple pattern matching algorithms as well as for the Baeza-Yates and Regnier two-dimensional pattern matching algorithm, the threads collectively accessed input string charactersand stored them to the shared memory of the device to work around the coalescing requirementsof the global memory. Finally, in the case of the Aho-Corasick, Set Horspool, Set Backward OracleMatching, Wu-Manber and SOG algorithms, the input string characters were retrieved as groups of16 characters with the use of uchar4 vectors to further improve the utilization of the global memorybandwidth and were subsequently stored to the shared memory of the device by all threads of eachthread block.

Page 159: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

146CHAPTER 8. IMPLEMENTATION OF PATTERN MATCHING ALGORITHMS ON A

HYBRID ARCHITECTURE

Main proceduremain ( )

1 . I n i t i a l i z e MPI r ou t i n e s ;2 . I f ( p roc e s s==master ) then c a l l master ( ) ; else c a l l worker ( ) ;3 . Exit message−pass ing ope ra t i on s ;

Master sub−proceduremaster ( )

1 . Rece ive the r e s u l t s ( i . e . matches ) from a l l workers ; (MPI Reduce )2 . Pr int the t o t a l r e s u l t s ;

Worker sub−procedureworker ( )

1 . Load pattern / pattern s e t to the host memory ;2 . Preproces s the pattern / pattern s e t ;3 . Load the p r ep ro c e s s i ng ar rays to the g l oba l memory o f the GPU; (cudaMemcpy/cudaMemcpy2D)4 . Ca l cu la te the input s t r i n g chunk based on the s i z e o f the c l u s t e r ;5 . Set−up zero−copying o f the input s t r i n g ; ( cudaHostAlloc )6 . Execute the search phase o f the a lgor i thm ;7 . Copy the r e s u l t s to the memory o f the host ; ( cudaMemcpy)8 . Send the r e s u l t s to master ; (MPI Reduce )

Figure 8.1: Pseudocode of the hybrid implementation

8.3.2 Performance Model

Let N be the number of available computer nodes of the cluster. In the case of multiple patternmatching algorithms, each node receives a chunk consisting of ⌈ n

N⌉ input string characters plus

additional m − 1 characters to ensure the correctness of the results. For two-dimensional patternmatching algorithms, a column-based distribution of the input string was chosen with each nodereceiving a chunk of n × ( n

N+ m − 1) input string characters. The total execution time of the

presented hybrid architecture can be resolved into the following:

• Tinit is the time of the nodes to allocate and load from disk the necessary data structures forthe pattern / pattern set.

• TCPU preproc is the average time of each algorithm to preprocess the pattern / pattern set.

• TNFS comm is the cost in terms of time for NFS to export the input string and the pattern/ pattern set to the nodes of the cluster. The local caching capability of the Linux kernelfor network filesystems, as presented in [41], mitigates this cost on subsequent accesses of thesame data.

• TCUDA copy preproc is the time to copy the preprocessing arrays to the global memory of theGPU using the cudaMemcpy()/cudaMemcpy2D() functions of the CUDA API.

• TCUDA copy input is the time of the CUDA function cudaMallocHost() to initialize the inputstring array as pinned host memory and the time to copy the input string to the globalmemory using the zero-copy technique of CUDA.

Page 160: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8.4. ANALYSIS 147

• TCUDA search includes the latency involved with the launch of the computational kernel andthe time to execute the search phase of the algorithms in parallel on the GPU.

• TMPI comm is the communication time required by the master node to gather the results fromthe N worker nodes of the cluster using the MPI Reduce() function of the MPI API.

The total execution time of the parallel implementation of the pattern matching algorithmsusing the presented hybrid parallel architecture is then equal to the summation of the above:

Tinit+TCPU preproc+TNFS comm+TCUDA copy preproc+TCUDA copy input+TCUDA search+TMPI comm

(8.1)

Due to the use of the zero-copy technique, TNFS comm, TCUDA copy preproc and TCUDA search can-not be measured directly. To estimate the cost of the NFS exports, the experiments of this chapterwere executed twice. The first time, the Linux kernel caches were “cold” since they were flushed viathe /proc/sys/vm/drop caches system file while the second time the caches were “hot” and thusdata communication between the master and the workers was kept to a minimum. TNFS comm wasthen estimated as the difference between the two execution times. When the caches were “hot”,the sum of TCUDA copy input + TCUDA search could be easily measured.

8.4 Analysis

This section discusses the efficiency of the presented hybrid parallel architecture when used for theimplementation of different pattern matching algorithms. Figures 8.2 and 8.3 present the searchingtimes of the parallel implementations of the Aho-Corasick, Set Horspool, Set Backward OracleMatching, Wu-Manber and SOG multiple pattern matching algorithms, including the time to zero-copy the input string for the snapshot of the Expressed Sequence Tags from the February, 2009assembly of the human genome. Figure 8.4 depicts the searching time, including TCUDA copy input, ofthe Baker and Bird and the Baeza-Yates and Regnier two-dimensional pattern matching algorithmimplementations for the two-dimensional uncompressed BMP image of the Orion Nebula, whileFigure 8.5 shows the time of NFS to export the input string and the pattern/pattern set to thecluster nodes and the time to copy the preprocessing arrays from the host to the device memory.Figures 8.2 to 8.4 have a linear horizontal and vertical axis while Figure 8.5 has a linear horizontaland a logarithmic vertical axis.

To implement an existing application in parallel, the parallel parts it contains must be identified;the speed-up to the execution will then be proportional to the level of the algorithm’s parallelism.The identification of parallelism includes finding instructions, sequences of instructions, or evenlarge regions of code that may be executed concurrently by different processors [18]. According toAmdahl’s Law [4], the maximum speed-up that can be achieved will then be equal to the nodes ofthe cluster.

1

1−R+ RN

(8.2)

where N is the number of processing elements and R is the proportion of the application thatcan be parallelized. The preprocessing phase of the algorithms is executed sequentially by the CPUof each cluster node. It is not surprising then that the preprocessing time TCPU preproc is constantin the size of the cluster. The time to allocate the relevant data structures and to load the pattern /

Page 161: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

148CHAPTER 8. IMPLEMENTATION OF PATTERN MATCHING ALGORITHMS ON A

HYBRID ARCHITECTURE

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10

Tim

e (s

ec)

Number of nodes

ACSH

SBOM

Figure 8.2: The searching time of the trie-based multiple pattern matching algorithms

pattern set into host memory is not affected by the size of the cluster and is therefore also constantin N .

As depicted in the Figures, the searching time of the algorithm implementations decreasedproportional to the size of the computer cluster. With 10 worker nodes and comparing to a clusterconfiguration with a single worker node, the parallel implementation of the Aho-Corasick algorithmhad a speedup of 9.8, the Set Horspool algorithm exhibited a speedup of 9.1, the Set BackwardOracle Matching was 7.5 times faster, the execution time of the Wu-Manber algorithm reducedby up to a factor of 9.8, while the SOG algorithm had a speedup of 9.9. For the two-dimensionalpattern matching algorithms, Baker and Bird had a speedup of 3.5 while Baeza-Yates and Regnierhad a speedup of 2.5. It can generally be concluded that the two-dimensional pattern matchingalgorithms had a lower speedup due to the chosen column-based distribution of the input string.

When considering the total execution time of the algorithm implementations, the experimen-tally obtained speedup results are quite different. Due to a combination of a fast execution timeand a slow TNFS comm, the speedup of the Aho-Corasick, Set Horspool and Set Backward OracleMatching algorithms ranged between 1.3 and 1.8 when all 10 worker nodes were used as comparedto just a single node. On subsequent executions with the same input string, with the help of thelocal caches for network filesystems of the Linux kernel, as mentioned in section 8.3, the speedupincreased up to 8.7. The hash-based Wu-Manber and SOG algorithms on the other hand with theexcessively high searching time, had a speedup of 7.5 and 9 that increased close to 10 on subse-quent executions. Finally, for the Baker and Bird and the Baeza-Yates and Regnier algorithms,their performance actually decreased when the Linux kernel caches were “cold” as opposed to anincrease of approximately 2 times for “hot” caches.

As can be seen from Figure 8.5, the time of NFS to export the data set to the cluster nodesincreased in the size of the computer cluster. For the multiple pattern matching dataset, thattime ranged approximately between 10 and 17 seconds for 1 to 10 nodes while for the smaller

Page 162: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8.4. ANALYSIS 149

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10

Tim

e (s

ec)

Number of nodes

WMSOG

Figure 8.3: The searching time of the hash-based multiple pattern matching algorithms

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10

Tim

e (s

ec)

Number of nodes

Baker and BirdBaeza-Yates and Regnier

Figure 8.4: The searching time of the two-dimensional pattern matching algorithms

Page 163: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

150CHAPTER 8. IMPLEMENTATION OF PATTERN MATCHING ALGORITHMS ON A

HYBRID ARCHITECTURE

0.0001

0.001

0.01

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10

Tim

e (s

ec)

Number of nodes

export of Human genome over NFSexport of Orion nebula over NFS

CUDA_copy_preproc for ACCUDA_copy_preproc for SH

CUDA_copy_preproc for SBOM

CUDA_copy_preproc for WMCUDA_copy_preproc for SOG

CUDA_copy_preproc for Baker and BirdCUDA_copy_preproc for Baeza-Yates and Regnier

Figure 8.5: The time of NFS to export the different data sets to the cluster nodes and the time tocopy the preprocessing arrays to the global memory of the GPU

two-dimensional dataset TNFS comm ranged between 3 and 10 seconds. For most algorithms, theNFS export time was actually comparable to the searching time of the algorithms as measuredby TCUDA copy input and TCUDA search combined. On configurations with additional worker nodes,a saturation point exists after which the NFS server becomes a bottleneck that can greatly limitthe performance of the cluster. For this reason, when setting up larger computers clusters, ahierarchical configuration should be used with multiple NFS servers to avoid such possible sourcesof congestion.

The time TMPI comm of MPI to transmit the results from the worker nodes to the master isindependent of the algorithm. It is related instead on a number of different characteristics, such asthe cluster topology, the underlying network infrastructure and the number of nodes used. For thespecific cluster configuration, TMPI comm was almost negligible comparing to the total executiontime of the parallel implementation of the algorithms, since all cluster nodes were interconnectedvia the same network switch and there was no transmission of data from the master to the workers.On larger sized clusters this time is not expected to increase significantly, except in the case wherea dynamic input string pointer distribution is chosen to achieve data-parallelism.

Page 164: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8.5.APPENDIX

:EXPERIM

ENTALRESULTS

151

8.5 Appendix: Experimental Results

Table 8.1: Individual times and total execution time of the parallel implementation of the Aho-Corasick algorithm on the referencecomputer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0307 10.0865 0.1715 0.010934 16.3598 0.0000 26.64862 0.0320 12.2022 0.1711 0.010934 10.5364 0.0000 22.94183 0.0349 15.7000 0.1709 0.010934 5.3660 0.0002 21.27184 0.0340 15.0284 0.1707 0.010934 4.0425 0.0001 19.27575 0.0334 14.1339 0.1707 0.010934 3.2514 0.0001 17.58956 0.0321 12.1772 0.1709 0.010934 2.7351 0.0001 15.11527 0.0350 13.9781 0.1706 0.010934 2.3492 0.0004 16.53338 0.0343 13.9284 0.1705 0.010934 2.0732 0.0004 16.20679 0.0344 15.1101 0.1706 0.010934 1.8481 0.0001 17.163310 0.0375 17.8298 0.1705 0.010934 1.6735 0.0002 19.7114

Table 8.2: Individual times and total execution time of the parallel implementation of the Set Horspool algorithm on the referencecomputer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0341 10.0865 0.0045 0.010732 25.0579 0.0000 35.18292 0.0365 12.2022 0.0036 0.010732 14.7828 0.0000 27.02523 0.0331 15.7000 0.0036 0.010732 9.3424 0.0000 25.07904 0.0380 15.0284 0.0037 0.010732 6.7205 0.0001 21.79065 0.0336 14.1339 0.0037 0.010732 5.4223 0.0001 19.59366 0.0364 12.1772 0.0036 0.010732 4.5103 0.0001 16.72757 0.0378 13.9781 0.0036 0.010732 3.8969 0.0012 17.91758 0.0340 13.9284 0.0036 0.010732 3.4000 0.0007 17.36679 0.0343 15.1101 0.0037 0.010732 3.0602 0.0001 18.208410 0.0379 17.8298 0.0038 0.010732 2.7562 0.0001 20.6278

Page 165: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

152CHAPTER

8.IM

PLEMENTATIO

NOFPATTERN

MATCHIN

GALGORIT

HMSON

AHYBRID

ARCHIT

ECTURE

Table 8.3: Individual times and total execution time of the parallel implementation of the Set Backward Oracle Matching algorithm onthe reference computer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0376 10.0865 0.0046 0.023426 29.9088 0.0000 40.03752 0.0367 12.2022 0.0047 0.023426 19.7305 0.0004 31.97443 0.0345 15.7000 0.0047 0.023426 13.3086 0.0002 29.04794 0.0362 15.0284 0.0050 0.023426 9.7745 0.0000 24.84415 0.0383 14.1339 0.0048 0.023426 7.8740 0.0001 22.05116 0.0322 12.1772 0.0049 0.023426 6.5938 0.0003 18.80837 0.0341 13.9781 0.0049 0.023426 5.6599 0.0002 19.67728 0.0359 13.9284 0.0048 0.023426 4.9661 0.0004 18.93579 0.0347 15.1101 0.0049 0.023426 4.4201 0.0002 19.570010 0.0344 17.8298 0.0049 0.023426 3.9987 0.0003 21.8681

Table 8.4: Individual times and total execution time of the parallel implementation of the Wu-Manber algorithm on the referencecomputer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0361 10.0865 0.0007 0.004541 562.2405 0.0000 572.36382 0.0381 12.2022 0.0008 0.004541 283.3387 0.0001 295.57993 0.0382 15.7000 0.0007 0.004541 190.1184 0.0001 205.85734 0.0356 15.0284 0.0007 0.004541 142.7560 0.0006 157.82125 0.0373 14.1339 0.0007 0.004541 114.8453 0.0001 129.01716 0.0408 12.1772 0.0007 0.004541 95.7736 0.0002 107.99247 0.0380 13.9781 0.0007 0.004541 82.0668 0.0002 96.08388 0.0389 13.9284 0.0008 0.004541 71.8891 0.0004 85.85769 0.0374 15.1101 0.0007 0.004541 63.8871 0.0001 79.035410 0.0388 17.8298 0.0008 0.004541 57.3456 0.0001 75.2151

Page 166: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

8.5.APPENDIX

:EXPERIM

ENTALRESULTS

153

Table 8.5: Individual times and total execution time of the parallel implementation of the SOG algorithm on the reference computercluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0337 10.0865 0.0137 0.008093 1434.4464 0.0000 1444.58042 0.0417 12.2022 0.0140 0.008093 714.2239 0.0001 726.48193 0.0370 15.7000 0.0137 0.008093 476.0502 0.0000 491.80094 0.0382 15.0284 0.0137 0.008093 357.2848 0.0007 372.36585 0.0320 14.1339 0.0137 0.008093 285.9384 0.0002 300.11826 0.0423 12.1772 0.0137 0.008093 238.3942 0.0001 250.62757 0.0396 13.9781 0.0138 0.008093 204.3533 0.0002 218.38498 0.0349 13.9284 0.0136 0.008093 178.8752 0.0004 192.85249 0.0390 15.1101 0.0137 0.008093 159.0432 0.0003 174.206410 0.0416 17.8298 0.0137 0.008093 143.1170 0.0001 161.0021

Table 8.6: Individual times and total execution time of the parallel implementation of the Baker and Bird algorithm on the referencecomputer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0007 2.8772 0.0002 0.000177 2.8800 0.0000 5.75802 0.0007 5.8296 0.0002 0.000177 1.9197 0.0001 7.75033 0.0006 5.8310 0.0002 0.000177 1.5367 0.0001 7.36864 0.0006 5.9439 0.0002 0.000177 1.4786 0.0000 7.42335 0.0003 6.6701 0.0002 0.000177 1.3032 0.0001 7.97396 0.0005 7.4553 0.0002 0.000177 1.1318 0.0001 8.58787 0.0006 8.1830 0.0002 0.000177 1.0042 0.0005 9.18858 0.0005 9.1426 0.0002 0.000177 0.9848 0.0019 10.13009 0.0006 9.5517 0.0002 0.000177 0.9006 0.0000 10.453110 0.0005 10.3336 0.0002 0.000177 0.8259 0.0001 11.1603

Page 167: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

154CHAPTER

8.IM

PLEMENTATIO

NOFPATTERN

MATCHIN

GALGORIT

HMSON

AHYBRID

ARCHIT

ECTURE

Table 8.7: Individual times and total execution time of the parallel implementation of the Baeza-Yates and Regnier algorithm on thereference computer cluster (sec)

Nodes Tinit TNFS comm (aprox.) TCPU preproc TCUDA copy preproc TCUDA copy input + TCUDA search TMPI comm Tparallel

1 0.0006 2.8772 0.0002 0.000163 0.4606 0.0000 3.33862 0.0007 5.8296 0.0002 0.000163 0.3260 0.0001 6.15653 0.0007 5.8310 0.0002 0.000163 0.2604 0.0000 6.09234 0.0006 5.9439 0.0002 0.000163 0.2370 0.0001 6.18185 0.0006 6.6701 0.0002 0.000163 0.2255 0.0001 6.89646 0.0006 7.4553 0.0002 0.000163 0.2065 0.0000 7.66257 0.0006 8.1830 0.0002 0.000163 0.2021 0.0003 8.38618 0.0006 9.1426 0.0002 0.000163 0.1948 0.0006 9.33899 0.0006 9.5517 0.0002 0.000163 0.1916 0.0001 9.744210 0.0006 10.3336 0.0002 0.000163 0.1872 0.0002 10.5218

Page 168: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Chapter 9

Conclusions

This chapter summarizes the results of this thesis and presents ideas for future work.

9.1 Results

One of the aims of this thesis was to classify and survey the algorithms that solve the multipleand two-dimensional pattern matching problems and to evaluate the performance of some of themost well known algorithms for different problem parameters including the size of the input string,the pattern and the alphabet used as well as the size of the pattern set in the case of multiplepattern matching algorithms. Moreover, a significant goal was to improve the performance ofthe Baker and Bird and Baeza-Yates and Regnier two-dimensional pattern matching algorithmswith the introduction of efficient variants that use the Set Horspool, Wu-Manber, Set BackwardOracle Matching and SOG multiple pattern matching algorithms in place of the Aho-Corasickalgorithm. Finally, an objective of this thesis was to increase the performance of different multipleand two-dimensional pattern matching algorithms with their implementation on different parallelarchitectures including GPUs and clusters of nodes.

The main contributions and results of this thesis were as follows:

1. We surveyed and categorized many well known algorithms for multiple pattern matching andpresented experimental results of the well-known Aho-Corasick, Set Horspool, Set BackwardOracle Matching, Wu-Manber and SOG algorithms. The performance of the algorithms wasevaluated in terms of preprocessing and searching time for randomly generated data, biologicalsequence databases and English alphabet data for pattern sets of different sizes. We elaboratedon the factors that affect the performance of the preprocessing and the searching phase ofthe algorithms and with the use of experimental maps we indicated the preferred algorithmsfor each different problem parameter. We discussed that for randomly generated data with abinary alphabet, the E.coli genome and the FASTA Nucleic Acid database, Aho-Corasick wasthe fastest algorithm when patterns of a length m = 8 were used while Set Backward OracleMatching outperformed the rest of the algorithms for patterns of a length m = 32. For theSwiss Prot. and the FASTA Amino Acid databases, SOG was the fastest algorithm when setsof up to 2, 000 patterns were used, with the patterns having a length of m = 8 characters.For the same data and for sets of more than 2, 000 patterns or when patterns with a lengthof m = 32 characters were used, Set Backward Oracle Matching was the fastest among thepresented algorithms. Finally, when English alphabet data were used, the performance interms of searching time of the algorithms converged, with the SOG algorithm being slightly

155

Page 169: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

156 CHAPTER 9. CONCLUSIONS

faster when patterns of a length m = 8 were used and the Set Backward Oracle Matching,Wu-Manber and the Aho-Corasick algorithms outperforming the rest of the algorithms whenpatterns of a length m = 32 were used.

2. We presented efficient variants of the Baker and Bird and the Baeza-Yates and Regnier algo-rithms that use the trie of the Set Horspool algorithm, the factor oracle of the Set BackwardOracle Matching algorithm and the hashing functions of the Wu-Manber and SOG algo-rithms to preprocess a two-dimensional pattern and then locate all of its occurrences in atwo-dimensional input string. The performance of the original algorithms and their vari-ants was compared in terms of preprocessing and searching time and the way it was affectedby different problem parameters was discussed, including the size of the pattern, the inputstring and the alphabet, since these are the parameters that generally affect the performanceof two-dimensional pattern matching algorithms. Using the Set Horspool, Wu-Manber, SetBackward Oracle Matching and SOG data structures in place of the trie of Aho-Corasickresulted in the improvement of the efficiency in terms of searching time of the Baker andBird and the Baeza-Yates and Regnier algorithms for some types of data. More specifically,when a pattern and an input string with alphabets of size |Σ| = 256 and |Σ| = 1024 wereused, the Set Horspool variant outperformed the rest of the Baker and Bird implementationsfor a pattern of a size up to m = 16, while Wu-Manber was the fastest implementation fora bigger pattern size. In a similar fashion, the original implementation of the Baeza-Yatesand Regnier algorithm had the fastest searching time among the presented implementationswhen data with a binary alphabet were used together with a pattern of a size up to m = 16as well as for data with an alphabet of size |Σ| = 256 and 1024 when a pattern with a size ofm = 4 was used, together with the Set Horspool implementation. Finally, for a pattern witha size m between 32 and 256 characters when a binary alphabet was used and for a patternwith a size m between 8 and 256 when an alphabet of size |Σ| = 256 and 1024 was used,the Wu-Manber variant outperformed the Aho-Corasick, Set Horspool, Set Backward OracleMatching and SOG implementations of the Baeza-Yates and Regnier algorithm.

3. We presented details on the design and the implementation of the Aho-Corasick, Set Horspool,Set Backward Oracle Matching, Wu-Manber and SOG multiple pattern matching algorithmsin parallel on a GPU using the CUDA API. Different optimizations were used to further in-crease the performance of the parallel implementations; binding frequently used arrays to thetexture memory of the device; coalescing read accesses to the global memory of the GPU andstoring the retrieved data to shared memory and finally packing the input string charactersto uint4 vectors to further increase the utilization of the global memory bandwidth. Basedon the experimental results, it was concluded that the speedup of the basic implementationof the algorithms ranged between 2.5 and 10.9 over the equivalent sequential version of thealgorithms. It was also shown that by applying the optimizations discussed in this thesis, theparallel implementation of the Aho-Corasick algorithm was up to 18.5 times faster than thecorresponding sequential implementations, the implementations of the Set Horspool and theSet Backward Oracle Matching algorithms were up to 33.7 times faster, the parallel imple-mentation of the Wu-Manber algorithm was up to 11.8 times faster while the implementationof SOG was up to 8.9 times faster.

4. We also gave details on the design and the implementation of the Baker and Bird and Baeza-Yates and Regnier algorithms on a GPU using the CUDA API and evaluated their perfor-mance after applying different optimization techniques. Their parallel searching time was

Page 170: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

9.2. FUTURE RESEARCH 157

compared to the searching time of the sequential implementations for different problem pa-rameters. The optimization techniques used to further increase the performance of the imple-mentations included the following; binding frequently used arrays to the texture memory ofthe device; copying the pattern to the shared memory of the device; coalescing read accessesto the global memory and storing the retrieved data to shared memory. It was concludedthat the speedup of the basic implementation of the Baker and Bird algorithm was up to 16.5times over the sequential implementation while the final, optimized kernel was up to 21.88times faster. It was also discussed that the basic implementation of the Baeza-Yates andRegnier algorithm was up to 38.6 times faster than the sequential implementation while thespeedup of the optimized version of the same algorithm was up to 113.62 faster.

5. Finally, we presented a 3-tier modular hybrid parallel architecture that combines the flexi-bility of a CPU with the efficiency of a GPU and the scalability that a distributed memoryarchitecture can offer. The presented hybrid architecture used different technologies to imple-ment in parallel the Aho-Corasick, Set Horspool, Wu-Manber, Set Backward Oracle Matchingand SOG multiple pattern matching algorithms and the Baker and Bird and Baeza-Yates andRegnier two-dimensional pattern matching algorithms. The experimentally obtained speedupresults of the Aho-Corasick, Set Horspool and Set Backward Oracle Matching algorithmsranged between 1.3 and 1.8 when all 10 worker nodes were used while the Wu-Manber andSOG algorithms had a speedup of 7.5 and 9 respectively. Finally, for the Baker and Bird andthe Baeza-Yates and Regnier algorithms, their performance actually decreased due to thesignificant time of NFS to export the data to the nodes in conjunction with the fast searchingtime of the algorithms.

9.2 Future Research

The work presented in this thesis can be considered as complete, in the sense that the problem ofpattern matching was thoroughly examined. As a first step, it was discussed that the most widelyused algorithms for multiple pattern matching are in fact generalizations of some of the most wellknown string matching algorithms. The string matching algorithms were discussed in chapter 3while the multiple pattern matching algorithms were presented in chapter 2. Then, it was arguedthat two of the most well known algorithms for two-dimensional pattern matching, Baker and Birdand Baeza-Yates and Regnier, handle the two-dimensional pattern matching problem as a specialcase of multiple pattern matching, where the two-dimensional pattern array is considered as a finiteset of different patterns. Based upon the fact that in [11] it was suggested that the use of a multiplepattern matching algorithm based on Boyer-Moore instead of Aho-Corasick should result in theimprovement of the searching phase of the Baeza-Yates and Regnier algorithm, efficient variants ofthe Baker and Bird and the Baeza-Yates and Regnier were introduced in chapter 4. Finally, sinceboth the multiple and the two-dimensional pattern matching algorithms that were studied in thisthesis have nearly all the characteristics that make them suitable for parallel execution on GPUs,their performance was further improved with their implementation on Graphics Processing Unitsand on a hybrid cluster of nodes.

This section gives a number of interesting suggestions for further expansion of the topics dis-cussed in this thesis. The experiments on the multiple pattern matching algorithms could beextended with additional parameters, including patterns of varying length, larger pattern sets anddata with different alphabets. The performance of the introduced variants of the Baker and Birdand the Baeza-Yates and Regnier algorithms could also be evaluated for data sets with larger input

Page 171: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

158 CHAPTER 9. CONCLUSIONS

string and pattern sizes and for additional types of data, including photo archives and satelliteimagery. The way the performance of the different implementations of the Baker and Bird andthe Baeza-Yates and Regnier algorithms is affected based on the way they access the memory isa very interesting subject that should be studied further. Although this thesis focuses on exactmultiple and two-dimensional algorithms, algorithms from different categories of pattern matching,including pattern matching with scaling, compression or rotation have similar characteristics andthus it is worth to study them further.

The discussion of the parallel implementation details of the algorithms on chapters 6 and 7 wasfocused on a specific GPU, the NVIDIA GTX280. It will be valuable to implement the algorithmsin parallel on more modern GPUs and examine the way their performance is affected from thecharacteristics of more recent hardware. Finally, the hybrid parallel architecture of chapter 8 couldbe also adapted for heterogeneous GPU clusters, it could be further tested on larger computerclusters and it could be expanded for multi-core processing using the OpenMP API.

Page 172: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

Bibliography

[1] L. Adhianto and B. Chapman. Performance Modeling of Communication and Computation in HybridMPI and OpenMP Applications. In Proceedings of the 12th International Conference on Parallel andDistributed Systems, 2006.

[2] A.V. Aho and M.J. Corasick. Efficient String Matching: An Aid to Bibliographic Search. Communi-cations of the ACM, 18(6):333–340, 1975.

[3] C. Allauzen, M. Crochemore, and M. Raffinot. Factor Oracle: A New Structure for Pattern Matching.SOFSEM’99: Theory and Practice of Informatics, 1725:758–758, 1999.

[4] G.M. Amdahl. Validity of the Single Processor Approach to Achieving Large Scale Computing Capa-bilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, AFIPS ’67 (Spring),pages 483–485. ACM, 1967.

[5] A. Amir. Multidimensional Pattern Matching: A Survey. 1992. Technical report GIT-CC-92/29.

[6] A. Amir, G. Benson, and M. Farach. Alphabet Independent Two-dimensional Matching. In Proceedingsof the Twenty-fourth Annual ACM Symposium on Theory of Computing, pages 59–68. ACM, 1992.

[7] A. Amir, O. Kapah, and D. Tsur. Faster Two-dimensional Pattern Matching with Rotations. Theo-retical Computer Science, 368(3):196–204, 2006. Combinatorial Pattern Matching.

[8] A. Amir, G.M. Landau, and U. Vishkin. Efficient Pattern Matching with Scaling. Journal of Algo-rithms, 13(1):2–32, 1992.

[9] A. Apostolico and Z. Galil. Pattern Matching Algorithms. Oxford University Press, 1997.

[10] R. Baeza-Yates and C. Perleberg. Fast and Practical Approximate String Matching. In CombinatorialPattern Matching, pages 185–192. Springer, 1992.

[11] R. Baeza-Yates and M. Regnier. Fast Two Dimensional Pattern Matching. Information ProcessingLetters, 45(1):51–57, 1993.

[12] R.A. Baeza-Yates and L.O. Fuentes. A Framework to Animate String Algorithms. Information Pro-cessing Letters, 59(5):241–244, 1996.

[13] R.A. Baeza-Yates and G.H. Gonnet. A New Approach to Text Searching. Communications of theACM, 35(10):74–82, 1992.

[14] T.P. Baker. A Technique for Extending Rapid Exact-Match String Matching to Arrays of More thanOne Dimension. SIAM Journal on Computing, 7(4):533–541, 1978.

[15] R.S. Bird. Two Dimensional Pattern Matching. Information Processing Letters, 6(5):168–170, 1977.

[16] R.S. Boyer and J.S. Moore. A Fast String Searching Algorithm. Communications of the ACM,20(10):762–772, 1977.

[17] R. Buyya. High Performance Cluster Computing: Programming and Applications, volume 2. PrenticeHall, 1999.

[18] B. Chapman, G. Jost, and R. Van Der Pas. Using OpenMP: Portable Shared Memory Parallel Pro-gramming, volume 10. MIT press, 2008.

159

Page 173: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

160 BIBLIOGRAPHY

[19] A.Y. Chen. A Hybrid Architecture for Massively Scaled Distributed Virtual Environments. UCLAMultimedia Systems Laboratory, 7, 2006.

[20] X. Chen, B. Fang, L. Li, and Y. Jiang. WM+: An Optimal Multi-pattern String Matching AlgorithmBased on the WM Algorithm. Advanced Parallel Processing Technologies, pages 515–523, 2005.

[21] B. Commentz-Walter. A String Matching Algorithm Fast on the Average. In Proceedings of the 6thColloquium, on Automata, Languages and Programming, pages 118–132, 1979.

[22] Nvidia Corporation. Geforce gtx 200 gpu technical brief, 2008. Available online at:http://www.nvidia.com/docs/IO/55506/GeForce GTX 200 GPU Technical Brief.pdf.

[23] J.K. Cringean, G.A. Manson, P. Wilett, and Wilson G.A. Efficiency of Text Scanning in BibliographicDatabases using Microprocessor-based, Multiprocessor Networks. Journal of Information Science,14(6):335–345, 1988.

[24] M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter.Speeding Up Two String-matching Algorithms. Algorithmica, 12(4):247–267, 1994.

[25] M. Crochemore, A. Czumaj, L. Gasieniec, T. Lecroq, W. Plandowski, and W. Rytter. Fast PracticalMulti-pattern Matching. Information Processing Letters, 71(3-4):107 – 113, 1999.

[26] M. Crochemore, L. Gasieniec, and W. Rytter. Two-dimensional Pattern Matching by Sampling. In-formation Processing Letters, 46:159–162, June 1993.

[27] M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, Inc., 1994.

[28] M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific Publishing Company, 1stedition, 2002.

[29] Y. Dong-hong, X. Ke, and C. Yong. An Improved Wu-Manber Multiple Patterns Matching Algo-rithm. In Proceedings of the 25th IEEE International Performance, Computing and CommunicationsConference, pages 675–680, 2006.

[30] S. Dori and G.M. Landau. Construction of Aho Corasick Automaton in Linear Time for IntegerAlphabets. Information Processing Letters, 98(2):66–72, 2006.

[31] U. Drepper. What Every Programmer Should Know About Memory. 2007.

[32] ESA/Hubble. Hubble’s sharpest view of the orion nebula. Website, 2013.http://www.spacetelescope.org/images/heic0601a/.

[33] M.J. Flynn. Very High-speed Computing Systems. Proceedings of the IEEE, 54(12):1901–1909, 1966.

[34] K. Fredriksson, G. Navarro, and E. Ukkonen. Optimal Exact and Fast Approximate Two DimensionalPattern Matching Allowing Rotations. Combinatorial Pattern Matching, 2373:235–248, 2002.

[35] Z. Galil. On Improving the Worst Case Running Time of the Boyer-Moore String Matching Algorithm.Communications of the ACM, 22(9):505–508, 1979.

[36] Z. Galil and K. Park. Truly Alphabet-independent Two-dimensional Pattern Matching. In Proceedingsof the 33st IEEE Annual Symposium on Foundations of Computer Science, IEEE, pages 247–256, 1992.

[37] M. Gebhart, S.W. Keckler, B. Khailany, R. Krashinsky, and W.J. Dally. Unifying Primary Cache,Scratch, and Register File Memories in a Throughput Processor. In Proceedings of the 45th AnnualIEEE/ACM International Symposium on Microarchitecture, 2012.

[38] GNU Grep. Webpage containing information about the gnu grep search utility. Website, 2012.http://www.gnu.org/software/grep/.

[39] T.R. Halfhill. Parallel Processing with CUDA. Microprocessor Report, pages 01–08, 2008.

[40] R.N. Horspool. Practical Fast Searching in Strings. Software: Practice and Experience, 10(6):501–506,1980.

Page 174: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

BIBLIOGRAPHY 161

[41] D. Howells. Fs-cache: A network Filesystem Caching Facility. In Proceedings of the Linux Symposium,volume 1. Citeseer, 2006.

[42] L. Hu, Z. Wei, F. Wang, X. Zhang, and K. Zhao. An Efficient AC Algorithm with GPU. ProcediaEngineering, 29:4249–4253, 2012.

[43] M. Jamshed, J. Lee, S. Moon, I. Yun, D. Kim, S. Lee, Y. Yi, and K.S. Park. Kargus: a Highly-scalableSoftware-based Intrusion Detection System. In Proceedings of the ACM Conference on Computer andCommunications Security (CCS), 2012.

[44] P. Kalsi, H. Peltola, and J. Tarhio. Comparison of Exact String Matching Algorithms for BiologicalSequences. Communications in Computer and Information Science, 13:417–426, 2008.

[45] J. Karkkainen and E. Ukkonen. Two- and Higher- Dimensional Pattern Matching in Optimal ExpectedTime. SIAM Journal on Computing, 29(2):571–589, 1999.

[46] R.M. Karp and M.O. Rabin. Efficient Randomized Pattern-matching Algorithms. IBM Journal ofResearch and Development, 31(2):249–260, 1987.

[47] S. Kim and Y. Kim. A Fast Multiple String-pattern Matching Algorithm. Proceedings of the 17thAoM/IAoM Inernational Conference on Computer Science, pages 1–6, 1999.

[48] D.E. Knuth, J.H Morris, and V.R. Pratt. Fast Pattern Matching in Strings. SIAM Journal onComputing, 6(2):323–350, 1977.

[49] C.S. Kouzinopoulos and K.G. Margaritis. Improving the Efficiency of Exact Two Dimensional On-Line Pattern Matching Algorithms. In Proceedings of the 12th Panhellenic Conference on Informatics,pages 232–236, 2008.

[50] C.S. Kouzinopoulos and K.G. Margaritis. Parallel Implementation of Exact Two Dimensional PatternMatching Algorithms using MPI and OpenMP. In Proceedings of the 9th Hellenic European Researchon Computer Mathematics and its Applications Conference, 2009.

[51] C.S. Kouzinopoulos and K.G. Margaritis. String Matching on a Multicore GPU using CUDA. InProceedings of the 13th Panhellenic Conference on Informatics, pages 14–18, 2009.

[52] C.S. Kouzinopoulos and K.G. Margaritis. Experimental Results on Algorithms for Multiple KeywordMatching. In Proceedings of the IADIS International Conference on Informatics, pages 129–133, 2010.

[53] C.S. Kouzinopoulos and K.G. Margaritis. A Performance Evaluation of the Preprocessing Phaseof Multiple Keyword Matching Algorithms. In Proceedings of the 15th Panhellenic Conference onInformatics, 2011.

[54] C.S. Kouzinopoulos and K.G. Margaritis. Experimental Results on Multiple Pattern Matching Algo-rithms for Biological Sequences. In Proceedings of the International Conference on Bioinformatics -Models, Methods and Algorithms, pages 274–277, 2011.

[55] C.S. Kouzinopoulos and K.G. Margaritis. Exact On-Line Two-Dimensional Pattern Matching usingMultiple Pattern Matching Algorithms. Journal of Experimental Algorithmics, 2013.

[56] C.S. Kouzinopoulos, P.D. Michailidis, and K.G. Margaritis. Parallel Processing of Multiple PatternMatching Algorithms for Biological Sequences: Methods and Performance Results. InTech, 2011.

[57] C.S. Kouzinopoulos, P.D. Michailidis, and K.G. Margaritis. Performance Study of Parallel HybridMultiple Pattern Matching Algorithms for Biological Sequences. In International Conference onBioinformatics-Models, Methods and Algorithms, pages 182–187, 2012.

[58] N.B. Lakshminarayana and H. Kim. Effect of Instruction Fetch and Memory Scheduling on GPUPerformance. In Workshop on Language, Compiler, and Architecture Support for GPGPU, 2010.

[59] T. Lecroq. Fast Exact String Matching Algorithms. Information Processing Letters, 102(6):229–235,2007.

Page 175: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

162 BIBLIOGRAPHY

[60] C. Lin, S. Tsai, C. Liu, S. Chang, and J Shyu. Accelerating String Matching using Multi-ThreadedAlgorithm on GPU. In Global Telecommunications Conference (GLOBECOM 2010), 2010 IEEE, pages1–5, 2010.

[61] P. Liu, Y. Liu, and J. Tan. A Partition-based Efficient Algorithm for Large Scale Multiple-stringsMatching. In String Processing and Information Retrieval, pages 399–404. Springer, 2005.

[62] P.D. Michailidis and K.G. Margaritis. On-line String Matching Algorithms: Survey and ExperimentalResults. International Journal of Computer Mathematics, 76(4):411–434, 2001.

[63] P.D. Michailidis and K.G. Margaritis. Parallel Implementations for String Matching Problem on aCluster of Distributed Workstations. Neural, Parallel & Scientific Computations, 10(3):287–312, 2002.

[64] R. Muth and U. Manber. Approximate Multiple String Search. In Combinatorial Pattern Matching,pages 75–86. Springer, 1996.

[65] G. Navarro and K. Fredriksson. Average Complexity of Exact and Approximate Multiple StringMatching. Theoretical Computer Science, 321(2-3):283–290, 2004.

[66] G. Navarro and M. Raffinot. A Bit-parallel Approach to Suffix Automata: Fast Extended StringMatching. Lecture Notes in Computer Science, 1448:14–33, 1998.

[67] G. Navarro and M. Raffinot. Flexible Pattern Matching in Strings: Practical On-line Search Algorithmsfor Texts and Biological Sequences. Cambridge University Press, 2002.

[68] N. Nethercote and J. Seward. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumenta-tion. Proceedings of the ACM SIGPLAN conference on Programming language design and implemen-tation, pages 89–100, 2007.

[69] E. Niewiadomska-Szynkiewicza, M. Marksa, J. Janturab, and M. Podbielskib. A Hybrid CPU/GPUCluster for Encryption and Decryption of Large Amounts of Data. Journal of Telecommunicationsand Information Technology, (3):32–39, 2012.

[70] NVIDIA. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, version 5.5,2013. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA C Programming Guide.p

[71] University of California Santa Cruz. Jack baskin school of engineering. Website, 2013.http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/.

[72] J.D. Owens. Streaming Architectures and Technology Trends. In ACM SIGGRAPH 2005 Courses.ACM, 2005.

[73] M. Papadopoulou, M. Sadooghi-Alvandi, and H. Wong. Micro-benchmarking the GT200 GPU. Com-puter Group, ECE, University of Toronto, 2009. Technical report.

[74] T. Polcar and B. Melichar. A Two-dimensional Online Tessellation Automata Approach to Two-dimensional Pattern Matching. In Proceedings of the Eindhoven FASTAR Days, 2004. invited talk.

[75] C. Pungila and V. Negru. A Highly-Efficient Memory-Compression Approach for GPU-AcceleratedVirus Signature Matching. Information Security, pages 354–369, 2012.

[76] T.K. Pyrgiotis, C.S. Kouzinopoulos, and K.G. Margaritis. Parallel Implementation of the Wu-ManberAlgorithm using the OpenCL Framework. In Artificial Intelligence Applications and Innovations,volume 382 of IFIP Advances in Information and Communication Technology, pages 576–583. Springer,2012.

[77] L. Salmela. Improved Algorithms for String Searching Problems. PhD thesis, Helsinki University ofTechnology, 2009.

[78] L. Salmela, J. Tarhio, and J. Kytojoki. Multipattern String Matching with q-grams. Journal ofExperimental Algorithmics, 11:1–19, 2006.

[79] S.S. Sheik, S.K. Aggarwal, A. Poddar, B. Sathiyabhama, N. Balakrishna, and K. Sekar. Analysis ofString-searching Algorithms on Biological Sequence Databases. Current Science, 89(2):368–374, 2005.

Page 176: Parallel and Distributed Implementations of Multiple and ...users.uom.gr/~ckouz/papers/thesis.pdf · Parallel and Distributed Implementations of Multiple and Two-Dimensional Pattern

BIBLIOGRAPHY 163

[80] T.F. Sheu, N.F. Huang, and H.P. Lee. Hierarchical Multi-pattern Matching Algorithm for NetworkContent Inspection. Information Sciences, 178:2880–2898, 2008.

[81] M. Snir, S. Otto, S. Huss-Lederman, D.W. Walker, and Dongarra J. MPI: The Complete Reference.The MIT Press, 1996.

[82] Snort. Webpage containing information on the snort intrusion prevention and detection system. Web-site, 2010. http://www.snort.org/.

[83] Streamline. The official webpage of the streamline project. Website, 2011. http://netstreamline.org/.

[84] T. Tao. Compressed Pattern Matching for Text and Images. PhD thesis, University of Central Florida,2005.

[85] J. Tarhio. A sublinear algorithm for two-dimensional string matching. Pattern Recognition Letters,17(8):833 – 838, 1996.

[86] A.B. Tucker. Computer Science Handbook. CRC Press, 2004.

[87] A. Tumeo, S. Secchi, and O. Villa. Experiences with String Matching on the Fermi Architecture.Architecture of Computing Systems-ARCS 2011, pages 26–37, 2011.

[88] A. Tumeo, O. Villa, and D.G. Chavarrıa-Miranda. Aho-Corasick String Matching on Shared andDistributed-memory Parallel Architectures. Parallel and Distributed Systems, IEEE Transactions on,23(3):436–443, 2012.

[89] G. Vasiliadis, S. Antonatos, M. Polychronakis, E.P. Markatos, and S. Ioannidis. Gnort: High Perfor-mance Network Intrusion Detection using Graphics Processors. Proceedings of RAID, 5230:116–134,2008.

[90] G. Vasiliadis and S. Ioannidis. Gravity: A Massively Parallel Antivirus Engine. In Recent Advancesin Intrusion Detection, pages 79–96. Springer, 2010.

[91] L. Vespa and N. Weng. SWM: Simplified Wu-Manber for GPU-based Deep Packet Inspection. InProceedings of the 2012 International Conference on Security and Management, 2012.

[92] V. Volkov. Better Performance at Lower Occupancy. In Proceedings of the GPU Technology Conference,GTC, volume 10, 2010.

[93] B.W. Watson. Taxonomies and Toolkits of Regular Language Algorithms. PhD thesis, EindhovenUniversity of Technology, 1995.

[94] N Wilt. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-WesleyProfessional, 2013.

[95] S. Wu and U. Manber. Agrep - A Fast Approximate Pattern-Matching Tool. In Proceedings of USENIXTechnical Conference, pages 153–162, 1992.

[96] S. Wu and U. Manber. A Fast Algorithm for Multi-pattern Searching. pages 1–11, 1994. Technicalreport TR-94-17.

[97] J. Zdarek. Two-dimensional Pattern Matching using Automata Approach. PhD thesis, Czech TechnicalUniversity, 2010.

[98] X. Zha and S. Sahni. Multipattern String Matching on a GPU. In Computers and Communications(ISCC), 2011 IEEE Symposium on, pages 277–282. IEEE, 2011.

[99] Z. Zhou, Y. Xue, J. Liu, W. Zhang, and J. Li. MDH: A High Speed Multi-phase Dynamic HashString Matching Algorithm for Large-Scale Pattern Set. Information and Communications Security,4861:201–215, 2007.

[100] R.F. Zhu and T. Takaoka. A Technique for Two-dimensional Pattern Matching. Communications ofthe ACM, 32(9):1110–1120, September 1989.

[101] CUDA Zone. Official webpage of the nvidia cuda api. Website, 2012.http://www.nvidia.com/object/cuda home.html.