SCHOOLOFSCIENCE … › 2015 › 05 › thesis.pdfFINALTHESIS...

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS

SCHOOL OF SCIENCE

DEPARTMENT OF INFORMATICS & TELECOMMUNICATIONS

FINAL THESIS

High-dimensional approximate nearestneighbor:

k-d Generalized Randomized Forests

Georgios Samaras

Advisor: Ioannis Emiris, Professor, DIT

ATHENSFEBRUARY 2015

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ

ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ

ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ & ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

Κατά προσέγγιση πλησιέστερος γείτονας σευψηλές διαστάσεις:

k-d Generalized Randomized Forests

Γεώργιος Σαμαράς

Επιβλέπων: Ιώαννης Εμίρης, Καθηγητής, ΕΚΠΑ

ΑΘΗΝΑΦΕΒΡΟΥΑΡΙΟΣ 2015

FINAL THESIS

High-dimensional approximate nearest neighbor:k-d Generalized Randomized Forests

Georgios Samaras

1115 2010 00093

ADVISOR:

Ioannis Emiris, Professor, DIT

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

Κατά προσέγγιση πλησιέστερος γείτονας σε υψηλές διαστάσεις:k-d Generalized Randomized Forests

Γεώργιος Σαμαράς

1115 2010 00093

ΕΠΙΒΛΕΠΩΝ:

Ιώαννης Εμίρης, Καθηγητής, ΕΚΠΑ

ABSTRACTWe propose a new data-structure, the generalized randomized k-d forest, or k-d

GeRaF, for approximate nearest neighbor searching in high dimensions. In partic-ular, we introduce new randomization techniques to specify a set of independentlyconstructed trees where search is performed simultaneously, hence increasing accu-racy. We omit backtracking, and we optimize distance calculations, thus accelerat-ing queries. We compare our public domain software to state-of-the-art methodsincluding BBD-trees implemented in ANN, Locality Sensitive Hashing in E2LSHand randomized k-d forests implemented in FLANN. Experimental results indicatethat our method would be the method of choice in dimensions around 1,000, andprobably up to 10,000, and pointsets of cardinality up to a few hundred thousandsor even one million; this range of inputs is encountered in many critical applica-tions today. We handle GIST datasets of 106 images in 960 dimensions in < 1secwith about 90% outputs being true nearest neighbors.

SUBJECT AREA: Computational GeometryKEYWORDS: Nearest Neighbour, Data-structure, Randomized tree,Space partition, Geometric search, Open software, Practical complexity,High dimensions

ΠΕΡΙΛΗΨΗ

Προτείνουμε μια καινούργια δομή δεδομένων, το generalized randomized k-d forest,ή kd-GeRaF, για προσεγγιστική αναζήτηση κοντινότερου γείτονα σε υψηλές διαστάσεις.Για την ακρίβεια, εισαγάγουμε νέες τυχαιοποιημένες τεχνικές ώστε να ορίσουμε ένασύνολο από ανεξάρτητα κατασκευασμένα δέντρα που η αναζήτηση γίνεται ταυτόχρονα,πραμα που αυξάνει την ακρίβεια. Παραλείπουμε τις υπαναχωρήσεις, και βελτιστοποιούμετον υπολογισμό αποστάσεων, κάτι που κάνει την αναζήτηση γρηγορότερη. Συγκρίνουμετο λογισμικό μας, ανοιχτού κώδικα, με σύχρονες μεθόδους, όπως τα BBD δέντρα,υλοποιημένα στοANN, το Τοπικά ΕυαίσθητοΚατακερματισμό, υλοποιημένο στο E2LSHκαι randomized k-d forests, υλοποιημένα στο FLANN. Πειραματικά αποτελέσματαυποδεικνύουν ότι η μέθοδος μας θα ήταν αυτή που θα επέλεγε κανείς στη περίπτωσητων 900-1000 διαστάσεων, και ίσως μέχρι και 10.000, για σημειοσύνολα πληθικότηταςλίγων εκατοντάδων μέχρι ακόμα και ένα εκατομμύριο. Χειριζόμαστε GIST δεδομένα106 εικόνων στις 960 διαστάσεις σε < 1 δευτ., όπου 90% των περιπτώσεν βρίσκουμετους πραγματικούς κοντινότερους γείτονες.

ΘΕΜΑΤΙΚΗ ΠΕΡΙΟΧΗ: Υπολογιστική ΓεωμετρίαΛΕΞΕΙΣΚΛΕΙΔΙΑ:Κοντινότερος Γείτονας, Δομήδεδομένων, Υποδιαίρεσηχώρου,Γεωμετρικήαναζήτηση,Ανοιχτό λογισμικό,Πρακτικήπολυπλοκότητα, ΥψηλέςΔιαστάσεις

Acknowledgements

I would like to thank my advisor, I. Emiris, who assisted me throughout myundergraduate years. Moreover, I should acknowledge all my professors and theDepartment in general, for its high level of quality, which had a great impacton the outcome of my undergraduate studies, both as a scientist and a person.Moreover, I would like to thank I. Psarros, V. Anagnostopoulos and Y. Avrithis,members of the Laboratory of Algebraic and Geometric ALgorithms (EρΓA1), fortheir support.

Travelling to Switzerland and France was something that really broadened myhorizons. Especially the internship in INRIA(Institut national de recherche eninformatique et en automatique), in GUDHI project [1], helped me a lot in devel-oping the RKD forest project, which is the ancestor of kd-GeRaF software. Some ofthe data used in the experiments below were kindly provided by Clément Maria.The rest were provided by Dimitri Nicolopoulos. Also the Winter School in Paris,supported by IHP(Institut Henri Poincaré), gave me the opportunity to meet S.Vempala and discuss about the kd-GeRaF.

Furthermore, my family and my friends played a key role in supporting me.Last, but certaintly not least, I would like to thank the players of the “Cage”, who

1http://erga.di.uoa.gr/

http://erga.di.uoa.gr/

contributed in achieving the balance between a healthy mind and body for me.

Contents

1 Introduction 19

1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.4 Structure of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Why a kd-GeRaF? 24

2.1 Outline of KD tree search . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Building process . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.2 Search process . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1.3 Why Approximate Nearest Neighbour Search (ANNS)? . . . . . 27

2.2 Why a kd-GeRaF instead of a KD tree? . . . . . . . . . . . . . . . . . 27

2.2.1 Diminshed returns . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 How to avoid diminshed returns? . . . . . . . . . . . . . . . . . 29

2.2.3 How to build different trees? . . . . . . . . . . . . . . . . . . . . 29

2.2.4 Randomly rotate the point set . . . . . . . . . . . . . . . . . . . 30

2.2.5 Randomly choose a cutting dimension . . . . . . . . . . . . . . 31

2.2.6 Add a random factor to the cutting value . . . . . . . . . . . . . 31

3 How to build a kd-GeRaF? 33

3.1 Building process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Deterministic rotation . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.2 Rotation using Householder matrices . . . . . . . . . . . . . . . 37

3.2.2.1Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.2.2A more efficient method . . . . . . . . . . . . . . . . . . 38

3.3 Computing variances . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 How to find the t max elements? . . . . . . . . . . . . . . . . . . . . 41

3.5 Where to split? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.6 Find the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Technical remarks on kd-GeRaF 44

4.1 How to store the tree . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Exploit caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 t_dims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Convert 2D to 1D . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.3 Combination of these two methods . . . . . . . . . . . . . . . . 49

4.2.4 Which method to use? . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 RKD vs KD tree in the building process . . . . . . . . . . . . . . . . 51

4.4 How much does rotation affect the building time? . . . . . . . . . . . 52

4.5 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.1 Building in parallel . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.2 Searching in parallel . . . . . . . . . . . . . . . . . . . . . . . . 54

5 How to search into a kd-GeRaF? 56

5.1 Exact searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Exact radius searching . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Fast searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 k-NN searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Time measurements and comparisons 61

6.1 Data and software used . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.1.1 SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1.2 GIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.1.3 Klein bottle & sphere . . . . . . . . . . . . . . . . . . . . . . . 62

6.1.4 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2 Time measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2.1 Automatic configuration of kd-GeRaFt . . . . . . . . . . . . . . 65

6.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1 kd-GeRaF vs BBD . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1.1Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.1.2Build & Search . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.2 kd-GeRaF vs LSH . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.3.2.1Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3.2.2Build & Search . . . . . . . . . . . . . . . . . . . . . . . 102

6.3.3 kd-GeRaF vs FLANN . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.3.1Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.3.2Build & Search . . . . . . . . . . . . . . . . . . . . . . . 103

7 Conclusion and future work 105

References 107

List of Figures

1.1 The input and output of the Approximate Nearest Neighbour Search

(ANNS) problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1 [2]: Priority search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 [2]: Diminished returns. . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 A nice case for adding δ . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 An RKD tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 The kd-GeRaF project . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Memory access for computing variance . . . . . . . . . . . . . . . . . 48

6.1 The building times of BBD, LSH, FLANN and kd-GeRaF . . . . . . . 66

6.2 The search times of BBD, LSH, FLANN andkd-GeRaF for 10.000 points 67

6.3 The search times of BBD, LSH, FLANN and kd-GeRaF for 1.000.000

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.4 The building times of BBD, LSH, FLANN and kd-GeRaF . . . . . . . 70

6.5 The search times of BBD, LSH, FLANN and kd-GeRaF for 100.000

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.6 The search times of BBD, LSH, FLANN and kd-GeRaF for 1.000.000

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.7 The building times of BBD, LSH, FLANN and kd-GeRaF forD = 10.000 75

6.8 The building times of BBD, LSH, FLANN and kd-GeRaF for D = 100 76

6.9 The search times of BBD, LSH, FLANN and kd-GeRaF in 10.000

dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77


points in 100 dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 78



6.12 The building times of BBD, LSH, FLANN and kd-GeRaF for D =

10.000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.13 The building times of BBD, LSH, FLANN and kd-GeRaF for D = 100 83

6.14 The search times of BBD, LSH, FLANN and kd-GeRaF in 10.000

dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84





6.17 The building times of kd-GeRaF, BBD, LSH and FLANN . . . . . . 89

6.18 The search times of kd-GeRaF, BBD, LSH and FLANN for 10.000

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.19 The search times of kd-GeRaF, BBD, LSH and FLANN for 1.000.000

points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.20 The speedup over brute force, of kd-GeRaF and FLANN, for N =

1.000, D = 10.000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


10.000, D = 10.000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96


10.000, D = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

List of Tables

4.1 Approach: 1D, time in seconds. . . . . . . . . . . . . . . . . . . . . . 49

4.2 Approach: 1D+t_dims . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Approach: 2D+t_dims . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 RKD vs KD in the building process . . . . . . . . . . . . . . . . . . . 51

4.5 Building times with rotation on and off . . . . . . . . . . . . . . . . . 53

4.6 Building times with parallelization on and off . . . . . . . . . . . . . 54

6.1 SIFT: BBD vs LSH vs FLANN vskd-GeRaF . . . . . . . . . . . . . . 68

6.2 GIST: BBD vs LSH vs FLANN vs kd-GeRaF . . . . . . . . . . . . . . 71

6.3 GIST, N = 106: BBD vs LSH vs FLANN vs kd-GeRaF . . . . . . . . 73

6.4 Sphere: BBD vs LSH vs FLANN vs kd-GeRaF . . . . . . . . . . . . . 80

6.5 Klein bottle: BBD vs LSH vs FLANN vs kd-GeRaF . . . . . . . . . . 86

6.6 kd-GeRaF vs BBD, optimal parameters . . . . . . . . . . . . . . . . . 91

6.7 kd-GeRaF vs LSH, optimal parameters . . . . . . . . . . . . . . . . . 92

6.8 kd-GeRaF vs FLANN, optimal parameters . . . . . . . . . . . . . . . 93

6.9 kd-GeRaF vs FLANN, optimal parameters . . . . . . . . . . . . . . . 98

6.10 Exact NN of the kd-GeRaF . . . . . . . . . . . . . . . . . . . . . . . 100

List of Algorithms

1 Build RKD tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 kd GeRaF building . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Euler rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Knuth’s online variance algorithm . . . . . . . . . . . . . . . . . . 41

5 Find_t_max with min-heap . . . . . . . . . . . . . . . . . . . . . . 42

6 Quickselect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Build KD tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Find an exact NN of query q . . . . . . . . . . . . . . . . . . . . . . 57

9 kd GeRaF building . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


Chapter 1

INTRODUCTION

1.1 Problem description

Consider a set S of points in a real D-dimensional space RD, where distancesare defined using a function ∆ : RD × RD → R (the Euclidean metric). NearestNeighbour Search (NNS) is an optimization problem for finding the closest pointsin S to a given query point q ∈ RD.

High-dimensional nearest neighbour problems arise naturally when complexobjects are represented by vectors of D numeric features. For example, one canrepresent a 1000× 1000 image as a vector in a 1.000.000-dimensional space, onedimension per pixel. The performance of existed solutions for finding the NearestNeighbour (NN), drops rapidly as the dimensions increases [5]. The source of thisproblem lies in the curse of dimensionality [13]; that is, either the running time orthe space requirement grows exponentially in D [6]. In practise, the reason thatthe performance is getting worse, for example in a KD tree, is mainly because ofthe time-consuming procedure of backtracking in the tree.

Georgios Samaras 19


Improving the efficiency of Nearest Neighbour search in high dimensions is veryimportant, since it is a task that tends to be one of the most computationally ex-pensive parts of many algorithms in a variety of applications, including computervision [3], knowledge discovery and data mining, pattern recognition and clas-sification, machine learning, data compression, multimedia databases, documentretrieval, and statistics [7].

Research has shown that by taking advantage of the trade-off between accu-racy and speed, we can achieve remarkable improvements in the performance ofthe NNS problem in higher dimensions, by sacrificing a bit of the accuracy inpractise. As a result, we are going to perform Approximate Nearest NeighbourSearch (ANNS) in higher dimensions.

Definition 1 Given a positive real ε > 0, then a point p ∈ S is a (1+ε)-approximatenearest neighbour of the query point q ∈ RD, if dist(q, p) ≤ (1 + ε)dist(q, pnn),where pnn ∈ S is the true nearest neighbour to q.

In other words, the search algorithm may not return the exact nearest neighbour,but a neighbour that is relatively near to our query. How close will it be? Thisdepends on the approximation factor ε. Note that in the special case of ε = 0, weare actually performing an Exact NNS and not an Approximate.

In the picture below, we show the input of the problem, a point set and a querypoint q. Moreover, we show the output of the problem, with pnn being the exactNearest Neighbour(NN) and p a possible result that we would acquire from anANNS.

Georgios Samaras 20


Figure 1.1: The input and output of the Approximate Nearest Neighbour Search(ANNS) problem

The question that arises in ANNS is how to perform the search fast, whilekeeping sure that the output point p (shown above) is going to be a point closeenough to the actual NN (pnn point above), if not the actual NN itself.

Georgios Samaras 21


1.2 Previous work

There are many proposed approaches to solve the ANNS problem efficiently.Instead of listing most of them, we will briefly present Locality Sensitive Hash-ing (LSH) and the Balanced Box Decomposition (BBD) tree. The RandomizedK-Dimensional (RKD) trees approach will be discussed later.

The basic idea of LSH is to hash the points of the data set so as to ensure thatthe probability of collision is much higher for objects that are close to each otherthan for those that are far apart [12]. The BBD tree is a variant of the quadtreeand octree [14] but is most closely related to the fair-split tree, see [15]. This treehas O(logN) height, and subdivides space into regions of O(D) complexity definedby axis-aligned hyperrectangles that are fat, meaning that the ratio between thelongest and shortest sides is bounded [16].

1.3 Contribution

In this dissertation we attempt to analyze the theoretical and practical aspectsof the approach regarding the RKD trees, in order to efficiently perform ANNS.We call this approach kd-GeRaF.

Moreover, we developed state of the art software in C++, called kd-GeRaF,which allowed us to test the ideas presented in several papers and conduct a num-ber of experiments, in order to see how these ideas perform in practice. We alsoattempted to optimize every single step of the existed ideas and to combine them,in order to see what happens.

Georgios Samaras 22


Furthermore, we are going to present a comparison with other approaches, suchas LSH, BBD tree and FLANN (a library that uses several RKD trees too), throughexperimental results. We also studied the parameters of every approach and howhard was to find the optimum ones for every such method. Our results show thatkd-GeRaF is faster, but a bit less accurate. Also notice that the experiments showthat our approach scales well, even for 10.000 dimensions. kd-GeRaF seems to beunambiguously better, in terms of overall performance, when the dimension valuelies around 1000. Notice that we wrote a paper, based on this dissertation, whichcontains more interesting experiments, such an (1+ε) guarantee, as well as someimprovements to the performance of kd-GeRaF.

1.4 Structure of dissertation

The sections are organized in the following way:

• Why a kd-GeRaF?In the 2nd chapter we present why kd-GeRaF is interesting.

• How to build a kd-GeRaF?In the 3rd chapter we present every step of the building process.

• Technical remarks on kd-GeRaF

In the 4th chapter we present technical details regarding the kd-GeRaF.

• How to search into a kd-GeRaF?In the 5th chapter we present every step of the search process.

• Conclusion and future workIn the last chapter we conclude and mention the future steps of kd-GeRaF.

Georgios Samaras 23


Chapter 2

Why a kd-GeRaF?

The purpose of this chapter is to present the main problem that comes with apriority search in a KD tree and why we turn our interest towards the kd-GeRaF

approach.

2.1 Outline of KD tree search

2.1.1 Building process

When building the KD tree, we halve the data set at every level, by splittingat the median, at a chosen dimension. There are many ways to pick the splittingdimension. The simplest one is to pick all the dimensions in a cyclical order; thatis picking dimension x while we are in the root level, then dimension y in thesecond level and so on, until we reach the final dimension (if we do), where westart again from dimension x.

By recursively splitting the data set, we end up with a fully balanced binarytree, of height logN , where N is the number of points in the data set. It is imple-

Georgios Samaras 24


mentation dependent whether the leaves of the tree contain one or more points.The more points a leaf of the tree contains, the less the height of the tree will be,since the recursion will stop earlier, than in the case of having one point per leaf,which is a good approach when N is big, because in the searching algorithm, wedo not want the height of the tree to be too big, in order to be able to performfast descents to the leaves of the tree.

2.1.2 Search process

Given a query point q ∈ RD, where D is the dimension of the data set, wedescent the tree to a leaf. The descent requires only one comparison at every level,in order to decide which branch to follow. Note that this leaf is the one (and it isunique) that the query point q would be put into, if it was a point of the data set.Moreover notice that every node in the tree corresponds to a cell in RD, as shownin figure 2.1. The leaf we found are selves now contains the first candidate for theNN of q.

It is quite common however, that the actual NN of q lies in another leaf, differ-ent from the one we have just descented. For that reason we have to check othercells, that their distance is less than the one between q and the first NN candidate.Notice that we do not know whether a cell that has to be visited, contains a bettercandidate or not, we find out only by the time we visit it and examine the point(s)that lie into it. This process is called backtracking, a process in which we searchother cells for better candidates.

We perform backtracking by priority search [7], in which the cells are searchedin order of their distance from the query point q. This may be accomplished

Georgios Samaras 25


efficiently using a priority tree for ordering the cells; see figure 2.1 in which thecells are numbered in the order of their distance from the query vector. The searchterminates when there are no more cells within the distance defined by the bestpoint found so far. Note that the priority queue is a dynamic structure that isbuilt while the tree is being searched.

Figure 2.1: [2]: Priority search.The points of the data set are represented as black dots that lie inside cells in

RD, which are created by the KD tree. Given q, we descent to the cell labeled 1,where we find our first NN candidate. Then we need to examine only the cellsthat are located closer to q, than the first candidate, i.e., in increasing distancefrom q, cells 2 to 5. After visiting these cells, we are sure that the exact NN is

the point that lies into cell 3, which we visited via backtracking.

Georgios Samaras 26


2.1.3 Why Approximate Nearest Neighbour Search (ANNS)?

In order to find the exact NN in high dimensions, we might have to do a lotof backtracking, thus visit a large number of nodes and leaves. For that reason,existed solutions for low dimensions are as fast as the brute force search. This iswhere Approximate Nearest Neighbour search comes into play, in order to allowus to overcome this difficulty, in the expense of finding the exact NN.

The point returned by the algorithm as the approximate NN, may be the ex-act NN with some acceptable probability, that depends on the number of cellssearched. This happens because we set a maximum number of cells to be visited.If we reach this threshold value, we terminate the search and return the best can-didate found so far.

As a result, the more cells we visit, the more likely is to find the exact NNand have a more time consuming execution of our search algorithm. Of course, asmentioned in section 1.1, the approximation factor ε can also play a key role onwhen the algorithm will stop.

2.2 Why a kd-GeRaF instead of a KD tree?

2.2.1 Diminshed returns

As we saw above, priority search with Approximate Nearest Neighbour Searchseems a good approach, so why bother using an RKD tree over a KD tree andmoreover, consume more memory?

The problem lies in the phrase of section 2.1.3, ”the more cells we visit, the

Georgios Samaras 27


more likely is to find the exact NN”. One would hope to almost maximize theprobability of finding the exact Nearest Neighbour, as (s)he would increase thenumber of cells visited. However, this is not the case after a threshold value of thenumber of cells we visit, as shown in figure 2.2.

Figure 2.2: [2]: Diminished returns.The trade-off between accuracy and speed here is not fair enough for us, after a

threshold value of cells visited.

The core of this problematic, for us, behaviour, lies at the backtracking process,which we described in section 2.1.2. As this process progresses, we are visiting cellsthat are further and further than the original cell that the query point drived usinto. In other words, every visit we perform is not independent from the previousones, a fact that leads to the diminished returns described in section 2.2.1.

Georgios Samaras 28


2.2.2 How to avoid diminshed returns?

We could rephrase the question of this subsection as such: How to make thesame effort (i.e. to visit the same number of cells) and get better results than theones described in section 2.2.1 (diminished returns).

In order to answer this question, we do the following:

1. We create m different KD-trees, each with a different structure in such a waythat searches in the different trees will be (largely) independent.

2. With a limit of n nodes to be searched, we break the search into simultaneoussearches among all the m trees. On the average, n

m nodes will be searched ineach of the trees.

That way, every search attempt is not going to go as far from the cell the querypoint lies into, as it would go with one KD tree. By following this approach, theproblem that arises is that the our multiple now searches, return the same results,since every KD tree is the same.

In order to solve this problem, we have to build every tree in a different waythan every other tree in our forest. Notice that the more different every tree is,the better our search performance will be, since this means that we minimize theprobability of checking a point of the data set two or more times.

2.2.3 How to build different trees?

By adding randomization! In our approach, we investigated three randomiza-tion factors, which we combined or used alone and experimented.

1. Randomly rotate the point set, before the building process of a tree begins.

Georgios Samaras 29


2. Randomly choose a cutting dimension.

3. Add a random factor to the cutting value.

By applying randomization to the building of the tree, we increase the probabilityof having different trees, thus better results from our search attempts. Before clos-ing this chapter, we are going to explain the three randomization factors presentedabove and why they are interesting.

2.2.4 Randomly rotate the point set

In our attempt of doing independent multiple searches, we create multiple KDtrees with different orientations [2]. That way, our building process is going todeal with a different point set per tree. In other words, every tree would have theillusion that it handles a different point set. Of course, every point set is just arotated version of the original point set.

Suppose we have a data set X = {xi}. Creating KD trees with different ori-entations simply means creating KD trees from rotated data Rxi, where R is arotation matrix. A regular KD tree is one created without any rotation, i.e. R =I. By rotating the data set, the resulting KD tree has a different structure andcovers a different set of dimensions compared with the regular tree.

The searching algorithm does not need to be changed, since we are just goingto use as a query point the rotated query point, i.e. Rq, where q is the originalquery point, since dist(q, xi) = dist(Rq,Rxi). Notice that the rotated point setdoes not need to be stored, since the rotation matrix and the original point setmaintains all the information we need.

Georgios Samaras 30


2.2.5 Randomly choose a cutting dimension

Another method for ending up with different trees in our forest, is to performthe split at a random dimension at every step. Recall the building process of a KDtree, presented in section 2.1.1, where at every step of the recursion, we would halvethe point set, by cutting at the median, at a chosen dimension (in a cyclical order).

Instead of this deterministic approach, we compute the variance at every di-mension and we take into account, for splitting, only the t dimensions, that havethe greatest variances. Then, every time we need to pick a dimension to performa split, we choose randomly amongst these t dimensions. As a result, the chancethat the resulted trees are different is really high.

Notice that after following this approach, we no longer build several KD trees,but Randomly oriented KD (RKD) trees, which we like to call an RKD forest.Moreover, take into account that the next subsection is directly connected to anRKD tree.

2.2.6 Add a random factor to the cutting value

One more method for ending up with different trees in our forest, is to add arandom factor to the cutting value [17]. When we perform the actions describedin the previous section, we end up with a cutting value, which is the median inthe selected dimension. In order to icnrease the probability of having completelydifferent trees in our forest, we apply this method too [4].

So, we take that cutting value mentioned in the previous paragraph, and weadd a factor δ, which is randomly selected (in a uniform way) in [−−3∆√

D, −3∆√

D],

Georgios Samaras 31


where ∆ is the diameter of the current point set, since δ is computed at every stepof the recursion that takes place during the building process.

Figure 2.3: A nice case for adding δOn the left of this figure we have an RKD tree and on the right another one.

Notice that the same query point (red dot) lies into a different cell in every tree,because we added δ to the corresponding cutting value, when building the trees.

Georgios Samaras 32


Chapter 3

How to build a kd-GeRaF?

The purpose of this chapter is to analyze the steps we have to make towardsbuilding a kd-GeRaF, trying to optimize every step of the process, in order toproduce state of the art software, since the methods presented below take fulladvantage of the latest version of C++, i.e. C++11.

3.1 Building process

We build a number (input parameter) of trees, where every tree is build withalgorithm 1. Algorithm 2 is a more detailed version of algorithm 1.

The bottleneck of the algorithm is finding the median, when the number ofpoints is much bigger than the number of dimensions. Otherwise, computing thevariance is the bottleneck, even though the procedure runs only once. Recall thatour goal is to build a forest with different trees, as described in section 2.2.3. Belowwe are going to analyze every step of the algorithm. However, as one may havenoticed already, the δ factor described in 2.2.6 is omitted and it will be discussedlater.

Georgios Samaras 33


Algorithm 1 Build RKD treeprocedure RKD(t). t is the number of dimensions used for building the tree

Compute variance in every dimensionFind the t coordinates that have the maximum variancesAt every level do:

Select randomly among the t coordinatesFind median on the selected coordinateSplit at the median

end procedure

Figure 3.1: An RKD treeOne RKD tree of 64 points in 8D. Max points per leaf is set to 4. A node

contains the splitting dimension and the splitting value, while a leaf contains theindex of a point. The other trees that would be built, would have (some)

different cuts, thus the leaves would store (some) different points.

Georgios Samaras 34


Algorithm 2: kd GeRaF buildinginput : pointset X, #trees m, #split-dimensions t, max #points per leaf poutput: randomized kd forest F

1 begin2 V ← 〈variance of X in every dimension〉3 D ← 〈t dimensions of maximum variance V 〉4 F ← ∅ . forest5 for i← 1 to m do6 f ← 〈random transformation〉 . isometry, shuffling7 F ← F ∪ (f,build(f(X))) . build on transformed X, store f8 return F

9 function build(X) . recursively build tree (node/leaf)10 if |X| ≤ p then . termination reached11 return leaf (X)

12 else . split points and recurse13 c← 〈one of dimensions D at random〉14 v ← 〈median of X on coordinate c〉15 (L,R)← 〈split of X on coordinate c at value v〉16 return node(c, v,build(L),build(R)) . build children on L,R

3.2 Rotation

The goal is to randomly rotate the dataset before building every tree, so thatevery tree will handle a different dataset (due to the rotation). We do this inorder to make every tree as different as possible from the others. Also, rotatingthe dataset can make us avoid bad cases of datasets for our tree. However, as D.Nicolopoulos stated in his final Master thesis, one rotation for all the trees can beenough to avoid the bad cases.

Rotating the dataset for every tree can take significantly more time for the

Georgios Samaras 35


building process, which would be hard to pay off during the search procedure. Atsection 4.4, we compare the building times with and without rotation.

3.2.1 Deterministic rotation

Before going to the random rotation, we thought that it would be nice to use adeterministic rotation. For this, we used Euler rotation, which allows us to selecttwo dimensions and rotate by an angle. For simplicity, an angle of 45 degrees isused.

It is not obvious which dimensions should be used for rotation, but a pair ofhigh and low variance rotation may produce good trees.

The main loop for this kind of operation is

Algorithm 3: Euler rotationinput : vector src, rotation angle θ, d1, d2

. d1 and d2 are the dimensions used for rotationoutput: vector dest

1 begin2 if v.size() < 2 then return return 0;3 µ← 0; s← 04 for i← 0 to src.size() do5 for p← 0 to no_planes do6 dest[i][d1]← cos_theta ∗ src[i][d1]− sin_theta ∗ src[i][d2];7 dest[i][d2]← sin_theta ∗ src[i][d1] + cos_theta ∗ src[i][d2];

Georgios Samaras 36


3.2.2 Rotation using Householder matrices

3.2.2.1 Definition

A Householder matrix P is defined as:

P = I − 2 · (v · vH)

For every value of the matrix the following holds true:

Pij =

1− 2 · vi · vi if i = j

−2 · vi · vi otherwise

Consider some point

x =

x1

x2

...

xn

To apply the Householder transformation matrix P, we calculate P · x. Theresult of this is the following:

y = P · x =

x1 − 2 · v1 · v1 · x1 − 2 · v1 · v2 · x2 + ..− 2 · v1 · vn · xnx2 − 2 · v2 · v1 · x1 − 2 · v2 · v2 · x2 + ..− 2 · v2 · vn · xn

...

xn − 2 · vn · v1 · x1 − 2 · vn · v2 · x2 + ..− 2 · vn · vn · xn

If we have precalculated P, then calculating y is O(N 2).

Georgios Samaras 37


3.2.2.2 A more efficient method

According to vector calculus, a hyperplane that passes through the origin isdefined by its normal vector, i.e. a vector perpendicular to the surface. By con-vention, the normal points to the positive half-space.

We define v as the unit normal vector describing the Householder reflectionhyperplane. Consider some point x and an unit normal vector v:

x =

x1

x2

...

xn

, v =

v1

v2

...

vn

By vector calculus we have that dot product

v · x = x1 · v1 + x2 · v2 + ...+ vn · xn

is the signed distance from point x to the hyperplane v, if v is the unit normalvector of the hyperplane. Signed distance is positive in the halfspace where theunit normal points towards, and negative in the other halfspace.

In order to reflect point x about the hyperplane, we simply subtract twice thesigned distance to the hyperplane along the hyperplane surface normal. In otherwords, using basic vector calculus:

y = x− 2 · v(v · x)

The above is the general formula for reflection about a plane, according to ba-sic vector calculus. The parenthesized part is the dot product of the two original

Georgios Samaras 38


vectors, and thus scalar.

Reflection

If we plug in the values, the components of y are:

y1 = x1 − 2 · v1 · (v1 · x1 + v2 · x2 + v3 · x3 + ...+ vn · xn =

x1 − 2 · v1 · v1 · x1 − 2 · v1 · v2 · x2 − 2 · v1 · v3 · x3 + ...+ 2 · v1 · vn · xn

y2 = x2 − 2 · v2 · (v1 · x1 + v2 · x2 + v3 · x3 + ...+ vn · xn =

x2 − 2 · v2 · v1 · x1 − 2 · v2 · v2 · x2 − 2 · v2 · v3 · x3 + ...+ 2 · v2 · vn · xn

. . .

y2 = xn − 2 · vn · (v1 · x1 + v2 · x2 + v3 · x3 + ...+ vn · xn =

xn − 2 · vn · v1 · x1 − 2 · vn · v2 · x2 − 2 · vn · v3 · x3 + ...+ 2 · vn · vn · xn

Georgios Samaras 39


The vector y is exactly the same as in the matrix transformation case! Thismeans that:

P · x = (I − 2(v · vH))x = x− 2 · v(v · x)

where v · vH is matrix by matrix multiplication, yielding a N × N matrix, and(v · x) is vector dot product, yielding a scalar.

Since calculating

y = x− 2 · v(v · x)

is obviously an O(N) operation, it means that we now have an O(N) algorithm totransform a point using the Householder transformation, which is better than theO(N 2) method described in the previous section and provides us the same results.

3.3 Computing variances

We use the incremental algorithm of Knuth [8], which is numerically stableand performs one pass to compute mean and variance. The time complexity ofthis algorithm is O(N).

We need to compute the variances of all dimensions, so Knuth’s algorihm willrun in a loop. A critical improvement to the algorithm was that we moved divisionin the outer loop and replace it with a multiplication. Below we present Knuth’salgorithm:

Georgios Samaras 40


Algorithm 4: Knuth’s online variance algorithminput : real vector voutput: variance of v

1 begin2 if v.size() < 2 then return return 0;3 µ← 0; s← 04 for n← 1 to v.size() do5 δ ← x− µ6 µ← µ+ δ/n7 s← s+ δ(x− µ)

8 return s/(n− 1)

3.4 How to find the t max elements?

As we mentioned in 2.2.5, we are interested in the t dimensions with the highestvariances. In order to do this, we use a min-heap, with complexity O(t+ (n− t) ∗logt). This can reduce to O((n− t) ∗ logt), since the O(t) is an intialization part,which may be skipped by using iterators. However, t is small and this operationwill run only once, thus reducing the complexity here won’t have any impact tothe overall project.

The idea is to find the k-th largest element using a min-heap and then iterateover the array and collect all the elements that are greater or equal to this element.That way, we obtain the k maximum elements in the array.

The first step costs O(t), while the second and the third cost O((n− t)∗ logk).We compared this method with std::priority_queue of the standard library ofC++ and our heap was faster for this particular case.

Georgios Samaras 41


Algorithm 5: Find_t_max with min-heapinput : vector v, t requested max elementsoutput: the t max elements

1 begin. Create a min_heap of the t first elements in the array

2 for i← t+ 1 to a.size() do3 if a[i] > min_heap.root then4 min_heap.root← a[i]5 heapify(min_heap)

. Finally, min_heap has the t largest elements and root of themin_heap is the t-th largest element.

3.5 Where to split?

The idea is to split the dataset in half at every step. FLANN (Fast Libraryfor Approximate Nearest Neighbors) library splits at the mean. The final Masterthesis of Nicolopoulos suggests splitting at the mean, by taking only a 10% sampleinto account. In FLANN’s studies, they assume that the tree is balanced, thus theyassume that splitting at the mean will halve the point set. Obviously, whether thesplitting in mean, rather than median, is a good choice, depends heavily on thedataset.

At this point we should consider how a bad splitting decision would affect ourtree. Bad splitting decisions mean that we will receive an unbalanced tree, whichwill affect badly the search procedure and the method that stores the tree into ar-rays, since unbalanced trees will lead to waste of space, as mentioned in section 4.1.

Moreover, note that we can not perform rotations of the tree, in order to obtaina balanced tree, like the case of an AVL tree. The reason behind this, is that if

Georgios Samaras 42


we rotate the tree, we will break the invariance of our tree, which is that everyelement that is less of the splitting point, lies on the left subtree and likewise forthe right subtree.

Splitting at the median does guarantee that we actually halve the dataset. Asa result, we split on the median. We randomly select a dimension of the oneswith the highest variance. Random selection is done with std::mt19937 (Mersennenumbers, provided by the standard library of C++), in a uniform distribution.

3.6 Find the median

For this operation, we use Quickselect algorithm, which has O(n) time com-plexity for best and average cases, where the complexity for worst case scenario isO(n2). However, as with the case of quicksort (developed by Tony Hoare too), inpractice the complexity is the one of the average case.

Algorithm 6 Quickselectprocedure Quickselect(v) . v is a vector of numbers

Choose a pivotPartition the data in two, based on the pivotRecurse in the partition with the median

end procedure

In the algorithm above, vector v is re-arranged so that every element that is lessthan the median, lies left of it and all the other elements lie right of the median.v will be partially sorted after the execution of Quickselect. Time measurementshave shown that Median of medians algorithm is not faster.

Georgios Samaras 43


Chapter 4

Technical remarks onkd-GeRaF

Purpose of this chapter is to present how kd-GeRaF can take advantage ofcaching and parallelization, since not only we implemented all the above, but wetried to optimized every step of the process or/and come up with new ideas thatimproved performance.

Georgios Samaras 44


Figure 4.1: The kd-GeRaF projectThe main page of the documentation of the ancestor of kd-GeRaF. kd-GeRaF’s

documentation soon to be created.

The project is hosted in github: kd-GeRaF (https://github.com/gsamaras/

kd_GeRaF).

4.1 How to store the tree

We store the tree in arrays, one for the inner nodes and one for the leaves. If anode has index i, then the left child lies in the 2i+1 cell and the right one, in the2i + 2 cell. The parent (if any) is found at the b i−12 c, assuming that the root liesat index 0.

Georgios Samaras 45

https://github.com/gsamaras/kd_GeRaF




This method benefits from more compact storage and better locality of refer-ence, particularly during a preorder traversal. However, it is expensive to growand wastes space proportional to 2h − n for a tree of depth h with n nodes.

Tree stored in array

In our case, the tree is perfect, thus:

1. no waste of space

2. no need to build (and then discard) the actual tree and then put it in thearray

3. we know the size of the tree before starting the building process, thus nore-allocations needed

The goal of this method is to exploit caching, which will provide fast lookups.The reason is that all nodes of some level are consecutive in the array. Note thatleaves contain more than one points, because, for large datasets, we don’t wantthe depth of the tree to be very big.

Another aspect to take into account for large datasets, is that the memorycomplexity shall grow as the number of trees grow. For this reason, every leaf ofevery tree indexes the points that it contains, which is far less costly in terms ofmemory usage for large datasets in high dimensions.

Georgios Samaras 46


4.2 Exploit caching

Since we deal with high dimensions and datasets that may contain many points,exploiting caching is something that helped us improve the performance of ourproject. As mentioned in section 2.2.5, we need to compute the variance of everypoint, in all dimensions. That means, that if we want to compute the variance ofthe 1st coordinate for example, then we should access the first coordinate of thefirst point, then the first coordinate of the second point and so on.

4.2.1 t_dims

Our first approach was to use a 2D std::vector, provided by the STL of C++.This means that we may have to deal with several data misses in the cache, sincethe coordinates of every point would be guaranteed to be consecutive in memory,but this does not hold true for the points. Since we explore high dimensions, it islikely that the OS will not place the points in consecutive memory cells.

As a result, we decided to gather the t dimensions with the max variances inan array of std::vectors, so that when we would like to compute the variance in aspecific dimension, then all the coordinates of all the points, would lie in consec-utive memory cells. This array of vectors would be discarded after the buildingprocess.

Below, we present figure 4.2, where red is for the memory we access and greenfor the memory that we really need, when the splitting dimension is 1. Our mea-surements showed that this method gives 1 second speedup, when the number ofpoints is big enough (in our measurements 100.000). Otherwise, it doesn’t makeany difference.

Georgios Samaras 47


(i) Initial approach of fetching dataThe first approach, which pollutes ourcache with unnecessary memory cells.

(ii) Gathering the t dimensionsWe do not to loop over anything, red

vector contains what we need.

Figure 4.2: Memory access for computing variance

4.2.2 Convert 2D to 1D

Section 4.2.1 made clear that we are not sure whether every coordinate lies inconsecutive memory cells. As a result, converting our 2D vector into an 1D, wouldreasure us for this property, which would offer great locality, thus faster executionof our program.

The trade-off is that we have to pay a multiplication for every index we perform(but one load less). So if we want to access [i][j] element in a NxM matrix, afterthe conversion, we will do that as [i * M + j].

Georgios Samaras 48


4.2.3 Combination of these two methods

We combined sections 2.6.1 and 2.6.2, so that the gathering of the t dimensionswould run on an 1D vector.

4.2.4 Which method to use?

We compared these methods, with these parameters:

1. points_per_leaf = 16

2. trees_no = 5

3. split_dims = 3 // the t dimensions described in section 2.2.5

The methods yielded similar time measurements. We decided to use the onedesrcibed in section 4.2.2, i.e. converting the 2D vector to 1D, since it was a littlefaster. Below we present the building times for our forest.

shape N dim TimeKlein bottle 1.000 10.000 0.14Klein bottle 10.000 10.000 1.25Klein bottle 10.000 100 0.022Klein bottle 100.000 100 0.22

sphere 1.000 10.000 0.12sphere 10.000 10.000 1.34sphere 10.000 100 0.023sphere 100.000 100 0.23

Table 4.1: Approach: 1D, time in seconds.

Georgios Samaras 49


For the history, we present the rest of the time measurements.



Table 4.2: Approach: 1D+t_dims



Table 4.3: Approach: 2D+t_dims

Georgios Samaras 50


4.3 RKD vs KD tree in the building process

We build RKD trees and KD, in order to compare the building times. Webuild the KD tree in the same same way as the RKD, except for the part ofrandomization, instead of the approach described in section 2.1.1. The methodsare almost identical, but we present the building process for a KD tree (RKD’sbuild is algorithm 1):

Algorithm 7 Build KD treeprocedure KD

At every level do:Find coordinate with max varianceFind median on this coordinateSplit at the median

end procedure

We built 6 trees per forest. Below is the speedup of the RKD tree:

shape N dim Speedup of RKDKlein bottle 1.000 10.000 6.3Klein bottle 10.000 10.000 11Klein bottle 100.000 100 10.1

sphere 1.000 10.000 6.2sphere 10.000 10.000 10.3sphere 100.000 100 9.2

Table 4.4: RKD vs KD in the building process

Georgios Samaras 51


4.4 How much does rotation affect the buildingtime?

We did some experiments in order to have a clear view for how much the ro-tation process increases the building time. Notice, that as presented in table ??,kd-GeRaF has a really fast, in comparison to other relative softwares, buildingtime, so we can easily allow our selves to spend more time in the building process,if needed.

By needed, we mean that if the extra work we do in the building process hasan improvement in the search performance of kd-GeRaF. Our experiments showedthat rotating the point set or not, makes practically no difference. Our softwareallows the user to select if (s)he wishes rotation or not, since (s)he knows betterhis/her data set.

Georgios Samaras 52


shape N dim Rotation TimeKlein bottle 1.000 10.000 yes 0.13Klein bottle 1.000 10.000 no 0.029Klein bottle 10.000 10.000 yes 1.35Klein bottle 10.000 10.000 no 0.26Klein bottle 10.000 100 yes 0.025Klein bottle 10.000 100 no 0.011Klein bottle 100.000 100 yes 0.27Klein bottle 100.000 100 no 0.38

sphere 1.000 10.000 yes 0.12sphere 1.000 10.000 no 0.08sphere 10.000 10.000 yes 1.8spheree 10.000 10.000 no 0.28sphere 10.000 100 yes 0.026sphere 10.000 100 no 0.011sphere 100.000 100 yes 0.91sphere 100.000 100 no 0.11

Table 4.5: Building times with rotation on and off

4.5 Parallelization

kd-GeRaF allows parallelization, since every tree is independent from the others.Notice that we take advantage of the newest version of C++, i.e. C++11, whichsupports multithreading by default.

4.5.1 Building in parallel

In the third version of our software, we construct the forest in parallel. De-pending on the number of the trees we want to build, we can assign one thread

Georgios Samaras 53


per tree, or a number of trees per thread, keeping in mind that every thread hasthe same amount of work to do, except from the last one, which might have todo a bit more. As expected, parallelization gave us a significant speedup in thebuilding process.

data N dim Parallel TimeSIFT 10.000 128 no 0.015SIFT 10.000 128 yes 0.009SIFT 1.000.000 128 no 3.32SIFT 1.000.000 128 yes 1.48

Table 4.6: Building times with parallelization on and off

4.5.2 Searching in parallel

The searching process in a kd-GeRaF can also been performed in parallel, butwe have not implemented it yet. Two straightforward approaches are in progress;searching every tree independently and then combine the results, or search everytree and dynamically update the best candidate(s) found so far in the whole forest.

Searching the kd-GeRaF is explained in section ??, thus the above paragraphmay seem a bit obscure at the moment. Which of the two methods mentionedabove is better? We can not tell for sure and experimenting is needed, in order tobe sure for which approach to follow.

Searching every tree independently means that every thread runs without anyneed for communication, which means maximum speed. On the other hand, updat-

Georgios Samaras 54


ing the best candidate(s) found so far in the whole forest, require communication,thus possible blocking, but promise possible shorter searching process.

Georgios Samaras 55


Chapter 5

How to search into akd-GeRaF?

The purpose of this chapter is to analyze the steps we have to make in order tosearch into a kd-GeRaF, for k Nearest Neighbours, where k is an input parameter.Recall the description we gave in section 1.1.

We perform three kind of search operations:

1. Exact searching (if parameters allow so)

2. Exact radius searching (if parameters allow so)

3. Fast searching (pruning assistance)

5.1 Exact searching

This search approach can guarantee the return of the exact NN. This means,that it will explore as many leaves as it has too, in order to find the exact NN.

Georgios Samaras 56


In other words, this approach can be raletively slow, especially if the query is toofar away from the set of points, or equidistant from most or all the points of our set.

The Approximate Nearest Neighbour search has two parameters associatedwith it. In order to prevent searching the entire tree, we can limit the searchingexpanse. The parameters ε and max_leaves_check may be altered, affecting theaccuracy of the approximation. The greater the first parameter is, the greater thedistance from the true nearest neighbour. The value of ε limits backtracking. Thesecond parameter limits the number of leaves explored, thus the number of pointschecked is limited by this parameter’s value.

Algorithm 8: Find an exact NN of query qinput : node, query point qoutput: variance of v

1 begin2 if node is a leaf then3 Search all points, in node, update current_best_distance4 else5 node is internal6 if cut_coord(q) ≤ node.cut_value then7 NN(left_child)8 if cut_coor(q) + current_best_distance > node.cut_value then9 NN(right_child)

10 else11 cut_coord(q) > node.cut_value12 NN(right_child)13 if cut_coor(q)− current_best_distance ≤ node.cut_value then14 NN(left_child)

Georgios Samaras 57


When no approximation parameters are given, this algorithm should be runwith one tree only, because after searching the first tree, we already have the ex-act NN. Searching more than one tree would be a waste of time.

The number of trees plays a significant role in reducing backtracking, whenapproximate search is performed. Multiple trees are created during build time.Every tree is searched and its results are collected within a structure. The resultsare compared and the best are extracted. Due to the randomized processes in-troduced while building the trees (see section 2.2.3), each tree is different, thusproviding different points in the extraction process. Several good candidate pointsare collected after the traversal of each tree, thus limiting the backtracking neededin order to return the best approximate nearest neighbour, given the parametervalues.

5.2 Exact radius searching

This kind of search is similar to the exact search of a NN. The difference isthat we are not looking for the closest point to the query point, but given a radiusr, we are interested in all the points that lie within the given radius r, while thequery point q plays the role of the center.

The algorithm is very similar to the one presented above. Approximation pa-rameters can be given, as above. The parameters affect the backtracking procedure.Notice that radius searching can be relatively slow if many points lie withing thegiven radius and the query point. The number of trees play the same role as inthe Exact searching presented in section 5.1.

Georgios Samaras 58


5.3 Fast searching

This procedure offers faster execution, but not the same accuracy as the searchroutine presented in section 5.1. It offers approximation parameters, ε andmax_leaves_check,similar to the (Approximate) Exact Search. However, in this approach, we searchevery tree in a different way than in section 5.1.

1. We build a number of trees

2. We search every tree and collect the results, without backtracking

3. Then we visit a number of the unvisited best branches, without backtracking

Note that in step 3, the set of branches is updated, thus we search among the cur-rent best branches from all the trees, in every search we perform. Moreover, noticethat backtracking is indirectly happening in step 3. In algorithm 9, we provide amore detailed version of the algorithm:

The purpose of this approach is to effectively prune the trees we build and con-duct few and meaningful explorations. Another method of improving performance,is to keep an internal data structure, which will disallow us to check points thatare already taken into account. This costs us a little more memory, but makes usavoiding computing distances of points that have already been checked.

5.4 k-NN searching

All the procedures described above, can be modified in order to search for the knearest neighbours. We simply maintain a heap, which keeps track of the k current

Georgios Samaras 59


Algorithm 9: kd GeRaF buildinginput : pointset X, #trees m, #split-dimensions t, max #points per leaf poutput: randomized kd forest F

1 begin2 V ← 〈variance of X in every dimension〉3 D ← 〈t dimensions of maximum variance V 〉4 F ← ∅ . forest5 for i← 1 to m do6 f ← 〈random transformation〉 . isometry, shuffling7 F ← F ∪ (f,build(f(X))) . build on transformed X, store f8 return F

9 function build(X) . recursively build tree (node/leaf)10 if |X| ≤ p then . termination reached11 return leaf (X)

12 else . split points and recurse13 c← 〈one of dimensions D at random〉14 v ← 〈median of X on coordinate c〉15 (L,R)← 〈split of X on coordinate c at value v〉16 return node(c, v,build(L),build(R)) . build children on L,R

best points checked so far. At the end of the procedure, the heap will contain the kbest points. Approximation can be applied with the same logic as when we searchfor one NN.

Georgios Samaras 60


Chapter 6

Time measurements andcomparisons

The purpose of this chapter is to take time measurements of the software“kd-GeRaF” in order to observe its performance. Moreover, we compare it withmultiple RKD trees (which is basically the same as the term “RKD forest”), LSHandt the BBD tree.

6.1 Data and software used

We used the following four data categories:

1. SIFT

2. GIST

3. Klein bottle

4. Sphere

Georgios Samaras 61


6.1.1 SIFT

We use the SIFT descriptor (Scale invariant feature) [10], which is of particularinterest because it performs well compared with other types of image descriptorsin the same class [11]. A SIFT descriptor is a 128-dimensional vector normalisedto length one. It characterises a local image patch by capturing local gradientsinto a set of histograms which are then compacted into one descriptor vector. Ina typical application, a large number of SIFT descriptors extracted from one ormany images are stored in a database. A query usually involves finding the bestmatched descriptor vector(s) in the database to a SIFT descriptor extracted froma query image. A useful data-structure for finding nearest neighbour queries forimage descriptors is the KD-tree [9].

6.1.2 GIST

The GIST descriptor was initially proposed in [19]. The idea is to develop alow dimensional representation of the scene, which does not require any form ofsegmentation. The authors propose a set of perceptual dimensions (naturalness,openness, roughness, expansion, ruggedness) that represent the dominant spatialstructure of a scene. They show that these dimensions may be reliably estimatedusing spectral and coarsely localized information. The image is divided into a4-by-4 grid for which orientation histograms are extracted [?]. Note that the de-scriptor is similar in spirit to the local SIFT descriptor [10].

6.1.3 Klein bottle & sphere

We generate points on a Klein bottle embedded in D-dimensional Euclideanspace. We also use a noise factor, which is the amplitude of the gaussian noiseadded to each coordinate. We generate the sphere in a similar way.

Georgios Samaras 62


6.1.4 Software used

Below we list the software we used, in order to perform experiments, time mea-surements and comparisons. Every library below is used for Approximate NearestNeighbour Search.

1. kd-GeRaF, by the author of this dissertation, representing the k-d GeRaFapproach

2. FLANN (Fast Library for Approximate Nearest Neighbors), by MariusMuja and David G. Lowe, representing the multiple RKD trees approach

3. E2LSH (Exact Euclidean LSH), by Alex Andoni, representing the Locality-Sensitive Hashing (LSH) approach

4. ANN , by David M. Mount and Sunil Arya, representing the BBD tree

6.2 Time measurements

In this section we present the time measurements of the software mentioned insection 6.1.4.

Notice that the approach described in section 2.2.6 is not used in the presentedresults, since it was proven to be not helpful, since the accuracy dropped a lot morethan the speed. However, the approach itself might be good, but the kd-GeRaF

software was not initially designed for supporting this feature, which may lead tounbalanced trees, whereas the software intends to build balanced trees.

In the tables below, N is the number of points the data set has, while D isthe dimension of the data set. The search times are the average ones and they

Georgios Samaras 63


are measured in seconds (same goes for the build ones) . Moreover, we define asa miss to be the case where the point returned is not exact the Nearest Neigh-bour. Also notice that the graphs below are a more abstract view of the tables. Thestructure is that the graphs are presented first and then the tables, per experiment.

The task of finding the optimal parameters for each data set is rather hardand varies from algorithm to algorithm. For this reason most packages provide anautomatic method for choosing these parameters or/and default values for theseparameters. The only parameter that lacks a default value or an automatic methodfor estimating that parameter, is the parameter R of LSH, which is also the hard-est to estimate manually and probably the most critical for the LSH algorithm.

E2LSH solves a randomized version of the R-near neighbour problem, whichthey call a (R, 1−δ)-near neighbour problem. In that case, each point p satisfying‖p− q‖2 ≤ R, where q is the query point, has to be reported with a probabil-ity at least 1−δ (thus, δ is the probability that a near neighbour p is not reported).

Unless specified, the experiments bellow use the default/automatic values pro-vided by the packages we used. However, notice that LSH’s R parameter hadto be specified in all the experiments, which is a severe drawback in comparisonto the other methods. By optimal parameters we mean the parameters we foundmanually and observed experimentally that work well for a particular data set.We consider the time a package takes to estimate the parameters to be includedin the building process.

Georgios Samaras 64


6.2.1 Automatic configuration of kd-GeRaFt

As explained above, configuring the optimal parameters is a difficult task, sinceevery data set may require fine tuning of the parameters. For that reason, thekd-GeRaF project provides a routine that estimates the optimal parameters forthe particular data set. However, this parameters may differ from the exact opti-mal parameters.

This routine takes into account the size and the dimension of the point set,as well as the five greatest variances that the dimensions of the data set have.Notice that this step is performed by the building process, regardless of whetherthe project looks for the optimal parameters itslef or not, thus it will not slowdown at all the overall performance of the project.

In combination with the experience we have acquired by running numerousexperiments, we can estimate the optimal parameters. Notice that the more ourexperience grows, the more likely is for the project to estimate better the optimalparameters.

Georgios Samaras 65


Figure 6.1: The building times of BBD, LSH, FLANN and kd-GeRaFFor the case of N = 1.000.000, BBD runs out of memory and FLANN did not

terminate execution after at least 10 hours, for any ε.

Georgios Samaras 66


Figure 6.2: The search times of BBD, LSH, FLANN andkd-GeRaF for 10.000 points

Georgios Samaras 67


Figure 6.3: The search times of BBD, LSH, FLANN and kd-GeRaF for 1.000.000pointsBBD runs out of memory and FLANN did not terminate execution after at least

10 hours, for any epsilon. Moreover, LSH went out of memory for ε = 0 andε = 0.1.

Table 6.1: SIFT: BBD vs LSH vs FLANN vskd-GeRaF

Method N D ε Build miss% SearchBBD 0.47 0 0.0051LSH 7.44 0 0.0034

FLANN 69.7 27 0.000034Continued on next page

Georgios Samaras 68


Table 6.1 – Continued from previous pageMethod N D ε Build miss% Searchkd-GeRaF 10.000 128 0 0.03 0 0.00036

BBD 0.49 0 0.0047LSH 7.42 0 0.0012

FLANN 70.0 27 0.000034kd-GeRaF

10.000 128 0.1

0.03 5 0.00012

BBD 0.47 0 0.0036LSH 2.26 0 0.00099


10.000 128 0.5

0.03 12 0.00011

BBD 0.51 1 0.0027LSH 0.68 13 0.00028


10.000 128 0.9

0.03 15 0.000097

BBD mem - -LSH mem - -

FLANN time - -kd-GeRaF

1.000.000 128 0

79.1 1 0.019

BBD mem - -LSH mem - -


1.000.000 128 0.1

82.4 4 0.012

BBD mem - -LSH 170 1.34 0.12

FLANN time - -Continued on next page

Georgios Samaras 69


Table 6.1 – Continued from previous pageMethod N D ε Build miss% Searchkd-GeRaF 1.000.000 128 0.5 84.8 8 0.009

BBD mem - -LSH 145 8.61 0.035


1.000.000 128 0.9

89.9 22 0.003

Figure 6.4: The building times of BBD, LSH, FLANN and kd-GeRaFBBD runs out of memory and FLANN did not terminate execution after at least

10 hours, for any ε.

Georgios Samaras 70


Figure 6.5: The search times of BBD, LSH, FLANN and kd-GeRaF for 100.000pointsBBD runs out of memory and FLANN did not terminate execution after at least

10 hours, for any epsilon.

Table 6.2: GIST: BBD vs LSH vs FLANN vs kd-GeRaF

Method N D ε Build miss% SearchBBD mem - -LSH 3.67 0.5 0.24

FLANN time - -Continued on next page

Georgios Samaras 71


Table 6.2 – Continued from previous pageMethod N D ε Build miss% Searchkd-GeRaF 100.000 960 0.0 80.1 0.8 0.057

BBD mem - -LSH 1.81 0.5 0.21


100.000 960 0.1109.7 1 0.025

BBD mem - -LSH 1.91 0.5 0.18


100.000 960 0.5

117.6 2 0.024

BBD mem - -LSH 1.57 0.5 0.21


100.000 960 0.9

108.9 14 0.017

Georgios Samaras 72


Figure 6.6: The search times of BBD, LSH, FLANN and kd-GeRaF for 1.000.000pointsBBD runs out of memory and FLANN did not terminate execution after at least

10 hours, for any epsilon.

Table 6.3: GIST, N = 106: BBD vs LSH vs FLANN vs kd-GeRaF

Method N D ε miss% SearchBBD mem -LSH 100 1.3

FLANN time -Continued on next page

Georgios Samaras 73


Table 6.3 – Continued from previous pageMethod N D ε miss% Searchkd-GeRaF 1.000.000 960 0.0 3.6 0.25

BBD mem -LSH 100 1.2

FLANN time -kd-GeRaF

1.000.000 960 0.111.5 0.11



1.000.000 960 0.5

13.6 0.11



1.000.000 960 0.9

22.5 0.06

Georgios Samaras 74


Figure 6.7: The building times of BBD, LSH, FLANN and kd-GeRaF for D =10.000

Georgios Samaras 75


Figure 6.8: The building times of BBD, LSH, FLANN and kd-GeRaF for D = 100

Georgios Samaras 76


Figure 6.9: The search times of BBD, LSH, FLANN and kd-GeRaF in 10.000dimensions

Georgios Samaras 77


Figure 6.10: The search times of BBD, LSH, FLANN and kd-GeRaF for 10.000points in 100 dimensions

Georgios Samaras 78



As one can see, the miss rate is maximized in certain cases. This happens be-cause the queries in the Sphere data sets are equidistant from the points. Thedistance of the exact NN to the query and the distance of the NN an algorithmhas returned to the query differ in the first, the second or the third decimal digit.

That means that the algorithms find a really good candidate, but not the exact.For example, LSH returned a point that has distance 32.163 from the query point,where the exact NN has distance 32.162. In some applications the really good

Georgios Samaras 79


candidates the algorithms return are enough to carry on, while there are someothers that strictly need to find the exact NN.

Table 6.4: Sphere: BBD vs LSH vs FLANN vs kd-GeRaF

Method N D ε Build miss% SearchBBD 1.25 0 0.0091LSH 0.21 45 0.016


1000 10.000 0.0

0.06 0 0.00040

BBD 1.26 100 0.00021LSH 0.24 45 0.02


1000 10.000 0.1

0.06 24 0.00020

BBD 1.30 100 0.00022LSH 0.71 45 0.017


1000 10.000 0.5

0.06 100 0.00021

BBD 1.25 100 0.00020LSH 0.17 45 0.016


1000 10.000 0.9

0.034 24 0.00010

BBD 0.14 0 0.0014LSH 0.01 100 0.000072


10.000 100 0.0

0.09 0 0.0020

BBD 0.14 99 0.000002Continued on next page

Georgios Samaras 80


Table 6.4 – Continued from previous pageMethod N D ε Build miss% Search

LSH 0.1 100 0.000024FLANN 44.6 52 0.000012

kd-GeRaF

10.000 100 0.1

0.11 6 0.0016

BBD 0.14 99 0.000002LSH 0.002 100 0.000017


10.000 100 0.5

0.14 12 0.0015

BBD 0.14 99 0.000002LSH 0.002 100 0.000002


10.000 100 0.9

0.09 1 0.0013

BBD 2.73 1 0.012LSH 1.2 2 0.028

FLANN 100.8 100 0.000021kd-GeRaF

100.000 100 0.0

26.6 2 0.0039

BBD 2.83 100 0.000024LSH 0.4 2 0.024

FLANN 103.7 100 0.000021kd-GeRaF

100.000 100 0.1

24.2 26 0.0029

BBD 2.65 100 0.000028LSH 0.39 2 0.022

FLANN 101.5 100 0.000020kd-GeRaF

100.000 100 0.5

24.4 38 0.0058


Georgios Samaras 81



LSH 0.27 2 0.022FLANN 104.6 100 0.000021

kd-GeRaF

100.000 100 0.9

26.3 24 0.0028

Figure 6.12: The building times of BBD, LSH, FLANN and kd-GeRaF for D =10.000

Georgios Samaras 82


Figure 6.13: The building times of BBD, LSH, FLANN and kd-GeRaF for D =100

FLANN did not terminate execution after at least 10 hours, for any ε.

Georgios Samaras 83


Figure 6.14: The search times of BBD, LSH, FLANN and kd-GeRaF in 10.000dimensions

Georgios Samaras 84




Georgios Samaras 85




As one can see, the miss rate is maximized in certain cases. This happens forthe same reason with the Sphere experiments.

Table 6.5: Klein bottle: BBD vs LSH vs FLANN vs kd-GeRaF

Method N D ε Build miss% SearchBBD 1.49 1 0.0047

Continued on next page

Georgios Samaras 86



LSH 0.21 37 0.016FLANN 20.5 100 0.00034

kd-GeRaF

1000 10.000 0.0

0.058 3 0.00056

BBD 1.49 1 0.00027LSH 0.16 37 0.017


1000 10.000 0.1

0.06 2 0.00052

BBD 1.46 1 0.00072LSH 0.19 37 0.017


1000 10.000 0.5

0.058 100 0.00026

BBD 1.33 1 0.00027LSH 0.16 37 0.016


1000 10.000 0.9

0.06 100 0.00013

BBD 0.13 0 0.0004LSH 0.11 1 0.0027


10.000 100 0.0

0.06 2 0.00010

BBD 0.14 59 0.000043LSH 0.07 1 0.0024


10.000 100 0.1

0.06 3 0.000083


Georgios Samaras 87



LSH 0.03 20 0.0019FLANN time - -

kd-GeRaF

10.000 100 0.5

0.06 3 0.000083

BBD 0.14 59 0.000052LSH 0.05 63 0.00085


10.000 100 0.9

0.08 5 0.000070

BBD 2.15 0 0.0027LSH 0.99 100 0.0099


100.000 100 0.0

20.6 3 0.0031

BBD 2.10 74 0.00021LSH 0.54 100 0.0049


100.000 100 0.1

21.6 1 0.0032

BBD 2.04 74 0.00014LSH 0.99 100 0.0091


100.000 100 0.5

21.9 3 0.0079

BBD 2.01 74 0.00013LSH 0.54 100 0.0049


100.000 100 0.9

27.6 4 0.0080

Georgios Samaras 88


Figure 6.17: The building times of kd-GeRaF, BBD, LSH and FLANN

Georgios Samaras 89


Figure 6.18: The search times of kd-GeRaF, BBD, LSH and FLANN for 10.000pointsNotice that the increment of execution time for LSH in the case of ε = 0.1 doesnot depend on the algorithm, but on how the OS scheduled the execution at thetime. The performace of LSH for ε = 0 and ε = 0.1 should be considered the

same in this case.

Georgios Samaras 90


Figure 6.19: The search times of kd-GeRaF, BBD, LSH and FLANN for 1.000.000points

Notice that LSH runs out of memory in the case of ε = 0.

Table 6.6: kd-GeRaF vs BBD, optimal parameters

Method N D ε Build Speedup miss% Search Speedupkd-GeRaF 0.02 0 0.0006

BBD10.000 128 0

0.248.3

0 0.00193.2

kd-GeRaF 0.02 0 0.0006BBD

10.000 128 0.10.26

130 0.0018

3

kd-GeRaF 0.02 1 0.0005Continued on next page

Georgios Samaras 91


Table 6.6 – Continued from previous pageMethod N D ε Build Speedup miss% Search SpeedupBBD 10.000 128 0.5 0.25 12.5 0 0.0016 3.2

kd-GeRaF 0.07 9 0.00018BBD

10.000 128 0.90.26

3.70 0.0014

7.8

kd-GeRaF 62.6 3.2 0.03BBD

1.000.000 128 0119.6

1.90.66 0.22

7.3

kd-GeRaF 62.6 3.6 0.03BBD

1.000.000 128 0.1120.7

1.90.66 0.22

7.3

kd-GeRaF 62.5 5.2 0.024BBD

1.000.000 128 0.5110.5

1.70.66 0.15

6.25

kd-GeRaF 62 6.5 0.02BBD

1.000.000 128 0.9131.8

2.10.74 0.11

5.5

Table 6.7: kd-GeRaF vs LSH, optimal parameters

Method N D ε Build Speedup miss% Search Speedupkd-GeRaF 0.02 0 0.0006

LSH10.000 128 0

7.44372

0 0.00345.6

kd-GeRaF 0.02 0 0.0006LSH 10.000 128 0.1 6.44 322 0 0.0039 6.5

kd-GeRaF 0.02 1 0.0005LSH

10.000 128 0.52.26

1130 0.00099

1.98

kd-GeRaF 0.07 9 0.00018LSH

10.000 128 0.90.68

9.70 0.00028

1.6

Continued on next page

Georgios Samaras 92


Table 6.7 – Continued from previous pageMethod N D ε Build Speedup miss% Search Speedup

kd-GeRaF 62.6 3.2 0.03LSH

1.000.000 128 0- Out of Mem - -

-

kd-GeRaF 62.6 3.6 0.03LSH

1.000.000 128 0.1196.4

3.130.74 0.31

10.3

kd-GeRaF 62.5 5.2 0.024LSH

1.000.000 128 0.5170.1

2.721.34 0.12

5

kd-GeRaF 62 6.5 0.02LSH

1.000.000 128 0.9145.5

2.38.61 0.03

1.5

Table 6.8: kd-GeRaF vs FLANN, optimal parameters

Method N D ε Build Speedup miss% Search Speedupkd-GeRaF 0.02 0 0.0006FLANN

10.000 128 00.11

5.50 0.0007

1.16

kd-GeRaF 0.02 0 0.00062FLANN

10.000 128 0.10.11

5.50 0.00063

1.01

kd-GeRaF 0.02 1 0.00051FLANN

10.000 128 0.50.11

5.51 0.00056

1.09

kd-GeRaF 0.07 9 0.00018FLANN 10.000 128 0.9 0.11 1.6 1 0.00044 2.4

kd-GeRaF 62.6 3.2 0.034FLANN

1.000.000 128 014.1

0.20.76 0.084

2.47

kd-GeRaF 62.6 3.6 0.031Continued on next page

Georgios Samaras 93


Table 6.8 – Continued from previous pageMethod N D ε Build Speedup miss% Search SpeedupFLANN 1.000.000 128 0.1 21.4 0.34 0.81 0.062 2

kd-GeRaF 62.5 5.2 0.024FLANN

1.000.000 128 0.514.7

0.230.82 0.05

2.08

kd-GeRaF 62 6.5 0.02FLANN

1.000.000 128 0.915.1

0.240.85 0.046

2.3

Georgios Samaras 94


Figure 6.20: The speedup over brute force, of kd-GeRaF and FLANN, for N =1.000, D = 10.000The rest of the approximation parameters probably make FLANN to increase its

execution time in the case of ε = 0.5, in order to keep the miss rate in anacceptable level (see table below).

Georgios Samaras 95


Figure 6.21: The speedup over brute force, of kd-GeRaF and FLANN, for N =10.000, D = 10.000

Georgios Samaras 96


Figure 6.22: The speedup over brute force, of kd-GeRaF and FLANN, for N =10.000, D = 100

We define as a miss to be the case where the point returned is not in the topfive Nearest Neighbours. As error, we define the distance from the exact NN tothe point found. These values are the average ones in the table below. p_l isthe points_per_leaf parameter and max_l the max_leaves_check parameter.Speedup is measured against brute force, while the building process is also takeninto account. FLANN’s parameters are chosen to be the default ones, since theyseem to give the best results.

Georgios Samaras 97


Table 6.9: kd-GeRaF vs FLANN, optimal parameters

Software N D p_l #trees max_l ε miss% error Speedup2 4 2 2.2 0.001 218.65 4 2 1.2 0.001 113.3kd-GeRaF

5 4 20 0.2 0.0001 29.91 7.8 60.06 60.4

FLANN1.000 10.000

auto5

auto0

6.7 57.12 20.6

2 4 2 67.3 1055 251.95 4 2 40.2 1055 130.2

kd-GeRaF 5 4 20 5.2 1124 39.81 7.8 60.06 86.6

FLANN1.000 10.000

auto5

auto0.1

6.7 57.12 21.5

2 4 2 32.1 3869 253.55 4 2 59 3869 133.7kd-GeRaF

5 4 20 7.9 3869 34.21 7.8 60.06 48.9

FLANN1.000 10.000

auto5

auto0.5

6.7 57.12 21.2

2 4 2 49.2 0.0015 3489.65 4 2 30.2 0.0013 912.3kd-GeRaF

2 4 20 14.2 0.0003 1024.21 64.6 95.8 180.4

FLANN10.000 10.000

auto5

auto0

16.1 73.66 26.8

2 4 2 72.4 1058 3674.55 4 2 57.9 922.6 1108kd-GeRaF

2 4 20 41.5 1058 271.6Continued on next page

Georgios Samaras 98


Table 6.9 – Continued from previous pageSoftware N D p_l #trees max_l ε miss% error Speedup

1 64.6 95.8 101.8FLANN

10.000 10.000auto

5auto

0.116.1 73.66 23.3

2 4 2 62.3 3880 3908.65 4 2 10.2 3383 1210.8kd-GeRaF

2 4 20 7.9 3880 269.31 64.6 95.84 126.7

FLANN10.000 10.000

auto5

auto0.5

16.1 73.66 27.1

50 1 2 56.8 0.0068 836.8kd-GeRaF

50 1 3 51.2 0.0052 752.71 77.1 13.3 149.9

FLANN10.000 100

auto5

auto0

62.3 9.57 26.3

50 1 2 64.1 89.9 1061.9kd-GeRaF

50 1 3 74.2 89.9 836.21 77.1 13.3 162.2

FLANN

10.000 100auto

5auto

0.1

62.3 9.57 25.2

50 1 2 83.8 329.9 1124kd-GeRaF

50 1 3 85.7 392.4 814.21 77.1 13.3 168.1

FLANN10.000 100

auto5

auto0.5

62.3 9.57 25.7

Georgios Samaras 99


N D Speedup1.000 10.000 2.7410.000 10.000 2.69100.000 100 1.5

Table 6.10: Exact NN of the kd-GeRaF

6.3 Comparisons

In this section we will present what the measurements in section 6.2 tell usabout the comparison amongst the libraries we tested for ANNS. There are plentyof factors that we can use for comparison, but we will constraint ourselves to theones regarding the parameters and the performance of every method.

6.3.1 kd-GeRaF vs BBD

As mentioned in the section 6.1.4, we used the ANN library, which providedthe BBD tree.

6.3.1.1 Parameters

Both methods provide the same, in theory, flexibility, since they are both tree-based approaches. The task of finding the optimal parameters for our data setswas of the same difficulty in both methods. However, as table 6.1 shows, the pa-rameters did not have the same impact in both cases.

In detail, we can see that when the approximation parameters, such ε, maxnumber of leaves to check (which is not presented in table 6.1 for simplicity) and

Georgios Samaras 100


others are tuned in a way that states that we are interested in speed rather inaccuracy, the kd-GeRaF effectively increase its speed, while loosing some of theaccuracy. On the other hand, the BBD tree is not affected much by the approxi-mation parameters, leaving the user with no actual tool to alter the behaviour ofthe ANNS library (s)he uses.

In short, kd-GeRaF allows the user to choose between maximum accuracy andspeed, while the BBD tree gives the illusion it does so too, but in practice, it isnot affected much by the parameters the user passes.

6.3.1.2 Build & Search

In both methods, building time is affected by the parameters the user chooses,such as the maximum number of points a leaf can store and others. kd-GeRaF

is about 55 times faster than the BBD in the building process, despite the factthat the first is building a number of trees and the second only one. Moreover,the measurements were taken, with parallelization not enabled for the kd-GeRaF,something that would boost the speed of the building process.

The kd-GeRaF was 4.3 times faster in the searching process than the BBD tree,while the BBD tree was more accurate. Of course, the parameters discussed insection 6.3.1.1 play a key role in that behaviour.

6.3.2 kd-GeRaF vs LSH

As mentioned in the section 6.1.4, we used the E2LSH library, which providedthe LSH method.



6.3.2.1 Parameters

The task of finding the optimal parameters for our data sets was really hard forthe LSH case, since by nature, LSH solves the R-NN problem, which means thatnot only we have to search for the optimal parameters in terms of performance,but for the optimal parameter for R, which is essentially the radius.

If R is bigger than the optimal value, LSH will do more work in the build andsearch process. If R is less than the optimal value, LSH will ommit points thatmay be the actual nearest neighbours. Finding the parameter R was such a strug-gle, that I had to develop an independent program, which would find the optimalvalue for R. Of course, the time this program took to execute was not taken intoaccount in the building time of LSH, but it is a big disadvantage of the LSH andshould be mentioned.

Both methods allow the user to choose the desired approximation parametersand the impact is visible in table 6.2. However, the problem of finding the optimalR parameter for LSH is shadowing everything else in this section.


In both methods, building time is affected by the parameters the user chooses.kd-GeRaF is about 252 times faster than the LSH in the building process, despitethe fact that the measurements were taken, with parallelization not enabled forthe kd-GeRaF, something that would boost the speed of the building process.

The kd-GeRaF was 3.4 times faster than LSH, while LSH was more accurate(but less accurate than the BBD tree). Of course, the parameters discussed in



section 6.3.2.1 play a key role in that behaviour.

6.3.3 kd-GeRaF vs FLANN

As mentioned in the section 6.1.4, we used the FLANN library, which providedalmost the same method as the kd-GeRaF project.

6.3.3.1 Parameters

The task of finding the optimal parameters for our data sets was of equal diffi-culty in both cases, as expected, since in both cases several RKD trees are used.Both libraries allowed the user to choose the parameters in such a way that woulddefine the trade-off between speed and accuracy. However, as with the BBD tree,the kd-GeRaF was affected more by the approximation parameters, in comparisonto FLANN.

Note that FLANN would fail to find the exact nearest neighbours in severaloccasions, even when zero approximation was requested by the user. Probablythis happens because the library is designed for ANNS. However, the kd-GeRaF,if requested, can find the exact nearest neighbours.


The comparison of the building process is identical to the one of the kd-GeRaF

vs the BBD tree, thus it is ommitted from this section and from table 6.3.

The kd-GeRaF was faster than FLANN. The accuracy was affected by the pa-rameters discussed in section 6.3.3.1, but if requested, the kd-GeRaF can be moreaccurate, while maintaining a significant speedup against FLANN. However, for



the particular case of SIFT data FLANN behaves equally good with the kd-GeRaF.



Chapter 7

Conclusion and future work

The kd-GeRaF is an approach that seems to be very interesting for performingApproximate Nearest Neighbour Search. The theoretical aspect of this approachis really appealing. Moreover the thorough experiments we conducted presenteda promising behaviour of the kd-GeRaF, since they prove that this approach isflexible.

Moreover, the results showed that the approach can serve the trade off be-tween speed and accuracy as requested, by giving the user the power to define thebehaviour of the method, which is something that is critical, since the optimalparameters are dependent on the data set.

Furthermore, our experiments showed that only the random factor that is crit-ical for the performace of the kd-GeRaF, is the on of choosing at random thesplitting dimension. The random rotation of the data set and the addition of therandom factor δ, as discussed in section 2.2.3 do not improve the performance ofour approach.



The results presented above are encouraging and can be improven. Paral-lelization can play a key role in increasing the efficiency of the approach, sincefrom nature the kd-GeRaF allows parallel execution. There are many parallel ap-proaches that would fit with this approach and further research and experimentingis needed, in order to determine which is the optimal one.

Finally, the fact that there are numerous parameters that we can use to alterthe behaviour of the kd-GeRaF is pretty intriguing and additional research needsto be done. Moreover, this aspect provide us with a better understanding of howApproximate Nearest Neighbour Search behaves in higher dimensions, somethingthat may lead us to new ideas.



References

[1] GUDHI, Geometric Understanding in Higher Dimensions, https://project.in-ria.fr/gudhi/

[2] Chanop Silpa-Anan and Richard Hartley. Optimised KD-trees for fast imagedescriptor matching. Computer Vision and Pattern Recognition, 2008. CVPR2008. IEEE.

[3] Marius Muja and David G. Lowe. Fast Approximate Nearest Neighbors withautomatic algorithm configuration. VISAPP (1), 2009.

[4] Santosh S. Vempala. Randomly-oriented k-d Trees Adapt to Intrinsic Dimen-sion. IARCS Annual Conference on Foundations of Software Technology andTheoretical Computer Science (FSTTCS 2012).

[5] Sariel Har-Peled, Piotr Indyk and Rajeev Motwani. Approximate nearestneighbors: Towards removing the curse of dimensionality. In STOC, 1998.

[6] Rina Panigrahy. An Improved Algorithm Finding Nearest Neighbor UsingKd-trees. Microsoft Research, 2008.

[7] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman and An-gela Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Search-ing in Fixed Dimensions. Journal of the ACM, 45(6):891–923, 1998.



[8] Donald E. Knuth. Art of Computer Programming. Volume 2: SeminumericalAlgorithms, 3rd edn., p. 232. Boston: Addison-Wesley, 1998.

[9] Jeffrey S. Beis and David G. Lowe. Shape indexing using approximate near-est–neighbour search in high–dimensional spaces. In Proceedings of computervision and pattern recognition, pages 1000–1006, Puerto Rico, June 1997.

[10] David G. Lowe. Distictive image features from scale-invariant keypoints. In-ternational journal of computer vision, 60(2):91–110, 2004.

[11] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of localdescriptors. In Proceedings of computer vision and pattern recognition, 2003.

[12] Aristides Gionis, Piotr Indyk and Rajeev Motwani. Similarity Search in HighDimensions via Hashing. GIM 1999.

[13] Beyer, K. S., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When Is”Nearest Neighbor” Meaningful? Pages 217-235 of: Proceedings of the Inter-national Conference on Database Theory (ICDT). Springer-Verlag.

[14] R. A. Finkel and J. L. Bentley. Quad trees: a data structure for retrieval oncomposite keys. Acta Inform., 4:1–9, 1974.

[15] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional pointsets with applications to k-nearest-neighbors and n-body potential fields. J.ACM, 42:67–90, 1995.

[16] C. Duncan and M.T. Goodrich. Ch-7, Approximate Geometric Query Struc-tures, Handbook of Data Structures and Applications, Chapman and Hall.CRC Press, 26-1–26-17, 2005.



[17] Philbin, J. and Chum, O. and Isard, M. and Sivic, J. and Zisserman, A. ObjectRetrieval with Large Vocabularies and Fast Spatial Matching. CVPR 2007.

[18] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic repre-sentation of the spatial envelope. IJCV, 42(3):145–175, 2001.

[19] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg and C. Schmid. Evaluationof GIST descriptors for web-scale image search. CIVR 09, July 8-10, 2009Santorini, GR.


SCHOOLOFSCIENCE … › 2015 › 05 › thesis.pdfFINALTHESIS...

Documents

Transcript of SCHOOLOFSCIENCE … › 2015 › 05 › thesis.pdfFINALTHESIS...