Efficient algorithms for ( δ , γ , α )-matching

30
Efficient algorithms for (δ,γ,α)-matching Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. pl PSC, Prague, August 2006 Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected] .fi

description

Efficient algorithms for ( δ , γ , α )-matching. Kimmo Fredriksson Dept. of Computer Science Univ. of Joensuu, Finland [email protected]. Szymon Grabowski Computer Engineering Dept., Tech. Univ. of Łódź, Poland [email protected]. PSC, Prague, August 2006. Problem setting. - PowerPoint PPT Presentation

Transcript of Efficient algorithms for ( δ , γ , α )-matching

Page 1: Efficient algorithms for ( δ , γ , α )-matching

Efficient algorithmsfor (δ,γ,α)-matching

Szymon GrabowskiComputer Engineering Dept.,Tech. Univ. of Łódź, Poland

[email protected]

PSC, Prague, August 2006

Kimmo FredrikssonDept. of Computer Science Univ. of Joensuu, Finland

[email protected]

Page 2: Efficient algorithms for ( δ , γ , α )-matching

2

For example, it’s not relevant for music information retrieval (MIR)

and molecular biology.

Several approximate matching models have thus been developed...

String matching in its classic form: given text T = t0t1 ... tn–1, and pattern P = p0p1 ... pm–1

over a finite alphabet Σ of size σ, report all occurences of P in T.

Such simple problem variant (exact matching)is not very useful for many applications.

Problem setting

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 3: Efficient algorithms for ( δ , γ , α )-matching

3

Models & applications – music information retrieval

We allow classes of characters: the classes are continuous intervals (of equal width, 2δ+1, for all pattern positions).

This corresponds to handling little distortions of the melody (singer / whistler unskilled or under influence...).

Limitation on the sum of individual errors γ (< mδ).

Gaps also allowed – this is to skip ornamentation (esp. in classical music). We assume all gaps are in [0, α] range.

Transposition invariance – the key of the melody can be arbitrary, i.e. everything can be shifted up or down

by a fixed value.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matchingFuture work, hopefully...

Page 4: Efficient algorithms for ( δ , γ , α )-matching

4

Problem we consider here

(δ,γ,α)-matching

Two symbols a, b Σ delta-match ( we write a =δ b ) iff |a – b| δ.

We say that a pattern P (δ,γ,α)-matches the text substring ti0 ti1 ... ti(m–1),

if pj =δ tij for j {0 ... m–1},where 0 < ij+1 – ij α+1,

and

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

1

0

m

jijj tp

Page 5: Efficient algorithms for ( δ , γ , α )-matching

5

Previous work on similar models

(δ, α)-matching:Crochemore et al., 2002: O(mn) time (worst, avg, and best case).

Cantone et al., 2005a: also O(mn) in every case to find not only the end positions of the occurences but also all the matching sequences.

Cantone et al., 2005b: achieving O(n) on avg (for constant α) and retaining O(mn) in the worst case.

Navarro & Raffinot, 2003; Cantone et al., 2005b: nondeterministic finite automaton with O(n mα / w) worst case time.

Along these lines: Fredriksson & Grabowski, 2006: more compact automaton with O(n m log(α) / w) worst case time.

Fredriksson & Grabowski, 2006: bit-par alg with O(nδ + n / w m) worst case time.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 6: Efficient algorithms for ( δ , γ , α )-matching

6

Surprisingly little work specifically on the(δ,γ,α)-matching problem...

Crochemore et al., 2002: dynamic programming alg,runs in O(mn) worst-case time. Uses a min-queue.

Of course, also a brute-force DP alg is possible:O(mn α) time, but may be faster in practice than

the more sophisticated alg above (as α usually small).

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 7: Efficient algorithms for ( δ , γ , α )-matching

7

Our contributions

We improve the basic dynamic programming based algorithm to run in O(nα δ/σ) average time.

We propose a simple sparse DP alg with O(n) avg timeand O(min(mn, |M|α)) worst-casetime,

where M = { (i,j) | pi =δ tj }.

We develop a bit-parallel algorithm that runs in O(nδ + mn log γ / w) worst case time.

Its avg time complexity is close to O(n log γ α (δ/σ) / w + n), assuming small α.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 8: Efficient algorithms for ( δ , γ , α )-matching

8

Basic dynamic programing

Let us have matrix D, with each cell (i, j) corresponding to the search state of pattern prefix p0 ... pi in text T.

More precisely, a γ-bounded value of Di,j will denote that p0 ... pi matches T at the end position j.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Brute-force computation in O(mn α) time and O(n) space (enough to store only the curr and prev row).

We can also proceed column-wise: same time but O(αm) space instead.

Page 9: Efficient algorithms for ( δ , γ , α )-matching

9K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Cut-off trick for improving the avg time(Ukkonen, 1985; Cantone et al., 2005)

Usually, calculating all the matrix cells is an overkill.

Observation: if Di...m–1,j–α...j > γ then Di+1...m–1,j+1 > γ.

Read: it’s not so easy to get out of a ‘dead zone’.

m

Page 10: Efficient algorithms for ( δ , γ , α )-matching

10

DP-CO, cont’d

The avg time is O(n (αδ/σ)2). (Pessimistic analysis, we weren’t able to take the gamma restriction

into account.)

The worst case remains O(mn α),but as in (Crochemore et al., 2002) it can

be improved to O(mn). The difference is we handle m queues as we proceed column-wise.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 11: Efficient algorithms for ( δ , γ , α )-matching

11

Simple algorithm(ingenious name, eh?)

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

In a few words: naïve brute force DP algorithm but applied only locally.

We work on lists Li, corresponding to individual rows.

We start with L0 = { j | tj =δ p0 } (obtained in O(n) time).

For i=1...m–1:Li = { j | tj =δ pi AND Di–1,j’ + |pi–tj| γ AND

0 < j–j’ α +1 }

We put each j only once into Li (if there are many j’ that can cause it, we choose the one that minimizes the new Di,j).

Obtaining list Li takes O(α|Li–1|) time.

Page 12: Efficient algorithms for ( δ , γ , α )-matching

12

Simple algorithm, cont’d

Complexity

All lists have length |M| in total in the worst case.Which implies O(|M|α) worst case time.

But: (i) on average this is much better,(ii) we can improve somewhat the worst case.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 13: Efficient algorithms for ( δ , γ , α )-matching

13

Simple algorithm, cont’d

Average case analysis

The length of list L0 is O(n δ/σ) on avg.Hence L1 is computed in O(n α δ/σ) avg time.But its avg length is only O(n δ/σ α δ/σ).

...........................In general, computing Li takes O(n (α δ/σ)i) avg time.

The total time will be summation over m such components.

Note that α, δ, σ are fixed for a given problem instance.In other words, α δ/σ can be considered a constant.

If the constant α (2δ+1)/σ is less than 1, we have a geometric series with O(n) sum.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 14: Efficient algorithms for ( δ , γ , α )-matching

14K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Simple algorithm, cont’d

Improving the worst case

Idea: avoid brute-force handling of overlapping windows of α+1 size.

We make use of a min-queue (Gajewska & Tarjan, 1986), similarly to the concept from (Crochemore et al., 2002).

The queue always keeps up to α+1 integers, namely the error sums corresponding to the sliding window area in the previous row. For each

processed cell 0 or 1 values are inserted to the front of the queue (O(1) time) and from 0 to α+1 values deleted from the tail. But we can’t remove more than we’ve inserted. Hence O(1) amortized cost per cell.

This improves the worst-case time complexity to O(min(mn, |M|α)).

Page 15: Efficient algorithms for ( δ , γ , α )-matching

15

Bit-parallelism technique(in stringology)

Baeza–Yates (1989) noticed that CPU registers are usually longer than 1 bit...

And he made use of this fact.

In O(1) time we can peform operations like logical and (&), or (|), shifts (<<, >>)etc. on a whole machine word (usu. 32 or 64 bits).

Nowadays, bit-parallelism is a very popular techniquein string matching algorithms, in theory and in practice.

Also useful for many approximate matching variants.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 16: Efficient algorithms for ( δ , γ , α )-matching

16

Bit-parallel dynamic programming

Modified DP alg: let the cells of D be chunks of O(log γ) bits. We’ll be able to compute O(w / log γ) cells in parallel.

More precisely, each cell will use l + 1 bits, where l = log2(2γ +1).

Error sum zero will be encoded as 2l–1 – (γ +1),γ +1 (the lowest ‘illegal’ value) will be thus 2l–1

(old trick, e.g., Fredriksson & Navarro, 2004; Crochemore et al., 2005).

This representation can solve 3 issues:(i) checking in parallel if some counters exceed γ,

(ii) parallel handling of counter overflows, (iii) computing pairwise minima over two sets of counters

in parallel.K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 17: Efficient algorithms for ( δ , γ , α )-matching

17

Tiling the DP matrix with C = w / (l+1) × 1 vectors (C = 8). The dark gray cell of the current tile depends on the

light gray cells of the two tiles in the previous row (α = 4).

We are in row i. Thx to preprocessing, we know the delta-errors between all chars in the current tile (C cells) and P[i].

Problem: How to calculate the new values of Di,*?

BP-DP, cont’d

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 18: Efficient algorithms for ( δ , γ , α )-matching

18

Solution #1. Naïve shifts (chunk by chunk) and minimizations with O(α) factor.

Solution #2. Similar but with a halving technique: first shift by α / 2 counter positions, then by α / 4 etc.

performing the minimization at each step.It yields O(log α) time factor.

Solution #3. Use a precomputed function.Which we choose, as it gives O(1) time for a

O(w)-bit chunk (in practice some w’, e.g. w’=w / 4).

BP-DP, cont’d

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 19: Efficient algorithms for ( δ , γ , α )-matching

19

Pre-emptying the computation in the BP-DP search

The cut-off trick can again be used. With some modification since now we calculate C cells in

parallel. (Read: the picture at slide 9 will be less jagged and the trick is somewhat less efficient here.)

Avg search time is (upper bound estimation, maybe not tight):

O(n / C α δ/σ + n).

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 20: Efficient algorithms for ( δ , γ , α )-matching

20

How to find minima in parallel forthe O(w / log γ)-sized chunks

Precomputing as usual (ugly...) or an old trick (Paul & Simon, 1980)

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 21: Efficient algorithms for ( δ , γ , α )-matching

21

Preprocessing in BP-DP

Preprocessing is simple.We build a helper bit-matrix V such that Vi,j = |pi – tj| if pi =δ tj , and γ+1 otherwise.

Note that the numbers of rows in V can be reduced to the # of unique symbols in P (why storing completely repeating

rows?), which is σP. We call this terse representation V’.

First we fill V’ with γ+1 values in O(n / C σP) time. Then we scan T and set 0..δ in at most 2δ +1 rows of V’

(those that δ-match the current char from T). Worst case time of the latter phase: O(nδ). Less on avg.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 22: Efficient algorithms for ( δ , γ , α )-matching

22

Lazy preprocessing

Note that in the previous scheme (with cut-off) the avg time may be even O(n) but the preprocessing

typically superlinear (even if not much).

To avoid costly preprocessing in the case when search will be fast (i.e. the cut-off thing will work efficiently),

we can interweave the preprocessing and search phases.

This leads to O(n / C α δ/σ + n) avg preprocessing time (pessimistic analysis), i.e. matches the avg search time.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 23: Efficient algorithms for ( δ , γ , α )-matching

23

Multiple patterns

The bit-par alg has relatively high preprocessing cost:O(nδ + P n / w / log γ ) in the worst case.

If we are however about to search for r patterns, the search time is multiplied by r,

but the good news is that the preprocessing is increasedmuch more mildly:

to O(nδ + P n / w / log γ +rm),where P is now the # of distinct symbols in the

whole pattern set.

Practical (well-known) trick for r patterns if r small compared to / δ: superimpose pattern (then verify).

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 24: Efficient algorithms for ( δ , γ , α )-matching

24

Test methodology

All algorithms implemented in C, compiled with icc 9.0.

Test machine: P4 2.4 GHz, 512 MB, running GNU/Linux 2.4.20.

Avg times reported over 100 trials (randomly extracted patt.).

Text files:1. Concatenation of 7543 music pieces (MIDI, stripped off of anything

except pitch values), totalling 1.8 MB. Alphabet: [0..127] range, but far from random: only 55 values actually occur, and only 6 most freq

symbols cover ~50% of the whole text.

2. Uniformly random data in 0..127 range.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 25: Efficient algorithms for ( δ , γ , α )-matching

25

Compared algorithms

BP Cut-off: bit-parallel dynamic programming with cut-off (without the lazy preprocessing).

BP Filter: the (δ,α)-matching version of BP Cut-off (Fredriksson & Grabowski, 2006)

used as a filter, and DP-CO used for verifications.

DP Cut-off: dynamic programming with cut-off.

Simple: simple sparse DP (in the O(|M|α) worst case time version).

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 26: Efficient algorithms for ( δ , γ , α )-matching

26

Experimental results, MIDI δ = 1, γ = 4, α = 1

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 27: Efficient algorithms for ( δ , γ , α )-matching

27

Experimental results, MIDI δ = 4, γ = 16, α = 2

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 28: Efficient algorithms for ( δ , γ , α )-matching

28K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Experimental results, randomδ = 4, γ = 16, α = 2

Page 29: Efficient algorithms for ( δ , γ , α )-matching

29

Conclusions

Bit-parallelism works well also for the (δ,γ,α) search problem...

...But it works even better if regions of text where matches cannot be extended are quickly discarded.

Still, BP-DP for (δ,γ,α) disappoints compared to BP-DP for (δ,α) used as a filter.

(Problem: the γ counters need many bits...)

Consistently best alg in the tests was a simple heuristic (called Simple alg). Fortunately, it doesn’t have competitive

worst-case time.

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching

Page 30: Efficient algorithms for ( δ , γ , α )-matching

30

Future plans

Research on extended models: most importantly with transposition invariance.

Some purely theoretical variants (e.g., better complexity for large alpha).

Injecting compression to represent bit vectors more succinctly and thus speed up the search?

Can we replace the log γ factor in the bit-par algwith log δ?

(Hint: in each step we increase the counters by at most δ only.)

K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching