80
SUFFIX ARRAYS musings on the data structure, applications and implementation in Java Michal Nowak, Dawid Weiss IDSS Seminar, 11/2009
• Upload

others
• Category

## Documents

• view

7
• download

0

### Transcript of SUFFIX ARRAYS musings on the data structure, applications...

SUFFIX ARRAYSmusings on the data structure, applications and implementation in Java

Michał Nowak, Dawid Weiss

IDSS Seminar, 11/2009

Notation

• Let Σ be an alphabet;• Σ is finite and non-empty;

• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;

• O(xb − xc) = 1;

• \$ is a special symbol smaller than any x ∈ Σ.

• Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.

• Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.

Notation

• Let Σ be an alphabet;• Σ is finite and non-empty;

• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;

• O(xb − xc) = 1;

• \$ is a special symbol smaller than any x ∈ Σ.

• Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.

• Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.

Examples

• Zero-terminated US-ASCII strings (Latin letters).

• DNA sequences.

• Integer-coded sequences of words.

• . . .

(recall)

Suffix Trees

suffix S(i) ic a c a o \$ 0

a c a o \$ 1c a o \$ 2

a o \$ 3o \$ 4

\$ 5

suffix S(i) ic a c a o \$ 0

a c a o \$ 1c a o \$ 2

a o \$ 3o \$ 4

\$ 5

Suffixes of “cacao”.

Building the trie.

Suffix trie for “cacao”.

Compacting (pocket trees)

1 Move labels to edges,

2 collapse nodes with a single descendant.

Suffix tree for “cacao”.

Properties of suffix trees

• Maximum 2n nodes (!).

• Elegant to program with.

• Construction in O(n) time (Esko Ukkonen, 1995).

Applications

P1: Search for a substring m in Θ(|m|) steps.

P2: The longest repeated sequence.

?P3: The longest common substring of m strings.

alibaba.taliban.

P3: The longest common substring of m strings.

P3: The longest common substring of m strings.

What do you think,do geese see God?

P4: The longest palindrome.

whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...

P4: The longest palindrome.

There are many more ST applications.

Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Scienceand Computational Biology. Cambridge University Press.

Problems with suffix trees

• Alphabet size.

• Processing locality.

Suffix Arrays

Ordered depth-first ST traversal.

\$a · c a o \$a · o \$c a · c a o \$c a · o \$o \$

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

suffix S(i) i

c a c a o \$ 0a c a o \$ 1

c a o \$ 2a o \$ 3

o \$ 4\$ 5

→ sort →

suffix S(i) SA[j]

\$ 5a · c a o \$ 1a · o \$ 3c a · c a o \$ 0c a · o \$ 2o \$ 4

Conclusions:

• A suffix array is an array of indices of sorted suffixes.

• A suffix array is always a permutation of suffix indices.

• Memory consumption: sizeof (index)× n.

• Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

• Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

(selected)

Algorithms

Naïve suffix sortingAlgorithms:

• Three-way quicksort (Bentley, McIlroy).

• Radix sort and combinations.

Problems due to:

• suffix comparisons not of O(1) (average LCP),“aaaaaaa...a\$”.

• potentially large and sparse alphabets.

Naïve suffix sortingAlgorithms:

• Three-way quicksort (Bentley, McIlroy).

• Radix sort and combinations.

Problems due to:

• suffix comparisons not of O(1) (average LCP),“aaaaaaa...a\$”.

• potentially large and sparse alphabets.

SACA goals

• Possibly Θ(n).• Fast in practice (real data).

• Lightweight (computation memory).

• Cache and IO-aware (linear scans).

Source: Puglisi et al., ACM Comput. Surveys.

O(n log n): qsufsort

• Jesper Larsson, Kunihiko Sadakane; 1999.

• Prefix doubling.

• Quick sort within buckets.

0 1 2 3 4 5 6 7 8 9 10 11

x = [ a b e a c a d a b e a \$ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

↑ ↑ ↑ ↑ ↑1 4 6 8 11

i S(i)

0 abeacadabea\$3 acadabea\$5 adabea\$7 abea\$

10 a\$

→ ordered by

i + h S(i)

0 + 1 b·eacadabea\$3 + 1 c·adabea\$5 + 1 d·abea\$7 + 1 b·ea\$

10 + 1 \$

0 1 2 3 4 5 6 7 8 9 10 11

x = [ a b e a c a d a b e a \$ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

↑ ↑ ↑ ↑ ↑1 4 6 8 11

i S(i)

0 abeacadabea\$3 acadabea\$5 adabea\$7 abea\$

10 a\$

→ ordered by

i + h S(i)

0 + 1 b·eacadabea\$3 + 1 c·adabea\$5 + 1 d·abea\$7 + 1 b·ea\$

10 + 1 \$

0 1 2 3 4 5 6 7 8 9 10 11

x = [ a b e a c a d a b e a \$ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

↑ ↑ ↑ ↑ ↑1 4 6 8 11

i S(i)

0 abeacadabea\$3 acadabea\$5 adabea\$7 abea\$

10 a\$

→ ordered by

i + h S(i)

0 + 1 b·eacadabea\$3 + 1 c·adabea\$5 + 1 d·abea\$7 + 1 b·ea\$

10 + 1 \$

0 1 2 3 4 5 6 7 8 9 10 11

x = [ a b e a c a d a b e a \$ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

↑ ↑ ↑ ↑ ↑1 4 6 8 11

i S(i)

0 abeacadabea\$3 acadabea\$5 adabea\$7 abea\$

10 a\$

→ ordered by

i + h S(i)

0 + 1 b·eacadabea\$3 + 1 c·adabea\$5 + 1 d·abea\$7 + 1 b·ea\$

10 + 1 \$

0 1 2 3 4 5 6 7 8 9 10 11

x = [ a b e a c a d a b e a \$ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]L = [ -1 5 2 -2 2 ]

SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ]ISA2 = [ 3 7 11 4 8 5 9 3 7 11 2 0 ]

L = [ -2 2 -2 2 -2 2 ]

SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ]ISA4 = [ 3 7 11 4 8 5 9 3 6 10 1 0 ]

L = [ -2 2 -8 ]

SA8 = [ 11 10 7 0 3 5 8 1 4 6 9 2 ]ISA8 = [ 3 7 11 4 8 5 9 2 6 10 1 0 ]

L = [ -12 ]

O(n) solution: skewDivide-and-conquer:

• Bucket-sorting.

• Problem splitting and recursion.

• Cheap merge from partial SAs.

12-page paper, C source code included. . .

Source: MN thesis.

Source: Karkkainen/Sanders.

Induced copying

• Determine different types of suffixes.

• Sort a single type of suffixes.

• Induce the order of remaining suffixes.

jsuffixarrays.org

About the project

• Michał Nowak.

• Suffix arrays in Java.

• Different algorithms.

• JVM benchmark.

Algorithms:

Algorithm Authors Complexity Memory

skew P. Sanders, J. Kärkkäinen O(n) 10-13nqsufsort J. Larrson, K. Sadakane O(n log n) 8ndeep shallow G. Manzini O(n2 log n) 5ntwo-stage H. Itoh, H. Tanaka O(n2 log n) 5nimpr. two-stage S. Puglisi, M. Maniscalco O(n2 log n) 5nbpr K. B. Schürmann O(n2(log n)−1) 9-10n

quicksort (naïve algorithm)

JVMs (in their newest versions):

• SUN

• IBM

• JRockit (Oracle)

• Apache Harmony

• gcj, initially only.

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Code conversion problems

• Pointers→ indexed arrays.

• Memory reused for different types→ φ.

• Stack-allocated structures and arrays→ φ.

• Boundary checks penalty.

• Language constraints (byte arrays).

Engineering fun

JIT code dumping1 private static byte bump(byte v) {2 return (byte) (v + 1);3 }4

5 public static void doLoop(byte [] array) {6 for (int i = 0; i < array.length; i++) {7 array[i] = bump(array[i]);8 }9 }

Equivalent C code (gcc -S -O3):1 ...2 .L4:3 leaq 1048576(%rsp), %rdx // The loop’s end address.4 addb \$1, (%rax) // bump byte5 addq \$1, %rax // increase loop counter.6 cmpq %rdx, %rax7 jne .L4 // repeat until cond. true.

Java code compiled with gcj -O3 -S Test.java:1 ...2 .L20:3 .p2align 4,,54 jbe .L31 // AOOB if max <= i5 .L23:6 movslq %edi,%rax // store current i7 addl \$1, %edi // increase loop counter.8 addb \$1, 12(%rbx,%rax) // bump byte inside the array9 cmpl %edi, %edx // loop condition check.

10 .p2align 4,,211 jg .L20

Java code; JIT-compiled, SUN’s HotSpot, -server:1 mov 0x10(%rsi),%r9d // get array length2 test %r9d,%r9d // check if empty array.3 jle L_OUT4 xor %r11d,%r11d // i = 05 L1: cmp %r9d,%r11d6 jae L_OUT // L_OUT if max >= i7 movslq %r11d,%r10 // store current i8 movsbl 0x18(%rsi,%r10,1),%r8d9 inc %r11d // bump i

10 inc %r8d // bump value at array11 mov %r8b,0x18(%rsi,%r10,1)12 cmp \$0x1,%r11d // jump always (?)13 jl L1

But also:1 movslq %ecx,%rcx // (Unfolded loop)2 movsbl 0x19(%rbx,%rcx,1),%r9d3 inc %r9d4 mov %r9b,0x19(%rbx,%rcx,1)5 movsbl 0x1a(%rbx,%rcx,1),%r9d6 inc %r9d7 mov %r9b,0x1a(%rbx,%rcx,1)8 movsbl 0x1b(%rbx,%rcx,1),%r9d9 inc %r9d

10 ...

Quiz1 /** Do I look evil to you? */2 public class Example10 {3 private static boolean ready;4

5 public static void startThread() {6 new Thread() {7 public void run() {8 try { sleep(2000); } catch (Exception e) { /* ignore */ }9 ready = true;

10 System.out.println("Setting ready to true.");11 }12 }.start();13 }14

15 public static void main(String [] args) {16 startThread();17 while (!ready) {18 // Do nothing.19 }20 System.out.println("I’m ready.");21 }22 }

1 > gcj -O3 -S Example10.java

1 ...2 cmpb \$0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...

1 > gcj -O3 -S Example10.java

1 ...2 cmpb \$0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...

Other things of interest• Minimizing GC activity.

• Memory tracing aspects (AspectJ).

Evaluation

Evaluation

• Random input (alphabets 4, 100, 255, variable length).

• Gauntlet and Manzini’s corpora.

• Multiple runs, warmup rounds removed.

• Wall time measured (no parallelism other than the GC).

• Memory peaks measured.

1

2

3

4

5

6

7

8

0 50 100 150 200 250 300

time

[s]

alphabet size [symbols]

time on constant input, varying alphabet

BPRDEEP-SHALLOW

DIVSUFSORTNS

QSUFSORTSKEW

0

100

200

300

400

500

600

0 5 10 15 20 25

mem

ory

[MB

]

input size [millions elements]

memory on random input, alphabet size = 255

BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25

time

[s]

input size [millions elements]

time on random input, alphabet size = 255

BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

0

5

10

15

20

25

30

35

40

45

50

0 5 10 15 20 25

time

[s]

input size [millions elements]

time on random input, alphabet size = 4

BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

In reality, data is hardly ever random.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SKEW DIVSUFSORT BPR QSUFSORT

tim

e [

s]

ababab(ab...)ac, 196KB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

SKEW DIVSUFSORT BPR QSUFSORT

tim

e [

s]

Where is naïve sort? ababab(ab...)ac, 196KB

0

20

40

60

80

100

120

140

sun ibm jrockit harmony

time

[s]

BPRDIVSUFSORT

QSUFSORTSKEW

Summary

• Suffix arrays are worth remembering.

• Fundamental and short algorithms developed now.

• http://jsuffixarrays.org.