SUFFIX ARRAYS musings on the data structure, applications...

of 80/80
SUFFIX ARRAYS musings on the data structure, applications and implementation in Java Michal Nowak, Dawid Weiss IDSS Seminar, 11/2009
  • date post

    15-Jul-2020
  • Category

    Documents

  • view

    2
  • download

    0

Embed Size (px)

Transcript of SUFFIX ARRAYS musings on the data structure, applications...

  • SUFFIX ARRAYSmusings on the data structure, applications and implementation in Java

    Michał Nowak, Dawid Weiss

    IDSS Seminar, 11/2009

  • Notation

    • Let Σ be an alphabet;• Σ is finite and non-empty;• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;• O(xb − xc) = 1;• $ is a special symbol smaller than any x ∈ Σ.

    • Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.

    • Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.

  • Notation

    • Let Σ be an alphabet;• Σ is finite and non-empty;• symbols xa ∈ Σ, where a = 1 . . . |Σ| are ordered;• O(xb − xc) = 1;• $ is a special symbol smaller than any x ∈ Σ.

    • Let X be a sequence of n symbols, where X [i] ∈ Σ andi = 0 . . . n − 1.

    • Let S(i) or i be a sub-sequence of X starting at positioni and ending at n − 1.

  • Examples

    • Zero-terminated US-ASCII strings (Latin letters).• DNA sequences.• Integer-coded sequences of words.• . . .

  • (recall)

    Suffix Trees

  • suffix S(i) ic a c a o $ 0

    a c a o $ 1c a o $ 2

    a o $ 3o $ 4

    $ 5

  • suffix S(i) ic a c a o $ 0

    a c a o $ 1c a o $ 2

    a o $ 3o $ 4

    $ 5

  • Suffixes of “cacao”.

  • Building the trie.

  • Suffix trie for “cacao”.

  • Compacting (pocket trees)

    1 Move labels to edges,

    2 collapse nodes with a single descendant.

  • Suffix tree for “cacao”.

  • Properties of suffix trees

    • Maximum 2n nodes (!).• Elegant to program with.• Construction in O(n) time (Esko Ukkonen, 1995).

  • Applications

  • P1: Search for a substring m in Θ(|m|) steps.

  • P2: The longest repeated sequence.

  • ?P3: The longest common substring of m strings.

  • alibaba.taliban.

    P3: The longest common substring of m strings.

  • P3: The longest common substring of m strings.

  • What do you think,do geese see God?

    P4: The longest palindrome.

  • whatdoyouthinkdogeeseseegod.dogeeseseegodkniht...

    P4: The longest palindrome.

  • There are many more ST applications.

    Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Scienceand Computational Biology. Cambridge University Press.

  • Problems with suffix trees

    • Alphabet size.• Processing locality.

  • Suffix Arrays

  • Ordered depth-first ST traversal.

    $a · c a o $a · o $c a · c a o $c a · o $o $

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.

    • A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.

    • Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.

    • Suffix arrays and suffix trees can simulate each other.(Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • suffix S(i) i

    c a c a o $ 0a c a o $ 1

    c a o $ 2a o $ 3

    o $ 4$ 5

    → sort →

    suffix S(i) SA[j]

    $ 5a · c a o $ 1a · o $ 3c a · c a o $ 0c a · o $ 2o $ 4

    Conclusions:

    • A suffix array is an array of indices of sorted suffixes.• A suffix array is always a permutation of suffix indices.• Memory consumption: sizeof (index)× n.• Suffix arrays and suffix trees can simulate each other.

    (Abouelhoda et al., 2004).

    • Suffix sorting can be done in O(n);naïve: suffix tree + traversal, SA-specific not until 2003.

  • (selected)

    Algorithms

  • Naïve suffix sortingAlgorithms:

    • Three-way quicksort (Bentley, McIlroy).• Radix sort and combinations.

    Problems due to:

    • suffix comparisons not of O(1) (average LCP),“aaaaaaa...a$”.

    • potentially large and sparse alphabets.

  • Naïve suffix sortingAlgorithms:

    • Three-way quicksort (Bentley, McIlroy).• Radix sort and combinations.

    Problems due to:

    • suffix comparisons not of O(1) (average LCP),“aaaaaaa...a$”.

    • potentially large and sparse alphabets.

  • SACA goals

    • Possibly Θ(n).• Fast in practice (real data).• Lightweight (computation memory).• Cache and IO-aware (linear scans).

  • Source: Puglisi et al., ACM Comput. Surveys.

  • O(n log n): qsufsort• Jesper Larsson, Kunihiko Sadakane; 1999.• Prefix doubling.• Quick sort within buckets.

  • 0 1 2 3 4 5 6 7 8 9 10 11

    x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

    ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

    ↑ ↑ ↑ ↑ ↑1 4 6 8 11

    i S(i)

    0 abeacadabea$3 acadabea$5 adabea$7 abea$

    10 a$

    → ordered by

    i + h S(i)

    0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$

    10 + 1 $

  • 0 1 2 3 4 5 6 7 8 9 10 11

    x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

    ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

    ↑ ↑ ↑ ↑ ↑1 4 6 8 11

    i S(i)

    0 abeacadabea$3 acadabea$5 adabea$7 abea$

    10 a$

    → ordered by

    i + h S(i)

    0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$

    10 + 1 $

  • 0 1 2 3 4 5 6 7 8 9 10 11

    x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

    ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

    ↑ ↑ ↑ ↑ ↑1 4 6 8 11

    i S(i)

    0 abeacadabea$3 acadabea$5 adabea$7 abea$

    10 a$

    → ordered by

    i + h S(i)

    0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$

    10 + 1 $

  • 0 1 2 3 4 5 6 7 8 9 10 11

    x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

    ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]

    ↑ ↑ ↑ ↑ ↑1 4 6 8 11

    i S(i)

    0 abeacadabea$3 acadabea$5 adabea$7 abea$

    10 a$

    → ordered by

    i + h S(i)

    0 + 1 b·eacadabea$3 + 1 c·adabea$5 + 1 d·abea$7 + 1 b·ea$

    10 + 1 $

  • 0 1 2 3 4 5 6 7 8 9 10 11

    x = [ a b e a c a d a b e a $ ]SA1 = [ 11 (0 3 5 7 10) (1 8) 4 6 (2 9) ]

    ISA1 = [ 5 7 11 5 8 5 9 5 7 11 5 0 ]L = [ -1 5 2 -2 2 ]

    SA2 = [ 11 10 (0 7) 3 5 (1 8) 4 6 (2 9) ]ISA2 = [ 3 7 11 4 8 5 9 3 7 11 2 0 ]

    L = [ -2 2 -2 2 -2 2 ]

    SA4 = [ 11 10 (0 7) 3 5 8 1 4 6 9 2 ]ISA4 = [ 3 7 11 4 8 5 9 3 6 10 1 0 ]

    L = [ -2 2 -8 ]

    SA8 = [ 11 10 7 0 3 5 8 1 4 6 9 2 ]ISA8 = [ 3 7 11 4 8 5 9 2 6 10 1 0 ]

    L = [ -12 ]

  • O(n) solution: skewDivide-and-conquer:

    • Bucket-sorting.• Problem splitting and recursion.• Cheap merge from partial SAs.

    12-page paper, C source code included. . .

  • Source: MN thesis.

  • Source: Karkkainen/Sanders.

  • Induced copying

    • Determine different types of suffixes.• Sort a single type of suffixes.• Induce the order of remaining suffixes.

  • jsuffixarrays.org

  • About the project

    • Michał Nowak.• Suffix arrays in Java.• Different algorithms.• JVM benchmark.

  • Algorithms:

    Algorithm Authors Complexity Memory

    skew P. Sanders, J. Kärkkäinen O(n) 10-13nqsufsort J. Larrson, K. Sadakane O(n log n) 8ndeep shallow G. Manzini O(n2 log n) 5ntwo-stage H. Itoh, H. Tanaka O(n2 log n) 5nimpr. two-stage S. Puglisi, M. Maniscalco O(n2 log n) 5nbpr K. B. Schürmann O(n2(log n)−1) 9-10n

    quicksort (naïve algorithm)

    JVMs (in their newest versions):

    • SUN• IBM• JRockit (Oracle)• Apache Harmony• gcj, initially only.

  • Code conversion problems

    • Pointers→ indexed arrays.• Memory reused for different types→ φ.• Stack-allocated structures and arrays→ φ.• Boundary checks penalty.• Language constraints (byte arrays).

  • Code conversion problems

    • Pointers→ indexed arrays.

    • Memory reused for different types→ φ.• Stack-allocated structures and arrays→ φ.• Boundary checks penalty.• Language constraints (byte arrays).

  • Code conversion problems

    • Pointers→ indexed arrays.• Memory reused for different types→ φ.

    • Stack-allocated structures and arrays→ φ.• Boundary checks penalty.• Language constraints (byte arrays).

  • Code conversion problems

    • Pointers→ indexed arrays.• Memory reused for different types→ φ.• Stack-allocated structures and arrays→ φ.

    • Boundary checks penalty.• Language constraints (byte arrays).

  • Code conversion problems

    • Pointers→ indexed arrays.• Memory reused for different types→ φ.• Stack-allocated structures and arrays→ φ.• Boundary checks penalty.

    • Language constraints (byte arrays).

  • Code conversion problems

    • Pointers→ indexed arrays.• Memory reused for different types→ φ.• Stack-allocated structures and arrays→ φ.• Boundary checks penalty.• Language constraints (byte arrays).

  • Engineering fun

  • JIT code dumping1 private static byte bump(byte v) {2 return (byte) (v + 1);3 }4

    5 public static void doLoop(byte [] array) {6 for (int i = 0; i < array.length; i++) {7 array[i] = bump(array[i]);8 }9 }

  • Equivalent C code (gcc -S -O3):1 ...2 .L4:3 leaq 1048576(%rsp), %rdx // The loop’s end address.4 addb $1, (%rax) // bump byte5 addq $1, %rax // increase loop counter.6 cmpq %rdx, %rax7 jne .L4 // repeat until cond. true.

    Java code compiled with gcj -O3 -S Test.java:1 ...2 .L20:3 .p2align 4,,54 jbe .L31 // AOOB if max

  • Java code; JIT-compiled, SUN’s HotSpot, -server:1 mov 0x10(%rsi),%r9d // get array length2 test %r9d,%r9d // check if empty array.3 jle L_OUT4 xor %r11d,%r11d // i = 05 L1: cmp %r9d,%r11d6 jae L_OUT // L_OUT if max >= i7 movslq %r11d,%r10 // store current i8 movsbl 0x18(%rsi,%r10,1),%r8d9 inc %r11d // bump i

    10 inc %r8d // bump value at array11 mov %r8b,0x18(%rsi,%r10,1)12 cmp $0x1,%r11d // jump always (?)13 jl L1

    But also:1 movslq %ecx,%rcx // (Unfolded loop)2 movsbl 0x19(%rbx,%rcx,1),%r9d3 inc %r9d4 mov %r9b,0x19(%rbx,%rcx,1)5 movsbl 0x1a(%rbx,%rcx,1),%r9d6 inc %r9d7 mov %r9b,0x1a(%rbx,%rcx,1)8 movsbl 0x1b(%rbx,%rcx,1),%r9d9 inc %r9d

    10 ...

  • Quiz1 /** Do I look evil to you? */2 public class Example10 {3 private static boolean ready;4

    5 public static void startThread() {6 new Thread() {7 public void run() {8 try { sleep(2000); } catch (Exception e) { /* ignore */ }9 ready = true;

    10 System.out.println("Setting ready to true.");11 }12 }.start();13 }14

    15 public static void main(String [] args) {16 startThread();17 while (!ready) {18 // Do nothing.19 }20 System.out.println("I’m ready.");21 }22 }

  • 1 > gcj -O3 -S Example10.java

    1 ...2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...

  • 1 > gcj -O3 -S Example10.java

    1 ...2 cmpb $0, _ZN3com10dawidweiss9debugging6simple9Example105readyE(%rip)3 jne .L134 .L16: // buhu! :)5 jmp .L166 .p2align 4,,77 .L13:8 ...

  • Other things of interest• Minimizing GC activity.• Memory tracing aspects (AspectJ).

  • Evaluation

  • Evaluation

    • Random input (alphabets 4, 100, 255, variable length).• Gauntlet and Manzini’s corpora.

    • Multiple runs, warmup rounds removed.• Wall time measured (no parallelism other than the GC).• Memory peaks measured.

  • 1

    2

    3

    4

    5

    6

    7

    8

    0 50 100 150 200 250 300

    time

    [s]

    alphabet size [symbols]

    time on constant input, varying alphabet

    BPRDEEP-SHALLOW

    DIVSUFSORTNS

    QSUFSORTSKEW

  • 0

    100

    200

    300

    400

    500

    600

    0 5 10 15 20 25

    mem

    ory

    [MB

    ]

    input size [millions elements]

    memory on random input, alphabet size = 255

    BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

  • 0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    0 5 10 15 20 25

    time

    [s]

    input size [millions elements]

    time on random input, alphabet size = 255

    BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

  • 0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0 5 10 15 20 25

    time

    [s]

    input size [millions elements]

    time on random input, alphabet size = 4

    BPRDEEP-SHALLOWDIVSUFSORTNSQSUFSORTSKEW

  • In reality, data is hardly ever random.

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    SKEW DIVSUFSORT BPR QSUFSORT

    tim

    e [

    s]

    ababab(ab...)ac, 196KB

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    SKEW DIVSUFSORT BPR QSUFSORT

    tim

    e [

    s]

    Where is naïve sort? ababab(ab...)ac, 196KB

  • 0

    20

    40

    60

    80

    100

    120

    140

    sun ibm jrockit harmony

    time

    [s]

    BPRDIVSUFSORT

    QSUFSORTSKEW

  • Summary

    • Suffix arrays are worth remembering.• Fundamental and short algorithms developed now.• http://jsuffixarrays.org.

    http://jsuffixarrays.org

    NotationSuffix TreesDescriptionApplicationsProblems

    Suffix ArraysAlgorithmsProject jsuffixarraysAbout the ProjectEngineering FunEvaluation

    Summary