Suffix Arrays

57
Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers

description

Suffix Arrays. A New Method for Online String Searches U.Manber and G.Myers. Introduction - String matching. Let A = a 0 a 1 ... a N- 1 be a large text of length N Let W = w 0 w 1 ... w p- 1 be a word of length P Is W a substring of A?. Introduction - Suffix Trees. Build time - PowerPoint PPT Presentation

Transcript of Suffix Arrays

Page 1: Suffix Arrays

Suffix Arrays

A New Method for Online String Searches

U.Manber and G.Myers

Page 2: Suffix Arrays

Introduction - String matching

Let A = a0a1...aN-1 be a large text of length N

Let W = w0w1...wp-1 be a word of length P

Is W a substring of A?

Page 3: Suffix Arrays

Introduction - Suffix Trees

Build time O(N)

Search time O(P)

Structure space O(N) Big constant

Dependent of |Σ|

Page 4: Suffix Arrays

Suffix Arrays

An array of all the suffixes of A Sorted by lexicographical order

A = aababa

aaababaabaabababababa

Page 5: Suffix Arrays

A = aababa

Ai = aiai+1...aN-1 The suffix of A that starts at position i.

Position array (Pos) Pos[k] is the start position of kth smallest suffix APos[k] is the suffix pointed from Pos[k] APos[k] is the kth smallest suffix

Pos 012345

Suffix Arrays

503142012345

Page 6: Suffix Arrays

Searching

“Is W a substring of A?” W is a substring of A

Some suffix Ai starts with W i is W’s location All the instances of W must match

consecutive suffixes in the array Find the array interval that contains

those suffixes

Page 7: Suffix Arrays

Searching - Definitions

For a string u up = u0u1...up-1

For strings u,v u ≤p v up ≤ vp

Same for ≠, =, >… For any p, Pos is ordered according

to ≤p

Page 8: Suffix Arrays

Searching - Definitions

W = w0w1…wP-1

LW = min (k : W ≤p APos[k] or k = N) First suffix ≥p from W

RW = max (k : APos[k] ≤p W or k = 1) Last suffix ≤p from W

LW RW

W >p APos[k] W < p APos[k]

W =p APos[k]

Page 9: Suffix Arrays

Search Algorithm

k [LW, RW] W =p APos[k] To find W’s instances - find [LW, RW] Number of W’s occurrences is

(RW-LW+1)

Matches are APos[LW],…, APos[RW]

Suffix array is sorted - use binary search

Page 10: Suffix Arrays

Binary Search

Search interval [L,R] Midpoint M Compare W to APos[M]

Decide where to search next W ≤p APos[M] - search in left half (R = M)

W >p APos[M] - search in right half (L = M) O(PlogN)

aababcbcdcbbW = abc

L M R

Page 11: Suffix Arrays

Search Algorithm

Observation: We can use information from one

comparison to speedup the next comparisons

Use additional information lcp = longest common prefix

Page 12: Suffix Arrays

Search Algorithm - lcp

lcp(v,w) = the length of the longest common prefix of v and w

Obtained by comparing v and w and stopping at the first unequal symbol

Use precomputed lcp information to reduce the number of comparisons to O(P + logN)

Page 13: Suffix Arrays

Search Algorithm

Consider all possible midpoints M = 1…N-2

Every midpoint corresponds to a triplet [LM,M,RM]

Suppose we precomputed two arrays: Llcp[M] = lcp (APos[LM], APos[M])

Rlcp[M] = lcp (APos[M], APos[RM])

Page 14: Suffix Arrays

Search Algorithm

Maintain two more variables l = lcp(APos[L], W)

r = lcp(W, APos[R])

W = abcd

abaaabbabcabcdacacaacbacdad

l = 2 r = 1

LM RMM

Llcp[M] = 1 Rlcp[M] = 1

Page 15: Suffix Arrays

Search Algorithm

Assume l≥r Compare l with Llcp

If l < Llcp[M] W >l+1 APos[LM]

APos[LM] =l+1 APos[M]

W >l+1 APos[M]

abaabaaababababaabacabcdacacdad

l = 2 r = 1

LM RMM

Llcp[M] = 3 Rlcp[M] = 1

W = abcd

Go Right!l remains unchanged

Page 16: Suffix Arrays

Search Algorithm

If l > Llcp[M] APos[LM] <l APos[M]

W =l APos[LM]

W <l APos[M]

abaabcdabdacacaadadaadbadc

l = 2 r = 1

LM RMM

Llcp[M] = 1 Rlcp[M] = 1

W = abcd

Go Left!r = Llcp[M]

Page 17: Suffix Arrays

Search Algorithm

If l = Llcp[M] W can be in either half Start comparing A and APos[M] from the (l+1)

symbol First unequal symbol determines whether to go

right or left r/l will be updated to l+j j+1 comparisons

ababaabaaabcabccabcdadaadbadc

l = 2 r = 1

LM RMM

Llcp[M] = 2 Rlcp[M] = 1

W = abcd

Page 18: Suffix Arrays

Search Algorithm - Complexity

In each Iteration: Let h=max(l,r) We start comparing from the hth symbol

to the h+j+1 j+1 symbol comparisons Next time we will start from the h+j

symbol j symbols out of the j+1 will not be

compared again

Page 19: Suffix Arrays

Search Algorithm - Complexity

Every symbol in W will be successfully matched at most once O(P) successful comparisons

At most one symbol will be unsuccessfully matched in each iteration O(logN) unsuccessful comaprsions

Total: O(P + logN) comparisons

Page 20: Suffix Arrays

Build Suffix Array

So far… A O(P + logN) search algorithm Given a sorted suffix array Given lcp information (Llcp, Rlcp)

Next… Sort the suffix array in O(NlogN) Compute the lcp’s while sorting the

array

Page 21: Suffix Arrays

Sort Algorithm

First stage Sort the suffixes into buckets, according to first

symbol

Inductive stage Assume array is bucket sorted according to

first H symbols Every H-bucket holds suffixes with the same H

first symbols Buckets are ordered according to the ≤H

relation Sort according to 2H first symbols

Page 22: Suffix Arrays

Sort Algorithm – Intuition

Let Ai, Aj be two suffixes in the same H-bucket

Ai =H Aj

Next H symbols of Ai and Aj are the first H symbols of Ai+H and Aj+H

In order to determine the ≤2H order of Ai and Aj, look at the ≤H order of Ai+H and Aj+H

aaaaababaaabaaababaababaabaa

A = aababaa

H = 2

Ai AjAj+H

Ai+H

Page 23: Suffix Arrays

Sort Algorithm – Main Idea

Let Ai be a suffix in the first H-bucket

Ai starts with the smallest H-symbol string

Ai-H should be the first in its 2H-bucket

aababaabaaabababababa

A = aababa H = 1

Page 24: Suffix Arrays

Sort Algorithm

In stage H Go over all the suffixes in the ≤H order

For each Ai move Ai-H to the next available place in its H-bucket

The suffixes are now sorted according to the ≤2H order

Go on to stage 2H to produce ≤4H order

Page 25: Suffix Arrays

in

Sort Algorithm - Example

01234567

n

A = assassin

sin

A3A0A6A7A1A5A4A2

ssassinssinsassinassassin assin

sassinssinsinssassinninassassinassinH = 1

H = 2

Page 26: Suffix Arrays

Sort Algorithm - Example

A = assassin5 6210 743

A0A3A6A7A2A5A4A1

ssassinssinsinsassinninassinassassinH = 2

H = 4 ssinssassinsinsassinninassinassassin

A0A3A6A7A2A5A1A4

Page 27: Suffix Arrays

Sort Algorithm - Complexity

First Stage Bucket sort according to first symbol O(NlogN)

Inductive Stages O(logN) stages O(N) per stage

Total O(NlogN)

Space Can be implemented using two N-sized integer

arrays

Page 28: Suffix Arrays

Finding Longest Common Prefixes

The search algorithm uses lcp information: Llcp[M] = lcp (APos[LM], APos[M])

Rlcp[M] = lcp (APos[M], APos[RM])

We want to compute this information while we are sorting the array

Page 29: Suffix Arrays

Finding Longest Common Prefixes

Show how to compute lcp’s for suffixes in adjacent H-buckets during the sort algorithm

Use that to compute the lcp’s of all the suffixes that are consecutive in the sorted suffix array

Show how to compute lcps for all the necessary suffixes

Page 30: Suffix Arrays

Finding LCP for adjacent buckets

After the first sort stage, lcp’s of suffixes in adjacent buckets is 0

Assume after stage H we know the lcps between suffixes in adjacent H-buckets

Suppose Ap and Aq are in the same H-bucket but not in the same 2H bucket H ≤ lcp(Ap, Aq) < 2H lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) lcp(Ap+H, Aq+H) < H

Page 31: Suffix Arrays

Let i,j be Ap+H, Aq+H’s positions in the suffix array

Assume i<j Array is ordered according to the <H order

lcp(APos[i], APos[j]) = min(lcp(APos[k-1], APos[k]))

Finding LCP for adjacent buckets

k [i+1,j]

aababaabaaabababababaH = 1

i j2 1 0

Page 32: Suffix Arrays

LCP Data Structures – Hgt][

We need a data structure that will allow us: get the lcp’s of consecutive suffixes get their minimum

Hgt[] – an N-1 sized array Hgt[i] = lcp(APos[i-1], APos[i])

Page 33: Suffix Arrays

Hgt will be computed inductively throughout the sort Initialized to N+1 Hgt[i] is updated in stage 2H

APos[i] started a new 2H-bucket To update Hgt[i]:

Let a,b be the array positions of APos[i-1]+H

and APos[i] +H

Assume a≤b Hgt[i] = H + min(Hgt[k])

LCP Data Structures – Hgt][

k [a+1,b]

Page 34: Suffix Arrays

lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1

lcp(sassin,sin) = 1 + lcp(assin, in) = 1

Finding LCP - Example

assinassassininnssassinsinssinsassin

assassinassininnsassinsinssinssassin

0 0 0 9 999

1 10 0 0 99

0 0 0 1 1 23

H = 2

H = 1

assassinassininnsassinsinssassinssinH = 4

Page 35: Suffix Arrays

We need the following operations for Hgt[]: Set(i, h) – sets Hgt[i] to h Min_height(i,j) – determines min(Hgt[k])

We need to find a way to find the lcp’s for all the necessary suffixes – not just the ones in consecutive positions

k [i,j]

LCP Data Structures - Interval Tree

Page 36: Suffix Arrays

LCP Data Structures - Interval Tree

A full and balanced binary tree N-1 leaves, correspond to Hgt[] O(logN) height, N-2 interior vertices Keep a Hgt value for each interior

vertex as well: Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])

Page 37: Suffix Arrays

LCP Data Structures - Interval Tree

Operations implementation: Set(i,h)

Set Hgt[i] to h and update the Hgt values on the path from i to the root

Min-height(i,j) Finds the minimal Hgt value by scanning

O(logN) vertices in the tree

Operations complexity – O(logN)

Page 38: Suffix Arrays

Finding LCP – Interval Tree

(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)

0

9 0 0 0

0 0

9

0

9 9

9

9

1

1

1

Page 39: Suffix Arrays

Finding LCP - Complexity

In stage 2H we update Hgt[i] for all the leaves that started new buckets Each update is one set operation and one

Min_height - O(logN) Throughout the algorithm every leaf is updated

exactly once - O(N) updates Updates complexity: O(NlogN)

In each stage we scan the array to see which suffixes opened new buckets Scans complexity: O(NlogN)

Total LCP complexity O(NlogN)

Page 40: Suffix Arrays

Finding LCP - Llcp][ and Rlcp][

We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort

Use an interval tree that represents a binary search Each interior node corresponds to (LM, RM) for some

M For each interior node (LM, RM)

Left(LM, RM) = (LM,M) Right(LM, RM) = (M, RM)

N-2 interior nodes Leaves correspond to (i-1,i) Leaf(i-1,i) = Hgt[i]

Page 41: Suffix Arrays

Finding LCP - Llcp][ and Rlcp][

According to interval tree structure: Hgt[(L,R)] = min(Hgt[k])

Hgt[(L,R)] = lcp (APos[L], APos[R])

Llcp[M] = Hgt[(LM,M)]

Rlcp[M] = Hgt[(M,RM)]

k [L+1,R]

Page 42: Suffix Arrays

Worst Case Complexity

Suffix Array Build time

O(NlogN) Search time

O(P+logN) Structure space

O(N) 2N - 3N integers

Independent of |Σ|

Suffix Tree Build time

O(N) Search time

O(P) Structure space

O(N) Big constant

Dependent of |Σ|

Page 43: Suffix Arrays

Expected Time Improvements

Improve the expected case time of Search Algorithm Sort Algorithm LCP computation

Use the following assumptions All N-symbol strings are equally likely Under this assumption:

Expected length of longest repeated substring of A is O(log|Σ|N)

Page 44: Suffix Arrays

Expected Case Improvements - Main Idea

Let T = Let IntT(u) = integer encoding in base |Σ| of the T-

symbol prefix of u Example:

T = 3 Σ = a,b u = abaa IntT(u) = 010 = 2

There are |Σ|T ≤ N possible T-symbol prefixes IntT(u) is a number in [0,N-1]

Map each suffix Ap to IntT(Ap) Can be done in O(N) time

Nlog

Page 45: Suffix Arrays

Expected Case Improvements - Search Algorithm

Use an additional array Buck[] Think of the sorted array as buckets,

based on the IntT encoding

Buck[k] = min{ i | IntT (APos[i]) = k} The first position that contains a suffix

that’s mapped to k

Compute Buck[] at the end of the sort algorithm O(N) additional time

Page 46: Suffix Arrays

Expected Case Improvements - Search Algorithm

Given a word W We need to find Lw and Rw

Let k = IntT(W)

Lw and Rw must be in k’s bucket (Buck[k], Buck[k+1])

We only need to search one bucket

Page 47: Suffix Arrays

Expected Case Improvements - Search Algorithm

Number of buckets = |Σ|T ≤ N Average number of elements in a

bucket = O(1) In the binary search for W

Expected size of bucket to search = O(1)

Expected number of search steps: O(1) Expected case time: O(P)

Page 48: Suffix Arrays

Expected Case Improvements - Sort Algorithm

First stage of sort Sort according to first symbol

Replace first stage with sort according to IntT

Equivalent to sort according to first T symbols

Can be done in O(N) time We changed the base case of the sort

from H=1 to H=T

Page 49: Suffix Arrays

Expected Case Improvements - Sort Algorithm

Observation: Let C be the length of the longest

repeated substring of A Sort is in fact complete once we have

reached (C+1)-buckets Suppose some (C+1)-bucket contains more

than one suffix Then we have two suffixes with lcp > C This prefix is a repeated substring longer than

C - contradiction

Page 50: Suffix Arrays

Expected Case Improvements - Sort Algorithm

Expected case: C = O(log|Σ|N) = O(T) Number of stages: O(1)

Expected case time: O(N)

Page 51: Suffix Arrays

Expected Case Improvements - LCP Computation

Replace interval tree with sort history Binary tree Models the refinement of buckets

during the sort A vertex for each H-bucket Each vertex holds the stage number

at which its bucket was split

Page 52: Suffix Arrays

Expected Case Improvements - LCP Computation

Leaves correspond to suffixes and are arranged in an N element array

Each vertex has at least two children

O(N) nodes Can be built with O(N) additional

time during the sort

Page 53: Suffix Arrays

Expected Case Improvements - LCP Computation

Given the sort history we can compute lcp(Ap, Aq) Find the nca (nearest common

ancestor) of Ap and Aq

Let H be the nca’s stage number lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)

Recursively compute lcp(Ap+H, Aq+H) Stop when the nca is the root

Page 54: Suffix Arrays

Expected Case Improvements - LCP Computation

Each step is O(1) At each step the stage number of the nca

is at least halved Suppose we stop the recursion when

H < T’ =

Expected length of longest repeated substring is O(T) Expected case lcp is O(T) = O(log|Σ|N)

Nlog2

1

Page 55: Suffix Arrays

Expected Case Improvements - LCP Computation

O(1) recursive steps in the expected case

Expected case time for one lcp: O(1) Expected case time for computing

Llcp[], Rlcp[]: O(N)

Page 56: Suffix Arrays

Expected Case Improvements - LCP Computation

We need a way to find lcp’s that are known to be less than T’

Build a |Σ|T’ x |Σ|T’ array: Lookup[IntT’(x), IntT’(y)] = lcp(x,y) for all

T’-symbol strings x,y Max N entries (|Σ|T’ = √N) Compute incrementally in O(N) Final recursion steps are replaced by

O(1) lookup

Page 57: Suffix Arrays

Expected Time Complexity

Search time O(P)

Sort + LCP computation time O(N)