56 M A C E D O N I A - ª ∞ ∫ ∂ ¢ √ ¡ π ∞ - M A K E D O N I J A
Tries and Suffix Trees - Technical University of · PDF filereport starting positions of all...
Transcript of Tries and Suffix Trees - Technical University of · PDF filereport starting positions of all...
Tries and Suffix Trees Inge Li Gørtz
• String matching problem. Given strings T (text) and P (pattern) over an alphabet Σ, report starting positions of all occurrences of P in T. • Finite automaton: O(mΣ + n) time and space• KMP: O(m+n) time and space
• String indexing problem. Given a string S of characters from an alphabet Σ. Preprocess S into a data structure to support• Search(P): Return starting position of all occurrences of P in S.
• Today: Data structure using O(n) space and supporting Search(P) in O(m) time.
• Applications: • Search engines, e.g. prefix searches.• Finding common substrings of many biological strings• Finding repeating substructures in biological strings• Detecting DNA contamination
String indexing problem
• Tries• Compressed tries• Suffix trees• Applications of suffix trees
Outline
Tries
• Text retrieval
• Trie over the strings: sells, by, the, sea, shells, tea.
Tries
b
yS2
S1
s
s
s
S4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
• Text retrieval • Prefix-free?
• Trie over the strings: sells, by, the, sea, shells, tea, she.
Tries
b
yS2
S1
s
s
s
S4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
• Text retrieval • Prefix-free?
• Trie over the strings: sells, by, the, sea, shells, tea, she.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “sea”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
S4
Tries
b
y
S2
S1
s
s
sS6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “short”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “short”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “short”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “short”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Text retrieval • Search for “short”
• Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Build a trie over the strings: by$, sells$, sea$.
Tries
b
y
S2
S1
s
sS4
e
a l
l
$
$
$
• Properties of the trie. A trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties:
• How many children can a node have?
• How many leaves does T have?
• What is the height of T?
• What is the number of nodes in T?
Trie
• Search time: O(d) in each node => O(dm).• O(m) if d constant. • d not constant: use dictionary
• Hashing O(1)• Balanced BST: O(log d)
• Time and space for a trie (for small/constant d): • O(m) for searching for a string of length m. • O(n) space.• Preprocessing: O(n)
Trie
• Prefix search: return all words in the trie starting with “se”
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Prefix search: return all words in the trie starting with “se”
• Prefix search: return all words in the trie starting with “se”
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Time for prefix search: O(m) + time to report all occurrences.
• Solution: compact tries.
Trie
Could be large!!
Compact tries
• Compact trie: Chains of nodes with a single child is merged into a single node.
Tries
b
y
S2
S1
s
s
sS4 S6 S3
e
e
e
ea al
l
l
l
h h
t
S5
$
$
$
$
$ $S7
$
• Compact trie: Chains of nodes with a single child is merged into a single node.
Tries
b ty
a
se
ll
e
s
ahe
hel
ls
$$
$$
$ $ $
by$
S2
s
e
a$
S4
lls$S1
he
lls$
S6 S3
S5 S7
$
t
ea$
he$
• Properties of the compact trie. A compact trie T storing a collection S of s strings of total length n from an alphabet of size d has the following properties:• Every internal node of T has at least 2 and at most d children.• T has s leaves• The number of nodes in T is < 2s.
• Time and space for a compact trie (constant d): • O(m) for searching for a string of length m. • O(m + occ) for prefix search, where occ = #occurrences• O(s) space.• Preprocessing: O(n)
Trie
Suffix trees
• String indexing problem. Given a string S of characters from an alphabet Σ. Preprocess S into a data structure to support• Search(P): Return starting position of all occurrences of P in S.
• Build a compressed trie over all suffixes of S (suffix tree). Label leaves with index of suffix.
• Observation: An occurrence of P is a prefix of a suffix of S.
Suffix tree
occurrence of P
Suffix of S
• String indexing problem. Given a string S of characters from an alphabet Σ. Preprocess S into a data structure to support• Search(P): Return starting position of all occurrences of P in S.
• Build a compressed trie over all suffixes of S (suffix tree). Label leaves with index of suffix.
• Observation: An occurrence of P is a prefix of a suffix of S.
• Example: P = ana.
Suffix tree
occurrence of P
Suffix of S
b a n a n a s t r i n g s s a l a d s
Suffix of S
Suffix of S
• Suffix tree: over the string banana$
Suffix Tree
$
b
1
an
a
an
a
$
6
na
$
4
na$
2
na
na$
3 5
$
$
7
• Suffix tree: over the string banana$
Suffix Tree
$
b
1
an
a
an
a
$
6
na
$
4
na$
2
na
na$
3 5
$
$
7
• Search for P.• Report labels of all leaves
below final node
• Suffix tree: over the string banana$• Find all occurrences of P=“an”
Suffix Tree
$
b
1
an
a
an
a
$
6
na
$
4
na$
2
na
na$
3 5
$
$
7
• Search for P.• Report labels of all leaves
below final node
• Suffix tree: over the string banana$
Suffix Tree
$
b
1
an
a
an
a
$
6
na
$
4
na$
2
na
na$
3 5
$
$
7
1 2 3 4 5 6 7b a n a n a $
• Store S and store node labels by reference to S.
• Suffix tree: over the string banana$
Suffix Tree
1
[2,2]
6
42
3 5
7
1 2 3 4 5 6 7b a n a n a $
• Store S and store node labels by reference to S.
[3,4] [7,7]
[5,7]
[7,7]
[7,7]
[7,7]
[1,7][3,4]
[5,7]
Suffix trees and common substrings
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2 $1
$2 $2
$2 $1 $1
$2 $2
$1 $1
$2 $2
$1 $1
$2 $1 $1
• Suffix tree of a string S: Compact trie over all suffixes of S.
• Space and time:• Space: O(n)• Search time: O(m) + time to report all occurrences = O(m+occ)• Preprocessing: Can be done in O(sort(n,|Σ|)) time, where sort(n,|Σ|) is the time it
takes to sort n characters from an alphabet Σ.
• Suffix trees can be used to solve the String indexing problem in:• Space: O(n)• Search time: O(m+occ)• Preprocessing: O(sort(n,|Σ|)) time
Suffix tree
Applications of suffix trees
• Find longest common substring of strings S1 and S2. • Construct the suffix tree over S1$1S2$2.• Example: Find longest common substring of piespies and piepiees:
• Construct suffix tree of piespies$1piepiees$2.
Longest common substring
• Suffix tree of piespies$1piepiees$2.
Generalized suffix tree
$1
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2
$1
$1
$1
$2
$2
$2
$2
$2
p
p p
pp
i
i
i eee
ee
e
e
e
ss
s
s
s
s
i
i
i
s s
$1
$2
$2$2
s
s
e
e
...
...
......
...
e e
$2
$2
$2 $2
$1
$2
...
e
$2s p
s
ie
$1
$2
...
p
si
$1
$2
e
...
p
si
$1
$2
e
...
• Suffix tree of piespies$1piepiees$2.• Mark leaf with if $1 suffix starts in S1.
Generalized suffix tree
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2 $1
$2 $2
$2 $1 $1
$2 $2
$1 $1
$2 $2
$1 $1
$2 $1 $1
• Suffix tree of piespies$1piepiees$2.• Mark leaf with if $1 suffix starts in S1.
Generalized suffix tree
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2 $1
$2 $2
$2 $1 $1
$2 $2
$1 $1
$2 $2
$1 $1
$2 $1 $1
• Suffix tree of piespies$1piepiees$2.• Mark leaf with if $1 suffix starts in S1.• Add string-depth.
15111
Generalized suffix tree
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2 $1
$2 $2
$2 $1 $1
$2 $2
$1 $1
$2 $2
$1 $1
$2 $1 $1
[18,18] [9,18] [15,15]
[16,18]
[13,18]
[17,17]
[18,18]
[9,18]
[5,18]
[14,15]
[16,18]
[13,18]
[8,8]
[9,18] [5,18]
[13,15]
[16,18]
[13,18]
[8,8]
[9,18] [5,18]
[18,18]
[17,17]
[9,18]
[5,18]
1 10 1
4 7 2
3 12 16
2
5 8 3
13 17
3
6 9 4
14 18
13
3
• Suffix tree of piespies$1piepiees$2.• Mark leaf with if $1 suffix starts in S1.• Add string-depth.
15111
Generalized suffix tree
918
15 12
716 3 6
14 11
2
1013
5 1
17 8 4
$2 $1
$2 $2
$2 $1 $1
$2 $2
$1 $1
$2 $2
$1 $1
$2 $1 $1
[18,18] [9,18] [15,15]
[16,18]
[13,18]
[17,17]
[18,18]
[9,18]
[5,18]
[14,15]
[16,18]
[13,18]
[8,8]
[9,18] [5,18]
[13,15]
[16,18]
[13,18]
[8,8]
[9,18] [5,18]
[18,18]
[17,17]
[9,18]
[5,18]
1 11 1
4 7 2
3 12 16
2
5 8 3
17
6 9 4
14 18
S[13,15] = “pie” is the longest common substring.
• Using a suffix tree we can solve the longest common substring problem in linear time (for a constant size alphabet).
Longest common substring