Suffix array
Suffix_array | efficient data structure for indexing large texts |
---|---|
.size | give number of suffixes |
summarize suffix array representation | |
.suffix | give index of ith smallest suffix |
.common_prefix_len | give length of longest common prefix between adjacent suffixes |
.occurrance | list positions in the original text where a substring occurred |
Suffix array ¶
Declaration
Parameters
text | A non-temporary text that the suffix array keeps a reference of; should not be modified afterwards |
Example
Discussion
The algorithm used for suffix array construction is currently suboptimal at O(n(lgn)^2)
and
could be improved to O(n)
. For a more efficient (but much less readable) implementation,
see Yuta Mori's use of SAIS.
A suffix array is a very efficient data structure for indexing text for future searches.
It allows for O(lg(n))
queries on the number and position of substrings.
It is a recent (1990) invention as a space efficient alternative to suffix trees, which have
the same applications.
Variations of suffix array are usually behind the convenient ctrl+f
features in editors and browsers.
Number of suffixes | Suffix_array::
size ¶
Declaration
Discussion
The number of suffixes is equal to the length of the original text (new suffixs for each character).
Suffix array summarization | Suffix_array::
print ¶
Declaration
Discussion
Summarizes internal representation; for example with text="consternation":
8-0: ation
0-0: consternation
5-0: ernation
10-0: ion
12-1: n
7-1: nation
2-0: nsternation
11-2: on
1-0: onsternation
6-0: rnation
3-0: sternation
4-1: ternation
9-0: tion
The suffixes are sorted by lexicographic (alphabetical) order. The first number is the starting index of the suffix in the original text. The second number is the common prefix length with the next ordered suffix, so "ternation" and "tion" both share 't' as a prefix.
Suffix array indices | Suffix_array::
suffix ¶
Declaration
Parameters
ith_suffix | lexicographic rank of suffix with 0 being the smallest |
Example
Discussion
Requires the original text to not have been modified.
The actual suffix is the substring from the index returned to text.end()
.
Longest common prefix | Suffix_array::
common_prefix_len
¶
Declaration
Parameters
ith_suffix | lexicographic rank of suffix with 0 being the smallest |
Example
Discussion
The function is a simple access to the LCP array, which is built right after the suffix array is built. This array enables a lot of applications to be efficient, such as search queries for occurrances.
All occurrances of substring | Suffix_array::
occurrance ¶
Declaration
Parameters
target | substring to search for |
Example
Discussion
Basically performs the function of ctrl+f
in text editors and browsers.
It can do this in O(lgn)
time since it is a single binary search followed
by looking at following suffixes to see if they share a prefix as large as the target.
An alternative without building the LCP is to use two binary searches to find the first
and last occurrance of it and taking the difference.