Suffix array
Suffix_array | efficient data structure for indexing large texts |
---|---|
.size | give number of suffixes |
summarize suffix array representation | |
.suffix | give index of ith smallest suffix |
.common_prefix_len | give length of longest common prefix between adjacent suffixes |
.occurrance | list positions in the original text where a substring occurred |
Suffix array ¶
Declaration
template <typename Sequence = std::string>
class Suffix_array;
// constructor
Suffix_array(const Sequence& text);
Parameters
text | A non-temporary text that the suffix array keeps a reference of; should not be modified afterwards |
Example
std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size);
// index the file
sal::Suffix_array<> sa {text};
Discussion
The algorithm used for suffix array construction is currently suboptimal at O(n(lgn)^2)
and
could be improved to O(n)
. For a more efficient (but much less readable) implementation,
see Yuta Mori's use of SAIS.
A suffix array is a very efficient data structure for indexing text for future searches.
It allows for O(lg(n))
queries on the number and position of substrings.
It is a recent (1990) invention as a space efficient alternative to suffix trees, which have
the same applications.
Variations of suffix array are usually behind the convenient ctrl+f
features in editors and browsers.
Number of suffixes | Suffix_array::
size ¶
Declaration
size_t size() const;
Discussion
The number of suffixes is equal to the length of the original text (new suffixs for each character).
Suffix array summarization | Suffix_array::
print ¶
Declaration
void print() const;
Discussion
Summarizes internal representation; for example with text="consternation":
8-0: ation
0-0: consternation
5-0: ernation
10-0: ion
12-1: n
7-1: nation
2-0: nsternation
11-2: on
1-0: onsternation
6-0: rnation
3-0: sternation
4-1: ternation
9-0: tion
The suffixes are sorted by lexicographic (alphabetical) order. The first number is the starting index of the suffix in the original text. The second number is the common prefix length with the next ordered suffix, so "ternation" and "tion" both share 't' as a prefix.
Suffix array indices | Suffix_array::
suffix ¶
Declaration
size_t suffix(size_t ith_suffix) const;
Parameters
ith_suffix | lexicographic rank of suffix with 0 being the smallest |
Example
std::string text {"consternation"};
Suffix_array<> sa {text};
sa.suffix(2);
// 5 index in text that starts the suffix "ernation", 3rd smallest suffix
Discussion
Requires the original text to not have been modified.
The actual suffix is the substring from the index returned to text.end()
.
Longest common prefix | Suffix_array::
common_prefix_len
¶
Declaration
size_t common_prefix_len(size_t ith_suffix) const;
Parameters
ith_suffix | lexicographic rank of suffix with 0 being the smallest |
Example
// using same sa as above with text = "consternation"
sa.common_prefix_len(2);
// 0 shared suffix between "ernation" and "ion", 3rd and 4th smallest suffixes
Discussion
The function is a simple access to the LCP array, which is built right after the suffix array is built. This array enables a lot of applications to be efficient, such as search queries for occurrances.
All occurrances of substring | Suffix_array::
occurrance ¶
Declaration
std::vector<size_t> occurrance(const Sequence& target) const;
Parameters
target | substring to search for |
Example
std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size);
// index the file
sal::Suffix_array<> sa {text};
std::vector<size_t> whales {sa.occurrance("whales")};
whales.size();
// 1323 occurrances of the word "whale"
for (size_t starting_location : whales) std::cout << starting_location << ' ';
// prints out all the starting indices of where "whale" occurs
Discussion
Basically performs the function of ctrl+f
in text editors and browsers.
It can do this in O(lgn)
time since it is a single binary search followed
by looking at following suffixes to see if they share a prefix as large as the target.
An alternative without building the LCP is to use two binary searches to find the first
and last occurrance of it and taking the difference.