Suffix array

Suffix_arrayefficient data structure for indexing large texts
.sizegive number of suffixes
.printsummarize suffix array representation
.suffixgive index of ith smallest suffix
.common_prefix_lengive length of longest common prefix between adjacent suffixes
.occurrancelist positions in the original text where a substring occurred

Suffix array

Declaration

template <typename Sequence = std::string>
class Suffix_array;

// constructor
Suffix_array(const Sequence& text);

Parameters

textA non-temporary text that the suffix array keeps a reference of; should not be modified afterwards

Example

std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size); 

// index the file
sal::Suffix_array<> sa {text};

Discussion

The algorithm used for suffix array construction is currently suboptimal at O(n(lgn)^2) and could be improved to O(n). For a more efficient (but much less readable) implementation, see Yuta Mori's use of SAIS.

A suffix array is a very efficient data structure for indexing text for future searches. It allows for O(lg(n)) queries on the number and position of substrings. It is a recent (1990) invention as a space efficient alternative to suffix trees, which have the same applications. Variations of suffix array are usually behind the convenient ctrl+f features in editors and browsers.

Number of suffixes | Suffix_array::size

Declaration

size_t size() const;

Discussion

The number of suffixes is equal to the length of the original text (new suffixs for each character).

Suffix array summarization | Suffix_array::print

Declaration

void print() const;

Discussion

Summarizes internal representation; for example with text="consternation":

    8-0: ation
    0-0: consternation
    5-0: ernation
   10-0: ion
   12-1: n
    7-1: nation
    2-0: nsternation
   11-2: on
    1-0: onsternation
    6-0: rnation
    3-0: sternation
    4-1: ternation
    9-0: tion

The suffixes are sorted by lexicographic (alphabetical) order. The first number is the starting index of the suffix in the original text. The second number is the common prefix length with the next ordered suffix, so "ternation" and "tion" both share 't' as a prefix.

Suffix array indices | Suffix_array::suffix

Declaration

size_t suffix(size_t ith_suffix) const;

Parameters

ith_suffixlexicographic rank of suffix with 0 being the smallest

Example

std::string text {"consternation"};
Suffix_array<> sa {text};

sa.suffix(2);
// 5 index in text that starts the suffix "ernation", 3rd smallest suffix

Discussion

Requires the original text to not have been modified. The actual suffix is the substring from the index returned to text.end().

Longest common prefix | Suffix_array::common_prefix_len

Declaration

size_t common_prefix_len(size_t ith_suffix) const;

Parameters

ith_suffixlexicographic rank of suffix with 0 being the smallest

Example

// using same sa as above with text = "consternation"
sa.common_prefix_len(2);
// 0 shared suffix between "ernation" and "ion", 3rd and 4th smallest suffixes

Discussion

The function is a simple access to the LCP array, which is built right after the suffix array is built. This array enables a lot of applications to be efficient, such as search queries for occurrances.

All occurrances of substring | Suffix_array::occurrance

Declaration

std::vector<size_t> occurrance(const Sequence& target) const;

Parameters

targetsubstring to search for

Example

std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size); 

// index the file
sal::Suffix_array<> sa {text};
std::vector<size_t> whales {sa.occurrance("whales")};

whales.size();
// 1323 occurrances of the word "whale" 

for (size_t starting_location : whales) std::cout << starting_location << ' ';
// prints out all the starting indices of where "whale" occurs

Discussion

Basically performs the function of ctrl+f in text editors and browsers. It can do this in O(lgn) time since it is a single binary search followed by looking at following suffixes to see if they share a prefix as large as the target. An alternative without building the LCP is to use two binary searches to find the first and last occurrance of it and taking the difference.