Suffix array

Suffix_array	efficient data structure for indexing large texts
.size	give number of suffixes
.print	summarize suffix array representation
.suffix	give index of ith smallest suffix
.common_prefix_len	give length of longest common prefix between adjacent suffixes
.occurrance	list positions in the original text where a substring occurred

Suffix array ¶

Declaration

template <typename Sequence = std::string>
class Suffix_array;

// constructor
Suffix_array(const Sequence& text);

Parameters

text	A non-temporary text that the suffix array keeps a reference of; should not be modified afterwards

Example

std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size); 

// index the file
sal::Suffix_array<> sa {text};

Discussion

The algorithm used for suffix array construction is currently suboptimal at O(n(lgn)^2) and could be improved to O(n). For a more efficient (but much less readable) implementation, see Yuta Mori's use of SAIS.

A suffix array is a very efficient data structure for indexing text for future searches. It allows for O(lg(n)) queries on the number and position of substrings. It is a recent (1990) invention as a space efficient alternative to suffix trees, which have the same applications. Variations of suffix array are usually behind the convenient ctrl+f features in editors and browsers.

Number of suffixes | `Suffix_array::`size ¶

Declaration

size_t size() const;

Discussion

The number of suffixes is equal to the length of the original text (new suffixs for each character).

Suffix array summarization | `Suffix_array::`print ¶

Declaration

void print() const;

Discussion

Summarizes internal representation; for example with text="consternation":

    8-0: ation
    0-0: consternation
    5-0: ernation
   10-0: ion
   12-1: n
    7-1: nation
    2-0: nsternation
   11-2: on
    1-0: onsternation
    6-0: rnation
    3-0: sternation
    4-1: ternation
    9-0: tion

The suffixes are sorted by lexicographic (alphabetical) order. The first number is the starting index of the suffix in the original text. The second number is the common prefix length with the next ordered suffix, so "ternation" and "tion" both share 't' as a prefix.

Suffix array indices | `Suffix_array::`suffix ¶

Declaration

size_t suffix(size_t ith_suffix) const;

Parameters

ith_suffix

lexicographic rank of suffix with 0 being the smallest

Example

std::string text {"consternation"};
Suffix_array<> sa {text};

sa.suffix(2);
// 5 index in text that starts the suffix "ernation", 3rd smallest suffix

Discussion

Requires the original text to not have been modified. The actual suffix is the substring from the index returned to text.end().

Longest common prefix | `Suffix_array::common_prefix_len`¶

Declaration

size_t common_prefix_len(size_t ith_suffix) const;

Parameters

ith_suffix

lexicographic rank of suffix with 0 being the smallest

Example

// using same sa as above with text = "consternation"
sa.common_prefix_len(2);
// 0 shared suffix between "ernation" and "ion", 3rd and 4th smallest suffixes

Discussion

The function is a simple access to the LCP array, which is built right after the suffix array is built. This array enables a lot of applications to be efficient, such as search queries for occurrances.

All occurrances of substring | `Suffix_array::`occurrance ¶

Declaration

std::vector<size_t> occurrance(const Sequence& target) const;

Parameters

target

substring to search for

Example

std::ifstream text_file("mobydick.txt");
text_file.seekg(0, std::ios::end);
size_t size {text_file.tellg()};
std::string text(size, ' ');
// go back to beginning
text_file.seekg(0);
// read into string
text_file.read(&text[0], size); 

// index the file
sal::Suffix_array<> sa {text};
std::vector<size_t> whales {sa.occurrance("whales")};

whales.size();
// 1323 occurrances of the word "whale" 

for (size_t starting_location : whales) std::cout << starting_location << ' ';
// prints out all the starting indices of where "whale" occurs

Discussion

Basically performs the function of ctrl+f in text editors and browsers. It can do this in O(lgn) time since it is a single binary search followed by looking at following suffixes to see if they share a prefix as large as the target. An alternative without building the LCP is to use two binary searches to find the first and last occurrance of it and taking the difference.

Simple Algorithms Library

Suffix array

Suffix array ¶

Number of suffixes | Suffix_array::size ¶

Suffix array summarization | Suffix_array::print ¶

Suffix array indices | Suffix_array::suffix ¶

Longest common prefix | Suffix_array::common_prefix_len¶

All occurrances of substring | Suffix_array::occurrance ¶

Simple Algorithms

Library

Number of suffixes | `Suffix_array::`size ¶

Suffix array summarization | `Suffix_array::`print ¶

Suffix array indices | `Suffix_array::`suffix ¶

Longest common prefix | `Suffix_array::common_prefix_len`¶

All occurrances of substring | `Suffix_array::`occurrance ¶