Suffix Arrays

December 29, 2017

Suffix arrays are a data structure which allow for efficient string matching and other operations. In this tutorial, we will describe what they are and why they are useful. We will also describe algorithms for generating suffix arrays, starting with a trivial one and moving onto faster ones. Lets get started.

What are suffix arrays?

Consider any string, say s = "random". It is customary to include a symbol at the end which is smaller than any character. Let us include ’$’, which is **smaller than any Latin character in ASCII**, so s = "random$".

Consider all suffixes of this string, and name the suffix that starts at the i-th position as z_i.

Thus:

z_0 = random$

z_1 = andom$

z_2 = ndom$

z_3 = dom$

z_4 = om$

z_5 = m$

z_6 = $

and lets now sort them by lexicographic order (check footnotes for definition):

z_6 = $

z_1 = andom$

z_3 = dom$

z_5 = m$

z_2 = ndom$

z_4 = om$

z_0 = random$

and write down the indices in that order

[6, 1, 3, 5, 2, 4, 0]

This is the suffix array of the given string s = "random$".

Try: What is the suffix array of "aaaa$"? What is the suffix array of "suffix$"?

Naive algorithm

A naive algorithm for generating a suffix array works as follows. We first find out all suffixes of the given string and put them in an array. Then we sort the array.

Instead of actually creating all the suffixes, a better way to implement this would be to have an array with the numbers 0...n where n = length of string s. Then we sort these numbers using a custom comparison function define as follows: a < b, if z_a < z_b, where z_j is the suffix that starts from at the jth position in the string. This directly gives us the suffix array.

Complexity

The above algorithm has a run-time complexity of O(n log n * q) where q = time for a single comparison. Since the average time for performing a comparison in our case is O(n), our algorithm is O(n² log n).

Suppose we just implemented the above algorithm and ran it for a string of length 1,000,000. How long does it take to execute? Well, before it stops, you will have finished reading this article. It would take about 2.5 days on a standard personal computer.

O(n log² n) algorithm

The idea behind the O(n log² n) algorithm is very elegant. The essence is to note that the expensive O(n) comparison in the previous algorithm can be replaced with something much more efficient, because we are dealing with strings which have a lot of redundancy (they are all suffixes of the same string).

The algorithm performs roughly log n steps. After step k, the suffixes are sorted by the first 2^k characters only, k goes from 0 to log n. When performing step k, we’ll re-use the results from step k-1.

The first step, is step 0, and in this step the suffixes are sorted only by the first character. Let’s consider s = banana$. This is the state after Step 0

z_6 = $, rank 0

z_1 = anana$, rank 1

z_3 = ana$, rank 1

z_5 = a$, rank 1

z_0 = banana$, rank 2

z_2 = nana$, rank 3

z_4 = na$, rank 3

Note that we’ve also attached ranks of the elements in the above sorting. anana$, ana$ and a$ all have the same rank as the comparison is done just using the first character which is the same for the three of them ("a").

For the next step, we need to sort using 2 characters. We already know the ordering using 1 character. Moreover, the first 2 characters of z_k, is simply first 1 character of z_k + first 1 character of z_(k+1). Hence, to sort all z_k using 2 characters, we replace the z_k with the pair (rank of z_k using 1 character, rank of z_(k + 1) using 1 character) and sort according to that.

This is what the result looks like.

z_0 → (2, 1)

z_1 → (1, 3)

z_2 → (3, 1)

z_3 → (1, 3)

z_4 → (3, 1)

z_5 → (1, 0)

z_6 → (0, 0)

Note: If something has no next character, we use a value <= every possible rank, i.e. 0.

To take an example, z1 = first two characters of anana$ = "an" = `first character of z1 + first character of z2=“a” + “n”. Moreover, **since string comparison is more expensive, and we have already sorted single characters, we replace the characters with their ranks**, i.e.z1 → (rank of z1 in previous step, rank of z2 in previous step) → (1, 3)`.

Once we do the replacement, we sort these pairs. (pair-wise (x, y) comparisons are defined as “sort by x, break ties by y”).

z_6 → (0, 0), rank 0

z_5 → (1, 0), rank 1

z_1 → (1, 3), rank 2

z_3 → (1, 3), rank 2

z_0 → (2, 1), rank 3

z_2 → (3, 1), rank 4

z_4 → (3, 1), rank 4

Step 1 is complete.

Lets go on to Step 2 and sort these suffixes by the first 2² = 4 characters. We again build pairs, this time using ranks we computed just now, in Step 1. We have first 4 characters of z_k = first 2 character of z_k + first 2 character of **z_(k+2)**. Hence, the pairs will be (rank of z_k in Step 1, rank of z_(k + 2) in Step 1).

For example z_0 is replaced with (rank of z_0 in Step 1, rank of z_2 in Step 1). Again, this works because rank of z_0 in Step 1 is nothing but the rank of "ba" among all 2 length sub-strings, and rank of z_2 in Step 1 is the rank of "na" among all 2 length sub-strings. The said pair thus represents the entire string "bana" which is what we intend to use for comparison in this step.

We carry on this process for 2 more steps, since the largest suffix must have its entire length used for comparison. That happens in Step k where 2^k >= 7, or k = 3. And finally end up with the suffix array

z_6 = rank 0, $

z_5 = rank 1, a$

z_3 = rank 2, ana$

z_1 = rank 3, anana$

z_0 = rank 4, banana$

z_4 = rank 5, na$

z_2 = rank 6, nana$

which is [6, 5, 3, 1, 0, 4, 2].

Complexity

We perform roughly log n steps, and each step takes O(n log n) time because of the sorting. Hence, this algorithm’s complexity is O(n log² n).

Interlude: Radix sort for our case

General purpose sorting is O(n log n), and that is the best possible (based on information theoretic bound), when the only information we know about our <= operator is that it “doesn’t contradict itself” (formally, induces a total ordering on the elements). In particular, O(n log n) is the best possible only if we don’t know anything about the values we would like to sort.

Better sorting algorithms exist when we know more. In this case, we know that everything we sort is a pair of two integers with each integer in the range 0 to n. With this information, we can perform sorting in O(n) using radix sort.

It works like this. Let us consider an example. Say we want to sort (2,3), (4,5), (2,5), (4,7), (1,2). We’ll only look at the first elements of each pair for now. Using first element only, we can infer that (1,2) < (2,3) and (2, 5) < (4, 5) and (4, 7).

Call a collection of elements with same first element, a bucket. So (2,3) and (2,5) form a bucket. (4,5) and (4,7) form a bucket and (1,2) forms a bucket by itself.

From this information we can conclude that elements of bucket (4, 5), (4, 7) starts filling the sorted array from position 3 (0-indexed), because there are 3 elements before it. And (2,3), (2,5) starts filling from 1.

To calculate the start positions of each of the buckets - first we store the sizes of the buckets in an array. This array is [0, 1, 2, 0, 2] for our example. The i-th element (0-indexed) is the number of pairs with first element = i.

Then, we perform a cumulative sum of this array, i.e. array whose x’th element is the sum of all elements in the above array from 0..x-1. For this above array, it is [0, 0, 1, 3, 3, 5]. This array indicates from where elements of first element x start filling in the sorted array (referred to as bucstart). For more explanation of this array see code below.

Then, we create a list of lists. In position x, we have the elements which have second element x. For our example, this would be

[ [], 
  [], 
  [(1, 2)], 
  [(2, 3)], 
  [], 
  [(2, 5), (4, 5)], 
  [], 
  [(4, 7)] ]

Now traverse the list in this order, and when you get an element, put it in an array as follows:

elem <- get current element
key <- elem.first
place elem in bucstart[key]
bucstart[key] <- bucstart[key] + 1 // position the next element with this key occupies

When the process completes you get the sorted array, in complexity O(n).

Improving to O(n log n)

Using radix sort algorithm described above, we can reduce the sorting to O(n), and consequently the algorithm is sped up to O(n log n).

Note: Although the theoretical complexity is better, it is hard to actually make this work faster than O(n log² n) for practical inputs (say n <= 500000). This is because inbuilt sort functions in programming languages like C++ are very optimized, and an easy implementation of radix sort will use vectors and vector operations.

However, optimized implementations of the radix sort should lead to faster execution than the normal sort.

Implementation

Here is the implementation of the above algorithms. Just enable/disable the RADIXSORT to use O(n log n) instead of O(n log² n). Note: In this implementation, O(n log n) is slower than O(n log² n).

#include <bits/stdc++.h>
//#define DEBUG                      // enable to see verbose output of radixsort working
//#define RADIXSORT                  // enable radix sort instead of c++ inbuilt sort 
//#define LARGETEST                  // enable execution using random large case (no input) instead of inputting string
using namespace std;
const int maxn = 6e5 + 7;
pair< pair<int, int>, int > Z[maxn];   // array of z_k and k
pair< pair<int, int>, int > P[maxn];   // used as copy of Z
int rnk[maxn];                         // ranks of the elements 
vector<int> B[maxn];                   // list of lists as mentioned in the article
int bucsz[maxn];                       // sizes of the buckets
int bucstart[maxn];                    // current start position of the buckets
inline void radsort(int buckets, int n) { // assume buckets start from 0
    #ifdef RADIXSORT
    fill(B, B + buckets, *new vector<int>());
    fill(bucsz, bucsz + buckets, 0);
    fill(bucstart, bucstart + buckets, 0);
    #ifdef DEBUG
    cout<<"Call ================="<<endl;
    #endif
    for (int i = 0; i < n; i++) {             // calculate sizes of buckets
        auto elem = Z[i];
        #ifdef DEBUG
        cout<<elem.first.first<<","<<elem.first.second<<" : "<<elem.second<<endl;
        #endif
        int key = elem.first.first; 
        bucsz[key]++;                       
    }
    for (int i = 0; i < buckets; i++) {        // calculate initial position of the elements with first element i
        if (i > 0) bucstart[i] = bucstart[i - 1] + bucsz[i - 1];
        else bucstart[i] = 0;
    }
    for (int i = 0; i < n; i++) {              // build the list of lists
        auto elem = Z[i];
        int key = elem.first.second; 
        B[key].push_back(i);
    }
    for (int i = 0; i < buckets; i++) {        // implementation of the pseudocode snippet in the article in the Radix sort section
        for (int j = 0; j < B[i].size(); j++) {
            int k = B[i][j]; int fst = Z[k].first.first; 
            int pos = bucstart[fst];                        
            P[pos] = Z[k]; bucstart[fst] = pos + 1;
        }
    }
    for (int i = 0; i < n; i++) Z[i] = P[i];  // copy back to Z
    #endif
    #ifndef RADIXSORT
    sort(Z, Z + n);                           // this is used if we do not do RADIXSORT. This makes the algorithm O(n log^2 n)
    #endif
    fill(rnk, rnk + n, 0);
    for (int i = 0; i < n; i++) {
        int idx = Z[i].second;
        if (i == 0) { rnk[idx] = 0; continue; }
        rnk[idx] = rnk[ Z[i - 1].second ]; // rnk of previous element
        if (Z[i].first != Z[i - 1].first) rnk[idx]++; // if you're worse then you're of worse (higher) rnk
    }
    #ifdef DEBUG
    for (int i = 0; i < n; i++) {
        cout<<rnk[i]<<" ";
    }
    cout<<endl;
    #endif
}
vector<int> SA(string s) {
    int n = s.size(); s += "$";         // add dummy character. Now s has valid positions 0..n.
    for (int i = 0; i <= n; i++) {
        Z[i] = make_pair( make_pair(s[i], 0), i );
    }               
    radsort(256, n + 1);                // maximum number of buckets is 256 (ASCII characters in general)
    for (int j = 0; (1<<j) <= n; j++) { // pairs of strings of length 2^j each
        for (int i = 0; i <= n; i++) {
            int next = i + (1<<j);
            Z[i] = make_pair( make_pair(rnk[i], (next > n) ? 0 : rnk[next]), i );
        }
        radsort(n + 1, n + 1);          // here maximum number of buckets is n + 1 (0 .. n)
    }
    vector<int> S;
    for (int i = 0; i <= n; i++) S.push_back(Z[i].second);      // construct the suffix array from the final Z array.
    return S;
}
int main() {
    ios::sync_with_stdio(false); cin.tie(0);
    string s;
    #ifdef LARGETEST
    for (int i = 0; i < 500000; i++) {              // generate large testcase of n = 500000 when LARGETEST is enabled
        s += (char)('a' + rand()%26);               // there is no srand(time(NULL)), so this produces the same string everytime.
    }
    #endif
    #ifndef LARGETEST
    cin>>s;
    #endif
    vector<int> S = SA(s);
    for (int i = 0; i < S.size(); i++) {           // output the suffix array
        cout<<S[i]<<" ";
    } 
    cout<<endl;
}

Practice

The following problem just requires applying the Suffix array, so you use it to test your implementation and understanding: Problem SARRAY.

To solve this problem, you’ll need to extend the above implementation to a bigger alphabet and also define the null character accordingly.

This problem grades according to speed. You’ll probably score 60 with the above approach (which is good enough for most competitive programming sites like Codeforces, CodeChef, etc).

Here are some more problems for practice from SPOJ:

Footnotes

Lexicographic Order

Lexicographic order is a generalized version of alphabetical order. If we are working in the realm of strings derived from Latin (English) characters, then we have: a < b < c < d < ... < z.

How do we compare two strings? Say we have the strings "suffix" and "suffice". We go character by character, and stop at the first position they differ. In this case, it is the 5th (0-indexing) position, where "suffix" has x and "suffice" has c. Since, c < x, we have "suffice" < "suffix".

What happens when one of the string ends before there is any different character, i.e., the smaller string is a prefix of the larger one? In that case, we say the smaller string is smaller. For example, "ab" < "abcd".

This order is already implemented in most programming languages (C++, Python, etc). So when we do s < t for two strings s, t, the language automatically does checking based on lexicographic order as described above.

What are suffix arrays?

Naive algorithm

Complexity

O(n log2 n) algorithm

Complexity

Interlude: Radix sort for our case

Improving to O(n log n)

Implementation

Practice

Footnotes

Lexicographic Order

O(n log² n) algorithm