# Suffix Arrays

December 29, 2017

Suffix arrays are a data structure which allow for efficient string matching and other operations. In this tutorial, we will describe what they are and why they are useful. We will also describe algorithms for generating suffix arrays, starting with a trivial one and moving onto faster ones. Lets get started.

# What are suffix arrays?

Consider any string, say * s = "random"*. It is customary to include a symbol at the end which is smaller than any character. Let us include ’$’, which is **smaller than any Latin character in ASCII**, so

`s = "random$"`

.Consider all *suffixes* of this string, and name the suffix that starts at the `i-th`

position as `z_i`

.

Thus:

`z_0 = random$`

`z_1 = andom$`

`z_2 = ndom$`

`z_3 = dom$`

`z_4 = om$`

`z_5 = m$`

`z_6 = $`

and lets now sort them by *lexicographic order* (check footnotes for definition):

`z_6 = $`

`z_1 = andom$`

`z_3 = dom$`

`z_5 = m$`

`z_2 = ndom$`

`z_4 = om$`

`z_0 = random$`

and write down the indices in that order

`[6, 1, 3, 5, 2, 4, 0]`

This is the *suffix array* of the given string `s = "random$"`

.

Try: What is the suffix array of `"aaaa$"`

? What is the suffix array of `"suffix$"`

?

# Naive algorithm

A naive algorithm for generating a suffix array works as follows. We first find out all suffixes of the given string and put them in an array. Then we sort the array.

Instead of actually creating all the suffixes, a better way to implement this would be to have an array with the numbers `0...n`

where `n = length of string s`

. Then we sort these numbers using a custom **comparison function** define as follows: `a < b, if z_a < z_b`

, where `z_j`

is the suffix that starts from at the `j`

th position in the string. This directly gives us the suffix array.

## Complexity

The above algorithm has a run-time complexity of O(n log n * q) where `q = time for a single comparison`

. Since the average time for performing a comparison in our case is O(n), our algorithm is O(n^{2} log n).

Suppose we just implemented the above algorithm and ran it for a string of length 1,000,000. How long does it take to execute? Well, before it stops, you will have finished reading this article. It would take about 2.5 days on a standard personal computer.

# O(n log^{2} n) algorithm

The idea behind the O(n log^{2} n) algorithm is very elegant. The essence is to note that the expensive O(n) comparison in the previous algorithm can be replaced with something much more efficient, because we are dealing with *strings which have a lot of redundancy* (they are all suffixes of the *same* string).

The algorithm performs roughly *log n* steps. After step *k*, the suffixes are **sorted by the first 2 ^{k} characters only**, k goes from 0 to

*log n*. When performing step

*k*, we’ll re-use the results from step

*k-1*.

The first step, is step 0, and in this step the suffixes are sorted only by the first character. Let’s consider `s = banana$`

. This is the state after Step 0

`z_6 = $, rank 0`

`z_1 = anana$, rank 1`

`z_3 = ana$, rank 1`

`z_5 = a$, rank 1`

`z_0 = banana$, rank 2`

`z_2 = nana$, rank 3`

`z_4 = na$, rank 3`

Note that we’ve also attached **ranks** of the elements in the above sorting. `anana$`

, `ana$`

and `a$`

all have the same rank as the comparison is done just using the first character which is the same for the three of them (`"a"`

).

For the next step, we need to sort using 2 characters. We * already know* the ordering using 1 character.

**Moreover, the**. Hence, to sort all

`first 2 characters of z_k`

, is simply `first 1 character of z_k + first 1 character of z_(k+1)`

`z_k`

using 2 characters, we replace the `z_k`

with the pair `(rank of z_k using 1 character, rank of z_(k + 1) using 1 character)`

and sort according to that.This is what the result looks like.

`z_0 → (2, 1)`

`z_1 → (1, 3)`

`z_2 → (3, 1)`

`z_3 → (1, 3)`

`z_4 → (3, 1)`

`z_5 → (1, 0)`

`z_6 → (0, 0)`

Note: If something has no next character, we use a value <= *every possible rank*, i.e. 0.

**To take an example**, z*1 = first two characters of anana$ = "an" = `first character of z*1 + first character of z

*2*1 → (rank of z

`=`

“a” + “n”`. Moreover, **since string comparison is more expensive, and we have already sorted single characters, we replace the characters with their ranks**, i.e.`

z*1 in previous step, rank of z*2 in previous step) → (1, 3)`.

Once we do the replacement, we sort these pairs. (pair-wise (x, y) comparisons are defined as “sort by x, break ties by y”).

`z_6 → (0, 0), rank 0`

`z_5 → (1, 0), rank 1`

`z_1 → (1, 3), rank 2`

`z_3 → (1, 3), rank 2`

`z_0 → (2, 1), rank 3`

`z_2 → (3, 1), rank 4`

`z_4 → (3, 1), rank 4`

Step 1 is complete.

Lets go on to Step 2 and sort these suffixes by the first 2^{2} = 4 characters. We again build pairs, this time using *ranks we computed just now*, in Step 1. We have `first 4 characters of z_k = first 2 character of z_k + first 2 character of **z_(k+2)**`

. Hence, the pairs will be (rank of `z_k`

in Step 1, rank of `z_(k + 2)`

in Step 1).

For example `z_0`

is replaced with (rank of `z_0`

in Step 1, rank of `z_2`

in Step 1). Again, this works because rank of `z_0`

in Step 1 is nothing but the rank of `"ba"`

among all 2 length sub-strings, and rank of `z_2`

in Step 1 is the rank of `"na"`

among all 2 length sub-strings. The said pair thus represents the entire string `"bana"`

which is what we intend to use for comparison in this step.

We carry on this process for 2 more steps, since the largest suffix must have its entire length used for comparison. That happens in Step k where 2^{k} >= 7, or k = 3. And finally end up with the suffix array

`z_6 = rank 0, $`

`z_5 = rank 1, a$`

`z_3 = rank 2, ana$`

`z_1 = rank 3, anana$`

`z_0 = rank 4, banana$`

`z_4 = rank 5, na$`

`z_2 = rank 6, nana$`

which is `[6, 5, 3, 1, 0, 4, 2]`

.

## Complexity

We perform roughly *log n* steps, and each step takes O(n log n) time because of the sorting. Hence, this algorithm’s complexity is O(n log^{2} n).

# Interlude: Radix sort for our case

General purpose sorting is O(n log n), and that is the *best possible* (based on information theoretic bound), when the only information we know about our <= operator is that it “doesn’t contradict itself” (formally, induces a total ordering on the elements). In particular, O(n log n) is the best possible only if we don’t know anything about the values we would like to sort.

Better sorting algorithms *exist* when we know more. In this case, we know that everything we sort is a pair of two integers with each integer in the range 0 to n. With this information, we can perform sorting in O(n) using **radix sort**.

It works like this. Let us consider an example. Say we want to sort `(2,3), (4,5), (2,5), (4,7), (1,2)`

. We’ll only look at the first elements of each pair for now. Using first element only, we can infer that `(1,2) < (2,3) and (2, 5) < (4, 5) and (4, 7)`

.

Call a collection of elements with same first element, a **bucket**. So (2,3) and (2,5) form a bucket. (4,5) and (4,7) form a bucket and (1,2) forms a bucket by itself.

From this information we can conclude that elements of bucket (4, 5), (4, 7) starts filling the sorted array from position 3 (0-indexed), because there are 3 elements before it. And (2,3), (2,5) starts filling from 1.

To calculate the start positions of each of the buckets - first we store the *sizes of the buckets* in an array. This array is `[0, 1, 2, 0, 2]`

for our example. The i-th element (0-indexed) is the number of pairs with first element = i.

Then, we perform a **cumulative sum** of this array, i.e. array whose x’th element is the sum of all elements in the above array from `0..x-1`

. For this above array, it is `[0, 0, 1, 3, 3, 5]`

. This array indicates from **where elements of first element x start filling in the sorted array** (referred to as `bucstart`

). For more explanation of this array see code below.

Then, we create a list of lists. In position *x*, we have the elements which have second element *x*. For our example, this would be

```
[ [],
[],
[(1, 2)],
[(2, 3)],
[],
[(2, 5), (4, 5)],
[],
[(4, 7)] ]
```

Now traverse the list in this order, and when you get an element, put it in an array as follows:

```
elem <- get current element
key <- elem.first
place elem in bucstart[key]
bucstart[key] <- bucstart[key] + 1 // position the next element with this key occupies
```

When the process completes you get the sorted array, in complexity O(n).

## Improving to O(n log n)

Using radix sort algorithm described above, we can reduce the sorting to O(n), and consequently the algorithm is sped up to O(n log n).

**Note**: Although the theoretical complexity is better, it is hard to *actually* make this work faster than O(n log^{2} n) for practical inputs (say n <= 500000). This is because inbuilt sort functions in programming languages like C++ are very optimized, and an easy implementation of radix sort will use vectors and vector operations.

However, optimized implementations of the radix sort should lead to faster execution than the normal sort.

# Implementation

Here is the implementation of the above algorithms. Just enable/disable the `RADIXSORT`

to use O(n log n) instead of O(n log^{2} n). Note: In this implementation, O(n log n) is *slower* than O(n log^{2} n).

```
#include <bits/stdc++.h>
//#define DEBUG // enable to see verbose output of radixsort working
//#define RADIXSORT // enable radix sort instead of c++ inbuilt sort
//#define LARGETEST // enable execution using random large case (no input) instead of inputting string
using namespace std;
const int maxn = 6e5 + 7;
pair< pair<int, int>, int > Z[maxn]; // array of z_k and k
pair< pair<int, int>, int > P[maxn]; // used as copy of Z
int rnk[maxn]; // ranks of the elements
vector<int> B[maxn]; // list of lists as mentioned in the article
int bucsz[maxn]; // sizes of the buckets
int bucstart[maxn]; // current start position of the buckets
inline void radsort(int buckets, int n) { // assume buckets start from 0
#ifdef RADIXSORT
fill(B, B + buckets, *new vector<int>());
fill(bucsz, bucsz + buckets, 0);
fill(bucstart, bucstart + buckets, 0);
#ifdef DEBUG
cout<<"Call ================="<<endl;
#endif
for (int i = 0; i < n; i++) { // calculate sizes of buckets
auto elem = Z[i];
#ifdef DEBUG
cout<<elem.first.first<<","<<elem.first.second<<" : "<<elem.second<<endl;
#endif
int key = elem.first.first;
bucsz[key]++;
}
for (int i = 0; i < buckets; i++) { // calculate initial position of the elements with first element i
if (i > 0) bucstart[i] = bucstart[i - 1] + bucsz[i - 1];
else bucstart[i] = 0;
}
for (int i = 0; i < n; i++) { // build the list of lists
auto elem = Z[i];
int key = elem.first.second;
B[key].push_back(i);
}
for (int i = 0; i < buckets; i++) { // implementation of the pseudocode snippet in the article in the Radix sort section
for (int j = 0; j < B[i].size(); j++) {
int k = B[i][j]; int fst = Z[k].first.first;
int pos = bucstart[fst];
P[pos] = Z[k]; bucstart[fst] = pos + 1;
}
}
for (int i = 0; i < n; i++) Z[i] = P[i]; // copy back to Z
#endif
#ifndef RADIXSORT
sort(Z, Z + n); // this is used if we do not do RADIXSORT. This makes the algorithm O(n log^2 n)
#endif
fill(rnk, rnk + n, 0);
for (int i = 0; i < n; i++) {
int idx = Z[i].second;
if (i == 0) { rnk[idx] = 0; continue; }
rnk[idx] = rnk[ Z[i - 1].second ]; // rnk of previous element
if (Z[i].first != Z[i - 1].first) rnk[idx]++; // if you're worse then you're of worse (higher) rnk
}
#ifdef DEBUG
for (int i = 0; i < n; i++) {
cout<<rnk[i]<<" ";
}
cout<<endl;
#endif
}
vector<int> SA(string s) {
int n = s.size(); s += "$"; // add dummy character. Now s has valid positions 0..n.
for (int i = 0; i <= n; i++) {
Z[i] = make_pair( make_pair(s[i], 0), i );
}
radsort(256, n + 1); // maximum number of buckets is 256 (ASCII characters in general)
for (int j = 0; (1<<j) <= n; j++) { // pairs of strings of length 2^j each
for (int i = 0; i <= n; i++) {
int next = i + (1<<j);
Z[i] = make_pair( make_pair(rnk[i], (next > n) ? 0 : rnk[next]), i );
}
radsort(n + 1, n + 1); // here maximum number of buckets is n + 1 (0 .. n)
}
vector<int> S;
for (int i = 0; i <= n; i++) S.push_back(Z[i].second); // construct the suffix array from the final Z array.
return S;
}
int main() {
ios::sync_with_stdio(false); cin.tie(0);
string s;
#ifdef LARGETEST
for (int i = 0; i < 500000; i++) { // generate large testcase of n = 500000 when LARGETEST is enabled
s += (char)('a' + rand()%26); // there is no srand(time(NULL)), so this produces the same string everytime.
}
#endif
#ifndef LARGETEST
cin>>s;
#endif
vector<int> S = SA(s);
for (int i = 0; i < S.size(); i++) { // output the suffix array
cout<<S[i]<<" ";
}
cout<<endl;
}
```

# Practice

The following problem just requires applying the Suffix array, so you use it to test your implementation and understanding: Problem SARRAY.

To solve this problem, you’ll need to extend the above implementation to a bigger alphabet and also define the *null character* accordingly.

This problem grades according to speed. You’ll probably score 60 with the above approach (which is good enough for most competitive programming sites like Codeforces, CodeChef, etc).

Here are some more problems for practice from SPOJ:

# Footnotes

## Lexicographic Order

**Lexicographic order** is a generalized version of *alphabetical order*. If we are working in the realm of strings derived from **Latin (English) characters**, then we have: `a < b < c < d < ... < z`

.

How do we compare two strings? Say we have the strings `"suffix"`

and `"suffice"`

. We go character by character, and stop at the first position they differ. In this case, it is the 5th (0-indexing) position, where `"suffix"`

has `x`

and `"suffice"`

has `c`

. Since, `c < x`

, we have `"suffice" < "suffix"`

.

What happens when one of the string ends *before* there is any different character, i.e., the smaller string is a prefix of the larger one? In that case, we say the smaller string is smaller. For example, `"ab" < "abcd"`

.

This order is already implemented in most programming languages (C++, Python, etc). So when we do `s < t`

for two strings `s, t`

, the language automatically does checking based on lexicographic order as described above.