Suffix arrays are a data structure which allow for efficient string matching and other operations. In this tutorial, we will describe what they are and why they are useful. We will also describe algorithms for generating suffix arrays, starting with a trivial one and moving onto faster ones. Lets get started.

# What are suffix arrays?

Consider any string, say *s = "random"*. It is customary to include a symbol at the end which is smaller than any character. Let us include '$', which is **smaller than any Latin character in ASCII**, so s = "random$".

Consider all *suffixes* of this string, and name the suffix that starts at the i-th position as z_i.

Thus:

z_0 = random$

z_1 = andom$

z_2 = ndom$

z_3 = dom$

z_4 = om$

z_5 = m$

z_6 = $

and lets now sort them by *lexicographic order *(check footnotes for definition):

z_6 = $

z_1 = andom$

z_3 = dom$

z_5 = m$

z_2 = ndom$

z_4 = om$

z_0 = random$

and write down the indices in that order

[6, 1, 3, 5, 2, 4, 0]

This is the *suffix array* of the given string s = "random$".

Try: What is the suffix array of "aaaa$"? What is the suffix array of "suffix$"?

# Naive algorithm

A naive algorithm for generating a suffix array works as follows. We first find out all suffixes of the given string and put them in an array. Then we sort the array.

Instead of actually creating all the suffixes, a better way to implement this would be to have an array with the numbers 0...n where n = length of string s. Then we sort these numbers using a custom **comparison function** define as follows: a < b, if z_a < z_b, where z_j is the suffix that starts from at the jth position in the string. This directly gives us the suffix array.

## Complexity

The above algorithm has a run-time complexity of O(n log n * q) where q = time for a single comparison. Since the average time for performing a comparison in our case is O(n), our algorithm is O(n^{2} log n).

Suppose we just implemented the above algorithm and ran it for a string of length 1,000,000. How long does it take to execute? Well, before it stops, you will have finished reading this article. It would take about 2.5 days on a standard personal computer.

# O(n log^{2} n) algorithm

The idea behind the O(n log^{2} n) algorithm is very elegant. The essence is to note that the expensive O(n) comparison in the previous algorithm can be replaced with something much more efficient, because we are dealing with *strings which have a lot of redundancy* (they are all suffixes of the *same* string).

The algorithm performs roughly *log n *steps. After step *k*, the suffixes are **sorted by the first 2 ^{k} characters only**, k goes from 0 to

*log n*. When performing step

*k*, we'll re-use the results from step

*k-1*.

The first step, is step 0, and in this step the suffixes are sorted only by the first character. Let's consider s = banana$. This is the state after Step 0

z_6 = $, rank 0

z_1 = anana$, rank 1

z_3 = ana$, rank 1

z_5 = a$, rank 1

z_0 = banana$, rank 2

z_2 = nana$, rank 3

z_4 = na$, rank 3

Note that we've also attached **ranks** of the elements in the above sorting. anana$, ana$ and a$ all have the same rank as the comparison is done just using the first character which is the same for the three of them ("a").

For the next step, we need to sort using 2 characters. We * already know* the ordering using 1 character.

**Moreover, the first 2 characters of z_k, is simply first 1 character of z_k + first 1 character of z_(k+1)**. Hence, to sort all z_k using 2 characters, we replace the z_k with the pair (rank of z_k using 1 character, rank of z_(k + 1) using 1 character) and sort according to that.

This is what the result looks like.

z_0 → (2, 1)

z_1 → (1, 3)

z_2 → (3, 1)

z_3 → (1, 3)

z_4 → (3, 1)

z_5 → (1, 0)

z_6 → (0, 0)

Note: If something has no next character, we use a value <= *every possible rank*, i.e. 0.

**To take an example**, z_1 = first two characters of anana$ = "an" = first character of z_1 + first character of z_2 = "a" + "n". Moreover, **since string comparison is more expensive, and we have already sorted single characters, we replace the characters with their ranks**, i.e. z_1 → (rank of z_1 in previous step, rank of z_2 in previous step) → (1, 3).

Once we do the replacement, we sort these pairs. (pair-wise (x, y) comparisons are defined as "sort by x, break ties by y").

z_6 → (0, 0), rank 0

z_5 → (1, 0), rank 1

z_1 → (1, 3), rank 2

z_3 → (1, 3), rank 2

z_0 → (2, 1), rank 3

z_2 → (3, 1), rank 4

z_4 → (3, 1), rank 4

Step 1 is complete.

Lets go on to Step 2 and sort these suffixes by the first 2^{2} = 4 characters. We again build pairs, this time using *ranks we computed just now*, in Step 1. We have first 4 characters of z_k = first 2 character of z_k + first 2 character of **z_(k+2)**. Hence, the pairs will be (rank of z_k in Step 1, rank of z_(k + 2) in Step 1).

For example z_0 is replaced with (rank of z_0 in Step 1, rank of z_2 in Step 1). Again, this works because rank of z_0 in Step 1 is nothing but the rank of "ba" among all 2 length sub-strings, and rank of z_2 in Step 1 is the rank of "na" among all 2 length sub-strings. The said pair thus represents the entire string "bana" which is what we intend to use for comparison in this step.

We carry on this process for 2 more steps, since the largest suffix must have its entire length used for comparison. That happens in Step k where 2^{k} >= 7, or k = 3. And finally end up with the suffix array

z_6 = rank 0, $

z_5 = rank 1, a$

z_3 = rank 2, ana$

z_1 = rank 3, anana$

z_0 = rank 4, banana$

z_4 = rank 5, na$

z_2 = rank 6, nana$

which is [6, 5, 3, 1, 0, 4, 2].

## Complexity

We perform roughly *log n* steps, and each step takes O(n log n) time because of the sorting. Hence, this algorithm's complexity is O(n log^{2} n).

# Interlude: Radix sort for our case

General purpose sorting is O(n log n), and that is the *best possible* (based on information theoretic bound), when the only information we know about our <= operator is that it "doesn't contradict itself" (formally, induces a total ordering on the elements). In particular, O(n log n) is the best possible only if we don't know anything about the values we would like to sort.

Better sorting algorithms* exist *when we know more. In this case, we know that everything we sort is a pair of two integers with each integer in the range 0 to n. With this information, we can perform sorting in O(n) using **radix sort**.

It works like this. Let us consider an example. Say we want to sort (2,3), (4,5), (2,5), (4,7), (1,2). We'll only look at the first elements of each pair for now. Using first element only, we can infer that (1,2) < (2,3) and (2, 5) < (4, 5) and (4, 7).

Call a collection of elements with same first element, a **bucket**. So (2,3) and (2,5) form a bucket. (4,5) and (4,7) form a bucket and (1,2) forms a bucket by itself.

From this information we can conclude that elements of bucket (4, 5), (4, 7) starts filling the sorted array from position 3 (0-indexed), because there are 3 elements before it. And (2,3), (2,5) starts filling from 1.

To calculate the start positions of each of the buckets - first we store the *sizes of the buckets* in an array. This array is [0, 1, 2, 0, 2] for our example. The i-th element (0-indexed) is the number of pairs with first element = i.

Then, we perform a **cumulative sum** of this array, i.e. array whose x'th element is the sum of all elements in the above array from 0..x-1. For this above array, it is [0, 0, 1, 3, 3, 5]. This array indicates from** where elements of first element x start filling in the sorted array** (referred to as bucstart). For more explanation of this array see code below.

Then, we create a list of lists. In position *x*, we have the elements which have second element *x*. For our example, this would be

[ [],[],[(1, 2)],[(2, 3)],[],[(2, 5), (4, 5)],[],[(4, 7)] ]

Now traverse the list in this order, and when you get an element, put it in an array as follows:

elem <- get current elementkey <- elem.firstplace elem in bucstart[key]bucstart[key] <- bucstart[key] + 1 // position the next element with this key occupies

When the process completes you get the sorted array, in complexity O(n).

## Improving to O(n log n)

Using radix sort algorithm described above, we can reduce the sorting to O(n), and consequently the algorithm is sped up to O(n log n).

**Note**: Although the theoretical complexity is better, it is hard to *actually* make this work faster than O(n log^{2} n) for practical inputs (say n <= 500000). This is because inbuilt sort functions in programming languages like C++ are very optimized, and an easy implementation of radix sort will use vectors and vector operations.

However, optimized implementations of the radix sort should lead to faster execution than the normal sort.

# Implementation

Here is the implementation of the above algorithms. Just enable/disable the RADIXSORT to use O(n log n) instead of O(n log^{2} n). Note: In this implementation, O(n log n) is *slower* than O(n log^{2} n).

#include <bits/stdc++.h>//#define DEBUG // enable to see verbose output of radixsort working//#define RADIXSORT // enable radix sort instead of c++ inbuilt sort//#define LARGETEST // enable execution using random large case (no input) instead of inputting stringusing namespace std;const int maxn = 6e5 + 7;pair< pair<int, int>, int > Z[maxn]; // array of z_k and kpair< pair<int, int>, int > P[maxn]; // used as copy of Zint rnk[maxn]; // ranks of the elementsvector<int> B[maxn]; // list of lists as mentioned in the articleint bucsz[maxn]; // sizes of the bucketsint bucstart[maxn]; // current start position of the bucketsinline void radsort(int buckets, int n) { // assume buckets start from 0#ifdef RADIXSORTfill(B, B + buckets, *new vector<int>());fill(bucsz, bucsz + buckets, 0);fill(bucstart, bucstart + buckets, 0);#ifdef DEBUGcout<<"Call ================="<<endl;#endiffor (int i = 0; i < n; i++) { // calculate sizes of bucketsauto elem = Z[i];#ifdef DEBUGcout<<elem.first.first<<","<<elem.first.second<<" : "<<elem.second<<endl;#endifint key = elem.first.first;bucsz[key]++;}for (int i = 0; i < buckets; i++) { // calculate initial position of the elements with first element iif (i > 0) bucstart[i] = bucstart[i - 1] + bucsz[i - 1];else bucstart[i] = 0;}for (int i = 0; i < n; i++) { // build the list of listsauto elem = Z[i];int key = elem.first.second;B[key].push_back(i);}for (int i = 0; i < buckets; i++) { // implementation of the pseudocode snippet in the article in the Radix sort sectionfor (int j = 0; j < B[i].size(); j++) {int k = B[i][j]; int fst = Z[k].first.first;int pos = bucstart[fst];P[pos] = Z[k]; bucstart[fst] = pos + 1;}}for (int i = 0; i < n; i++) Z[i] = P[i]; // copy back to Z#endif#ifndef RADIXSORTsort(Z, Z + n); // this is used if we do not do RADIXSORT. This makes the algorithm O(n log^2 n)#endiffill(rnk, rnk + n, 0);for (int i = 0; i < n; i++) {int idx = Z[i].second;if (i == 0) { rnk[idx] = 0; continue; }rnk[idx] = rnk[ Z[i - 1].second ]; // rnk of previous elementif (Z[i].first != Z[i - 1].first) rnk[idx]++; // if you're worse then you're of worse (higher) rnk}#ifdef DEBUGfor (int i = 0; i < n; i++) {cout<<rnk[i]<<" ";}cout<<endl;#endif}vector<int> SA(string s) {int n = s.size(); s += "$"; // add dummy character. Now s has valid positions 0..n.for (int i = 0; i <= n; i++) {Z[i] = make_pair( make_pair(s[i], 0), i );}radsort(256, n + 1); // maximum number of buckets is 256 (ASCII characters in general)for (int j = 0; (1<<j) <= n; j++) { // pairs of strings of length 2^j eachfor (int i = 0; i <= n; i++) {int next = i + (1<<j);Z[i] = make_pair( make_pair(rnk[i], (next > n) ? 0 : rnk[next]), i );}radsort(n + 1, n + 1); // here maximum number of buckets is n + 1 (0 .. n)}vector<int> S;for (int i = 0; i <= n; i++) S.push_back(Z[i].second); // construct the suffix array from the final Z array.return S;}int main() {ios::sync_with_stdio(false); cin.tie(0);string s;#ifdef LARGETESTfor (int i = 0; i < 500000; i++) { // generate large testcase of n = 500000 when LARGETEST is enableds += (char)('a' + rand()%26); // there is no srand(time(NULL)), so this produces the same string everytime.}#endif#ifndef LARGETESTcin>>s;#endifvector<int> S = SA(s);for (int i = 0; i < S.size(); i++) { // output the suffix arraycout<<S[i]<<" ";}cout<<endl;}

# Practice

The following problem just requires applying the Suffix array, so you use it to test your implementation and understanding: Problem SARRAY.

To solve this problem, you'll need to extend the above implementation to a bigger alphabet and also define the *null character *accordingly.

This problem grades according to speed. You'll probably score 60 with the above approach (which is good enough for most competitive programming sites like Codeforces, CodeChef, etc).

Here are some more problems for practice from SPOJ:

# Footnotes

## Lexicographic Order

**Lexicographic order** is a generalized version of *alphabetical order*. If we are working in the realm of strings derived from **Latin (English) characters**, then we have: a < b < c < d < ... < z.

How do we compare two strings? Say we have the strings "suffix" and "suffice". We go character by character, and stop at the first position they differ. In this case, it is the 5th (0-indexing) position, where "suffix" has x and "suffice" has c. Since, c < x, we have "suffice" < "suffix".

What happens when one of the string ends *before* there is any different character, i.e., the smaller string is a prefix of the larger one? In that case, we say the smaller string is smaller. For example, "ab" < "abcd".

This order is already implemented in most programming languages (C++, Python, etc). So when we do s < t for two strings s, t, the language automatically does checking based on lexicographic order as described above.