DOI: 10.1007/s00453-004-1094-1
Algorithmica (2004) 40: 33–50
Algorithmica
©
2004 Springer-Verlag New York, LLC
Engineering a Lightweight Suffix Array
Construction Algorithm
1
Giovanni Manzini
2
and Paolo Ferragina
3
Abstract. In this paper we describe a new algorithm for building the suffix array of a string. This task is
equivalent to the problem of lexicographically sorting all the suffixes of the input string. Our algorithm is
based on a new approach called deep–shallow sorting: we use a “shallow” sorter for the suffixes with a short
common prefix, and a “deep” sorter for the suffixes with a long common prefix.
All the knownalgorithmsforbuilding the suffixarrayeitherrequirealarge amount of space or are inefficient
when the input string contains many repeated substrings. Our algorithm has been designed to overcome this
dichotomy. Our algorithm is “lightweight” in the sense that it uses very small space in addition to the space
required by the suffix array itself. At the same time our algorithm is fast even when the input contains many
repetitions: this has been shown by extensive experiments with inputs of size up to 110 Mb.
The source code of our algorithm, as well as a C library providing a simple API, is available under the
GNU GPL [26].
Key Words. Suffix array, Algorithmic engineering, Space-economical algorithms, Full-text index, Suffix
tree.
1. Introduction. In this paper we consider the problem of computing the suffix array
of a text string T [1, n]. This problem consists in sorting the suffixes of T in lexicographic
order. The suffix array [24] (or
PAT array [10]) is a simple, easy to code, and elegant data
structure used for severalfundamental stringmatching problemsinvolvingboth linguistic
texts and biological data [5], [13]. Recently, interest in this data structure has been revi-
talized by its use as a building block for two novel applications: (1) the Burrows–Wheeler
compression algorithm [4], which is a provably [25] and practically [29] effective com-
pression tool; and (2) the construction of succinct [12], [28] or compressed [8], [9], [11]
indexes. In these applications the construction of the suffix array is the computational
bottleneck both in time and space. This motivated our interest in designing yet another
suffix array construction algorithm which is fast and lightweight in the sense that it uses
small working space.
The suffix array consists of n integers in the range [1, n]. This means that in principle
it uses (n log n) bits of storage. However, in most applications the size of the text
is smaller than 2
32
and it is customary to store each integer in a 4 byte word; this
1
This research was partially supported by the Italian MIUR projects “Algorithmics for Internet and the Web
(ALINWEB)” and “Technologies and Services for Enhanced Content Delivery (ECD)”. A preliminary version
of this work has appeared in Proceedings of the 10th European Symposium on Algorithms (ESA ’02).
2
Dipartimento di Informatica, Universit`a del Piemonte Orientale, Alessandria, Italy, and IIT-CNR, Pisa, Italy.
manzini@mfn.unipmn.it.
3
Dipartimento di Informatica, Universit`a di Pisa, Pisa, Italy. ferragina@di.unipi.it.
Received December 16, 2002; revised October 12, 2003. Communicated by R. Sedgewick.
Online publication April 26, 2004.