Simple Linear Work Suffix Array Construction
Juha K¨arkk¨ainen and Peter Sanders
Max-Planck-Institut f¨ur Informatik
Stuhlsatzenhausweg 85, 66123 Saarbr¨ucken, Germany
[juha,sanders]@mpi-sb.mpg.de.
Abstract. A suffix array represents the suffixes of a string in sorted
order. Being a simpler and more compact alternative to suffix trees, it
is an important tool for full text indexing and other string processing
tasks. We introduce the skew algorithm for suffix array construction over
integer alphabets that can be implemented to run in linear time using
integer sorting as its only nontrivial subroutine:
1. recursively sort suffixes beginning at positions i mod 3 =0.
2. sort the remaining suffixes using the information obtained in step one.
3. merge the two sorted sequences obtained in steps one and two.
The algorithm is much simpler than previous linear time algorithms
that are all based on the more complicated suffix tree data structure.
Since sorting is a well studied problem, we obtain optimal algorithms
for several other models of computation, e.g. external memory with par-
allel disks, cache oblivious, and parallel. The adaptations for BSP and
EREW-PRAM are asymptotically faster than the best previously known
algorithms.
1 Introduction
The suffix tree [39] of a string is a compact trie of all the suffixes of the string. It
is a powerful data structure with numerous applications in computational biol-
ogy [21] and elsewhere [20]. One of the important properties of the suffix tree is
that it can be constructed in linear time in the length of the string. The classical
linear time algorithms [32,36,39] require a constant alphabet size, but Farach’s
algorithm [11,14] works also for integer alphabets, i.e., when characters are poly-
nomially bounded integers. There are also efficient construction algorithms for
many advanced models of computation (see Table 1).
The suffix array [18,31] is a lexicographically sorted array of the suffixes of a
string. For several applications, the suffix array is a simpler and more compact
alternative to the suffix tree [2,6,18,31]. The suffix array can be constructed in
linear time by a lexicographic traversal of the suffix tree, but such a construction
loses some of the advantage that the suffix array has over the suffix tree. The
fastest direct suffix array construction algorithms that do not use suffix trees
require O(n log n) time [5,30,31]. Also under other models of computation, direct
Partially supported by the Future and Emerging Technologies programme of the EU
under contract number IST-1999-14186 (ALCOM-FT).
J.C.M. Baeten et al. (Eds.): ICALP 2003, LNCS 2719, pp. 943–955, 2003.
c
Springer-Verlag Berlin Heidelberg 2003