AnElegantAlgorithmfortheConstructionofSuffixArrays(2014)-计算机科学资源-CSDN文库

158 浏览量 2021-04-22 18:05:58 上传评论收藏 268KB PDF 举报

资源推荐

资源详情

资源评论

Journal of Discrete Algorithms 27 (2014) 21–28

Contents lists available at ScienceDirect

Journal of Discrete Algorithms

www.elsevier.com/locate/jda

An elegant algorithm for the construction of suﬃx arrays

Sanguthevar Rajasekaran, Marius Nicolae

∗

Dept. of Computer Science and Engineering, Univ. of Connecticut, Storrs, CT, USA

article info abstract

Article history:

Received 4 October 2012

Received in revised form 28 November 2013

Accepted 10 March 2014

Available online 19 April 2014

Keywords:

Suﬃx array construction algorithm

Parallel algorithm

High probability bounds

The suﬃx array is a data structure that ﬁnds numerous applications in string processing

problems for both linguistic texts and biological data. It has been introduced as a memory

eﬃcient alternative for suﬃx trees. The suﬃx array consists of the sorted suﬃxes of a

string. There are several linear time suﬃx array construction algorithms (SACAs) known

in the literature. However, one of the fastest algorithms in practice has a worst case run

time of O

). The problem of designing practically and theoretically eﬃcient techniques

remains open.

In this paper we present an elegant algorithm for suﬃx array construction which takes

linear

time with high probability; the probability is on the space of all possible inputs. Our

algorithm is one of the simplest of the known SACAs and it opens up a new dimension

of suﬃx array construction that has not been explored until now. Our algorithm is easily

parallelizable. We offer parallel implementations on various parallel models of computing.

We prove a lemma on the

-mers of a random string which might ﬁnd independent

applications. We also present another algorithm that utilizes the above algorithm. This

algorithm is called RadixSA and has a worst case run time of O

(n logn). RadixSA introduces

an idea that may ﬁnd independent applications as a speedup technique for other SACAs. An

empirical comparison of RadixSA with other algorithms on various datasets reveals that our

algorithm is one of the fastest algorithms to date. The C++ source code is freely available

at http://www.engr.uconn.edu/~man09004/radixSA.zip.

CC BY-NC-SA license (http://creativecommons.org/licenses/by-nc-sa/3.0/).

1. Introduction

The suﬃx array is a data structure that ﬁnds numerous applications in string processing problems for both linguistic

texts and biological data. It has been introduced in [17] as a memory eﬃcient alternative to suﬃx trees. The suﬃx array of

astringT is an array A,(

|T |=|A|=n) which gives the lexicographic order of all the suﬃxes of T .Thus,A[i] is the starting

position of the lexicographically i-th smallest suﬃx of T .

The original suﬃx array construction algorithm [17] runs

in O (n log n) time. It is based on a technique called preﬁx

doubling: assume that the suﬃxes are grouped into buckets such that suﬃxes in the same bucket share the same preﬁx of

length k.Letb

be the bucket number for suﬃx i.Letq

= (b

, b

i+k

). Sort the suﬃxes with respect to q

using radix sort. As

a result, the suﬃxes become sorted by their ﬁrst 2k characters. Update the bucket numbers and repeat the process until all

thesuﬃxesareinbucketsofsize1.Thisprocesstakesnomorethanlogn rounds. The idea of sorting suﬃxes in one bucket

based on the bucket information of nearby suﬃxes is called induced copying. It appears in some form or another in many of

the algorithms for suﬃx array construction.

Corresponding author.

E-mail addresses: r

ajasek@engr.uconn.edu (S. Rajasekaran), marius.nicolae@engr.uconn.edu (M. Nicolae).

http://dx.doi.org/10.1016/j.jda.2014.03.001

1570-8667/

(http://creativecommons.org/licenses/by-nc-sa/3.0/).

22 S. Rajasekaran, M. Nicolae / Journal of Discrete Algorithms 27 (2014) 21–28

Numerous papers have been written on suﬃx arrays. A survey on some of these algorithms can be found in [22].The

authors of [22] categorize suﬃx array construction algorithms (SACA) into ﬁve based on the main techniques employed:

1) Preﬁx Doubling (examples include [17] –runtime

= O(n log n); [15] –runtime= O (n log n)); 2) Recursive (examples

include [11] –runtime

= O(n log logn)); 3) Induced Copying (examples include [1] –runtime= O (n



logn)); 4) Hybrid

(examples include [7] and [12] –runtime

= O (n

logn)); and 5) Suﬃx Tree (examples include [13] –runtime= O (n log σ )

where σ is the size of the alphabet).

In 2003, three independent groups [12,9,10] found the ﬁrst linear time suﬃx array construction algorithms which do

not require building a suﬃx tree beforehand. For example, in [12] thesuﬃxesareclassiﬁedaseitherL or S.Suﬃxi is an

L suﬃx if it is lexicographically larger than suﬃx i

+ 1, otherwise it is an S suﬃx. Assume that the number of L suﬃxes

is less than n

/2, if not, do this for S suﬃxes. Create a new string where the segments of text in between L suﬃxes are

renamed to single characters. The new text has length no more than n

/2 and we recursively ﬁnd its suﬃx array. This suﬃx

array gives the order of the L suﬃxes in the original string. This order is used to induce the order of the remaining suﬃxes.

Another linear time algorithm, called skew,i

sgivenin[9].Itﬁrstsortsthosesuﬃxesi with i mod 3 = 0 using a recursive

procedure. The order of these suﬃxes is then used to infer the order of the suﬃxes with i mod 3

= 0. Once these two groups

are determined we can compare one suﬃx from the ﬁrst group with one from the second group in constant time. The last

step is to merge the two sorted groups, in linear time.

Several other SACAs have been proposed in the literature in recent years (e.g., [20,25]).

Some of the algorithms with

superlinear worst case run times perform better in practice than the linear ones. One of the currently best performing

algorithms in practice is the BPR algorithm of [25] which has an asymptotic worst-case run time of O

). BPR ﬁrst sorts

all the suﬃxes up to a certain depth, then focuses on one bucket at a time and repeatedly reﬁnes it into sub-buckets.

In this paper we present an elegant algorithm for suﬃx array construction. This algorithm takes linear time with high

obability. Here the probability is on the space of all possible inputs. Our algorithm is one of the simplest algorithms

known for constructing suﬃx arrays. It opens up a new dimension in suﬃx array construction, i.e., the development of

algorithms with provable expected run times. This dimension has not been explored before. We prove a lemma on the

-mers of a random string which might ﬁnd independent applications. Our algorithm is also nicely parallelizable. We offer

parallel implementations of our algorithm on various parallel models of computing.

We also present another algorithm for suﬃx array construction that utilizes the above algorithm. This algorithm, called

adixSA, is based on bucket sorting and has a worst case run time of O

(n log n). It employs an idea which, to the best of our

knowledge, has not been directly exploited until now. RadixSA selects the order in which buckets are processed based on a

heuristic such that, downstream, they impact as many other buckets as possible. This idea may ﬁnd independent application

as a standalone speedup technique for other SACAs based on bucket sorting. RadixSA also employs a generalization of

Seward’s copy method [26] (initially described in [4]) to detect and handle repeats of any length. We compare RadixSA with

other algorithms on various datasets.

2. A useful lemma

Let Σ be an alphabet of interest and let S = s

...s

∈ Σ

∗

. Consider the case when S is generated randomly, i.e., each

is picked uniformly randomly from Σ (1 ≤ i ≤n). Let L be the set of all -mers of S. Note that |L|=n −  +1. What can

we say about the independence of these

-mers? In several papers analyses have been done assuming that these -mers are

independent (see e.g., [2]). These authors point out that this assumption may not be true but these analyses have proven to

be useful in practice. In this section we prove the following lemma on these

-mers.

Lemma 1. Let

L be the set of all -mers of a random string generated from an alphabet Σ .Then,the-mers in L are pairwise indepen-

dent. These

-mers need not be k-way independent for k ≥ 3.

Proof. Le

t A and B be any two -mers in L.Ifx and y are non-overlapping, clearly, Prob[A = B]=(1/σ )



, where σ =|Σ|.

Thus, consider the case when x and y are overlapping.

Let P

= s

i+1

...s

i+−1

,for1≤ i ≤ (n −  + 1).LetA = P

and B = P

with i < j and j ≤ (i +  − 1).Alsolet j = i + k

where 1

≤k ≤ ( −1).

Consider the special case when k divides .IfA = B, then it should be the case that s

= s

i+k

= s

i+2k

=···=s

i+

;

i+1

= s

i+k+1

= s

i+2k+1

=···=s

i++1

; ···; and s

i+k−1

= s

i+2k−1

= s

i+3k−1

=···=s

i++k−1

. In other words, we have k series

of equalities. Each series is of length

(/k) + 1. The probability of all of these equalities is (

)

/k

(

)

/k

···(

)

/k

= (

)



As an example, let S = abcdef ghi,  = 4, k = 2, A = P

, and B = P

. In this case, the following equalities should hold:

=c = e and b = d = f . The probability of all of these equalities is (1/σ )

(1/σ )

= (1/σ )



Now consider the general case (where k ma

y not divide ). Let  = qk +r for some integers q and r where r < k.If A = B,

the following equalities will hold: s

= s

i+k

= s

i+2k

=···=s

i+(+k−1)/kk

; s

i+1

= s

i+1+k

= s

i+1+2k

=···=s

i+1+(+k−2)/kk

;

···; and s

i+k−1

= s

i+k−1+k

= s

i+k−1+2k

=···=s

i+k−1+(/k)k

Here again we have k series of equalities. The number of elements in the q-th series is 1 +

+k−q

,for1≤q ≤k.The

probability of all of these equalities is

(1/σ )

where x =





+k−q

.

剩余7页未读，继续阅读

评论收藏

内容反馈

weixin_38732425

粉丝: 6
资源: 942

An Elegant Algorithm for the Construction of Suffix Arrays (2014...

最新资源

An Elegant Algorithm for the Construction of Suffix Arrays (2014...

Pattern Matching using Suffix Trays, Arrays and Trees (2014)-计算机科学

sasa:SA-IS算法实现SuffixArray构造

Computer Algorithms

An algorithm for the machine calculation of complex Fourier series

Algorithm-Ukkonen-s-Suffix-Tree-Algorithm.zip

A Polynomial-time Algorithm for the Change-Making Problem.pdf

High Bandwidth Sensorless Algorithm for AC Machines Based on Square-wave Type

GPU-ArraySort - A parallel, in-place algorithm for sorting large number of arrays (2016)-计算机科学

A fast algorithm for the total variation model of image denoising

智能算法：Great Wall Construction Algorithm长城构建算法Matlab

HyperLogLog - The Analysis of a Near-Optimal Cardinality Estimation Algorithm (914-3045-2-PB)-计算机科学

An Efficient Global Optimization Algorithm.pdf

The Algorithm for Three- Dimensional Voronoi Polyhedra

翻译A-fast-learning-algorithm-for-deep-belief-nets.doc编程资料

An Optimal Algorithm for Generating Minimal Perfect Hash Functions - 1992 (10.1.1.51.5566)-计算机科学

An Experimental Comparison of Min-Cut/Max-Flow Algorithms

Algorithm-Elements-Of-Programming-Interviews.zip

266.rar_Type-2_fuzzy approximation _interval_interval fuzzy _int

An improved algorithm for spectral

The.MIT.Press.Once.Upon.an.Algorithm.How.Stories.Explain.Computing.0262036630

A Self-Stabilizing Algorithm for Maximal Matching in Anonymous Networks

Algorithm for Interview面试算法笔记-中文

A Neural Algorithm of Artistic Style（一个艺术风格化的神经网络算法）

Development of an Optimal Vehicle-to-Grid Aggregator for Frequency Regulation

A Communication-Efficient Parallel Algorithm for Decision Tree

A Reliable Randomized Algorithm for the Closest-Pair Problem - 1997 (CP-11.4.1997)-计算机科学

最新资源