没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Journal of Discrete Algorithms 27 (2014) 21–28Contents lists available at ScienceDirectJournal of Discrete Algorithmswww.elsevier.com/locate/jdaAn elegant algorithm for the construction of suffix arraysSanguthevar Rajasekaran, Marius Nicolae ∗Dept. of Computer Science and Engineering, Univ. of Connecticut, Storrs, CT, USAa r t i c l e i n f o a b s t r a c tArticle history: Received 4 October 2012 Received in revised form 28 November 2013 Accepted 10 March 2014 Available online 19 April 2014Keyw
资源推荐
资源详情
资源评论
Journal of Discrete Algorithms 27 (2014) 21–28
Contents lists available at ScienceDirect
Journal of Discrete Algorithms
www.elsevier.com/locate/jda
An elegant algorithm for the construction of suffix arrays
Sanguthevar Rajasekaran, Marius Nicolae
∗
Dept. of Computer Science and Engineering, Univ. of Connecticut, Storrs, CT, USA
article info abstract
Article history:
Received 4 October 2012
Received in revised form 28 November 2013
Accepted 10 March 2014
Available online 19 April 2014
Keywords:
Suffix array construction algorithm
Parallel algorithm
High probability bounds
The suffix array is a data structure that finds numerous applications in string processing
problems for both linguistic texts and biological data. It has been introduced as a memory
efficient alternative for suffix trees. The suffix array consists of the sorted suffixes of a
string. There are several linear time suffix array construction algorithms (SACAs) known
in the literature. However, one of the fastest algorithms in practice has a worst case run
time of O
(n
2
). The problem of designing practically and theoretically efficient techniques
remains open.
In this paper we present an elegant algorithm for suffix array construction which takes
linear
time with high probability; the probability is on the space of all possible inputs. Our
algorithm is one of the simplest of the known SACAs and it opens up a new dimension
of suffix array construction that has not been explored until now. Our algorithm is easily
parallelizable. We offer parallel implementations on various parallel models of computing.
We prove a lemma on the
-mers of a random string which might find independent
applications. We also present another algorithm that utilizes the above algorithm. This
algorithm is called RadixSA and has a worst case run time of O
(n logn). RadixSA introduces
an idea that may find independent applications as a speedup technique for other SACAs. An
empirical comparison of RadixSA with other algorithms on various datasets reveals that our
algorithm is one of the fastest algorithms to date. The C++ source code is freely available
at http://www.engr.uconn.edu/~man09004/radixSA.zip.
© 2014 The Authors. Published by Elsevier B.V. This is an open access article under the
CC BY-NC-SA license (http://creativecommons.org/licenses/by-nc-sa/3.0/).
1. Introduction
The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic
texts and biological data. It has been introduced in [17] as a memory efficient alternative to suffix trees. The suffix array of
astringT is an array A,(
|T |=|A|=n) which gives the lexicographic order of all the suffixes of T .Thus,A[i] is the starting
position of the lexicographically i-th smallest suffix of T .
The original suffix array construction algorithm [17] runs
in O (n log n) time. It is based on a technique called prefix
doubling: assume that the suffixes are grouped into buckets such that suffixes in the same bucket share the same prefix of
length k.Letb
i
be the bucket number for suffix i.Letq
i
= (b
i
, b
i+k
). Sort the suffixes with respect to q
i
using radix sort. As
a result, the suffixes become sorted by their first 2k characters. Update the bucket numbers and repeat the process until all
thesuffixesareinbucketsofsize1.Thisprocesstakesnomorethanlogn rounds. The idea of sorting suffixes in one bucket
based on the bucket information of nearby suffixes is called induced copying. It appears in some form or another in many of
the algorithms for suffix array construction.
*
Corresponding author.
E-mail addresses: r
ajasek@engr.uconn.edu (S. Rajasekaran), marius.nicolae@engr.uconn.edu (M. Nicolae).
http://dx.doi.org/10.1016/j.jda.2014.03.001
1570-8667/
© 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-SA license
(http://creativecommons.org/licenses/by-nc-sa/3.0/).
22 S. Rajasekaran, M. Nicolae / Journal of Discrete Algorithms 27 (2014) 21–28
Numerous papers have been written on suffix arrays. A survey on some of these algorithms can be found in [22].The
authors of [22] categorize suffix array construction algorithms (SACA) into five based on the main techniques employed:
1) Prefix Doubling (examples include [17] –runtime
= O(n log n); [15] –runtime= O (n log n)); 2) Recursive (examples
include [11] –runtime
= O(n log logn)); 3) Induced Copying (examples include [1] –runtime= O (n
logn)); 4) Hybrid
(examples include [7] and [12] –runtime
= O (n
2
logn)); and 5) Suffix Tree (examples include [13] –runtime= O (n log σ )
where σ is the size of the alphabet).
In 2003, three independent groups [12,9,10] found the first linear time suffix array construction algorithms which do
not require building a suffix tree beforehand. For example, in [12] thesuffixesareclassifiedaseitherL or S.Suffixi is an
L suffix if it is lexicographically larger than suffix i
+ 1, otherwise it is an S suffix. Assume that the number of L suffixes
is less than n
/2, if not, do this for S suffixes. Create a new string where the segments of text in between L suffixes are
renamed to single characters. The new text has length no more than n
/2 and we recursively find its suffix array. This suffix
array gives the order of the L suffixes in the original string. This order is used to induce the order of the remaining suffixes.
Another linear time algorithm, called skew,i
sgivenin[9].Itfirstsortsthosesuffixesi with i mod 3 = 0 using a recursive
procedure. The order of these suffixes is then used to infer the order of the suffixes with i mod 3
= 0. Once these two groups
are determined we can compare one suffix from the first group with one from the second group in constant time. The last
step is to merge the two sorted groups, in linear time.
Several other SACAs have been proposed in the literature in recent years (e.g., [20,25]).
Some of the algorithms with
superlinear worst case run times perform better in practice than the linear ones. One of the currently best performing
algorithms in practice is the BPR algorithm of [25] which has an asymptotic worst-case run time of O
(n
2
). BPR first sorts
all the suffixes up to a certain depth, then focuses on one bucket at a time and repeatedly refines it into sub-buckets.
In this paper we present an elegant algorithm for suffix array construction. This algorithm takes linear time with high
pr
obability. Here the probability is on the space of all possible inputs. Our algorithm is one of the simplest algorithms
known for constructing suffix arrays. It opens up a new dimension in suffix array construction, i.e., the development of
algorithms with provable expected run times. This dimension has not been explored before. We prove a lemma on the
-mers of a random string which might find independent applications. Our algorithm is also nicely parallelizable. We offer
parallel implementations of our algorithm on various parallel models of computing.
We also present another algorithm for suffix array construction that utilizes the above algorithm. This algorithm, called
R
adixSA, is based on bucket sorting and has a worst case run time of O
(n log n). It employs an idea which, to the best of our
knowledge, has not been directly exploited until now. RadixSA selects the order in which buckets are processed based on a
heuristic such that, downstream, they impact as many other buckets as possible. This idea may find independent application
as a standalone speedup technique for other SACAs based on bucket sorting. RadixSA also employs a generalization of
Seward’s copy method [26] (initially described in [4]) to detect and handle repeats of any length. We compare RadixSA with
other algorithms on various datasets.
2. A useful lemma
Let Σ be an alphabet of interest and let S = s
1
s
2
...s
n
∈ Σ
∗
. Consider the case when S is generated randomly, i.e., each
s
i
is picked uniformly randomly from Σ (1 ≤ i ≤n). Let L be the set of all -mers of S. Note that |L|=n − +1. What can
we say about the independence of these
-mers? In several papers analyses have been done assuming that these -mers are
independent (see e.g., [2]). These authors point out that this assumption may not be true but these analyses have proven to
be useful in practice. In this section we prove the following lemma on these
-mers.
Lemma 1. Let
L be the set of all -mers of a random string generated from an alphabet Σ .Then,the-mers in L are pairwise indepen-
dent. These
-mers need not be k-way independent for k ≥ 3.
Proof. Le
t A and B be any two -mers in L.Ifx and y are non-overlapping, clearly, Prob[A = B]=(1/σ )
, where σ =|Σ|.
Thus, consider the case when x and y are overlapping.
Let P
i
= s
i
s
i+1
...s
i+−1
,for1≤ i ≤ (n − + 1).LetA = P
i
and B = P
j
with i < j and j ≤ (i + − 1).Alsolet j = i + k
where 1
≤k ≤ ( −1).
Consider the special case when k divides .IfA = B, then it should be the case that s
i
= s
i+k
= s
i+2k
=···=s
i+
;
s
i+1
= s
i+k+1
= s
i+2k+1
=···=s
i++1
; ···; and s
i+k−1
= s
i+2k−1
= s
i+3k−1
=···=s
i++k−1
. In other words, we have k series
of equalities. Each series is of length
(/k) + 1. The probability of all of these equalities is (
1
σ
)
/k
(
1
σ
)
/k
···(
1
σ
)
/k
= (
1
σ
)
.
As an example, let S = abcdef ghi, = 4, k = 2, A = P
1
, and B = P
3
. In this case, the following equalities should hold:
a
=c = e and b = d = f . The probability of all of these equalities is (1/σ )
2
(1/σ )
2
= (1/σ )
4
= (1/σ )
.
Now consider the general case (where k ma
y not divide ). Let = qk +r for some integers q and r where r < k.If A = B,
the following equalities will hold: s
i
= s
i+k
= s
i+2k
=···=s
i+(+k−1)/kk
; s
i+1
= s
i+1+k
= s
i+1+2k
=···=s
i+1+(+k−2)/kk
;
···; and s
i+k−1
= s
i+k−1+k
= s
i+k−1+2k
=···=s
i+k−1+(/k)k
.
Here again we have k series of equalities. The number of elements in the q-th series is 1 +
+k−q
k
,for1≤q ≤k.The
probability of all of these equalities is
(1/σ )
x
where x =
k
q
=1
+k−q
k
.
剩余7页未读,继续阅读
资源评论
weixin_38732425
- 粉丝: 6
- 资源: 942
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 人脸检测-yolov8.zip
- 为 YOLOv3 框架实现了多主干和多 gpu 模型,从 qqwwee 分叉而来 .zip
- 一种强大的鱼类检测模型,可在任何海洋环境中实时检测水下鱼类 .zip
- 一个关于如何使用yolov5转化的openvino模型的SDK.zip
- 蓝桥杯历届单片机国赛编程题
- 使用内容提供者共享数据(利用记事本项目)
- 计算机课程设计基于SpringBoot的酒店管理系统项目带答辩ppt+数据库.zip
- IT桔子:中国智能电视市场研究报告
- [MICCAI'24]“BGF-YOLO通过多尺度注意力特征融合增强型YOLOv8用于脑肿瘤检测”的官方实现 .zip
- CB Insights:智能汽车才是未来-信息图
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功