MF算法---传说后缀数组中最快的构建法资源-CSDN文库

共28个文件

cpp：15个

h：6个

sln：1个

mf算法

SA数组

快DC3,Ka

字符串匹配

5星 · 超过95%的资源需积分: 16 18 浏览量 2009-08-26 19:43:56 上传评论 6 收藏 765KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

MF串行.zip （28个子文件）

problem1

problem1.suo 33KB

problem1

bwt_aux.h 912B

problem1.vcproj 5KB

test1.cpp 2KB

bwt.cpp 7KB

suftest2.cpp 10KB

targetver.h 498B

unbwt.cpp 8KB

ds.cpp 9KB

common.h 2KB

problem1.cpp 4KB

globals.cpp 4KB

lcp_aux.h 556B

shallow.cpp 17KB

problem1.vcproj.WWW-83C14170147.Administrator.user 1KB

helped.cpp 21KB

rmq.cpp 2KB

rmq.h 553B

lcp_aux.cpp 11KB

problem1.vcproj.ampl.cfg 198B

bwt_aux.cpp 8KB

ds_ssort.h 285B

blind2.cpp 12KB

testlcp.cpp 9KB

deep2.cpp 7KB

problem1.ncb 1.24MB

problem1.sln 890B

MF算法.pdf 376KB

DOI: 10.1007/s00453-004-1094-1

Algorithmica (2004) 40: 33–50

Algorithmica

2004 Springer-Verlag New York, LLC

Engineering a Lightweight Sufﬁx Array

Construction Algorithm

Giovanni Manzini

and Paolo Ferragina

Abstract. In this paper we describe a new algorithm for building the sufﬁx array of a string. This task is

equivalent to the problem of lexicographically sorting all the sufﬁxes of the input string. Our algorithm is

based on a new approach called deep–shallow sorting: we use a “shallow” sorter for the sufﬁxes with a short

common preﬁx, and a “deep” sorter for the sufﬁxes with a long common preﬁx.

All the knownalgorithmsforbuilding the sufﬁxarrayeitherrequirealarge amount of space or are inefﬁcient

when the input string contains many repeated substrings. Our algorithm has been designed to overcome this

dichotomy. Our algorithm is “lightweight” in the sense that it uses very small space in addition to the space

required by the sufﬁx array itself. At the same time our algorithm is fast even when the input contains many

repetitions: this has been shown by extensive experiments with inputs of size up to 110 Mb.

The source code of our algorithm, as well as a C library providing a simple API, is available under the

GNU GPL [26].

Key Words. Sufﬁx array, Algorithmic engineering, Space-economical algorithms, Full-text index, Sufﬁx

tree.

1. Introduction. In this paper we consider the problem of computing the sufﬁx array

of a text string T [1, n]. This problem consists in sorting the sufﬁxes of T in lexicographic

order. The sufﬁx array [24] (or

PAT array [10]) is a simple, easy to code, and elegant data

structure used for severalfundamental stringmatching problemsinvolvingboth linguistic

texts and biological data [5], [13]. Recently, interest in this data structure has been revi-

talized by its use as a building block for two novel applications: (1) the Burrows–Wheeler

compression algorithm [4], which is a provably [25] and practically [29] effective com-

pression tool; and (2) the construction of succinct [12], [28] or compressed [8], [9], [11]

indexes. In these applications the construction of the sufﬁx array is the computational

bottleneck both in time and space. This motivated our interest in designing yet another

sufﬁx array construction algorithm which is fast and lightweight in the sense that it uses

small working space.

The sufﬁx array consists of n integers in the range [1, n]. This means that in principle

it uses (n log n) bits of storage. However, in most applications the size of the text

is smaller than 2

and it is customary to store each integer in a 4 byte word; this

This research was partially supported by the Italian MIUR projects “Algorithmics for Internet and the Web

(ALINWEB)” and “Technologies and Services for Enhanced Content Delivery (ECD)”. A preliminary version

of this work has appeared in Proceedings of the 10th European Symposium on Algorithms (ESA ’02).

Dipartimento di Informatica, Universit`a del Piemonte Orientale, Alessandria, Italy, and IIT-CNR, Pisa, Italy.

manzini@mfn.unipmn.it.

Dipartimento di Informatica, Universit`a di Pisa, Pisa, Italy. ferragina@di.unipi.it.

Received December 16, 2002; revised October 12, 2003. Communicated by R. Sedgewick.

Online publication April 26, 2004.

34 G. Manzini and P. Ferragina

yields a total space occupancy of 4n bytes. For what concerns the cost of constructing

the sufﬁx array, the theoretically best known algorithms run in (n) time [6]. These

algorithms work by ﬁrst building the sufﬁx tree and then obtaining the sorted sufﬁxes via

an in-order traversal of the tree. However, sufﬁx tree construction algorithms are both

complex and space consuming since they occupy at least 15n bytes of working space

(or even more, depending on the text structure [22]). This makes their use impractical

even for moderately large texts. For this reason, sufﬁx arrays are usually built directly

using algorithms which run in O(n log n) time but have a smaller space occupancy.

Among these algorithms the current “leader” is the

qsufsort algorithm by Larsson and

Sadakane [23].

qsufsort uses 8n bytes

and it is much faster in practice than the algorithms

based on sufﬁx tree construction.

Unfortunately, the size of our documents has grown much more quickly than the

main memory of our computers. Thus, it is desirable to build a sufﬁx array using as

small space as possible. Recently, Itoh and Tanaka [15] and Seward [30] have proposed

two new algorithms which only use 5n bytes. We call these algorithms lightweight

algorithms to stress their (relatively) small space occupancy. From the theoretical point

of view these algorithms have a (n

log n) worst-case time complexity. In practice

they are faster than

qsufsort when the average LCP (Longest Common Preﬁx) is small.

However, for texts with a large average

LCP these algorithms can be slower than qsufsort

by a factor of 100 or more.

In this paper we describe and extensively test a new lightweight sufﬁx sorting al-

gorithm. Our main idea is to use a very small amount of extra memory, in addition to

5n bytes, to avoid any degradation in performance when the average

LCP is large. To

achieve this goal we make use of engineered algorithms and ad hoc data structures. Our

algorithm uses 5n +cn bytes, where c can be chosen by the user at run time; in our tests

c was at most 0.03. The theoretical worst-case time complexity of our algorithm is still

(n

log n), but its behavior in practice is quite good. Extensive experiments, carried

out on four different architectures, show that our algorithm is faster than any other tested

algorithm. Only on a single instance—a single ﬁle on a single architecture— was our

algorithm outperformed by

qsufsort.

2. Deﬁnitions and Previous Results. Let T [1, n] denote a text over the alphabet .

The sufﬁx array [24] (or

PAT array [10]) for T is an array SA[1, n] such that T [SA[1], n],

T [SA[2], n], etc. is the list of sufﬁxes of T sorted in lexicographic order. For example,

for T =

babcc then SA = [2, 1, 3, 5, 4] since T [2, 5] = abcc is the sufﬁx with the

lowest lexicographic rank, followed by T [1, 5] =

babcc, followed by T [3, 5] = bcc

and so on.

Given two strings v, w we write LCP(v, w) to denote the length of their longest

common preﬁx. The average

LCP of a text T is deﬁned as the average length of the LCP

Here and in the following the space occupancy ﬁgures include the space for the input text, for the sufﬁx

array, and for any auxiliary data structure used by the algorithm.

Note that to deﬁne the lexicographic order of the sufﬁxes it is customary to append at the end of T a special

end-of-text symbol which is smaller than any symbol in .

Engineering a Lightweight Sufﬁx Array Construction Algorithm 35

between two consecutive sufﬁxes, that is,

average

LCP =



n − 1



n−1



i=1

LCP(T [SA[i], n], T [SA[i + 1], n]).

The average

LCP is a rough measure of the difﬁculty of sorting the sufﬁxes: if the average

LCP is large we need—in principle—to examine “many” characters in order to establish

the relative order of two sufﬁxes. Note however that most sufﬁx sorting algorithms do

not compare sufﬁxes with a simple character-by-character comparison, thus the average

LCP is not the only parameter which plays a role in this problem.

In the rest of the paper we make the following assumptions which correspond to the

situation most often faced in practice. We assume ||≤256 and that each alphabet

symbol is stored in 1 byte. Hence, the text T [1, n] occupies precisely n bytes. Further-

more, we assume that n ≤ 2

and that the starting position of each sufﬁx is stored in

a 4 byte word. Hence, the sufﬁx array SA[1, n] occupies precisely 4n bytes. In the fol-

lowing we use the term “lightweight” to denote a sufﬁx sorting algorithm which uses 5n

bytes plus some small amount of extra memory (we are intentionally giving an informal

deﬁnition). Note that 5n bytes are just enough to store the input text T and the sufﬁx

array SA. Although we do not claim that 5n bytes are indeed required, we do not know

of any algorithm using less space.

To test the sufﬁx array construction algorithms we use the collection of ﬁles shown in

Table 1. These ﬁles contain different kinds of data in different formats; they also display

a wide range of sizes and of average

LCPs.

2.1. The Larsson–Sadakane

qsufsort Algorithm. The qsufsort algorithm [23] is based

on the doubling technique introduced in [18] and ﬁrst used for the construction of the

sufﬁx array in [24]. Given two strings v, w and t > 0 we write v<

w if the length-t

preﬁx of v is lexicographically smaller than the length-t preﬁx of w. Similarly we deﬁne

the symbols ≤

and =

. Let s

, s

denote two sufﬁxes and assume s

(that is,

T [s

, n] and T [s

, n] have a length-t common preﬁx). Let ˆs

= s

+ t denote the sufﬁx

T [s

+t, n] and similarly let ˆs

= s

+t. The fundamental observation of the doubling

Table 1. Files used in our experiments sorted in order of increasing average

LCP.

Name Ave. LCP Max. LCP File size Description

sprot 89.08 7,373 109,617,186 Swiss prot database (original ﬁle name sprot34.dat)

rfc 93.02 3,445 116,421,901 Concatenation of RFC text ﬁles

howto 267.56 70,720 39,422,105 Concatenation of Linux Howto text ﬁles

reuters 282.07 26,597 114,711,151 Reuters news in XML format

linux 479.00 136,035 116,254,720 Tar archive containing the Linux kernel 2.4.5 source ﬁles

jdk13 678.94 37,334 69,728,899 Concatenation of

html and java ﬁles from the JDK 1.3 doc.

etext99 1,108.63 286,352 105,277,340 Concatenation of Project Gutemberg

etext99/*.txt ﬁles

chr22 1,979.25 199,999 34,553,758 Genome assembly of human chromosome 22

gcc 8,603.21 856,970 86,630,400 Tar archive containing the gcc 3.0 source ﬁles

w3c 42,299.75 990,053 104,201,579 Concatenation of

html ﬁles from www.w3c.org

36 G. Manzini and P. Ferragina

technique is that

≤

⇐⇒ ˆs

≤

ˆs

.(1)

In other words, we can derive the ≤

order between s

and s

by looking at the rank of

ˆs

and ˆs

in the ≤

order.

The algorithm

qsufsort works in rounds. At the beginning of the ith round the sufﬁxes

are already sorted according to the ≤

ordering. In the ith round the algorithm looks

at groups of sufﬁxes sharing the ﬁrst 2

characters and sorts them according to the

≤

i+1

ordering using the Bentley–McIlroy ternary quicksort [1]. Because of (1) each

comparison in the quicksort algorithm takes O(1) time. After at most log n rounds all

the sufﬁxes are sorted. Thanks to a very clever data organization

qsufsort only uses 8n

bytes. Even more surprisingly, the whole algorithm ﬁts in two pages of clean and elegant

C code.

The experiments reported in [23] show that

qsufsort outperforms other sufﬁx sorting

algorithms based on either the doubling technique or the sufﬁx tree construction. The

only algorithm which runs faster than

qsufsort, but only for ﬁles with average LCP less

than 20, is the Bentley–Sedgewick multikey quicksort [2]. Multikey quicksort is a direct

comparison algorithm since it considers the sufﬁxesas ordinary strings and sorts them via

a character-by-character comparison without taking advantage of their special structure.

In this paper we did not consider multikey quicksort since it is well known that it is

inefﬁcient when the average

LCP is large. However, for inputs with a small average LCP

it is one of the fastest algorithms: see [21] for an efﬁcient sufﬁx sorting algorithm based

on multikey quicksort.

2.2. The Itoh–Tanaka

two-stage Algorithm. In [15] Itoh and Tanaka describe a sufﬁx

sorting algorithm called two-stage sufﬁx sort (

two-stage from now on). two-stage only

uses the text T and the sufﬁx array SAfor a total space occupancy of 5n bytes.To describe

howit works, we assume  ={

a, b,...,z}and let SAbe initialized as SA[i] = i. Using

counting sort,

two-stage initially sorts the array SA according to the ≤

ordering. Then

it logically partitions SA into || buckets

,...,

. A bucket is a set of consecutive

entries of SAcontaining the sufﬁxes which start with the same character, from

a to z in

our illustrative example. Within each bucket

two-stage distinguishes between two types

of sufﬁxes:

Type A sufﬁxes in which the second character of the sufﬁx is smaller than the

ﬁrst, and

Type B sufﬁxes in which the second character is larger than or equal to the ﬁrst

sufﬁx character. Within each bucket

two-stage stores Type A sufﬁxes ﬁrst, followed by

Type B sufﬁxes. This is correct since Type A sufﬁxes lexicographically precede Type B

sufﬁxes.

The crucial observation of algorithm

two-stage is that when all Type B sufﬁxes are

sorted, we can easily derive the ordering of the

Type A sufﬁxes. This can be done with

a single pass over the array SA: when we meet sufﬁx s

= T [i, n] we look at sufﬁx

i−1

= T[i − 1, n], if s

i−1

is a Type A sufﬁx we move it to the ﬁrst empty position of

bucket

T [i−1]

Type B sufﬁxes are sorted using textbook string sorting algorithms: in their imple-

mentation the authors use MSD radix sort [27] for sorting large groups of sufﬁxes,

Bentley–Sedgewick multikey quicksort for medium size groups, and insertion sort for

small groups. Summing up,

two-stage can be considered an “advanced” direct compar-

Engineering a Lightweight Sufﬁx Array Construction Algorithm 37

ison algorithm since Type B sufﬁxes are sorted by direct comparison whereas Type A

sufﬁxes are sorted by a much faster procedure which takes advantage of the special

structure of the sufﬁxes.

In[15] the authors compare

two-stagewith three direct-comparison algorithms (quick-

sort, multikey quicksort, and MSD radix sort) and with an earlier version of

qsufsort.

two-stage

turns out to be roughly four times faster than quicksort and MSD radix sort,

and from two to three times faster than multikey quicksort and

qsufsort. However, the

ﬁles used for the experiments have an average

LCP of at most 31, and we know that

the advantage of doubling algorithms (like

qsufsort) with respect to direct comparison

algorithms becomes apparent for much larger average

LCPs.

Some improvements to algorithm

two-stage

have been recently described in [16].

Although these improvements are based on some interesting algorithmic ideas, we do

not describe them here since they lead to an algorithm which is not lightweight—its

space requirement being 9n bytes.

2.3. Seward

copy Algorithm. Independently of Itoh and Tanaka, Seward describes

in [30] a lightweight algorithm, called

copy, which is based on a concept similar to the

Type A

/Type B sufﬁxes used by algorithm two-stage.

Using counting sort,

copy initially sorts the array SA according to the ≤

ordering.

As before we use the term bucket to denote the contiguous portion of SA containing

a set of sufﬁxes sharing the same ﬁrst character. We use the term sub-bucket to denote

the contiguous portion of SAcontaining sufﬁxes sharing the ﬁrst two characters. There

are || buckets, each one consisting of || sub-buckets. One or more (sub-)buckets can

be empty. In the following we use the symbol

to denote the bucket containing the

sufﬁxes starting with character α, and we use the symbol

αβ

to denote the sub-bucket

containing the sufﬁxes starting with the character-pair αβ.

copy sorts the buckets one at a time starting with the one containing the fewestsufﬁxes,

and proceeding up to the largest one. Assume for simplicity that  ={

a, b,...,z}.To

sort a bucket, say

, copy sorts the sub-buckets b

, b

,...,b

individually. The

crucial point of algorithm

copy is that when bucket B

is completely sorted, with a simple

pass over it

copy sorts all the sub-buckets b

, b

,...,b

. These sub-buckets are

marked as sorted and

copy skips them when their “parent” bucket is sorted. In other

words, assuming

is sorted after B

, when we sort B

we skip b

and any other

already sorted sub-bucket within

As a further improvement, Seward shows that even the sorting of the sub-bucket

can be avoided since its ordering can be derived from the ordering of the sub-buckets

,...,b

and b

,...,b

. This trick, ﬁrst suggested in [4], is extremely effective

when working on ﬁles containing long runs of identical characters.

Algorithm

copy sorts the sub-buckets using the Bentley–McIlroy ternary quicksort.

During this sorting the sufﬁxes are considered atomic, that is, each comparison consists

of the scanning of two entire sufﬁxes. The standard trick of sorting the largest side of the

partition last and eliminating tail recursion ensures that the amount of space required by

the recursion stack grows, in the worst case, logarithmically with the size of the input text.

In [30] Seward compares a tuned implementation of

copy with the qsufsort algorithm

on a set of ﬁles with average

LCP up to 400. In these tests copy outperforms qsufsort for

all ﬁles but one. However, Seward reports that

copy is much slower than qsufsort when

评论收藏

内容反馈

zhitenglin

2017-10-15

算法理解太痛苦了。
yexin16

2013-05-02

用bwt构建的后缀数组，很快。
JersonMa

2013-11-07

还是挺快的,不错
The_Loser

2015-07-28

表示看不懂
mengxuming1

2012-05-09

是比较快的后缀数组构建算法。。但是总感觉不是很好理解

tiandyoin

粉丝: 121
资源: 28

MF算法 --- 传说后缀数组中最快的构建法

最新资源

MF算法 --- 传说后缀数组中最快的构建法

多任务优化算法MFDE算法

后缀数组详解

罗穗骞《后缀数组——处理字符串的有力工具》(有算法源码和解题源码)

后缀数组与应用

后缀数组的概念与用法

MF-JR-YD-3MF-JR-YD-3MF-JR-YD-3

MF99-免费查询系统 V3

iC+MF4450-MF4452-MF4550d-MF4570dn-D520-MF4410-MF4412-MF4420n维修手册

MF00336-JAVA全流程供应链系统源码（前后端分离）.zip

后缀数组--许智磊

佳能MF4370dn-MF4350d-MF4330d-MF4322d-MF4320d

MF-DFA程序

佳能CNON一体机MF4010-MF4120-4100系列中文维修手册.pdf

MF-DCCA-master.zip_DFA方法分析_MF-DCCA_MF-DCCA的意思_MFDCCA_多重分形

MF－DFA.rar_DFA_MF-DFA matlab_MF-MDF_多重分形

MF00282-springboot小区物业管理系统源码.rar

TI-MF10-N.pdf

MF00818-(多酒店版）酒店管理系统源码.zip

基于DVB-RCS标准的MF-TDMA接入技术研究与应用.pdf

MF00689-药店进销

MF00215-人力资源.zip

MF00223-绩效考核.zip

MF00216-工资系统.zip

MF00187-人力资源.zip

MF00232-合同系统.zip

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）