没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Title On the construction and application of compressed text indexesAuthor(s) Hon, Wing-kai.; 韓永楷.CitationIssued Date 2004URL http://hdl.handle.net/10722/31962Rights The author retains all proprietary rights, (such as patent rights)and the right to use in future works.iBibliographic NotesChapter 3 to Chapter 8 consist of the main technical part of this thesis, whosematerials are respectively extracted and revised from the following papers:• Chapter 3 – “Constructing Compressed Suffix Arrays with
资源推荐
资源详情
资源评论
Title On the construction and application of compressed text indexes
Author(s) Hon, Wing-kai.; 韓永楷.
Citation
Issued Date 2004
URL http://hdl.handle.net/10722/31962
Rights
The author retains all proprietary rights, (such as patent rights)
and the right to use in future works.
i
Bibliographic Notes
Chapter 3 to Chapter 8 consist of the main technical part of this thesis, whose
materials are respectively extracted and revised from the following papers:
• Chapter 3 – “Constructing Compressed Suffix Arrays with Large Alpha-
bets”, in Proceedings of the 14th International Conference on Algorithms
and Computation (ISAAC’03), pp. 240–249, 2003. (Co-authored with T. W.
Lam, K. Sadakane and W. K. Sung.)
• Chapter 4 – “Breaking a Time-and-Space Barrier in Constructing Full-Text
Indices”, in Proceedings of the 44th Annual IEEE Symposium on Founda-
tions of Computer Science (FOCS’03), pp. 251–260, 2003. (Co-authored
with K. Sadakane and W. K. Sung.)
• Chapter 5 – “Space-Economical Algorithms for Finding Maximal Unique
Matches”, in Proceedings of the 13th Annual Symposium on Combinato-
rial Pattern Matching (CPM’02), pp. 144–152, 2002. (Co-authored with
K. Sadakane.)
• Chapter 6 – “Compressed Index for a Dynamic Collection of Texts”, in Pro-
ceedings of the 15th Annual Symposium on Combinatorial Pattern Match-
ing (CPM’04), pp. 445–456, 2004. (Co-authored with H. L. Chan and
T. W. Lam.)
• Chapter 7 – “Compressed Index for Dynamic Text”, in Proceedings of the
14th IEEE Data Compression Conference (DCC’04), pp. 102–111, 2004.
(Co-authored with T. W. Lam, K. Sadakane, W. K. Sung and S. M. Yiu.)
• Chapter 8 – “Practical Aspects of Compressed Suffix Arrays and FM-index
in Searching DNA Sequences”, in Proceedings of the 5th Workshop on Algo-
rithm Engineering and Experimentations (ALENEX’04), pp. 31–38, 2004.
(Co-authored with T. W. Lam, W. K. Sung, W. L. Tse, C. K. Wong and
S. M. Yiu.)
Chapter 1
Introduction
This thesis studies the theoretical and practical aspects of maintaining indexes
for long texts, especially the DNA sequences which find most applications nowa-
days, that are minute in space and exhibit competent text searching ability when
compared to the uncompressed counterparts. Particularly, two recent indexes,
namely the CSA and the FM-index, are studied. Our results include time-and-
space efficient algorithms that construct these indexes, experimentation for the
practical performance of these indexes, and extension of these indexes in solving
more complicated text queries. A brief introduction is shown as follows.
1.1 Survey on Traditional Text Indexes
Text searching is a classical problem in computer science. Let T be a text of
length n, and P be a pattern of length p. The basic problem is to locate all
occurrences of the pattern P in the text T efficiently. Using Knuth-Morris-Pratt
[49] or Boyer-Moore [12] algorithms, this can be solved in the optimal O(n + p)
time. In most applications, the text is often given in advance, and will be searched
against different patterns. It thus pays some extra memory to pre-process the
text and create an index on it, so as to facilitate the subsequent searching process.
Text indexes can be classified into two categories, namely, the word-based
indexes and the full-text indexes. Word-based indexes, such as inverted lists
[40] and signature files [21], are used for texts with word boundaries. They are
constructed based on the distinct words that appear in the texts. For instance,
the inverted lists of a text store, for each distinct word, the list of positions that
the word occurs in the text; and, the set of distinct words are stored via a trie or
1
CHAPTER 1. INTRODUCTION 2
hashing to allow efficient access to a particular list. Subsequent searching for a
word can be done efficiently by locating the corresponding list and enumerating
all the positions of occurrence. For signature files, it is built by first partitioning
the text into pages; then, each word is hashed into a fixed length s-bit value, and
a page is represented by a signature that forms by ‘OR’ing all the hash values of
the words appearing in that page. Subsequent searching for a word can often be
confined to a small number of pages whose signature ‘agrees’ with the hash value
of the word. In general, word-based indexes are able to support fast word queries;
in addition, they are small in size, occupying about 20-30% of the original text
size [82].
Unfortunately, word-based indexes are not suitable to handle texts without
clear-cut word boundaries like DNA sequences, Chinese texts and Japanese texts.
In these circumstances, full-text indexes, which are indexes that make no assump-
tion on the word boundary, are relied upon, at the cost of increasing the space
occupancy. Basically, these indexes are constructed by storing information on
all the substrings occurring in the texts. Suffix trees [57, 79] and suffix arrays
[56] are two fundamental full-text indexes in the literature, which find numerous
applications in areas including biological research (e.g., gene hunting, promoter
consensus identification, and motif finding) [38], data mining [41] and text com-
pression [28]. Suffix tree is a compact version of the trie that stores all suffixes
of the given text. Each suffix is represented by a unique leaf storing its start-
ing position. Based on the suffix tree, any substring of the text can be found
by following some path descending from the root. The importance of the suffix
tree is underlined by the fact that it has been rediscovered many times in the
scientific literature, disguised under different names [34]. Some examples include
the compact bi-tree [79], the prefix tree [15], the PAT tree [30], the position tree
[3, 47, 53], the repetition finder [67], and the subword tree [8, 15]. On the other
hand, suffix array is a reduced form of a suffix tree, which is obtained by visiting
the leaves of the corresponding suffix tree from left to right. More precisely, it
is an array of positions, sorted in the lexicographic order of the corresponding
suffixes.
Both suffix tree and suffix array exhibit superb searching performance. Given
the suffix tree of a text T whose characters are from an alphabet Σ, we can search
for a pattern P within T using O(p log |Σ| + occ) time,
∗
where occ denotes the
∗
We use the notation log
c
b
n to denote (log n/ log b)
c
, which is the c-th power of the base-b
logarithm of n. Unless specified, we use b = 2.
CHAPTER 1. INTRODUCTION 3
number of occurrences of P in T . Note that the time is independent of the text
size. For suffix array, its searching time is O(p + log n + occ), which is only a bit
slower. For the space concern, both of them require O(n log n) bits; suffix array
is associated with a smaller constant, though.
In the literature, there is another full-text index called directed acyclic word
graph (DAWG) [10], which is the smallest finite state automaton that recognizes
all the substrings appearing in the given text [17]. Thus, based on DAWG, the
existence of a pattern P in T can be determined in O(p) time. By compacting
the edges of the DAWG and augmenting additional information in the nodes,
we obtain the labeled compact DAWG (CDAWG) of [11], which is equivalently
obtained from the suffix tree of the text by merging its edge-isomorphic subtrees
and deleting part of the resulting structure [34]. CDAWG provides significant
reductions of the memory space required by suffix trees and DAWGs [18]; never-
theless, in order to support locating all the occurrences of a pattern in the text,
the space requirement is still O(n log n) bits.
With the advance in bio-technology, the complete DNA sequences for a num-
ber of living organisms have been known. Some examples of these DNA sequences
are depicted in Table 1.1. The size of the DNA sequences can be much longer
than the traditional texts. For instance, the human DNA comprises about 3.3 bil-
lion characters. For this DNA sequence, the best known implementation of suffix
tree and suffix array require 40 Gigabytes [29, 50] and 14 Gigabytes, respectively.
Such memory requirement far exceeds the capacity of ordinary computers. Ex-
isting approaches for indexing human DNA include (1) using supercomputers
with large main memory [75] and (2) storing the indexing data structure in the
secondary storage [16, 41]. The first approach is expensive and inflexible, while
the second one is slow. As more and more DNA are decoded, it is vital that
individual biologists can eventually analyze different DNA sequences efficiently
with their ordinary PCs. In the next section, we show the recent trend in tackling
the problem—by compressing the index.
1.2 Survey on Compressed Text Indexes
To overcome the memory requirement, we need to construct indexes that require
considerably less space. Perhaps the most effective way to start with is to reduce
the space of the existing indexes, while maintaining their searching powers. This
idea has stimulated many work in the last decade. Firstly, K¨ark¨ainen (1995) [43]
剩余97页未读,继续阅读
资源评论
weixin_38747025
- 粉丝: 129
- 资源: 1108
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功