OntheConstructionandApplicationofCompressedTextIndexes(2004)-计算机科学资源-CSDN文库

133 浏览量 2021-04-22 18:36:00 上传评论收藏 645KB PDF 举报

资源推荐

资源详情

资源评论

Title On the construction and application of compressed text indexes

Author(s) Hon, Wing-kai.; 韓永楷.

Citation

Issued Date 2004

URL http://hdl.handle.net/10722/31962

Rights

The author retains all proprietary rights, (such as patent rights)

and the right to use in future works.

CHAPTER 1. INTRODUCTION 2

hashing to allow eﬃcient access to a particular list. Subsequent searching for a

word can be done eﬃciently by locating the corresponding list and enumerating

all the positions of occurrence. For signature ﬁles, it is built by ﬁrst partitioning

the text into pages; then, each word is hashed into a ﬁxed length s-bit value, and

a page is represented by a signature that forms by ‘OR’ing all the hash values of

the words appearing in that page. Subsequent searching for a word can often be

conﬁned to a small number of pages whose signature ‘agrees’ with the hash value

of the word. In general, word-based indexes are able to support fast word queries;

in addition, they are small in size, occupying about 20-30% of the original text

size [82].

Unfortunately, word-based indexes are not suitable to handle texts without

clear-cut word boundaries like DNA sequences, Chinese texts and Japanese texts.

In these circumstances, full-text indexes, which are indexes that make no assump-

tion on the word boundary, are relied upon, at the cost of increasing the space

occupancy. Basically, these indexes are constructed by storing information on

all the substrings occurring in the texts. Suﬃx trees [57, 79] and suﬃx arrays

[56] are two fundamental full-text indexes in the literature, which ﬁnd numerous

applications in areas including biological research (e.g., gene hunting, promoter

consensus identiﬁcation, and motif ﬁnding) [38], data mining [41] and text com-

pression [28]. Suﬃx tree is a compact version of the trie that stores all suﬃxes

of the given text. Each suﬃx is represented by a unique leaf storing its start-

ing position. Based on the suﬃx tree, any substring of the text can be found

by following some path descending from the root. The importance of the suﬃx

tree is underlined by the fact that it has been rediscovered many times in the

scientiﬁc literature, disguised under diﬀerent names [34]. Some examples include

the compact bi-tree [79], the preﬁx tree [15], the PAT tree [30], the position tree

[3, 47, 53], the repetition ﬁnder [67], and the subword tree [8, 15]. On the other

hand, suﬃx array is a reduced form of a suﬃx tree, which is obtained by visiting

the leaves of the corresponding suﬃx tree from left to right. More precisely, it

is an array of positions, sorted in the lexicographic order of the corresponding

suﬃxes.

Both suﬃx tree and suﬃx array exhibit superb searching performance. Given

the suﬃx tree of a text T whose characters are from an alphabet Σ, we can search

for a pattern P within T using O(p log |Σ| + occ) time,

∗

where occ denotes the

∗

We use the notation log

n to denote (log n/ log b)

, which is the c-th power of the base-b

logarithm of n. Unless speciﬁed, we use b = 2.

CHAPTER 1. INTRODUCTION 3

number of occurrences of P in T . Note that the time is independent of the text

size. For suﬃx array, its searching time is O(p + log n + occ), which is only a bit

slower. For the space concern, both of them require O(n log n) bits; suﬃx array

is associated with a smaller constant, though.

In the literature, there is another full-text index called directed acyclic word

graph (DAWG) [10], which is the smallest ﬁnite state automaton that recognizes

all the substrings appearing in the given text [17]. Thus, based on DAWG, the

existence of a pattern P in T can be determined in O(p) time. By compacting

the edges of the DAWG and augmenting additional information in the nodes,

we obtain the labeled compact DAWG (CDAWG) of [11], which is equivalently

obtained from the suﬃx tree of the text by merging its edge-isomorphic subtrees

and deleting part of the resulting structure [34]. CDAWG provides signiﬁcant

reductions of the memory space required by suﬃx trees and DAWGs [18]; never-

theless, in order to support locating all the occurrences of a pattern in the text,

the space requirement is still O(n log n) bits.

With the advance in bio-technology, the complete DNA sequences for a num-

ber of living organisms have been known. Some examples of these DNA sequences

are depicted in Table 1.1. The size of the DNA sequences can be much longer

than the traditional texts. For instance, the human DNA comprises about 3.3 bil-

lion characters. For this DNA sequence, the best known implementation of suﬃx

tree and suﬃx array require 40 Gigabytes [29, 50] and 14 Gigabytes, respectively.

Such memory requirement far exceeds the capacity of ordinary computers. Ex-

isting approaches for indexing human DNA include (1) using supercomputers

with large main memory [75] and (2) storing the indexing data structure in the

secondary storage [16, 41]. The ﬁrst approach is expensive and inﬂexible, while

the second one is slow. As more and more DNA are decoded, it is vital that

individual biologists can eventually analyze diﬀerent DNA sequences eﬃciently

with their ordinary PCs. In the next section, we show the recent trend in tackling

the problem—by compressing the index.

1.2 Survey on Compressed Text Indexes

To overcome the memory requirement, we need to construct indexes that require

considerably less space. Perhaps the most eﬀective way to start with is to reduce

the space of the existing indexes, while maintaining their searching powers. This

idea has stimulated many work in the last decade. Firstly, K¨ark¨ainen (1995) [43]

剩余97页未读，继续阅读

评论收藏

内容反馈

weixin_38747025

粉丝: 129
资源: 1108

On the Construction and Application of Compressed Text Indexes (...

最新资源

On the Construction and Application of Compressed Text Indexes (...

Expert One-on-One J2EE Design and Development

Image and Video Processing in the Compressed Domain.pdf

application/x-rar

Object-relative Addressing - Compressed Pointers in 64-bit Java Virtual Machines (P107_134)-计算机科学

Compressed Bloom Filters-计算机科学

Sorting improves word-aligned bitmap indexes - 2014 (0901.3751v6)-计算机科学

文件下载及web文件的contentType类型大全

Backward Search FM-Index (Full-text Index in a Minute Space) - Slides-计算机科学

Extensions of Compressed Sensing

compressed-image-file-formats

squashfs1.3r3.tar.gz

Image-compressed-Based-on-wavelet.zip_小波图像压缩

squashfs2.2-r2.tar.gz

CP tensor-based compression of hyperspectral

复杂网络大家名著：Complex Networks---Principles, Methods and Applications书籍配套数据

Compressed Sensing Theory and Applications 2012

2019 Analog-and-Algorithm-Assisted Ultra-low Power Biosignal Acquisition Systems

Continuous-Compressed-Sensing-.zip_On Gridless Sparse_compressed

Reversible data embedding for vector quantization compressed images using search-order coding and index parity matching

Optical information authentication using compressed double-random-phase-encoded images and quick-response codes

PyPI 官网下载 | backports.hook_compressed-1.0.0-py3-none-any.whl

Topics in Compressed Sensing

CS-Theory-and-Application.rar_compressIve sensing_compressed sen

最新资源