FastPrefixSearchinLittleSpace,withApplications-计算机科学资源-CSDN文库

148 浏览量 2021-04-22 18:36:05 上传评论收藏 130KB PDF 举报

资源详情

资源评论

Fast Preﬁx Search in Little Space, with Applications

Djamal Belazzougui

∗

Paolo Boldi

†

Rasmus Pagh

‡

Sebastiano Vigna

Abstract

A preﬁx search returns the strings out of a given collection S that start with a given preﬁx.

Traditionally, preﬁx search is solved by data structures that are also dictionaries, that is, they

actually contain the strings in S. For very large collections stored in slow-access memory, we

propose extremely compact data structures that solve weak preﬁx searches—they return the correct

result only if some string in S starts with the given preﬁx. Our data structures for weak preﬁx search

use O(|S| log `) bits in the worst case, where ` is the average string length, as opposed to O(|S|`)

bits for a dictionary. We show a lower bound implying that this space usage is optimal.

1 Introduction

In this paper we are interested in the following problem (hereafter referred to as preﬁx search): given

a collection of n strings, ﬁnd all the strings that start with a given preﬁx p. In particular, we will

be interested in the space/time tradeoffs needed to do preﬁx search in a static context (i.e., when the

collection does not change over time).

There is a large literature on indexing string collections. We refer to Ferragina et al. [11, 4] for

state-of-the-art results, with emphasis on the cache-oblivious model. Roughly speaking, results can

be divided into two categories based on the power of queries allowed. As shown by P

atra¸scu and

Thorup [15] any data structure for bit strings that supports predecessor (or rank) queries must either

use super-linear space, or use time (log | p|) for a query on a preﬁx p. On the other hand, it is

known that preﬁx queries, and more generally range queries, can be answered in constant time using

linear space [1].

Another distinction is between data structures (typically comparison-based) where the query time

grows with the number of strings in the collection, versus those (typically some kind of trie) where

the query time depends only on the length of the query string

. In this paper we ﬁll a gap in the

literature by considering data structures for weak preﬁx search, a relaxation of preﬁx search, with

query time depending only on the length of the query string. In a weak preﬁx search we have the

guarantee that the input p is a preﬁx of some string in the set, and we are only requested to output the

ranks (in lexicographic order) of the strings that have p as preﬁx.

Our ﬁrst result is that weak preﬁx search can be performed by accessing a data structure that

uses just O(n log `) bits, where ` is the average string length. This is much less than the space of

n` bits used for the strings themselves. We also show that this is the minimum possible space usage

for any such data structure, regardless of query time. We investigate different time/space tradeoffs:

∗

Université Paris Diderot—Paris 7, France

†

Dipartimento di Scienze dell’Informazione, Università degli Studi di Milano, Italy

‡

IT University of Copenhagen, Denmark

Obviously, one can also combine the two in a single data structure.

At one end of this spectrum we have constant-time queries (for preﬁxes that ﬁt in O(1) words), and

still asymptotically vanishing space usage for the index. At the other end, space is optimal and the

query time grows logarithmically with the length of the preﬁx. Precise statements can be found in the

technical overview below.

Technical overview. For simplicity we consider strings over a binary alphabet, but our methods

generalise to larger alphabets. Our main result is that weak preﬁx search needs just O(| p|/w+log | p|)

time and O(n log `) space, where ` is the average length of the strings, p is the query string, and w is

the machine word size. In the cache-oblivious model [12], we use O( p/B + log | p|) I/Os. For strings

of ﬁxed length w, this reduces to query time O(log w) and space O(n log w), and we show that the

latter is optimal regardless of query time. Throughout the paper we strive to state all space results in

terms of `, and time results in terms of the length of the actual query string p, because in a realistic

setting (e.g., term dictionaries of a search engine) string lengths might vary wildly, and queries might

be issued that are signiﬁcantly shorter than the average (let alone maximum) string length. Actually,

the data structure size depends on the hollow trie size of the set S—a data-aware measure related to

the trie size [13] that is much more precise than the bound O(n log `).

Building on ideas from [1], we then give an O(| p|/w +1) solution (i.e., constant time for preﬁxes

of length O(w)) that uses space O(n`

1/c

log `) (for any c > 0). This structure shows that weak preﬁx

search is possible in constant time using sublinear space; queries requires O(| p|/B + 1) I/Os in the

cache-oblivious model.

Comparison to related results. If we study the same problem in the I/O model or in the cache-

oblivious model, the nearest competitors are the String B-tree [10] and its cache-oblivious ver-

sion [4], albeit they require access to the set S. The static String B-tree can be modiﬁed to use

space O(n log n + n log `); it has very good search performance with O(| p|/B + log

n) I/Os per

query (supporting all query types discussed in this paper), and its cache-oblivious version guarantees

the same bounds with high probability. However, a search for p inside the String B-tree may involve

(|p| + log n) RAM operations ((| p|/w + log n) for the cache-oblivious version), so it may be

too expensive for intensive computations. Our ﬁrst method, which achieves the optimal space usage

of O(n log `) bits, uses O(|p|/w + log | p|) RAM operations and O(|p|/B + log | p|) I/Os instead.

The number of RAM operations is a strict improvement over String B-trees, while the I/O bound is

better for large enough sets. Our second method uses slightly more space (O(n`

1/c

log `) bits for any

c > 0) but features O(| p|/w + 1) RAM operations and O(| p|/B + 1) I/Os.

In [11], the authors discuss very succinct static data structures for the same purposes (on a generic

alphabet), decreasing the space to a lower bound that is, in the binary case, the trie size. The search

time is logarithmic in the number of strings. As in the previous case, we improve on RAM operations

and on I/Os for large enough sets.

The ﬁrst cache-oblivious dictionary supporting preﬁx search was devised by Brodal et al. [5]

achieving O(| p|) RAM operations and O(| p|/B + log

n) I/Os. We note that the result in [5] is

optimal in a comparison-based model, where we have a lower bound of (log

n) I/Os per query.

By contrast, our result, like those in [4, 11], assumes an integer alphabet where there is no such lower

bound.

Implicit in the paper of Alstrup et al. [1] on range queries is a linear-space structure for constant-

time weak preﬁx search on ﬁxed-length bit strings. Our constant-time data structure, instead, uses

sublinear space and allows for variable-length strings.

Applications. Data structures that allow weak preﬁx search can be used to solve the non-weak

version of the problem, provided that the original data is stored (typically, in some slow-access mem-

ory): a single probe is sufﬁcient to determine if the result set is empty; if not, access to the string set

is needed just to retrieve the strings that match the query. By the same means we can answer preﬁx

counting queries. It is also possible to solve range queries with two additional probes to the original

data (w.r.t. the output size), improving the results in [1]. We ﬁnally show that our results extend to

the cache-oblivious model, where we provide an alternative to the results in [5, 4, 11] that removes

the dependence on the data set size for preﬁx searches and range queries.

Our contributions. The main contribution of this paper is the identiﬁcation of the weak preﬁx

search problem, and the proposal of a solutions based on techniques developed in [2]. Optimality

(in space or time) of the solution is also a central result of this research. The second interesting

contribution is the description of range locators for variable-length strings; they are an essential

building block in our weak preﬁx search algorithms, and can be used whenever it is necessary to

recover in little space the range of leaves under a node of a trie.

2 Notation and tools

In the following sections, we will use the toy set of strings shown in Figure 1 to display examples

of our constructions. We use von Neumann’s deﬁnition and notation for natural numbers: n =

{ 0, 1, . . . , n − 1 }, so 2 = { 0, 1 } and 2

∗

is the set of all binary strings.

Weak preﬁx search. Given a preﬁx-free set of strings S ⊆ 2

∗

, the weak preﬁx search problem

requires, given a preﬁx p of some string in S, to return the range of strings of S having p as preﬁx;

this set is returned as the interval of integers that are the ranks (in lexicographic order) of the strings

in S having p as preﬁx.

Model and assumptions. The model of computation considered in most of the paper is a unit-

cost word RAM with word size w. We assume that |S| = O(2

) for some constant c, so that

constant-time static data structures depending on |S| can be used. We extend several results also to

the cache-oblivious model [12]. Compacted tries. Consider the compacted trie built for a preﬁx-free

set of strings S ⊆ 2

∗

. For a given node α of the trie, we deﬁne (see Figure 1):

• e

, the extent of node α, is the longest common preﬁx of the strings represented by the leaves

that are descendants of α (this was called the “string represented by α” in [2]);

• c

, the compacted path of node α, is the string stored at α;

• n

, the name of node α, is the string e

deprived of its sufﬁx c

(this was called the “path

leading to α” in [2]);

• given a string x, we let exit(x) be the exit node of x , that is, the only node α such that n

is a

preﬁx of x and e

is not a proper preﬁx of x;

• the skip interval (i

. . j

] associated to α is (0 . . |c

|] for the root, and (|n

| − 1 . . |e

|] for all

other nodes.

Data-aware measures. Consider the compacted trie on a set S ⊆ 2

∗

. We deﬁne the trie measure of

S [13] as

T(S) =

( j

− i

) =

(|c

| + 1) − 1 = 2n − 2 +

| = O(n`),

剩余10页未读，继续阅读

评论收藏

内容反馈

Fast Prefix Search in Little Space, with Applications-计算机科学

评论0

最新资源

Fast Prefix Search in Little Space, with Applications-计算机科学

评论0

最新资源

相关推荐

Fast Search in Hamming Space with Multi-Index Hashing-计算机科学

8-A-Cyclic-Prefix-OFDM-System-with-BPSK.pdf

Prefix B-trees - 1977-计算机科学

hbase-prefix-tree-1.1.3-API文档-中文版.zip

php-7.4.33-centos7下编译，含libonig.so.5及libicu*.so.50

hbase-prefix-tree-1.4.3-API文档-中文版.zip

Efficient in-memory indexing with Generalized Prefix trees.pptx

Parallel Scans and Prefix Sums - Slides (2013)-计算机科学

hbase-prefix-tree-1.2.12-API文档-中文版.zip

hbase-prefix-tree-1.2.12-API文档-中英对照版.zip

ndarray-prefix-sum:ndarray的前缀求和面积表计算

hbase-prefix-tree-1.1.3-API文档-中英对照版.zip

Prefix Sums and Their Applications (10.1.1.128.6230)-计算机科学

解决方案： No toolchains found in the NDK toolchains folder for ABI with prefix: mips64el-linux-android

file-5.15-mips32r1-linux-static.tar.xz

mysql unique option prefix myisam_recover instead of myisam-recover-options的解决方法

bwm-ng-0.6.2

php-7.1.29.tar.gz.zip

Optimizing Parallel Prefix Operations for the Fermi Architecture-计算机科学

android-ndk-r18b-windows-x86_64中toolchains中相关的包

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

NPPJSONViewer.zip

GD32替换STM32注意事项.pdf

XCP协议的规范文档

VS2015安装证书，JavaScript_ProjectSystem.msi，JavaScript_LanguageService.msi

CANoe通过CAPL脚本实现自动测试

Qt 5实现串口调试助手（源工程文件、0积分下载）