没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) AccessesDjamal Belazzougui∗ Paolo Boldi† Rasmus Pagh‡ Sebastiano Vigna†Abstract A minimal perfect hash function maps a set S of n keys into the set { 0, 1, . . . , n − 1 } bijectively. Classical results state that minimal perfect hashing is possible in constant time us- ing a structure occupying space close to the lower bound of log e bits per element. Here we consider the problem of monotone minimal perfect hashing, in which t
资源推荐
资源详情
资源评论
Monotone Minimal Perfect Hashing:
Searching a Sorted Table with O(1) Accesses
Djamal Belazzougui
∗
Paolo Boldi
†
Rasmus Pagh
‡
Sebastiano Vigna
†
Abstract
A minimal perfect hash function maps a set S of n keys into
the set { 0, 1, . . . , n − 1 } bijectively. Classical results state
that minimal perfect hashing is possible in constant time us-
ing a structure occupying space close to the lower bound
of log e bits per element. Here we consider the problem of
monotone minimal perfect hashing, in which the bijection is
required to preserve the lexicographical ordering of the keys.
A monotone minimal perfect hash function can be seen as a
very weak form of index that provides ranking just on the
set S (and answers randomly outside of S). Our goal is to
minimise the description size of the hash function: we show
that, for a set S of n elements out of a universe of 2
w
ele-
ments, O(n log log w) bits are sufficient to hash monotoni-
cally with evaluation time O(log w). Alternatively, we can
get space O(n log w) bits with O(1) query time. Both of
these data structures improve a straightforward construction
with O(n log w) space and O(log w) query time. As a con-
sequence, it is possible to search a sorted table with O(1)
accesses to the table (using additional O(n log log w) bits).
Our results are based on a structure (of independent interest)
that represents a trie in a very compact way, but admits er-
rors. As a further application of the same structure, we show
how to compute the predecessor (in the sorted order of S) of
an arbitrary element, using O(1) accesses in expectation and
an index of O(n log w) bits, improving the trivial result of
O(nw) bits. This implies an efficient index for searching a
blocked memory.
1 Introduction
This paper addresses a series of problems that lie at the con-
fluence of two streams of research: the study of minimal per-
fect hash functions, and the analysis of indexing structures.
A minimal perfect hash functions maps bijectively a set S
of n keys into the set { 0, 1, . . . , n − 1 }. The construction of
such functions has been widely studied in the last years, lead-
ing to fundamental theoretical results such as [12, 13, 15].
From an application-oriented viewpoint, order-
preserving minimal perfect hash functions have been used to
∗
Institut National d’Informatique, Oued Smar, Algiers, Algeria
†
Università degli Studi di Milano, Italy
‡
IT University of Copenhagen, Denmark
retrieve the position of a key in a given list of keys [11, 20].
We start from the observation that all existing techniques
for this task assume that keys can be provided in any order,
incurring an unavoidable (n log n) lower bound on the
number of bits required to store the function. However,
very frequently the keys to be hashed are sorted in their
intrinsic (i.e., lexicographical) order. This is typically the
case of dictionaries of search engines, list of URLs of web
graphs, etc. We call the problem of mapping each key of a
lexicographically sorted set to its ordinal position monotone
minimal perfect hashing. This problem has received, to
the best of our knowledge, no attention in the literature.
However, as we will shortly explain, it is tightly connected
with other classical problems. It is, in a way, a very weak
form of ranking: for instance, partial ranking on a set S is
given by a function that returns the lexicographical position
of an element x of S, but returns −1 if x is not in S. Instead,
a monotone minimal perfect hash function is allowed to
return any result on elements not in S.
In a classical paper, Yao [27] showed that if one uses
no extra space in addition to a table of n > 2 keys from
an ordered universe u, and if u is sufficiently large, any
organization of the table requires log(n + 1) worst case
search time. In particular, sorting the table yields the best
possible search time. The lower bound holds even if we
are interested only in membership queries (“is x in the table
or not?”), and it extends to more general data structures
allowing pointers and repeated keys in n
O(1)
space. A
number of researchers have investigated the amount of extra
space needed to break Yao’s lower bound, using hashing
techniques to provide membership queries with O(1) table
accesses [9, 10, 25]. However, these schemes do not support
range queries (“Which are the keys in the range [` . . r]?”)
beyond the trivial reduction to membership queries that
requires linear time in the size of the range.
Here we consider the scenario where the table is sorted,
and ask the question: What is the space usage of index
structures that make it possible to search the table using O(1)
accesses in expectation? We consider two kinds of search for
a key x :
1. Membership searches where we must find the position
of x in the table, or report that x is not in the table, and
2. Predecessor searches where we must return the position
of the largest key not greater than x.
The second type immediately implies efficient range queries:
Find the predecessor of ` and scan the table until a key
greater than r is found. This kind of scanning is very efficient
on disks, as well as modern blocked memory architectures.
Indeed, several recent papers on range queries propose to
keep keys in a sorted table (with gaps), rather than in a
normal search tree, exactly for this reason [24, 2, 1]. (We
note that these papers are concerned with the dynamic case
where the key set is changing, while we consider only the
static case.)
The first type of search allows point queries, but also
range queries of a special kind: If we know some element
in the range, its position can be found in O(1) accesses,
after which reporting all elements in the range is trivial. This
may be relevant for example in database applications, where
referential integrity constraints ensure that the elements in
one relation also exist in another relation.
Our results. Suppose that our universe has size 2
w
,
and n ≥ log w. Without loss of generality we assume that
w is a power of two (if not, round it up). Our model of
computation is a unit-cost word RAM with word size w. We
describe a monotone minimal perfect hash function of size
O(n log log w) bits and query time O(log w), and one of size
O(n log w) bits and constant query time. In both cases, this
implies that we can find an element in a sorted table using
just one access to the table. Both constructions are based on
novel ways of separating a set of keys into O(n/k) ordered
groups of size O(k). This improves on the space and time,
respectively, of the classic solution that stores every k-th key
explicitly in a predecessor data structure (e.g. [26]). It is
known that (n) bits is a lower bound, even when the set is
not required to be stored in sorted order [21], so these results
are (at least) close to best possible.
The main tool in the more space efficient solution for
membership search is an approximate trie representation that
allows us to store a set S of n keys in space O(n log w)
so that for every element y its rank relative to S can be
approximately determined in the following sense: The data
structure returns a set of two integers in { 0, 1, . . . , n } such
that with probability 1 − 1/w one of them is the position of
the predecessor of y. The data structure is an approximate
version of a new van Emde Boas tree-like data structure
that we call z-fast trie (named after its relationship to y-fast
tries [26]).
We believe that our approximate trie result is of indepen-
dent interest, and could potentially find applications in set-
tings where the two possible predecessors could be searched
in parallel (e.g. in hardware solutions for routing, and in B-
trees on parallel disk systems.) We also show a lower bound
implying that the approximate trie representation is close to
optimal in the following sense: Every data structure that de-
termines the correct rank of each element in the universe with
probability more than 1/2 must use space close to the space
required for storing S itself. Our data structure allows the
rank to be determined with probability slightly below 1/2.
An implementation of the data structures presented in
this paper is distributed under the Gnu LGPL as part of
the Sux4J project (http://sux4j.dsi.unimi.it/). We
study carefully the implementation problems and the practi-
cal behaviour of our algorithms in a forthcoming paper.
Related work. Mehlhorn [21] showed that minimal
perfect hashing requires space 2(n + log w) bits. The lower
bound holds even in the more general case where the range
of the function is of size O(n) rather than exactly n, and h is
required to be injective. A succinct data structure represent-
ing a minimal perfect hash function, with O(1) evaluation
time, was constructed by Hagerup and Tholey [15].
Schmidt and Siegel [25] considered a generalization of
perfect hashing where the hash function h returns a set of at
most k values from {0, . . . , n − 1}. The requirement is that
there should exist a matching between S and {0, . . . , n − 1}
such that every key x ∈ S is matched to an element of
h(x). For constant k this still requires (n) bits of space.
Specifically, upper and lower bounds of O(ne
−k
+log log m)
and ((n/ k
2
)e
−k
+ log log m) bits were shown.
Mairson [18, 19] considered a related generalization of
minimal perfect hashing where the range is split into “pages”
of size k, and the k possible positions for a key are always
the positions of a single page. In other words, at most k
keys of S should map to a single page. Mairson showed
that 2(n log(k)/k) bits are necessary and sufficient for this
problem. Allowing pages that have only (k) keys on
average does not help: Also in this case there is a lower
bound of (n log(k)/k) [19]. All these results are for the
case where n is not bounded by a function of k; indeed, for
k = ω(log n) a “paged” perfect hash function requires only
O(log n + log w) bits of space.
Fiat el al. [10] considered searching of a table that may
be organized as any permutation of S. They showed that
O(log n + log w) bits of additional storage are sufficient to
achieve constant-time membership search. A subset of the
authors [9] later showed, by a nonconstructive argument, that
in theory O(log w) additional bits is sufficient. These meth-
ods make use of the fact that information can be encoded as
permutations of elements, which is not possible in our set-
ting.
2 Definitions, notation, tools
Sets and integers. We use von Neumann’s definition and
notation for natural numbers: n = { 0, 1, . . . , n − 1 }. We
thus freely write f : m → n for a function from the first
m natural numbers to the first n natural numbers. We do
the same with real numbers, with a slight abuse of notation,
understanding a ceiling operator.
In the following, we will always assume that a universe
剩余9页未读,继续阅读
资源评论
weixin_38657848
- 粉丝: 5
- 资源: 906
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功