MonotoneMinimalPerfectHashing-SearchingaSortedTablewithO(1)Accesses-2014(MonotoneMinimalPerfectHashing)-计算机科学资源-CSDN文库

89 浏览量 2021-04-22 16:05:48 上传评论收藏 155KB PDF 举报

资源推荐

资源详情

资源评论

Monotone Minimal Perfect Hashing:

Searching a Sorted Table with O(1) Accesses

Djamal Belazzougui

∗

Paolo Boldi

†

Rasmus Pagh

‡

Sebastiano Vigna

†

Abstract

A minimal perfect hash function maps a set S of n keys into

the set { 0, 1, . . . , n − 1 } bijectively. Classical results state

that minimal perfect hashing is possible in constant time us-

ing a structure occupying space close to the lower bound

of log e bits per element. Here we consider the problem of

monotone minimal perfect hashing, in which the bijection is

required to preserve the lexicographical ordering of the keys.

A monotone minimal perfect hash function can be seen as a

very weak form of index that provides ranking just on the

set S (and answers randomly outside of S). Our goal is to

minimise the description size of the hash function: we show

that, for a set S of n elements out of a universe of 2

ele-

ments, O(n log log w) bits are sufﬁcient to hash monotoni-

cally with evaluation time O(log w). Alternatively, we can

get space O(n log w) bits with O(1) query time. Both of

these data structures improve a straightforward construction

with O(n log w) space and O(log w) query time. As a con-

sequence, it is possible to search a sorted table with O(1)

accesses to the table (using additional O(n log log w) bits).

Our results are based on a structure (of independent interest)

that represents a trie in a very compact way, but admits er-

rors. As a further application of the same structure, we show

how to compute the predecessor (in the sorted order of S) of

an arbitrary element, using O(1) accesses in expectation and

an index of O(n log w) bits, improving the trivial result of

O(nw) bits. This implies an efﬁcient index for searching a

blocked memory.

1 Introduction

This paper addresses a series of problems that lie at the con-

ﬂuence of two streams of research: the study of minimal per-

fect hash functions, and the analysis of indexing structures.

A minimal perfect hash functions maps bijectively a set S

of n keys into the set { 0, 1, . . . , n − 1 }. The construction of

such functions has been widely studied in the last years, lead-

ing to fundamental theoretical results such as [12, 13, 15].

From an application-oriented viewpoint, order-

preserving minimal perfect hash functions have been used to

∗

Institut National d’Informatique, Oued Smar, Algiers, Algeria

†

Università degli Studi di Milano, Italy

‡

IT University of Copenhagen, Denmark

retrieve the position of a key in a given list of keys [11, 20].

We start from the observation that all existing techniques

for this task assume that keys can be provided in any order,

incurring an unavoidable (n log n) lower bound on the

number of bits required to store the function. However,

very frequently the keys to be hashed are sorted in their

intrinsic (i.e., lexicographical) order. This is typically the

case of dictionaries of search engines, list of URLs of web

graphs, etc. We call the problem of mapping each key of a

lexicographically sorted set to its ordinal position monotone

minimal perfect hashing. This problem has received, to

the best of our knowledge, no attention in the literature.

However, as we will shortly explain, it is tightly connected

with other classical problems. It is, in a way, a very weak

form of ranking: for instance, partial ranking on a set S is

given by a function that returns the lexicographical position

of an element x of S, but returns −1 if x is not in S. Instead,

a monotone minimal perfect hash function is allowed to

return any result on elements not in S.

In a classical paper, Yao [27] showed that if one uses

no extra space in addition to a table of n > 2 keys from

an ordered universe u, and if u is sufﬁciently large, any

organization of the table requires log(n + 1) worst case

search time. In particular, sorting the table yields the best

possible search time. The lower bound holds even if we

are interested only in membership queries (“is x in the table

or not?”), and it extends to more general data structures

allowing pointers and repeated keys in n

O(1)

space. A

number of researchers have investigated the amount of extra

space needed to break Yao’s lower bound, using hashing

techniques to provide membership queries with O(1) table

accesses [9, 10, 25]. However, these schemes do not support

range queries (“Which are the keys in the range [` . . r]?”)

beyond the trivial reduction to membership queries that

requires linear time in the size of the range.

Here we consider the scenario where the table is sorted,

and ask the question: What is the space usage of index

structures that make it possible to search the table using O(1)

accesses in expectation? We consider two kinds of search for

a key x :

1. Membership searches where we must ﬁnd the position

of x in the table, or report that x is not in the table, and

2. Predecessor searches where we must return the position

of the largest key not greater than x.

The second type immediately implies efﬁcient range queries:

Find the predecessor of ` and scan the table until a key

greater than r is found. This kind of scanning is very efﬁcient

on disks, as well as modern blocked memory architectures.

Indeed, several recent papers on range queries propose to

keep keys in a sorted table (with gaps), rather than in a

normal search tree, exactly for this reason [24, 2, 1]. (We

note that these papers are concerned with the dynamic case

where the key set is changing, while we consider only the

static case.)

The ﬁrst type of search allows point queries, but also

range queries of a special kind: If we know some element

in the range, its position can be found in O(1) accesses,

after which reporting all elements in the range is trivial. This

may be relevant for example in database applications, where

referential integrity constraints ensure that the elements in

one relation also exist in another relation.

Our results. Suppose that our universe has size 2

and n ≥ log w. Without loss of generality we assume that

w is a power of two (if not, round it up). Our model of

computation is a unit-cost word RAM with word size w. We

describe a monotone minimal perfect hash function of size

O(n log log w) bits and query time O(log w), and one of size

O(n log w) bits and constant query time. In both cases, this

implies that we can ﬁnd an element in a sorted table using

just one access to the table. Both constructions are based on

novel ways of separating a set of keys into O(n/k) ordered

groups of size O(k). This improves on the space and time,

respectively, of the classic solution that stores every k-th key

explicitly in a predecessor data structure (e.g. [26]). It is

known that (n) bits is a lower bound, even when the set is

not required to be stored in sorted order [21], so these results

are (at least) close to best possible.

The main tool in the more space efﬁcient solution for

membership search is an approximate trie representation that

allows us to store a set S of n keys in space O(n log w)

so that for every element y its rank relative to S can be

approximately determined in the following sense: The data

structure returns a set of two integers in { 0, 1, . . . , n } such

that with probability 1 − 1/w one of them is the position of

the predecessor of y. The data structure is an approximate

version of a new van Emde Boas tree-like data structure

that we call z-fast trie (named after its relationship to y-fast

tries [26]).

We believe that our approximate trie result is of indepen-

dent interest, and could potentially ﬁnd applications in set-

tings where the two possible predecessors could be searched

in parallel (e.g. in hardware solutions for routing, and in B-

trees on parallel disk systems.) We also show a lower bound

implying that the approximate trie representation is close to

optimal in the following sense: Every data structure that de-

termines the correct rank of each element in the universe with

probability more than 1/2 must use space close to the space

required for storing S itself. Our data structure allows the

rank to be determined with probability slightly below 1/2.

An implementation of the data structures presented in

this paper is distributed under the Gnu LGPL as part of

the Sux4J project (http://sux4j.dsi.unimi.it/). We

study carefully the implementation problems and the practi-

cal behaviour of our algorithms in a forthcoming paper.

Related work. Mehlhorn [21] showed that minimal

perfect hashing requires space 2(n + log w) bits. The lower

bound holds even in the more general case where the range

of the function is of size O(n) rather than exactly n, and h is

required to be injective. A succinct data structure represent-

ing a minimal perfect hash function, with O(1) evaluation

time, was constructed by Hagerup and Tholey [15].

Schmidt and Siegel [25] considered a generalization of

perfect hashing where the hash function h returns a set of at

most k values from {0, . . . , n − 1}. The requirement is that

there should exist a matching between S and {0, . . . , n − 1}

such that every key x ∈ S is matched to an element of

h(x). For constant k this still requires (n) bits of space.

Speciﬁcally, upper and lower bounds of O(ne

−k

+log log m)

and ((n/ k

−k

+ log log m) bits were shown.

Mairson [18, 19] considered a related generalization of

minimal perfect hashing where the range is split into “pages”

of size k, and the k possible positions for a key are always

the positions of a single page. In other words, at most k

keys of S should map to a single page. Mairson showed

that 2(n log(k)/k) bits are necessary and sufﬁcient for this

problem. Allowing pages that have only (k) keys on

average does not help: Also in this case there is a lower

bound of (n log(k)/k) [19]. All these results are for the

case where n is not bounded by a function of k; indeed, for

k = ω(log n) a “paged” perfect hash function requires only

O(log n + log w) bits of space.

Fiat el al. [10] considered searching of a table that may

be organized as any permutation of S. They showed that

O(log n + log w) bits of additional storage are sufﬁcient to

achieve constant-time membership search. A subset of the

authors [9] later showed, by a nonconstructive argument, that

in theory O(log w) additional bits is sufﬁcient. These meth-

ods make use of the fact that information can be encoded as

permutations of elements, which is not possible in our set-

ting.

2 Deﬁnitions, notation, tools

Sets and integers. We use von Neumann’s deﬁnition and

notation for natural numbers: n = { 0, 1, . . . , n − 1 }. We

thus freely write f : m → n for a function from the ﬁrst

m natural numbers to the ﬁrst n natural numbers. We do

the same with real numbers, with a slight abuse of notation,

understanding a ceiling operator.

In the following, we will always assume that a universe

剩余9页未读，继续阅读

评论收藏

内容反馈

weixin_38657848

粉丝: 5
资源: 906

Monotone Minimal Perfect Hashing - Searching a Sorted Table with...

最新资源

Monotone Minimal Perfect Hashing - Searching a Sorted Table with...

Monotone Minimal Perfect Hashing - Searching a Sorted Table with O(1) Accesses (1496770.1496856)-计算机科学

Theory and Practice of Monotone Minimal Perfect Hashing-计算机科学

vim-monotone：用于vim的深色单色方法

Advanced+Probability+Theory(荆炳义+高等概率论)

Monotone

np难问题近似算法(绝版好书)

Proximal point algorithm for a new class of fuzzy set-valued variational inclusions with (H,?)-monotone mappings (2008年)

python编程之Monotone-Stack.md

Stability of the Non-Monotone Critical Traveling Waves forSpatially Discrete Reaction-Diffusion Equations with Time Delay

ConvexHull2D:各种二维凸包算法在C++中的实现

java-leetcode题解之Flip String to Monotone Increasing.java

「云安全」Error_Detection_in_Monotone_Span_Programs_with_Applicati

前端开源库-monotone-chain-convex-hull

Monotone-classifier:研究项目

Majorization_and_Matrix-Monotone_Functions_in_Wire.pdf

Hal L. Smith-Monotone Dynamical Systems_ An Introduction to the Theory of Competitive and Cooperative Systems.djvu

Monotone Browser:单调GUI-开源

Stability of non-monotone non-critical traveling waves in discretereaction-diffusion equations with time delay

Homogenization of Monotone Systems of Non-coercive Hamilton-Jacobi Equations

COCOS2D -HTML5 JS A*算法

A PRP type method for systems of monotone equations

matlablinprog代码-non-monotone-dr-submodular:连续DR子模最大化：结构和算法

Majorization and Matrix-Monotone Functions in Wireless Communications

A monotone finite element scheme for convection-diffusion equations.pdf

One Dimensional Spline Interpolation Algorithms

Efﬁcient Finite Abstraction of Mixed Monotone Systems.

最新资源