没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query ProcessingStefan Richter Information Systems GroupSaarland Universitystefan.richter@infosys.uni- saarland.deVictor Alvarez ∗Dept. Computer Science TU Braunschweigalvarez@ibr.cs.tu-bs.deJens Dittrich Information Systems GroupSaarland Universitydittrich@cs.uni- saarland.deABSTRACT Hashing is a solved problem. It allows us to get constant time ac- cess for lookups. Hashing is also simple. It is safe to use an arbi- trary
资源推荐
资源详情
资源评论
A Seven-Dimensional Analysis of Hashing Methods and its
Implications on Query Processing
Stefan Richter
Information Systems Group
Saarland University
stefan.richter@infosys.uni-
saarland.de
Victor Alvarez
∗
Dept. Computer Science
TU Braunschweig
alvarez@ibr.cs.tu-bs.de
Jens Dittrich
Information Systems Group
Saarland University
dittrich@cs.uni-
saarland.de
ABSTRACT
Hashing is a solved problem. It allows us to get constant time ac-
cess for lookups. Hashing is also simple. It is safe to use an arbi-
trary method as a black box and expect good performance, and opti-
mizations to hashing can only improve it by a negligible delta. Why
are all of the previous statements plain wrong? That is what this pa-
per is about. In this paper we thoroughly study hashing for integer
keys and carefully analyze the most common hashing methods in a
five-dimensional requirements space: () data-distribution, () load
factor, () dataset size, () read/write-ratio, and () un/successful-
ratio. Each point in that design space may potentially suggest a dif-
ferent hashing scheme, and additionally also a different hash func-
tion. We show that a right or wrong decision in picking the right
hashing scheme and hash function combination may lead to sig-
nificant difference in performance. To substantiate this claim, we
carefully analyze two additional dimensions: () five representa-
tive hashing schemes (which includes an improved variant of Robin
Hood hashing), () four important classes of hash functions widely
used today. That is, we consider 20 different combinations in total.
Finally, we also provide a glimpse about the effect of table mem-
ory layout and the use of SIMD instructions. Our study clearly
indicates that picking the right combination may have considerable
impact on insert and lookup performance, as well as memory foot-
print. A major conclusion of our work is that hashing should be
considered a white box before blindly using it in applications, such
as query processing. Finally, we also provide a strong guideline
about when to use which hashing method.
1. INTRODUCTION
In recent years there has been a considerable amount of research
on tree-structured main-memory indexes, e.g. [17, 13, 21]. How-
ever, it is hard to find recent database literature thoroughly exam-
ining the effects of different hash tables in query processing. This
is unfortunate for at least two reasons: First, hashing has plenty of
applications in modern database systems, including join process-
ing, grouping, and accelerating point queries. In those applica-
tions, hash tables serve as a building block. Second, there is strong
∗
Work done while at Saarland University.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org.
Proceedings of the VLDB Endowment, Vol. 9, No. 3
Copyright 2015 VLDB Endowment 2150-8097/15/11.
evidence that hash tables are much faster than even the most re-
cent and best tree-structured indexes. For instance, in our recent
experimental analysis [1] we carefully compared the performance
of modern tree-structured indexes for main-memory databases like
ARTful [17] with a selection of different hash tables
1
. A central
lesson learned from our work [1] was that a carefully and well-
chosen hash table is still considerably faster (up to factor 4-5x)
for point queries than any of the aforementioned tree-structured in-
dexes. However, our previous work also triggered some nagging
research questions: () When exactly should we choose which hash
table? () What are the most efficient hashing methods that should
be considered for query processing? () What other dimensions af-
fect the choice of “the right” hash table? and finally () What is the
performance impact of those factors. While investigating answers
to these questions we stumbled over interesting results that greatly
enriched our knowledge, and that could greatly help practitioners,
and potentially also the optimizer, to take well-informed decisions
as of when to use what hash table.
1.1 Our Contributions
We carefully study single-threaded hashing for 64-bit integer
keys and values in a five-dimensional requirements space:
1. Data distribution. Three different data distributions: dense,
sparse, and a grid-like distribution (think of IP addresses).
2. Load factor. Six different load factors between 25- and 90%.
3. Dataset size. We consider a variety of sizes for the hash ta-
bles to observe performance when they are rather small (they
fit in cache), and when they are of medium and large sizes
(outside cache but still addressable by TLB using huge pages
or not respectively).
4. Read/write-ratio. We consider whether the hash tables are
to be used under a static workload (OLAP-like) or a dynamic
workload (OLTP-like). For both we simulate an indexing
workload — which in turn captures the essence of other im-
portant operations such as joins or aggregates.
5. Un/successful lookup ratio. We study the performance of
the hash tables when the amount of lookups (probes) varies
from all successful to all unsuccessful.
Each point in that design space may potentially suggest a differ-
ent hash table. We show that a right/wrong decision in picking the
1
We use the term hash table throughout the paper to indicate that
both the hashing scheme (say linear probing) and the hash function
(say Murmur) are chosen.
right combination hhashing scheme, hash functioni may lead to an
order of magnitude difference in performance. To substantiate this
claim, we carefully analyze two additional dimensions:
6. Hashing scheme. We consider linear probing, quadratic prob-
ing, Robin Hood hashing as described in [5] but carefully en-
gineered, Cuckoo hashing [19], and two different variants of
chained hashing.
7. Hash function. We integrate each hashing scheme with four
different hash functions: Multiply-shift [8], Multiply-add-
shift [7], Tabulation hashing [20], and Murmur hashing [2],
which is widely used in practice. This gives 24 different
combinations (hash tables).
Therefore, we study in total a set of seven different dimen-
sions that are key parameters to the overall performance of a hash
table. We shed light on these seven dimensions focusing on one of
the most important use-cases in query processing: indexing. This
in turn resembles very closely other important operations such as
joins and aggregates — like SUM, MIN, etc. Additionally, we also
offer a glimpse about the effect of different table layout and the use
of SIMD instructions. Our main goal is to produce enough results
that can guide practitioners, and potentially the optimizer, towards
choosing the most appropriate hash table for their use case at hand.
To the best of our knowledge, no work in the literature has con-
sidered such a thorough set of experiments on hash tables.
Our study clearly indicates that picking the right configuration
may have considerable impact on standard query processing tasks
such as main-memory indexing as well as join processing, which
heavily rely on hashing. Hence, hashing should be considered as a
white box method in query processing and query optimization.
We decided to focus on studying hash tables in a single-threaded
context to isolate the impact of the aforementioned dimensions. We
believe that a thorough evaluation of concurrency in hash tables is a
research topic in its own and beyond the scope of this paper. How-
ever, our observations still play an important role for hash maps
in multi-threaded algorithms. For partitioning-based parallelism
— which has recently been considered in the context of (partition-
based hash) joins [3, 4, 16] — single-threaded performance is still
a key parameter: each partition can be considered an isolated unit
of work that is only accessed by exactly one thread at a time, and
therefore concurrency control inside the hash tables is not needed.
Furthermore, all hash tables we present in the paper can be ex-
tended for thread safety through well-known techniques such as
striped locking or compare-and-swap. Here, the dimensions we
discuss still impact the performance of the underlying hash table.
This paper is organized as follows: In Sections 2 and 3 we briefly
describe each of the five considered hashing schemes and the four
considered hash functions respectively. In Section 4 we describe
our methodology, setup, measurements, and the three data distribu-
tions used. We also discuss why we have narrowed down our result
set — we present in this paper what we consider the most relevant
results. In Sections 5, 6, and 7 we present all our experiments along
with their corresponding discussion.
2. HASHING SCHEMES
In this paper, we study the performance of five different hash-
ing schemes: () chained hashing, () linear probing, () quadratic
probing, () Robin Hood hashing on linear probing, and () Cuckoo
hashing — the last four belong to the so-called open-addressing
schemes, in which every slot of the hash table stores exactly one
element, or stores special values denoting whether the correspond-
ing slot is free. For open-addressing schemes we assume that the
tables have l slots (l is called capacity of the table). Let 0 ≤ n ≤ l
be the number of occupied slots (we call n the size of the table)
and consider the ratio α =
n
l
as the load factor of the table. For
chained hashing, the concept of load factor makes in general lit-
tle sense since it can store more than one element in the same slot
using a linked list, and thus we could obtain α > 1. Hence, when-
ever we discuss chained hashing for a load factor α, we mean that
the presented chained hash tables are memory-wise comparable to
open-addressing hash tables at load factor α — in particular, the
hash tables contain the same number n of elements, but their direc-
tory size can differ. We elaborate on this in Section 4.5.
Finally, one fundamental question in open-addressing is whether
to organize the table as array-of-structs (AoS) or as a struct-of-
arrays (SoA). In AoS, the table is stored in one (or more in case of
Cuckoo hashing) arrays of key-value pairs, similar to a row layout.
In contrast to that, SoA representation keeps keys and correspond-
ing values separated in two corresponding, aligned arrays - similar
to column layout. We found in a micro-benchmark that AoS is su-
perior to SoA in most relevant cases for our setup and hence apply
this organization in all open-addressing schemes in this paper. For
more details on this micro-benchmark see Section 7. We now pro-
ceed to briefly describe each considered hashing scheme in turn.
2.1 Chained Hashing
Standard chained hashing is a very simple approach for collision
handling, where each slot of table T (the directory) is a pointer to a
linked list of entries. On inserts, entries are appended to the list that
corresponds to their key k under hash function h, i.e., T [h(k)]. In
case of lookups, the linked list under T [h(k)] is searched for the en-
try with key k. Chained hashing is a simple and robust method that
is widely used in practice, e.g., in the current implementations of
std::unordered map in C++ STL or java.util.HashMap
in Java. However, compared to open-addressing methods, chained
hashing has typically sub-optimal performance for integer keys w.r.t.
runtime and memory footprint. Two main reasons for this are:
() the pointers used by the linked lists lead to a high memory
overhead and () using linked lists leads to additional cache misses
(even for slots with one element and no collisions). This situation
brings different opportunities for optimizing a traditional chained
hash table. For example, we can reduce cache misses by mak-
ing the directory wide enough (say 24-byte entries for key-value-
pointer triplets) so that we can always store one element directly in
the directory and avoid following the corresponding pointer. Colli-
sions are then stored in the corresponding linked list. In this version
we potentially achieved the latency of open-addressing schemes
(if collisions are rare) at the cost of space. Throughout the paper
we denote the two versions of chained hashing we mentioned by
ChainedH8, and ChainedH24 respectively.
In the very first set of experiments we studied the performance of
ChainedH8, and ChainedH24 under a variety of factors, as to better
understand the trade-offs they offer. One key observation that we
would like to point out at this point is: We observed that entry al-
location in the linked lists is a key factor for insert performance in
all our variants of chained hashing. For example, a naive approach
with dynamic allocation, i.e., using one malloc call per insertion,
and one free call per delete, lead to a significant overhead. For
most use cases, an alternative allocation strategy provides a consid-
erable performance benefit. That is, for both chained hashing meth-
ods in our indexing experiments, Sections 5 and 6, we use a slab
allocator. The idea is to bulk-allocate many (or up to all) entries
in one large array and store all map entries consecutively in this ar-
rays. This strategy is very efficient in all scenarios where the size of
the hash table is either known in advance or only growing. We ob-
剩余11页未读,继续阅读
资源评论
weixin_38672840
- 粉丝: 9
- 资源: 893
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- Linux内核5.0基础架构解析: ARM64架构、内存管理及进程管理
- 【java毕业设计】员工在线知识培训考试平台源码(ssm+mysql+说明文档).zip
- 【java毕业设计】演出道具租赁管理系统源码(ssm+mysql+说明文档).zip
- ScanMaster RPP3 脉冲放大器手册
- 【java毕业设计】社区医院儿童预防接种管理系统源码(ssm+mysql+说明文档).zip
- 【java毕业设计】企业台账管理平台源码(ssm+mysql+说明文档+LW).zip
- 【java毕业设计】面向品牌会员的在线商城源码(ssm+mysql+说明文档).zip
- 【java毕业设计】消防物资存储系统源码(ssm+mysql+说明文档+LW).zip
- 【java毕业设计】高校课程评价系统源码(ssm+mysql+说明文档+LW).zip
- 【java毕业设计】大健康老年公寓管理系统源码(ssm+mysql+说明文档).zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功