ASeven-DimensionalAnalysisofHashingMethodsanditsImplicationsonQueryProcessing-2015(p249-richter)-计算机科学资源-CSDN文库

122 浏览量 2021-04-22 15:57:15 上传评论收藏 440KB PDF 举报

资源推荐

资源详情

资源评论

A Seven-Dimensional Analysis of Hashing Methods and its

Implications on Query Processing

Stefan Richter

Information Systems Group

Saarland University

stefan.richter@infosys.uni-

saarland.de

Victor Alvarez

∗

Dept. Computer Science

TU Braunschweig

alvarez@ibr.cs.tu-bs.de

Jens Dittrich

Information Systems Group

Saarland University

dittrich@cs.uni-

saarland.de

ABSTRACT

Hashing is a solved problem. It allows us to get constant time ac-

cess for lookups. Hashing is also simple. It is safe to use an arbi-

trary method as a black box and expect good performance, and opti-

mizations to hashing can only improve it by a negligible delta. Why

are all of the previous statements plain wrong? That is what this pa-

per is about. In this paper we thoroughly study hashing for integer

keys and carefully analyze the most common hashing methods in a

ﬁve-dimensional requirements space: () data-distribution, () load

factor, () dataset size, () read/write-ratio, and () un/successful-

ratio. Each point in that design space may potentially suggest a dif-

ferent hashing scheme, and additionally also a different hash func-

tion. We show that a right or wrong decision in picking the right

hashing scheme and hash function combination may lead to sig-

niﬁcant difference in performance. To substantiate this claim, we

carefully analyze two additional dimensions: () ﬁve representa-

tive hashing schemes (which includes an improved variant of Robin

Hood hashing), () four important classes of hash functions widely

used today. That is, we consider 20 different combinations in total.

Finally, we also provide a glimpse about the effect of table mem-

ory layout and the use of SIMD instructions. Our study clearly

indicates that picking the right combination may have considerable

impact on insert and lookup performance, as well as memory foot-

print. A major conclusion of our work is that hashing should be

considered a white box before blindly using it in applications, such

as query processing. Finally, we also provide a strong guideline

about when to use which hashing method.

1. INTRODUCTION

In recent years there has been a considerable amount of research

on tree-structured main-memory indexes, e.g. [17, 13, 21]. How-

ever, it is hard to ﬁnd recent database literature thoroughly exam-

ining the effects of different hash tables in query processing. This

is unfortunate for at least two reasons: First, hashing has plenty of

applications in modern database systems, including join process-

ing, grouping, and accelerating point queries. In those applica-

tions, hash tables serve as a building block. Second, there is strong

∗

Work done while at Saarland University.

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org.

Proceedings of the VLDB Endowment, Vol. 9, No. 3

evidence that hash tables are much faster than even the most re-

cent and best tree-structured indexes. For instance, in our recent

experimental analysis [1] we carefully compared the performance

of modern tree-structured indexes for main-memory databases like

ARTful [17] with a selection of different hash tables

. A central

lesson learned from our work [1] was that a carefully and well-

chosen hash table is still considerably faster (up to factor 4-5x)

for point queries than any of the aforementioned tree-structured in-

dexes. However, our previous work also triggered some nagging

research questions: () When exactly should we choose which hash

table? () What are the most efﬁcient hashing methods that should

be considered for query processing? () What other dimensions af-

fect the choice of “the right” hash table? and ﬁnally () What is the

performance impact of those factors. While investigating answers

to these questions we stumbled over interesting results that greatly

enriched our knowledge, and that could greatly help practitioners,

and potentially also the optimizer, to take well-informed decisions

as of when to use what hash table.

1.1 Our Contributions

We carefully study single-threaded hashing for 64-bit integer

keys and values in a ﬁve-dimensional requirements space:

1. Data distribution. Three different data distributions: dense,

sparse, and a grid-like distribution (think of IP addresses).

2. Load factor. Six different load factors between 25- and 90%.

3. Dataset size. We consider a variety of sizes for the hash ta-

bles to observe performance when they are rather small (they

ﬁt in cache), and when they are of medium and large sizes

(outside cache but still addressable by TLB using huge pages

or not respectively).

4. Read/write-ratio. We consider whether the hash tables are

to be used under a static workload (OLAP-like) or a dynamic

workload (OLTP-like). For both we simulate an indexing

workload — which in turn captures the essence of other im-

portant operations such as joins or aggregates.

5. Un/successful lookup ratio. We study the performance of

the hash tables when the amount of lookups (probes) varies

from all successful to all unsuccessful.

Each point in that design space may potentially suggest a differ-

ent hash table. We show that a right/wrong decision in picking the

We use the term hash table throughout the paper to indicate that

both the hashing scheme (say linear probing) and the hash function

(say Murmur) are chosen.

right combination hhashing scheme, hash functioni may lead to an

order of magnitude difference in performance. To substantiate this

claim, we carefully analyze two additional dimensions:

6. Hashing scheme. We consider linear probing, quadratic prob-

ing, Robin Hood hashing as described in [5] but carefully en-

gineered, Cuckoo hashing [19], and two different variants of

chained hashing.

7. Hash function. We integrate each hashing scheme with four

different hash functions: Multiply-shift [8], Multiply-add-

shift [7], Tabulation hashing [20], and Murmur hashing [2],

which is widely used in practice. This gives 24 different

combinations (hash tables).

Therefore, we study in total a set of seven different dimen-

sions that are key parameters to the overall performance of a hash

table. We shed light on these seven dimensions focusing on one of

the most important use-cases in query processing: indexing. This

in turn resembles very closely other important operations such as

joins and aggregates — like SUM, MIN, etc. Additionally, we also

offer a glimpse about the effect of different table layout and the use

of SIMD instructions. Our main goal is to produce enough results

that can guide practitioners, and potentially the optimizer, towards

choosing the most appropriate hash table for their use case at hand.

To the best of our knowledge, no work in the literature has con-

sidered such a thorough set of experiments on hash tables.

Our study clearly indicates that picking the right conﬁguration

may have considerable impact on standard query processing tasks

such as main-memory indexing as well as join processing, which

heavily rely on hashing. Hence, hashing should be considered as a

white box method in query processing and query optimization.

We decided to focus on studying hash tables in a single-threaded

context to isolate the impact of the aforementioned dimensions. We

believe that a thorough evaluation of concurrency in hash tables is a

research topic in its own and beyond the scope of this paper. How-

ever, our observations still play an important role for hash maps

in multi-threaded algorithms. For partitioning-based parallelism

— which has recently been considered in the context of (partition-

based hash) joins [3, 4, 16] — single-threaded performance is still

a key parameter: each partition can be considered an isolated unit

of work that is only accessed by exactly one thread at a time, and

therefore concurrency control inside the hash tables is not needed.

Furthermore, all hash tables we present in the paper can be ex-

tended for thread safety through well-known techniques such as

striped locking or compare-and-swap. Here, the dimensions we

discuss still impact the performance of the underlying hash table.

This paper is organized as follows: In Sections 2 and 3 we brieﬂy

describe each of the ﬁve considered hashing schemes and the four

considered hash functions respectively. In Section 4 we describe

our methodology, setup, measurements, and the three data distribu-

tions used. We also discuss why we have narrowed down our result

set — we present in this paper what we consider the most relevant

results. In Sections 5, 6, and 7 we present all our experiments along

with their corresponding discussion.

2. HASHING SCHEMES

In this paper, we study the performance of ﬁve different hash-

ing schemes: () chained hashing, () linear probing, () quadratic

probing, () Robin Hood hashing on linear probing, and () Cuckoo

hashing — the last four belong to the so-called open-addressing

schemes, in which every slot of the hash table stores exactly one

element, or stores special values denoting whether the correspond-

ing slot is free. For open-addressing schemes we assume that the

tables have l slots (l is called capacity of the table). Let 0 ≤ n ≤ l

be the number of occupied slots (we call n the size of the table)

and consider the ratio α =

as the load factor of the table. For

chained hashing, the concept of load factor makes in general lit-

tle sense since it can store more than one element in the same slot

using a linked list, and thus we could obtain α > 1. Hence, when-

ever we discuss chained hashing for a load factor α, we mean that

the presented chained hash tables are memory-wise comparable to

open-addressing hash tables at load factor α — in particular, the

hash tables contain the same number n of elements, but their direc-

tory size can differ. We elaborate on this in Section 4.5.

Finally, one fundamental question in open-addressing is whether

to organize the table as array-of-structs (AoS) or as a struct-of-

arrays (SoA). In AoS, the table is stored in one (or more in case of

Cuckoo hashing) arrays of key-value pairs, similar to a row layout.

In contrast to that, SoA representation keeps keys and correspond-

ing values separated in two corresponding, aligned arrays - similar

to column layout. We found in a micro-benchmark that AoS is su-

perior to SoA in most relevant cases for our setup and hence apply

this organization in all open-addressing schemes in this paper. For

more details on this micro-benchmark see Section 7. We now pro-

ceed to brieﬂy describe each considered hashing scheme in turn.

2.1 Chained Hashing

Standard chained hashing is a very simple approach for collision

handling, where each slot of table T (the directory) is a pointer to a

linked list of entries. On inserts, entries are appended to the list that

corresponds to their key k under hash function h, i.e., T [h(k)]. In

case of lookups, the linked list under T [h(k)] is searched for the en-

try with key k. Chained hashing is a simple and robust method that

is widely used in practice, e.g., in the current implementations of

std::unordered map in C++ STL or java.util.HashMap

in Java. However, compared to open-addressing methods, chained

hashing has typically sub-optimal performance for integer keys w.r.t.

runtime and memory footprint. Two main reasons for this are:

() the pointers used by the linked lists lead to a high memory

overhead and () using linked lists leads to additional cache misses

(even for slots with one element and no collisions). This situation

brings different opportunities for optimizing a traditional chained

hash table. For example, we can reduce cache misses by mak-

ing the directory wide enough (say 24-byte entries for key-value-

pointer triplets) so that we can always store one element directly in

the directory and avoid following the corresponding pointer. Colli-

sions are then stored in the corresponding linked list. In this version

we potentially achieved the latency of open-addressing schemes

(if collisions are rare) at the cost of space. Throughout the paper

we denote the two versions of chained hashing we mentioned by

ChainedH8, and ChainedH24 respectively.

In the very ﬁrst set of experiments we studied the performance of

ChainedH8, and ChainedH24 under a variety of factors, as to better

understand the trade-offs they offer. One key observation that we

would like to point out at this point is: We observed that entry al-

location in the linked lists is a key factor for insert performance in

all our variants of chained hashing. For example, a naive approach

with dynamic allocation, i.e., using one malloc call per insertion,

and one free call per delete, lead to a signiﬁcant overhead. For

most use cases, an alternative allocation strategy provides a consid-

erable performance beneﬁt. That is, for both chained hashing meth-

ods in our indexing experiments, Sections 5 and 6, we use a slab

allocator. The idea is to bulk-allocate many (or up to all) entries

in one large array and store all map entries consecutively in this ar-

rays. This strategy is very efﬁcient in all scenarios where the size of

the hash table is either known in advance or only growing. We ob-

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38672840

粉丝: 9
资源: 893

A Seven-Dimensional Analysis of Hashing Methods and its Implicat...

最新资源

A Seven-Dimensional Analysis of Hashing Methods and its Implicat...

dimensional-analysis

Three-Dimensional Model Analysis and Processing【三维模型分析与处理】

（part3）Two-dimensional phase unwrapping: theory, algorithms, and software3

High-dimensional data analysis with low-dimensional models-2020.pdf

Analysis of Multivariate and High-Dimensional Data 532pages

Applied functional analysis - Numerical Methods, Wavelet Methods, and Image Processing (Abul Hasan Siddiqi) (M.Dekker 2004 569s)

High-dimensional controlled-phase gate between a 2N-dimensional photon and N three-level artificial atoms

Two-Dimensional Signal and Image Processing Part 2

Two-Dimensional Signal and Image Processing Part 1

Two-dimensional signal and image processing.zip

MUSIC and maximum likelihood techniques on two-dimensional DOA estimation

Three-dimensional simulation

Research on Three-dimensional Reconstruction of the Liver CT Images

Low-Dimensional Models for High-Dimensional Data From Linear to

A two-dimensional hybrid for elastic-plastic stress analysis

High-Dimensional Probability: An Introduction with Applications in Data Science

Big Data in Omics and Imaging_Association Analysis-CRC (2018).pdf

Finite-Dimensional Variational Inequalities and Complementarity(Volume II)

High-Dimensional Probability.pdf

Analytical Studies of the (2+1)-Dimensional Negative Gardner-KP Equation and Its Application on Internal Solitary Waves

Analysis of Dimensional Homogeniety in Fractal Elasticity And Its Simplest Uniaxial Model

Two Dimension Signal and image processing MIT Textbook part 2

Two-Dimensional Phase Unwrapping Problem

Theoretical and experimental study on two-dimensional binary locally resonant phononic crystals

最新资源