ConcurrentHashTables-FastandGeneral-ACM-2019(3309206)-计算机科学资源-CSDN文库

10 浏览量 2021-04-22 16:02:40 上传评论收藏 1.75MB PDF 举报

资源推荐

资源详情

资源评论

Concurrent Hash Tables: Fast and General(?)!

TOBIAS MAIER and PETER SANDERS, Karlsruhe Institute of Technology, Germany

ROMAN DEMENTIEV, Intel Deutschland GmbH, Germany

Concurrent hash tables are one of the most important concurrent data structures, which are used in numer-

ous applications. For some applications, it is common that hash table accesses dominate the execution time.

To eciently solve these problems in parallel, we need implementations that achieve speedups in highly

concurrent scenarios. Unfortunately, currently available concurrent hashing libraries are far away from this

requirement, in particular, when adaptively sized tables are necessary or contention on some elements occurs.

Our starting point for better performing data structures is a fast and simple lock-free concurrent hash

table based on linear probing that is, however, limited to word-sized key-value types and does not support

dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate

that dynamic growing has a performance overhead comparable to the same generalization in sequential hash

tables.

We perform extensive experiments comparing the performance of our implementations with six of the

most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar

restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the

dierence even approaches four orders of magnitude.

All our implementations discussed in this publication can be found on github [17].

CCS Concepts: • Theory of computation → Sorting and searching; Shared memory algorithms; Bloom

lters and hashing; Data structures and algorithms for data management;•Information systems → Data

structures;•Software and its engineering → Data types and structures;•Computer systems organiza-

tion → Multicore architectures;

Additional Key Words and Phrases: Concurrency, dynamic data structures, experimental analysis, hash table,

lock-freedom, transactional memory

ACM Reference format:

Tobias Maier, Peter Sanders, and Roman Dementiev. 2019. Concurrent Hash Tables: Fast and General(?)!. ACM

Trans. Parallel Comput. 5, 4, Article 16 (February 2019), 32 pages.

https://doi.org/10.1145/3309206

1 INTRODUCTION

A hash table is a dynamic data structure that stores a set of elements that are accessible by their key.

It supports insertion, deletion, nd, and update in constant expected time. In a concurrent hash

table, multiple threads have access to the same table. This allows threads to share information

in a exible and ecient way. Therefore, concurrent hash tables are one of the most important

Authors’ addresses: T. Maier and P. Sanders, Karlsruhe Institute of Technology, Kaiserstraße 12, Karlsruhe, 76131,

Germany; emails: {t.maier, sanders}@kit.edu; R. Dementiev, Intel Deutschland GmbH, Am Campeon 10-12, Neubiberg,

85579, Germany; email: roman.dementiev@intel.com.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be

honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.

2329-4949/2019/02-ART16 $15.00

https://doi.org/10.1145/3309206

ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.

16:2 T. Maier et al.

concurrent data structures. See Section 4 for a more detailed discussion of concurrent hash table

functionality.

To show the ubiquity of hash tables, we give a short list of example applications: A very simple

use case is storing sparse sets of precomputed solutions (e.g., References [3, 30]). A more compli-

cated scenario for hash tables is at the heart of many data base–like applications (for example, in the

case of aggregations SELECT FROM...COUNT...GROUP BY x [26]). Such a query selects rows from one

or several relations and counts for every key x how many rows have been found (similar queries

work with SUM, MIN,orMAX). Hashing can also be used for a data-base join [5]. Another group of

examples is the exploration of a large combinatorial search space where a hash table is used to

remember the already explored elements (e.g., in dynamic programming [39], itemset mining [31],

a chess program, or when exploring an implicitly dened graph in model checking [40]). Similarly,

a hash table can maintain a set of cached objects to save I/Os [27]. Further examples are duplicate

removal, storing the edge set of a sparse graph to support edge queries [24], maintaining the set

of nonempty cells in a grid-data structure used in geometry processing (e.g., [7]), or maintaining

the children in tree data structures such as van Emde-Boas search trees [6] or sux trees [22].

Many of these applications have in common that—even in the sequential version of the

program—hash table accesses constitute a signicant fraction of the running time. Thus, it is es-

sential to have highly scalable concurrent hash tables that actually deliver signicant speedups to

parallelize these applications. Unfortunately, currently available general purpose concurrent hash

tables do not oer the needed scalability (see Section 8 for concrete numbers). However, it seems to

be folklore that a lock-free linear probing hash table can be constructed using atomic compare and

swap (CAS) of key-value-pairs. Find-operations can even proceed naively and without any write

operations. This implementation we call folkloreHT [39] has several restrictions that keep it from

becoming the standard for parallel hash tables: keys and values have to be machine words to enable

the necessary CAS operation (two-word CAS), the table cannot grow over its preallocated size, and

it does not support true deletion (in a way that reclaims memory). In Section 4, we explain our own

implementation of (folkloreHT) in detail, after elaborating on some related work, and introducing

the necessary notation (in Sections 2 and 3, respectively). To see the potential large performance

dierences, consider a situation with mostly read only access to the table and heavy contention for

a small number of elements that are accessed again and again by all threads. folkloreHT actually

prots from this situation, because the contended elements are likely to be replicated into local

caches. However, any implementation that needs locks or CAS instructions for nd-operations,

will become much slower than the sequential code on current machines. The purpose of our arti-

cle is to document and explain performance dierences, and, more importantly, to explore to what

extent we can make folkloreHT more general with an acceptable deterioration in performance.

These generalizations are discussed in Section 5.Weexplainhowtogrow(andshrink)sucha

table, and how to support deletions and more general data types. In Section 6, we explain how

hardware transactional memory synchronization can be used to speed up insertions and updates

and how it may help to handle more general data types.

After describing implementation details in Section 7, Section 8 experimentally compares our

hash tables with six of the most widely used concurrent hash tables for microbenchmarks including

insertion, nding, and aggregating data. We look at both uniformly distributed and skewed input

distributions. Section 9 summarizes the results and discusses possible lines of future research.

2 RELATED WORK

This publication follows up on our previous ndings about generalizing fast concurrent hash ta-

bles [19, 20]. In addition to describing how to generalize a fast linear probing hash table, we oer

an extensive experimental analysis comparing many concurrent hash tables from several libraries.

ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.

Concurrent Hash Tables: Fast and General(?)! 16:3

There has been extensive previous work on concurrent hashing. The widely used textbook “The

Art of Multiprocessor Programming” [11] devotes an entire chapter to concurrent hashing and

gives an overview over previous work. However, it seems to us that a lot of previous work focuses

more on concepts and correctness but surprisingly little on scalability. For example, most of the

discussed growing mechanisms in Reference [11] assume that the size of the hash table is known

exactly without a discussion that this introduces a performance bottleneck limiting the speedup

to a constant. In practice, the actual migration is often done sequentially.

Stivala et al. [39] describe a bounded concurrent linear probing hash table specialized for dy-

namic programming that only supports insert and nd. Their insert operation starts from scratch

when the CAS fails, which seems suboptimal in the presence of contention. An interesting point

is that they need only word size CAS instructions at the price of reserving a special empty value.

This technique could also be adapted to port our code to machines without 128-bit CAS. Kim and

Kim [13] compare this table with a cache-optimized lock-free implementation of hashing with

chaining and with hopscotch hashing [12]. The experiments use only uniformly distributed keys,

i.e., there is little contention. Both linear probing and hashing with chaining perform well in that

case. The evaluation of nd-performance is a bit inconclusive: Chaining wins but uses more space

than linear probing. Moreover, it is not specied whether this is for successful (search for keys of

inserted elements) or mostly unsuccessful (search for not inserted keys) accesses. We suspect that

varying these parameters could reverse the result.

Gao et al. [9] present a theoretical dynamic linear probing hash table that is lock-free. The main

contribution is a formal correctness proof. Not all details of the algorithm or even an implemen-

tation is given. There is also no analysis of the complexity of the growing procedure.

Shun and Blelloch [37]proposephase concurrent hash tables, which are allowed to use only a

single operation within a globally synchronized phase. They show how phase concurrency helps

to implement some operations more eciently and even deterministically in a linear probing con-

text. For example, deletions can adapt the approach from Reference [14] and rearrange elements.

This is not possible in a general hash table, since this might cause nd-operations to report false

negatives. They also outline an elegant growing mechanism albeit without implementing it and

without lling in all the details like how to initialize newly allocated tables. They propose to trig-

ger a growing operation when any operation has to scan more than k log n elements where k is

a tuning parameter. This approach is tempting, since it is somewhat faster than the approximate

size estimator we use. We actually tried that but found that this trigger has a very high variance—

sometimes it triggers late making operations rather slow, sometimes it triggers early wasting a lot

of space. We also have theoretical concerns, since the bound k log n on the length of the longest

probe sequence implies strong assumptions on certain properties of the hash function. Shun and

Blelloch make extensive experiments including applications from the problem based benchmark

suite [38].

Li et al. [16] use the bucket cuckoo-hashing method by Dietzfelbinger and Weidling [8]and

develop a concurrent implementation. They use ne grained per bucket locks which can sometimes

be avoided using transactional memory, e.g., Intel Transactional Synchronization Extensions TSX

(Intel TSX). To further reduce the number of acquired locks per insertions they use a BFS-based

insertion algorithm to nd a minimal displacement path to an empty bucket (in contrast to the

commonly used random walk). As a result of their work, they implemented the small open source

library libcuckoo, which we measure against (which does not use Intel TSX). This approach has

the potential to achieve very good space eciency. However, our measurements indicate that the

performance penalty is high.

The practical importance of concurrent hash tables also leads to new and innovative implemen-

tations outside of the scientic community. A good example of this is the Junction library [34],

ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.

16:4 T. Maier et al.

which was published by Preshing in the beginning of 2016, shortly after our initial publication

[19].

3 PRELIMINARIES

We assume that each application thread has its own designated hardware thread or processing core

and denote the number of these threads with p. A data structure is non-blocking if no blocked

thread currently accessing this data structure can block an operation on the data structure by

another thread. A data structure is lock-free if it is non-blocking and guarantees global progress,

i.e., there must always be at least one thread nishing its operation in a nite number of steps.

Hash Tables store a set of Key, Value pairs (elements).

A hash function h maps each key to a

cell of a table (an array). The number of elements in the hash table is denoted n and the number

of operations is m. For the purpose of algorithm analysis, we assume that n and m are  p

–this

allows us to simplify algorithm complexities by hiding O (p) terms that are independent of n and m

in the overall cost. Sequential hash tables support the insertion of elements, and nding, updating,

or deleting an element with given key—all of this in constant expected time. Further operations

compute n (size), build a table with a given number of initial elements, and iterate over all elements

(forall).

Linear Probing is one of the most popular sequential hash table schemes used in practice. An

element x, a is stored at the rst free table entry following position h(x ) (wrapping around when

the end of the table is reached). Linear probing is at the same time simple and ecient—if the table

is not too full, a single cache line access will be enough most of the time. Deletion can be imple-

mented by rearranging the elements locally [14] to avoid holes violating the invariant mentioned

above. When the table becomes too full or too empty, the elements can be migrated to a larger or

smaller table, respectively. The migration cost can be charged to insertions and deletions causing

amortized constant overhead.

4 CONCURRENT HASH TABLE INTERFACE AND FOLKLORE IMPLEMENTATION

Although it seems quite clear what a hash table is and how this generalizes to concurrent hash

tables, there is a surprising number of details to consider. Therefore, we will quickly go over some

of our interface decisions and detail how this interface can be implemented in a simple, fast, lock-

free concurrent linear probing hash table.

This hash table will have a bounded capacity c that has to be specied when the table is con-

structed. It is the basis for all other hash table variants presented in this publication. We call this

table folkloreHT, because variations of it are used in many publications and it is not clear to us by

whom it was rst published.

The most important requirement for concurrent data structures is that they should be lineariz-

able, i.e., it must be possible to order the hash table operations in some sequence—without reorder-

ing two operations of the same thread—so that executing them sequentially in that order yields

the same results as the concurrent processing. For a hash table data structure, this basically means

that all operations should be executed atomically some time between their invocation and their

return. For example, a find should never return an inconsistent state, e.g., a half-updated data eld

that was never actually stored at the corresponding key.

Our variant of the folkloreHT ensures the atomicity of operations using two-word atomic CAS

operations for all changes of the table. As long as the key and the value each only use one machine

Much of what is said here can be generalized to the case when Elements are black boxes from which keys are extracted

by an accessor function.

ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.

Concurrent Hash Tables: Fast and General(?)! 16:5

word, we can use two-word CAS opearations to atomically manipulate a stored key together with

the corresponding value. There are other variants that avoid needing two-word compare and swap

operations, but they often need a designated empty value (see [34]). Since, the corresponding ma-

chine instructions are widely available on modern hardware, using them should not be a problem.

If the target architecture does not support the needed instructions, then the implementation could

in theory be switched to use a variant of concurrent linear probing without two-word CAS. As

it can easily be deduced by the context, we will usually omit the “two-word” prex and use the

abbreviation CAS for both single and double word CAS operations.

ALGORITHM 1: Pseudocode for the insertOrUpdate operation

Input:Keyk, Data Element d, Update Function up : Key × Val × Val → Val

default: atomicUpdate(·) = CAS(current,up(k, current.data, d))

Output: Boolean true when a new key was inserted, false if an update occurred

1 i =h(k);

2 while true do

3 i = i % c;

4 current = table[i];

5 if current.key == empty_key then // Key is not present yet . . .

6 if table[i].CAS(current, k, d ) then return true;

7 else continue; // retry at same position

8 else if current.key == k then // Same key already present . . .

9 if table[i].atomicUpdate(current, d, up) then return false ;

10 else continue; // retry at same position

11 i++;

Initialization. The constructor allocates an array of size c consisting of 128-bit aligned cells

whose key is initialized to the empty values.

Modications. We categorize all changes to the hash table’s content into one of the following

three functions that can be implemented similarly (excluding deletions).

insert( k, d ): Returns true if the insert succeeds (false if an element with the specic key

is already present). Only one operation should succeed if multiple threads are inserting the same

key at the same time. The implementation is similar to Algorithm 1, with lines 11–14 replaced by

return false.

update( k,d, up ): Returns false, if there is no value stored at the specied key, otherwise this

function atomically updates the stored value to new = up(current, d ). Notice that the resulting

value can be dependent on both the current value and the input parameter d. The implementation

is similar to Algorithm 1, with lines 6–9 replaced by return false and line 12 replaced by return

true.

insertOrUpdate( k, d, up ): This operation updates the current value, if one is present, oth-

erwise the given data element is inserted as the new value. The function returns true,if

insertOrUpdate performed an insert (key was not present), and false if an update was ex-

ecuted.

We choose this interface for two main reasons. It allows applications to quickly dierentiate

between inserting and changing an element—this is especially useful, since the thread that rst

inserted a key can be identied uniquely. Additionally, it allows transparent, lock-free updates

that can be more complex than just replacing the current value (think of CAS or Fetch-and-Add).

ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.

剩余31页未读，继续阅读

评论收藏

内容反馈

weixin_38653296

粉丝: 2
资源: 911

Concurrent Hash Tables - Fast and General - ACM - 2019 (3309206)...

最新资源

Concurrent Hash Tables - Fast and General - ACM - 2019 (3309206)...

Concurrent Hash Tables - Fast and General - 2016 (1601.04017v2)-计算机科学

Concurrent, Real-Time and Distributed Programming in Java Threads, RTSJ 无水印原版pdf

concurrent-all-in-one.pdf

mastering-c-multithreading-write-robust-concurrent-and-parallel

patterns-for-concurrent-and-networked-objects

backport-util-concurrent.jar

concurrent-1.3.4-sources.jar

Concurrent and Real-Time Programming in Java

Java-concurrent-collections-concurrenthashmap-blockingqueue.pdf

日本人写的js多线程库Concurrent.Thread-full-20080319.zip

CJTAG-Concurrent-JTAG-multi-site-test.pdf

JavaEE源代码 concurrent-1.3.2

Python库 | concurrent_log_handler-0.9.4-py2.py3-none-any.whl

压力测试工具concurrent-tcp-client.zip

Concurrent Programming on Windows - Synchronization and Time

concurrent-patterns-best-practices-synchronization

java并发工具包 java.util.concurrent中文版-带书签版

atlassian-util-concurrent-0.0.12.jar.zip

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

NPPJSONViewer.zip

蓝牙BLE协议中文版.pdf

CANoe通过CAPL脚本实现自动测试

AD20官方中文教程.pdf

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）