没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
16Concurrent Hash Tables: Fast and General(?)!TOBIAS MAIER and PETER SANDERS, Karlsruhe Institute of Technology, GermanyROMAN DEMENTIEV, Intel Deutschland GmbH, GermanyConcurrent hash tables are one of the most important concurrent data structures, which are used in numer-ous applications. For some applications, it is common that hash table accesses dominate the execution time.To efficiently solve these problems in parallel, we need implementations that achieve speedups in highlyconcurrent scena
资源推荐
资源详情
资源评论
16
Concurrent Hash Tables: Fast and General(?)!
TOBIAS MAIER and PETER SANDERS, Karlsruhe Institute of Technology, Germany
ROMAN DEMENTIEV, Intel Deutschland GmbH, Germany
Concurrent hash tables are one of the most important concurrent data structures, which are used in numer-
ous applications. For some applications, it is common that hash table accesses dominate the execution time.
To eciently solve these problems in parallel, we need implementations that achieve speedups in highly
concurrent scenarios. Unfortunately, currently available concurrent hashing libraries are far away from this
requirement, in particular, when adaptively sized tables are necessary or contention on some elements occurs.
Our starting point for better performing data structures is a fast and simple lock-free concurrent hash
table based on linear probing that is, however, limited to word-sized key-value types and does not support
dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate
that dynamic growing has a performance overhead comparable to the same generalization in sequential hash
tables.
We perform extensive experiments comparing the performance of our implementations with six of the
most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar
restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the
dierence even approaches four orders of magnitude.
All our implementations discussed in this publication can be found on github [17].
CCS Concepts: • Theory of computation → Sorting and searching; Shared memory algorithms; Bloom
lters and hashing; Data structures and algorithms for data management;•Information systems → Data
structures;•Software and its engineering → Data types and structures;•Computer systems organiza-
tion → Multicore architectures;
Additional Key Words and Phrases: Concurrency, dynamic data structures, experimental analysis, hash table,
lock-freedom, transactional memory
ACM Reference format:
Tobias Maier, Peter Sanders, and Roman Dementiev. 2019. Concurrent Hash Tables: Fast and General(?)!. ACM
Trans. Parallel Comput. 5, 4, Article 16 (February 2019), 32 pages.
https://doi.org/10.1145/3309206
1 INTRODUCTION
A hash table is a dynamic data structure that stores a set of elements that are accessible by their key.
It supports insertion, deletion, nd, and update in constant expected time. In a concurrent hash
table, multiple threads have access to the same table. This allows threads to share information
in a exible and ecient way. Therefore, concurrent hash tables are one of the most important
Authors’ addresses: T. Maier and P. Sanders, Karlsruhe Institute of Technology, Kaiserstraße 12, Karlsruhe, 76131,
Germany; emails: {t.maier, sanders}@kit.edu; R. Dementiev, Intel Deutschland GmbH, Am Campeon 10-12, Neubiberg,
85579, Germany; email: roman.dementiev@intel.com.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2329-4949/2019/02-ART16 $15.00
https://doi.org/10.1145/3309206
ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.
16:2 T. Maier et al.
concurrent data structures. See Section 4 for a more detailed discussion of concurrent hash table
functionality.
To show the ubiquity of hash tables, we give a short list of example applications: A very simple
use case is storing sparse sets of precomputed solutions (e.g., References [3, 30]). A more compli-
cated scenario for hash tables is at the heart of many data base–like applications (for example, in the
case of aggregations SELECT FROM...COUNT...GROUP BY x [26]). Such a query selects rows from one
or several relations and counts for every key x how many rows have been found (similar queries
work with SUM, MIN,orMAX). Hashing can also be used for a data-base join [5]. Another group of
examples is the exploration of a large combinatorial search space where a hash table is used to
remember the already explored elements (e.g., in dynamic programming [39], itemset mining [31],
a chess program, or when exploring an implicitly dened graph in model checking [40]). Similarly,
a hash table can maintain a set of cached objects to save I/Os [27]. Further examples are duplicate
removal, storing the edge set of a sparse graph to support edge queries [24], maintaining the set
of nonempty cells in a grid-data structure used in geometry processing (e.g., [7]), or maintaining
the children in tree data structures such as van Emde-Boas search trees [6] or sux trees [22].
Many of these applications have in common that—even in the sequential version of the
program—hash table accesses constitute a signicant fraction of the running time. Thus, it is es-
sential to have highly scalable concurrent hash tables that actually deliver signicant speedups to
parallelize these applications. Unfortunately, currently available general purpose concurrent hash
tables do not oer the needed scalability (see Section 8 for concrete numbers). However, it seems to
be folklore that a lock-free linear probing hash table can be constructed using atomic compare and
swap (CAS) of key-value-pairs. Find-operations can even proceed naively and without any write
operations. This implementation we call folkloreHT [39] has several restrictions that keep it from
becoming the standard for parallel hash tables: keys and values have to be machine words to enable
the necessary CAS operation (two-word CAS), the table cannot grow over its preallocated size, and
it does not support true deletion (in a way that reclaims memory). In Section 4, we explain our own
implementation of (folkloreHT) in detail, after elaborating on some related work, and introducing
the necessary notation (in Sections 2 and 3, respectively). To see the potential large performance
dierences, consider a situation with mostly read only access to the table and heavy contention for
a small number of elements that are accessed again and again by all threads. folkloreHT actually
prots from this situation, because the contended elements are likely to be replicated into local
caches. However, any implementation that needs locks or CAS instructions for nd-operations,
will become much slower than the sequential code on current machines. The purpose of our arti-
cle is to document and explain performance dierences, and, more importantly, to explore to what
extent we can make folkloreHT more general with an acceptable deterioration in performance.
These generalizations are discussed in Section 5.Weexplainhowtogrow(andshrink)sucha
table, and how to support deletions and more general data types. In Section 6, we explain how
hardware transactional memory synchronization can be used to speed up insertions and updates
and how it may help to handle more general data types.
After describing implementation details in Section 7, Section 8 experimentally compares our
hash tables with six of the most widely used concurrent hash tables for microbenchmarks including
insertion, nding, and aggregating data. We look at both uniformly distributed and skewed input
distributions. Section 9 summarizes the results and discusses possible lines of future research.
2 RELATED WORK
This publication follows up on our previous ndings about generalizing fast concurrent hash ta-
bles [19, 20]. In addition to describing how to generalize a fast linear probing hash table, we oer
an extensive experimental analysis comparing many concurrent hash tables from several libraries.
ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.
Concurrent Hash Tables: Fast and General(?)! 16:3
There has been extensive previous work on concurrent hashing. The widely used textbook “The
Art of Multiprocessor Programming” [11] devotes an entire chapter to concurrent hashing and
gives an overview over previous work. However, it seems to us that a lot of previous work focuses
more on concepts and correctness but surprisingly little on scalability. For example, most of the
discussed growing mechanisms in Reference [11] assume that the size of the hash table is known
exactly without a discussion that this introduces a performance bottleneck limiting the speedup
to a constant. In practice, the actual migration is often done sequentially.
Stivala et al. [39] describe a bounded concurrent linear probing hash table specialized for dy-
namic programming that only supports insert and nd. Their insert operation starts from scratch
when the CAS fails, which seems suboptimal in the presence of contention. An interesting point
is that they need only word size CAS instructions at the price of reserving a special empty value.
This technique could also be adapted to port our code to machines without 128-bit CAS. Kim and
Kim [13] compare this table with a cache-optimized lock-free implementation of hashing with
chaining and with hopscotch hashing [12]. The experiments use only uniformly distributed keys,
i.e., there is little contention. Both linear probing and hashing with chaining perform well in that
case. The evaluation of nd-performance is a bit inconclusive: Chaining wins but uses more space
than linear probing. Moreover, it is not specied whether this is for successful (search for keys of
inserted elements) or mostly unsuccessful (search for not inserted keys) accesses. We suspect that
varying these parameters could reverse the result.
Gao et al. [9] present a theoretical dynamic linear probing hash table that is lock-free. The main
contribution is a formal correctness proof. Not all details of the algorithm or even an implemen-
tation is given. There is also no analysis of the complexity of the growing procedure.
Shun and Blelloch [37]proposephase concurrent hash tables, which are allowed to use only a
single operation within a globally synchronized phase. They show how phase concurrency helps
to implement some operations more eciently and even deterministically in a linear probing con-
text. For example, deletions can adapt the approach from Reference [14] and rearrange elements.
This is not possible in a general hash table, since this might cause nd-operations to report false
negatives. They also outline an elegant growing mechanism albeit without implementing it and
without lling in all the details like how to initialize newly allocated tables. They propose to trig-
ger a growing operation when any operation has to scan more than k log n elements where k is
a tuning parameter. This approach is tempting, since it is somewhat faster than the approximate
size estimator we use. We actually tried that but found that this trigger has a very high variance—
sometimes it triggers late making operations rather slow, sometimes it triggers early wasting a lot
of space. We also have theoretical concerns, since the bound k log n on the length of the longest
probe sequence implies strong assumptions on certain properties of the hash function. Shun and
Blelloch make extensive experiments including applications from the problem based benchmark
suite [38].
Li et al. [16] use the bucket cuckoo-hashing method by Dietzfelbinger and Weidling [8]and
develop a concurrent implementation. They use ne grained per bucket locks which can sometimes
be avoided using transactional memory, e.g., Intel Transactional Synchronization Extensions TSX
(Intel TSX). To further reduce the number of acquired locks per insertions they use a BFS-based
insertion algorithm to nd a minimal displacement path to an empty bucket (in contrast to the
commonly used random walk). As a result of their work, they implemented the small open source
library libcuckoo, which we measure against (which does not use Intel TSX). This approach has
the potential to achieve very good space eciency. However, our measurements indicate that the
performance penalty is high.
The practical importance of concurrent hash tables also leads to new and innovative implemen-
tations outside of the scientic community. A good example of this is the Junction library [34],
ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.
16:4 T. Maier et al.
which was published by Preshing in the beginning of 2016, shortly after our initial publication
[19].
3 PRELIMINARIES
We assume that each application thread has its own designated hardware thread or processing core
and denote the number of these threads with p. A data structure is non-blocking if no blocked
thread currently accessing this data structure can block an operation on the data structure by
another thread. A data structure is lock-free if it is non-blocking and guarantees global progress,
i.e., there must always be at least one thread nishing its operation in a nite number of steps.
Hash Tables store a set of Key, Value pairs (elements).
1
A hash function h maps each key to a
cell of a table (an array). The number of elements in the hash table is denoted n and the number
of operations is m. For the purpose of algorithm analysis, we assume that n and m are p
2
–this
allows us to simplify algorithm complexities by hiding O (p) terms that are independent of n and m
in the overall cost. Sequential hash tables support the insertion of elements, and nding, updating,
or deleting an element with given key—all of this in constant expected time. Further operations
compute n (size), build a table with a given number of initial elements, and iterate over all elements
(forall).
Linear Probing is one of the most popular sequential hash table schemes used in practice. An
element x, a is stored at the rst free table entry following position h(x ) (wrapping around when
the end of the table is reached). Linear probing is at the same time simple and ecient—if the table
is not too full, a single cache line access will be enough most of the time. Deletion can be imple-
mented by rearranging the elements locally [14] to avoid holes violating the invariant mentioned
above. When the table becomes too full or too empty, the elements can be migrated to a larger or
smaller table, respectively. The migration cost can be charged to insertions and deletions causing
amortized constant overhead.
4 CONCURRENT HASH TABLE INTERFACE AND FOLKLORE IMPLEMENTATION
Although it seems quite clear what a hash table is and how this generalizes to concurrent hash
tables, there is a surprising number of details to consider. Therefore, we will quickly go over some
of our interface decisions and detail how this interface can be implemented in a simple, fast, lock-
free concurrent linear probing hash table.
This hash table will have a bounded capacity c that has to be specied when the table is con-
structed. It is the basis for all other hash table variants presented in this publication. We call this
table folkloreHT, because variations of it are used in many publications and it is not clear to us by
whom it was rst published.
The most important requirement for concurrent data structures is that they should be lineariz-
able, i.e., it must be possible to order the hash table operations in some sequence—without reorder-
ing two operations of the same thread—so that executing them sequentially in that order yields
the same results as the concurrent processing. For a hash table data structure, this basically means
that all operations should be executed atomically some time between their invocation and their
return. For example, a find should never return an inconsistent state, e.g., a half-updated data eld
that was never actually stored at the corresponding key.
Our variant of the folkloreHT ensures the atomicity of operations using two-word atomic CAS
operations for all changes of the table. As long as the key and the value each only use one machine
1
Much of what is said here can be generalized to the case when Elements are black boxes from which keys are extracted
by an accessor function.
ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.
Concurrent Hash Tables: Fast and General(?)! 16:5
word, we can use two-word CAS opearations to atomically manipulate a stored key together with
the corresponding value. There are other variants that avoid needing two-word compare and swap
operations, but they often need a designated empty value (see [34]). Since, the corresponding ma-
chine instructions are widely available on modern hardware, using them should not be a problem.
If the target architecture does not support the needed instructions, then the implementation could
in theory be switched to use a variant of concurrent linear probing without two-word CAS. As
it can easily be deduced by the context, we will usually omit the “two-word” prex and use the
abbreviation CAS for both single and double word CAS operations.
ALGORITHM 1: Pseudocode for the insertOrUpdate operation
Input:Keyk, Data Element d, Update Function up : Key × Val × Val → Val
default: atomicUpdate(·) = CAS(current,up(k, current.data, d))
Output: Boolean true when a new key was inserted, false if an update occurred
1 i =h(k);
2 while true do
3 i = i % c;
4 current = table[i];
5 if current.key == empty_key then // Key is not present yet . . .
6 if table[i].CAS(current, k, d ) then return true;
7 else continue; // retry at same position
8 else if current.key == k then // Same key already present . . .
9 if table[i].atomicUpdate(current, d, up) then return false ;
10 else continue; // retry at same position
11 i++;
Initialization. The constructor allocates an array of size c consisting of 128-bit aligned cells
whose key is initialized to the empty values.
Modications. We categorize all changes to the hash table’s content into one of the following
three functions that can be implemented similarly (excluding deletions).
insert( k, d ): Returns true if the insert succeeds (false if an element with the specic key
is already present). Only one operation should succeed if multiple threads are inserting the same
key at the same time. The implementation is similar to Algorithm 1, with lines 11–14 replaced by
return false.
update( k,d, up ): Returns false, if there is no value stored at the specied key, otherwise this
function atomically updates the stored value to new = up(current, d ). Notice that the resulting
value can be dependent on both the current value and the input parameter d. The implementation
is similar to Algorithm 1, with lines 6–9 replaced by return false and line 12 replaced by return
true.
insertOrUpdate( k, d, up ): This operation updates the current value, if one is present, oth-
erwise the given data element is inserted as the new value. The function returns true,if
insertOrUpdate performed an insert (key was not present), and false if an update was ex-
ecuted.
We choose this interface for two main reasons. It allows applications to quickly dierentiate
between inserting and changing an element—this is especially useful, since the thread that rst
inserted a key can be identied uniquely. Additionally, it allows transparent, lock-free updates
that can be more complex than just replacing the current value (think of CAS or Fetch-and-Add).
ACM Transactions on Parallel Computing, Vol. 5, No. 4, Article 16. Publication date: February 2019.
剩余31页未读,继续阅读
资源评论
weixin_38653296
- 粉丝: 2
- 资源: 911
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 连接ESP32手表来做验证20241223-140953.pcapng
- 小偏差线性化模型,航空发动机线性化,非线性系统线性化,求解线性系统具体参数,最小二乘拟合 MATLAB Simulink 航空发动机,非线性,线性,非线性系统,线性系统,最小二乘,拟合,小偏差,系统辨
- 好用的Linux终端管理工具,支持自定义多行脚本命令,密码保存、断链续接,SFTP等功能
- Qt源码ModbusTCP 主机客户端通信程序 基于QT5 QWidget, 实现ModbusTCP 主机客户端通信,支持以下功能: 1、支持断线重连 2、通过INI文件配置自定义服务器I
- Linux下TurboVNC+VirtualGL 使用GPU卡vglrun glxgears
- QGroundControl-installer.exe
- Linux下TurboVNC+VirtualGL 使用GPU卡vglrun glxgears
- 台球检测40-YOLO(v5至v11)、COCO、CreateML、Paligemma、TFRecord、VOC数据集合集.rar
- 颜色拾取器 for Windows
- 数字按键3.2考试代码
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功