RethinkingSIMDVectorizationforIn-MemoryDatabases(2015)-计算机科学资源-CSDN文库

156 浏览量 2021-04-22 18:42:48 上传评论收藏 2.09MB PDF 举报

资源推荐

资源详情

资源评论

Rethinking SIMD Vectorization for In-Memory Databases

Orestis Polychroniou

∗

Columbia University

orestis@cs.columbia.edu

Arun Raghavan

Oracle Labs

arun.raghavan@oracle.com

Kenneth A. Ross

†

Columbia University

kar@cs.columbia.edu

ABSTRACT

Analytical databases are continuously adapting to the un-

derlying hardware in order to saturate all sources of par-

allelism. At the same time, hardware evolves in multiple

directions to explore diﬀerent trade-oﬀs. The MIC architec-

ture, one such example, strays from the mainstream CPU

design by packing a larger number of simpler cores per chip,

relying on SIMD instructions to ﬁll the performance gap.

Databases have been attempting to utilize the SIMD ca-

pabilities of CPUs. However, mainstream CPUs have only

recently adopted wider SIMD registers and more advanced

instructions, since they do not rely primarily on SIMD for

eﬃciency. In this paper, we present novel vectorized designs

and implementations of database operators, based on ad-

vanced SIMD operations, such as gathers and scatters. We

study selections, hash tables, and partitioning; and com-

bine them to build sorting and joins. Our evaluation on the

MIC-based Xeon Phi co-processor as well as the latest main-

stream CPUs shows that our vectorization designs are up to

an order of magnitude faster than the state-of-the-art scalar

and vector approaches. Also, we highlight the impact of ef-

ﬁcient vectorization on the algorithmic design of in-memory

database operators, as well as the architectural design and

power eﬃciency of hardware, by making simple cores com-

parably fast to complex cores. This work is applicable to

CPUs and co-processors with advanced SIMD capabilities,

using either many simple cores or fewer complex cores.

1. INTRODUCTION

Real time analytics are the steering wheels of big data

driven business intelligence. Database customer needs have

extended beyond OLTP with high ACID transaction through-

put, to interactive OLAP query execution across the entire

database. As a consequence, vendors oﬀer fast OLAP solu-

tions, either by rebuilding a new DBMS for OLAP [28, 37],

or by improving within the existing OLTP-focused DBMS.

∗

Work partly done when ﬁrst author was at the Oracle Labs.

†

Supported by NSF grant IIS-1422488 and an Oracle gift.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGMOD ’15, May 31 - June 04, 2015, Melbourne, VIC, Australia.

http://dx.doi.org/10.1145/2723372.2747645.

The advent of large main-memory capacity is one of the

reasons that blink-of-an-eye analytical query execution has

become possible. Query optimization used to measure blocks

fetched from disk as the primary unit of query cost. Today,

the entire database can often remain main-memory resident

and the need for eﬃcient in-memory algorithms is apparent.

The prevalent shift in database design for the new era

are column stores [19, 28, 37]. They allow for higher data

compression in order to reduce the data footprint, minimize

the number of columns accessed per tuple, and use column

oriented execution coupled with late materialization [9] to

eliminate unnecessary accesses to RAM resident columns.

Hardware provides performance through three sources of

parallelism: thread parallelism, instruction level parallelism,

and data parallelism. Analytical databases have evolved to

take advantage of all sources of parallelism. Thread par-

allelism is achieved, for individual operators, by splitting

the input equally among threads [3, 4, 5, 8, 14, 31, 40],

and in the case of queries that combine multiple operators,

by using the pipeline breaking points of the query plan to

split the materialized data in chunks that are distributed to

threads dynamically [18, 28]. Instruction level parallelism is

achieved by applying the same operation to a block of tu-

ples [6] and by compiling into tight machine code [16, 22].

Data parallelism is achieved by implementing each operator

to use SIMD instructions eﬀectively [7, 15, 26, 30, 39, 41].

The diﬀerent sources of parallelism were developed as a

means to deliver more performance within the same power

budget available per chip. Mainstream CPUs have evolved

on all sources of parallelism, featuring massively superscalar

pipelines, out-of-order execution of tens of instructions, and

advanced SIMD capabilities, all replicated on multiple cores

per CPU chip. For example, the latest Intel Haswell ar-

chitecture issues up to 8 micro-operations per cycle with

192 reorder buﬀer entries for out-of-order execution, 256-bit

SIMD instructions, two levels of private caches per core and

a large shared cache, and scales up to 18 cores per chip.

Concurrently with the evolution of mainstream CPUs, a

new approach on processor design has surfaced. The design,

named the many-integrated-cores (MIC) architecture, uses

cores with a smaller area (transistors) and power footprint

by removing the massively superscalar pipeline, out-of-order

execution, and the large L3 cache. Each core is based on a

Pentium 1 processor with a simple in-order pipeline, but

is augmented with large SIMD registers, advanced SIMD

instructions, and simultaneous multithreading to hide load

and instruction latency. Since each core has a smaller area

and power footprint, more cores can be packed in the chip.

The MIC design was originally intended as a GPU [33],

but now targets high performance computing applications.

Using a high FLOPS machine to execute compute-intensive

algorithms with superlinear complexity is self-evident. Ex-

ecuting analytical queries in memory, however, consists of

data-intensive linear algorithms that mostly “move” rather

than“process”data. Previous work to add SIMD in databases

has optimized sequential access operators such as index [41]

or linear scans [39], built multi-way trees with nodes that

match the SIMD register layout [15, 26], and optimized spe-

ciﬁc operators, such as sorting [7, 11, 26], by using ad-hoc

vectorization techniques, useful only for a speciﬁc problem.

In this paper, we present good design principles for SIMD

vectorization of main-memory database operators, without

modifying the logic or the data layout of the baseline scalar

algorithm. The baseline algorithm is deﬁned here as the

most straightforward scalar implementation. Formally, as-

sume an algorithm that solves a problem with optimal com-

plexity, its simplest scalar implementation, and a vectorized

implementation. We say that the algorithm can be fully vec-

torized, if the vector implementation executes O(f (n)/W )

vector instructions instead of O(f(n)) scalar instructions

where W is the vector length, excluding random memory

accesses that are by deﬁnition not data-parallel. We deﬁne

fundamental vector operations that are frequently reused

in the vectorizations and are implemented using advanced

SIMD instructions, such as non-contiguous loads (gathers)

and stores (scatters). The fundamental operations that are

not directly supported by speciﬁc instructions can be imple-

mented using simpler instructions at a performance penalty.

We implement vectorized operators in the context of main-

memory databases: selection scans, hash tables, and parti-

tioning, which are combined to build more advanced opera-

tors: sorting and joins. These operators cover a large por-

tion of the time needed to execute analytical queries in main

memory. For selection scans, we show branchless tuple selec-

tion and in-cache buﬀering. For hash tables, we study both

building and probing across using multiple hashing schemes.

For partitioning, we describe histogram generation, includ-

ing all partitioning function types: radix, hash, and range.

We also describe data shuﬄing, including inputs larger than

the cache. All of the above are combined to build radixsort

and multiple hash join variants that highlight the impact of

vectorization on determining the best algorithmic design.

We compare our vectorized implementations of in-memory

database operators against the respective state-of-the-art

scalar and vector techniques, by evaluating on the Intel Xeon

Phi co-processor and on the latest mainstream CPUs. Xeon

Phi is currently the only available hardware based on the

MIC architecture and is also the only generally available

hardware that supports gathers and scatters, while the lat-

est mainstream CPUs (Intel Haswell) support gathers.

We use the sorting and join operator to compare a Xeon

Phi 7120P co-processor (61 P54C cores at 1.238 GHz, 300

Watts TDP) against four high-end Sandy Bridge CPUs (4×8

Sandy Bridge cores at 2.2 GHz, 4 × 130 Watts TDP), and

found that they have similar performance, but on a diﬀerent

power budget, since Xeon Phi spends almost half the energy.

The next generation of Xeon Phi will also be available as

a standalone CPU,

even if the current generation is only

newsroom.intel.com/community/intel_newsroom/blog/2014/

06/23/intel-re-architects-the-fundamental-building-

block-for-high-performance-computing

available as a co-processor, and will support more advanced

SIMD instructions (AVX 3), also supported by the next gen-

eration of mainstream CPUs.

Our work does not focus on

evaluating Xeon Phi as a co-processing accelerator, such as

GPUs, that would also be bound by the PCI-e bandwidth,

but as an alternative CPU design that is suitable and more

power eﬃcient for executing analytical database workloads.

We summarize our contributions:

• We introduce design principles for eﬃcient vectoriza-

tion of in-memory database operators and deﬁne fun-

damental vector operations that are frequently reused.

• We design and implement vectorized selection scans,

hash tables, and partitioning, that are combined to

design and build sorting and multiple join variants.

• We compare our implementations against state-of-the-

art scalar and vectorized techniques. We achieve up

to an order of magnitude speedups by evaluating on

Xeon Phi as well as on the latest mainstream CPUs.

• We show the impact of vectorization on the algorith-

mic design of in-memory operators, as well as the ar-

chitectural design and power eﬃciency of hardware, by

making simple cores comparably fast to complex cores.

The rest of the paper is organized as follows. Section 2

presents related work. Sections 4, 5, 6, and 7 discuss the

vectorization of selection scans, hash tables, Bloom ﬁlters,

and partitioning. Sections 8 and 9 discuss algorithmic de-

signs for sorting and hash join. We present our experimental

evaluation in Section 10, we discuss how SIMD vectorization

relates to GPUs in Section 11, and conclude in Section 12.

Implementation details are provided in the Appendix.

2. RELATED WORK

Previous work that added SIMD instructions in database

operators is brieﬂy summarized. Zhou et al. used SIMD for

linear scans, index scans, and nested loop joins [41]. Ross

proposed probing multiple keys per bucket for cuckoo hash-

ing [30]. Willhalm et al. optimized decompression of bit

packed data [39]. Inoue et al. proposed data-parallel comb-

sort and merging [11]. Chhugani et al. optimized bitonic

merging for mergesort [7]. Kim et al. designed multi-way

trees tailored to the SIMD layout [15]. Polychroniou et

al. discussed trade-oﬀs for updating heavy hitter aggregates

[25], fast range partitioning via a range function index [26],

and Bloom ﬁlter probing using gathers [27]. Schlegel et al.

described scalable frequent itemset mining on Xeon Phi [32].

Database operators have been extensively optimized for

modern processors. Manegold et al. introduced hash joins

with cache-conscious partitioning [19], which Kim et al. ex-

tended to multi-core CPUs and compared against SIMD

sort-merge joins [14]. Cieslewicz et al. and Ye et al. studied

contention-free aggregation [8, 40]. Blanas et al. and Balke-

sen et al. evaluated hardware-tuned hash joins [3, 5]. Albu-

tiu et al. introduced NUMA-aware joins and Balkesen et al.

evaluated join variants on multiple CPUs [1, 4]. Satish et al.

and Wassenberg et al. introduced buﬀered partitioning for

radixsort [31, 38]. Polychroniou et al. introduced in-place

and range partitioning for sorting variants [26]. Jha et al.

optimized hash joins on Xeon Phi, but used SIMD partially

only to compute hash functions and load hash buckets [12].

software.intel.com/blogs/2013/avx-512-instructions

Compilers partially transform scalar to SIMD code based

on loop unrolling [17] and control ﬂow predication [34]. Our

designs are far more complicated and not strictly equivalent.

Database operators can be compiled directly [22], thus man-

ual vectorization is also desirable to maximize performance.

GPUs have also been used in databases for generic queries

[2] and to accelerate operators, such as indexes [15], sorting

[31], joins [13, 24], and selections [36]. SIMT GPU threads

are organized in warps that execute the same scalar instruc-

tion per cycle by predicating all control ﬂow. In SIMD code,

control ﬂow is converted to data ﬂow in software. We discuss

the similarities between SIMD and SIMT code and to what

extent our work is applicable to SIMT GPUs in Section 11.

3. FUNDAMENTAL OPERATIONS

In this section we deﬁne the fundamental vector opera-

tions that we will need to implement vectorized database

operators. The ﬁrst two operations, termed selective load

and selective store, are spatially contiguous memory accesses

that load or store values using a subset of vector lanes. The

last two operations are spatially non-contiguous memory

loads and stores, termed gathers and scatters respectively.

A B C D E F G H I J K L M N O P

B D F G I K N O P

0 1 0 1 0 1 1 0 0 1 0 0 1 1 1

vector

memory

mask

Figure 1: Selective store operation

Selective stores write a speciﬁc subset of the vector lanes

to a memory location contiguously. The subset of vector

lanes to be written is decided using a vector or scalar register

as the mask, which must not be limited to a constant.

A B C D E F G H I J K L M N O P

R S T U V W X Y Z

0 1 0 1 0 1 1 0 1 0 1 0 0 1 1 1

A R C S E T U H V J W L M X Y Z

vector

mask

memory

Figure 2: Selective load operation

Selective loads are the symmetric operation that involves

loading from a memory location contiguously to a subset of

vector lanes based on a mask. The lanes that are inactive

in the mask retain their previous values in the vector.

C J A M % T R @ M ! P B P J ^ X

2 9 0 12 30 19 17 27 12 26 15 1 15 9 31 23

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ! @ # $ % ^

memory

index

vector

value

vector

Figure 3: Gather operation

Gather operations load values from non-contiguous loca-

tions. The inputs are a vector of indexes and an array

pointer. The output is a vector with the values of the respec-

tive array cells. By adding a mask as an input operand, we

deﬁne the selective gather that operates on a subset of lanes.

The inactive mask lanes retain their previous contents.

A B C D E F G H I J K L M N O P

2 9 0 12 30 19 17 27 12 26 15 1 15 9 31 23

C L A N I M G F P J H E O

memory

value

vector

index

vector

Figure 4: Scatter operation

Scatter operations execute stores to multiple locations.

The input is a vector of indexes, an array pointer, and a

vector of values. If multiple vector lanes point to the same

location, we assume that the rightmost value will be written.

By adding a mask as an input we can store lanes selectively.

Gathers and scatters are not really executed in parallel

because the (L1) cache allows one or two distinct accesses

per cycle. Executing W cache accesses per cycle is an im-

practical hardware design. Thus, random memory accesses

have to be excluded from the O(f(n)/W ) vectorization rule.

Gathers are supported on the latest mainstream CPUs

(Haswell) but scatters are not. Older mainstream CPUs

(e.g., Sandy Bridge) support neither. Emulating gathers is

possible at a performance penalty, which is small if done

carefully. We discuss more hardware details in Appendix B.

Selective loads and stores are also not supported on the

latest mainstream CPUs, but can be emulated using vector

permutations. The lane selection mask is extracted as a

bitmask and is used as an array index to load a permutation

mask from a pre-generated table. The data vector is then

permuted in a way that splits the active lanes of the mask

to the one side of the register and the inactive lanes to the

other side. In case of a selective store we can store the vector

(unaligned) and in case of a selective load, we load a new

vector (unaligned) and blend the two vectors to replace the

inactive lanes. This technique was ﬁrst used in vectorized

Bloom ﬁlters [27] on CPUs, without deﬁning the operations.

We describe the Xeon Phi instructions in Appendix C.

4. SELECTION SCANS

Selection scans have re-emerged for main-memory query

execution and are replacing traditional unclustered indexes

in modern OLAP DBMSs [28]. Advanced optimizations in-

clude lightweight bit compression [39] to reduce the RAM

bandwidth, generation of statistics to skip data regions [28],

and scanning of bitmaps-zonemaps to skip cache lines [35].

Linear selective scan performance has been associated with

branch mispredictions, if the operator is implemented as

shown in Algorithm 1. Previous work has shown that con-

verting control ﬂow to data ﬂow can aﬀect performance,

making diﬀerent approaches optimal per selectivity rate [29].

Branches can be eliminated as shown in Algorithm 2 to avoid

misprediction penalties, at the expense of accessing all pay-

load columns and eagerly evaluating all selective predicates.

Algorithm 1 Selection Scan (Scalar - Branching)

j ← 0  output index

for i ← 0 to |T

keys in

| − 1 do

k ← T

keys in

[i]  access key columns

if (k ≥ k

lower

) && (k ≤ k

upper

) then  short circuit and

payloads out

[j] ← T

payloads in

[i]  copy all columns

keys out

[j] ← k

j ← j + 1

end if

end for

剩余15页未读，继续阅读

评论收藏

内容反馈

weixin_38653687

粉丝: 3
资源: 973

Rethinking SIMD Vectorization for In-Memory Databases (2015)-计算机...

最新资源

Rethinking SIMD Vectorization for In-Memory Databases (2015)-计算机...

Rethinking the Heatmap Regression for Bottom-Up Human Pose

02-5-2参考论文列表1

PersonReID-TSF:AAAI 2020 论文代码 Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect

Rethinking-Atrous-Convolution-for-Semantic-Image-Segmentation-1.zip

[IJCAI_2022,_Official_Code]_for_paper_Rethinking__TANet-image-

2020-Rethinking Pre-training and Self-training.pdf

行人属性识别Rethinking-of-PAR源码+数据集PA100K+pt模型.zip

rethinking_package.pdf

读论文Rethinking the Role of Demonstrations What Makes In-Context

Rethinking Database High Availability with RDMA Networks.pdf

rethinking-database-design-with-apache-drill

gUnet源码(Rethinking Performance Gains in Image Dehazing Networks)

2019-何凯明-预训练-Rethinking ImageNet Pre-training1

Spatial-Phase Shallow Learning Rethinking Face Forgery Detection

Beyond Block IO - Rethinking Traditional Storage Primitives (ouyangx-hpca2011)-计算机科学

Beyond Block IO - Rethinking Traditional Storage Primitives (ouyangx-hpca2011-slides)-计算机科学

Rethinking the design of the Internet:The end to end arguments vs. the brave new world

Rethinking Productivity in Software Engineering.pdf

rethinking-redux-demo, 在Redux中，管理路由的简单方法.zip

Statistical Rethinking A Bayesian Course with Examples in R and Stan

Statistical Rethinking

Rethinking-Inpainting-MEDFE:通过具有特征均衡功能的互编码器重新思考图像修复。 ECCV 2020口头

藏经阁-NO MORE _SBT ASSEMBLY__RETHINKING SPARK-SUBMIT USING CUESHEE

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）