DesignTrade-offsforaRobustDynamicHybridHashJoin(Extende资源-CSDN文库

版权申诉

27 浏览量 2022-01-02 23:25:09 上传评论收藏 2.33MB PDF 举报

Design Trade-offs for a Robust Dynamic Hybrid Hash Join (Extended Version)_健壮的动态混合哈希连接（扩展版）的设计权衡.pdf 《健壮的动态混合哈希连接（扩展版）的设计权衡》这篇论文深入探讨了数据库管理系统（DBMS）中至关重要的Join操作，特别是混合哈希连接（Hybrid Hash Join, HHJ）算法的设计与优化。HHJ作为一种高效且广泛应用的连接算法，其性能在很大程度上依赖于输入关系的准确统计信息，但在实际系统中，获取这些信息并不总是可行的。论文通过详实的实验和分析，研究了构建一个健壮且动态的HHJ操作器时所面临的设计权衡。作者们重新审视了过去研究中提出的HHJ设计和优化策略，并将它们与其他算法进行对比，这些算法要么由他们自己设计，要么来自于相关研究。论文首先探讨了分区数量对HHJ性能的影响，提出了一种分区数目的下限和默认值，以优化HHJ的效率。接着，他们设计并评估了不同的分区插入技术，目标是在最小化CPU成本的同时最大化内存利用率。此外，他们还研究了一系列用于动态选择溢出分区的算法，并将其结果与之前发表的研究进行了比较。接下来，论文提出了两种不同的溢出分区增长策略，并通过实验及模型分析来研究其有效性。这些算法被集成到Apache AsterixDB框架中，进行了各种场景的评估，包括变长记录大小、不同join属性分布以及不同存储类型等条件，以确保算法的普适性和适应性。通过对HHJ算法的深入研究，论文揭示了在缺乏输入数据统计信息的情况下，如何通过灵活的设计和策略调整来维持HHJ的高效性。这些发现对于提升DBMS的整体性能具有重要意义，尤其是在处理大规模数据集和复杂查询时，能够帮助系统在资源有限的环境下实现更优的性能表现。这篇论文为数据库系统的设计者和开发者提供了一套全面的指南，以应对HHJ中的挑战，尤其是如何在不确定性和动态性中寻找最佳平衡点，从而提高系统的健壮性和适应性。通过实验和分析，作者们提出了一系列实用的优化方法，有助于未来数据库系统的设计与优化。

资源推荐

资源详情

资源评论

Design Trade-os for a Robust Dynamic Hybrid Hash Join

(Extended Version)

Shiva Jahangiri

University of California, Irvine

shivaj@uci.edu

Michael J. Carey

University of California, Irvine

mjcarey@ics.uci.edu

Johann-Christoph Freytag

Humboldt-Universität zu Berlin

freytag@informatik.hu-berlin.de

ABSTRACT

The Join operator, as one of the most expensive and commonly used

operators in database systems, plays a substantial role in Database

Management System (DBMS) performance. Among the many dif-

ferent Join algorithms studied over the last decades, Hybrid Hash

Join (HHJ) has proven to be one of the most ecient and widely-

used join algorithms. While HHJ’s performance depends largely

on accurate statistics and information about the input relations, it

may not always be practical or possible for a system to have such

information available.

HHJ’s design depends on many details to perform well. This

paper is an experimental and analytical study of the trade-os in

designing a robust and dynamic HHJ operator. We revisit the design

and optimization techniques suggested by previous studies through

extensive experiments, comparing them with other algorithms de-

signed by us or used in related studies.

We explore the impact of the number of partitions on HHJ’s

performance and propose a lower bound and a default value for

the number of partitions. We continue by designing and evaluat-

ing dierent partition insertion techniques to maximize memory

utilization with the least CPU cost. In addition, we consider a com-

prehensive set of algorithms for dynamically selecting a partition

to spill and compare the results against previously published stud-

ies. We then present two alternative growth policies for spilled

partitions and study their eectiveness using experimental and

model-based analyses.

These algorithms have been implemented in the context of

Apache AsterixDB and evaluated under dierent scenarios such as

variable record sizes, dierent distributions of join attributes, and

dierent storage types, including HDD, SSD, and Amazon Elastic

Block Store (Amazon EBS) [2].

1 INTRODUCTION

As one of the most popular and expensive DBMS operators, the join

operator can signicantly impact the performance of a DBMS. HHJ

[

] has shown superior performance in computing the equijoin of

two datasets among other kinds of join operators. In a nutshell, HHJ

groups the records of each dataset into disjoint partitions. A hash

table is created to hold one of the partitions in memory (memory-

resident partition), while the rest will be written (spilled) to disk

to be processed in the next rounds of HHJ. The number of the

partitions and the selection of the memory-resident partition are

static decisions made at the compile time of an HHJ operator. While

previous studies [

] have suggested various cost models and

optimization techniques for enhancing such decisions, these studies

have two shortcomings: (1) They assume a uniform distribution for

join attribute values. (2) Their cost models rely on having accurate

statistical information such as input sizes prior to query execution.

Unfortunately, collecting and accessing or predicting such infor-

mation may not always be feasible. For example:

•

Many data management systems process external data that

resides outside their storage for which they have little or

no information. (Examples include: Apache AsterixDB [

Apache Spark [4], and Oracle [7].)

•

The accurate sizes of join inputs may not be known if they

result from other operators instead of being base relations.

•

Newly developed DBMSs may not have statistics available

until they become more mature in other dimensions.

Not having sucient statistics can be detrimental to the perfor-

mance of operators whose designs depend on such information.

[

] has proposed Dynamic HHJ to address the unbalanced join at-

tribute values distribution by dynamically destaging the partitions

at the runtime of a join query.

Investigating the Dynamic HHJ algorithm reveals several design

questions that must be explored carefully, as they may impact the

system’s overall performance:

•

Number of partitions: How many partitions should the records

be hashed into if the sizes of inputs are unknown or inaccu-

rate?

•

Partition Insertion: How can we nd a "good" page (memory

frame) within a partition for inserting a new record?

•

Victim Selection Policy: How can we select a "good" partition

to spill in the case of insucient memory?

•

Growth Policy: How many memory frames should a spilled

partition be allowed to occupy?

With this motivation, this paper is an experimental survey of the

trade-os in designing a robust Dynamic HHJ algorithm. We an-

swer the questions above through a comprehensive evaluation of

dierent design aspects of the Dynamic HHJ algorithm and eval-

uate the alternative options through extensive experimental and

model-based analyses.

The rst contribution of this paper is to propose a lower bound

and a default value for the number of partitions for Dynamic HHJ.

We show that our proposed lower bound, while simple, can reduce

the total amount of I/O by a factor of three in some investigated

scenarios. Second, we study dierent partition insertion algorithms

to eciently nd a frame with enough space in the target parti-

tion. We evaluate the eectiveness of these algorithms on partition

compactness (fullness) and total I/O reduction. Additionally, we

propose and evaluate two policies for allocating memory frames

to spilled partitions. Finally, we propose and implement various

dynamic destaging (victim selection) strategies and evaluate them

under dierent scenarios such as dierent record size distributions,

join attribute value distributions, and combinations thereof. The

suggested optimization techniques and algorithm variants have

arXiv:2112.02480v1 [cs.DB] 5 Dec 2021

Shiva Jahangiri, Michael J. Carey, and Johann-Christoph Freytag

been implemented in the Apache AsterixDB system and evaluated

on dierent storage types, including HDD, SSD, and Amazon EBS.

The remainder of the paper is organized as follows: Section 2

provides background information on Apache AsterixDB and the

workow of the HHJ and Dynamic HHJ operators. Section 3 dis-

cusses previous work related to this study. In Section 4, we discuss

the lower bound on and the suggested default number of partitions

to use in practice. Section 5 introduces and evaluates dierent parti-

tion insertion algorithms. In Section 6, two policies for the growth

of spilled partitions are discussed and evaluated. Section 7 discusses

and evaluates various destaging partition selection policies. In Sec-

tion 8, some optimization techniques in AsterixDB are discussed

before Section 9 concludes the paper.

2 BACKGROUND

2.1 Hybrid Hash Join

Like other hash-based join algorithms, HHJ uses hashing to stage

large inputs to reduce record comparisons during the join. HHJ has

been shown to outperform other join types in computing equijoins

of two datasets. It was designed as a hybrid version of the Grace

Hash Join and Simple Hash Join algorithms [

]. All three men-

tioned hash join algorithms consist of two phases, namely "build"

and "probe". During the build phase, they partition the smaller input,

which we refer to as "build input", into disjoint subsets. Similarly,

the probe phase divides the larger input, which we refer to as "probe

input", into the same number of partitions as the build input. While

all three algorithms share a similar high-level design, they dier in

their details, making each of them suitable for a specic scenario.

Grace Hash Join partitions the build and probe inputs consec-

utively, writing each partition back to disk in a separate le. This

partitioning process continues for each partition until they t into

memory. A hash table is created to process the join once a parti-

tion is small enough to t in memory. Grace Hash Join performs

best when the smaller dataset is signicantly larger than the main

memory.

In Simple Hash Join, records are hashed into two partitions: a

memory-resident and a disk (spilled) partition. A portion of memory

is used for a hash table to hold the memory-resident partition’s

records. Simple Hash Join performs well when memory is large

enough to hold most of the smaller dataset. In Grace Hash Join, the

idea is to use memory to divide a large amount of data into smaller

partitions that t into memory, while Simple Hash Join focuses on

the idea of keeping some portion of data in memory to reduce the

total amount of I/O, considering that a large amount of memory

is available. Next, we discuss the details of the HHJ operator and

compare its design with its parent algorithms.

Like Grace Hash Join, HHJ uses hash partitioning to group each

input’s records into "join-able" partitions to avoid unnecessary

record comparisons. Like Simple Hash Join, HHJ uses a portion of

memory to keep one of the partitions and its hash table in memory,

while the rest write to disk. Keeping data in memory reduces the

total amount of I/O, and utilizing a hash table lowers the number

of record comparisons. The overall of Hybrid Hash Join is shown

in Figure 1.

As mentioned earlier, the HHJ operator consists of two consecu-

tive phases of build and probe. During the build phase, the records

of the smaller input are scanned and hash-partitioned based on

the values of the join attributes. We call the hash function used for

partitioning a "split function." The records mapped to the memory-

resident partition remain in memory, while the rest of the partitions

are written (frame by frame) to disk. Pointers to the records of the

memory-resident partition are inserted into a hash table at the end

of the build phase.

After the build phase ends, the probe phase starts by scanning

and hash-partitioning the records of the larger input. The same

split function used during the build phase is used for this step. The

records that map to the memory-resident partition are hashed using

the same hash function used in the build phase to probe the hash

table. All other records belong to spilled partitions and are written

(frame by frame) to that partition’s probe le on disk.

After all records of the probe input have been processed, the

pairs of spilled partitions from the build phase and probe phase are

processed as inputs to the next rounds of HHJ.

2.2 Apache AsterixDB

Apache AsterixDB [

] is an open-source, parallel, shared-

nothing big data management system (BDMS) built to support

the storage, indexing, modifying, analyzing, and querying of large

volumes of semi-structured data.

The unit of data that is transferred within AsterixDB, as well

as between AsterixDB and disk is called a "frame". A frame is a

xed-size and congurable set of contiguous bytes. AsterixDB uses

Dynamic HHJ, whose design and optimization is the main topic

of this paper. AsterixDB supports dierent join algorithms such

as Block Nested Loop Join, Dynamic HHJ, Broadcast Join, and

Indexed Nested Loop Join. However, Dynamic HHJ is the default

and primary join type in AsterixDB for processing equi-joins due

to its superior performance.

AsterixDB currently does not support statistics, so users may

provide hints to guide AsterixDB at execution time by selecting

an alternative type of join operator or by providing dataset size

information. For example, a user may use the Indexed Nested Loop

Join hint to request this join algorithm instead of a Dynamic HHJ.

AsterixDB follows this hint whenever possible; otherwise, it utilizes

Dynamic HHJ (by default). In addition, a hint to use a Broadcast Join

might be advantageous when the build dataset is small enough to

be sent to all nodes instead of using hash partitioning. The current

release of AsterixDB follows the join order in a query’s FROM

clause for determining the build and probe inputs. The rst input

in the FROM clause will serve as the probe relation; the rest will be

build inputs.

We chose Apache AsterixDB as our primary platform for im-

plementing and evaluating our proposed techniques for several

reasons. First, it is an open-source platform that allows us to share

our techniques and their evaluations with the community. More

importantly, AsterixDB is a parallel big data management system

for managing and processing large amounts of semi-structured

data with a declarative language. Finally, its similarity in structure

and design to other NoSQL and NewSQL database systems and

query engines makes our results and techniques applicable to other

systems as well.

Shiva Jahangiri, Michael J. Carey, and Johann-Christoph Freytag

into the hash table to be probed. The spilled partitions are processed

one by one in the next rounds of the Dynamic HHJ operator.

3 RELATED WORK

HHJ was rst proposed in [

]. It was shown to have superior

performance compared to other types of join using simple cost

models, especially if a large amount of memory is available[

]. In

[

], the authors provided a more detailed cost model to determine

the optimal buer allocation for various join types.

One of the key problems in conguring HHJ for execution is to

choose the number of partitions into which to hash the records. In

[

], the author provided an equation for calculating the number

of partitions based on the memory and build input size. In [

the authors derived an upper bound on the number of partitions

and then merged smaller partitions to reduce the fragmentation in

each partition, which is helpful when the join attribute values are

skewed. Our paper introduces a lower bound and a default value

for the number of partitions and shows how it can signicantly

reduce the total amount of I/O.

Another challenge for executing HHJ is to eciently nd a frame

with sucient space in the target partition for each incoming record.

This problem is similar to the Bin-Packing problem [

]. The

problem has also been widely studied in the operating system and

the DBMS literature [

] for managing free disk space. This

paper will examine those algorithms and a few more for inserting

records in partitions during HHJ. The dierence between our work

and disk-related studies is that in our work records will not reside

in the partitions long term, and no deletion apart from partition

spilling happens in this case.

The authors of [

] proposed a dynamic destaging scheme where

the partition written to disk is selected dynamically during exe-

cution. In [

], Graefe et al. detailed the optimization techniques

and the design of Dynamic HHJ variant in Microsoft SQL Server.

Those two studies are closely related to our work; both choose the

largest partition to be written to disk. Despite some reasoning, the

authors discuss no other options, nor do they evaluate them. Our

study denes 13 dierent possibilities and evaluates them under

various record sizes and join attribute value distributions.

Regarding AsterixDB [

], the details of its default Dynamic

HHJ can be found in [37].

4 NUMBER OF PARTITIONS

The rst step in conguring the HHJ operator is to determine the

number of the partitions for partitioning the input datasets. The

purpose of this section is twofold: (1) Choosing the number of

partitions for the cases where no a priori information about input

datasets is provided. (2) Providing a lower bound on the number

of partitions to prevent excessive spilling due to inaccuracy of the

provided information.

There are two main constraints to be considered when choosing

the number of partitions: (1) An HHJ operator needs at least two

partitions to divide the input dataset into smaller subsets. (2) Each

partition needs at least one output frame in order not to spill less

than half-full frames to disk.

As such, the number of partitions for an HHJ should be chosen

from the range of:

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜 𝑓 𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛𝑠 =

[

2, #𝑜 𝑓 𝑚𝑒𝑚𝑜𝑟𝑦 𝑓 𝑟𝑎𝑚𝑒𝑠

]

(1)

In [

], the author oers the following equation to calculate the

number of partitions for an HHJ operator.

𝐵 = ⌈

|𝑅| ∗ 𝐹 − |𝑀 |

|𝑀| − 1

⌉ (2)

|R| represents the size of the build input in frames, F is a fudge

factor, |M| represents the size of the memory in frames available to

this join operator, and B is the number of disk-resident partitions.

Based on this equation, the HHJ operator will use B+1 partitions

(including a memory-resident partition) and nish in B+1 rounds.

While this equation calculates the number of partitions in a way

that minimizes the total rounds in HHJ and thus reduces unnec-

essary I/O, any inaccuracy in estimating its input parameter, |R|,

might drastically impact the operator’s performance. This is espe-

cially true when only a few partitions are created (large memory).

In this case, data is distributed among just a few partitions, causing

a high penalty for spilling a partition as a large amount of data will

be written to disk.

Figure 3 shows the result of a simulation study that explores the

impact of the number of partitions on the amount of data written to

disk during the execution of an HHJ operator. Final result writing

is excluded from this measurement. We use the same number of

partitions for all rounds of HHJ in this experiment. Both the build

and probe inputs contain the same size of data for simplicity. The

amount of available memory stays xed at 128 MB during this

simulation. At the same time, the sizes of the inputs change to cover

all the cases from when the build dataset ts in memory to when it

is 64 times larger than memory. As Figure 3 shows, the number of

partitions does not impact the amount of spilling signicantly if a

large portion of data ts in memory (input sizes less than or equal

to 2048 MB in this example); however, this is not true when the data

size is considerably larger than the memory size (input sizes equal

to or larger than 4096 MB in this example). In the latter case when

data size signicantly exceeds the memory size (input sizes equal

to or larger than 4096 MB in Figure 3), choosing a small number of

partitions leads to a handful of large-sized partitions causing extra

rounds of HHJ and large amount of spilling to disk. On the other

hand, while using a larger number of partitions can reduce the total

amount of spilling, it can make the join’s I/O pattern more random

due to frequent writings of partitions containing just a few frames.

Fragmentation within frames is another downside of having a

large number of partitions. In [

], the authors dened an upper

bound for the number of partitions in order to reduce fragmentation

and random writes due to too many single-frame partitions. How-

ever, to the best of our knowledge, no lower bound on the number

of partitions has been suggested to improve the performance of the

HHJ algorithm. Table 1 shows the number of partitions calculated

using Equation 2 given accurate inputs.

We can use Equation 2 to calculate the number of partitions

for the next rounds of HHJ as the sizes of spilled partitions are

known. Figure 4 shows that how using the spilled partition sizes in

accurately calculating the number of partitions for the next rounds

of an HHJ can reduce the total amount of spilling of this operator.

剩余16页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6604
资源: 9万+

Design Trade-offs for a Robust Dynamic Hybrid Hash Join (Extende

Using Confidence Bounds for Exploitation-Exploration Trade-offs

?? ?????2.zip_Trade-Offs_drone summary

????? ??????.zip_Trade-Offs_drone research

spectrum sensing\Spectrum sensing in cognitive radio networks_ requirements, challenges and design trade-offs [cognitive radio communications]

Trade-offs in Analog circuit design

FPGA Hardware Accelerators - Case Study on Design Methodologies and Trade-Offs

Trade off and Optimization in Analog CMOS Design

Design of fuel-cell micro-cogeneration systems through modeling and optimization

Design Pattern - Elements of Reusable Object Oriented Software

A Secure Perceptual Hash Algorithm for Image Content Authentication

Resoling transactional Access-Analytic Performance Trade_offs

A Critique of the CAP Theorem.pdf

Operating Systems Internals and Design Principles, 9th Edition

Fundamentals of IP and SoC Security

Digital Signal Processing System Analysis and Design

Folded op amp design method

ASIC设计- ASIC Design in the Silicon Sandbox Keith Barr

RF and Digital Signal Processing for Software-Defined Radio

Design Patterns in Modern C++-Apress(2018)

Microsoft® Windows® Group Policy Guide (Pro-One-Offs).pdf

Snubber Circuits For Power Electronics

5G AAS白皮书，不错

论文研究-OFFS：基于对象模型的闪存文件系统研究.pdf

A Survey of FPGA-Based LDPC Decoders

Designing Data-Intensive Applications

Introduction to Space-Time wireless Communications

最新资源