hadoopjoinimplement资源-CSDN文库

需积分: 9 36 浏览量 2011-10-17 23:38:10 上传评论收藏 160KB PDF 举报

Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table op- erations; on the other hand, join operation in distributed database systems was never an easy task because data lo- cation and skewness makes join strategies harder to opti- mize. Fragment-replicate join (map join) may be a clever step towards good performance in some cases, but it can be a dangerous move under certain circumstances. This paper introduces some new techniques used in map join to tackle these issues, and proposes a plan generator for the join types that we currently have. ### Hadoop Join Implementation：关键技术与优化策略 #### 摘要在大数据处理领域，Hadoop作为主流的大规模数据处理框架之一，其MapReduce模型在并行数据处理方面展现出巨大优势。然而，对于数据间的连接操作（即join操作），Hadoop的表现并不理想。由于Hadoop的设计初衷是为了解决大规模数据的聚合操作而非跨表操作，因此，在处理join操作时存在一定的局限性。本文旨在介绍一种新的方法——自适应连接计划生成（Adaptive Join Plan Generation），该方法针对Hadoop中的join操作提出了几种新颖的技术，并设计了一个连接类型计划生成器。 #### 背景与挑战随着数据量的急剧增长，如何高效地处理大规模数据成为业界关注的重点。Hadoop MapReduce框架因其简单易用的编程模型、良好的容错性和低成本的扩展能力而被广泛应用。然而，join操作因其复杂性和数据分布的特点，在Hadoop中实现起来较为困难。具体而言： 1. **数据分布与倾斜问题**：在分布式环境中，数据的不均匀分布会导致join操作性能下降。 2. **MapReduce的局限性**：Hadoop的MapReduce模型更倾向于支持聚合操作而非join操作。 3. **Map Join的风险**：虽然片段复制（Fragment-replicate）或称为Map Join的方法在某些场景下能提高性能，但在其他情况下可能导致严重的问题。 #### 自适应连接计划生成技术为了克服上述挑战，本文提出了一种自适应连接计划生成技术，主要包括以下几个方面的创新： 1. **动态数据分区策略**：根据输入数据的特点自动调整数据分区策略，以减少shuffle阶段的数据传输量。 2. **智能选择Join类型**：基于数据大小、分布情况等因素，动态选择最合适的Join算法（如Map Join、Reduce Join等）。 3. **数据倾斜处理机制**：通过预先处理数据，减少因数据倾斜导致的处理延迟。 4. **自适应资源分配**：根据任务执行情况动态调整Map和Reduce任务的数量，优化资源利用率。 #### 关键技术详解 - **动态数据分区策略**：通过对输入数据进行分析，系统可以自动识别出哪些部分的数据更适合进行局部处理，从而减少整个集群的数据交换量。例如，如果数据中存在热点键值，则可以通过特殊的分区策略将这些键值的数据集中在少数几个节点上进行处理，以此来减轻其他节点的压力。 - **智能选择Join类型**：自适应连接计划生成技术能够根据数据集的大小、数据分布情况以及数据倾斜程度等多方面因素，智能地选择最适合当前场景的Join算法。例如，当一个数据集远小于另一个数据集时，采用Map Join可以显著提高效率；而在数据集大小相当的情况下，则可能更倾向于使用Reduce Join来避免内存溢出等问题。 - **数据倾斜处理机制**：在实际应用中，数据倾斜问题常常导致某些Reducer任务处理时间过长，严重影响整体性能。本文提出的方法可以通过预处理阶段对数据进行重新组织，确保每个Reducer处理的数据量大致相同，从而有效解决数据倾斜问题。 - **自适应资源分配**：在Hadoop中，资源的合理分配对于提升作业执行效率至关重要。自适应连接计划生成技术能够根据任务的实际执行情况，动态调整Map和Reduce任务的数量，确保资源得到充分利用的同时，还能根据负载情况及时做出调整。 #### 结论通过引入自适应连接计划生成技术，不仅可以在一定程度上解决Hadoop中join操作的性能瓶颈，还能够提高系统的整体稳定性和可靠性。未来的研究方向可以进一步探索更多针对特定场景的优化策略，以及如何更好地与其他大数据处理框架集成，以满足日益增长的数据处理需求。

资源推荐

资源详情

资源评论

Adaptive Join Plan Generation in Hadoop

For CPS296.1 Course Project

Gang Luo

Duke University

Durham, NC 27705

gang@cs.duke.edu

Liang Dong

Duke University

Durham, NC 27705

liang@cs.duke.edu

ABSTRACT

Joins in Hadoop has always been a problem for its users: the

Map/Reduce framework seems to be speciﬁcally designed

for group-by aggregation tasks rather than across-table op-

erations; on the other hand, join operation in distributed

database systems was never an easy task because data lo-

cation and skewness makes join strategies harder to opti-

mize. Fragment-replicate join (map join) may be a clever

step towards good performance in some cases, but it can be

a dangerous move under certain circumstances. This paper

introduces some new techniques used in map join to tackle

these issues, and proposes a plan generator for the join types

that we currently have.

Categories and Subject Descriptors

H.2 [Database Management]: Plan Generation

General Terms

Theory

Keywords

Hadoop, join operation

1. INTRODUCTION

Currently, the amount of data the industry and academia

are facing is large, and will keep increasing, which makes

large-scale data processing a hot issue. Map-Reduce[5] is

a popular parallel data processing framework. The simple

programming model allows users to write simple program

that could run on hundreds of machines simultaneously to

process data. Its fault tolerance feature also makes it a

robust system even for commodity machines. These features

enable the system running Map-Reduce expand to a really

large scale by adding commodity machines to the cluster

at a low cost, thus could greatly reduce the time consume

by jobs. The open source implementation of Map/Reduce

framework, namely Hadoop[2], has caught much attention

ever since it was born.

Even though it seems promising to improve the eﬃciency

for data processing by brutally enlarging the cluster size and

running the jobs on more nodes, it is a better idea to de-

sign sophisticate plan that make good use the Map-Reduce

paradigm while avoid the side eﬀect as much as possible. As

one of the most critical operations in data processing, join

operation is usually more time-consuming than other kinds

of work and thus has a greater impact on the overall per-

formance. Basically, to join two datasets in Map-Reduce is

quite simple, as we will introduce later. But for the most-

obvious join method, which reads both tables from the disk,

and shuﬄes all the data over the network to the reducers,

the performance can be limited by the network connection

speed. When the datasets are too large, the network transfer

time becomes the bottleneck, thus lowering the utilization

of the computing resources.

With some tools/framework built on top of Hadoop, for ex-

ample, Pig[6] or Hive[8], the annoying work of programming

in Java could be saved. Instead, users could write declar-

ative (for Hive) or procedural (for Pig) queries to perform

tasks that could take much lines of code with pure Hadoop.

However, neither of these tools has addressed the problem of

join: Pig has implemented fragment-duplicate join (known

as “map join” in our paper), and also skew join that can

handle skewed tables; the user may want to give some hints

to the compiler, indicating which join method the system

should use. This is not a good way of handling the problem

since the user may not know what is “map join”, or the un-

derlying data; furthermore, the user may give a wrong hint

which could hurt the performance. Building a plan genera-

tor which decides smartly which plan to use will make those

tools more favorable.

Through our early experience with map join, we have learned

that it may consume more memory of a map task than it

possesses, the consequence will be either extremely slow ex-

ecution or thrashing map tasks. “Advanced join” is another

approach that can be potentially beneﬁcial, but the over-

head of this kind of join should not be neglected.

Using Distributed Cache[1] to copy ﬁles to each node could

be a potential improvement to the performance, but our pre-

liminary experiments suggested otherwise, which forms one

issue that this paper tries to solve. Other than that, our

work will focus on extending the “original” map join imple-

mentation for it to work with more cases. More importantly,

we will propose a cost-based plan generator for eﬃcient joins

in Hadoop, which will determine what join plan should be

used for better performance.

2. CLASSIC JOIN IN HADOOP

In this section, we will brieﬂy introduce the implementations

of a typical approach of join, which is named “default join”

for the rest of the paper, and then move on to “map join”.

2.1 Default/reduce Join

Default join is a 2-way join widely used in Map-Reduce

which comply with the MapReduce spirit fairly well. This

strategy is most intuitive and is the default strategy for tools

such as Pig and Hive. It is called “reduce join” later in the

paper for the reason that join is performed at the reduce

phase, in contrast with map join.

Given two tables, default join will ﬁrst read diﬀerent parts

of these two tables into diﬀerent Mappers where for each

record, the join attribute will be extracted as the key of a

record in intermediate result, with the ﬁle tag (to indicate

which table does one record come from) and other necessary

attributes as the value. The intermediate will be shuﬄed

and then send to Reducers. Each reducer will only get tuples

with the same key. Records in each reducer are grouped by

tag so that the records from diﬀerent tables are identical.

The Cartesian product of the two parts are the ﬁnal result.

Default join work well for most situations. One of the ex-

ceptions is that when both tables are huge, there are lots of

data transferred over network from Mappers to Reducers.

Without considering early projection or any selection based

on predicates given by user, the size of data transferred on

network is the sum of size of R and S. Now that the network

transfer is the bottleneck in this case, one potential solu-

tion is to ﬁrst ﬁlter the source datasets and get rid of those

records at Mapper phase which are not possible to be joined

in the Reducer phase. This approach is exactly the “ad-

vanced join” that we have proposed, and we will introduce

this join method later in this section.

2.2 Map join

Map join is one of the improvement from default join by

eliminating the reduce phase and thus eliminating the trans-

fer of data over the network between map phase and reduce

phase. This gain from this becomes obvious when one of the

tables is small.

Map join aims to use only the map phase so the no data will

be transferred on network. The “input ﬁle” to Hadoop for

this job will be only one table (fragment side), and one map

task is initialized for each “split” of the table, thus ﬁnishing

the “fragmenting” step. Then those map tasks will read from

the other table (duplicate side) as a whole, and perform the

join locally. The name of “fragment-replicate join” is due to

the behavior of fragmenting one table to be processed in a

distributed fashion, and then replicating the other side for

each of the fragment; and the name “map join” suggests that

the join is performed using only the map phase, thus elimi-

nating the cost of sorting and shuﬄing over the network.

For map side join has to read the whole table S into every

Mapper, it is really eﬃcient when it meets a pattern of a

huge table and a small table, but disastrous when both of

tables are huge, because reading the“duplicate side”to every

map task will impose much more cost than the save gained

from not transferring the other table over the network. One

“original” implementation of map join reads and loads the

duplicate table to a hash table built in memory, and then

probes this hash table to do a join. As can be observed, this

kind of implementation will not work if the duplicate table

cannot be loaded to memory. This issue will be explored

and answered later in this paper.

2.3 Distributed Cache

Distributed Cache is one way of distributing common data

that is shared and accessed by all the map tasks in a Hadoop

job. Before launching the job, we assign the “duplicate side”

to Distributed Cache, so that it is copied to the local ﬁle

system in each node that has task(s) to run. Then at the

map phase, Mapper will just need to read from the local

FS, rather than using an HDFS call to read the data (most

likely from another node).

The merit of Distributed Cache is grouping multiple accesses

to the duplicate table to only once per node: Distributed

Cache will copy the duplicate table only ONCE per node;

assuming several map tasks run on each node, if we directly

read the duplicate table through HDFS calls, there is clearly

more network communication instead of local I/O. There-

fore, using of Distributed Cache is one way of improving the

performance of map join; although it can also be handy in

the case of advanced join discussed in the next part. How-

ever, the real beneﬁt from Distributed Cache remains to be

analyzed with controlled experiments.

2.4 Advanced join

Advanced join is based on the idea of ﬁltering the table

before transferring them on the network. It relies on one

prerequisite operation “semi join”, which performs a normal

join only on the “join key”. In other words, semi join will

only extract join key in the map phase, and the output of

semi join will be a table of keys.

Semi join aims to ﬁnd out all the distinct values used as

the join key that are shared by both tables. The process

of semi join is described as follows: First, the two tables

are read by map tasks, where the join key value for each

record will be extracted as the key of intermediate result

with a tag (to indicate which table does one record come

from) as the value. After shuﬄing, all the records from

both tables with the same join key value well be sent to

the same Reducer. Each Reducer scan its input and try to

ﬁnd diﬀerent tags. If it ﬁnd diﬀerent tags, which means the

both tables contains records with the same join key value,

then the Reducer output the key. If all the tags are the

same after scanning all the input for certain Reducer, which

means only one table contains records with this join key

value, the Reducer does nothing for this key since it will not

be used for join later. After all the Reducers ﬁnishing their

work, we get all the distinct join key values. This step will

be useful for ﬁltering out those useless records for this join.

Semi-join makes sense even if the join key is Primary and

Foreign key. Consider the following scenario: if the join

key is a Primary key in one table, and in the other table

剩余7页未读，继续阅读

评论收藏

内容反馈

kurt6868

粉丝: 4
资源: 48

hadoop join implement

最新资源

hadoop join implement

hadoop_join_aggregate:在hadoop中加入和聚合mapreduce算法

hadoop Join代码（map join 和reduce join）

word源码java-hadoop_example:Hadoop基本操作，包括pagerank、kmeans、join、max_tempera

hadoop_table_join:北京大学数据库课程项目，2014

hadoop mapreduce多表关联join多个job相互依赖传递参数

hadoop_join.jar.zip_hadoop_hadoop query_reduce

hadoop-0.21.0-datajoin.jar

Deep learning with Hadoop : build, implement and scale distributed d l models

Hadoop Reduce Join及基于MRV2 API 重写

hadoop2.7.3 Winutils.exe hadoop.dll

hadoop winutils hadoop.dll

hadoop的dll文件 hadoop.zip

hadoop.dll & winutils.exe For hadoop-2.7.1

hadoop2.7.3的hadoop.dll和winutils.exe

hadoop的hadoop.dll和winutils.exe下载

Hadoop下载 hadoop-2.9.2.tar.gz

hadoop2.7.7对应的hadoop.dll，winutils.exe

Hadoop datajoin示例(客户和订单信息)

Hadoop下载 hadoop-3.3.3.tar.gz

hadoop2.7.3 hadoop.dll

Hadoop.Essentials.1784396680

hadoop.dll & winutils.exe For hadoop-2.6.0

hadoop2.6 hadoop.dll+winutils.exe

win环境 hadoop 3.1.0安装包

Linux上Hadoop安装包hadoop-2.7.4.tar.gz

各个版本Hadoop，hadoop.dll以及winutils.exe文件下载大合集

hadoop-2.7.7 linux安装包

hadoop2.7.4 hadoop.dll包括winutils.exe

winutils+hadoop.dll+eclipse插件（hadoop2.7）

Hadoop3.1.3.rar

最新资源