in Hadoop, which will determine what join plan should be
used for better performance.
2. CLASSIC JOIN IN HADOOP
In this section, we will briefly introduce the implementations
of a typical approach of join, which is named “default join”
for the rest of the paper, and then move on to “map join”.
2.1 Default/reduce Join
Default join is a 2-way join widely used in Map-Reduce
which comply with the MapReduce spirit fairly well. This
strategy is most intuitive and is the default strategy for tools
such as Pig and Hive. It is called “reduce join” later in the
paper for the reason that join is performed at the reduce
phase, in contrast with map join.
Given two tables, default join will first read different parts
of these two tables into different Mappers where for each
record, the join attribute will be extracted as the key of a
record in intermediate result, with the file tag (to indicate
which table does one record come from) and other necessary
attributes as the value. The intermediate will be shuffled
and then send to Reducers. Each reducer will only get tuples
with the same key. Records in each reducer are grouped by
tag so that the records from different tables are identical.
The Cartesian product of the two parts are the final result.
Default join work well for most situations. One of the ex-
ceptions is that when both tables are huge, there are lots of
data transferred over network from Mappers to Reducers.
Without considering early projection or any selection based
on predicates given by user, the size of data transferred on
network is the sum of size of R and S. Now that the network
transfer is the bottleneck in this case, one potential solu-
tion is to first filter the source datasets and get rid of those
records at Mapper phase which are not possible to be joined
in the Reducer phase. This approach is exactly the “ad-
vanced join” that we have proposed, and we will introduce
this join method later in this section.
2.2 Map join
Map join is one of the improvement from default join by
eliminating the reduce phase and thus eliminating the trans-
fer of data over the network between map phase and reduce
phase. This gain from this becomes obvious when one of the
tables is small.
Map join aims to use only the map phase so the no data will
be transferred on network. The “input file” to Hadoop for
this job will be only one table (fragment side), and one map
task is initialized for each “split” of the table, thus finishing
the “fragmenting” step. Then those map tasks will read from
the other table (duplicate side) as a whole, and perform the
join locally. The name of “fragment-replicate join” is due to
the behavior of fragmenting one table to be processed in a
distributed fashion, and then replicating the other side for
each of the fragment; and the name “map join” suggests that
the join is performed using only the map phase, thus elimi-
nating the cost of sorting and shuffling over the network.
For map side join has to read the whole table S into every
Mapper, it is really efficient when it meets a pattern of a
huge table and a small table, but disastrous when both of
tables are huge, because reading the“duplicate side”to every
map task will impose much more cost than the save gained
from not transferring the other table over the network. One
“original” implementation of map join reads and loads the
duplicate table to a hash table built in memory, and then
probes this hash table to do a join. As can be observed, this
kind of implementation will not work if the duplicate table
cannot be loaded to memory. This issue will be explored
and answered later in this paper.
2.3 Distributed Cache
Distributed Cache is one way of distributing common data
that is shared and accessed by all the map tasks in a Hadoop
job. Before launching the job, we assign the “duplicate side”
to Distributed Cache, so that it is copied to the local file
system in each node that has task(s) to run. Then at the
map phase, Mapper will just need to read from the local
FS, rather than using an HDFS call to read the data (most
likely from another node).
The merit of Distributed Cache is grouping multiple accesses
to the duplicate table to only once per node: Distributed
Cache will copy the duplicate table only ONCE per node;
assuming several map tasks run on each node, if we directly
read the duplicate table through HDFS calls, there is clearly
more network communication instead of local I/O. There-
fore, using of Distributed Cache is one way of improving the
performance of map join; although it can also be handy in
the case of advanced join discussed in the next part. How-
ever, the real benefit from Distributed Cache remains to be
analyzed with controlled experiments.
2.4 Advanced join
Advanced join is based on the idea of filtering the table
before transferring them on the network. It relies on one
prerequisite operation “semi join”, which performs a normal
join only on the “join key”. In other words, semi join will
only extract join key in the map phase, and the output of
semi join will be a table of keys.
Semi join aims to find out all the distinct values used as
the join key that are shared by both tables. The process
of semi join is described as follows: First, the two tables
are read by map tasks, where the join key value for each
record will be extracted as the key of intermediate result
with a tag (to indicate which table does one record come
from) as the value. After shuffling, all the records from
both tables with the same join key value well be sent to
the same Reducer. Each Reducer scan its input and try to
find different tags. If it find different tags, which means the
both tables contains records with the same join key value,
then the Reducer output the key. If all the tags are the
same after scanning all the input for certain Reducer, which
means only one table contains records with this join key
value, the Reducer does nothing for this key since it will not
be used for join later. After all the Reducers finishing their
work, we get all the distinct join key values. This step will
be useful for filtering out those useless records for this join.
Semi-join makes sense even if the join key is Primary and
Foreign key. Consider the following scenario: if the join
key is a Primary key in one table, and in the other table