没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Hadoop Map-Reduce Tutorial
Table of contents
1 Purpose...............................................................................................................................2
2 Pre-requisites......................................................................................................................2
3 Overview............................................................................................................................2
4 Inputs and Outputs............................................................................................................. 3
5 Example: WordCount v1.0................................................................................................ 3
5.1 Source Code...................................................................................................................3
5.2 Usage.............................................................................................................................6
5.3 Walk-through.................................................................................................................7
6 Map-Reduce - User Interfaces........................................................................................... 8
6.1 Payload.......................................................................................................................... 9
6.2 Job Configuration........................................................................................................12
6.3 Task Execution & Environment..................................................................................13
6.4 Job Submission and Monitoring..................................................................................15
6.5 Job Input......................................................................................................................16
6.6 Job Output................................................................................................................... 17
6.7 Other Useful Features..................................................................................................18
7 Example: WordCount v2.0.............................................................................................. 22
7.1 Source Code.................................................................................................................22
7.2 Sample Runs................................................................................................................28
7.3 Highlights.................................................................................................................... 30
Copyright © 2007 The Apache Software Foundation. All rights reserved.
1. Purpose
This document comprehensively describes all user-facing facets of the Hadoop Map-Reduce
framework and serves as a tutorial.
2. Pre-requisites
Ensure that Hadoop is installed, configured and is running. More details:
•
Hadoop Quickstart for first-time users.
•
Hadoop Cluster Setup for large, distributed clusters.
3. Overview
Hadoop Map-Reduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
A Map-Reduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the outputs
of the maps, which are then input to the reduce tasks. Typically both the input and the output
of the job are stored in a file-system. The framework takes care of scheduling tasks,
monitoring them and re-executes the failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the Map-Reduce
framework and the Distributed FileSystem are running on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the cluster.
The Map-Reduce framework consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves
execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce
functions via implementations of appropriate interfaces and/or abstract-classes. These, and
other job parameters, comprise the job configuration. The Hadoop job client then submits the
job (jar/executable etc.) and configuration to the JobTracker which then assumes the
responsibility of distributing the software/configuration to the slaves, scheduling tasks and
monitoring them, providing status and diagnostic information to the job-client.
Although the Hadoop framework is implemented in JavaTM, Map-Reduce applications need
Hadoop Map-Reduce Tutorial
Page 2
Copyright © 2007 The Apache Software Foundation. All rights reserved.
not be written in Java.
•
Hadoop Streaming is a utility which allows users to create and run jobs with any
executables (e.g. shell utilities) as the mapper and/or the reducer.
•
Hadoop Pipes is a SWIG- compatible C++ API to implement Map-Reduce applications
(non JNITM based).
4. Inputs and Outputs
The Map-Reduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a Map-Reduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,
v3> (output)
5. Example: WordCount v1.0
Before we jump into the details, lets walk through an example Map-Reduce application to get
a flavour for how they work.
WordCount is a simple application that counts the number of occurences of each word in a
given input set.
This works with a local-standalone, pseudo-distributed or fully-distributed Hadoop
installation.
5.1. Source Code
WordCount.java
1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
Hadoop Map-Reduce Tutorial
Page 3
Copyright © 2007 The Apache Software Foundation. All rights reserved.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends
MapReduceBase implements
Mapper<LongWritable, Text, Text,
IntWritable> {
15. private final static IntWritable
one = new IntWritable(1);
16. private Text word = new Text();
17.
18. public void map(LongWritable key,
Text value, OutputCollector<Text,
IntWritable> output, Reporter
reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new
StringTokenizer(line);
21. while
(tokenizer.hasMoreTokens()) {
22.
word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
Hadoop Map-Reduce Tutorial
Page 4
Copyright © 2007 The Apache Software Foundation. All rights reserved.
27.
28. public static class Reduce extends
MapReduceBase implements
Reducer<Text, IntWritable, Text,
IntWritable> {
29. public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable>
output, Reporter reporter) throws
IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new
IntWritable(sum));
35. }
36. }
37.
38. public static void main(String[]
args) throws Exception {
39. JobConf conf = new
JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42.
conf.setOutputKeyClass(Text.class);
43.
conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46.
Hadoop Map-Reduce Tutorial
Page 5
Copyright © 2007 The Apache Software Foundation. All rights reserved.
剩余29页未读,继续阅读
资源评论
doudou0411
- 粉丝: 0
- 资源: 10
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功