mapreduce自定义分隔符源码资源-CSDN文库

共48个文件

class：28个

java：13个

prefs：2个

mapreduce

hadoop

4星 · 超过85%的资源需积分: 50 171 浏览量 2013-11-12 13:33:41 上传评论收藏 59KB ZIP 举报

在MapReduce编程模型中，数据通常是以行的形式存储在文本文件中，每行的数据项之间由特定的分隔符（如制表符或逗号）隔开。默认情况下，Hadoop的`LineRecordReader`类将每一行作为一个记录进行处理，而对行内数据的分隔则需要用户自定义。在处理格式复杂、分隔符不固定的日志文件时，我们需要对`LineRecordReader`进行扩展，以实现自定义分隔符的功能。理解`LineRecordReader`的工作原理是至关重要的。`LineRecordReader`的主要任务是从输入分片（InputSplit）中读取行，并将行内容作为键值对（Key, Value）返回。键通常是输入分片的位置，值则是读取到的行内容。在处理分隔符之前，我们先要了解`Text`类，它是Hadoop用于表示文本数据的类，它使用UTF-8编码。为了自定义分隔符，我们需要创建一个新的`RecordReader`类，例如命名为`CustomDelimiterRecordReader`。这个类会继承`LineRecordReader`并覆盖其关键方法，主要是`nextKeyValue()`方法。在这个方法中，我们将解析行内容，根据自定义的分隔符拆分数据项，并将它们存储在`Text`对象中。以下是一个简化的`CustomDelimiterRecordReader`类的实现思路： 1. 在构造函数中，接收自定义分隔符作为参数。 2. 覆盖`nextKeyValue()`方法： - 调用父类的`nextKeyValue()`获取一行文本。 - 使用自定义分隔符对行内容进行分割。可以使用Java的`String.split()`方法，或者更高效的方式，如`Pattern`和`Matcher`类。 - 将分割得到的数据项存储在一个数组或列表中。 - 设置Key为当前的输入位置，Value为一个`Text`对象，将数据项的列表转换成字符串并设置到`Text`对象中。 3. 可以提供一个辅助方法，如`getFields()`，用于获取已分割的数据项。此外，为了使`CustomDelimiterRecordReader`能在MapReduce作业中使用，还需要在`Mapper`类的`setup()`方法中设置输入格式，指定我们的`CustomDelimiterRecordReader`。 ```java public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override protected void setup(Context context) throws IOException, InterruptedException { Configuration conf = context.getConfiguration(); conf.set("custom.delimiter", conf.get("custom.delimiter")); // 获取自定义分隔符 context.setInputFormat(CustomInputFormat.class); // 自定义输入格式，该格式使用CustomDelimiterRecordReader } // ... } ``` 同时，需要创建一个自定义的`InputFormat`类，例如`CustomInputFormat`，它继承自`FileInputFormat`，并在`createRecordReader()`方法中返回`CustomDelimiterRecordReader`实例。通过以上步骤，我们就成功地实现了自定义分隔符的功能，使得MapReduce能够处理具有复杂格式的日志数据。在实际应用中，可能还需要考虑其他因素，如错误处理、性能优化等。记住，自定义分隔符的目的是为了更好地适应各种非结构化数据，使得数据处理更加灵活和高效。

资源详情

资源评论

收起资源包目录

UserDefinedSeparatorRecordReader.zip （48个子文件）

test

src

test

TestSplit.java 2KB

IsearchRecordReader.java 9KB

UserDefinedSeparatorReader.java 8KB

MaxTemp3.java 3KB

LineInputFormat.java 4KB

MaxTemp.java 3KB

FileSplit.java 3KB

UserDefinedSeparatorRecordReader.java 5KB

ValidateLineNumber.java 3KB

MaxTemp2.java 2KB

test.java 4KB

WebLogInputFormat.java 626B

UDFProcess.java 2KB

bin

test

TestSplit$FileSplitMapper.class 2KB

WebLogInputFormat.class 1KB

ValidateLineNumber$FileSplitReducer.class 2KB

TestSplit.class 2KB

UDFProcess$UDFReducer.class 2KB

MaxTemp2.class 1KB

MaxTemp$MaxTempMapper.class 2KB

UDFProcess.class 2KB

MaxTemp3$MaxTempMapper.class 3KB

LineInputFormat.class 5KB

FileSplit$FileSplitPartitioner.class 1KB

MaxTemp3$MaxTempCombine.class 2KB

FileSplit$FileSplitMapper.class 2KB

UserDefinedSeparatorReader.class 4KB

MaxTemp.class 2KB

MaxTemp$MaxTempReducer.class 2KB

UserDefinedSeparatorRecordReader.class 5KB

test.class 4KB

FileSplit$FileSplitReducer.class 2KB

MaxTemp3.class 2KB

FileSplit.class 2KB

UDFProcess$UDFMapper.class 2KB

MaxTemp3$MaxTempReducer.class 2KB

IsearchRecordReader.class 5KB

MaxTemp2$MaxTempMapper.class 2KB

ValidateLineNumber.class 2KB

ValidateLineNumber$FileSplitMapper.class 2KB

MaxTemp2$MaxTempReducer.class 2KB

.classpath 391B

.settings

org.eclipse.wst.jsdt.ui.superType.container 49B

org.eclipse.core.resources.prefs 98B

org.eclipse.wst.jsdt.ui.superType.name 6B

.jsdtscope 454B

org.eclipse.jdt.core.prefs 629B

.project 568B

package test; import java.io.IOException; import java.io.InputStream; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.CompressionCodecFactory; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.util.LineReader; public class IsearchRecordReader extends RecordReader<LongWritable, Text> { private static final Log LOG = LogFactory.getLog(IsearchRecordReader.class); private CompressionCodecFactory compressionCodecs = null; private long start; private long pos; private long end; private LineReader in; private int maxLineLength; private LongWritable key = null; private Text value = null; // 行分隔符，即一条记录的分隔符 private byte[] separator = { '\b' }; private int sepLength = 1; public IsearchRecordReader() { } public IsearchRecordReader(String seps) { this.separator = seps.getBytes(); sepLength = separator.length; } public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); this.start = split.getStart(); this.end = (this.start + split.getLength()); Path file = split.getPath(); this.compressionCodecs = new CompressionCodecFactory(job); CompressionCodec codec = this.compressionCodecs.getCodec(file); // open the file and seek to the start of the split FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); boolean skipFirstLine = false; if (codec != null) { this.in = new LineReader(codec.createInputStream(fileIn), job); this.end = Long.MAX_VALUE; } else { if (this.start != 0L) { skipFirstLine = true; this.start -= sepLength; fileIn.seek(this.start); } this.in = new LineReader(fileIn, job); } if (skipFirstLine) { // skip first line and re-establish "start". int newSize = in.readLine(new Text(), 0, (int) Math.min( (long) Integer.MAX_VALUE, end - start)); if (newSize > 0) { start += newSize; } } this.pos = this.start; } public boolean nextKeyValue() throws IOException { if (this.key == null) { this.key = new LongWritable(); } this.key.set(this.pos); if (this.value == null) { this.value = new Text(); } int newSize = 0; while (this.pos < this.end) { newSize = this.in.readLine(this.value, this.maxLineLength, Math .max( (int) Math.min(Integer.MAX_VALUE, this.end - this.pos), this.maxLineLength)); if (newSize == 0) { break; } this.pos += newSize; if (newSize < this.maxLineLength) { break; } LOG.info("Skipped line of size " + newSize + " at pos " + (this.pos - newSize)); } if (newSize == 0) { // 读下一个buffer this.key = null; this.value = null; return false; } // 读同一个buffer的下一个记录 return true; } public LongWritable getCurrentKey() { return this.key; } public Text getCurrentValue() { return this.value; } public float getProgress() { if (this.start == this.end) { return 0.0F; } return Math.min(1.0F, (float) (this.pos - this.start) / (float) (this.end - this.start)); } public synchronized void close() throws IOException { if (this.in != null) this.in.close(); } /*class LineReader { //回车键(hadoop默认) //private static final byte CR = 13; //换行符(hadoop默认) //private static final byte LF = 10; //按buffer进行文件读取 private static final int DEFAULT_BUFFER_SIZE = 32 * 1024 * 1024; private int bufferSize = DEFAULT_BUFFER_SIZE; private byte[] buffer; private int bufferLength = 0; private int bufferPosn = 0; LineReader(InputStream in, int bufferSize) { this.bufferLength = 0; this.bufferPosn = 0; // this.in = in; this.bufferSize = bufferSize; this.buffer = new byte[this.bufferSize]; } public LineReader(InputStream in, Configuration conf) throws IOException { this(in, conf.getInt("io.file.buffer.size", DEFAULT_BUFFER_SIZE)); } public void close() throws IOException { in.close(); } public int readLine(Text str, int maxLineLength) throws IOException { return readLine(str, maxLineLength, Integer.MAX_VALUE); } public int readLine(Text str) throws IOException { return readLine(str, Integer.MAX_VALUE, Integer.MAX_VALUE); } //以下是需要改写的部分_start，核心代码 public int readLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException{ str.clear(); Text record = new Text(); int txtLength = 0; long bytesConsumed = 0L; boolean newline = false; int sepPosn = 0; do { //已经读到buffer的末尾了，读下一个buffer if (this.bufferPosn >= this.bufferLength) { bufferPosn = 0; bufferLength = in.read(buffer); //读到文件末尾了，则跳出，进行下一个文件的读取 if (bufferLength <= 0) { break; } } int startPosn = this.bufferPosn; for (; bufferPosn < bufferLength; bufferPosn ++) { //处理上一个buffer的尾巴被切成了两半的分隔符(如果分隔符中重复字符过多在这里会有问题) if(sepPosn > 0 && buffer[bufferPosn] != separator[sepPosn]){ sepPosn = 0; } //遇到行分隔符的第一个字符 if (buffer[bufferPosn] == separator[sepPosn]) { bufferPosn ++; int i = 0; //判断接下来的字符是否也是行分隔符中的字符 for(++ sepPosn; sepPosn < sepLength; i ++, sepPosn ++){ //buffer的最后刚好是分隔符，且分隔符被不幸地切成了两半 if(bufferPosn + i >= bufferLength){ bufferPosn += i - 1; break; } //一旦其中有一个字符不相同，就判定为不是分隔符 if(this.buffer[this.bufferPosn + i] != separator[sepPosn]){ sepPosn = 0; break; } } //的确遇到了行分隔符 if(sepPosn == sepLength){ bufferPosn += i; newline = true; sepPosn = 0; break; } } } int readLength = this.bufferPosn - startPosn; bytesConsumed += readLength; //行分隔符不放入块中 //int appendLength = readLength - newlineLength; if (readLength > maxLineLength - txtLength) { readLength = maxLineLength - txtLength; } if (readLength > 0) { record.append(this.buffer, startPosn, readLength); txtLength += readLength; //去掉记录的分隔符 if(newline){ str.set(record.getBytes(), 0, record.getLength() - sepLength); } } } while (!newline && (bytesConsumed < maxBytesToConsume)); if (bytesConsumed > (long)Integer.MAX_VALUE) { throw new IOException("Too many bytes before newline: " + bytesConsumed); } return (int) bytesConsumed; } //以下是需要改写的部分_end //以下是hadoop-core中LineReader的源码_start public int readLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException{ str.clear(); int txtLength = 0; int newlineLength = 0;

评论收藏

内容反馈

u200917908

2015-05-13

我没用上，后来用了其他的办法

mapreduce 自定义分隔符源码

评论1

最新资源

mapreduce 自定义分隔符源码

评论1

最新资源

相关推荐

mapreduce源码

MapReduce之自定义OutPutFormat.md

MapReduce源码分析完整版

Java操作Hadoop Mapreduce基本实践源码

自定义MapReduce的InputFormat

基于java的文件分割器源码

Java图像分割

MapReduce自定义Key实现获取学生最高成绩 课程设计

Hadoop之MapReduce编程实例完整源码

MapReduce保姆级教程源码

MapReduce 2.0源码分析与编程实

MapReduce2.0源码分析与实战编程

hadoop 框架下 mapreduce源码例子 wordcount

基于Java实现的MapReduce学习代码设计源码

Hadoop MapReduce Cookbook 源码

MapReduce2中自定义排序分组

hadoop-mapreduce-examples 官方demo源码

Hadoop中MapReduce基本案例及代码（五）

基于Java的Hadoop HDFS和MapReduce实践案例设计源码

MapReduce2.0源码分析与实战编程 文字注释版

kmeans(mapreduce)

mapreduce wc单词计数 自定义分区 自定义排序实现

mapreduce 实现朴素贝叶斯算法-源码

MapReduce源码分析

MapReduce源码分析完整版.docx

Hadoop MapReduce实现tfidf源码

实验项目 MapReduce 编程

Notepad++安装包

安卓期末大作业（AndroidStudio开发），垃圾分类助手app，分为前台后台，代码有注释，均能正常运行

MapReduce自定义Key实现获取学生最高成绩课程设计

MapReduce2.0源码分析与实战编程文字注释版

mapreduce wc单词计数自定义分区自定义排序实现