没有合适的资源？快使用搜索试试~ 我知道了~

文库首页大数据Hadoop22、MapReduce使用Gzip压缩、Snappy压缩和Lzo压缩算法写文件和读取相应的文件

22、MapReduce使用Gzip压缩、Snappy压缩和Lzo压缩算法写文件和读取相应的文件

hadoop

mapreduce

需积分: 0 3 下载量 102 浏览量 2023-05-29 14:16:32 上传评论收藏 759KB PDF 举报

温馨提示

试读

25页

22、MapReduce使用Gzip压缩、Snappy压缩和Lzo压缩算法写文件和读取相应的文件网址：https://blog.csdn.net/chenwewi520feng/article/details/130456088 本文的前提是hadoop环境正常。本文最好和MapReduce操作常见的文件文章一起阅读，因为写文件与压缩往往是结合在一起的。相关压缩算法介绍参考文章：HDFS文件类型与压缩算法介绍。本文介绍写文件时使用的压缩算法，包括：Gzip压缩、Snappy压缩和Lzo压缩。本文分为3部分，即Gzip压缩文件的写与读、Snappy压缩文件的写与读和Lzo压缩文件的写与读。 ———————————————— 版权声明：本文为CSDN博主「一瓢一瓢的饮 alanchan」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。原文链接：https://blog.csdn.net/chenwewi520feng/article/details/130456088

资源推荐

资源详情

资源评论

@TOC

本文的前提是hadoop环境正常。

本文最好和MapReduce操作常见的文件文章一起阅读，因为写文件与压缩往往是结合在一起的。

相关压缩算法介绍参考文章：HDFS文件类型与压缩算法介绍。

本文介绍写文件时使用的压缩算法，包括：Gzip压缩、Snappy压缩和Lzo压缩。

本文分为3部分，即Gzip压缩文件的写与读、Snappy压缩文件的写与读和Lzo压缩文件的写与读。

一、源文件：TextFile文件

以下示例是基于该文件作为源文件，换成不同的压缩算法。

源数据记录条数：12606948条

clickhouse系统存储文件大小：50.43 MB

逐条读出存成文本文件大小：1.08G(未压缩)

逐条读出存成ORC文件大小：105M(默認壓縮算法是ZLIB)

二、Gzip压缩文件的写与读

1、写Gzip文件

读取Text文件写为压缩后的Text文件。

//配置输出结果压缩为Gzip格式,可以不用reduce。如果不用reduce，由于文件比较大，map有9个，所以

会输出9个文件。本示例使用了reducer

// conf.set("mapreduce.output.fileoutputformat.compress","true");

conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.i

o.compress.GzipCodec");

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.Reducer.Context;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.springframework.util.StopWatch;

/**

* @author alanchan

public class WriteFromTextFileToTextFileByGzip extends Configured implements

Tool {

static String in = "D:/workspace/bigdata-component/hadoop/test/in/seq";

static String out = "D:/workspace/bigdata-

component/hadoop/test/out/compress/gzip";

static String flag = "1";

@Override

public int run(String[] args) throws Exception {

Configuration conf = getConf();

Job job = Job.getInstance(conf, this.getClass().getSimpleName());

job.setJarByClass(this.getClass());

FileInputFormat.addInputPath(job, new Path(args[0]));

Path outputDir = new Path(args[1]);

outputDir.getFileSystem(this.getConf()).delete(outputDir, true);

FileOutputFormat.setOutputPath(job, outputDir);

job.setMapperClass(WriteFromTextFileToTextFileByGzipMapper.class);

job.setMapOutputKeyClass(NullWritable.class);

job.setMapOutputValueClass(Text.class);

job.setReducerClass(WriteFromTextFileToTextFileByGzipReducer.class);

job.setOutputKeyClass(NullWritable.class);

job.setOutputValueClass(Text.class);

// job.setNumReduceTasks(0);

return job.waitForCompletion(true) ? 0 : 1;

2、读Gzip文件

与读取一般txtfile文件没有区别。

}

public static void main(String[] args) throws Exception {

StopWatch clock = new StopWatch();

clock.start(WriteFromTextFileToTextFileByGzip.class.getSimpleName());

Configuration conf = new Configuration();

// 配置输出结果压缩为Gzip格式

if (flag.equals(args[2])) {

conf.set("mapreduce.output.fileoutputformat.compress", "true");

conf.set("mapreduce.output.fileoutputformat.compress.codec",

"org.apache.hadoop.io.compress.GzipCodec");

}

int status = ToolRunner.run(conf, new

WriteFromTextFileToTextFileByGzip(), args);

clock.stop();

System.out.println(clock.prettyPrint());

System.exit(status);

}

static class WriteFromTextFileToTextFileByGzipMapper extends

Mapper<LongWritable, Text, NullWritable, Text> {

protected void map(LongWritable key, Text value, Context context) throws

IOException, InterruptedException {

context.write(NullWritable.get(), value);

}

static class WriteFromTextFileToTextFileByGzipReducer extends

Reducer<NullWritable, Text, NullWritable, Text> {

protected void reduce(NullWritable key, Iterable<Text> values, Context

context)

throws IOException, InterruptedException {

for (Text value : values) {

context.write(NullWritable.get(), value);

}

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Counter;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.Reducer.Context;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.springframework.util.StopWatch;

public class ReadFromGzipFileToTextFile extends Configured implements Tool {

static String out = "D:/workspace/bigdata-

component/hadoop/test/out/compress/gzipread";

static String in = "D:/workspace/bigdata-

component/hadoop/test/out/compress/gzip";

@Override

public int run(String[] args) throws Exception {

Configuration conf = getConf();

Job job = Job.getInstance(conf, this.getClass().getSimpleName());

job.setJarByClass(this.getClass());

FileInputFormat.addInputPath(job, new Path(args[0]));

Path outputDir = new Path(args[1]);

outputDir.getFileSystem(this.getConf()).delete(outputDir, true);

FileOutputFormat.setOutputPath(job, outputDir);

job.setMapperClass(ReadFromGzipFileToTextFileMapper.class);

job.setMapOutputKeyClass(NullWritable.class);

job.setMapOutputValueClass(Text.class);

job.setReducerClass(ReadFromGzipFileToTextFileReducer.class);

job.setOutputKeyClass(NullWritable.class);

job.setOutputValueClass(Text.class);

// job.setNumReduceTasks(0);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

StopWatch clock = new StopWatch();

clock.start(ReadFromGzipFileToTextFile.class.getSimpleName());

Configuration conf = new Configuration();

int status = ToolRunner.run(conf, new ReadFromGzipFileToTextFile(),

args);

clock.stop();

System.out.println(clock.prettyPrint());

System.exit(status);

}

剩余24页未读，继续阅读

评论收藏

内容反馈

资源评论

资源反馈

评论星级较低，若资源使用遇到问题可联系上传者，3个工作日内问题未解决可申请退款~

一瓢一瓢的饮alanchanchn

粉丝: 2827
资源: 69

上传资源快速赚钱

我的内容管理展开

我的资源快来上传第一个资源

我的收益

登录查看自己的收益

我的积分登录查看自己的积分

我的C币登录后查看C币余额

我的收藏

我的下载

下载帮助

前往需求广场，查看用户热搜

22、MapReduce使用Gzip压缩、Snappy压缩和Lzo压缩算法写文件和读取相应的文件

Hadoop原理与技术MapReduce实验

基于MapReduce的图算法

Mapreduce 在windows运行文件

HadoopOutputSnappy:MapReduce 程序以 snappy 压缩格式输出

基于MapReduce的矩阵相乘算法代码及其使用

MapReduce读取单词个数.rar

21、MapReduce读写SequenceFile、MapFile、ORCFile和ParquetFile文件

MapReduce Shuffle 过程图解 Xmind文件

论文研究-基于PML结构文件的MapReduce算法优化.pdf

基于MapReduce的Apriori算法代码及其使用

大数据实验5实验报告：MapReduce 初级编程实践

MapReduce实现矩阵相乘算法

Google Snappy压缩源码

python_snappy-0.5.1-cp36-cp36m-win_amd64

idea编写mapreduce工程pom文件

图像文件转换为MapReduce可以读写的二进制文件代码

图像文件转换为MapReduce可以读写的二进制文件代码Hadoop

用MapReduce实现KMeans算法

MapReduce学习笔记，亲自测试写出来的，1000分都不贵

云计算MapReduce实现KNN算法

项目源码：基于Hadoop+Spark招聘推荐可视化系统 大数据项目 计算机毕业设计

hadoop-3.3.4.tar.gz + winutils 安装环境

基于Hadoop的电影影评数据分析

基于大数据的音乐推荐系统（适合本科毕设）

基于Hadoop+Spark招聘推荐可视化系统 大数据项目 毕业设计（源码下载）

适用于hadoop 3.3.5 3.3.6版本的winutils

数据科学导论实验报告 实验1：常用Linux操作和 Hadoop操作

淘宝用户行为数据集

中文官方教程_tableau_prep.pdf

最新资源

项目源码：基于Hadoop+Spark招聘推荐可视化系统大数据项目计算机毕业设计

基于Hadoop+Spark招聘推荐可视化系统大数据项目毕业设计（源码下载）

数据科学导论实验报告实验1：常用Linux操作和 Hadoop操作