hadoop平台的海量数据分类应用资源-CSDN文库

hadoop

需积分: 10 117 浏览量 2017-08-16 08:40:00 上传评论收藏 853KB PDF 举报

资源推荐

资源详情

资源评论

CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE

APPROACH

V. A. Ayma

*, R. S. Ferreira

, P. Happ

, D. Oliveira

, R. Feitosa

a, b

, G. Costa

, A. Plaza

, P. Gamba

Dept. of Electrical Engineering, Pontifical Catholic University of Rio de Janeiro, Brazil – (vaaymaq, rsilva, patrick, raul,

gilson)@ele.puc-rio.br

Dept. of Computer and Systems, Rio de Janeiro State University, Brazil

Dept. of Technology of Computers and Communications, University of Extremadura, Spain - aplaza@unex.es

Dept. of Electronics, University of Pavia, Italy - paolo.gamba@unipv.it

KEY WORDS: Big Data, MapReduce Framework, Hadoop, Classification Algorithms, Cloud Computing

ABSTRACT:

Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods,

and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by

remote sensors raises more challenges to be overcome. In this work, a tool within the scope of InterIMAGE Cloud Platform (ICP),

which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named ICP: Data Mining

Package, is able to perform supervised classification procedures on huge amounts of data, usually referred as big data, on a

distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA’s

machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of

an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the

potential of the tool, as well as aspects that affect its performance.

* Corresponding author

1. INTRODUCTION

The amount of data generated in all fields of science is

increasing extremely fast (Sagiroglu et al., 2013) (Zaslavsky et

al., 2012) (Suthaharan, 2014) (Kishor, 2013). MapReduce

frameworks (Dean et al., 2004), such as Hadoop (Apache

Hadoop, 2014), are becoming a common and reliable choice to

tackle the so called big data challenge.

Due to its nature and complexity, the analysis of big data raises

new issues and challenges (Li et al., 2014) (Suthaharan, 2014).

Although many machine learning approaches have been

proposed so far to analyse small to medium size data sets, in a

supervised or unsupervised way, just few of them have been

properly adapted to handle large data sets (Yadav et al., 2013)

(Dhillon et al., 2014) (Pakize et al., 2014). An overview of

some data mining approaches for very large data sets can be

found in (He et al., 2010) (Bekkerman et al., 2012)

(Nandakumar et al., 2014).

There are two main steps in the supervised classification

process. The first is the training step where the classification

model is built. The second is the classification itself, which

applies the trained model to assign unknown data to one out of

a given set of class labels. Although the training step is the one

that draws more scientific attention (Liu et al., 2013) (Dai et al.,

2014) (Kiran et al., 2013) (Han et al., 2013), it usually relies on

a small representative data set that does not represent an issue

for big data applications. Thus, the big data challenge affects

mostly the classification step.

This work introduces the ICP: Data Mining Package, an open-

source, MapReduce-based tool for the supervised classification

of large amounts of data. The remaining of the paper is

organized as follows: Section 2 presents a brief overview of

Hadoop; the tool is presented in Section 3; a case study is

presented in Section 4 and, finally, the conclusions are

discussed in Section 5.

2. HADOOP OVERVIEW

Apache Hadoop is an open-source implementation of the

MapReduce framework, proposed by Google (Intel IT Center,

2012). It allows the distributed processing of datasets in the

order of petabytes across hundreds or thousands of commodity

computers connected to a network (Kiran et al., 2013). As

presented in (Dean et al., 2004), it has been commonly used to

run parallel applications for big data processing and analysis

(Pakize et al., 2014) (Liu et al., 2013). The next two sections

present Hadoop’s two main components: HDFS and

MapReduce.

2.1 Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is the storage

component of Hadoop. It is designed to reliably store very large

data sets on clusters, and to stream those data at high

throughput to user applications (Shvachko et al., 2010). HDFS

stores file system metadata and application data separately. By

default, it stores three independent copies of each data block

(replication) to ensure reliability, availability and performance

(Kiran et al., 2013).

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-3/W2, 2015

PIA15+HRIGI15 – Joint ISPRS conference 2015, 25–27 March 2015, Munich, Germany

This contribution has been peer-reviewed.

doi:10.5194/isprsarchives-XL-3-W2-17-2015

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

emy_zj

粉丝: 10
资源: 5

hadoop平台的海量数据分类应用

海量数据处理

海量数据解决方案

海量数据web

海量数据的MySQL数据集

大量免费数据库

海量数据导入

海量数据处理 海量数据处理

基于Hadoop的海量数据平台

基于Hadoop平台的海量文本分类的并行化

海量数据分级管理提升应用效力

hadoop 大数据平台

hadoop大数据平台介绍

海量数据应用之产品分析那点事

ClickHouse 在海量数据下的应用实践.pdf

分布式海量云存储平台的技术创新及应用 (1).pdf

网易海量数据存储平台的构建和运维

海量数据的存储技术方案和应用实践.pdf

海量数据高速响应平台解决方案.pptx

浅析MongoDB数据库的海量数据存储应用

海量大数据平台的运维智能化实践

面向海量数据的云存储系统实现与应用研究

海量高效数据索引·hadoop·JPA·data mining

腾讯海量数据实时计算平台实现及应用

项目源码：基于Hadoop+Spark招聘推荐可视化系统 大数据项目 计算机毕业设计

hadoop-3.3.4.tar.gz + winutils 安装环境

基于Hadoop的电影影评数据分析

基于大数据的音乐推荐系统（适合本科毕设）

基于Hadoop+Spark招聘推荐可视化系统 大数据项目 毕业设计（源码下载）

最新资源

海量数据处理海量数据处理

项目源码：基于Hadoop+Spark招聘推荐可视化系统大数据项目计算机毕业设计

基于Hadoop+Spark招聘推荐可视化系统大数据项目毕业设计（源码下载）