没有合适的资源?快使用搜索试试~ 我知道了~
hadoop平台的海量数据分类应用
资源推荐
资源详情
资源评论
CLASSIFICATION ALGORITHMS FOR BIG DATA ANALYSIS, A MAP REDUCE
APPROACH
V. A. Ayma
a,
*, R. S. Ferreira
a
, P. Happ
a
, D. Oliveira
a
, R. Feitosa
a, b
, G. Costa
a
, A. Plaza
c
, P. Gamba
d
a
Dept. of Electrical Engineering, Pontifical Catholic University of Rio de Janeiro, Brazil – (vaaymaq, rsilva, patrick, raul,
gilson)@ele.puc-rio.br
b
Dept. of Computer and Systems, Rio de Janeiro State University, Brazil
c
Dept. of Technology of Computers and Communications, University of Extremadura, Spain - aplaza@unex.es
d
Dept. of Electronics, University of Pavia, Italy - paolo.gamba@unipv.it
KEY WORDS: Big Data, MapReduce Framework, Hadoop, Classification Algorithms, Cloud Computing
ABSTRACT:
Since many years ago, the scientific community is concerned about how to increase the accuracy of different classification methods,
and major achievements have been made so far. Besides this issue, the increasing amount of data that is being generated every day by
remote sensors raises more challenges to be overcome. In this work, a tool within the scope of InterIMAGE Cloud Platform (ICP),
which is an open-source, distributed framework for automatic image interpretation, is presented. The tool, named ICP: Data Mining
Package, is able to perform supervised classification procedures on huge amounts of data, usually referred as big data, on a
distributed infrastructure using Hadoop MapReduce. The tool has four classification algorithms implemented, taken from WEKA’s
machine learning library, namely: Decision Trees, Naïve Bayes, Random Forest and Support Vector Machines (SVM). The results of
an experimental analysis using a SVM classifier on data sets of different sizes for different cluster configurations demonstrates the
potential of the tool, as well as aspects that affect its performance.
* Corresponding author
1. INTRODUCTION
The amount of data generated in all fields of science is
increasing extremely fast (Sagiroglu et al., 2013) (Zaslavsky et
al., 2012) (Suthaharan, 2014) (Kishor, 2013). MapReduce
frameworks (Dean et al., 2004), such as Hadoop (Apache
Hadoop, 2014), are becoming a common and reliable choice to
tackle the so called big data challenge.
Due to its nature and complexity, the analysis of big data raises
new issues and challenges (Li et al., 2014) (Suthaharan, 2014).
Although many machine learning approaches have been
proposed so far to analyse small to medium size data sets, in a
supervised or unsupervised way, just few of them have been
properly adapted to handle large data sets (Yadav et al., 2013)
(Dhillon et al., 2014) (Pakize et al., 2014). An overview of
some data mining approaches for very large data sets can be
found in (He et al., 2010) (Bekkerman et al., 2012)
(Nandakumar et al., 2014).
There are two main steps in the supervised classification
process. The first is the training step where the classification
model is built. The second is the classification itself, which
applies the trained model to assign unknown data to one out of
a given set of class labels. Although the training step is the one
that draws more scientific attention (Liu et al., 2013) (Dai et al.,
2014) (Kiran et al., 2013) (Han et al., 2013), it usually relies on
a small representative data set that does not represent an issue
for big data applications. Thus, the big data challenge affects
mostly the classification step.
This work introduces the ICP: Data Mining Package, an open-
source, MapReduce-based tool for the supervised classification
of large amounts of data. The remaining of the paper is
organized as follows: Section 2 presents a brief overview of
Hadoop; the tool is presented in Section 3; a case study is
presented in Section 4 and, finally, the conclusions are
discussed in Section 5.
2. HADOOP OVERVIEW
Apache Hadoop is an open-source implementation of the
MapReduce framework, proposed by Google (Intel IT Center,
2012). It allows the distributed processing of datasets in the
order of petabytes across hundreds or thousands of commodity
computers connected to a network (Kiran et al., 2013). As
presented in (Dean et al., 2004), it has been commonly used to
run parallel applications for big data processing and analysis
(Pakize et al., 2014) (Liu et al., 2013). The next two sections
present Hadoop’s two main components: HDFS and
MapReduce.
2.1 Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the storage
component of Hadoop. It is designed to reliably store very large
data sets on clusters, and to stream those data at high
throughput to user applications (Shvachko et al., 2010). HDFS
stores file system metadata and application data separately. By
default, it stores three independent copies of each data block
(replication) to ensure reliability, availability and performance
(Kiran et al., 2013).
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XL-3/W2, 2015
PIA15+HRIGI15 – Joint ISPRS conference 2015, 25–27 March 2015, Munich, Germany
This contribution has been peer-reviewed.
doi:10.5194/isprsarchives-XL-3-W2-17-2015
17
资源评论
emy_zj
- 粉丝: 10
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功