没有合适的资源?快使用搜索试试~ 我知道了~
计算机研究 -基于云计算平台Hadoop的聚类研究.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 72 浏览量
2022-06-29
06:45:14
上传
评论
收藏 974KB PDF 举报
温馨提示
试读
60页
计算机研究 -基于云计算平台Hadoop的聚类研究.pdf
资源推荐
资源详情
资源评论
Abstract
II
Abstract
With the rapid development of Internet technology, sizes of dataset are growing
dramatically. A typical dataset might involve Terabytes or even Petabytes of data. How to
extract meaningful and helpful information quickly and efficiently from vast amounts of
data is a new challenge for data mining. Hadoop is an open source cloud computing
platform, which provides the users the powerful capacities of data collection, storage and
computing. Clustering analysis is an important field of data mining. Research of clustering
based on Hadoop has been a new hotspot.
In this thesis, we make a research on cloud computing and clustering analysis. Our focus
lies in the designs and implementations of clustering algorithms based on Hadoop platform.
Out main contributions are as follows:
(1) We do a research of cloud computing, such as its characteristics, business models
and so on. Particularly, we dig deep into the key technologies of Hadoop’s distributed file
system (HDFS) and its programming framework (MapReduce). We also gain an insight
into the basic ideas of different kinds of clustering methods and make an in-depth analysis
of some classical algorithms. Our main purpose of this part is to obtain solid knowledge for
designing and implementing clustering algorithms based on Hadoop.
(2) We propose bigClustering which can be easily parallelized in MapReduce and done
in quite a few MapReduce rounds while many existing clustering algorithms failed to be
effectively migrated to the cloud computing platform. Based on the ideas of micro-cluster,
bigKClustering divides the dataset into many groups and constructs one micro-cluster,
which will be treated as a single point, corresponding to each group. All the micro-clusters
that are closed enough will be connected and put into the same group by the equivalence
relation. The center of each group will be calculated and that will be the center of a real
cluster in the dataset. Experiments show that bigKClustering not only runs fast and obtains
high clustering quality but also scales well.
(3) We propose SnIClustering which aims to reduce the number of intermediate values
produced by MapReduce framework over which this algorithm runs. Based on the ideas of
sampling and filtering, SnIClustering tends to minimize the amount of data involved in the
final clustering process. It selects a small number of samples which can well represent the
whole original dataset, using particular sampling technology. According to the distribution
characteristics of these samples, it filters the original dataset. The data left after filtering
万方数据
Abstract
III
and the samples are clustering by an existing algorithm in a single node. Experiments show
that SnIClustering runs very fast, obtains relatively good clustering results and scales well.
Keywords:large datasets; data mining; cloud computing; Hadoop; clustering
万方数据
目录
IV
目录
摘 要 ............................................................................................................................... I
Abstract ............................................................................................................................ II
目录 ................................................................................................................................IV
第一章 绪论 ..................................................................................................................... 1
§1.1 研究背景和意义 .................................................................................................1
§1.2 研究现状............................................................................................................2
§1.3 研究内容............................................................................................................4
§1.4 论文组织结构 ....................................................................................................5
第二章 云计算及 Hadoop 架构........................................................................................ 7
§2.1 云计算 ...............................................................................................................7
§2.1.1 基本概念 ..................................................................................................7
§2.1.2 技术平台 ...................................................................................................8
§2.2 Hadoop 架构 .......................................................................................................9
§2.2.1 Hadoop 技术背景 ......................................................................................9
§2.2.2 Hadoop HDFS(分布式文件系统) ....................................................... 11
§2.2.3 Hadoop MapReduce(编程模型) ......................................................... 13
§2.3 本章小结.......................................................................................................... 17
第三章 聚类研究概述 .................................................................................................... 18
§3.1 基本概念.......................................................................................................... 18
§3.2 相关度量.......................................................................................................... 19
§3.2.1 相似性度量 ............................................................................................ 19
§3.2.2 评估度量 ................................................................................................ 20
§3.3 聚类算法.......................................................................................................... 22
§3.3.1 层次的方法 ............................................................................................. 22
§3.3.2 划分的方法 ............................................................................................. 24
§3.3.3 基于密度的方法 .................................................................................... 26
§3.3.4 高维数据聚类 ........................................................................................ 27
§3.4 本章小结.......................................................................................................... 29
第四章 基于微簇和等价关系的聚类算法 ..................................................................... 30
§4.1 问题提出.......................................................................................................... 30
§4.2 相关工作.......................................................................................................... 30
万方数据
目 录
V
§4.3 bigKClustering 算法......................................................................................... 32
§4.3.1 相关定义 ................................................................................................ 32
§4.3.2 算法思想 ................................................................................................ 32
§4.3.3 算法描述 ................................................................................................ 33
§4.3.4 算法在 MapReduce 上的并行化 pbigKClustering ................................. 34
§4.4 实验评估.......................................................................................................... 35
§4.4.1 数据集 .................................................................................................... 36
§4.4.2 实验结果及分析 .................................................................................... 36
§4.4.3 参数讨论 ................................................................................................ 38
§4.5 本章小结.......................................................................................................... 38
第五章 MapReduce 框架上基于采样和过滤的聚类算法 ............................................. 39
§5.1 问题提出.......................................................................................................... 39
§5.2 相关工作.......................................................................................................... 39
§5.3 SnIClustering 算法 ........................................................................................... 41
§5.3.1 算法思想 ................................................................................................ 41
§5.3.2 算法描述 ................................................................................................ 42
§5.4 实验评估.......................................................................................................... 44
§5.4.1 数据集及参数说明................................................................................. 44
§5.4.2 实验结果及分析 .................................................................................... 44
§5.5 本章小结.......................................................................................................... 46
第六章 总结与展望 ....................................................................................................... 47
§6.1 全文总结.......................................................................................................... 47
§6.2 工作展望.......................................................................................................... 48
参考文献 ......................................................................................................................... 49
致 谢............................................................................................................................... 54
攻读硕士期间发表的论文及参与的科研项目 ............................................................... 55
万方数据
万方数据
剩余59页未读,继续阅读
资源评论
programyp
- 粉丝: 86
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功