计算机研究-基于云计算平台Hadoop的聚类研究.pdf资源-CSDN文库

版权申诉

72 浏览量 2022-06-29 06:45:14 上传评论收藏 974KB PDF 举报

资源推荐

资源详情

资源评论

Abstract

With the rapid development of Internet technology, sizes of dataset are growing

dramatically. A typical dataset might involve Terabytes or even Petabytes of data. How to

extract meaningful and helpful information quickly and efficiently from vast amounts of

data is a new challenge for data mining. Hadoop is an open source cloud computing

platform, which provides the users the powerful capacities of data collection, storage and

computing. Clustering analysis is an important field of data mining. Research of clustering

based on Hadoop has been a new hotspot.

In this thesis, we make a research on cloud computing and clustering analysis. Our focus

lies in the designs and implementations of clustering algorithms based on Hadoop platform.

Out main contributions are as follows:

(1) We do a research of cloud computing, such as its characteristics, business models

and so on. Particularly, we dig deep into the key technologies of Hadoop’s distributed file

system (HDFS) and its programming framework (MapReduce). We also gain an insight

into the basic ideas of different kinds of clustering methods and make an in-depth analysis

of some classical algorithms. Our main purpose of this part is to obtain solid knowledge for

designing and implementing clustering algorithms based on Hadoop.

(2) We propose bigClustering which can be easily parallelized in MapReduce and done

in quite a few MapReduce rounds while many existing clustering algorithms failed to be

effectively migrated to the cloud computing platform. Based on the ideas of micro-cluster,

bigKClustering divides the dataset into many groups and constructs one micro-cluster,

which will be treated as a single point, corresponding to each group. All the micro-clusters

that are closed enough will be connected and put into the same group by the equivalence

relation. The center of each group will be calculated and that will be the center of a real

cluster in the dataset. Experiments show that bigKClustering not only runs fast and obtains

high clustering quality but also scales well.

(3) We propose SnIClustering which aims to reduce the number of intermediate values

produced by MapReduce framework over which this algorithm runs. Based on the ideas of

sampling and filtering, SnIClustering tends to minimize the amount of data involved in the

final clustering process. It selects a small number of samples which can well represent the

whole original dataset, using particular sampling technology. According to the distribution

characteristics of these samples, it filters the original dataset. The data left after filtering

万方数据

摘要 ............................................................................................................................... I

Abstract ............................................................................................................................ II

目录 ................................................................................................................................IV

第一章绪论 ..................................................................................................................... 1

§1.1 研究背景和意义 .................................................................................................1

§1.2 研究现状............................................................................................................2

§1.3 研究内容............................................................................................................4

§1.4 论文组织结构 ....................................................................................................5

第二章云计算及 Hadoop 架构........................................................................................ 7

§2.1 云计算 ...............................................................................................................7

§2.1.1 基本概念 ..................................................................................................7

§2.1.2 技术平台 ...................................................................................................8

§2.2 Hadoop 架构 .......................................................................................................9

§2.2.1 Hadoop 技术背景 ......................................................................................9

§2.2.2 Hadoop HDFS（分布式文件系统） ....................................................... 11

§2.2.3 Hadoop MapReduce（编程模型） ......................................................... 13

§2.3 本章小结.......................................................................................................... 17

第三章聚类研究概述 .................................................................................................... 18

§3.1 基本概念.......................................................................................................... 18

§3.2 相关度量.......................................................................................................... 19

§3.2.1 相似性度量 ............................................................................................ 19

§3.2.2 评估度量 ................................................................................................ 20

§3.3 聚类算法.......................................................................................................... 22

§3.3.1 层次的方法 ............................................................................................. 22

§3.3.2 划分的方法 ............................................................................................. 24

§3.3.3 基于密度的方法 .................................................................................... 26

§3.3.4 高维数据聚类 ........................................................................................ 27

§3.4 本章小结.......................................................................................................... 29

第四章基于微簇和等价关系的聚类算法 ..................................................................... 30

§4.1 问题提出.......................................................................................................... 30

§4.2 相关工作.......................................................................................................... 30

万方数据

§4.3 bigKClustering 算法......................................................................................... 32

§4.3.1 相关定义 ................................................................................................ 32

§4.3.2 算法思想 ................................................................................................ 32

§4.3.3 算法描述 ................................................................................................ 33

§4.3.4 算法在 MapReduce 上的并行化 pbigKClustering ................................. 34

§4.4 实验评估.......................................................................................................... 35

§4.4.1 数据集 .................................................................................................... 36

§4.4.2 实验结果及分析 .................................................................................... 36

§4.4.3 参数讨论 ................................................................................................ 38

§4.5 本章小结.......................................................................................................... 38

第五章 MapReduce 框架上基于采样和过滤的聚类算法 ............................................. 39

§5.1 问题提出.......................................................................................................... 39

§5.2 相关工作.......................................................................................................... 39

§5.3 SnIClustering 算法 ........................................................................................... 41

§5.3.1 算法思想 ................................................................................................ 41

§5.3.2 算法描述 ................................................................................................ 42

§5.4 实验评估.......................................................................................................... 44

§5.4.1 数据集及参数说明................................................................................. 44

§5.4.2 实验结果及分析 .................................................................................... 44

§5.5 本章小结.......................................................................................................... 46

第六章总结与展望 ....................................................................................................... 47

§6.1 全文总结.......................................................................................................... 47

§6.2 工作展望.......................................................................................................... 48

参考文献 ......................................................................................................................... 49

致谢............................................................................................................................... 54

攻读硕士期间发表的论文及参与的科研项目 ............................................................... 55

万方数据

剩余59页未读，继续阅读

评论收藏

内容反馈

版权申诉

programyp

粉丝: 86
资源: 1万+

计算机研究 -基于云计算平台Hadoop的聚类研究.pdf

最新资源

计算机研究 -基于云计算平台Hadoop的聚类研究.pdf

一种基于Hadoop云计算平台大数据聚类算法设计.pdf

基于云计算平台Hadoop的并行k_means聚类算法设计研究_赵卫中.pdf

基于云计算平台Hadoop的HKM聚类算法设计研究.pdf

基于Hadoop云计算平台的聚类K-means算法的研究与实现.pdf

论文研究-RFID物联网复杂事件模式聚类算法研究.pdf

论文研究-基于Hadoop的K-Means聚类算法优化与实现 .pdf

利用Hadoop云计算平台进行海量数据聚类分析.pdf

基于云计算Hadoop平台下K-Means聚类方法的研究与改进.pdf

基于Hadoop云计算平台的新浪微博数据聚类分析算法研究.pdf

论文研究-云计算环境下混合协同过滤优化技术研究.pdf

数据分析方法与技术.pptx

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

HAI-2024斯坦福AI指数报告（中文译版）.pdf

2023泛娱乐社交出海手册-ZEGO即构科技

4个亲测好用的ChatGPT4渠道

毕业设计的概要介绍与分析

甘晴void的一些相关资源

c语言基础的一些相关资源

民宿网站

桌面聊天室

学术海报模板+论文科研+研究生

北森能力测评题库.zip

chrome-win64.zip

认知智能技术与产业研究报告2023

最新资源