高维数据子空间聚类算法研究.pdf资源-CSDN文库

版权申诉

81 浏览量 2022-07-12 13:02:10 上传评论收藏 892KB PDF 举报

在当前的信息时代，高维数据的处理已经成为一个关键的挑战，尤其在大数据分析领域。高维数据子空间聚类算法的研究旨在解决由于数据维度过多导致的传统聚类方法失效的问题。高维数据的特点使得欧式距离等传统相似度度量在处理这类数据时不再有效，而且数据的真正结构往往隐藏在较低的子空间内。因此，子空间聚类成为一种有效的解决方案，它尝试在低维子空间中发现数据的聚类结构。本文主要关注的是自底向上的子空间聚类算法，这种算法通常从单个特征开始，逐渐合并相似的子空间，形成更复杂的聚类结构。然而，此类算法存在一些不足，比如网格划分问题和密度分歧问题。网格划分可能导致过细或过粗的分割，影响聚类质量；而密度分歧问题则是因为高维空间中的局部密度难以准确评估，导致聚类效果不佳。为了解决这些问题，论文提出了一种基于核密度估计的子空间聚类算法。核密度估计是一种非参数统计方法，用于估计数据分布的密度函数，它可以有效地处理高维空间中的局部密度评估，从而克服传统方法的局限。算法首先介绍相关的技术和概念，包括核函数的选择、参数设置等，并定义了算法运行所需的关键术语。接着，详细阐述了算法的执行步骤，包括数据预处理、子空间探索、密度估计以及聚类形成过程。通过大量的人工数据集和真实数据集的实验验证，该算法显示出了良好的可伸缩性、高聚类准确率和运行效率。与传统子空间聚类算法相比，它在处理大规模高维数据时表现更优。实验结果充分证明了新算法在实际应用中的潜力。此外，论文还对未来的工作进行了展望，包括探讨如何将该算法应用于分布式并发架构，以提升处理大规模数据的效率，以及如何扩展到处理混合型属性的数据，即同时包含连续和离散特征的数据集。这篇论文深入研究了高维数据的子空间聚类问题，提出了一种基于核密度估计的新算法，有效地解决了自底向上子空间聚类算法的困境，提高了聚类质量和效率，为高维数据聚类分析提供了新的理论和技术支持。

资源推荐

资源详情

资源评论

摘要

信息技术和 Internet 的日益发展使得互联网上的高维数据呈指数级增长，

如各种文档数据、多媒体数据、基因表达数据等其属性维度可以达到成百上

千维甚至更高，而高维数据聚类分析技术是分析处理高维数据的一个重要的

研究手段。高维数据相对于低维数据在很多方面有很大的差异性，如常使用

的欧式距离等相似性度量在高维空间中失效，数据簇类往往只存在于某些低

维子空间中等，这些都对高维数据聚类分析技术提出了很大的挑战，如何研

究出高效的高维聚类技术，并有效的指导应用实践对推动互联网发展有着巨

大意义。

本文针对高维数据聚类分析技术这一课题展开研究，先对高维数据聚类

分析的主要方法和现状进行了概括，然后将本文的研究重点定位在子空间聚

类算法上，针对自底向上的子空间聚类算法所存在的不足进行了研究和改

进，提出了一种高效的基于核密度估计的子空间聚类算法，并通过大量的实

验验证该算法的有效性。本文的主要内容和贡献有以下几点：

1. 引入了高维数据子空间聚类问题，对自底向上的子空间聚类算法进行

了深入研究，然后引出了子空间聚类算法中普遍存在的密度分歧问题。

2. 针对子空间聚类算法中所存在的网格划分困境和密度分歧问题，提出

了一种高效的基于核密度估计的子空间聚类算法，介绍了该算法所需要的相

关技术，并定义了算法中所需的术语和概念，详细描述了算法实现的具体步

骤。

3. 对基于核密度估计的子空间聚类算法在人工数据集和真实数据集上进

行了实验验证。实验结果表明，该算法在自身算法可伸缩性、聚类结果的准

确率以及运行效率上都比传统的子空间聚类算法有很大提高。

4. 最后对该算法的分布式并发架构以及混合型属性延伸思想进行了展

望。

关键词：高维数据，聚类分析，子空间聚类，核密度估计

ABSTRACT

With the development of information technology and Internet, high dimensional

data such as multi-media data and gene microarray data on the Internet is growing ex-

ponentially and their attributes (dimensions) can amount to several hundreds. In such

circumstances, high dimensional data clustering technique is one of the most important

methods for analyzing high dimensional data.

The characteristics of high dimensional data diﬀer so much from those of the low di-

mensional data. For instance, the similarity measurement which is commonly utilized in

low dimensional data clustering will not contribute to excellent clustering results any more

in high dimensional space, and some attributes are correlated with each other to some ex-

tent and the subspaces are possibly spanned by diﬀerent combinations of attributes. All

these particular features of high dimensional data make high dimensional data clustering

technique a quite challenging task. How to study high dimensional data clustering tech-

niques based on the well-developed theory of data mining is critically important when to

eﬀectively instruct the new direction of Internet development.

This thesis focuses on the research of high dimensional data clustering techniques.

We ﬁrstly summarized the prevalent methods and current situations of high dimensional

data analysis and categorized the existing high dimensional data clustering techniques,

such as dimension reduction, manifold learning, distance metric learning, subspace clus-

tering, etc. Then we focused our attention on the subspace clustering methods to further

study high dimensional data clustering techniques. After we deeply studied and improved

the bottom-up based subspace clustering methods, we proposed a novel subspace clus-

tering method based on kernel density estimation and the intensive experiments showed

the superior eﬀectiveness and eﬃciency of our proposed method. The main contents and

contributions can be summarized as follows:

1. We ﬁrstly introduced the subspace clustering problem for high-dimensional data

and then studied the bottom-up based subspace clustering algorithms in depth. In the end

of chapter 2, the density divergence problem is introduced for further study.

2. We proposed the kernel density estimation based on subspace clustering algorith-

m to eﬀectively address the dilemma of grid partition and the density divergence problem.

Some related techniques are ﬁrst introduced and the basic terms and deﬁnitions are de-

ﬁned. Subsequently, the detailed algorithm is explicitly described in the end of chapter

3. We conduct intensive experiments on both synthetic and real datasets and the

performance comparisons on algorithm scalability, accuracy and eﬃciency with existing

subspace clustering algorithms show the superiority of our proposed algorithm.

摘要 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · I

ABSTRACT · · · · · · · · · · · · · · · · · · · · · · · · · · · · · II

目录 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · IV

第一章绪论 · · · · · · · · · · · · · · · · · · · · · · · · · · 1

1.1 研究背景及意义 · · · · · · · · · · · · · · · · · · · · · · 1

1.1.1 聚类分析及其应用 · · · · · · · · · · · · · · · · · · · · · 1

1.1.2 高维数据及其聚类分析 · · · · · · · · · · · · · · · · · · 2

1.2 国内外研究现状以及主要存在问题 · · · · · · · · · · 3

1.2.1 高维数据的聚类分析方法 · · · · · · · · · · · · · · · · · 3

1.2.2 高维数据聚类分析的应用 · · · · · · · · · · · · · · · · · 5

1.2.3 存在的主要问题 · · · · · · · · · · · · · · · · · · · · · · 6

1.3 本文主要内容与结构 · · · · · · · · · · · · · · · · · · · 7

第二章

子空间聚类算法 · · · · · · · · · · · · · · · · · · 9

2.1 子空间问题引入 · · · · · · · · · · · · · · · · · · · · · · 9

2.2

子空间聚类算法概述 · · · · · · · · · · · · · · · · · · · 10

2.3

自底向上的子空间聚类算法 · · · · · · · · · · · · · · · 12

2.3.1 CLIQUE · · · · · · · · · · · · · · · · · · · · · · · · · · · 12

2.3.2 ENCLUS · · · · · · · · · · · · · · · · · · · · · · · · · · · 13

2.3.3 MAFIA · · · · · · · · · · · · · · · · · · · · · · · · · · · 13

2.3.4 CLTree · · · · · · · · · · · · · · · · · · · · · · · · · · · · 14

2.4 子空间聚类的密度分歧问题 · · · · · · · · · · · · · · · 14

第三章基于核密度估计的子空间聚类算法 · · · · · · 16

3.1

基于核密度估计的子空间聚类算法的提出 · · · · · · 16

3.2 相关技术 · · · · · · · · · · · · · · · · · · · · · · · · · · · 17

3.2.1 核密度估计 · · · · · · · · · · · · · · · · · · · · · · · · · 17

3.2.2 频繁项集挖掘 · · · · · · · · · · · · · · · · · · · · · · · · 19

3.3 基本定义 · · · · · · · · · · · · · · · · · · · · · · · · · · · 21

3.3.1 一维密集区域 · · · · · · · · · · · · · · · · · · · · · · · · 21

3.3.2 子空间密度阈值 · · · · · · · · · · · · · · · · · · · · · · 22

3.4 基于核密度估计的子空间聚类算法 · · · · · · · · · · 23

3.4.1 数据转换 · · · · · · · · · · · · · · · · · · · · · · · · · · 23

3.4.2 FP-tree 构建 · · · · · · · · · · · · · · · · · · · · · · · · · 24

3.4.3 密集子空间挖掘 · · · · · · · · · · · · · · · · · · · · · · 26

3.4.4 算法时间复杂度分析 · · · · · · · · · · · · · · · · · · · · 27

第四章实验 · · · · · · · · · · · · · · · · · · · · · · · · · · 29

4.1 数据集 · · · · · · · · · · · · · · · · · · · · · · · · · · · · 29

4.1.1 人工数据集 · · · · · · · · · · · · · · · · · · · · · · · · · 29

4.1.2 真实数据集 · · · · · · · · · · · · · · · · · · · · · · · · · 29

4.2 实验结果 · · · · · · · · · · · · · · · · · · · · · · · · · · · 30

4.2.1 算法的可伸缩性 · · · · · · · · · · · · · · · · · · · · · · 30

4.2.2 人工数据集实验结果 · · · · · · · · · · · · · · · · · · · · 32

4.2.3 真实数据集实验结果 · · · · · · · · · · · · · · · · · · · · 34

4.3 算法参数灵敏度分析 · · · · · · · · · · · · · · · · · · · 37

第五章

总结与展望 · · · · · · · · · · · · · · · · · · · · · · 38

5.1 总结 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 38

5.2

展望 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 38

参考文献 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 40

发表论文和参加科研情况说明 · · · · · · · · · · · · · · · · 44

致谢 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · 45

剩余47页未读，继续阅读

评论收藏

内容反馈

版权申诉

老帽爬新坡

粉丝: 92
资源: 2万+

高维数据子空间聚类算法研究.pdf

高维数据挖掘中的聚类算法研究.pdf

计算机研究 -基于深度学习的高维数据聚类算法研究.pdf

计算机研究 -基于枚举树的最大子空间聚类算法研究.pdf

一种基于相似维的高维子空间聚类算法.pdf

面向大数据的高维数据挖掘研究.pdf

数据挖掘中一种容错的子空间聚类算法.pdf

一种基于相似维的高维子空间聚类算法.docx

面向高维特征故障数据的进化软子空间聚类算法.pdf

计算机研究 -稀疏样本自表达的子空间聚类算法.pdf

计算机研究 -基于模型的高维数据聚类方法综述.pdf

基于大数据的高维数据挖掘研究.pdf

面向大数据的高维数据挖掘技术研究.pdf

论文研究-高维目标减少算法.pdf

高维数据流的聚类离群点检测算法研究

云计算海量高维大数据特征选择算法研究.pdf

基于概要数据结构的高维数据流聚类算法.pdf

最新资源