ABSTRACT
ABSTRACT
With the development of information technology and Internet, high dimensional
data such as multi-media data and gene microarray data on the Internet is growing ex-
ponentially and their attributes (dimensions) can amount to several hundreds. In such
circumstances, high dimensional data clustering technique is one of the most important
methods for analyzing high dimensional data.
The characteristics of high dimensional data differ so much from those of the low di-
mensional data. For instance, the similarity measurement which is commonly utilized in
low dimensional data clustering will not contribute to excellent clustering results any more
in high dimensional space, and some attributes are correlated with each other to some ex-
tent and the subspaces are possibly spanned by different combinations of attributes. All
these particular features of high dimensional data make high dimensional data clustering
technique a quite challenging task. How to study high dimensional data clustering tech-
niques based on the well-developed theory of data mining is critically important when to
effectively instruct the new direction of Internet development.
This thesis focuses on the research of high dimensional data clustering techniques.
We firstly summarized the prevalent methods and current situations of high dimensional
data analysis and categorized the existing high dimensional data clustering techniques,
such as dimension reduction, manifold learning, distance metric learning, subspace clus-
tering, etc. Then we focused our attention on the subspace clustering methods to further
study high dimensional data clustering techniques. After we deeply studied and improved
the bottom-up based subspace clustering methods, we proposed a novel subspace clus-
tering method based on kernel density estimation and the intensive experiments showed
the superior effectiveness and efficiency of our proposed method. The main contents and
contributions can be summarized as follows:
1. We firstly introduced the subspace clustering problem for high-dimensional data
and then studied the bottom-up based subspace clustering algorithms in depth. In the end
of chapter 2, the density divergence problem is introduced for further study.
2. We proposed the kernel density estimation based on subspace clustering algorith-
m to effectively address the dilemma of grid partition and the density divergence problem.
Some related techniques are first introduced and the basic terms and definitions are de-
fined. Subsequently, the detailed algorithm is explicitly described in the end of chapter
3.
3. We conduct intensive experiments on both synthetic and real datasets and the
performance comparisons on algorithm scalability, accuracy and efficiency with existing
subspace clustering algorithms show the superiority of our proposed algorithm.
II