没有合适的资源?快使用搜索试试~ 我知道了~
欧几里得最小生成树算法通常以二次计算复杂性运行,这对于大规模的高维数据集不切实际。 在本文中,我们针对高维数据提出了一种新的两级近似欧几里德最小生成树算法。 在第一级中,我们对给定的数据集执行离群值检测,以识别少量边界点,然后在简化的数据集上运行标准的Prim算法。 在第二级中,我们进行k近邻搜索以完成近似的欧几里德最小生成树构造过程。 在样本数据集上的实验结果证明了该方法的有效性,同时保持了较高的近似精度。
资源推荐
资源详情
资源评论
A Fast Two-Level Approximate Euclidean
Minimum Spanning Tree Algorithm
for High-Dimensional Data
Xia Li Wang
1(&)
, Xiaochun Wang
2
, and Xiaqiong Li
2
1
School of Information Engineering, Changan University, Xi’an 710061, China
xlwang@chd.edu.cn
2
School of Software Engineering,
Xi’an Jiaotong University, Xi’an 710049, China
xiaocchunwang@mail.xjtu.edu.cn,
xiaqiongli@stu.xjtu.edu.cn
Abstract. Euclidean minimum spanning tree algorithms run typically with
quadratic computational complexity, which is not practical for large scale high
dimensional datasets. In this paper, we propose a new two-level approximate
Euclidean minimum spanning tree algorithm for high dimensional data. In the
first level, we perform outlier detection for a given data set to identify a small
amount of boundary points and run standard Prim’s algorithm on the reduced
dataset. In the second level, we conduct a k-nearest neighbors search to com-
plete an approximate Euclidean Minimum Spanning Tree construction process.
Experimental results on sample data sets demonstrate the efficiency of the
proposed method while keeping high approximate precision.
Keywords: Euclidean minimum spanning tree
Minimum spanning tree
Approximate minimum spanning tree
Nearest neighbor search
1 Introduction
Finding a minimum spanning tree (MST) for a given connected graph is a fundamental
problem with diverse application domains and many efficient MST algorithms have
been developed. In today’s MST tasks, usually, a set of N d-dimensional data points is
given and the problem is commonly solved in the Euclidean setting, giving rise to the
so-called Euclidean minimum spanning tree (EMST) problem. In this case, there are
V ¼ N vertices and E ¼ NN 1ðÞ=2 edges in the complete graph, and standard EMST
algorithms, such as Kruskal’s[1] and Prim’s[2], have a time complexity roughly equal
to O dN
2
. For large-scale high- dimensional datasets, standard EMST algorithms will
lose its time performance. Fortunately, in many practical applications, an exact EMST
can be generally replaced by an approximate one without degrading the quality of the
final application.
Being a compact data representation of a given data set, EMST has been exten-
sively used in image segmentation [3, 4], cluster analysis [5–7], classification [8], and
manifold learning [9]. In particular, we are interested in EMST construction for cases
© Springer International Publishing AG, part of Springer Nature 2018
P. Perner (Ed.): MLDM 2018, LNAI 10935, pp. 273–287, 2018.
https://doi.org/10.1007/978-3-319-96133-0_21
where data consist of well separated clusters and the number of clusters is significantly
smaller than the number of data points [3, 10, 11]. In these situations, in-cluster edges
have weights that are significantly smaller than those longest edges corresponding to
cluster breakers and can be found quickly by fast k-nearest neighbors search algo-
rithms. However, the location of the longest edges in such EMSTs can not be deter-
mined quickly and is often the bottle neck for fast EMST algorithms.
In this paper, we propose a new clustering-inspired EMST algorithm that is both
computationally efficient and competent with the state-of-the-art EMST algorithms.
Basically, our novel data-dependent clustering-inspired EMST algorithm tries to
identify the relatively small number of longest edges in an EMST by finding and
retaining the relatively small number of boundary points and outliers before the
complete EMST is constructed. Being composed of two-level structures, the working
principle behind our novel data-dependent EMST algorithm is illustrated in Fig. 1.
During the first level, a relatively small group of the boundary points (including
outliers) are identified using angle based outlier factors (ABOF) and retained, upon
which the Prim’s EMST algorithm is applied to locate the small number of longest
edges. During the second level, we perform an index-based fast search algorithm to
find k-nearest neighbors (kNN) for each data point, based on which cluster-wise
EMSTs are formed. To be as general as possible, our proposed algorithm has no
specific requirements on the dimensionality of the data sets and the format of the
distance measure, although Euclidean distance is used as the edge weight in our
experiments. Experiments conducted on sample datasets show that our proposed
algorithm can obviously improve the run time performance of EMST while keeping
high precision.
Fig. 1. An illustration of the simple idea.
274 X. L. Wang et al.
The rest of this paper is organized as follows. Section 2 gives a review of some
related work. Section 3 presents the proposed data- dependent EMST method. Sec-
tion 4 shows the experimental comparisons. In Sect. 5, we give the conclusion and
future work.
2 Related Work
Work related to the method presented in this paper falls into two main categories:
EMST algorithms and density-based outlier detection methods.
2.1 MST Algorithms
For a given connected and weighted graph G ¼ E, VðÞ, Bor°uvka’s algorithm begins
with each vertex of a graph being a tree, and for each consecutive iteration, it selec ts the
shortest edge from a tree to another tree and combines them. This process continues
until all the trees are combined into one tree [12]. Proposed independently by Jarn´ık
[13], Prim [2] and Dijkstra [14] in 1930, 1957 and 1959, respectively, the famous
Prim’s algorithm first arbitrarily selects a vertex as a tree, and then repeatedly adds the
shortest edge that connects a new vertex to the tree, until all the vertices are included.
Proposed in 1956, Kruskal’s algorithm starts with sorting all the edges by their weights
in a non-decreasing order, treats each vertex as a tree, and iteratively combines the trees
by adding edges in the sorted order excluding those leading to a cycle until all the trees
are combined into one tree [1]. The time complexity of these classic MST algorithms is
O ElogVðÞ.
To construct an MST in the Euclidean setting, standard Prim’s algorithm requires a
quadratic running time. To be more efficient, in 1978, Bentley and Friedman [15]
proposed to use a kd-tree in Prim’s algorithm to enhance the search for the next edge to
add to the tree, which can reach an O NlogNðÞrunning time for most data distributions.
In 1985, Preparata and Shamos [16] gave a lower bound for the EMST problem of
NlogNðÞ, which has been the tightest known lower bound. In 1993, Callahan and
Kosaraju’s proposed Well-Separated Pair Decomposition (WSPD) [17] which forms
the basis of most recent EMST algorithms. The WSPD partitions data points into a set
of pairs of tree nodes such that the nodes in any pair are farther apart than the diameter
of either n ode. It can be shown that the WSPD has O NðÞpairs of nodes, and that the
MST is a subset of the edges formed between the closest pair of points in each pair of
nodes. In 2000, Narasimhan and Zachariasen applied WSPD to compute neighbors of
components for Boruvka’s algorithm to find edges of the MST [18]. However, the
constant in the O NðÞsize of the WSPD grows exponentially with the data dimension
and is often very large in practice. In 2010, March et al. presented a new dual-tree
algorithm for efficiently computing the EMST [19], which is superficially similar to the
method in [18] except that the WSPD is replaced by the new dual-tree data structure
and referred to in the following as FEMST algorithm. They used adaptive algorithm
analysis to prove the tightest (and possibly optimal) runtime bound for the EMST
problem to-date. Experiments conducted demonstrated the scalability of their met hod
on astronomical data sets.
A Fast Two-Level Approximate Euclidean Minimum Spanning Tree Algorithm 275
剩余14页未读,继续阅读
资源评论
weixin_38719578
- 粉丝: 6
- 资源: 928
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功