weightingk-meanswithanl2-normregularization资源-CSDN文库

需积分: 8 126 浏览量 2022-03-02 23:14:57 上传评论收藏 1.64MB PDF 举报

在机器学习和人工智能的研究领域，聚类作为无监督学习的重要分支，旨在将数据集中的样本根据某种相似性或紧密程度分成若干个簇。K均值聚类算法作为其中的经典方法，其简单高效的特点使其在处理大规模数据集时具有不可替代的地位。然而，随着研究的深入，人们逐渐发现K均值算法在特征选择和数据特性区分方面存在局限性。针对这一问题，近年来的研究提出了诸多改进策略，其中基于L2范数正则化的加权K均值聚类算法（以下简称l2-Wkmeans算法）吸引了学术界和业界的广泛关注。 K均值聚类算法的基本思想是将数据集中的每个样本归为与其最相似的簇中心，而相似性通常是通过欧氏距离来度量。算法在迭代过程中不断更新簇中心位置，并对数据样本进行重新分配，直至聚类结果稳定。然而，K均值算法对于所有特征都赋予了相同的权重，这意味着每个特征对于聚类结果的贡献度是相同的。但在现实世界中的数据往往是复杂多变的，不同特征对于识别样本间差异的重要性不一，这就需要一种能够动态调整各特征权重的机制。 l2-Wkmeans算法的创新之处在于通过引入L2范数正则化项，实现对特征权重的动态调整。在该框架中，特征权重的选取不再是固定的，而是成为可以学习的参数，使得算法能够根据特征的重要性自动调整其权重。具体来说，算法在优化目标函数时，除了最小化簇内样本与簇中心的距离外，还加入了L2范数项来约束特征权重的大小，从而避免了过拟合，并能更好地处理数据中的噪声。文章中提出的l2-Wkmeans算法特别针对数值型数据集进行了优化。算法的核心在于构建了一个新的目标函数，该函数结合了传统K均值的聚类目标和L2范数的正则项，通过数学推导和迭代优化，最终得到每个特征的最优权重以及聚类中心。在特征权重动态调整的基础上，l2-Wkmeans算法能够更准确地反映数据的真实分布，提高聚类质量。除了针对数值型数据集的l2-Wkmeans算法外，文章还提出了适用于分类数据集的算法，如l2-NOF和l2-NDM。这些算法针对非数值型特征设计了不同的平滑模式和距离度量方法，从而使得l2-Wkmeans算法的适用性得到了进一步的拓展。 l2-Wkmeans算法在实验验证部分的表现也同样值得关注。作者在不同维度、不同规模、不同特征类型的多个数据集上进行了广泛的实验。实验结果表明，l2-Wkmeans算法相较于传统的K均值算法，在聚类准确度、对噪声数据的鲁棒性以及高维数据的适应性等方面都有显著的提升。特别地，在处理具有复杂结构和多维度特征的数据集时，l2-Wkmeans算法的优越性表现得尤为突出。文章进一步讨论了正则化参数在l2-Wkmeans算法中的重要性。正则化参数直接影响到特征权重的分配和聚类结果的稳定。通过一系列的实验和理论分析，文章给出了正则化参数的选择建议和调整策略，为实际应用提供了可靠的参考。通过在K均值聚类算法中引入L2范数正则化，可以有效提升算法对特征重要性的评估，增强其聚类能力和泛化性能。这不仅有助于提升聚类分析的质量，而且也为相关领域的研究带来了新的研究思路。l2-Wkmeans算法的提出，不仅为数据科学家提供了更强大的工具，也为数据分析和挖掘提供了新的解决方案。

资源推荐

资源详情

资源评论

ARTICLE IN PRESS

JID: KNOSYS [m5G; March 26, 2018;14:47 ]

Knowledge-Based Systems 0 0 0 (2018) 1–15

Contents lists available at ScienceDirect

Knowle dge-Base d Systems

journal homepage: www.elsevier.com/locate/knosys

A new weighting k -means type clustering framework with an l

-norm

regularization

Xiaohui Huang

a , b , ∗

, Xiaofei Yang

, Junhui Zhao

, Liyan Xiong

, Yunming Ye

School of Information Engineering Department, East China Jiaotong University, Nanchang 330013, China

Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China

a r t i c l e i n f o

Article history:

Received 22 September 2017

Revised 17 March 2018

Accepted 21 March 2018

Available online xxx

Keywords:

Clustering

k -means algorithm

Feature weighting

-norm regularization

a b s t r a c t

k -Means algorithm has been proven an effective technique for clustering a large-scale data set. However,

traditional k -means type clustering algorithms cannot effectively distinguish the discriminative capabili-

ties of features in the clustering process. In this paper, we present a new k -means type clustering frame-

work by extending W- k -means with an l

-norm regularization to the weights of features. Based on the

framework, we propose the l

-Wkmeans algorithm by using conventional means as the centroids for clus-

tering numerical data sets and present the l

-NOF and l

-NDM algorithms by using two different smooth

modes representatives for clustering categorical data sets. At ﬁrst, a new objective function is developed

for the clustering framework. Then, the corresponding updating rules of the centroids, the membership

matrix, and the weights of the features, are derived theoretically for the new algorithms. We conduct

extensive experimental veriﬁcations to evaluate the performances of our proposed algorithms on numer-

ical data sets and categorical data sets. Experimental studies demonstrate that our proposed algorithms

delivers consistently promising results in comparison to the other comparative approaches, such basic

k -means, W- k -means, MKM_NOF, MKM_NDM etc., with respects to four metrics: Accuracy, RandIndex, Fs-

core , and Normal Mutual Information ( NMI ).

Introduction

Clustering is an unsupervised learning technique which aims to

partitioning a data set into several disjoint subsets such that the

objects within a subset have high similarities and the objects in

different subsets are dissimilar by certain pre-deﬁned criteria [1] .

It is one of most widely used approaches for exploratory data anal-

ysis, with applications ranging from protein sequence analysis [2] ,

community discover [3] to image segmentation [4] or astronomic

data analysis [5] .

k -Means clustering is a kind of partitioning clustering method.

Since its simplicity and effectiveness, k -means clustering algorithm

has been widely used to solve various real-life problems. All the

features have equivalent effects in the basic k -means [6] clus-

tering process. However, as a matter of fact, different features

may have different discriminative capabilities for clustering a high-

dimensional data set in real-life application. For instance, in the

sentence “London is the ﬁrst city to have hosted the modern

Games of three Olympiads”, the keywords “London, Olympiads”

have

more discriminative capabilities than the keywords “city,

∗

Corresponding author.

E-mail address: hxh016@hotmail.com (X. Huang).

modern” in sport news. To distinguish the importance of different

features in the clustering process, many weighting feature meth-

ods [1,7–11] were proposed by using the framework of k -means

type clustering. These approaches can be classiﬁed two categories:

(1) using the βth power to constrain feature weights [1,8,11] ; (2)

using the entropy to constrain feature weights [7,9,10,12,13] . Huang

et al. proposed the method of using the βth power to constrain

feature weights in the W- k -means algorithm [1,8] which is able to

automatically weight features based on the importance of the fea-

tures in the clustering process by adding a new step of calculat-

ing the weights to the basic k -means algorithm. However, when a

cluster includes one feature on which the scatter is zero, only the

weight of this feature is assigned to one and the weights of the

other features are allocated to zeros. That means that only the fea-

ture of zero scatter is employed in the W- k -means clustering pro-

cess. It is unreasonable for clustering a high-dimensional data set.

Jing et al. proposed the EWkmeans algorithm [7] which uses the

entropy to constrain feature weights under the k -means clustering

framework by adding an entropy term in its objective function to

stimulate more features to contribute to the identiﬁcation of clus-

ters. This method must calculate a groups of exponents with the

negative scatters of the clusters on every feature as the exponen-

tial variables, i.e. e

−scatter

. The scatter is usually large for a large

https://doi.org/10.1016/j.knosys.2018.03.028

Please cite this article as: X. Huang et al., A new weighting k -means type clustering framework with an l

-norm regularization,

Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.03.028

2 X. Huang et al. / Knowledge-Based Systems 0 0 0 (2018) 1–15

ARTICLE IN PRESS

JID: KNOSYS [m5G; March 26, 2018;14:47 ]

scale data set. Thus, e

−scatter

will tend to zero and often overﬂow

in the implementation of the algorithm.

In this paper, we proposed a k -means type clustering frame-

work by using a new fashion to weight features with an l

-norm

regularization. Based on the method of weighting features, a new

clustering algorithm, named l

-Wkmeans, is proposed by extend-

ing the W- k -means algorithm [1,8] for clustering numerical data

sets and two new algorithms, named l

-NOF and l

-NDM, are pro-

posed by extending MKM_NOF and MKM_NDM [14] for cluster-

ing categorical data sets, respectively. By combining the l

-norm

regularization and the nonnegative constraint to feature weights,

our proposed methods can effectively select discriminative features

and reduce the effects of noisy features in the clustering process.

In addition, different from MKM_NOF and MKM_NDM have no ca-

pability of feature selection and W- k -means calculates a weight

for every feature in a whole data set as the feature selection, our

proposed methods calculate a weight for every feature in a clus-

ter as the feature selection, which results in our proposed method

are more robust than the original methods against noise. Then,

we achieve the iterative updating rules of the three algorithms

by minimizing the corresponding objective functions. Extensive ex-

periments on both numerical and categorical data sets corrobo-

rate that our proposed methods can improve clustering results by

more effective feature selection with an l

-norm regularization.

The main contributions of this work are twofold:

• We propose a k -means type clustering framework by using a

new fashion to weight features with an l

-norm regularization.

On the basis of the framework, we propose three clustering al-

gorithms: the l

-Wkmeans algorithm for clustering numerical

data sets, the l

-NOF and l

-NDM algorithms for clustering cat-

egorical data sets.

• We develop an objective function for the weighting k-means

type clustering framework and give the complete proof of the

convergence of our proposed algorithms by optimizing the cor-

responding objective functions.

The rest of the paper is organized as follows: a brief re-

view related to k -means type clustering algorithms on a numeri-

cal data set and a categorical data set is presented in Section 2 .

Section 3 introduces the details of our proposed algorithms. Exper-

iments on both numerical data sets and categorical data sets are

presented in Section 4 . Finally, we discuss the features of our pro-

posed algorithms in Section 5 and conclude this paper in Section 6 .

2. Related work

This section gives a brieﬂy survey of the previous works which

involve the k -means type clustering algorithms on numerical data

sets and categorical data sets.

2.1. k -means type clustering on numerical data sets

The last decades have witnessed the rapid development of k -

means type clustering algorithms which range from no weighting

k -means type algorithms to various weighting k -means algorithms.

2.1.1. No weighting k -means type clustering algorithms

Basic k -means algorithm aims at ﬁnding a partition such that

the sum of the squared distances between the empirical means

of the clusters and the objects in the clusters is minimized. Ba-

sic k -means algorithm has been improved from the different as-

pects to overcome its weaknesses. Since k -means type algorithms

are sensitive to initial centroids, Arthur and Vassilvitskii proposed

k -means++ algorithm [15] which chooses the initial centroids by

maximizing the distances among them. Another weakness of k -

means type algorithms is to require manually tuning the parame-

ter k (the number of clusters). Pelleg and Moore proposed X-means

[16] to automatically seek the number of clusters by optimizing a

criterion such as Akaike Information Criterion or Bayesian Informa-

tion Criterion. Basic k -means algorithm and most of its variations

usually get stuck at local optima. Therefore, Bagirov proposed a

global k -means algorithm [17,18] which dynamically adds one clus-

ter center at a time and uses each data point as a candidate for

the k th cluster center. Since the global k -means mentioned above

requires a large amount of computational cost, Bai et al. [19] pro-

posed an acceleration mechanism for the production of new clus-

ter centroids by applying local geometrical information to describe

approximately the set of objects.

Moreover, in order to cater for the demands of clustering data

sets in different applications, several variations [20–25] of k -means

algorithms are proposed. Bradley et al. [20] presented a fast scal-

able and single pass version of k -means that is able to solve the

clustering problem of data stream. Dhillon and Modha studied a

certain spherical k -means algorithm [21] for clustering a docu-

ment data set. To cluster a large-scale data set, the triangle in-

equality is employed to accelerate the standard k -means algorithm

[22–24] . Similar to our proposed method, Regularized k -means

[25] aims at improving clustering results on high-dimensional data

sets by using the penalty term for eliminating the effects of re-

dundant features. However, Regularized k -means [25] use l

-norm

penalty term to centroids instead of l

-norm penalty term to fea-

ture weights in our proposed method.

This type of k -means algorithms equally treats all the features

and has no capability to feature selection. Therefore, the clustering

results produced by this type of algorithms are usually not promis-

ing when a data set includes noisy features.

2.1.2. Weighting k -means type clustering algorithms

All the features are employed equally in the clustering pro-

cess of no weighting k -means type clustering algorithms. In ac-

tual practice, a useful clustering pattern usually hides in a sub-

space deﬁned by a subset of all the features. Therefore, many re-

searchers [1,7,8] are devoted to study various types of weighting

feature ways to ﬁnd the hidden subspaces. Huang et al. proposed

the W- k -means [1] algorithm by using the βth power to constrain

the feature weights to tune the distribution of feature weights. In

W- k -means, the same feature in the different clusters shares a sin-

gle value of feature weight. As a matter of fact, the same feature in

the different clusters usually has different importance in real ap-

plications. For the sake of solving the problem mentioned above,

Chan et al. proposed the Attributes-Weighting clustering Algorithm

(AWA) [8] by using the similar weighting feature technique as W-

k -means to calculate a weight for every feature in each cluster.

Let X = { X

, X

, . . . , X

} be a set of n objects. Object X

{ x

i 1

, x

i 2

, . . . , x

} is characterized by a set of m features. The mem-

bership matrix U is a n × k binary matrix, where u

= 1 indicates

that object i is assigned to cluster p , otherwise, it is not assigned

to cluster p . Z =

{

, Z

, . . . , Z

}

is a set of k vectors representing

the centroids of k clusters. W is a weighting matrix where each

row W

= { w

, w

, . . . , w

} denotes a weight vector of all the

features in a cluster. The attributes-weighting clustering algorithm

can be formulated as

P (U, W, Z) =



p=1



i =1



j=1

( x

− z

)

, (1)

subject to

∈ { 0 , 1 } ,



p=1

= 1 ,



j=1

= 1 , 0 ≤ w

≤ 1 , (2)

where β is parameter to tune the distribution of feature weights.

β > 1, and the larger of the β



s value, the more similar among the

weights of different features in a cluster is. The AWA algorithm can

Please cite this article as: X. Huang et al., A new weighting k -means type clustering framework with an l

-norm regularization,

Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.03.028

X. Huang et al. / Knowledge-Based Systems 0 0 0 (2018) 1–15 3

ARTICLE IN PRESS

JID: KNOSYS [m5G; March 26, 2018;14:47 ]

usually achieve interesting results in most of cases. Using the way

of βth power to constrain feature weights, Huang et al. [11] pro-

posed a k -means type clustering framework which is able to inte-

grate both intra-cluster compactness and inter-cluster separations.

However, when all objects in a cluster share the same value on

a feature in the clustering process, i.e., the scatter of this feature

is zero in the cluster, the algorithms by using the way of the βth

power to constrain feature weights will assign the weight of the

zero scatter feature to one and assign the weights of the other fea-

tures to zeros. Generally, it is unreasonable to use only one feature

to distinguish a cluster on a data set.

Another method of using the entropy to constrain feature

weights proposed in the EWkmeans algorithm [7] aims to encour-

age more features to participate in the clustering process by simul-

taneously minimizing the within cluster scatter and maximizing

the negative weight entropy. The updating rules of the EWkmeans

algorithm can be achieved by minimizing the following objective

function:

P (U, W, Z) =



p=1



i =1



j=1

( x

− z

)

+ γ



p=1



j=1

log w

(3)

subject to the constraint conditions Eq. (2) , where γ is a parame-

ter to balance the scatter and the entropy of the weights. Follow-

ing to EWkmeans, Chen et al. [9,26] proposed two types of auto-

mated two-level variable weighting clustering algorithms for mul-

tiview data. Deng et al. proposed an enhanced soft subspace clus-

tering algorithm [10] by integrating intra-cluster compactness and

inter-cluster separation with the entropy constraint to the feature

weights. In EWkmeans algorithm, the updating rule of the weight

solved by minimizing the objective function Eq. (3) is as follow:

exp (−D

/γ )



l=1

exp (−D

/γ )

, (4)

where D

is the scatter of cluster p on feature j . From this updat-

ing rule, we can observe that when D

is large, e.g. D

= 10 0 0 ,

exp (−D

/γ ) nearly equals to zero (according to the reference [7] ,

the value of γ ranges from 0.1 to 7). Therefore, the clustering pro-

cess of the k -means algorithms with entropy regularization are of-

ten dominated by seldom features. Even, numeric overﬂow errors

may happen when the scatter is large.

2.2. k -modes type clustering on categorical data sets

2.2.1. k -modes type algorithms

Since working only on numerical data sets prohibits the k -

means type algorithms from being used to cluster real-life data

including categorical values, Huang proposed the k -modes algo-

rithm [27] which employs a simple matching dissimilarity measure

to handle categorical objects, replaces the means of clusters with

modes, and utilizes a frequency-based method to update modes in

the clustering process. Based on this technique, most of k -means

type methods [1,7,8] working on numerical data can be modiﬁed

to k -modes type algorithms for clustering a categorical data set by

simply using modes to replace means. Based on the k -modes al-

gorithm, Cao et al. proposed W- k -modes algorithm [28] by using

complement entropy and Bai et al. [29] proposed another weight-

ing k -modes algorithm by using the βth power to constrain feature

weights. To further pursue the performance, Bai and Liang [30] in-

troduced the inter-cluster separation to the conventional k -modes

and proved the convergence of their proposed algorithm. Qian

et al. [31] proposed a novel data-representation scheme for the

categorical data by mapping a set of categorical objects into a Eu-

clidean space. Based on the data-representation scheme, Qian et al.

[31] developed a general clustering framework for space structure

of categorical data.

However, since the centroid in a dimension is usually repre-

sented by a feature value, the representability of the centroids in

this type of algorithms is not enough, especially when the distri-

bution of feature values is uniform.

2.2.2. Generalization of centroid representability in k -modes type

algorithms

Conventional k -modes algorithm chooses the feature value of

maximal frequency to represent a cluster on a feature. The method

often ignores the representability of other feature values whose

frequencies are close to the maximal one in the cluster. For the

sake of eliminate this ﬂaw, many improved algorithms [14,32–

35]

were proposed by allocating proper weights to the feature

values which are not maximal frequency. A frequency-based cen-

troid is introduced to the k -modes algorithm by San et al. [32] .

The higher the frequency of a feature value in a cluster is, the

larger representability the feature value has in the cluster. The

relative feature frequencies as the weights are adopted to re-

ﬂect the representability of cluster centroid in a cluster in ref-

erences [34,36] which are able to improve the measure of intra-

compactness in a cluster for categorical data as opposed to con-

ventional k -modes algorithm. Lee and Pedrycz [35] generalized the

k -modes algorithm with fuzzy p-modes prototypes. The algorithms

mentioned above can be seen as the special cases of the general-

ized k -modes algorithm. However, Bai et al. [14] argued that the

before-mentioned methods can converge to a local optimum so-

lution only if these algorithms degenerate to the simple k -modes

algorithm. To overcome this deﬁciency, Bai et al. [14] developed

two modiﬁed k -mode type clustering algorithms: MKM_NOF and

MKM_NDM which employ the entropy and l

-norm regularization

to smooth the centroid representation on every feature, respec-

tively.

Generalized centroid usually has more representative than tra-

ditional centroid which is represented by a single feature value on

a dimension. However, the algorithms with generalized centroid

require more computational cost due to the larger dimensions of

the centroid. Moreover, these algorithms have no capability of fea-

ture selection for noisy data sets.

2.3. k -means type clustering on mixed data sets

Facing mixed type objects that we frequently encounter in real

world, Huang [27] proposed a more practically useful algorithm:

k -prototypes which is straightforward to integrate the classic k -

means and k -modes algorithms with a balancing parameter. Lee

and Pedrycz [35] extended k -prototypes into a fuzzy p -modes clus-

tering algorithm where more effective centroid representation is

used in the part of categorical data in comparison to classic k -

prototypes algorithm. Ahmada and Dey [37] proposed another k -

means type clustering algorithm for subspace clustering for mixed

numerical and categorical data. However, this method also inte-

grates numerical data and categorical data by simple addition. It

is lack of effective method to fuse numerical data and categorical

data under the k -means type clustering framework except simple

addition of two parts. In the existing methods mentioned above, in

essence, the numeric data and the categorical data are still tackled

separately, which is not really uniﬁed semantically.

2.4. Characteristic of our proposed methods

At present, the existing k -means type algorithms can be sum-

marized two classes: (1) no weighting k -means algorithms; (2)

weighting k -means algorithms. No weighting k -means algorithms

Please cite this article as: X. Huang et al., A new weighting k -means type clustering framework with an l

-norm regularization,

Knowledge-Based Systems (2018), https://doi.org/10.1016/j.knosys.2018.03.028

剩余14页未读，继续阅读

评论收藏

内容反馈

milkme_ops

粉丝: 1
资源: 7

weighting k-means with an l2-norm regularization

最新资源

weighting k-means with an l2-norm regularization

An Entropy Weighting k-Means Algorithm for

Weighted-KNN-Algorithm-With-Inverse-Distance-Weighting-Method-Python

特征提取；聚类.rar_数据降维_特征提取_聚类_聚类提取特征_降维

A new weighting-scheme for equity indexes aboura 2016.pdf

聚类特征选择算法总数

机器学习技法_国立台湾大学(林轩田)02_Diversity_by_Re-weighting_14-28.mp4

pandas-weighting-0.0.2.tar.gz

Optimal image-based weighting for energy-resolved CT.pdf

TW-$(k)$-Means: Automated Two-Level Variable Weighting Clustering Algorithm for Multiview Data

【代码】iMWK-means实现

matlab均方误差的代码-Perceptual-Weighting-Filter-Loss:语音增强DNN训练的感知加权滤波器损失

EWKM.zip_EWKM_R语言 聚类 EWKM_matlab 聚类 子空间聚类 EWKM_聚类ewkm

Delay-dependent stability analysis of neural networks with time-varying delay: A generalized free-weighting-matrix approach

pandas_weighting-0.0.2-py3-none-any.whl

Attribute Weighting for Averaged One-Dependence Estimators

pandas_weighting-0.0.1-py3-none-any.whl

pandas-weighting-0.0.1.tar.gz

Unsupervised Part-based Weighting Aggregation

Time-Weighting Symmetric Accumulated Cross-Correlation Method of Parameter Estimation

2019-CVPR-Multi-Similarity Loss with General Pair Weighting for

Term-weighting_approaches_in_automatic_te

apertium-weighting-tools-evaluation

A-weigthing--C-weighting

k-nearest-neighbors-from-global-to-local

最新资源

EWKM.zip_EWKM_R语言聚类 EWKM_matlab 聚类子空间聚类 EWKM_聚类ewkm