没有合适的资源?快使用搜索试试~ 我知道了~
K-匿名性是一种在发布微数据以进行数据挖掘时保护隐私的好方法。 微聚合和泛化是实现k-匿名性的两种典型方法。 但是它们两者在匿名化混合微数据上都有一些缺陷。 为了解决该问题,我们提出了一种新的匿名化方法,称为MAGE,该方法在处理混合微数据时可以保留比泛化和微聚合更多的语义。 MAGE的思想是将数值数据的均值向量与分类数据的泛化值组合为聚类质心,并将其用作相应聚类中元组的化身。 我们还提出了一种有效的TSCKA算法来匿名化混合数据。 实验结果表明,与两种著名的匿名算法Incognito和KACA相比,MAGE可以有效地对混合微数据进行匿名处理,而TSCKA算法可以在数据质量和算法效率之间取得更好的折衷。
资源推荐
资源详情
资源评论
MAGE: A semantics retaining K-anonymization method for mixed data
Jianmin Han
a,
⇑
, Juan Yu
b
, Yuchang Mo
a
, Jianfeng Lu
a
, Huawen Liu
a
a
Department of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China
b
Department of Computer Science and Technology, Fudan University, Shanghai 200433, China
article info
Article history:
Received 15 July 2011
Received in revised form 5 October 2013
Accepted 6 October 2013
Available online 17 October 2013
Keywords:
K-anonymity
Generalization
Microaggregation
Privacy preservation
abstract
K-anonymity is a fine approach to protecting privacy in the release of microdata for data mining.
Microaggregation and generalization are two typical methods to implement k-anonymity. But both of
them have some defects on anonymizing mixed microdata. To address the problem, we propose a novel
anonymization method, named MAGE, which can retain more semantics than generalization and
microaggregation in dealing with mixed microdata. The idea of MAGE is to combine the mean vector
of numerical data with the generalization values of categorical data as a clustering centroid and to use
it as incarnation of the tuples in the corresponding cluster. We also propose an efficient TSCKA algorithm
to anonymize mixed data. Experimental results show that MAGE can anonymize mixed microdata
effectively and the TSCKA algorithm can achieve better trade-off between data quality and algorithm
efficiency comparing with two well-known anonymization algorithms, Incognito and KACA.
Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction
Microdata, i.e., data containing individual-specific information,
play an increasingly important role in data analysis and scientific
research. Therefore many organizations are increasingly collecting
and publishing microdata. However, publishing microdata may
threaten individuals’ privacy, because it is still possible to link
the released data with other public or easy-to-access database to
re-identify individuals’ identities, although attributes which can
obviously identify individuals’ identities have been removed be-
fore publication. Sweeney [1] pointed out that there were about
87% of Americans can be uniquely identified through the combina-
tion of their gender, birthday and postcode. If an inpatient dataset
containing these three attributes are released by medical organiza-
tions, adversaries can re-identify the disease of a specific inpatient
by linking with other external table, though the identity attributes
have been removed before publication.
To address the problem, Samarati [2] proposed the k-anonymity
model. It requires that each tuple has at least k indistinguishable
tuples with respect to quasi-identifier in the released data. The
quasi-identifier is a combination of some attributes which also ex-
ist in other available public microdata, and can be used to re-iden-
tify individuals’ identities by linking the public microdata with the
released data. The k-anonymity model can protect individuals’
identities from being re-identified with high probability by adver-
saries. Due to the practicability of k-anonymity, it has been studied
extensively in recent years [3–16].
Generalization and microaggregation are two typical
techniques to achieve k-anonymity. Generalization was proposed
by Samarati [2]. The idea of generalization is to replace real value
of quasi-identifier with less specific but semantically consistent
value. Microaggregation is a statistical disclosure control (SDC)
method. The idea of microaggregation is to construct small clusters
from the data (each cluster should have at least k elements), and
then to replace each original data with the centroid of its corre-
sponding cluster. Generalization is good at dealing with categorical
data, as it generally generalizes the original data into more seman-
tics-retaining data. Microaggregation is suitable to k-anonymize
numerical data, because it adopts mean vector to replace numeri-
cal values, which can preserve more numerical semantic informa-
tion than generalization. However, when it comes to mixed data
consisting of numerical data and categorical data, neither general-
ization nor microaggregation can effectively k-anonymize them.
1.1. Limitations of generalization on k-anonymizing mixed data
For numerical data, generalization is not an appropriate
k-anonymizing method [4].Asitk-anonymizes numerical data in
the same way as categorical data, i.e., it replaces a numerical value
x with a continuous range [a,b] contained x, labeled by a L
[a,b]
,
generalization would lose much semantic information of numeri-
cal data. For example, analysts cannot infer from L
[a,b]
whether
the original numerical values are mostly in the lower half of [a,b]
or in its upper half. In addition, generalization is time-consuming
and loses considerable information for dealing with microdata
with large size quasi-identifier microdata. Aggarwal [3] claimed
that generalization may suffer from the curse of dimensionality.
0950-7051/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2013.10.009
⇑
Corresponding author. Tel.: +86 579 82283528.
E-mail address: hanjm@zjnu.cn (J. Han).
Knowledge-Based Systems 55 (2014) 75–86
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
1.2. Limitations of microaggregation on k-anonymizing mixed data
For categorical data, microaggregation is not well suitable. As it
replaces categorical data with mode in its correspondent cluster in
the k-anonymizing procedure, microaggregation may result in
eliminating some low frequent values of microdata. For example,
considering a dataset which has two attributes: illness and gender,
the illness’s value is ‘‘adiposity’’. Suppose there are 70% females and
30% males in the dataset. Microaggregation k-partitions the origi-
nal dataset into some equivalence classes whose size is between
k and 2k-1. If the distribution of gender values in all these equiva-
lence classes is proportional, the gender’s values are all changed
into ‘‘female’’. Thus, analyst would obtain wrong information from
the anonymous dataset that all the ‘‘adiposity’’ patients are
‘‘female’’.
In addition, distance measurement used by existing microag-
gregation algorithms cannot capture the practical semantics of cat-
egorical data. For example, the distance of two zip code{4661,4663}
is the same as the distance of two zip code{4661,4331} in microag-
gregation, and both of them are 1. But in terms of their practical
semantics, the distance of the two zip code{4661,4663} should be
more closer than that of {4661,4331}.
We have adopted microaggregation technology to anonymize
mixed data in literature [5]. From the work we know that in anon-
ymous dataset by microaggregation, some important values are
eliminated and distributions of values are not consistent with that
in original data.
1.3. Contributions and paper outline
As we mentioned above, both generalization and microaggrega-
tion have some limitations on k-anonymizing mixed data. To ad-
dress this problem, we propose the MAGE (MicroAggregation and
GEneralization) method by combining the two methods. We use
generalization to anonymize categorical data and use microaggre-
gation to anonymize numerical data. Our contributions include: (1)
we define distance measurement for mixed data, which is general-
ization distance for categorical data and Euclidean distance for
numerical data; (2) we propose a method to anonymize microdata
which integrates the mean vector of numerical data and the closest
common generalization of categorical data as group centroid and
to use it as incarnation of tuples in the corresponding cluster;
and (3) we propose an efficient TSCKA (Two Steps Clustering
K-anonymization Algorithm) to k-anonymize mixed data.
The rest of the paper is organized as follows. Section 2 intro-
duces preliminary concepts. Section 3 proposes an MAGE method
combining generalization and microaggregation oriented to mixed
data, and presents an evaluation method for K-anonymization
quality of mixed data. Section 4 proposes a TSCKA to achieve
k-anonymity for mixed data. Section 5 makes some comparisons
and analyses on experimental results. Section 6 briefly discusses
related work. Section 7 concludes the work.
2. Preliminaries
2.1. K-anonymity
Attributes in microdata can be divided into three categories: ex-
plicit identifiers, quasi-identifier and sensitive attributes. Explicit
identifiers such as name or social security number are attributes that
can uniquely identify individuals, which should be removed or
encrypted prior to publishing. Quasi-identifier cannot identify
individuals by themselves, but can disclose individuals’ privacy
by linking with external table. Sensitive attributes are a set of
attributes which contain individuals’ sensitive information, e.g.,
salary, religion, health status and so on.
Definition 1 (Anonymous Equivalence Class). An anonymous equiv-
alence class of a table with respect to an attribute set is a set of
tuples in which all tuples contain identical values for the attribute
set.
For example, tuples 1, 2, 3 in Table 1(b) form an equivalence
class with respect to the attribute set {Gender, Age,Pcode}, whose
corresponding values are identical. We generally name the
attribute set as quasi-identifier.
Definition 2 (K-anonymity). Let T(A
1
,A
2
,...,A
n
) be a table and QI
be the quasi-identifier of table T. We call that table T satisfies
k-anonymity if and only if each sequence of values in T[QI] appears
with at least k occurrences, where T[QI] denotes the projection of
attributes of QI in T, maintaining duplicate tuples.
K-anonymity requires that the size of each equivalence class
with respect to quasi-identifier is at least k. For example,
Table 1(b and c) are 2-anonymity tables of Table 1(a), where qua-
si-identifier is {Gender, Age,Pcode}. Generally, a table has more than
one k-anonymous views, but some are better than others. For
example, Table 1(b and c) are 2-anonymous views of Table 1(a).
Obviously, Table 1(c) loses much more details than
Table 1(b), i.e., the data utility of Table 1(b) is better than that of
Table 1(c). It has been proved that the size of each equivalence
class should be in [k,2k 1] in order to retain as more data utility
as possible.
2.2. Generalization
2.2.1. Basic concepts
Generalization is one of the most popular approaches for
k-anonymizing microdata. The main idea of generalization is to re-
place the original values of quasi-identifier with less specific values
which are semantically consistent with the original values, and the
procedure is called generalization. Quasi-identifier could consist of
both numerical attributes and categorical attributes. Generally, we
use different generalization methods to generalize different kinds
of attributes, i.e., generalizing categorical attribute values to a set
of distinct values or a single less specific value, whereas generaliz-
ing numerical attribute values to intervals which contain the
original values. For example, by stripping the rightmost digit, the
three categorical attribute values {4661, 4663, 4663} for zip code
can be generalized to 466 which semantically represents a larger
geographical area, whereas the three numerical attribute values
{35,36,37} are generally generalized to an interval [30,39].
Actually, the generalization operation can be done in two
different ways, namely global recoding and local recoding. Global
recoding, also called domain generalization, is processed at domain
level based on a predefined domain generalization hierarchy tree
denoted by DGHT. For global recording, once an attribute value is
generalized, each occurrence of the value should be replaced by
the new generalized value. Table 1(c) gives a global recoding ver-
sion of the original table. It is easy to see that global recoding
may over-generalize microdata. Differently, local recoding, also
called value generalization, generalizes attribute values at cell level
based on a predefined value generalization hierarchy tree denoted
by VGHT. Obviously, local recoding does not need all occurrences of
a value to be changed into the same generalized value, whereas it
allows a generalized attribute value co-exists with the original
value within the same generalized microdata. Table 1(b) gives a
local recoding version. Different from global recoding, local recod-
ing does not over-generalize a table, hence it may minimize the
76 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86
剩余11页未读,继续阅读
资源评论
weixin_38506798
- 粉丝: 4
- 资源: 937
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功