MAGE：一种用于混合数据的语义保留K-匿名方法资源-CSDN文库

Privacy

50 浏览量 2021-03-16 12:41:58 上传评论收藏 1.05MB PDF 举报

资源推荐

资源详情

资源评论

MAGE: A semantics retaining K-anonymization method for mixed data

Jianmin Han

⇑

, Juan Yu

, Yuchang Mo

, Jianfeng Lu

, Huawen Liu

Department of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China

Department of Computer Science and Technology, Fudan University, Shanghai 200433, China

article info

Article history:

Received 15 July 2011

Received in revised form 5 October 2013

Accepted 6 October 2013

Available online 17 October 2013

Keywords:

K-anonymity

Generalization

Microaggregation

Privacy preservation

abstract

K-anonymity is a ﬁne approach to protecting privacy in the release of microdata for data mining.

Microaggregation and generalization are two typical methods to implement k-anonymity. But both of

them have some defects on anonymizing mixed microdata. To address the problem, we propose a novel

anonymization method, named MAGE, which can retain more semantics than generalization and

microaggregation in dealing with mixed microdata. The idea of MAGE is to combine the mean vector

of numerical data with the generalization values of categorical data as a clustering centroid and to use

it as incarnation of the tuples in the corresponding cluster. We also propose an efﬁcient TSCKA algorithm

to anonymize mixed data. Experimental results show that MAGE can anonymize mixed microdata

effectively and the TSCKA algorithm can achieve better trade-off between data quality and algorithm

efﬁciency comparing with two well-known anonymization algorithms, Incognito and KACA.

1. Introduction

Microdata, i.e., data containing individual-speciﬁc information,

play an increasingly important role in data analysis and scientiﬁc

research. Therefore many organizations are increasingly collecting

and publishing microdata. However, publishing microdata may

threaten individuals’ privacy, because it is still possible to link

the released data with other public or easy-to-access database to

re-identify individuals’ identities, although attributes which can

obviously identify individuals’ identities have been removed be-

fore publication. Sweeney [1] pointed out that there were about

87% of Americans can be uniquely identiﬁed through the combina-

tion of their gender, birthday and postcode. If an inpatient dataset

containing these three attributes are released by medical organiza-

tions, adversaries can re-identify the disease of a speciﬁc inpatient

by linking with other external table, though the identity attributes

have been removed before publication.

To address the problem, Samarati [2] proposed the k-anonymity

model. It requires that each tuple has at least k indistinguishable

tuples with respect to quasi-identiﬁer in the released data. The

quasi-identiﬁer is a combination of some attributes which also ex-

ist in other available public microdata, and can be used to re-iden-

tify individuals’ identities by linking the public microdata with the

released data. The k-anonymity model can protect individuals’

identities from being re-identiﬁed with high probability by adver-

saries. Due to the practicability of k-anonymity, it has been studied

extensively in recent years [3–16].

Generalization and microaggregation are two typical

techniques to achieve k-anonymity. Generalization was proposed

by Samarati [2]. The idea of generalization is to replace real value

of quasi-identiﬁer with less speciﬁc but semantically consistent

value. Microaggregation is a statistical disclosure control (SDC)

method. The idea of microaggregation is to construct small clusters

from the data (each cluster should have at least k elements), and

then to replace each original data with the centroid of its corre-

sponding cluster. Generalization is good at dealing with categorical

data, as it generally generalizes the original data into more seman-

tics-retaining data. Microaggregation is suitable to k-anonymize

numerical data, because it adopts mean vector to replace numeri-

cal values, which can preserve more numerical semantic informa-

tion than generalization. However, when it comes to mixed data

consisting of numerical data and categorical data, neither general-

ization nor microaggregation can effectively k-anonymize them.

1.1. Limitations of generalization on k-anonymizing mixed data

For numerical data, generalization is not an appropriate

k-anonymizing method [4].Asitk-anonymizes numerical data in

the same way as categorical data, i.e., it replaces a numerical value

x with a continuous range [a,b] contained x, labeled by a L

[a,b]

generalization would lose much semantic information of numeri-

cal data. For example, analysts cannot infer from L

[a,b]

whether

the original numerical values are mostly in the lower half of [a,b]

or in its upper half. In addition, generalization is time-consuming

and loses considerable information for dealing with microdata

with large size quasi-identiﬁer microdata. Aggarwal [3] claimed

that generalization may suffer from the curse of dimensionality.

http://dx.doi.org/10.1016/j.knosys.2013.10.009

⇑

Corresponding author. Tel.: +86 579 82283528.

E-mail address: hanjm@zjnu.cn (J. Han).

Knowledge-Based Systems 55 (2014) 75–86

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier.com/locate/knosys

1.2. Limitations of microaggregation on k-anonymizing mixed data

For categorical data, microaggregation is not well suitable. As it

replaces categorical data with mode in its correspondent cluster in

the k-anonymizing procedure, microaggregation may result in

eliminating some low frequent values of microdata. For example,

considering a dataset which has two attributes: illness and gender,

the illness’s value is ‘‘adiposity’’. Suppose there are 70% females and

30% males in the dataset. Microaggregation k-partitions the origi-

nal dataset into some equivalence classes whose size is between

k and 2k-1. If the distribution of gender values in all these equiva-

lence classes is proportional, the gender’s values are all changed

into ‘‘female’’. Thus, analyst would obtain wrong information from

the anonymous dataset that all the ‘‘adiposity’’ patients are

‘‘female’’.

In addition, distance measurement used by existing microag-

gregation algorithms cannot capture the practical semantics of cat-

egorical data. For example, the distance of two zip code{4661,4663}

is the same as the distance of two zip code{4661,4331} in microag-

gregation, and both of them are 1. But in terms of their practical

semantics, the distance of the two zip code{4661,4663} should be

more closer than that of {4661,4331}.

We have adopted microaggregation technology to anonymize

mixed data in literature [5]. From the work we know that in anon-

ymous dataset by microaggregation, some important values are

eliminated and distributions of values are not consistent with that

in original data.

1.3. Contributions and paper outline

As we mentioned above, both generalization and microaggrega-

tion have some limitations on k-anonymizing mixed data. To ad-

dress this problem, we propose the MAGE (MicroAggregation and

GEneralization) method by combining the two methods. We use

generalization to anonymize categorical data and use microaggre-

gation to anonymize numerical data. Our contributions include: (1)

we deﬁne distance measurement for mixed data, which is general-

ization distance for categorical data and Euclidean distance for

numerical data; (2) we propose a method to anonymize microdata

which integrates the mean vector of numerical data and the closest

common generalization of categorical data as group centroid and

to use it as incarnation of tuples in the corresponding cluster;

and (3) we propose an efﬁcient TSCKA (Two Steps Clustering

K-anonymization Algorithm) to k-anonymize mixed data.

The rest of the paper is organized as follows. Section 2 intro-

duces preliminary concepts. Section 3 proposes an MAGE method

combining generalization and microaggregation oriented to mixed

data, and presents an evaluation method for K-anonymization

quality of mixed data. Section 4 proposes a TSCKA to achieve

k-anonymity for mixed data. Section 5 makes some comparisons

and analyses on experimental results. Section 6 brieﬂy discusses

related work. Section 7 concludes the work.

2. Preliminaries

2.1. K-anonymity

Attributes in microdata can be divided into three categories: ex-

plicit identiﬁers, quasi-identiﬁer and sensitive attributes. Explicit

identiﬁers such as name or social security number are attributes that

can uniquely identify individuals, which should be removed or

encrypted prior to publishing. Quasi-identiﬁer cannot identify

individuals by themselves, but can disclose individuals’ privacy

by linking with external table. Sensitive attributes are a set of

attributes which contain individuals’ sensitive information, e.g.,

salary, religion, health status and so on.

Deﬁnition 1 (Anonymous Equivalence Class). An anonymous equiv-

alence class of a table with respect to an attribute set is a set of

tuples in which all tuples contain identical values for the attribute

set.

For example, tuples 1, 2, 3 in Table 1(b) form an equivalence

class with respect to the attribute set {Gender, Age,Pcode}, whose

corresponding values are identical. We generally name the

attribute set as quasi-identiﬁer.

Deﬁnition 2 (K-anonymity). Let T(A

,...,A

) be a table and QI

be the quasi-identiﬁer of table T. We call that table T satisﬁes

k-anonymity if and only if each sequence of values in T[QI] appears

with at least k occurrences, where T[QI] denotes the projection of

attributes of QI in T, maintaining duplicate tuples.

K-anonymity requires that the size of each equivalence class

with respect to quasi-identiﬁer is at least k. For example,

Table 1(b and c) are 2-anonymity tables of Table 1(a), where qua-

si-identiﬁer is {Gender, Age,Pcode}. Generally, a table has more than

one k-anonymous views, but some are better than others. For

example, Table 1(b and c) are 2-anonymous views of Table 1(a).

Obviously, Table 1(c) loses much more details than

Table 1(b), i.e., the data utility of Table 1(b) is better than that of

Table 1(c). It has been proved that the size of each equivalence

class should be in [k,2k  1] in order to retain as more data utility

as possible.

2.2. Generalization

2.2.1. Basic concepts

Generalization is one of the most popular approaches for

k-anonymizing microdata. The main idea of generalization is to re-

place the original values of quasi-identiﬁer with less speciﬁc values

which are semantically consistent with the original values, and the

procedure is called generalization. Quasi-identiﬁer could consist of

both numerical attributes and categorical attributes. Generally, we

use different generalization methods to generalize different kinds

of attributes, i.e., generalizing categorical attribute values to a set

of distinct values or a single less speciﬁc value, whereas generaliz-

ing numerical attribute values to intervals which contain the

original values. For example, by stripping the rightmost digit, the

three categorical attribute values {4661, 4663, 4663} for zip code

can be generalized to 466 which semantically represents a larger

geographical area, whereas the three numerical attribute values

{35,36,37} are generally generalized to an interval [30,39].

Actually, the generalization operation can be done in two

different ways, namely global recoding and local recoding. Global

recoding, also called domain generalization, is processed at domain

level based on a predeﬁned domain generalization hierarchy tree

denoted by DGHT. For global recording, once an attribute value is

generalized, each occurrence of the value should be replaced by

the new generalized value. Table 1(c) gives a global recoding ver-

sion of the original table. It is easy to see that global recoding

may over-generalize microdata. Differently, local recoding, also

called value generalization, generalizes attribute values at cell level

based on a predeﬁned value generalization hierarchy tree denoted

by VGHT. Obviously, local recoding does not need all occurrences of

a value to be changed into the same generalized value, whereas it

allows a generalized attribute value co-exists with the original

value within the same generalized microdata. Table 1(b) gives a

local recoding version. Different from global recoding, local recod-

ing does not over-generalize a table, hence it may minimize the

76 J. Han et al. / Knowledge-Based Systems 55 (2014) 75–86

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38506798

粉丝: 4
资源: 937

MAGE：一种用于混合数据的语义保留K-匿名方法

基于聚类的数据敏感属性匿名保护算法 (2012年)

基于语义保留CEP的业务数据处理方法

半自动图像注释工具：Anno-Mage：一种半自动图像注释工具，它通过使用预先训练的模型为80个对象类建议注释来帮助您注释图像

法师：:mage:法师

使我成为编程专家的文件:mage:‍:mage:-Linux开发

sagamatron:使用TypeScript的魔力将样板从ReactReduxSaga堆栈中取出！ :mage:

The Ice Mage:用于休闲跑步游戏的 Android 应用-开源

The Fire Mage:用于休闲跑步游戏的 Android 应用-开源

grasshopper:一个非常严肃的游戏，涉及两个足球，飞盘和魔术:mage:‍:mage:

django-data-wizard：:mage::gear:通过基于Web的交互式向导将结构化数据（例如Excel，CSV，XML，JSON）导入一个或多个Django模型中

MAGE:MAGE 是一个用 JavaScript 制作浏览器游戏的引擎

InvisibleJs:使用密码安全地在纯文本中使用零宽度字符隐藏秘密:mage:‍:mage::star:

pyWhat::frog: 识别任何东西。 pyWhat 可让您轻松识别电子邮件、IP 地址等。 给它一个 .pcap 文件或一些文本，它会告诉你它是什么！ :mage:‍​​:female_sign:

leetcode题库-Leetcode::man_technologist:采访_EXP---:man_mage:坚持解决Leetcode问题，拒绝阿尔茨海默病。访问我的leetcod

lambda-lantern：:mage:关于功能编程模式的3D游戏。 使用PureScript Native，C ++和Panda3D

terraria-player-parser：:mage:用JavaScript编写的Terraria播放器文件解析器

3D-Mage-Studio.zip

alohomora:使得使用AWS Parameter Store的cli就像一根魔杖一样简单:mage:

pagemaster：Jekyll插件，用于从CSVYAMLJSON记录生成Markdown收集页面:mage::open_book:

Mage2Gen:用于生成Magento 2模块的Python库

开源项目-magefile-mage.zip

【Python实战】-Python+Opencv是实现车牌自动识别（源码+数据+字符匹配模板）

Python基于机器学习实现的股票价格预测、股票预测源码+数据集，机器学习大作业

时间序列数据集TSdatasets.rar

Fragstats V4.2 软件计算景观指数的参数文件示例

抖音用户浏览行为数据集

5-8抖音数据分析数据集

两阶段鲁棒优化/综合能源系统/需求响应/微电网/多目标优化/优化调度matlab-yalmip-cplex/gurobi文章复现

新闻数据集（对应新闻文本分类案例）

最新资源

pyWhat::frog: 识别任何东西。 pyWhat 可让您轻松识别电子邮件、IP 地址等。给它一个 .pcap 文件或一些文本，它会告诉你它是什么！ :mage:‍:female_sign:

lambda-lantern：:mage:关于功能编程模式的3D游戏。使用PureScript Native，C ++和Panda3D