GeometricDataPerturbationforPrivacyPreservingOutsourcedDataMining._generativeadversarialnetwork资源-CSDN文库

需积分: 10 51 浏览量 2014-08-05 10:36:41 上传评论收藏 927KB PDF 举报

该文件是一篇关于数据挖掘隐私保护的经典论文，题为《几何数据扰动用于隐私保护外包数据挖掘》。作者陈可可提出了一种名为几何数据扰动（GDP）的方法，以解决数据扰动中的隐私保护和数据效用之间的平衡问题。在数据挖掘中，数据扰动是一种常见的隐私保护技术，但它面临的一个主要挑战是如何平衡隐私保护与数据效用，这两者通常被视为一对矛盾的因素。论文认为，通过在扰动过程中选择性地保留与任务/模型相关的特定信息，可以同时提高隐私保护和数据效用。论文讨论了这种特定信息的一种类型，即多维几何信息，并提出了GDP方法以保留数据扰动中的这种信息。 GDP方法首先展示了在几何扰动数据集上，多种已知的数据挖掘模型能够达到与原始数据集上相当的模型质量水平。接着，论文解释了GDP方法背后的直觉，并将其与随机投影扰动等其他多维扰动方法进行比较。之后，作者提出了一个多列隐私评估框架，用于评估几何数据扰动对于不同级别攻击的有效性，并使用该框架研究了几何扰动数据集上的几种攻击。实验研究还表明，几何数据扰动不仅能提供令人满意的隐私保护，而且能很好地保留模型的准确性。数据挖掘通常涉及到利用数据中的多维几何信息，而论文中提出的GDP方法正是为了在数据扰动过程中保护这种信息。这种信息通常是数据挖掘模型隐式使用的，例如，在聚类、分类和回归等常见任务中。GDP方法的提出是为了在不泄露敏感数据的前提下，保护这些模型中的几何结构，从而保证数据挖掘任务能够在保留隐私的同时，尽可能准确地执行。云计数的兴起使得基于服务的计算成为主要范式。无论用户是利用云平台服务还是使用托管在云上的现有服务，都必须导出他们的私人数据。在云计算的背景下，隐私保护变得尤为重要，因为数据通常需要在组织之间共享或由第三方服务提供商处理。在这种情况下，外包数据挖掘任务时，保护原始数据不被泄露的同时，还能获得有效的挖掘结果，是一个关键问题。GDP方法提供了一种新的数据隐私保护途径，尤其适用于云计算环境。论文提出的多列隐私评估框架是一个创新点，它允许评估数据扰动技术在面对不同攻击时的表现。这意味着可以针对不同类型的数据挖掘任务，评估和比较不同数据扰动方法的隐私保护能力。通过对几何扰动数据集进行的攻击研究，论文展示了GDP方法在保护隐私方面的效能，并且实验结果表明该方法在保持数据挖掘模型准确性方面表现优异。这篇论文对于理解在大数据环境下，如何平衡数据挖掘中的隐私保护和数据效用，提供了重要的理论和实验依据。随着云计算和大数据分析在各行各业中的广泛应用，确保数据的安全性和隐私性变得越来越重要。GDP方法的提出，为隐私保护数据挖掘领域提供了新的研究方向和实用工具。

资源推荐

资源详情

资源评论

Under consideration for publication in Knowledge and Information Systems

Geometric Data Perturbation for Privacy

Preserving Outsourced Data Mining

Keke Chen

, Ling Liu

Department of Computer Science and Engineering, Wright State University, Dayton OH, USA;

College of Computing Georgia Institute of Technology, Atlanta GA, USA

Abstract. Data perturbation is a popular technique in privacy-preserving data mining. A major

challenge in data perturbation is to balance privacy protection and data utility, which are normally

considered as a pair of conﬂicting factors. We argue that selectively preserving the task/model spe-

ciﬁc information in perturbation will help achieve better privacy guarantee and better data utility.

One type of such information is the multidimensional geometric information, which is implicitly

utilized by many data mining models. To preserve this information in data perturbation, we pro-

pose the Geometric Data Perturbation (GDP) method. In this paper, we describe several aspects of

the GDP method. First, we show that several types of well-known data mining models will deliver

a comparable level of model quality over the geometrically perturbed dataset as over the original

dataset. Second, we discuss the intuition behind the GDP method and compare it with other mul-

tidimensional perturbation methods such as random projection perturbation. Third, we propose

a multi-column privacy evaluation framework for evaluating the effectiveness of geometric data

perturbation with respect to different level of attacks. Finally, we use this evaluation framework to

study a few attacks to geometrically perturbed datasets. Our experimental study also shows that

geometric data perturbation can not only provide satisfactory privacy guarantee but also preserve

modeling accuracy well.

Keywords: Privacy-preserving Data Mining, Data Perturbation, Geometric Data Perturbation,

Privacy Evaluation, Data Mining Algorithms

1. Introduction

With the rise of cloud computing, service-based computing is becoming the major

paradigm (Amazon, n.d.; Google, n.d.). Either to use the cloud platform services (Armbrust,

Fox, Grifﬁth, Joseph, Katz, Konwinski, Lee, Patterson, Rabkin, Stoica and Zaharia,

2009), or to use existing services hosted on clouds, users will have to export their private

Received Mar 10, 2010

Revised Oct 03, 2010

Accepted Oct 23, 2010

2 K. Chen and L. Liu

data to the service provider. Since these service providers are not within the trust bound-

ary, the privacy of the outsourced data has become one of the top-priority problems

(Armbrust et al., 2009; Bruening and Treacy, 2009). As data mining is one of the most

popular data intensive tasks, privacy preserving data mining for the outsourced data has

become an important enabling technology for utilizing the public computing resources.

Different from other settings of privacy preserving data mining such as collaboratively

mining private datasets from multiple parties (Lindell and Pinkas, 2000; Vaidya and

Clifton, 2003; Luo, Fan, Lin, Zhou and Bertino, 2009; Teng and Du, 2009), this paper

will focus on the following setting: the data owner exports data to and then receives

a model (with the quality description such as the accuracy for a classiﬁer) from the

service provider. This setting also applies to the situation that the data owner uses the

public cloud resources for large-scale scalable mining, where the service provider just

provides computing infrastructure.

We present a new data perturbation technique for privacy preserving outsourced

data mining (Aggarwal and Yu, 2004; Chen and Liu, 2005) in this paper. A data per-

turbation procedure can be simply described as follows. Before the data owners pub-

lish their data, they change the data in certain way to disguise the sensitive information

while preserving the particular data property that is critical for building meaningful data

mining models. Perturbation techniques have to handle the intrinsic tradeoff between

preserving data privacy and preserving data utility, as perturbing data usually reduces

data utility. Several perturbation techniques have been proposed for mining purpose

recently, but these two factors are not satisfactorily balanced. For example, random

noise addition approach (Agrawal and Srikant, 2000; Evﬁmievski, Srikant, Agrawal

and Gehrke, 2002) is weak to data reconstruction attacks and only good for very few

speciﬁc data mining models. The condensation approach (Aggarwal and Yu, 2004)

cannot effectively protect data privacy from naive estimation. The rotation perturba-

tion (Chen and Liu, 2005; Oliveira and Zaiane, 2010) and random projection pertur-

bation (Liu, Kargupta and Ryan, 2006) are all threatened by prior-knowledge enabled

Independent Component Analysis (Hyvarinen, Karhunen and Oja, 2001). Multidimen-

sional k-anonymization (LeFevre, DeWitt and Ramakrishnan, 2006) is only designed

for general-purpose utility preservation and may result in low-quality data mining mod-

els. In this paper, we propose a new multidimensional data perturbation technique: geo-

metric data perturbation that can be applied for several categories of popular data mining

models with better utility preservation and privacy preservation.

1.1. Data Privacy vs. Data Utility

Perturbation techniques are often evaluated with two basic metrics: the level of pre-

served privacy guarantee and the level of preserved data utility. Data utility is often

task/model-speciﬁc and measured by the quality of learned models. An ultimate goal

for all data perturbation algorithms is to maximize both data privacy and data utility,

although these two are typically representing conﬂicting goals in most existing pertur-

bation techniques.

Level of Privacy Guarantee: Data privacy is commonly measured by the difﬁculty

level in estimating the original data from the perturbed data. Given a data perturba-

tion technique, the more difﬁcult the original values can be estimated from the per-

turbed data, the higher level of data privacy this technique provides. In (Agrawal and

Srikant, 2000), the variance of the added random noise is used as the level of difﬁculty

for estimating the original values. However, recent research (Evﬁmievski, Gehrke and

Srikant, 2003; Agrawal and Aggarwal, 2002) reveals that variance of added noise only is

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 3

not an effective indicator of privacy guarantee. More research (Kargupta, Datta, Wang

and Sivakumar, 2003; Huang, Du and Chen, 2005) has shown that privacy guarantee

is subject to the attacks that can reconstruct the original data (or some records) from

the perturbed data. Thus, attack analysis has to be integrated into privacy evaluation.

Furthermore, since the amount of attacker’s prior knowledge on the original data deter-

mines the type of attacks and its effectiveness, we should also study privacy guarantee

according to the level of prior knowledge the attacker may have. With this study, the

data owner can decide whether the perturbed data can be released under the assumption

of certain level of prior knowledge. In this paper, we will study the proposed geomet-

ric data perturbation under a new privacy evaluation framework that incorporates attack

analysis and calculates multi-level privacyguarantees according to the level of attacker’s

prior knowledge.

Level of Data Utility: The level of data utility typically refers to the amount of crit-

ical information preserved after perturbation. More speciﬁcally, the critical information

should be task or model oriented. For example, decision tree and k-Nearest-Neighbor

(kNN) classiﬁer for classiﬁcation modeling typically utilize different sets of informa-

tion about the datasets: decision tree construction primarily concerns the related col-

umn distributions; the kNN model relies on the distance relationship which involves

all columns. Most of existing perturbation techniques do not explicitly address that the

critical information is actually task/model-speciﬁc. We argue that by narrowing down to

preserve only the task/model-speciﬁc information, we are able to provide better quality

guarantee on both privacy and model accuracy. The proposed geometric data pertur-

bation aims to approximately preserve the geometric properties that many data mining

models are based on.

It is interesting to note that privacy guarantee and data utility have exhibited con-

tradictive relationship in most data perturbation techniques. Typically, data perturbation

algorithms that aim at maximizing the level of privacy guarantee often have to bear with

reduced data utility. The intrinsic correlation between the two factors makes it challeng-

ing to ﬁnd a right balance for them in developing a data perturbation technique.

1.2. Contributions and Scope

Bearing the above issues in mind, we have developed the geometric data perturbation

approach to privacy preserving data mining. In contrast to other perturbation approaches

(Aggarwal and Yu, 2004; Agrawal and Srikant, 2000; Chen and Liu, 2005; Liu, Kar-

gupta and Ryan, 2006), our method exploits the task and model speciﬁc multidimen-

sional information about the datasets and produces a robust data perturbation method

that not only preserves such critical information well but also provides a better balance

between the level of privacy guarantee and the level of data utility. The contributions of

this paper can be summarized into three aspects.

First, we articulate that the multidimensional geometric properties of datasets are

the critical information for many data mining models. We deﬁne a data mining model

to be “perturbation invariant”, if the model built on the geometrically perturbed dataset

presents a quality to that over the original dataset. With geometric data perturbation,

the perturbed data can be exported to the public platform, where these perturbation-

invariant data mining models are applied to obtain equivalent models. We have proved

that a batch of data mining models, including kernel methods, SVM classiﬁers with

the three popular kernels, linear classiﬁers, linear regression, regression trees, and all

Euclidean-distance based clustering algorithms, are invariant to geometric data pertur-

4 K. Chen and L. Liu

bation with the rotation and translation components only, and we have also studied the

effect of the distance perturbation component to the model invariance property.

Second, we also study whether random projection perturbation (Liu, Kargupta and

Ryan, 2006) can be an alternative component in geometric data perturbation, based on

the formal analysis of the effect of multiplicative perturbation to model quality. We use

the Gaussian mixture model (McLachlan and Peel, 2000) to show in which situations

the multiplicative component can affect the model quality. It helps us understand why

the rotation component is a better choice than other multiplicative components in terms

of preserving model accuracy.

Third, since a random geometric-transformationbased perturbation is a multidimen-

sional perturbation, the privacy guarantee of the multiple dimensions (attributes) should

be evaluated collectively, not separately. We use a uniﬁed privacy evaluation metric for

all dimensions and a generic framework to incorporate attack analysis in privacy eval-

uation. We also analyze a set of attacks according to different levels of knowledge an

attacker may have. A randomized perturbation optimization algorithm is presented to

incorporate the evaluation of attack resilience into the perturbation algorithm design.

The rest of paper is organized as follows. Section 2 brieﬂy reviews the related work

in data perturbation. Section 3 deﬁnes some notations and gives the background knowl-

edge about geometric data perturbation. Then, in Section 4 and 5, we deﬁne the geo-

metric data perturbation and prove that many major models in classiﬁcation, regression

and clustering modeling are invariant to rotation and translation perturbation. In Section

5, we also extend the discussion to the effect of noise component and other choices of

multiplicative components such as random projection to model quality. In Section 6, we

ﬁrst introduce a generic privacy evaluation model and deﬁne a uniﬁed privacy metric

for multidimensional data perturbation. Then, a few inference attacks are analyzed un-

der the proposed privacy evaluation model, which results in a randomized perturbation

optimization algorithm. Finally, we present experimental results in Section 7.

2. Related Work

A considerable amount of work on privacy preserving data mining methods have been

reported in recent years (Aggarwal and Yu, 2004; Agrawal and Srikant, 2000; Clifton,

2003; Agrawal and Aggarwal, 2002; Evﬁmievski et al., 2002; Vaidya and Clifton, 2003).

The most relevant work about perturbation techniques for data mining includes the ran-

dom noise addition methods (Agrawal and Srikant, 2000; Evﬁmievski et al., 2002),

the condensation-based perturbation (Aggarwal and Yu, 2004), rotation perturbation

(Oliveira and Zaiane, 2010; Chen and Liu, 2005) and projection perturbation (Liu, Kar-

gupta and Ryan, 2006). In addition, k-anonymization (Sweeney, 2002) can also be re-

garded as a perturbation technique, and there are a large body of literatures focusing

on the k-anonymity model (Fung, Wang, Chen and Yu, 2010). Since our work is less

relevant to the k-anonymity model, we will focus on other perturbation techniques.

Noise Additive Perturbation The typical additive perturbation technique (Agrawal

and Srikant, 2000) is column-based additive randomization. This type of techniques

relies on the facts that 1) Data owners may not want to equally protect all values in a

record, thus a column-based value distortion can be applied to perturb some sensitive

columns. 2) Data classiﬁcation models to be used do not necessarily require the individ-

ual records, but only the column value distributions (Agrawal and Srikant, 2000) with

the assumption of independent columns. The basic method is to disguise the original

values by injecting certain amount of additive random noise, while the speciﬁc infor-

Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining 5

mation, such as the column distribution, can still be effectively reconstructed from the

perturbed data.

A typical random noise addition model (Agrawaland Srikant, 2000) can be precisely

described as follows. We treat the original values (x

, x

, . . . , x

) from a column to be

randomly drawn from a random variable X, which has some kind of distribution. The

randomization process changes the original data by adding random noises R to the

original data values, and generates a perturbed data column Y, Y = X + R. The

resulting record (x

+ r

, x

+ r

, . . . , x

+ r

) and the distribution of R are published.

The key of random noise addition is the distribution reconstruction algorithm (Agrawal

and Srikant, 2000; Agrawal and Aggarwal, 2002) that recovers the column distribution

of X based on the perturbed data and the distribution of R.

While the randomization approach is simple, several researchers have recently iden-

tiﬁed that reconstruction-based attacks are the major weakness of the randomization ap-

proach (Kargupta et al., 2003; Huang et al., 2005). In particular, the spectral properties

of the randomized data can be utilized to separate noise from the private data. Further-

more, only the mining algorithms that meet the assumption of independent columns

and work on column distributions only, such as decision-tree algorithms (Agrawal and

Srikant, 2000), and association-rule mining algorithms (Evﬁmievski et al., 2002), can

be revised to utilize the reconstructed column distributions from perturbed datasets. As

a result, it is inconvenient to apply this method for data mining in practice.

Condensation-based Perturbation The condensation approach (Aggarwal and Yu,

2004) is a typical multi-dimensional perturbation technique, which aims at preserv-

ing the covariance matrix for multiple columns. Thus, some geometric properties such

as the shape of decision boundary are well preserved. Different from the randomiza-

tion approach, it perturbs multiple columns as a whole to generate the entire “perturbed

dataset”. As the perturbed dataset preserves the covariance matrix, many existing data

mining algorithms can be applied directly to the perturbed dataset without requiring any

change or new development of algorithms.

The condensation approach can be brieﬂy described as follows. It starts by parti-

tioning the original data into k-record groups. Each group is formed by two steps –

randomly selecting a record from the existing records as the center of group, and then

ﬁnding the (k − 1) nearest neighbors of the center to be the other (k − 1) members.

The selected k records are removed from the original dataset before forming the next

group. Since each group has small locality, it is possible to regenerate a set of k records

to approximately preserve the distribution and covariance. The record regeneration al-

gorithm tries to preserve the eigenvectors and eigenvalues of each group, as shown in

Figure 1. The authors demonstrated that the condensation approach can well preserve

the accuracy of classiﬁcation models if the models are trained with the perturbed data.

However, we have observed that the condensation approach is weak in protecting

data privacy. As stated by the authors, the smaller the size of the locality is in each

group, the better the quality of preserving the covariance with the regenerated k records

is. However, the regenerated k records are conﬁned in the small spatial locality as shown

in Figure 1. Our result (section 7) shows that the differences between the regenerated

records and the nearest neighbor in original data are very small on average, and thus,

the original data records can be estimated from the perturbed data with high conﬁdence.

Rotation Perturbation Rotation perturbation was initially proposed for privacy pre-

serving data clustering (Oliveira and Za¨ıane, 2004). As one of the major components in

geometric perturbation, we ﬁrst applied rotation perturbation to privacy-preserving data

classiﬁcation in our paper (Chen and Liu, 2005) and addressed the general problem of

剩余38页未读，继续阅读

评论收藏

内容反馈

lifebud

粉丝: 0
资源: 2

Geometric Data Perturbation for Privacy Preserving Outsourced Da...

最新资源

Geometric Data Perturbation for Privacy Preserving Outsourced Da...

Geometric.DFMPro.8.5.0.10926.for.NX1926-1980.Series_Win64-SSQ.rar

Geometric Data Structures for Computer Graphics

PyTorch Geometric：图神经网络的革命者

geometric:一个用于做几何的 JavaScript 库

简单物理系统的整体性,.Geometric.Phase,.李华钟.

Voronoi Diagrams — A Survey of a Fundamental Geometric Data Structure

GeometricDataStructuresforComputerGraphics

[几何拓扑](Geometric.topology),.Sullivan,.扫描版.djvu .djvu

torch-geometric几个工具包（本人使用）追光者.zip

Geometric.GeomCaliper.2.8.0.Creo.Win64-SSQ.rar

Geometric.java

Geometric Algebra Applications Vol. I: Computer Vision, Graphics and Neurocomput

Morgan.Geometric Programming for Design and Cost Optimization.2009.pdf

Advances in Geometric Modeling.djvu

Computer Graphics and Geometric Modeling. Implementation and Algorithms

Geometric Dimensioning and Tolerancing for Mechanical Design 2E

Klette R., Rosenfeld A. Digital Geometry.. Geometric Methods for Digital Image Analysis (Morgan Kaufmann, 2004)

Geometric Tools for Computer Graphics

Geometric Algebra Applications Vol. I

Geometric Computing: for WT, Robot Vision, Learning, Control and Action

Geometric loss functions for camera pose regression-ppt

Subdivision Methods for Geometric Design

Geometric.Tools.for.Computer.Graphics

Springer.Computer.Graphics.and.Geometric.Modeling

Coding for Data and Computer Communications

J.E. Marsden.Geometric Control of Simple Mechanical Systems.Springer.2004

asy to understand Matlab code for geometric multigrid. Great for

【Python实战】-Python+Opencv是实现车牌自动识别（源码+数据+字符匹配模板）

Python基于机器学习实现的股票价格预测、股票预测源码+数据集，机器学习大作业

最新资源