2 K. Chen and L. Liu
data to the service provider. Since these service providers are not within the trust bound-
ary, the privacy of the outsourced data has become one of the top-priority problems
(Armbrust et al., 2009; Bruening and Treacy, 2009). As data mining is one of the most
popular data intensive tasks, privacy preserving data mining for the outsourced data has
become an important enabling technology for utilizing the public computing resources.
Different from other settings of privacy preserving data mining such as collaboratively
mining private datasets from multiple parties (Lindell and Pinkas, 2000; Vaidya and
Clifton, 2003; Luo, Fan, Lin, Zhou and Bertino, 2009; Teng and Du, 2009), this paper
will focus on the following setting: the data owner exports data to and then receives
a model (with the quality description such as the accuracy for a classifier) from the
service provider. This setting also applies to the situation that the data owner uses the
public cloud resources for large-scale scalable mining, where the service provider just
provides computing infrastructure.
We present a new data perturbation technique for privacy preserving outsourced
data mining (Aggarwal and Yu, 2004; Chen and Liu, 2005) in this paper. A data per-
turbation procedure can be simply described as follows. Before the data owners pub-
lish their data, they change the data in certain way to disguise the sensitive information
while preserving the particular data property that is critical for building meaningful data
mining models. Perturbation techniques have to handle the intrinsic tradeoff between
preserving data privacy and preserving data utility, as perturbing data usually reduces
data utility. Several perturbation techniques have been proposed for mining purpose
recently, but these two factors are not satisfactorily balanced. For example, random
noise addition approach (Agrawal and Srikant, 2000; Evfimievski, Srikant, Agrawal
and Gehrke, 2002) is weak to data reconstruction attacks and only good for very few
specific data mining models. The condensation approach (Aggarwal and Yu, 2004)
cannot effectively protect data privacy from naive estimation. The rotation perturba-
tion (Chen and Liu, 2005; Oliveira and Zaiane, 2010) and random projection pertur-
bation (Liu, Kargupta and Ryan, 2006) are all threatened by prior-knowledge enabled
Independent Component Analysis (Hyvarinen, Karhunen and Oja, 2001). Multidimen-
sional k-anonymization (LeFevre, DeWitt and Ramakrishnan, 2006) is only designed
for general-purpose utility preservation and may result in low-quality data mining mod-
els. In this paper, we propose a new multidimensional data perturbation technique: geo-
metric data perturbation that can be applied for several categories of popular data mining
models with better utility preservation and privacy preservation.
1.1. Data Privacy vs. Data Utility
Perturbation techniques are often evaluated with two basic metrics: the level of pre-
served privacy guarantee and the level of preserved data utility. Data utility is often
task/model-specific and measured by the quality of learned models. An ultimate goal
for all data perturbation algorithms is to maximize both data privacy and data utility,
although these two are typically representing conflicting goals in most existing pertur-
bation techniques.
Level of Privacy Guarantee: Data privacy is commonly measured by the difficulty
level in estimating the original data from the perturbed data. Given a data perturba-
tion technique, the more difficult the original values can be estimated from the per-
turbed data, the higher level of data privacy this technique provides. In (Agrawal and
Srikant, 2000), the variance of the added random noise is used as the level of difficulty
for estimating the original values. However, recent research (Evfimievski, Gehrke and
Srikant, 2003; Agrawal and Aggarwal, 2002) reveals that variance of added noise only is