II
Abstract
In the age of big data, it is important to exchange and share data among different
parties. De-identification policies use an abstract description of the data to get the privacy
protection. However, the number of de-identification policies is exponentially large due to
the broad domain of attributes. Deducing the number of polices is a difficulty. Skyline
computation can get a better control the trade off between data utility and data privacy, it
filters out a set of interesting policies from a potentially large set of policies. A policy is
interesting if it is not dominated by any other policy, that is, neither the data utility and
data privacy those policies filtered out are better than that remained. But it is yet
challenging for efficient skyline processing over large number of policies.
The skyline computing over the universal policies set (SKY-FILTER-MR) provides an
effective and extensive method with high precision for skyline processing over large scale
policies. First, applying the MapReduce programming model to traditional skyline over
policies can greatly reduce the execution time. This can effectively answer skyline on
large scale policies. Second, the approximate skyline sets an effective parameter ε based
on skyline. It requires that neither the data utility and the data privacy of those policies
filtered out are better in a certain range than that remained. With approximate skyline, the
power of filtering was greatly strengthened to effectively decrease the cost of skyline
computation over alternative policies. Meanwhile, it can be tuned to trade off the
near-optimality guarantee for lower risk and higher data utility by varying the parameter.
Extensive experiments demonstrate that SKY-FILTER-MR substantially outperforms
the baseline approach by up to four times faster and with the number of alternative
policies decreasing up to 732 times in the best case. Meanwhile, it has a good scalability
over large policy sets. In addition, the running time decreases with increasing ε.
SKY-FILTER-MR reduces the number of alternative policies and improves the efficiency
under the guarantee of the accuracy in skyline computing.
Keywords: de-identification policy, skyline, data privacy, MapReduce
万方数据