L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
Page 2
So a common practice is for organizations to release and receive person-
specific data with all explicit identifiers, such as name, address and telephone
number, removed on the assumption that anonymity is maintained because the
resulting data look anonymous. However, in most of these cases, the remaining
data can be used to re-identify individuals by linking or matching the data to other
data or by looking at unique characteristics found in the released data.
In an earlier work, experiments using 1990 U.S. Census summary data were
conducted to determine how many individuals within geographically situated
populations had combinations of demographic values that occurred infrequently
[1]. Combinations of few characteristics often combine in populations to uniquely
or nearly uniquely identify some individuals. For example, a finding in that study
was that 87% (216 million of 248 million) of the population in the United States
had reported characteristics that likely made them unique based only on {5-digit
ZIP
2
, gender, date of birth}. Clearly, data released containing such information
about these individuals should not be considered anonymous. Yet, health and
other person-specific data are often publicly available in this form. Below is a
demonstration of how such data can be re-identified.
Example 1.Re-identification by linking
The National Association of Health Data Organizations (NAHDO) reported
that 37 states in the USA have legislative mandates to collect hospital level
data and that 17 states have started collecting ambulatory care data from
hospitals, physicians offices, clinics, and so forth [2]. The leftmost circle in
Figure 1 contains a subset of the fields of information, or attributes, that
NAHDO recommends these states collect; these attributes include the
patient’s ZIP code, birth date, gender, and ethnicity.
In Massachusetts, the Group Insurance Commission (GIC) is responsible
for purchasing health insurance for state employees. GIC collected patient-
specific data with nearly one hundred attributes per encounter along the lines
of the those shown in the leftmost circle of Figure 1 for approximately
135,000 state employees and their families. Because the data were believed to
be anonymous, GIC gave a copy of the data to researchers and sold a copy to
industry [3].
For twenty dollars I purchased the voter registration list for Cambridge
Massachusetts and received the information on two diskettes [4]. The
rightmost circle in Figure 1 shows that these data included the name, address,
ZIP code, birth date, and gender of each voter. This information can be linked
using ZIP code, birth date and gender to the medical information, thereby
2
In the United States, a ZIP code refers to the postal code assigned by the U.S. Postal Service.
Typically 5-digit ZIP codes are used, though 9-digit ZIP codes have been assigned. A 5-digit code is
the first 5 digits of the 9-digit code.
评论1
最新资源