Anonymized Data: Think Again (And Again)
Another month, another study shows anonymization failing to protect personal identities in big data sets used widely for analytics and machine learning. What's to be done?
- By Barry Devlin
- September 10, 2019
Published in Nature Communications in July, Rocher et al have proven that 99.98 percent of Americans (in a sample size of the population of Massachusetts) would be correctly re-identified in any dataset using as few as 15 demographic attributes. They conclude that "even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model."
A set of 15 demographic attributes is miniscule compared to the number used in most analytic datasets. The authors compare it to the 248 data fields on each of 120 million American households in the file that marketing analytics firm Alteryx left unencrypted and unprotected on AWS in 2017. The comparison is perhaps misleading given that the Alteryx file's "anonymization" seemed to extend only as far as removing actual people's names, leaving addresses, phone numbers, banking details, and ethnicity among the leaked fields. The main point of the Rocher's article is that even after removing or obscuring such obvious personally identifiable information (PII), there remains an enormous number of seemingly innocuous data fields in big data sets that can be used to easily deanonymize them.
How many people were left at risk for invasion of privacy, identity theft, or fraud? The Alteryx file was sourced from Irish-based Experian, one of a few major data aggregation and consumer credit reporting companies worldwide. According to its fact sheet, the company maintains credit information on over 220 million U.S. consumers and 40 million active U.S. businesses, as well as demographic information on approximately 300 million consumers in 126 million households across the U.S. Other players in these industries hold similar types and amounts of information.
A 2018 infographic from Visual Capitalist provides a useful overview of the net of data collection that feeds this enormous ecosystem of our personal reference and behavioral data. Anonymization, together with encryption and other security tools, are supposed to protect such data from abuse while allowing it to be used for analytics in areas as diverse as customer upselling and socioeconomic research. A smaller, but more sensitive, set of big data circulates for medical research, allegedly protected by similar means.
What to Do?
As often happens when such research papers are published, software vendors take the opportunity to launch campaigns along the lines of "My product's feature X addresses this problem." In some cases, it appears true. In others, the marketing team has missed the line in the paper that says feature X is particularly useless. Advice on the veracity of such claims should come from an expert in the field of security and privacy rather than a data generalist like myself. Nonetheless, Rocher et al do validate a view that I have supported for some time -- that a holistic approach is required. The Privacy by Design (PbD) framework, although now two decades old, can form a strong basis for decisions and action when acquiring and using any data with PII characteristics.
PbD -- initially proposed in the 1990s, updated this decade, and incorporated into the GDPR -- covers a "trilogy of encompassing applications": IT systems, accountable business practices, and physical design and networked infrastructure. In 2010, its inventor, Anne Cavoukian, then Information and Privacy Commissioner for the Canadian province of Ontario, published a set of seven foundational principles that should be followed when dealing with sensitive data:
- Proactive, not reactive; [preventive], not remedial
- Privacy as the default setting
- Privacy embedded into design
- Full functionality -- positive-sum, not zero-sum
- End-to-end security -- full life cycle protection
- Visibility and transparency -- keep it open
- Respect for user privacy -- keep it user-centric
It is not without its critics in areas such as vagueness
difficulty. Nonetheless, its enterprise (indeed, supra-enterprise) scope is, in my opinion, the level at which concerns about privacy and PII protection must be addressed. Encryption and anonymization software and techniques are too limited in scope to apply anything more than sticking plasters to what is an arterial wound in data systems that are fundamental to the working of society today.
A Last Thought That Should Be First
In addition to these principles, I would add an eighth: minimize, and then minimize again, the amount of data you acquire and use. The less data you have, the less the impact of data breaches or deanonymization.
Such an admonition may ring strange in the data-driven world of digital transformation. However, experience over three decades of data warehousing and business intelligence has shown that businesspeople, when asked what data they need, will most likely say "all of it" for fear that IT may be too busy or slow to ask again, especially if they have little idea of what exactly its use might be. In reality, they often use no more than a relatively small percentage of what data they have.
Even in big data analytics, return on investment begins to plateau as more detailed and granular data is analyzed. Increasing data acquisition volumes gradually allows early recognition of when that plateau is approaching and winding down the volumes of additional data to be acquired. Furthermore, when existing data has served its purpose, consider carefully whether privacy protection dictates it should be deleted.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.