Protecting Personal Information: Differential Privacy and Data Science
In many cases, it's still possible to extract personal info from anonymized data. Thanks to changing regulations, differential privacy and other new techniques for obscuring sensitive data are going to start getting a lot more attention.
- By Steve Swoyer
- September 12, 2016
The basic techniques data scientists and statisticians use to anonymize their data sets are insufficient. In many cases, it's still possible to extract personally identifiable information (PII) from anonymized data sets. Thanks to a changing regulatory climate, differential privacy and other new techniques for obscuring sensitive data are going to start getting a lot more attention.
Take the European Union's General Data Privacy Regulations (GDPR), which prohibits the transfer of PII outside of the EU. What actually constitutes PII? Names, addresses, telephone numbers, and government-issued identification numbers, certainly, but also credit card numbers, postal codes, email addresses, social media identities, and a host of other personal identifiers.
The thing is, none of this information is essential for data science. In most cases, it just isn't important for a data scientist to know the identity of the person who purchased what items -- or when and where they purchased them, for that matter. Data scientists are only rarely interested in single individuals. They're after larger patterns, correlations, or anomalies. They're interested in finding similar people; they're on the hunt for things most people don't even know about themselves.
In common practice, data scientists and statisticians make use of data anonymization techniques (data masking, hashing, and so on) to anonymize PII. So long as a data scientist knows what kind of data they're working with -- names and addresses, credit card numbers, etc. -- the content of this data is often immaterial.
Is this enough, however? Is a 256-bit hash abstraction (i.e., a long string of random characters) that represents a woman named "Marie" who lives in Brussels sufficiently anonymized? You'd think so. Believe it or not, you'd be dead wrong.
Anonymization Not Anonymous Enough
A decade ago, Netflix announced its now-famous Netflix Prize, offering a $1 million bounty to anyone who could develop a filtering algorithm that would predict subscribers' film preferences with more accuracy than its own Cinematch algorithm. To help contestants, Netflix released two notionally anonymized data sets containing ratings from approximately 500,000 of its subscribers.
You can guess where this is going, can't you?
A team of researchers at the University of Texas at Austin was able to partially de-anonymize the Netflix data sets. Arvind Narayanan and Vitaly Shmatikov used statistical and mathematical techniques to cross-correlate the ratings "anonymous" Netflix users had given certain movies with ratings on the Internet Movie Database (IMDB).
In this way, the duo wrote, they were able to uncover the "apparent political preferences and other potentially sensitive information" of Netflix users.
"[A]n adversary who knows a little bit about some subscriber can easily identify her record if it is present in the dataset, or, at the very least, identify a small set of records which include the subscriber's record," Narayanan and Shmatikov wrote. "The adversary's background knowledge need not be precise, e.g., the dates may only be known ... with a 14-day error, the ratings may be known only approximately, and some of the ratings and dates may even be completely wrong."
Data Can't Help Being Leaky
De-anonymization -- also known as "entity resolution" -- was an issue even prior to the Netflix Prize.
In their paper, Narayanan and Shmatikov cite several other cases, including a successful effort (in the 1990s) that de-anonymized a publicly available healthcare database in Massachusetts by linking it to a database of registered voters. Narayanan and Shmatikov have also used their technique to extract personally identifiable information from anonymized social media data sets.
The core problem is that information is (for lack of a better word) "leaky."
anonymization is insufficient: it doesn't matter what you call an entity if the relationship or correspondence that entity has with non-anonymized data is still accurate or truthful. An attacker or adversary can still glean useful information about the entity.
Furthermore, in the age of social media, it's much easier to resolve entities and relationships and to derive PII. It's possible to resolve entity "X" in one database or data set to entity "Marie" in another.
The essential takeaway is that when only basic anonymization techniques are used, personally identifiable information must and will leak out.
Differential Privacy: A New Hope
What if not all of the information in an anonymized data set can be resolved to PII? What if a mathematically determined degree of error -- i.e., noise -- could be injected into the data set? What if this noise consisted of spurious purchases, incorrect dates and times, phony order numbers, random ZIP codes, and so on?
This noise would still permit data scientists and statisticians to extract valuable signal from their working data sets. They would still be able to identify useful patterns, establish significant correlations, and highlight promising or vexing anomalies. It would just be difficult, if not impossible, to extract PII from the data sets they're working with. Win-win, right?
The scenario I've just described is the idea behind a technique called differential privacy. It's one of the most promising techniques in the field of digital privacy.
Major Players Embrace Differential Privacy
Recently, it's had a few very public, very promising wins. In June, Apple announced it would use differential privacy to anonymize the data its macOS and iOS devices transmit back to it. Apple was following in the footsteps of other tech giants, including Google.
Microsoft was actually out in front of differential privacy: in 2012, Cynthia Dwork, a distinguished scientist at Microsoft Research, published a seminal public paper on the subject.
More recently, some of the best minds in the space -- such as Arvind Narayanan, part of the duo that de-anonymized the Netflix Prize data set -- have argued that differential privacy is a superior alternative to basic anonymization (aka "de-identification") techniques. As Narayanan puts it in the title of his paper: "De-identification still doesn't work."
Smaller Implementations
More tellingly, several start-ups now specialize in differential privacy, including LeapYear Technologies, formerly known as Shroudbase. (LeapYear is currently in stealth mode, so little information is available.)
Because the math and statistics behind differential privacy are so complex, there's a dearth of workable free solutions: one of the earliest, Privacy Integrated Queries (PINQ), is no longer maintained; another, DualQuery, has to be compiled from source.
Still another proposed implementation, Multiplicative Weights Exponential Mechanism (MWEM) isn't yet publicly available. Still, DualQuery is available via Git, and most data scientists aren't chary about using Git.
Future of Personal Privacy
In time, differential privacy will probably be widely used -- not just by companies (such as Apple, Google, and Microsoft) that have the in-house know-how to build their own implementations, but by organizations of all kinds. Think of the possibility of incorporating differential privacy technology into relational database systems, NoSQL databases, file systems, and so on.
In this scheme, differential privacy could be applied at the query level. Microsoft even has a Web portal dedicated to the discussion of differential privacy and database systems.
This is just one example. The larger point is that basic data anonymization technologies are insufficient and that new techniques -- such as differential privacy -- must be used to protect PII. This problem will only become more pressing over time.