TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Data Integration for AI: Overcoming Modern Pipeline Challenges July 23, 2025
  - From Silos to Insights: Centralizing Data to Drive AI July 24, 2025
  - Expert Panel: Leveraging AI-Powered Solutions for Data Management July 28, 2025
  - A Generative AI Framework for Credit and Financial Markets July 29, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance July 21, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Protecting Personal Information: Differential Privacy and Data Science

In many cases, it's still possible to extract personal info from anonymized data. Thanks to changing regulations, differential privacy and other new techniques for obscuring sensitive data are going to start getting a lot more attention.

By Steve Swoyer
September 12, 2016

The basic techniques data scientists and statisticians use to anonymize their data sets are insufficient. In many cases, it's still possible to extract personally identifiable information (PII) from anonymized data sets. Thanks to a changing regulatory climate, differential privacy and other new techniques for obscuring sensitive data are going to start getting a lot more attention.

Take the European Union's General Data Privacy Regulations (GDPR), which prohibits the transfer of PII outside of the EU. What actually constitutes PII? Names, addresses, telephone numbers, and government-issued identification numbers, certainly, but also credit card numbers, postal codes, email addresses, social media identities, and a host of other personal identifiers.

The thing is, none of this information is essential for data science. In most cases, it just isn't important for a data scientist to know the identity of the person who purchased what items -- or when and where they purchased them, for that matter. Data scientists are only rarely interested in single individuals. They're after larger patterns, correlations, or anomalies. They're interested in finding similar people; they're on the hunt for things most people don't even know about themselves.

In common practice, data scientists and statisticians make use of data anonymization techniques (data masking, hashing, and so on) to anonymize PII. So long as a data scientist knows what kind of data they're working with -- names and addresses, credit card numbers, etc. -- the content of this data is often immaterial.

Is this enough, however? Is a 256-bit hash abstraction (i.e., a long string of random characters) that represents a woman named "Marie" who lives in Brussels sufficiently anonymized? You'd think so. Believe it or not, you'd be dead wrong.

Anonymization Not Anonymous Enough

A decade ago, Netflix announced its now-famous Netflix Prize, offering a $1 million bounty to anyone who could develop a filtering algorithm that would predict subscribers' film preferences with more accuracy than its own Cinematch algorithm. To help contestants, Netflix released two notionally anonymized data sets containing ratings from approximately 500,000 of its subscribers.

You can guess where this is going, can't you?

A team of researchers at the University of Texas at Austin was able to partially de-anonymize the Netflix data sets. Arvind Narayanan and Vitaly Shmatikov used statistical and mathematical techniques to cross-correlate the ratings "anonymous" Netflix users had given certain movies with ratings on the Internet Movie Database (IMDB).

In this way, the duo wrote, they were able to uncover the "apparent political preferences and other potentially sensitive information" of Netflix users.

"[A]n adversary who knows a little bit about some subscriber can easily identify her record if it is present in the dataset, or, at the very least, identify a small set of records which include the subscriber's record," Narayanan and Shmatikov wrote. "The adversary's background knowledge need not be precise, e.g., the dates may only be known ... with a 14-day error, the ratings may be known only approximately, and some of the ratings and dates may even be completely wrong."

Data Can't Help Being Leaky

De-anonymization -- also known as "entity resolution" -- was an issue even prior to the Netflix Prize.

In their paper, Narayanan and Shmatikov cite several other cases, including a successful effort (in the 1990s) that de-anonymized a publicly available healthcare database in Massachusetts by linking it to a database of registered voters. Narayanan and Shmatikov have also used their technique to extract personally identifiable information from anonymized social media data sets.

The core problem is that information is (for lack of a better word) "leaky." anonymization is insufficient: it doesn't matter what you call an entity if the relationship or correspondence that entity has with non-anonymized data is still accurate or truthful. An attacker or adversary can still glean useful information about the entity.

Furthermore, in the age of social media, it's much easier to resolve entities and relationships and to derive PII. It's possible to resolve entity "X" in one database or data set to entity "Marie" in another.

The essential takeaway is that when only basic anonymization techniques are used, personally identifiable information must and will leak out.

Differential Privacy: A New Hope

What if not all of the information in an anonymized data set can be resolved to PII? What if a mathematically determined degree of error -- i.e., noise -- could be injected into the data set? What if this noise consisted of spurious purchases, incorrect dates and times, phony order numbers, random ZIP codes, and so on?

This noise would still permit data scientists and statisticians to extract valuable signal from their working data sets. They would still be able to identify useful patterns, establish significant correlations, and highlight promising or vexing anomalies. It would just be difficult, if not impossible, to extract PII from the data sets they're working with. Win-win, right?

The scenario I've just described is the idea behind a technique called differential privacy. It's one of the most promising techniques in the field of digital privacy.

Major Players Embrace Differential Privacy

Recently, it's had a few very public, very promising wins. In June, Apple announced it would use differential privacy to anonymize the data its macOS and iOS devices transmit back to it. Apple was following in the footsteps of other tech giants, including Google.

Microsoft was actually out in front of differential privacy: in 2012, Cynthia Dwork, a distinguished scientist at Microsoft Research, published a seminal public paper on the subject.

More recently, some of the best minds in the space -- such as Arvind Narayanan, part of the duo that de-anonymized the Netflix Prize data set -- have argued that differential privacy is a superior alternative to basic anonymization (aka "de-identification") techniques. As Narayanan puts it in the title of his paper: "De-identification still doesn't work."

Smaller Implementations

More tellingly, several start-ups now specialize in differential privacy, including LeapYear Technologies, formerly known as Shroudbase. (LeapYear is currently in stealth mode, so little information is available.)

Because the math and statistics behind differential privacy are so complex, there's a dearth of workable free solutions: one of the earliest, Privacy Integrated Queries (PINQ), is no longer maintained; another, DualQuery, has to be compiled from source.

Still another proposed implementation, Multiplicative Weights Exponential Mechanism (MWEM) isn't yet publicly available. Still, DualQuery is available via Git, and most data scientists aren't chary about using Git.

Future of Personal Privacy

In time, differential privacy will probably be widely used -- not just by companies (such as Apple, Google, and Microsoft) that have the in-house know-how to build their own implementations, but by organizations of all kinds. Think of the possibility of incorporating differential privacy technology into relational database systems, NoSQL databases, file systems, and so on.

In this scheme, differential privacy could be applied at the query level. Microsoft even has a Web portal dedicated to the discussion of differential privacy and database systems.

This is just one example. The larger point is that basic data anonymization technologies are insufficient and that new techniques -- such as differential privacy -- must be used to protect PII. This problem will only become more pressing over time.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Protecting Personal Information: Differential Privacy and Data Science

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Protecting Personal Information: Differential Privacy and Data Science

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career