TDWI Upside - Where Data Means Business

GDPR and Tokenizing Data (Part 3 in a Series)

You need to protect any personal data your enterprise collects. Tokenizing data is one way to stay in compliance with GDPR.

In the first two parts of this series we examined the six principles of the GDPR. In this final article, we'll look at how enterprises are protecting personal data to meet GDPR's requirements, including with tokenization technology.

For Further Reading:

GDPR's Impact on BI (Part 1 in This Series)

The 6 Pillars of the GDPR (Part 2 in This Series)

GDPR Forcing Organizations to View Data Strategically

When the GDPR is mentioned, the first thing that comes to mind may be "personal data" because that's what the GDPR is all about, protecting personal data. It's important to understand what, exactly, is meant by personal data.

We have to be very clear here because there is "personal data" and then there is "personally identifiable information" (popularly referred to as PII). Let's take them in reverse order.

  • Personally identifiable information is data that can be used to directly or indirectly identify a particular person. This consists of such data items as a person's name, address, email address, or phone number. For example:
  • John Doe, 27 First St., NY 12345

  • Personal data is data about the person and contains PII data. For example, this record is personal data:
  • John Doe, 27 First St., NY 12345, occupation bus driver, salary $41,000

    If we remove the PII data from this record, it no longer contains personal data. Instead, it becomes anonymous:

    Occupation bus driver, salary $41,000

This anonymous (or de-personalized) data could be used to analyze occupations and salaries and be GDPR-compliant.

In addition, you must consider sensitive personal information (SPI), often simply known as sensitive data. As the name implies, SPI data are facts about a person considered private. The GDPR lists specific items that are sensitive:

  • Racial or ethnic origin
  • Political opinions
  • Religious or philosophical beliefs
  • Trade union membership
  • Health/medical information
  • Sexual orientation
  • Genetic data
  • Biometric data that uniquely identifies an individual

Protecting Data: Two Approaches

What can you do if you want to collect and use personal data, and a group within your organization has a legitimate purpose to work with all data while another group only requires access to de-personalized information? There are two approaches. You can split the personal data and de-personalized data into two data marts, thus "anonymizing" the sensitive data for those who don't need it. Imagine splitting the bus driver data from our earlier example into a data mart with PII data and a different data mart with (now de-personalized) occupation and salary data.

The other option for complying with the GDPR is to de-personalize individual data elements so an enterprise can keep all its data in a single data mart. Because such a data mart contains personal data, it falls under GDPR's rules.

How to De-Personalize Data

Obfuscation (or obscuring) is a term used quite often in relation to the GDPR and "hiding" personal data. Obfuscation is a technique that makes data unrecognizable. "John Doe" becomes "hejn fed," for example. This can be destructive such that obfuscated data is unrecoverable (that is, the transformation is permanent) or non-destructive (the data can be recovered using a key or conversion table).

Tokenization is a form of non-destructive obfuscation. With tokenization, data is obscured but is recoverable via a special secure key. An example of this is credit card processing, where the credit card number is replaced in your data mart with a set of (seemingly) meaningless numbers and characters. The "real" credit card number is only available when the card processing company talks to the bank, at which point it is de-tokenized.

To tokenize or not to tokenize, that is the question. Excuse my corruption of Will Shakespeare, but I once worked on a BI project in Stratford-upon-Avon (Shakespeare's birthplace) and a muse of fire hast upon me descended.

"Tokenizing" PII data items renders them into meaningless groups of seemingly random characters that cannot be linked to an individual. The tokenized values can be converted back to the original values for those people with a legitimate interest in the PII data while keeping the PII data useless to everyone else.

There are a variety of tools that will tokenize personal data while maintaining referential integrity. That is, "John Doe" will tokenize to a value common throughout your database systems, maintaining any primary key/foreign key links.

Tokenization Pros and Cons

Tokenization of personal and sensitive data items would seem a logical way to satisfy the GDPR. After all:

  • It provides a clean method to hide personal and sensitive data (both PII and SPI)
  • It provides another layer of data security in case of a data breach (personal data is hidden)
  • It maintains referential integrity
  • Good test data can be extracted from production without fear of exposing personal data
  • Once the personal data is tokenized at the source, it will flow to all downstream systems seamlessly

Unfortunately there are drawbacks:

  • There is a price to pay for the tokenization tool, supporting hardware, and configuration costs.
  • Performance takes a hit. Tokenizing and de-tokenizing will have an impact on your application's performance. Whether this is significant will depend on the tool, hardware performance, etc.

Conclusion

The GDPR is not the end but the beginning, and the regulations will surely continue to change as BI evolves. The GDPR requires ongoing monitoring of your use of individuals' personal data along with understanding where that data is stored and who has access to it. With the growing use of artificial intelligence and robots for analysis and decision making, this heralds a new era for BI.

 

About the Author

Rod Welch is a BI consultant with the breadth and depth of experience gained from over 15 years in the BI environment from agile requirements gathering and dimensional modeling to ETL programming. In addition, he has a keen interest in agile and automated data warehouse development and the move to cloud storage. He is currently contracted to a U.K. insurance company to assess the impact of -- and define the detailed requirements for -- implementing GDPR. You can contact the author via email or via LinkedIn.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.