TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

Hidden Data Combinations and Facing the Challenge of CCPA

Do you know whether your data violates CCPA? Integris Software CEO Kristina Bergman recently shared with Upside some best practices regarding sensitive data.

By James E. Powell
March 20, 2020

The California Consumer Privacy Act (CCPA) is now in effect. Enterprises that do business in California must consider whether the data they collect is subject to the new regulation and how to update their privacy and data handling policies. Kristina Bergman, CEO of Integris Software explains why CCPA is important to your enterprise..

For Further Reading:

How to Let Your Data Lake Flow Without Fear in a World of Privacy Regulations

How to Survive the Coming Data Privacy Tsunami

Data Trends to Watch in 2020

Upside: What are some of the little-known requirements of CCPA?

Kristina Bergman: With companies' data privacy practices being scrutinized more than ever before, it's important to know that the definition of personal information (PI) has evolved rapidly under CCPA. Beyond obvious personal information such as names and addresses, according to CCPA, PI also includes household and inferred personal data.

CCPA doesn't provide a terribly clear definition of PI, but even data such as how often you open and close your refrigerator could tell someone when you're at home. GPS tracking data can show where someone lives and works -- and even what religion they practice. Most organizations don't realize these pieces of data are sensitive and are surprised by all the personal information they have in their inventory from data sharing agreements and acquisitions where unknown data entered their systems.

What are some of the problems you expect organizations will have in meeting CCPA requirements?

There are many items that could be classified as PI under CCPA but are not clearly defined, which makes it difficult for businesses to know what data is affected. Additionally, organizations are viewing CCPA as a manual workflow problem, but it's more than that. Manual surveys and processes become immediately outdated and don't account for the data flowing through organizations.

Historically, most people used static spreadsheets to identify where personal information existed in their organization. However, this wasn't sufficient because it relied on someone's best guess of what was in a massive data set that no human could possibly read through. There's also no real way to know if data-handling policies and procedures are working with manual processes, and they need to be tied back to requirements in real time. Manual processes of this nature also aren't defensible in a court of law or public opinion.

Organizations will experience new challenges from mergers and acquisitions. The market will face increased pressure under new regulations and businesses will need to prove compliance before an acquisition to avoid inheriting the other company's data risk. For example, they'll need to prove that data was collected in a lawful manner, with a legitimate business purpose, and in the right format. One of the largest problems will surround the cleanliness and validity of the data, which could potentially decrease the value in acquisition.

Given the overwhelming issue of having so much data that needs to be identified and organized, businesses will have difficulty discovering what data they have, where it's located, and how it's being used. This isn't a one-time task. It needs to happen on a continual basis as data changes to ensure ongoing compliance. Data is always on the move -- whether it's going out to third parties, purchased from others, or combined through mergers and acquisitions. Data is not a static thing and can't be treated as such.

What are toxic combinations of data? How are they a hidden challenge for CCPA and can you give a few examples?

Most organizations don't realize that even the anonymized and "harmless" data in their repositories can create toxic combinations that reveal identities and create compliance risks. For example, just by combining gender, ZIP code, and date of birth, you can identify 87 percent of the U.S. population. Even seemingly anonymous data points such as cable purchase history and donations to a church or nonprofit can imply religion, political affiliation, or sexual orientation.

Innocent events such as someone joining a Facebook group for breast cancer survivors with their real name could provide identifiable health condition data. There's a lot of non-HIPPA-compliant healthcare data out in the world from situations like this. Vacation day requests are another great example. If an employee takes the same days off each year that coincide with a religious holiday, someone could infer the religion or ethnicity of that person.

This issue is compounded by data sharing agreements that continually add volumes of new data into systems. Our recent Integris Software Data Privacy Maturity Study found that 40 percent of respondents had 50 or more data-sharing agreements in place. Altogether, this shows that it's more critical than ever for organizations to take real-time inventory of their data, know what they have, and know how to protect that information.

How does an enterprise find out if it has such a toxic combination among its data?

For enterprises to do a deep dive and figure out if they have toxic combinations of data across their systems, they'd have to use automated privacy tools and machine learning to crawl through everything. There is no way humans can go through the enormous amounts of data businesses store to read every line in every database and identify possible toxic combinations. Businesses also need to know that these combinations can be found in unstructured and semistructured data like blog posts, which people can't keep up with scanning. People also don't go down to the data-element level to examine correlations -- they stop at the metadata.

What best practices can you recommend an enterprise follow to ensure they avoid (or eliminate existing) combinations -- and for CCPA in general?

In today's challenging and evolving data privacy landscape, organizations must understand and protect against toxic combinations of data. It all starts with knowing what data they have and keeping a real-time inventory of the data flowing throughout the organization. Then, they need to make sure to constantly monitor the data and tie it back to their data handling policies and obligations.

Enterprise leaders also need to work with their security teams to ensure that data susceptible to creating toxic combinations is protected with appropriate policies and controls. There may be a business use case for having these combinations of data, but there are also always scenarios where the data was mistakenly leaked or found to be misused.

About the Author

James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him via email here.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Hidden Data Combinations and Facing the Challenge of CCPA

Related Articles

Trending Articles

Agentic BI Is Still Not Ready for Enterprise Prime Time

Self-Healing and Intelligent Data Delivery at Scale (Part 2 of 2)

Self-Healing and Intelligent Data Delivery at Scale (Part 1 of 2)

The Hidden Cost of AI at Scale: Why Data Architecture Matters More than Models

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Hidden Data Combinations and Facing the Challenge of CCPA

Related Articles

Trending Articles

Agentic BI Is Still Not Ready for Enterprise Prime Time

Self-Healing and Intelligent Data Delivery at Scale (Part 2 of 2)

Self-Healing and Intelligent Data Delivery at Scale (Part 1 of 2)

The Hidden Cost of AI at Scale: Why Data Architecture Matters More than Models

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career