TDWI Articles

Executive Q&A: Data Governance and Compliance

What's ahead with data governance and compliance when moving enterprise data to the cloud? We spoke to Balaji Ganesan CEO and co-founder of Privacera and XA Secure for answers.

The more data you have, the more you have to protect it. What are the challenges driving the need for data governance as more enterprises move to the cloud? How can you mitigate risk and still drive data analytics?

For Further Reading:

Can Hadoop Replace a Data Warehouse?

The Essential Role Data Quality Plays in Compliance

Data in the Cloud: The Truth Every IT Professional Needs to Know

Balaji Ganesan is CEO and co-founder of both Privacera and XA Secure. At the latter he co-created a security framework for the Hadoop platform, which along with the company, was acquired in 2014 by Hortonworks and introduced to open source as Apache Ranger. In this Q&A, Ganesan shares tangible ways to ensure data governance and compliance when dealing with customers' sensitive digital data and migrating huge data sets to the cloud.

Upside: What broad trends are driving the need for increased data governance when moving to the cloud?

Balaji Ganesan: There are multiple trends driving the need for increased governance, especially when moving data to the cloud. The first is the inherent characteristics of the cloud services themselves. Although public cloud services are easy for individual teams to launch and use, the very nature of cloud services as "easy to start" becomes a handicap when it relates to data governance. Each public cloud service is disparate and siloed from each other, making it extremely difficult to verify where sensitive data is stored. It is also time-consuming to secure it. This is in contrast to an on-prem environment, where data is managed through a set of centralized policies with consistent perimeters and access control.

At the same time, increasing legal regulation and consumer awareness of data in the cloud are driving enterprises to be more cautious of accidental or malicious data leaks, which creates a larger attack surface than on-premises data.

The third trend is an increased demand for data-driven analysis. Data scientists and business analysts need the most comprehensive data set to generate the most impactful business outcome. However, this requires IT teams to democratize data and make it available to analysts without compromising privacy.

How have enterprises gone about meeting these challenges? What's gone wrong with their approach?

There are a few different methods, each with their own advantages and disadvantages. The first and simplest approach is to utilize a single cloud service vendor only. This assumes the enterprise has enough discipline and control to ensure that there is no shadow IT signing up for cloud services beyond the approved cloud services and all data is secured through the single vendor's control mechanisms. The advantage of this approach is system uniformity; the disadvantage is significant vendor lock-in.

The next approach is allowing multiple public cloud services across the enterprise and manually managing governance of data within each service or provider. This option requires enterprise IT teams to be experts in each service and to manually implement and maintain consistent (or at least comparable) data access control across the organization. This is a more common situation than a single cloud services vendor approach, but in most cases the result is an intensive and taxing manual process that is unscalable.

It is at this point many enterprises realize they have a data governance problem and start looking to adopt a centralized framework for policies and automation of data access control. This foundation can be accomplished through custom-built scripts and processes but requires a significant effort to build and maintain. A more practical option is to leverage a centralized framework purposely created to manage data governance and one that takes advantage of pre-existing data infrastructure. For instance, my partner and I created Apache Ranger, an open-source project used for on-premises Hadoop data lakes that can be extended to the cloud specifically for this purpose.

There are also enterprises willing to modernize with alternative architectures. Here, the concept is that along with the migration to the cloud, the enterprise should leave all legacy frameworks behind. This approach is often very appealing in the early stages of implementation, but it quickly becomes untenable without years of proven scalability in production environments. When it becomes apparent to the teams left holding the bag -- typically IT and data architects in charge of a scalable long-term data governance framework -- many enterprises re-evaluate and begin trying to build or find a more robust product designed for the true diversity and scale of their enterprise environments.

How can enterprises balance the dual mandate of data governance and security with access control?

For Further Reading:

Can Hadoop Replace a Data Warehouse?

The Essential Role Data Quality Plays in Compliance

Data in the Cloud: The Truth Every IT Professional Needs to Know

The dual mandate describes the tension between analysts and data scientists who want to leverage and share vast amounts of data and IT teams who are taxed with adhering to legal and compliance regulations that limit the ability to share data to ensure privacy and security.

The balance between these two opposing forces must be consistently implemented and enforced across a diverse landscape of cloud services. This means that IT teams or data administrators need to become experts in the intricacies of each cloud service and possess the know-how to secure data within each. Then they have to manually execute piecemeal processes to implement appropriate data governance policies and security settings with -- and across -- the different services.

Data access governance control automates the piecemeal processes and transparently hides the intricacies of each individual cloud service, which enables enterprises to implement data governance consistently and efficiently. This also accelerates data sharing and democratization without unnecessarily involving limited administrative resources.

What are the implications for analytics? That is, if you balance governance how will analytics be simplified?

There are many instances where data science teams are at a standstill while waiting for data access to be reviewed, approved, and processed. This limits the productivity of these highly paid resources because they are forced to wait for IT and data administrators to fulfill the IT tickets needed to grant them access.

A centralized framework automates the manual tasks involved in securing data for privacy and compliance and simplifies and streamlines how data is made available for analytics. It also enables data democratization and provides a consistent reference for data usage and data transformation. Besides accelerating access to the data itself, it allows analysts to easily use and verify the integrity of data sets consistently, resulting in better business outcomes.

How has your experience building and working with the Apache Ranger community influenced how you support data governance? What tips can you offer organizations to avoid compliance complications when migrating large datasets to the cloud?

With Apache Ranger, we solved the problem for governance and compliance for big data and Hadoop because Apache Ranger managed the heterogeneous systems that were part of the big data (Hadoop) ecosystem. In traditional databases and data warehouses, data was locked in and there were limited ways to access it. With big data and Hadoop, it was a step towards the promise of data democratization where data could be stored in one place and made available for multiple use cases, whether it was SQL, NoSQL, or streaming.

However, enabling security and governance in this paradigm required a different approach -- which was the driver for creating Apache Ranger. We leveraged what worked in traditional environments and applied it to big data while balancing ease of use with performance and scalability. Thanks to the multitude of cloud services, similar problems exist within enterprises today. In fact, the sheer diversity that the public cloud offers is creating an even more complex maze to identify and properly secure sensitive data.

When migrating large datasets to the cloud, organizations should leverage their existing infrastructure as much as possible. There is enough to do already, and there is no need (or benefit) to re-inventing the wheel. Governance and compliance policies for an enterprise typically don't change simply because the infrastructure itself on premises or in the cloud -- is moving. For example, personally identifiable information (PII) data is sensitive data, regardless of where it is stored or used.

At the same time, migration to the cloud is seldom a "one and done" approach because data is not suddenly moved to the cloud and then magically removed from all on-premises systems. Most enterprises run in hybrid environments, with a mix of both on-premises and cloud services. This makes it even more important to leverage existing infrastructure to ensure consistency and reduce the burden of IT teams forced to administer and manage multiple sets of data infrastructure. Data compliance should not become an obstacle to migration to the cloud and with the right tools protecting PII, it can be a seamless experience.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.