Three Ingredients of Innovative Data Governance
Data governance doesn’t have to be the enemy of innovation. The two concepts can exist in harmony but require some important features to be in place to succeed.
- By Troy Hiltbrand
- October 17, 2022
When you hear the term data governance, is your first thought one of draconian policies that put security and regulations above business value? Unfortunately, this is the approach that many organizations have taken with data governance. They focus so heavily on restricting data to meet security and regulatory requirements that they eliminate the ability to generate business value from the data. The future of data governance must include finding ways to continue to protect the data but doing it in a way that enables organizational innovation.
Even though having a strong data governance policy and a strong innovative culture seem contradictory, there are some constructs that can be put in place to make it feasible. Three of the most important practices and processes to enable innovative data governance are synthetic data, DataOps, and a walled garden for your citizen data scientists.
Synthetic Data
The first important feature of innovative data governance is providing a data set that is statistically similar to the real data set without exposing private or confidential data. This can be accomplished using synthetic data.
Synthetic data is created using real data to seed a process that can then generate data that appears real but is not. Variational autoencoders (VAEs), generative adversarial networks (GANs), and real-world simulation create data that can provide a basis for experimentation without leaking real data and exposing the organization to untenable risk.
VAEs are neural networks composed of encoders and decoders. During the encoding process, the data is transformed in such a way that its feature set is compressed. During this compression, features are transformed and combined, removing the details of the original data. During the decoding process, the compression of the feature set is reversed, resulting in a data set that is like the original data but different. The purpose of this process is to identify a set of encoders and decoders that generate output data that is not directly attributable to the initial data source.
Consider an analogy of this process: taking a book and running it through a language translator (encoder) and then running it through a language translator in reverse (decoder). The resulting text would be similar but different.
GANs are a more complex construct that consists of pair of neural nets. One neural net is the generator and the other is the discriminator. The generator uses seed data to create new data sets. The discriminator is then used to determine if the generated data set is real or synthetic. Over an iterative process, the generator improves its output to the point where the discriminator cannot differentiate the real data set from the synthetic data set. At this point, the generator can create data sets that appear undifferentiable from the real data but can be used for data experimentation.
In addition to these two methods, some organizations are using gaming engines and physics-based engines to simulate data sets based on scientific principles and how objects in the real world interact with scientific principles (e.g., physics, chemistry, biology). As these virtual simulations are run, the resulting data set, which is representative of the actual data, can be collected for analysis and experimentation.
DataOps
DataOps is an extension of the practices and processes of DevOps but focused on the field of data engineering instead of software engineering. The automation of data governance that comes from DataOps practices allows for the right balance of restrictive governance to be married to the desired speed of innovation.
DataOps focuses on increased deployment frequency, automated testing, metadata and version control, monitoring, and collaboration.
- Increased deployment frequency. Instead of relying on large, infrequent rollouts of code to support data engineering and data science efforts, DataOps focuses on being able to promote incremental updates to the system. The concept that incremental functionality leads to incremental business value guides practice and process development.
- Automated testing. To increase deployment frequency, you cannot ignore testing and data quality. These elements are still critical. It is important to automate testing to ensure that data quality and code quality can be evaluated as part of the rollout and do not require human-centered, manual testing processes.
- Metadata and version control. Data engineering code might look different from application code, but ensuring that a chain of ownership and a record of changes exist allows for more frequent incremental system changes. Data engineering code is also needed to communicate changes effectively to relevant stakeholders. This includes the code associated with the engineering processes as well as solid traceability of the data lineage as it flows through the process. This more easily allows for recovery if something goes awry.
- Monitoring. As with automated testing, DataOps requires telemetry for the data pipelines and in the system running those pipelines. You need processes that check both code quality anddata quality and can raise an alert if either exceeds defined thresholds.
- Collaboration. As in all agile practices, collaboration is extremely important. Success with DataOps requires high levels of collaboration among the parties involved. This includes personnel from areas of data management, data engineering, data science, and analytics.
- Citizen data scientists (and the walled garden). It is becoming standard for organizations to want to employ the public in data science and analytics efforts, which can greatly accelerate innovation. This is where the role of the citizen data scientist is important.
However, citizen data scientists create a particular challenge when it comes to data governance. Now, efforts shift from trying to control employees with contractual obligations to trying to control the public, where recourse for wrongdoing is limited. This puts a larger burden on systematic controls to keep all parties safe and secure.
Enterprises must create walled gardens if citizen data scientists are to succeed. A walled garden is also referred to as a closed ecosystem. It is a system where the service provider has control over applications and content and provides users controlled access to just what they need to be successful but in a way that they feel like they are in control and can experiment.
These gardens include systems that are logically or physically air-gapped from the company’s other systems. This can include machines or virtual machines that allow for tasks to run without opening the general network to intrusion. This also goes back to the concept of synthetic data which can be used to perform analysis without exposing the underlying sensitive data.
Providing citizen data scientists with an environment that is easily accessible and sufficiently powerful allows them to experiment and be creative and innovative. Balancing that with the correct security practices and processes is the goal. When dealing with this walled garden environment, the security practices cannot appear to be onerous. The goal is to allow users to feel free to do what they want once they enter the walls of the garden, but also to make sure that those walls are durable and keep the citizen data scientists serenely inside.
Next Steps
As you evolve your data governance road map, determine where these three concepts -- synthetic data, DataOps, and walled gardens -- fall for your citizen data scientists. With these three key ingredients, you have the potential to create a truly innovative data governance experience that enables the business to flourish without taking on unacceptable risks.
The future of data governance will require balance, but a data professional who can balance security, privacy, innovation, and creativity will be a hero to the organization.