CEO Q&A: Data Quality Problems Will Still Haunt Your Analytics Future
Data quality issues become even more important as machine learning use grows. DataOps and data wrangling help enterprises address this vital problem.
- By James E. Powell
- April 16, 2019
In our continuing series of CEO Q&As, we spoke to Adam Wilson, CEO of data-prep specialist Trifacta, about the age-old problem of data quality in the context of today's drive toward AI and machine learning.
Upside: What technology or methodology must be part of an enterprise's data strategy if it wants to be competitive today?
Adam Wilson: As organizations collect more and more data, the potential for dirty data increases. Dirty data is the single greatest threat to success with analytics and machine learning and can be the result of duplicate data, human error, and nonstandard formats, just to name a few factors. Studies have shown that, on its own, dirty data costs companies 12 percent of overall revenue.
To remain competitive, a strong enterprise data strategy that manages the end-to-end life cycle of data in a way that's efficient and secure is key. Many organizations have turned to modern data preparation or data wrangling tools, which are able to sift through masses of data and use algorithms to detect anomalies and identify outliers resulting from human error and duplicate data.
What emerging technologies are you most excited about and think have the greatest potential? What's so special about them?
AI and especially its machine learning component have huge potential to streamline business operations and have a huge effect on the economy as inefficient work tasks get streamlined and industries benefit from increased automation.
However, advanced machine learning algorithms are only as good as the training data from which they learn. Unfortunately, as machine learning initiatives scale, it becomes harder to clean and prepare diverse data to support increasingly complex machine learning models. In order for machine learning to make an impact at scale, organizations will need to first accelerate their data preparation processes.
What is the single biggest challenge enterprises face today? How do most enterprises respond (and is it working)?
About 27 percent of business leaders aren't sure how much of their data is accurate. In many cases, that can be costly.
Poor data quality is Enemy #1 to the widespread, profitable use of machine learning, and for this reason, the growth of machine learning increases the importance of data cleansing and preparation. The quality demands of machine learning are steep, and bad data can backfire twice -- first when training predictive models and second in the new data used by that model to inform future decisions.
One of the biggest challenges with preparing data is that historically it has been time-consuming -- 80 percent of the overall analysis process is spent cleaning or preparing data. As data has increased in size and complexity in recent years, data preparation has only grown more demanding and is often relegated to the organization's most technical employees.
Still, most data fails to meet basic quality standards, for reasons ranging from a misunderstanding of what's expected to poorly calibrated measurement gear to overly complex processes to, perhaps most simply, human error.
To compensate, data scientists cleanse the data before training predictive models. It is time-consuming, tedious work. Even with such efforts, cleaning neither detects nor corrects all the errors. These concerns must be met with an aggressive, well-executed quality program that typically follows these four steps:
- Perform an audit
- Establish standardization rules
- Update data as frequently as possible
- Establish data quality responsibilities across teams
Is there a new technology in data and analytics that is creating more challenges than most people realize? How should enterprises adjust their approach to it?
Successful analysis relies upon accurate, well-structured data that has been formatted for the specific needs of the task at hand. Yet, today's data is bigger and more complex than ever before. It's resource-intensive and technically challenging to wrangle it into a format for analysis.
Data wrangling is needed to transition raw data source inputs into prepared outputs to be used in analysis and for a variety of other business purposes. Within a single project, there could be dozens of models and iterations. Many data science projects are at risk of failing because they take too long to produce results.
To optimize your chances of success, it is critical that you reduce the overall iteration time and adopt a "fail fast" approach. Businesses have turned to accelerating data wrangling and integrating it with a machine learning framework, allowing for faster time-to-value and more opportunities to engage with key stakeholders.
What initiative is your organization spending the most time/resources on today?
The past several years have seen huge growth in cloud computing environments. At the same time, data is moving faster. By 2020, analysts predict that the amount of data generated will have grown by 50-fold since 2010, and this rate will only continue to accelerate. The true competitive nature of data doesn't necessarily lie in how much data organizations can leverage but rather how quickly they can respond to it.
This doesn't always happen because deriving value from data also means making it accessible to those who have the right context for the data. This means cleaning and preparing that data for analysis, which is still widely reported as the biggest bottleneck in any analytics project, often accounting for up to 80 percent of the time and resources. Trifacta is focused on improving the data preparation process by turning it from a siloed process to one that is self-service and empowers users across an organization to unlock the power of their data.
Where do you see analytics and data management headed in 2019 and beyond? What's just over the horizon that we haven't heard much about yet?
DataOps will be the new DevOps. As organizations have shifted toward self-service processes, data analysts now have the right tools to wrangle and analyze their own data instead of endlessly iterating with IT. After this shift occurs, the question becomes how do you make such operations scalable, efficient, and repeatable?
Enter DataOps. As an adaptation of the software development methodology DevOps, DataOps refers to the tools, methodologies, and organizational structures that businesses must adopt to improve the velocity, quality, and reliability of analytics. Data engineers fill the critical roles powering DataOps, and as these practices become commonplace, data engineers will become critical resources.
In fact, 73 percent of organizations polled said they planned to invest in DataOps this year. In the same way that DevOps engineers are a highly sought-after role today, we predict that data engineers will be in the near future.
Autoscaling serverless solutions will become increasingly common. Data is everywhere, and even small businesses and individuals want to roll up their sleeves and wrangle data sets alongside the Fortune 500. One size doesn't fit all, however, which means serverless, pay-as-you-go solutions for DataOps will become a hot commodity for fledgling companies that are uninterested in setting up their own DataOps infrastructure right away.
Larger companies will seek out technologies with costs that scale automatically, allowing for surges at usage peaks and lower maintenance-level fees during idle periods. Above all else, convenience and flexibility will be key selection factors regardless of company size.
Self-service technologies without governance will hit their limits. As self-service solutions grow and adoption no longer becomes the primary metric of success, organizations will increasingly question whether these solutions are efficient, scalable, and secure. Without governance in place, IT organizations in particular will feel increasing pressure as the number of technologies to maintain and processes to schedule multiply unchecked. Heightened DataOps practices will offer new guidance on self-service technologies, and we predict that in 2019, self-service products without governance will hit their limits.
Tell us more about your product/solution and the problem it solves for enterprises.
Trifacta is an industry pioneer and established leader of the global market for data preparation technology. The company draws on decades of academic research in machine learning and data visualization to make the data preparation process faster and more intuitive.