Q&A: An Introduction to Self-Service Data Prep (Part 1 of 2)
What you need to know about the movement toward self-service data preparation.
- By James E. Powell
- June 11, 2018
Data cleansing and preparation is a challenge for many enterprises as they transform raw data into the ready information needed to increase profits and operate more efficiently. Today's business users access increasingly complex data from first-, second-, and third-party sources in a mix of data types, driving an increased desire for self-service data prep.
To learn more, we spoke with Piet Loubser, SVP global head of marketing at Paxata.
TDWI: Until recently, most self-service data prep was done in Excel. Now it seems as though everyone is claiming to have a data-prep solution. How are these new self-service tools improving on what data professionals have done before, and will they ever truly replace Excel?
Piet Loubser: Excel is a great personal productivity tool, but it lacks the capabilities needed to be an enterprise data preparation solution. Today's self-service data prep requires the ability to handle much bigger volumes of data and more variations in the structure of that data. Recent research indicates that 68 percent of organizations are still using Excel but find it is not a great solution for data preparation tasks.
The modern self-service data-prep tools will provide guides and guardrails for the business user to help them. For instance, some tools use artificial intelligence (AI) to profile the data continuously and make recommendations about how to rapidly enrich or combine data sets so businesspeople can use them. Don't forget the need for enterprise-grade governance and security. Many modern self-service data-prep tool vendors highlight their solutions' governance and security capabilities.
One has to imagine that many CIOs, CDOs, and data architects cringe when they think of the risks and governance nightmares that can occur with self-service data prep. How do you reconcile the demand for self-service data access and data prep with the needs for security and governance?
Data is the fuel for a data-driven world, but it must be managed as a strategic asset. Governance and data security are no longer optional. Traditional approaches to delivering data for analytics were managed strictly by highly skilled IT developers, resulting in strong governance and control. In the new world of self-service data prep, we need different capabilities.
Rather than be seen as preventing and limiting, proper governance and security should empower end users to safely find, shape, and publish their own data. New self-service data-prep platforms provide this at scale for all users regardless of which BI tool they use. If you take a platform approach to data governance rather than applying multiple desktop tools, you can effectively maintain control of your data while still empowering and enabling users at scale.
What do organizations need to know before adopting self-service data prep?
When you embark on a self-service data-prep initiative, one would assume you have some specific use case in mind. Unfortunately, a data preparation solution is often seen as a tool for only a single use case. Instead of viewing self-service data prep as a narrow solution for a specific user, effective organizations are now adopting this technology as a broad solution across the entire enterprise.
In this way, getting access to data and preparing it for any use case can become a way of life for every department or function. Becoming data-driven is a priority for survival, but achieving this is as much about the cultural change as it is about adopting new technologies.
Many buzzwords today have to do with AI or machine learning (ML). What is the role of AI and ML in self-service data prep?
Organizations are struggling to keep pace with the demand for business insights without enough expert IT personnel to generate the data needed for these analytics projects. Eighty percent of the effort in a typical analytics project is finding and preparing the data. This burdens an organization's already scarce and expensive resources with having to perform repetitive data-prep tasks versus spending time generating valuable insight and building models.
Artificial intelligence and machine learning can bring sophisticated data-prep activities to the data engineer or speed up the process for these scarce resources. For example, AI can be applied to make recommendations about improving or joining data sets or assist with understanding the stored data. This provides guidance while empowering less-technical users to accomplish powerful data-prep tasks.
What is next for self-service data prep? Does it stop at data prep for analytics?
I mentioned that many organizations still view data prep through a narrow lens focused on one specific use case or project. Data prep needs to become a core competency within an organization, applied across all functions and used by businesspeople and data scientists. The self-service data-prep initiative also needs to span the enterprise to power all kinds of analysis whether the data resides in the cloud, in traditional data sources, and/or a data lake.
In addition, the self-service data-prep solution needs to power your next-gen data-driven applications and any new products or services you create. Only when you have a common way to access and provision your data can you really start accelerating data-driven initiatives.
The Conversation Continues
In Part 2 of this conversation, we'll discuss how enterprises can get started with self-service data prep.
James E. Powell is the editorial director of TDWI, including the Business Intelligence Journal and Upside newsletter.