Overcoming Patchwork Data Prep
In their work, many data and business analysts spend the bulk of their time on data preparation tasks using a patchwork of tools. I visited with Manan Goel of Paxata to discuss how to reduce this overhead, how to weave together basic and advanced data preparation capabilities, and how analysts can be more efficient and effective in their work.
- By Jake Dolezal
- November 4, 2016
At the recent Strata+Hadoop conference in New York City, I had the opportunity to sit down with Manan Goel, senior director of product marketing at Paxata, which had just announced the release of its Connect addition to its data preparation platform. In our session, we discussed the challenges that analysts face in preparing data and how his company's product addresses them.
Analytics is so much more than analyzing data. Whether a business analyst or a data scientist, it is estimated that some 80 percent of a person's time is spent preparing data and only 20 percent is spent actually analyzing it. Often, this is because analysts do not have a comprehensive tool for data preparation. They are left to patch together different tools for sourcing and integrating the data they need and other tools for cleaning, standardizing, and enhancing it. Goel said that analysts should spend less time "stitching together" tools to prepare data and more time cultivating insights for their businesses.
Given the ever-increasing complexity in today's analytical environment and the varying types of information required by modern use cases and business questions, analysts need to be able to quickly discover and explore new data, as well as to put aside data they don't currently need but will likely need later. The tools they use for this work can be anything from a sophisticated data-integration tool to simple copying and pasting of datasets into Excel. This is only the beginning, though.
Once that work is complete, they will likely have data cleaning to do. Depending upon the requirements, this may be as simple as performing basic transformations, such as splitting columns, de-duplicating, removing white space, and standardizing columns -- for example, replacing all instances of "northeast" and "N.E." with "NE"). However, in many cases the work requires a higher degree of sophistication, such as using multiple "if-then-else" statements, fingerprinting, n-gram, and other techniques.
They may also need to enrich their data by adding data from related sources, triangulating this information with or without keys, and doing so with minimal referential integrity. This enriched data gives analysts access to possible factors and influences outside their companies' information ecosystems, which can provide additional insight.
Finally, successful analysts often need to collaborate, which requires transparency into each and every step of their process -- integration, transformation, standardization, and so on -- so their coworkers can follow their path to the answer. "Transparency" often means "data governance." Depending upon the maturity of their data governance processes, organizations may still need to get their arms around how their data is used and changed, and the constant flux of business rules that govern it in order to provide the necessary transparency.
What makes all of this such a challenge is the lack of a single, simple integrated tool to accomplish all of it. Most analysts have to patch together a number of tools to complete all this data prep, Excel being most commonly used and abused for it. Many of the existing integrated tools are too technical for business analysts, and even the technologically astute often have a steep learning curve before they can fully utilize such a high-end tool.
According to Goel, this is where Paxata fits in. He said their new Paxata Connect serves as an "end-to-end data-to-information pipeline" by providing a connectivity framework that creates a nexus to acquire, shape, and publish meaningful data for faster time to value. The core Paxata solution already gives analysts four key capabilities in data prep -- data integration, quality, enrichment, and governance -- which the Connect addition expands with a number of out-of-the-box connectors. Goel also said that the Paxata tool addresses the need for transparency by creating a "library of published data sets" that enables the sharing of "tribal knowledge," as well as transparent data lineage and a "bottoms-up" approach to maintaining evolving business data rules.
All in all, I left the conversation with the sense that Paxata has thought through many of the issues and pain points that data and business analysts face with regard to data prep. Their product seems well positioned to expand its reach in the market and I look forward to getting hands-on experience with the tool. I'll let you know what more I think then.
Dr. Jake Dolezal is practice leader of Analytics in Action at McKnight Consulting Group Global Services, where he is responsible for helping clients build programs around data and analytics. You can contact the author at firstname.lastname@example.org