An Introduction to Data Wrangling
What is data wrangling and why does it matter? Think of it as data preparation taken to the next level. To learn more, we turned to data-wrangling upstart Trifacta.
- By Stephen Swoyer
- January 13, 2015
What is data wrangling and why does it matter to BI professionals? Think of data wrangling as data preparation taken to the next, or to the nth, level. Better still, ask data-wrangling upstart Trifacta Inc. about it.
Trifacta -- or, more precisely, the academic brain trust that helped found it -- basically invented the term, which it describes as a kind of hybrid of data integration, data engineering, and data science.
"Trifacta as a company actually came out of a research initiative between Stanford and Cal [the University of California, Berkeley] called 'Data Wrangler.' The work that you do with data wrangling others would call 'data plumbing' or even janitorial work, but when you have somebody who knows how to wrangle data and gets into a flow of data wrangling, it's an elegant dance to watch," says Stephanie Langenfeld McReynolds, vice president of marketing with Trifacta.
According to Langenfeld McReynolds, data wrangling outstrips traditional data integration (DI), which tends to focus primarily on the selection, movement, and transformation of strictly-structured data from OLTP or other, similar sources. Think of it as similar to what happens when you go into one of those artisan or mash-up ice cream parlors. Should you desire, say, chocolate cheesecake praline ice cream, your ice cream and pralines are mashed together, by hand, right in front of you.
Trifacta's aim is to simplify or automate this process. They want to make it easier -- by exposing GUI acceleration features -- to join, split, blend, or even fold together data from strictly-structured and poly-structured sources. Think of it as analogous to an artisan ice cream parlor staffed with robots.
The site in which these robots do their mashing up is the Hadoop platform. There are two reasons for this, says Langenfeld McReynolds. First, no one disputes that certain kinds of data -- basically, any kind of information that can't be represented in tabular form or which doesn't consist of name-value pairs -- can't be (cost-effectively or practicably) managed by an (R)DBMS, which is designed to be programmed and manipulated via SQL.
In the same way, a growing class of analytics -- from Markov chains and random forests to multipath correlations and to graph analyses -- can't be performed using SQL, either. Hence, Hadoop is a natural platform in which to do all of this work. To the extent that Hadoop is also used as a long-term repository for strictly-structured data -- e.g., as an archive for operational or warehouse data -- this makes even more sense.
"You need to rethink how you're working with schema in your environment. In the old world of [the] very traditional data warehouse, you would define the schema in advance. There were good reasons for defining that schema in advance [having to do with] performance and scalability ... [and] that kind of structure still has a role," she argues. "There's a big advantage in leveraging Hadoop to not only store data -- to store all of your data -- but also to leverage the ability of Hadoop to recreate schema [in conjunction] with different types of analysis that you want to run."
The second reason Trifacta focuses on Hadoop is that Hadoop's data preparation feature set is primitive, to say the least. To be sure, it's possible to do any conceivable kind of data preparation (even Trifacta-style "data wrangling") in Hadoop -- if one codes it by hand. There are comparatively few tools that, like Trifacta, purport to automate or accelerate the grunt work of data preparation.
In fact, there's arguably no tool just like Trifacta, in much the same way that there are (arguably) no tools just like those marketed by Alpine Data Labs Inc. -- which exposes a one-click Hadoop-based data preparation feature -- or Metanautix Inc., which touts similar capabilities. (Both companies style themselves as analytic platforms first, data prep environments second.)
Paxata is similar to Trifacta in that it comes at data integration (DI) from a non-traditional orientation, but it focuses more explicitly on self-service DI, or the equivalent of DI-for-non-coders. (Would-be data wranglers might not be coding ninjas, but they do understand code.)
Finally, Langenfeld McReynolds argues, traditional DI products from IBM Corp., Informatica Corp., Syncsoft Inc., and Talend (among others) don't address the complexities of data wrangling, particularly with respect to manipulating poly-structured data.
"The traditional toolsets are not yet being used with Hadoop for the most part. When you scratch the surface of what the requirements are for wrangling data on top of Hadoop, you very quickly understand why," she comments. "A big part of that has to do with Hadoop's ability to generate schema-on-read as the last step in that data access process. If you think about traditional data integration or ETL workflows, they're all about getting data into an RDBMS, and it's all about refining that data before it hits the storage layer. As a result, a lot of data is thrown overboard before it hits that ETL process." The thing is, Langenfeld McReynolds points out, this throwaway data is of great interest to data scientists, along with less-savvy analytic types such as business analysts.
One big knock against Hadoop is that its data management feature set is shockingly primitive, at least relative to that of a mature (R)DBMS platform. Marc Demarest, a principal with information management consultancy Noumenal Inc., notes that neither the Hadoop distributed file system (HDFS) nor the Hadoop platform have a built-in understanding of data lineage or data quality. For this reason, it's impossible to answer any of several critical questions -- e.g., where did this file come from? who put it there? what is it used for? how accurate is the data stored in this file compared to this other data over here? how many other files of the same sort are in here? -- with the same degree of accuracy as with an (R)DBMS.
The upshot, Demarest argues, is that Hadoop has a pretty serious metadata problem. Langenfeld McReynolds concedes that Hadoop's out-of-the-box data management feature set leaves a lot to be desired but claims that Trifacta helps address this.
"As users are building up their script in Trifacta, metadata is being created through that process by the Trifacta product. What we then do with that metadata that's created [is] typing, [that is,] very simple metadata creation. This data type is time-oriented data, this data type is a string, this data type is a state, which we can tell because we understand the semantics of the data in Trifacta," she says. "We also integrate with some of those [BI] analytics tools as well [such as] Tableau. We're very cognizant about the need for that metadata [and] about how important that metadata is."
Trifacta, like Paxata, positions itself as a complement to the self-service discovery world of Tableau.
"Our part of the value proposition in the [analytic] stack is to be the data transformation platform. We sit in the middle, with the Hadoop distribution [beneath] and a tool like Tableau on top," she says, noting that Trifacta's isn't a completely GUI-fied environment. Some scripting is required, Langenfeld McReynolds concedes.
"Our focus is how can we take folks who understand SQL and give them access to Hadoop? SQL is not fantastic for [all] data transformation workloads. It's really good for BI-type queries, not so good for behavioral analysis, pattern matching, or path analysis on Web sites," she explains. "We kind of had to get a language that was flexible enough that the right level of abstraction was introduced. So what we came up with is a small, domain-specific language -- [called] "Wrangle" -- that is fit for the purpose of data transformation."
Right now, Trifacta tries to accelerate the extraction, loading, and preparation (in Hadoop) of relevant data from dispersed data sources. The long-term goal, says Langenfeld McReynolds, is that Trifacta will be able to push down the data processing workload to those dispersed data sources. In other words, instead of extracting data from a SQL database and moving it en bloc into Hadoop to be processed, Trifacta will instead push the data preparation workload down/out to the source database. (This is analogous to ETL push-down in traditional data management.)
This might be a moot capability, however, especially if Hadoop itself becomes more central to data preparation. Instead, data will be prepared in Hadoop (which functions as a kind of ETL platform) and moved or circulated outward -- i.e., to data warehouse platforms, analytic platforms, BI servers, or other consumers. Langenfeld McReynolds thinks this outcome especially likely owing to Hadoop's capacity to generate schema-on-read, as distinct from the rigid schema of the old-school DM model.
"One of the biggest changes that we see is a transition in thinking from the idea that schema should be governed top-down to [the idea that] schema is something that develops grassroots from the bottom up and should be reusable. I think that's the triggering change of thinking that happens as organizations try to figure out how to enable a business analyst, a data scientist, a business engineer to do the work that they need to be able to do," she tells BI This Week.