Some People Call Me the Data Wrangler, Some Call Me the Gangster of Prep
By Meighan Berberich, President, TDWI
Data prep. Wonderful, terrible data prep. According to John Akred of Silicon Valley Data Science, “it’s a law of nature that 80% of data science” is data prep. Although our surveys average closer to 60%, even that’s an awful lot of time to spend not analyzing data, interpreting results, and delivering business value—the real purpose of data science.
Unfortunately, real-world data doesn’t come neatly prepackaged and ready to use. It’s raw, messy, sparse, and exists across a million disparate sources. It can be dirty, poorly formatted, unclear, undocumented, or just plain wrong. One can easily see what makes Exaptive data scientist, Frank Evans, ask “Are we data scientists or data janitors?”
The news isn’t all bleak, though. If there’s one thing we know, it’s that the data scientist’s mindset is perfectly suited to grappling with a seemingly intractable problem and coming up with answers. For example, even Evans’ cynical-seeming question isn’t offered without some solutions.
“Most projects are won or lost at the wrangling and feature engineering stage,” Evans says. “The right tools can make all the difference.” We have a collection of best practices and methods for wrangling data, he offers, such as reformatting it to make it more flexible and easier to work with. There are also methods for feature engineering to derive the exact elements and structures that you want to test from raw data.
Akred is similarly solutions-oriented. His many years of experience applying data science in industry has allowed him to develop a framework for evaluating data.
“You have data in your organization. So you need to locate it, determine if it’s fit for purpose, and decide how to fill any gaps,” he says. His experience has allowed him to be equally pragmatic about the necessity to navigate the political and technical aspects of sourcing your data—something that can often be neglected.
Wes Bernegger of Periscopic takes a somewhat more playful tack.
“The road to uncovering insight and narratives in your data begins with exploration,” he says. “But though there are all kinds of tools to help you analyze and visualize your data, it’s still mostly an undefined process.” Bernegger suggests coming to the task with the attitude of an old-fashioned explorer.
“If you plan your voyage and are prepared to improvise with relentless curiosity,” he says, “you can often come away with unexpected discoveries and have more fun along the way.” Bernegger advises to lay out a system for the data exploration practice, from wrangling and tidying to visualization, through many rounds of iteration, and to stock up on some tools (such as continuing education) to help you find your way in unfamiliar terrain.
Build your expertise. Drive your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle, WA.
Posted on July 19, 2017