Repeatability: The Key to Scaling Data Science
When it comes to embedding analytics insights into your business, technology is the easy part. Figuring out what to do and how to do it is much harder.
Like most organizations, you want to embed analytics insights in your operational processes and promote a culture of analytical decision making. You want to use machine learning, deep learning, and related technologies to automate decision making when and where it makes sense.
These goals might seem both realistic and attainable. After all, software and cloud vendors are pitching you easy-to-use, quasi-automated, self-service tools and consultants promise to help you bridge the gap between the skills you have and the skills they say you'll need. Piece of cake, right?
Far from it, says Mark Madsen, a research analyst with information management consultancy Third Nature. Between the idea and the reality of embedded analytics falls the shadow of significant people, process, and methodological issues. "Technology is the easy part. Figuring out what to do and how to do it is a lot harder. In spite of this, there are lots of shiny new tools that promise to make all of these problems go away. Not surprisingly, they're attracting a lot of attention," he says.
The use and abuse of data science is a topic that's near and dear to Madsen's heart. He'll be speaking about this issue at TDWI's upcoming Accelerate conference, held in Boston April 3-5, 2017, which TDWI describes as "the leading conference for analytics and data science training." It features deep-dive tutorials, networking opportunities, and presentations from Michael Li, Claudia Perlich, Eduardo de la Rubia, and other luminaries. It will also be a forum for insightful and provocative content, including presentations such as Madsen's, which addresses the issues vendors, consultants, and would-be adopters are keen to minimize.
"People and data are the truly hard parts. People can be problematic because many believe data is absolute rather than relative, that analytics models produce a single, definitive 'answer' rather than a range of answers with varying degrees of truth, accuracy, and applicability," Madsen says.
"Data is a problem because managing data for analytics is a nuanced, detail-oriented, seemingly dull task left to back-office IT. It's precisely the kind of thing [business] people want to blow right past."
In the legacy business intelligence (BI) model, ETL was the chief problem. In data science, the problem area shifts to data engineering -- the preparation, integration, and (if necessary) management of data. The capabilities of commercial self-service tools are insufficient, Madsen argues, because a huge portion of the work that data scientists and statisticians do can't be automated or even accelerated, let alone tightly controlled or managed.
He uses data-in-motion, so-called "streaming" data, as an example.
"We've solved data extraction and capture problems for historical data in databases but not for streaming data. Capturing streaming data is easy; managing and analyzing it is much harder," he argues. "All of this ignores the problem of actually making use of insights. Say you add a new stream of data. How does anybody know it's out there, ready to be used? How do you add or adjust a model to use it? How do you embed that streaming insight as a reusable, repeatable service?"
There's another wrinkle here, too. Data science doesn't always lend itself to reuse and repeatability. On the one hand, machine learning, deep learning, neural networks, and other kinds of advanced analytics are iterative and require multiple passes on a data set, punctuated (after each iteration) by the collection and analysis of results and the (usually manual) tweaking of models or algorithms.
On the other hand, data science is a kind of experimental computing. Not all of these highly iterative experiments will pan out. As Madsen aptly puts it, "Every problem is its own special snowflake. This is the nature of analytics problems because you start with unknown data and every organization's data is unique."
Madsen calls this the "craft model" of information production. Some elements of craft are unavoidable in data science, he concedes. This isn't any reason to give up on reuse, repeatability, and manageability, however. Madsen recommends enlisting business stakeholders in a bottom-up approach to clearly identify and prioritize goals. Orienting questions include:
- Can you measure the value of achieving the goal versus the cost of doing the work?
- Can you establish precise, measurable bounds? "Selling more stuff" is insufficient, he argues. You need a way to empirically determine whether you're improving things. This isn't just a question of economic utility, either. In machine learning, you have to be able to evaluate the performance of a model as you refine it.
- Do you have the data to build a model, evaluate its performance, and measure the results on an ongoing basis? "Creating a data science team is about the repeatable production of new information and insights," he explains.
"The infrastructure focus needs to be on repeatability -- at least to the degree repeatability is possible."