Data Science at the Frontier: Here Be Dragons
In order to use advanced analytics insights, we must identify and control for our biases. In all likelihood, we're going to be surprised by what we find.
Roman and medieval maps sometimes displayed a three-word phrase -- "Here be lions" or "Here be dragons" -- in the blank space outside the limits of the known or explored world. When it comes to surveying the largely uncharted world of data science, this warning is as apt as ever.
Lions and dragons, in the form of unknown or unidentified biases, inhabit the frontiers of data science. In order to use advanced analytics insights, we'll have to identify and control for these biases. In all likelihood, we're going to be surprised by what we find.
Check Your Biases
For Angela Bassa, director of data science with a prominent robotics and artificial intelligence specialist, doing good data science is first and foremost a question of checking your biases -- starting with the human predisposition to assign cause and effect to disconnected events.
Bassa will be presenting a session on "quantitative storytelling" at TDWI's upcoming Accelerate conference in Boston. She plans to devote some portion of her presentation to the issue of causality.
"The risk when you're telling a quantitative story and you're building a narrative is that you're building in terms of causality -- 'because of this then that.' You're ascribing a causality that maybe isn't true. It's one of those natural biases we have. We see an effect and we think 'This happened because.' However, the use of 'because' is dangerous because data doesn't really tell you a 'because.'"
According to Bassa, millions of years of evolution have encouraged human beings to think and understand deterministically. The upshot, she argues, is that we're predisposed to assume causality where none exists. We think of causation as a kind of real, independent relationship that exists between two real things: a cause and an effect.
Quantitatively speaking, however, this isn't the case, she explains. Math doesn't distinguish between cause and effect.
Extraordinary Claims Require Extraordinary Evidence
There's another wrinkle, says Bassa: the data scientist isn't always the one who presents the results of an analysis or creates the visualizations used to explain an analysis. If organizations are going to make effective use of data science insights, they're going to have to promote a new kind of statistical numeracy.
Business analysts and management types alike -- up to and including C-level officers -- must become conversant with the concepts and implications of statistics and probability, she argues. Until that happens -- or as that's happening -- the data scientists, statisticians, analysts, and storytellers who communicate the results of an analysis must adapt their presentations to the simplest of understandings.
"A lot of times, the [analyst] who makes the visualization is not the person who presents it or consumes it or who first generated the data. That analyst has a really difficult job in that they have to prepare a visualization yet have no control over its delivery," she points out. "If the people you're speaking with, if the [finding] you're presenting to them is extraordinary, you're going to need to meet a level of scrutiny that other audiences may not require."
Predictive Technologies Embody Human Biases
Causality is a known bias. In some cases, says Claudia Perlich, chief data scientist with marketing analytics specialist Dstillery, we discover our biases only after they show up in our models. In a worst-case scenario, our models propagate biases we aren't even aware of and can't necessarily detect.
Perlich, who famously won the Association for Computing Machinery's Knowledge Discovery and Data Mining Cup three years running from 2007 to 2009, uses the example of a machine learning algorithm that screens job applications to cull the pool of prospective candidates.
"If you use a machine learning system to automatically screen job candidates, there is a chance your predictive model may propagate historical biases. From a societal perspective, we would prefer certain outcomes, but if this model makes predictions [based on] what has happened in the past, it is bounded by [the selection criteria of] the past," Perlich says.
"People like me are enthusiastic about what we do. It is quite exciting, but all of us who are enthusiastically building these models need to develop a moral sense of responsibility ... about how and when they are put to use." At the very least, Perlich argues, we have to recognize that predictive models embody the acknowledged and unacknowledged biases of the people who created them.
From unknown biases to unidentified methodological errors to a host of other problems, advanced analytics is very difficult to do well. Perlich and other advocates believe the potential rewards far outweigh the risks, however. "What we need to come to grips with and collaborate on are better options to do these things the right way -- from a performance perspective, from a societal perspective, from a privacy perspective," she concludes.