TDWI Articles

Big Databuse

With so many election predictions gone wrong, what are the lessons for business use of big data?

A lot of data -- big data, for sure -- was collected in the months leading up to Election Day. Nobody saw all of it. Few people saw more than a tiny fraction of it before it was fed into sophisticated analytic models that disgorged the predictions that Clinton was more than likely to win. As we know, the predictions were spectacularly wrong.

I assume the models were sophisticated based on the advances that have been made in data science in the past four to eight years, since Nate Silver correctly predicted 49 of 50 states in the US Presidential election 2008 and topped that in 2012 by getting all 50 states and the District of Columbia correct. He also had a pretty good batting average as a baseball analyst.

Being a data guy, I chose to follow Silver’s fivethirtyeight.com in the months before the election. It was a bit of a rollercoaster ride, but right at the end it was showing Clinton with a 72 percent chance of winning and a forecast of 301.6 electoral college votes. Even the 80 percent chance range showed her squeaking through. With two states still to declare at press time, it looks like Silver got 5 states wrong and will be off by some 70 electoral college votes this time around. He was far from alone; almost every poll and prediction showed Clinton as likely winner.

Of course, the journalists, pollsters and the big data analysts are now engaged in navel-gazing on where they failed. Journalist, Jim Rutenberg, offered up bigotry among journalists as a possible answer. Technology contributor, Tom Foremski, called it cultural bias in the analysis. The pollsters are poring through the data yet again in ever deeper demographic detail to see what went wrong.

I’ve previously asked what went wrong for data-driven predictions in the Brexit vote. There, polls were fairly evenly balanced throughout most of the campaign. The predictions of most observers of a vote to “Remain” were poorly supported by the data and could well be ascribed to some form of bias on the part of the analysts.

Wikipedia offers a list of some 175 cognitive biases identified by psychologists and behavioral economists to choose from. Cultural bias is not among them. Furthermore, given the (summary) data published, the problem with the Presidential election predictions lies less in interpretation than in the data itself and the models used to analyze it.

In the case of polling data from experienced companies, it is fair to assume that sufficient thought and statistical knowledge has gone into designing the survey instrument and selecting participants to minimize measurement or coverage errors. If this is true, we are left with the likelihood that non-response and false response errors may be the cause of the false predictions. Given the febrile nature of campaigning, this possibility should not be discounted.

There are lessons here for business use of big data. Much of the big data used by business comes not from well-designed surveys but from social media, where there is no basis for assuming statistically valid coverage given that the data pre-exists any query made of it. Furthermore, much of the discourse on social media is driven by aggressive posturing, trolling, and other extreme behaviors. Minority and alternative viewpoints may thus be widely self-censored without any way for the analysts to know what viewpoints are missing. Such issues are likely to be particularly prevalent when sensitive topics such as race, religion, or sexual orientation play a role in the information being analyzed.

In the U.S. election polls, demographic change models have also been cited as possibly affecting the predictions. This points to a particularly insidious problem as the analytics models now being used are increasingly created by algorithmic or “artificial intelligence” means. Such models are based not on rational, cause-and-effect thought but are correlation-based. In this case, there may be no knowable foundation for the model and thus no easy way to discover, prove, or disprove any embedded bias. Nicholas Diakopoulos’ discussion of some of these issues in “Accountability in Algorithmic Decision-making” makes for unsettling reading.

One final consideration is a version of the Heisenberg Uncertainty Principle. The question is often asked to what extent polls actually influence the outcome of voting. Similar considerations apply in the use of social media in the business world. Do people change their behavior when they know that their posts and messages are being analyzed and used to influence their behavior? Intuitively, the answer is yes. However, where is the data to prove or disprove this hypothesis?

The use of big data analytics in business is widespread and widely promoted as fundamental to success. Two different modes of failure in the political sphere over the past half year should give pause for thought, not just in polling but in all aspects of its use and abuse.

 

About the Author

Dr. Barry Devlin is among the foremost authorities on business insight and one of the founders of data warehousing in 1988. With over 40 years of IT experience, including 20 years with IBM as a Distinguished Engineer, he is a widely respected analyst, consultant, lecturer, and author of “Data Warehouse -- from Architecture to Implementation" and "Business unIntelligence--Insight and Innovation beyond Analytics and Big Data" as well as numerous white papers. As founder and principal of 9sight Consulting, Devlin develops new architectural models and provides international, strategic thought leadership from Cornwall. His latest book, "Cloud Data Warehousing, Volume I: Architecting Data Warehouse, Lakehouse, Mesh, and Fabric," is now available.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.