The Data-Driven Fascination with Now
Human biases are easily skewed by newer data because it makes up an ever-larger proportion of the data landscape.
- By Barry Devlin
- January 27, 2017
In today's IT world, the phrase "the amount of data has been doubling every x" has become so common as to often pass unnoticed. A quick search of Google News turned up four top stories from November and December 2016 repeating the mantra.
However, the authors offered four different values for "x" -- three years, two years, eighteen months, and one year. The smallest value was offered by Advertising Age, whereas the largest came from ex-CTO of IBM and now strategic advisor to MasterCard and Citi, Irving Wladawsky-Berger, in the Wall Street Journal.
The exact rate of growth is arguably irrelevant. The base data in many cases comes from "The Digital Universe" series of studies published by EMC (now Dell EMC) and IDC since 2007. The last version was in 2014; it's getting a bit stale, but it claimed a doubling every two years from 4.4 zettabytes in 2013 to 44 in 2020. (One zettabyte equals one million petabytes.)
Effects of Data Growth
Details aside, the graph is a classic hockey stick, and we are now accelerating rapidly up the handle. The volumes and growth rate are awe-inspiring, but most of us miss the subtle effect of a geometric growth rate.
Roughly speaking, in 2008, the digital universe reached 1 million petabytes, of which approximately 34,000 petabytes were created per month. In 2016, the amount of data created in one month alone was approximately 500,000 petabytes, or half of all the data that had ever been created by 2008. By 2020, the monthly growth will be on the order of 2 million petabytes (2 zettabytes) -- the total volume that was reached in mid-2010.
The consequence is that newer data consistently comprises an ever-larger proportion of the data landscape and attracts a correspondingly growing proportion of human attention. With such a temporal imbalance in information availability, human perception is easily skewed.
The effect is known as recency bias or the availability heuristic and it influences everything from stock trading to public polls on "best romantic movies of all time." In all cases, our most recent and available experiences form the baseline with which we judge reality or make predictions about the future.
This and other human cognitive biases have featured heavily in some experts' assertions of the superiority of algorithms and data science in decision making. Rita Sallam, research VP at Gartner, stated baldly in her March 2016 BI Summit keynote that "Algorithms will eliminate the last weak link in decision making ... us."
The point she and other experts miss is that both the algorithms and the data sets they use are subject to the biases of the humans who build them. Data science does not eliminate bias. On the contrary, it promotes the biases of a small subset of the population: the data industry.
Myths about Decision Making in the Data Industry
Two dangerous biases about decision making are common in IT culture. The first is a belief that the more data available, the better the decision making. This springs from a gross overestimation of the rationality of decision makers, as I described in a previous series of articles.
The second bias is the idea that faster decision making is the foundation for market success. This thinking is pervasive in customer-facing organizations: consumers have shorter attention spans, so we must focus on instant gratification and even anticipation of unstated needs. The end result is a fascination with the moment of now in both data collection and analysis.
As shown in the calculation above, the overall shape of the data set currently used for analytics increasingly favors the more recent past and present moments. With such a preponderance of recent data, algorithms can easily discount the historical record unless carefully designed to weight its contribution properly.
Recency bias on the part of both designers and users of these algorithms operates along with these cultural biases to exacerbate the underlying problem of data skew. Add the current drive toward self-service analytics by statistically uninformed business people and we can see the strong possibility of a perfect storm brewing with predictably chaotic results.
Use Your Historical Data Properly
The challenge therefore for every business -- particularly in consumer-facing industries -- in the coming year is to rebalance the emphasis between historical and recent/current data. Key to addressing this challenge is to refocus on governance and use of the traditional data warehouse environment as a repository of historical truth.
This is not to say that all data must pass through the warehouse, nor is it to be a "single version of the truth." Rather, the warehouse must stand beside and be closely integrated with the more recent data science and analytics environments. It is no longer a question of volume of data (if it ever truly was), but the value of both historical and ongoing data in discovering the answers you need.
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.