What Spreadsheets Can Teach Us About Analytics
With the current explosion of analytics, the data management lessons we didn't learn with spreadsheets have returned with a vengeance. Help is at hand.
- By Barry Devlin
- July 1, 2019
Spreadsheets "proliferate like poisonous vines, slowly strangling organizations by depriving them of a single consistent set of information ..." This simile, coined nearly two decades ago by Wayne Eckerson, then director of education and research at TDWI, still elicits a knowing smile among data management professionals. Despite Eckerson's seven tips to address the problem, the challenge persists.
In fact, it has grown immeasurably worse in recent years with an explosion of data sources of questionable quality, the emergence of Hadoop and data lakes, a plethora of overlapping and competing analytical tools, and a pivot to cloud-based, self-service offerings, often with limited governance. Eckerson's poisonous vines have evolved into terrifying triffids that can rampage through the data fields of digital businesses leaving behind a trail of devastated information assets.
Science fiction aside, the parallels between both the drivers and challenges of the spreadsheet and analytics environments offer important lessons for today's IT.
The Joy of Spreadsheets
Since their invention in 1979 -- check out Visicalc on the Apple -- spreadsheets have become by far the most popular data exploration tool for businesspeople of all analytical skill levels. Their wide availability, ease of use, and support for simply playing with the data have made them indispensable to users but unloved by IT departments trying to build data warehouses or promote carefully selected BI tools.
Following years of trying to "kill the spreadsheets," most data warehouse and BI professionals now simply shake their heads sadly at the mention of the subject. Few recognize even today that spreadsheets themselves are not the problem. The real issue is their promotion from individual use (where they excel -- pun intended -- at promoting innovation) to departmental and, most dangerously, corporate use. Once unleashed to broad, shared use, the limitations in data quality management and governance become immediately apparent, leading to a rich history of spreadsheet horror stories.
In addition to human error and poor quality-management practices within spreadsheets, businesses seldom conduct due diligence when promoting spreadsheets to broader use. Furthermore, the major spreadsheet vendors have done little to create an underpinning foundation of data and process management within or around the base function to enable auditability, track source lineage, and manage distributed and wider usage.
A Pain in the Analytics
Leaders in analytics and machine learning have focused strongly on data exploration and insight discovery, where data scientists take a leading role. However, more recently, data concerns have come to the fore as data scientists have realized the hard work and time involved in data acquisition and preparation. In this sense, the exploratory work of a data scientist exactly parallels that of a spreadsheet user. The added challenge for data scientists is the variety of sources, data volumes, and variable data quality they encounter.
The process of putting analytics into production is conceptually similar to that for spreadsheets, requiring a shift of focus from creativity and innovation to engineering, and -- ideally -- a shift in organizational responsibility from business to IT. However, the process for analytics is technically more complex because production analytical models must link directly to operational data -- both traditional transactions and sensor or clickstream events in near real time. Furthermore, model deployment and ongoing model management demand more technical knowledge and system access than simply sharing spreadsheets does.
Spread the Learning
Putting spreadsheets into production often amounts to no more than sharing the spreadsheet and its data with colleagues who run it -- and often modify it -- locally. Because of the limited function within spreadsheets to manage "productionization," IT and business together must build bespoke tools, processes, and organizational workarounds.
I have described this adaptive decision cycle in my book Business unIntelligence and formalized the three steps involved. These start from honoring the value of individual innovation and using collaborative thinking and tools to evaluate the usability and correctness of the spreadsheet in all circumstances before handing it over to IT to address production issues such as sourcing and performance. As a largely organizational approach, it is easily extended to analytics.
From the architectural and technological viewpoints, the lesson to be learned from spreadsheets is that software vendors must take appropriate and increased responsibility for data management. This involves embedding stronger data governance function in their analytics offerings and enabling more flexible and powerful data and process linkages between the analytical and operational environments. The result is a significant realignment of the traditional data warehouse and lake architectures to a production analytics platform as I described in a recent Business Intelligence Journal article.
In contrast to the spreadsheet situation, vendors across all aspects of data management -- database management systems; data integration and preparation tools; and metadata, catalog, and governance tools -- are delivering the functionality required. Although still at an early stage, leaders in analytics are turning their attention from data science (with its focus on algorithms) to data engineering that creates an overall data and process management environment where algorithms are more simply and efficiently developed, run, deployed, and managed.
About the Author
Dr. Barry Devlin defined the first data warehouse architecture in 1985 and is among the world’s foremost authorities on BI, big data, and beyond. His 2013 book, Business unIntelligence, offers a new architecture for modern information use and management.