Are Big Data Frameworks Accelerating to a Dead End?
Data storage and analytics frameworks aren’t keeping up with the growth in data creation today.
- By Jonathan Friedmann
- June 6, 2022
With data volume growing exponentially every year, it is expected that the amount of digital data created in the next five years will result in more than twice the amount of data produced since the advent of digital storage.
There is promise in these lofty projections because big data has proven key to progress and innovation for countless industries in the digital age. For healthcare organizations, the ability to collect and analyze vast swaths of patient records has streamlined hospital management and catalyzed breakthrough discovery of cures. By leveraging big data, insurance companies can analyze beneficiary behavior to detect fraud; financial institutions have harnessed big data to anticipate behaviors and subsequently create more efficient strategies. Airlines such as Etihad plan to utilize big data analytics to improve fuel economy, minimize maintenance costs, optimize flight scheduling, and improve safety. The list goes on.
The feasibility of a future empowered by big data depends solely on industry leaders who set the tone and pace of data-driven innovation.
When we talk about “big data,” data itself is only half the equation. It can be easy to overlook the colossal storage and analytics frameworks needed to process that information and actually turn it into something usable. Big data frameworks such as Spark, Presto, Big Query, AWS Redshift, and others are rapidly evolving to address skyrocketing computing demands. Given big data’s staggering growth, are our processing technologies keeping pace or are we losing the race to keep up ... big-time?
Bursting the Big Data Bubble
Key players in the space are offering insights into this question.
Databricks, a software company at the cutting-edge of high-performance big data frameworks, recently presented the progression of analytics frameworks’ performance in the last decade. They found performance improved two to four times from 2016 to 2021, translating roughly to a 25 percent increase in performance year over year.
Furthermore, it is reasonable to assume that the key improvements that boosted software performance are drying out based on the trends I see in software tools. For example, Databricks has recently re-written its analytics engine in C++, moving from a high-level programming to low-level programming essentially trying to scrape the last optimization by getting closer to the hardware. This is a difficult task to accomplish and signifies that there is not much left to improve.
If the rate of data growth remains much higher than the growth of our software processing capabilities, the industry will reach a critical pain point: the amount of data in the ether will far surpass our means to do anything with it. If current trends continue, then the amount of processing needed to match the exploding growth of data will already be falling behind, opening an alarming gap in computing resources.
Mind the Gap
This impending “computation gap” is no secret in the industry. Multiple innovators have risen to the challenge and are already making progress bridging the void. Databricks and Meta Platforms, Inc., for example, both recently released new C++ libraries (Photon and Velox respectively) designed to improve query performance and upgrade analytics processing.
However, this progress can also be viewed in a less-positive light: if the industry has reached a point where it must clamber to squeeze out any additional optimization, could this signify that we have all but exhausted our capabilities to maximize our software?
In response, some industry players are trying to re-engineer the lower tiers of their C++ libraries’ stack to upgrade performance -- in essence, scraping the bottom of the barrel to get the most out of an overwhelming amount of data.
Big data analytics has been essential to innovation in the 21st century. Unfortunately, if current trends continue, the growth of these capabilities will continue to be eclipsed by the exponentially greater growth of data itself. Without the capacity to take advantage of this data, countless businesses will miss out on critical advances.
If we hope to continue benefiting from all that big data has to offer, then it is high time for our industry to rethink the approach to both the hardware and software of analytics frameworks. Only by striving for new and unprecedented processing capabilities that evolve hand in hand with (if not even more rapidly than) the data they are tasked to assess will we be able to close the widening computing gap and usher in a new age.
Jonathan Friedmann is the co-founder and CEO at Speedata. Previously, Friedman was CEO and co-founder of Centipede, which developed IP for general purpose processors. He also served as COO and VP R&D at Provigent, a cellular infrastructure semiconductor company acquired by Broadcom. You can contact the author via LinkedIn.