Enabling Citizen Data Analysts in the Post-Hadoop World
The New Year will feature the fallout from the void left by Hadoop's decline.
- By Luke Han
- January 14, 2022
Many of us will view 2022 as the year we crawl out from under the dread and turmoil of the COVID-19 crisis to view a new horizon. The profound feeling that the world has changed forever provides a more dramatic backdrop than usual when making predictions about technology and data in the coming year.
The analytics world has not stood still in the past year. The event horizon for 2022 suggests that the post-Hadoop world has more refined cloud data services for a growing array of citizen participants, which shows the need for low-code/no-code building blocks for these data citizens in the form of data APIs.
Filling the Hadoop Void
It is not difficult to predict the continued decline of the Hadoop ecosystem. As much fun as it is to talk about the dramatic decline of a technology that drove multibillion-dollar mergers and an improbable amount of investment, next year will feature the void left by Hadoop's decline.
As soon as MapReduce became widely used and understood, Spark arrived and made it irrelevant. Brute force data ingested into Hadoop was replaced by the superior Kafka platform. One by one, each component of the Hadoop ecosystem has been marginalized by a better, more refined alternative.
What about all that data? Much of the dark data that has been collected for a rainy day has been dumped into Hadoop, more specifically, into the Hadoop Distributed File System (HDFS). Cloud vendors in the meantime have had no problem providing object storage and cloud data warehouses that render Hadoop irrelevant in the cloud. People moving wholesale to the cloud for big data analytics will have many good options, but very large data sets that reside on premises may prove too expensive or impractical to move.
Best Practice: Look before you leap off Hadoop
Although Hadoop doesn't appear to have much of a long future, data teams must consider that moving data out of Hadoop is not without cost. There are ample options to store and manage data in the cloud (AWS S3, Azure Data Lake Storage, etc.), but the cloud isn't the best option for all your data. Also, trying to implement your current HDFS-based data lake on another storage platform or distributed file system also has associated costs and perhaps not as many compelling benefits compared to keeping the data where it is (for now).
When endeavoring to avoid major data migrations, data teams should also consider data virtualization techniques as well as query precomputation and results caching so that back-end data systems -- Hadoop or otherwise -- are not overwhelmed by analytical queries.
Low Code/No Code for Analytics
Visual development platforms -- having now morphed into low-code/no-code platforms -- have long been a model for shortening development cycles, increasing developer productivity, and empowering semi-technical power users to contribute to solving a particular technical or business challenge.
Sometime after the push to grow the enterprise pool of citizen developers, many are now turning their attention to citizen data scientists/analysts. Although data science, data engineering, and data analytics workloads have lagged behind other low-code/no-code applications such as website development, chatbots, and interactive voice response, the dramatic rise in data engineering and data science use cases makes it an attractive area to realize the multiplying effects.
Because low-code/no-code platforms enable non-coders to build their own data apps organically, look for low-code/no-code tools and libraries for data prep, predictive analytics, and even machine learning. Making this a reality for citizen data analysts will require the creation of modular workflows for data management, for populating data pipeline templates, and for the accelerated adoption of data services. This will be the essential next wave of automation for analytics, machine learning, and AI.
Best Practice: Seek out multiplying effects for data engineering
Data engineering now takes up the lion's share of the development effort to enable low-code/no-code analytics for the masses. Data teams should seek out low-code/no-code solutions to tackle the fast-growing numbers of data engineering tasks such as data cleansing, data duplication, and data enrichment. Part of that process will be for development teams to search out or create a new set of data-oriented microservices and data APIs.
Low-code/no-code initiatives by their nature require that some high-code developers must do a ton of work shielding low-code/no-code users from the complexity and the pitfalls of the development processes they want to automate. At the same time, the greatly increased use of analytics has caused dramatic growth in the number of meaningful key performance indicators (KPIs) that are now tracked for virtually all business processes.
The massive shift in development strategies toward a microservices model means that new API libraries proliferate and form an API economy and marketplace. The monetization of APIs and the pursuit of this API economy will increasingly affect data applications, data engineering, analytics, and machine learning. These systems will require not only their own set of APIs but also a more sophisticated approach to getting the right data to the right people at the right time. In 2022, this will require the creation of a data-as-a-service (DaaS) mentality in data teams that will shape how APIs will be organized for analytics internally and externally. As part of a new trend -- the data mesh -- data APIs will be essential to build domain data products.
The systematic implementation of APIs that deliver data, metadata, and essential intelligence will not only be used for public, customer-facing processes but also for internal usage. This will put DaaS APIs in the hands of developers who can then make them available to the low-code/no-code platforms used by citizen data analysts,
Best Practice: Create an actionable strategy for data-as-a-service
Data teams must view the entirety of their analytics and machine learning workloads to best understand how to deliver data with requisite quality, high performance, low latency, and high user concurrency. These are the essential elements of delivering DaaS for the new analytics and AI ecosystems.
Luke Han is co-founder and CEO at Kyligence as well as the co-creator and Project Management Committee chair of the Apache Kylin project. As head of Kyligence, he has been working to grow the Apache Kylin community, expand its adoption, and build out a commercial software ecosystem through the Kyligence Cloud product. Prior to Kyligence, Han was the big data product lead at eBay and chief consultant at Actuate China. You can contact the author on Twitter or LinkedIn.