Trends in Data Management for 2019
TDWI analyst Philip Russom looks back at this year’s most important events and highlights five trends that will likely affect how you manage data in the coming year.
- By Philip Russom
- December 21, 2018
The leading trends in data management involve substantive changes that take years to play out. Some are moving faster than others, and some move in fits and starts. For example, the assimilation of Hadoop and data lakes shot forward, only to stall once users discovered platform and design problems that will require backtracking to fix. The slowest but most broadly influential trend is toward using clouds as the preferred compute platform -- for everything!
Other trends are more localized, such as the evolution of data semantics beyond metadata toward data cataloging. As another example: consider big data, which recently achieved mainstream status and enterprise assimilation in many organizations, only to be churned up anew by the onslaught of new big data coming from the Internet of Things (IoT).
Let’s look at these and other trends.
2018 Top Events
Hadoop Hype Halted
One of the biggest recent surprises is how quickly the air deflated from the hype around Hadoop. Hadoop came out of nowhere around 2013 and was in production in roughly one fifth of data warehouse environments by 2016. However, this brisk adoption started slowing in 2017 -- and almost halted in 2018 -- once Hadoop had saturated the kind of progressive warehouse programs that would need it. At the same time, users started becoming dissatisfied with Hadoop’s immaturity with relational functions, metadata management, and platform maintenance, in addition to the high costs of large on-premises clusters. Hadoop is here to stay but it will need extensive modernization to address these problems.
The Sudden Adoption of Data Lakes
Similar to Hadoop’s sudden adoption, data lakes appeared around 2016 and were adopted briskly. Note that Hadoop is a data platform whereas a data lake is a design pattern and method for managing data on big data platforms (such as Hadoop). In fact, many Hadoop users retrofitted data lake methods onto their Hadoop implementations to better manage Hadoop data and to get more business value from it. Today most data lakes are on Hadoop. However, as with Hadoop, data lake owners feel a growing need for relational techniques with big data, which is why there’s a trend toward the relational data lake, which is deployed on a relational database.
Modern Metadata Management Improved
Metadata management continues to be a powerful enabler for mission-critical, data-driven business activities, from operations to analytics. To keep pace with changing business requirements and to leverage new technologies, modern metadata management tools now include better automation (based on machine learning) and smarter data scanning (to deduce data structures and source-to-target mappings automatically). Likewise, it supports new cloud-based applications, data platforms, and virtualization techniques, across clouds, on premises, and software-as-a-service applications. Metadata is still the most common approach to data semantics but will soon be joined by a new approach: the data catalog.
Big Data Goes Mainstream
When the term “big data” originated in the 1990s, it usually referred to the mass of data pouring in and out of websites. The data wasn’t just big; it was also new in terms of its data structures (or lack thereof), low latency, and the innovative business practices it can enable, from ecommerce to online customer analytics. The size, traits, and opportunities of big data led to a new generation of big data platforms, tools, analytics, and business practices. It took years to figure all that out, but now big data, its platforms, and the valuable practices we learned from it are commonplace. In other words, big data has gone mainstream. In fact, many organizations just call it “data” and freely integrate it with other information assets. It’s not over. The next wave of big data is coming from IoT, and it’s just as new and hard to figure out as the first wave.
IT Becomes Cloudier
The most inevitable trend in all of IT -- not just data management -- is the migration of applications and data to clouds. TDWI anticipates a day several years from now when more applications will run on cloud platforms than on premises. Anecdotal evidence suggests that there is a recent up-swell in the acquisition of SaaS applications, which regularly displaces an on-premises application, thereby migrating many users and their data to a cloud. Plus, many organizations have a cloud-first mandate, which leads to more SaaS apps. With more SaaS apps and other cloud-based systems in the enterprise mix, managing data has become an inherently hybrid affair that reaches across multiple clouds and on-premises systems.
2019 Anticipated Hot Spots
To get more value from Hadoop in data warehouses and analytics use cases, many users are modernizing certain aspects of their implementations. For example, to lower the cost and simplify the maintenance of Hadoop clusters on premises, some organizations are migrating to Hadoop in the cloud. Similarly, some will continue to use Hadoop tools from the Apache ecosystem but will swap out the Hadoop Distributed File System for some form of cloud-based object storage. A popular modernization among all Hadoop users (whether for analytics use cases or not) is to migrate from MapReduce to Spark for better resource management, microbatch processing, and low-latency performance. Finally, users are carefully selecting tools -- from software vendors or the open source community -- that correct Hadoop’s limitations with relational functions and latency for data access and processing.
The Data Lake Remix
To give data the best possible home, the trend with data lakes (as with data warehouses) is toward a multiplatform architecture. That’s where the lake is physically distributed across Hadoop clusters and relational databases, which may be deployed on premises, on cloud platforms, or a hybrid combination of both. In that spirit, data lake owners are revisiting their choice of data platforms to move deeper into the so-called hybrid data lake. To make disparate and diverse data seem simpler than it is, virtualization techniques are handy (even required), which is why a hybrid data lake is also a virtual data lake, akin to the virtual data warehouse.
The Data Catalog
Usually when we say “metadata” we mean the technical metadata that highly technical tools use to interface with many types of applications and data platforms. Technical metadata describes data structures, data types, and protocol parameters. As a complement, data cataloging usually describes the traits of data, such the data domain, quality, lineage, and profile statistics of a specific dataset. The traits can be subjective, as when users tag data to score its trust level, compliance sensitivity, or usability. This way, analysts and similar users can focus their data searches and queries by data domain (e.g., only look at customer data), compliance risk (to avoid data about EU residents), and trustworthiness or usefulness (as rated by other users). In many ways, the future of data semantics beyond technical metadata is the modern data catalog.
IoT as the Next Wave of Big Data
IoT is a computing paradigm where a widening range of physical devices -- including smartphones, vehicles, shipping palettes, kitchen appliances, manufacturing robots, and anything fitted with a sensor -- can transmit data about its location, state, activity, and surroundings. The rise of the IoT paradigm is deeply profound, because it enables an organization to monitor, measure, and adjust a lengthening list of processes, customers, partners, facilities, and so on with unprecedented levels of detail, accuracy, and speed. Achieving these levels, however, requires a considerable investment in information technology, with a focus on fully modern data management infrastructure. The challenge is to create solutions that can scale to the big data nature of IoT while capturing, processing, and responding to data in real time or close to it.
Hybrid Data Management
Among TDWI members, we see more organizations migrating their data warehouses to various cloud-based systems. Sometimes they migrate to the same database vendor brand they’ve always used -- but it’s now on a cloud. Other times, user organizations select one of the new databases purpose-built for cloud data warehousing. Similarly, some Hadoop or data lake users are migrating their on-premises clusters to Hadoop on a cloud. Warehouses aside, TDWI also sees users depending more on cloud-based tools for data integration, data quality, reporting, and especially advanced analytics. Obviously, on-premises systems are not going away any time soon. Many enterprises keep their core report-oriented data warehouses on premises while migrating data staging, special data for analytics, and other pieces to a cloud. Put it all together, and the future of data management promises to be hybrid with the cloud side growing aggressively.