Data Management: 2016’s Hot Trends and What to Watch in 2017
The leading 2016 trends included Hadoop adoption, data lakes, and data warehouse modernization. In 2017 we'll see new activity around the SQL-ization of Hadoop, orchestrated data hubs, and managing IoT sensor data.
- By Philip Russom
- December 16, 2016
The leading trends in enterprise data management in 2016 were continued from recent years, namely Hadoop adoption and data warehouse modernization. However, the real surprise in 2016 was the data lake, which user organizations are suddenly taking seriously as the preferred design pattern for data set organization in Hadoop.
All these will continue into 2017 and be joined by new activity around the SQL-ization of Hadoop, orchestrated data hubs, and managing sensor data from the industrial Internet of Things (IoT). Let's look at each of these in detail.
2016 Top Trends
Increased Adoption of Hadoop
TDWI surveys in recent years have shown that Hadoop is making steady progress as a platform well suited to many purposes in data warehousing and analytics. Many early adopters have already integrated Hadoop clusters and tools into the architectures of their data warehouse environments.
Hadoop's massive, cheap storage offloads older systems by taking responsibility for data staging, ELT push down, and archiving of detailed source data (typically in the data lake design pattern). Hadoop also serves as a massively parallel execution engine for a wide variety of set-based and algorithmic analytics methods. These valuable use cases are driving the adoption of Hadoop.
TDWI has seen a giant step forward in adoption starting in late 2015 and continuing into 2016. The survey from TDWI Best Practices Report: Data Warehouse Modernization shows that 17 percent of data warehouse programs surveyed already have Hadoop in production in their data warehouse environment. This is up from earlier surveys, which showed 10 to 12 percent.
Even more dramatic, the survey shows that the percentage of organizations integrating Hadoop with a data warehouse will more than double within three years (up to 36 percent). In short, Hadoop is here to stay and will soon become common in data warehouse programs.
Hadoop-based Data Lakes
First and foremost, a data lake is a repository for raw data. A data lake tends to manage highly diverse data types and can scale to handle tens or hundreds of terabytes -- sometimes petabytes. It is optimized to ingest raw data quickly as received from both new and traditional sources.
The point of the data lake managing data in its original raw state is so that its details can be repurposed repeatedly as new business requirements and opportunities for new applications arise. After all, once data is remodeled, standardized, and otherwise transformed (as is required for report-oriented data warehousing), its applicability for other unforeseen use cases is greatly narrowed.
With that in mind, you can see that analytics is the primary driver behind data lakes. For example, certain forms of advanced analytics work best with data in its original state with all its original details. These include analytics based on mining, statistics, predictive algorithms, and natural language processing.
Hadoop has become an important enabling platform for data lakes because it scales linearly, supports a wide range of processing techniques, and costs a fraction of similar relational configurations. For these reasons, Hadoop is now the preferred data platform for data lakes.
Data Warehouse Modernization
This was the hottest area for data management in 2015. Most approaches to data warehouse modernization are large, multiphase projects that take months or years to complete, so the heat has continued through 2016 and will linger into 2017.
The main drivers for this trend are to extend the warehouse (sometimes with a complementary Hadoop cluster) to accommodate big data and other new data (especially sensor data), to update the warehouse architecture, to add more real-time functionality, to enable logical data warehousing, and to modernize related systems (such as those for reporting, analytics, and data integration).
2017 Anticipated Hot Spots
The SQL-ization of Hadoop
Hadoop was originally designed for Internet environments that had no relational requirements. As we increasingly employ Hadoop in mainstream use cases, however, relational requirements are becoming pressing, especially the need for ANSI-standard SQL. In fact, SQL support for Hadoop is a "must have" for emerging practices that involve Hadoop data, such as data exploration, data prep, and SQL-based analytics.
A number of open source tools -- including Impala, Drill, Presto, and Spark -- seek to add ANSI SQL to Hadoop, and a few mature vendor tools (for reporting, analytics, and data integration) have been updated to do the same. It's still early days with SQL engines for Hadoop, however, so we're waiting for more functionality, performance, and interoperability. In 2017 we will witness improvements in these areas for both open source and vendor-built SQL for Hadoop capabilities.
Instead of a Hadoop-based data lake, some organizations prefer to build a large relational data hub to achieve similar goals -- namely to provide a governable home for big data, analytics sandboxes, and collaborative data practices. Ambitious organizations are building data hubs that mix relational and Hadoop technologies such that it is hard to tell a data hub from a data lake.
Even so, here's a key differentiator: a true data hub is more than just another database. It also has significant toolage around it for data orchestration, publish and subscribe, security, auditing, and data integration and quality. The uptick in data hub deployments started in 2016 will grow in 2017.
Sensors in the Industrial Internet of Things (IoT)
TDWI sees IoT as emerging from its "hype cycle" earlier than anticipated. In particular, the industrial side of IoT (but not the consumer side) is ramping up aggressively, driven by an explosion of enterprise sensors in 2015 and 2016.
For example, utility companies (and other firms that monitor facilities closely) have long had many sensors; these firms have recently quadrupled their sensors so they can track processes in a more granular fashion and turn more manual tasks into digital ones. Manufacturing is a similar case; it has had robots for decades, but now the robots have more sensors for finer control -- now they can perform quality assurance, not just assembly.
As another example, multiple truck and rail freight companies have spoken at TDWI conferences about how sensor data (from vehicles and shipping containers) helps them make logistical operations and routes more efficient, keep customers happy with fast, auditable service, and reduce insurance rates by proving that vehicles and cargos are handled legally and safely.
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 600 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at firstname.lastname@example.org, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.