TDWI Articles

What's Ahead for Data Teams in 2021

Apache Iceberg dead ahead. What to look for -- and look out for -- this year.

Unprecedented market conditions have emphasized the importance of implementing modern data architectures that both accelerate analytics and keep costs under control. In 2020 this led to the cloud becoming the cornerstone of data innovation. In 2021 three major trends will emerge that will truly leverage the benefits of the cloud to make modern cloud data lakes the center of gravity for data architectures.

For Further Reading:

The Evolution of Data Lake Architectures

I Have a Data Warehouse, Do I Need a Data Lake Too?

Data, Time, and the Data Lake: Putting it All Together

Trend #1: Separation of compute and data becomes new architectural paradigm

The concept of separating compute and storage to easily and economically scale storage capacity independently of the compute resources needed for analytics has been around for several years. However, it wasn't until the widespread migration to the public cloud that the concept became a reality.

In the cloud, the separation of compute and storage provides efficiencies that are not possible on premises including:

  • Raw storage is so inexpensive and accessible that data teams can easily and economically scale storage to match rapidly growing data volumes
  • Compute capacity is available on demand, meaning that organizations only pay for what they need for their workloads
  • Isolation of compute clusters so that different workloads don't impact each other

In the coming year, however, another paradigm for fully leveraging cloud infrastructure resources will emerge -- one that puts data at the center of the architecture: the separation of compute and data.

Cloud object storage -- such as Amazon S3 and Microsoft Azure Data Lake Storage (ADLS) -- has increasingly become the default bit bucket in the cloud. New open source projects such as Apache Iceberg (created by Netflix, Apple, and other tech companies) and Project Nessie makes it possible for a variety of systems to select, insert, update, and delete records in S3 and ADLS as if they were infinitely scalable databases. The tables can be processed and queried directly by decoupled and elastic compute engines such as Apache Spark (batch), Dremio (SQL) and Apache Kafka (streaming). As a result, data essentially becomes its own tier, enabling us to think about data architectures in a completely different way.

Trend #2: Hidden costs associated with cloud data warehouses decrease their appeal

Cloud data warehouse vendors leverage the separation of storage from compute to provide offerings with improved scalability as well as a lower initial cost when compared to traditional data warehouses. However, in order to analyze the data it must be loaded into the data warehouse and can only be accessed through the data warehouse -- the data itself isn't decoupled from compute. This means that organizations must pay the data warehouse vendor to get the data both in and out of their system, so although upfront expenses for a cloud data warehouse may be lower, the costs at the end of the year are significantly higher than expected.

In the meantime, with its low-cost cloud object storage, the cloud data lake is increasingly becoming the center of gravity for many organizations' data architecture. Although both traditional SQL query engines and data warehouses provide a mechanism to query the data in the data lake directly, the performance isn't sufficient to meet the needs of the analytics teams. Therefore, data teams still need to copy and move data from their data lake to their data warehouse and incur the associated data ingest cost.

However, by leveraging open source table formats such as Iceberg and Nessie as well as modern cloud data lake engines, data teams can implement a data architecture that enables data consumers to query and manipulate data in the data lake directly without any degradation of performance. The result is an extreme reduction in complexity as well as the costs associated with data copies and ingesting data in the data warehouse.

Trend #3: Cloud data lake capabilities will exceed those of the data warehouse

Data warehouses offer several key capabilities for analytics workloads beyond data queries, including data mutations, transactions, and time travel (access to historical data from a point in time even if it has been change or deleted). These capabilities are delivered through proprietary, vertically integrated systems that require all access to go through and be processed by the database. This single-system approach simplifies concurrency management and updates. However, it also increases cost and limits flexibility.

Apache Iceberg, a new open source table format, addresses these challenges and is rapidly becoming an industry standard for managing data in data lakes. Iceberg introduces new capabilities that enable multiple engines to work together on the same data in a transactionally consistent manner and defines additional information on the state of datasets as they evolve over time. With Iceberg, data lake tables are no longer limited to select queries, and can now support record-level mutations (insert, update, delete), time travel, and transactions.

Another new open source project, Project Nessie, builds on the capabilities of table formats such as Iceberg and Delta Lake by providing Git-like semantics for data lakes. With Nessie, users can take advantage of branches to experiment or prepare data without impacting the live view of the data. For the first time ever, loosely coupled transactions have become a reality, enabling a single transaction spanning operations from multiple users and engines (Spark, Dremio, Hive, etc.). In addition, the ability to query data from consistent points in time, and across different points in time, makes it easier to reproduce results, understand changes, and support compliance requirements.

About the Author

Tomer Shiran is co-founder and CEO of Dremio. Previously he headed the product management team at MapR and was responsible for product strategy, road maps, and requirements. Prior to MapR, Shiran held numerous product management and engineering roles at IBM and Microsoft. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion - Israel Institute of Technology and is the author of five U.S. patents.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.