What's Ahead for Data in 2019
These three data-related trends are worth watching this year.
- By Tomer Shiran
- January 4, 2019
Companies love the "as-a-service" model for all layers of the technology stack, from infrastructure provided by cloud vendors to full SaaS applications. However, when it comes to data, companies are still operating in a mode of IT-owned and IT-controlled models, where users of data are waiting for their turn in line.
This year we will see the ongoing adoption of open source technologies, methodologies, and cloud services that move companies closer to an "as-a-service" model for their data, making their data scientists, data consumers, and data engineers more productive than ever.
Trend #1: The Rise of Apache Arrow and Arrow Flight
Over the past three years, a new standard for in-memory analytics has emerged called Apache Arrow. Arrow is not an application or runtime process. Instead, Arrow is an open source project that specifies a columnar, in-memory format for processing data as well as software libraries that perform low-level operations on the Arrow columnar data format.
Today, Arrow is used in many types of software applications, including SQL engines (such as Dremio's Sabot), data frames (for example, Python pandas), distributed processing (e.g., Spark), databases (such as InfluxDB), machine learning environments (RAPIDS, for example), and several visualization systems. Adoption of Arrow has increased dramatically in the past six months, with over 1 million downloads a month in the Python community alone.
The reason for this adoption is clear: the developers of analytics applications want to maximize the efficiency of their systems to improve the user experience and to lower the costs of operating these systems in cloud runtime environments. It is not uncommon for developers to see speed and efficiency improvements on the order of 100x by moving to an Arrow-based architecture.
In 2019, we will see adoption of Arrow in more software applications continue, including machine learning, data science, statistical packages, and business intelligence. Part of the drive comes from the benefits of speed and efficiency, but adoption is also driven by the ability of systems that implement Arrow to exchange data essentially for free. When two systems both implement Arrow, the exchange of data can happen without serializing and de-serializing the data and without making unnecessary copies, releasing CPU, GPU, and memory resources for more important work.
This brings us to Arrow Flight, a new way for applications to interact with Arrow. You can think of Flight as an alternative to ODBC/JDBC for in-memory analytics. Now that we have an established way for representing data in memory, Flight defines a standardized way to exchange that data between systems.
For example, for client applications interacting with Dremio (the company I co-founded), today we de-serialize the data into a common structure. When Tableau queries Dremio via ODBC, we process the query and stream the results as Arrow buffers all the way to the ODBC client before serializing to the cell-based protocol that ODBC expects. As soon as Arrow Flight is generally available, applications that implement Arrow can consume the Arrow buffers directly. In our internal tests, we observe from 10x-100x efficiency improvements with this approach compared to ODBC/JDBC interfaces.
Trend #2: Data-as-a-Service
We are now 10 years into the AWS era, which began with on-demand infrastructure billed by the hour. DaaS has moved up through the entire stack to include full applications and every building block in between. Now companies want the same kind of "on-demand" experience for their data, provisioned for the specific needs of an individual user, instantly, with great performance, ease of use, compatibility with their favorite tools, and without waiting months for IT.
Data-as-a-service includes several distinct functional abilities:
- Data catalog: A comprehensive inventory of data assets that makes it easy for data consumers to both find data across different systems and sources as well as describe data in ways that are meaningful to the business.
- Data curation: Tools to filter, blend, and transform data for a specific job. Reusable datasets can be added to the data catalog for discovery by other users. Some deployments may implement data curation in a virtual context, to minimize data copies.
- Data lineage: The ability to track the provenance and lineage of datasets as they are accessed from different systems and as new datasets are created.
- Data acceleration: Data acceleration allows for fast, interactive access to large datasets. Data consumers need to work at the speed of thought. If queries take minutes to process, users cannot perform their jobs effectively.
- Data virtualization: Enterprise data exists in many different systems, including data warehouses, data lakes, and operational systems. Data-as-a-service provides a uniform way to access data in situ without copying all data into a new silo.
- SQL execution: SQL remains the de facto standard for data analytics. Every BI tool and every data science platform supports SQL as the primary means of accessing data from different sources. Data-as-a-service provides SQL as the interface for these tools and systems.
Companies are now building data-as-a-service by combining these functional abilities to improve the productivity of their data consumers. Using open source projects, open standards, and cloud services, companies will deliver their first iterations of data-as-a-service to data consumers across critical lines of business.
Trend #3: Cloud Data Lakes
As companies re-platform to cloud services from AWS, Azure, and Google, data analytics tends to be the most challenging transition. Each vendor provides an alternative for the data warehouse and data marts: Redshift on AWS, SQL Data Warehouse on Azure, and BigQuery on Google. There are also independent offerings such as Snowflake that support multiple cloud platforms.
In addition to the data warehouse, companies have options for their data science workloads, including native Spark offerings on each of the cloud vendors, as well as a range of data science platforms from different vendors such as Databricks.
The cloud data lake will emerge as the common platform that underlies the cloud data warehouse and cloud data science environments. As companies move their analytics workloads to the cloud, the cloud data lake is where:
- Data first lands in its raw form, from legacy applications as well as streaming data
- Data is transformed, enriched, and blended for different needs
- Data is served for data science use cases
- Data is loaded into cloud data warehouses
Companies are building cloud data lakes with a combination of technologies:
S3 on AWS, ADLS on Azure, and Google Cloud Storage for storage of data. For data processing, companies use a number of options, including Spark, Hive, AWS Glue, Azure Data Factory, and Google Cloud Dataflow. Other functionality will continue to emerge, such as tighter integration with streaming platforms such as Kafka as well as data catalogs and data prep tools. Even in its most basic form, the cloud data lake will become a foundational system for companies moving to the cloud.
About the Author
Tomer Shiran is co-founder and CEO of Dremio. Previously he headed the product management team at MapR and was responsible for product strategy, road maps, and requirements. Prior to MapR, Shiran held numerous product management and engineering roles at IBM and Microsoft. He holds an MS in Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion - Israel Institute of Technology and is the author of five U.S. patents.