Ten Mistakes to Avoid When Querying Your Data Lake
TDWI Member Exclusive
August 14, 2020
A data lake is a centralized repository of all enterprise data that can store structured and unstructured data at any scale. Modern cloud data lakes are implemented with a loosely coupled architecture and have numerous benefits over tightly coupled data warehouses. Data lakes are less expensive, highly scalable, and do not require rigorous schema to define the data before loading, which makes it easier for companies to offload raw data directly from various systems.
The historical challenges around querying data lakes for analytics are rooted in the complex data pipelines that must be put in place to gain insights and value from the data. Fortunately, modern approaches are available that are much simpler and more effective.
Next-generation data lake engines enable easy, fast, and secure provisioning of data sets to end users, along with swift processing and queries, all directly from cloud data lake storage. Modern data lake processing engines are often used for batch analytics and machine learning, and modern data lake query engines are used to analyze structured and semistructured data for business intelligence and data science tools. These engines disrupt the status quo that data has to be moved out of the data lake and into a data warehouse via complex pipelines to be accessible to analysts.
We will explain the ten mistakes to avoid when querying data lakes, focusing on effective best practices for keeping data in data lake storage and querying it directly, thereby raising productivity and efficiency and lowering costs and complexity.