The Data Lake Manifesto: 10 Best Practices
You need these best practices to define the data lake and its methods.
- By Philip Russom
- October 16, 2017
The data lake has come on strong in recent years as a modern design pattern that fits today's data and the way many users want to organize and use their data. For example, many users want to ingest data into the lake quickly so it's immediately available for operations and analytics. They want to store data in its original raw state so they can process it many different ways as their requirements for business analytics and operations evolve.
They need to capture -- in a single pool -- big data, unstructured data, and data from new sources such as the Internet of Things (IoT), social media, customer channels, and external sources such as partners and data aggregators. Furthermore, users are under pressure to develop business value and organizational advantage from all these data collections, often via discovery-oriented analytics.
A data lake, especially when deployed atop Hadoop, can assist with all of these trends and requirements -- if users can get past the lake's challenges. In particular, the data lake is still very new, so its best practices and design patterns are just now coalescing. Most data lakes are on Hadoop, which itself is immature; a data lake can bring much-needed methodology to Hadoop. To the uninitiated, data lakes appear to have no methods or rules, yet that's not true. In fact, best practices for the data lake exist, and you'll fail without them.
To help data management professionals and their business counterparts get past these challenges and get the most from data lakes, the remainder of this article explains "The Data Lake Manifesto," a list of the top 10 best practices for data lake design and use, each stated as an actionable recommendation.
The Data Lake Manifesto
1. Onboard and ingest data quickly with little or no up-front improvement.
One of the innovations of the data lake is early ingestion and late processing, which is similar to ELT, but the T is far later in time and sometimes defined on the fly as data is read. Adopting the practice of early ingestion and late processing will allow integrated data to be available ASAP for operations, reporting, and analytics. This demands diverse ingestion methods to handle diverse data structures, interfaces, and container types; to scale to large data volumes and real-time latencies; and to simplify the onboarding of new data sources and data sets.
2. Control who loads which data into the lake and when or how it is loaded.
Without this control, a data lake can easily turn into a data swamp, which is a disorganized and undocumented data set that's difficult to navigate, govern, and leverage. Establish control via policy-based data governance. A data steward or curator should enforce a data lake's anti-dumping policies. Even so, the policies should allow exceptions -- as when a data analyst or data scientist dumps data into analytics sandboxes.
Document data as it enters the lake using metadata, an information catalog, business glossary, or other semantics so users can find data, optimize queries, govern data, and reduce data redundancy.
3. Persist data in a raw state to preserve its original details and schema.
Detailed source data is preserved in storage so it can be repurposed repeatedly as new business requirements emerge for the lake's data. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).
As users work with lake data over time, they sometimes break this rule to apply light data standardization when required for reporting, complete customer views, recurring queries, and general data exploration.
4. Improve data at read time as lake data is accessed and processed.
This is common with self-service user practices, namely data exploration and discovery, coupled with data prep and visualization. Data is modeled and standardized as it is queried iteratively, and metadata may also be developed during exploration. Note that these data improvements should be applied to copies of data so that the raw detailed source remains intact. As an alternative, some users improve lake data on the fly with virtualization, metadata management, and other semantics.
5. Capture big data and other new data sources in the data lake.
TDWI survey data shows that over half of data lakes are deployed exclusively on Hadoop, with another quarter deployed partially on Hadoop and partially on traditional systems. Many data lakes are deployed to handle big data (i.e., large volumes of Web data), and so Hadoop is a good fit. Hadoop-based data lakes are increasingly capturing large data collections from new sources, especially the IoT (machines, sensors, devices, vehicles), social media, and marketing channels.
6. Integrate data of diverse sources, structures, and vintages.
Data lakes aren't just for IoT and big data. Many users blend traditional enterprise data and modern big data on a Hadoop-based lake to enable advanced analytics, extend customer views with big data, enlarge data samples of existing fraud and risk analytics, and enrich cross-source correlations for more insightful clusters and segments. In addition, TDWI has seen blended lake data enable logistics optimization, sentiment analysis, near-time business monitoring, patient outcome analytics in healthcare, and predictive maintenance.
7. Extend and improve enterprise data architectures, both old and new.
Data lakes are rarely siloed. Most are integral parts of a larger data architecture or multiplatform data ecosystem -- common examples being the multiplatform data warehouse environment, omnichannel marketing, and the digital supply chain. A lake can also extend traditional applications -- such as those for multimodule ERP, financials, content management, and data or document archiving. Hence, a data lake can be a modernization strategy that extends the useful life and functionality of an existing application or data environment.
8. Make each data lake serve multiple technical and architectural purposes.
A single lake typically fulfills multiple architectural purposes, such as data landing and staging, archiving for detailed source data, sandboxing for analytics data sets, and managing operational data sets (especially complete views and data masters). Even so, when a single data lake plays this many architectural roles, it may need to be distributed over multiple data platforms, each with unique storage or processing characteristics. For example, TDWI surveys show that a quarter of data lakes are on both Hadoop and multiple instances of relational databases.
9. Enable new self-service data-driven business best practices.
These include data exploration, prep, visualization, and some kinds of analytics. Nowadays, savvy users (both business and technical) expect self-service access to lake data, and they will consider the lake a failure without it. Note that self-service functionality is enabled by key components, namely tools built for the high ease-of-use that business users need along with business metadata and other specialized semantics.
10. Select data management platforms that satisfy data lake requirements.
Hadoop is the preferred data platform for most lakes due to its low price, linear scalability, and powerful in situ processing for analytics. However, some users implement a massively parallel processing (MPP) relational database when the lake's data is relational and/or requires relational processing (complex SQL, OLAP, materialized views).
Hybrid platforms are on the rise with data lakes; they may combine Hadoop and relational systems or on-premises and on-cloud systems. With many data collections (data lakes, warehouses, big data, analytics, etc.), TDWI sees an increase in cloud storage, whether file/folder, object, or block.