TDWI Checklist Report | Designing and Operating a Scalable Enterprise Data Lake
March 21, 2019
Data lakes were originally created with the idea that by using modern distributed data computing and storage architectures based on Hadoop, organizations could take advantage of the greater flexibility, more cost-effective computing power, and economical storage to handle big data analytics use cases that couldn’t be managed by traditional enterprise data warehouses (EDWs).
However, early attempts at creating data lakes allowed self-service users to add new data sources into the lake with no governance mechanism, so they quickly became little more than a glorified dumping ground.
In recent years, though, innovations have allowed the data lake to evolve into a coordinated and governed environment for accumulating shared data resources that can be optimally used for competitive advantage. Yet there are many challenges when designing an enterprise data lake that is scalable, sustainable, and governable, while still maintaining flexibility and agility. As a combination of static and real-time data sources are fed into the data lake, its management and operation have become even more complex.
This checklist examines these issues and provides guidance about how to overcome the challenges. It also provides a number of recommendations that will support the design and development of an enterprise data lake that is not just sustainable but will also scale as data volumes continue to explode.