RESEARCH & RESOURCES

New Book Explains How to Keep Your Data Lake Clean

"We have millions to spend building data lakes wrong but not a dime to spend to build them right,” according to author Bill Inmon.

Organizations invest incredible amounts of time and money procuring and storing big data in data stores called data lakes. How many of these organizations can get their data out of those lakes in a usable form?

Very few can turn the data lake into an information gold mine. Most wind up with garbage dumps,” claims author Bill Inmon, a pioneer who explained the architecture and benefits of a data warehouse and a regular contributor to Upside.com.

Data Lake Architecture explains how enterprises can build a useful data lake “where data scientists and data analysts can solve business challenges and identify new business opportunities.”  Readers will learn how to structure data lakes as well as analog, application, and text-based data ponds to provide maximum business value. The book also explains the role of the raw data pond and suggests when to use an archival data pond.

As the author points out, people have been building big data projects for years. A by-product of the big data initiative is the data lake. 

“For the most part the data lake has been an afterthought. The idea is that data scientists will come along and look through the data lake and find treasures of information,” Inmon notes. “The results have been disappointing to say the least. Corporation after corporation has found that their experience with the data scientists has not lived up to expectations.”

The book explains that to get your money's worth out of big data and data lakes, a more rigorous and a more disciplined approach is needed. Instead of just piling raw data into the data lake, you need to take architected approach.

“Big data and data lakes can be divided into three types of data: analog data, application data, and textual data,” Inmon explains. These three types of data need to be organized into separate data ponds. Each pond holds a different kind of data.”

Once you’ve divided your data into these ponds, you need to integrate and organize your data into a cohesive whole.

“Analog data needs to be treated with data reduction technology,” says Inmon. “Application data needs to be treated with classical ETL. Textual data needs to be passed through textual disambiguation. Once the data in the data ponds has been conditioned, then it is ready for analysis by the data scientists or the business analyst.”

The book can be ordered from https://technicspub.com/bidw/.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.