Sifting Through the Garbage
Is your data lake just a garbage dump?
- By Bill Inmon
- April 14, 2016
Quick now! Which respected scientific discipline makes its living sifting through garbage?
If you answered archaeology, you know your science.
To explain this phenomenon, let me take you to my father's ranch in West Texas. About a mile from the ranch house -- high up on a bluff -- is a cave. The cave faces south so that it catches the winter sun and the summer shade. My brother and I have been up to the cave many times. It is a good hike and brisk but doable mountain climb.
The ceiling of the cave is covered in black soot, that indicates that someone once made a fire in the cave. In this case, a lot of fires in the cave. The cave is big enough to fit about three people comfortably.
In the summertime you always have to check for rattlesnakes before you go in because occasionally snakes like to lie in the shade of the cave. Once my brother found three really nice arrowheads in the cave. One of those arrowheads was a Pandale (you can look it up) and it's a family treasure to this day, but we never excavated the cave. We are not archaeologists.
At the foot of the cave, just outside the front of the cave, is the midden. The midden is where the inhabitants of the cave threw their trash. If you were to look at the midden you would find lots of things -- more black carbon-stained rock, old bones, maybe some teeth, the occasional broken arrow point, and scrapers. Scrapers were the flint implements they used to scrape flesh off of hides. I once took a scraper home and washed it off. I then used it as a knife to cut steaks and meat with. It worked better than any knife I had. If the inhabitants had made pottery, you would find broken pottery shards in the midden.
The midden was the garbage pile of the cave's inhabitants. An archaeologist would have made the garbage pile tell a story -- what food the inhabitants ate, what culture the inhabitants were from, how long ago had they been there, and so forth. Occasionally the inhabitants of similar caves would bury a body in the midden. (We never found one at our cave.)
An archaeologist would have deconstructed the midden and would have found out a whole lot about who lived there and how they lived.
Learning about early man is a fascinating subject, but just imagine if we could actually talk with early man. We could sit down and say: we -- the inhabitants of the 21st century -- want to know all about you.Could we please get you to organize the information about your life a little better than just throwing it all away in a pile? Our archaeologists are tired of studying you by looking at your garbage.
We could tell early man that we are interested in their religion and their customs, how they mate and marry, how they hunt, and how they get along with their neighbors. There is so much we could learn from early man if we could just have a conversation with them, but of course we can never have a conversation with early man.
Whom we can have a conversation with is the modern-day computer technician. Modern-day technicians are creating their own modern midden. Modern-day technicians are creating the data lake and corporations are hiring expensive data scientists to sift through the midden/data lake to find precious information.
Can't we do better than this? We may not be able to talk with the early caveman, but we can and should talk with today's technician because the data lakes being created today are just like the midden of early man.
What would we say to the data technicians today who are creating their midden-like data lakes?
The conversation might go something like this:
Don't just indiscriminately dump some data in the data lake and expect a data scientist to spin the data into gold. Give the data scientist some help.
It would really help the data scientist if he/she knew where the data came from and when it came into the data lake.
It would really help if the data scientist knew all about the metadata that described the data in the data lake. It would help if the data scientist knew about relationships in the data, and relationships between different types of data.
It would really help if textual data were reduced to the form of a database and the context of the text was identified as well as the text. It would really help if analog data were edited into a form that was convenient to analyze. It would really help if analog data had descriptor information and metaprocess information as well as metadata information.
It would really help if data from different applications were integrated. It would really help if log tape data were deciphered and if the contents of the log tape were clearly described.
When you stop and think of it, there is a whole lot of infrastructure information that could be added to the data lake, and the effect of adding to the data lake is to emancipate the data.
Heck, we might even streamline and demystify the data in the data lake so that an ordinary business user could actually make sense of the data. What a concept! If that were to happen, we would not need expensive and erudite data scientists (who are hard to find, expensive to talk to, and never have any time for you and your mundane problems), but the data technicians of today are too busy to think about such things.
The technicians of today are busy making true the axiom:
We can spend millions of dollars and many man-years building it wrong, but we don't have a dime or a minute to spend to build it right.
Maybe the technician of today is a lot more like early cave man than we had ever thought.
Bill Inmon has written 54 books published in 9 languages. Bill’s company -- Forest Rim Technology -- reads textual narrative and disambiguates the text and places the output in a standard data base. Once in the standard data base, the text can be analyzed using standard analytical tools such as Tableau, Qlikview, Concurrent Technologies, SAS, and many more analytical technologies. His latest book is Data Lake Architecture.