Don’t Let Your Data Lake Become a Swamp (Part 1 of 2)
How do you prevent your data lake from becoming a data dump? It’ll take a new set of infrastructures and technologies, especially one called textual disambiguation.
- By Bill Inmon
- March 23, 2016
It is a favorite subject of the big data crowd: Creating the Data Lake. The data lake is what is created when you fire up Hadoop and start throwing a bunch of data into your big data data lake. You throw in data from all sorts of places and Hadoop gratefully accepts and stores your data.
One day you look back and see how much data you have in Hadoop and you see how little it has cost you (in terms of the price per byte stored). You sit back, have a glass of wine, and congratulate yourself on your success.
Your self-congratulatory mood is awakened when one day you have to actually go and find something stored in your data lake. You discover that locating the data in your data lake is quite a tussle. It is more than a tussle. It is full scale war.
If you are not careful you have created a one-way data lake, where you are eternally placing data into your data lake but never getting anything out of your data lake. When you create a one-way data lake, in no amount of time it turns into a garbage dump.
What do you need to make your data lake a two way street? How do you avoid creating a garbage dump? Interestingly you need a whole different set of infrastructures.
To understand the infrastructure you need, you need to look at your data a little differently. One way to look at the data in your data lake is in terms of repetitive data and non-repetitive data. Repetitive data is data whose occurrence type repeats over and over. The records in repetitive data might be records of telephone calls, ATM records, click stream records, metering records and so forth. There are many kinds of repetitive records.
The other kind of data in your data lake is non-repetitive data -- data whose record content does not repeat. There are many forms of non-repetitive: textual data, email messages, telephone conversation transcripts, doctor’s notes, and so forth.
What You Need
To keep your data lake from becoming a garbage dump, you need a whole new type of technology.
For repetitive data, you need the most basic kind of information. You need a metadata description that looks at the definition of the contents of the repetitive record. If you don’t have the metadata definition, you cannot read and interpret the repetitive record. However, merely having the metadata definition is not enough. You also need the metadata definition over time. The definition of the contents of records has the nasty habit of changing over time. What you need is not just the metadata definition of repetitive data but the metadata definition over time, but that’s not enough. If you ever want to combine data from different files and applications, you need the integration map. You need a detailed description of what metadata over time needs to be transformed into what other form in order to create an integrated view of the data found in the data lake.
If you are serious about doing analytical processing of your data in your data lake, you need an elaborate and complex infrastructure.
Data warehouse people will quickly recognize the integrated metadata-over-time data map as a transformation map. Data warehouse people have been creating and using such a map for many years.
As interesting and important as repetitive data is, it is only one type of data found in the data lake. The other type of data found in the data lake is non-repetitive data. Much of (but not all) of non-repetitive data is unstructured, textual data. To read and analyze non-repetitive data, you must have a different kind of technology. In some circles, the technology you need is called “textual disambiguation.”
The elements of textual disambiguation for non-repetitive data are very different from the elements of classical transformational mapping for repetitive data.
Some of the interesting elements of textual disambiguation include:
- A taxonomy/ontology mapping classifies text in order to do analytical processing against it. A simple example of a taxonomy used for classification is car: Ford, Honda, Toyota, Volkswagen, etc.
- Stop-word processing removes extraneous words (“a”, “and”, “the”, “to”, “for”, etc.) from the database of words to be analyzed
- Inline contextualization infers the meaning of a word or words from the words that immediately surround it
- Homographic resolution infers the context of a word or words is inferred by knowing who wrote the words
- Custom variable recognition identifies a variable by its format. Simple examples include a U.S. telephone number (999 999 9999) or a Social Security number (999 99 9999).
- Proximity resolution can change how words are interpreted based on the closeness of other words
Furthermore, after text has been passed through textual disambiguation, it is often necessary to do what is called “post-processing” activities.
There is a whole technology you need for processing non-repetitive data in order to be able to perform analytical processing against it.
The problem is that technicians are paying 100 percent of their attention to the creation of the data lake and 0 percent of their attention to the access and integration of the data found in the data lake. If they are paying attention to the access of data in the data lake, they are trying to treat the data as if it were a standard database and trying to access the data as if it were a simple SQL structure.
Long ago the IT community learned that analysis of the data in a database was far more complex than merely issuing a simple query against the data. The data needed to be integrated before it could be meaningfully analyzed. However, the vendors of big data seem to have never learned the lessons of years ago (or else they just ignored those lessons).
Those who ignore history are doomed to repeat it, so don’t be surprised at the odor of your data as it sits there unused in your data lake. Data -- like garbage -- starts to smell after a while when it just sits and stagnates.
Read part 2 of this article here.
Bill Inmon has written 54 books published in 9 languages. Bill’s company -- Forest Rim Technology -- reads textual narrative and disambiguates the text and places the output in a standard data base. Once in the standard data base, the text can be analyzed using standard analytical tools such as Tableau, Qlikview, Concurrent Technologies, SAS, and many more analytical technologies. His latest book is Data Lake Architecture.