Why Do We Call Text "Unstructured"?
Text is commonly referred to as unstructured data, but it clearly has structure. What does "unstructured" mean in a data context?
- By Bill Inmon
- June 28, 2016
Structured data is repetitive data that occurs over and over. Banking transactions, airline reservations, retail sales, and telephone call detail records are all classical examples of what is known as structured data. In most cases this data is created as a result of the execution of a transaction.
Structured data fits nicely and neatly inside a standard database management system.
Then there is text. Text is commonly referred to as unstructured data. Prior to textual disambiguation, text did not fit comfortably into a standard database management system.
Is text really unstructured?
What does "unstructured" truly mean? In general, "unstructured" refers to a lack of structure. If text were really unstructured, we wouldn't be able to hold a conversation, but we understand each other when we speak. People understand books when they read them. What is going on here?
There is definitely structure behind text. There is proper spelling, punctuation, proper sentence construction, and proper thought development. Ask any English teacher and you will find out just how much structure is behind the text we write and speak.
Of course, the structure behind text is quite complex. Language is taught in school from the first grade on. Parents start teaching their children language at a very young age. It takes a long time for a human to learn how to speak properly and also to learn to understand speech, and the deeper you go into language, the more arcane and complex it becomes. Indeed, you can get a Ph.D. in language and make it your life's work.
There is the dictionary meaning of unstructured and there is the computer professional's meaning, and these two definitions are very different.
There really is structure behind text, but that doesn't allow the text to be considered structured in the eyes of the computer. That structure is so vast, so complex, and so arcane that the computer cannot understand it. The computer is capable of understanding only the simplest structures, and language is simply beyond the pale. Therefore, in the eyes of the computer, text is unstructured.
To make matters even more complex, unstructured data (in the computer sense) includes a lot more than text. Unstructured data includes all sorts of other data -- image data, sound data, log tape data, and meteorological data, to name a few.
Why does the computer's definition of what is structured and what is unstructured make a difference? The computer was made to handle structured data and not unstructured data. The computer expects data to be in nice, neat little piles called records. Each record has a key and other attributes. Once data is organized into a structured format, the computer speeds through it, much like bullets fly through a machine gun. If there is a bullet that is out of place, the machine gun jams.
The structure and organization of the data makes a big difference when it comes to efficient processing inside the computer.
One of the interesting questions becomes: if the computer cannot handle unstructured data efficiently, then can unstructured data be translated into a structured format?
You can use textual disambiguation to ingest raw, unstructured text and transform the important parts of unstructured text into a structured format while maintaining the essence of the unstructured data. It is like riding a bicycle across a tightrope stretched across Niagara Falls while juggling monkeys dash about. Not for the faint of heart.
The strategic value of textual disambiguation is that it enables text to be placed into a standard database so it can be used for corporate decision making.
If you don't grasp the strategic importance of being able to make decisions based on text, think about this. An estimated 80 percent to 90 percent of the data in an enterprise is text. However, most corporate decisions are made on the basis of reading and analyzing only 10 percent to 20 percent of the structured data in the corporation. Does this make sense?
It is like saying that only men over 65 who have college educations should make all the political decisions for the entire population. What about women? What about people younger than 65? What about people who do not have a college education?
We would never stand for a political system that was so misshapen and so elitist, but that is exactly what we do to the text and data found in our corporations.
Workers of the world unite. Start making corporate and management decisions on your unstructured data.
Bill Inmon has written 54 books published in 9 languages. Bill’s company -- Forest Rim Technology -- reads textual narrative and disambiguates the text and places the output in a standard data base. Once in the standard data base, the text can be analyzed using standard analytical tools such as Tableau, Qlikview, Concurrent Technologies, SAS, and many more analytical technologies. His latest book is Data Lake Architecture.