TDWI Upside - Where Data Means Business

Moving the Textual Analytics Cheese

There are plenty of natural obstacles to dealing with text, but those who feel threatened by advances in textual processing are inventing artificial obstacles as well.

Text is hard enough as it is. Text has double meanings, innuendo, improper syntax, and slang to deal with. Text has antecedents and precedents, and it comes in many languages. There is enough chaos and complexity found in text to challenge any system of textual analysis. We don’t need any more obstacles, but some people are trying to deny progress in text processing.

For years text has been a great challenge for data analytics. Text exists in the enterprise in vast supply, but text is almost perfectly resistant to any form of computerized analysis.

The computer has proven to be amazingly effective when it comes to structured data, but when it comes to text, the computer shrivels and hides in a corner. The unpredictable and imperfect nature of text remains impervious to most attempts to computerize it effectively, despite the many applications and opportunities for using text.

New technology now exists -- textual disambiguation -- that promises to bring text into the world of normal computer processing. With textual disambiguation, unstructured text can be turned into ordinary structured data so the great and powerful computer can handle it. The computer demands that data be structured, and now it is possible to take unstructured text and turn it into structured data.

What do we find accompanies this turn of events? We find people are uncomfortable with the thought that there has been a sea change in their industry. People resist change, whatever the change might be. People don’t like to have their cheese moved. People think their cheese belongs where it has always has been.

Having text and doing nothing with it was good enough for my father, so it is good enough for me. Don’t bother me with reality and facts and progress. I don’t want to hear it. Leave my cheese where it is!

Years ago when ETL came onto the scene, programmers of the day resisted the idea that one could automate writing transformation code. To the programmer of the day, humans wrote code and that was that. Programmers were threatened by the thought that a machine could write accurate and useful transformation code (and transformation is precisely what ETL does).

Some programmers gave transformations to the vendor that were so complex and so horrendous that humans had to write specialized code to perform the transformation. In doing so, programmers proved to management that ETL could not work. The strange thing was that management believed them, at least at first.

What the programmers didn’t tell management was that ETL solutions could easily write 99 percent of transformations. Of course, there are always some transformations so complex that they cannot be written automatically, but using that 1 percent of complex transformations as proof that ETL would not work was a stupid thing to do. Programmers who misrepresented their transformations just didn’t want someone coming in and moving their cheese.

The same thing is happening with text and text analytics today. The possibilities are changing and some humans are reacting predictably. They don’t like their cheese moved.

The other day I was at a conference and a speaker said he could prove that textual, analytical processing does not work. The gentleman offered a sentence that was very complex and full of ambiguities and said, “See, the computer cannot understand or make sense of this sentence. Therefore textual disambiguation does not work.”

You know what? The gentleman was partially right. There are sentences that are so obscure, so devious, so twisted that no amount of textual disambiguation will ever unravel the sentence. Does that mean that disambiguation does not work? Not at all.

You see, most sentences are straightforward. In normal conversation, it is rare to find sentences that are tortuously complex. Consider writers. Yes, there was William Faulkner and Henry James, both famous for their difficult and obscure form of writing. On the other hand, for every Faulkner and James there is an Ernest Hemingway, Danielle Steele, John Grisham, Mark Twain, and a whole host of other writers who write in an understandable, clear, concise fashion. Most writers actually want you to understand what is being said.

It is disingenuous and artificial for the world to offer examples of strange and uncommon speech and use them to prove textual disambiguation does not work. Time will prove that the textual analytics cheese has been moved despite the artificial arguments offered up by those who would stand in the way.

About the Author

Bill Inmon has written 54 books published in 9 languages. Bill’s company -- Forest Rim Technology -- reads textual narrative and disambiguates the text and places the output in a standard data base. Once in the standard data base, the text can be analyzed using standard analytical tools such as Tableau, Qlikview, Concurrent Technologies, SAS, and many more analytical technologies. His latest book is Data Lake Architecture.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.