A Peek into the Future: The Next Wave of Data Warehousing
Data found in warehouses is mostly transaction-oriented -- data from repetitive activities that are an integral part of every organization's business. Bill Inmon and Geno Valente look at what's ahead: textual data.
by Bill Inmon and Geno Valente
In the early days of computing, there were simple applications. With lightning speed, the world progressed from mainframes and batch applications to personal computers and online applications. At the same time, there emerged a class of data applications that are now known as business intelligence and data warehousing.
Data warehousing lets people look at corporate data as never before. Patterns, trends, seeing the forest and the trees -- all became a possibility with data warehousing. Business people were able to look at their enterprises as no one had ever been able to previously and entire industries and multi-billion dollar companies grew out of this love of knowing more. Companies saw that business intelligence and data warehousing allowed them to make important corporate decisions based on new perspectives gained from data gathered from all over the corporation.
However, the thirst for more insight into data created by warehousing techniques created a virtual flood of data. Warehouses stored detailed data, historical data, and data integrated from a wide variety of sources. There was an inevitable formula that applied to every data warehouse that has ever been built --
Detail x History x Many Sources = Lots of Data
Fortunately, technology grew to the extent that large volumes of data could be handled. However, even today only a few solutions offer true unrestricted ad hoc access to this class of data.
Operational business intelligence (BI) -- the domain of pre-defined queries used for operational reporting and management -- has been firmly established for over a decade and is successfully utilized in most enterprises, but the challenges of true ad hoc analytics on large data sets are still formidable. Legacy systems work well for pre-defined queries but quickly break down when analysts try to explore large data sets looking for insights. This has spawned a support industry of consultants and service providers who offer workarounds. They create application-specific data models, data de-normalization and partitioning schemes, and other patches in an attempt to improve performance.
For most practical purposes, this is where most enterprises are today. However, there is innovative technology emerging that could propel business intelligence into an utterly new and far more powerful position.
To date, much of the data found in warehouses is transaction-oriented. Transactions are repetitive activities that are an integral part of every organization's business. There are repetitive sales. There are repetitive phone calls. There are repetitive ATM activities and credit card swipes, and so on. In short, one way or the other, day-to-day business activities run on repetitive transactions. Some businesses have more transactions than others, but in almost every business, the fundamental activities of the organization are captured in the form of transactions. These transactions feed data warehouses, which in turn enable and feed the business intelligence process.
The Next Wave of Data
However, there is an entirely different class of data that exists today that has not yet found its way into business decision-making: textual data. Textual data is found in e-mail messages, contracts, warranty claims, chat logs, and even help desk reports. In short, textual data is the most voluminous and the most common data found in enterprises today. Unfortunately, it is not applied in the world of corporate decision-making. It is nonetheless there, wasting paper or disk space, but no one is accessing it to drive real corporate action.
There are many reasons why textual information is not found in the databases of most enterprises. The single biggest obstacle to including textual data in decision-making is that basic database technology is designed for repetitive events such as transactions. Universally, text is freeform, not conforming to any set of computer-based rules. Simply stated, text is not repetitive and is unstructured; databases are built for repetitive and structured data. There are many other challenges awaiting those who would try to use text to make decisions but the visceral mismatch between the structure of a database and the lack of structure of text is the biggest obstacle. This is the key reason why no one is successfully using textual data in corporate decision-making today.
Are historical text files not used in corporate decision making to any great extent because any important information wrapped up in text? No. Some of the most important data that exists happens to be found in the form of text. Consider the following text files and their business importance:
- Contracts: A contract represents an important obligation between the organization and another entity. A contract is always in the form of text. Taken collectively, top management has no idea what is in their corporate contracts. Ask a manager how many contracts will expire in six months and see what management says. Ask an executive, how many contracts are with institutions that have changed their name in the past year and see what management says. Ask a DW/BI manager how many contracts they have with an outside entity that would make sense to consolidate. Case after case, there is a wealth of information locked up in contracts and -- taken collectively -- enterprises have no idea what is in their body of contracts.
- Warranty Claims: Warranty claims are also in the form of text. Warranty claims hold a wealth of information about products, product failures, patterns of failures, customer attitudes, and a feast of information about how the customer is interacting with the company at the most basic level -- at the customer's level of experience with the product. Yet many of these warranty claims are processed by hand. Actually, there are so many warranties that an analyst cannot even take the time to create a database that can subsequently be analyzed.
- Loan Portfolios: It is common practice for organizations to lump together many loans into a portfolio. Trying to go through every loan and understand the worthiness of credit is a painful experience when done by hand. In fact, the process of manually analyzing a loan portfolio is so intimidating that enterprises simply don't attempt to do it.
- Doctor's Notes: Doctors make notes about a variety of medical incidents and activities. There is a wealth of information locked up in these notes and as long as doctor's notes are manually processed, no form of automated analysis is possible.
This is just the short list. Everywhere there is valuable information locked up in text that is crying out for automated analysis.
Textual Extract, Transform, and Load (ETL) to the Rescue
One of the seminal innovations now appearing on the market is "textual ETL." Textual ETL reads text, does the myriad of tasks needed to put the text into a form and structure that is useful to a database, then places the output structure table into a database system. In a word, textual ETL opens up worlds of old and new data exploration that were never before possible. With textual ETL, organizations can now start to include text in their corporate databases.
Another innovation required to manage this ever-larger wave of data is "unrestricted ad hoc" access. The forms, notes, and claim files mentioned above were not created with pull down lists, structured forms, or keyed data. Where one doctor wrote "bleeding," another might have written "laceration," and yet another might pencil in "contusion." However, all of these notes might be useful for one injury, but not another (Knife wound versus ulcer, for example). To analyze the data successfully, pre-defined queries cannot be applied -- it is ad hoc access 100 percent of the time.
Lastly, if you think the volume of data that came with transaction data warehouses was large, you will be stunned at the volume of data that comes when bringing text into analytic environments. It is estimated that for the average corporation, 80 percent of its data is in text form. In other words, as large as the corporate databases are with transaction-based data, they are going to grow much, much larger with textual data. As companies begin incorporating textual data into their databases and data warehouses, new challenges will occur that only a select few will be ready for.
Making "real decisions" on massive volumes of data is going to be one of the next major challenges facing enterprisess. Many of today's solutions struggle with the volume of data already available for analysis. With the wave of textual data, IT departments and BI users are going to be stressed further and must adapt to handle this increasing challenge.
Petabyte Data Environments for the Masses
Let's look at some of the challenges to managing far greater volumes of data used as a basis for decision making.
Challenge #1: Cost. Using today's standard data storage technology and current prices, it quickly becomes clear the price of storage and the technology housing and managing storage must decline. If it doesn't become affordable for "petabyte for the masses," organizations will not be able to afford the technology regardless of the business benefits. RDBMS and appliance providers must be at or below $20,000 per terabyte of user data (uncompressed) or $5,000 per TB (compressed).
Challenge #2: Loading the data. It is one thing to receive and integrate a trickle of data. It is quite another to receive and integrate a flood of data, and a flood that never abates. Load times will be key, and IT will expect performance at about 5TB/per hour/per rack.
Challenge #3: Unrestricted Ad Hoc Access and Analyses. Simple indexes were once appropriate for smaller amounts of data, but in today's world data must be organized in a much more sophisticated manner. This usually means that a parallel approach to data management is essential. A system must be data-model agnostic, data-schema agnostic, and provide ad hoc access performance regardless of JOIN key.
Challenge #4: Scalability. Users need to recognize the age of data and move or archive it according to its age and maintain longer time series to avoid such efforts altogether. Choices should be "keep it all" or have proper partitions that allow dropping a year or date that is over 10 years old.
Summary
The importance of incorporating textual data into an infrastructure is matched by the importance of storing and analyzing it. Bringing this data into the analytic environment is challenging, and far greater challenges are created once it is there. We can easily recognize the benefit of textual data inside the decision-making infrastructure, but it means nothing if the volume of data it creates overwhelms the entire environment.
These are some of the exciting opportunities and technology challenges emerging. Analytic professionals want more data in play, and textual data is at the top on their list. Emerging textual ETL tools and analytic environments purpose built for ad hoc are looking over the horizon and creating BI analysis opportunities once only dreamed of.
Bill Inmon is considered the father of data warehousing. He is the author of nearly 50 books, including his latest "Tapping into Unstructured Data." His newest company, Forest Rim Technology, is focused on accessing and integrating unstructured data into the data warehouse environment. http://textualetl.com/.
Geno Valente is vice president at XtremeData, Inc., maker of very high-performance database Decision Support Systems (DSS) and other acceleration appliances. He has spent the over 13 years helping support, sell, and market FPGA technology into markets such as financial services, bio-informatics, high performance computing, and WiMax/LTE, while working for Altera Corporation and XtremeData Inc. http://www.xtremedata.com/.