Why All Content Matters: Getting Started with Unstructured Data
After users become familiar with your BI project's benefits, they'll likely want more. Be prepared to provide analysis of unstructured data. We'll show you how to begin.
- By Kirby Lunger
- May 28, 2008
by Kirby Lunger
You've spent the last five years defining, establishing, and building an analytical environment for your organization. You received accolades for finally providing access to structured information from your company's transactional systems through a business intelligence (BI) tool with underlying data marts, a data warehouse, and a data integration tool. Now -- all of a sudden, it seems -- your colleagues are asking for access to other kinds of content such as e-mail, documents, and audio-visual media through your analytical architecture so they can use this content for predictive analytics in the BI application. Where should you start?
To help your coworkers access this "unstructured" content, you first need to understand the types of data they want to retrieve. You probably have a good handle on the traditional transactional data that is housed in your analytical environment, especially the information in your databases (stored using multidimensional, relational, and other legacy formats). What your colleagues are asking for is pretty much all of the other content in the organization, which could make up as much as 80 percent of all corporate information assets. Users also want the ability to analyze content about the organization which is traditionally only available "outside the firewall!"
This unstructured content is less ordered in terms of its information hierarchy, but the information is just as valuable for your performance management application. TDWI has described two types of data sources: semi-structured and unstructured. Semi-structured data includes spreadsheets, flat files, XML documents and RSS feeds. Unstructured data inside your organization is everything else you can imagine, such as e-mail, word processing files, audio-visual content, Web pages, and text fields in all of your organization's applications. You may also be asked to provide access to, and analysis of, information outside your organization. This content comes in similar formats but is used for different business purposes.
Your colleagues need access to this semi-structured and unstructured content to answer questions such as What are our company's contractual obligations across the enterprise? Is our organization meeting its compliance reporting requirements?
Your coworkers may also want to gain access to content created outside of your company to address questions such as What do our customers think of our products and services? What are our competitors doing? What trends and buzz in the marketplace could influence our organization?
The most commonly accepted approach today is to use textual analytics and/or extract, transform, and load (ETL) software to impose order on a data set that may be comprised of many different types of data. These tools deconstruct textual content (often using natural language processing) into data about specific, defined items such as customers or products. These items then are translated into a traditional data structure, such as records in a database row, or entities in a hierarchy.
This approach provides some clear advantages. The most obvious is that this content can be integrated into your current BI environment for presentation and analysis. This is a good incremental step if you are trying to get a sense of what value you can derive from this kind of analysis. However, this is only an incremental step to a more fully featured solution. This approach only addresses a subset of the semi-structured and unstructured data, and does not provide new analytical tools for exploring combinations of structured, semi-structured, and unstructured content. A more radical and interesting approach to this problem is to apply techniques that are not commonly used in a more structured environment, primarily from the search software field.
Today, we generally assume that when a user accesses the BI platform, he has a precise query in mind such as when your sales manager asks, "What is the most recent Region One sales forecast and how does it compare to actual sales?" In contrast, most search platform users are not totally sure what they are looking for. They may have a few parameters in mind (e.g., my new car should be red and have a high-safety rating), and they need the search platform to help them find information that is as relevant as possible to their query. The sales manager using a search paradigm might ask, "I see that performance was off in Region One last quarter. What were the causes of performance decline in Region One last quarter?"
Such "fuzzy" search logic implies a different approach to integrating your semi-structured and unstructured content into your analytic platform. Rather than looking at text as your only data source, you need to provide access to all types of content. This means that rather than folding this content into your data warehouse through an ETL process, you may need to consider some of the newer content ETL products just introduced for this type of initiative.
Data visualization and presentation also evolve when you take a more "search"-focused approach to such content. In its simplest application, this means searching in your BI application, optimized for the content your users will be trying to search. In a more complex scenario, you could offer analysis techniques such as a content terrain (or "heat" mapping) which is similar to a regular topographical map: this visual technique demonstrates content clusters based on particular areas of concentration within your enterprise.
To start down this path, you will obviously need to take a more holistic view of your organization's information and technology architecture to learn what data is available to your end users. You also need to spend time learning what is missing today from the BI environment. Don't be surprised if people at first cannot articulate their needs in this arena -- most people do not believe current tools can support this analysis!
In conjunction with this internal fact-finding, stay abreast of the evolution of "unstructured" content software and service solutions. Although these concepts have been around for some time, some technological developments have emerged only recently to allow some of the more interesting analysis and integration opportunities in this area.
Finally, keep experimenting! The BI market has grown and matured substantially in the last several years, and this is an exciting new area where we can all stretch and investigate. As famous engineer Richard Buckminster Fuller once quipped, "There is no such thing as a failed experiment -- only experiments with unexpected outcomes."
Kirby Lunger is senior vice president of corporate development at Attivio, Inc. in Newton, Mass. She can be reached at firstname.lastname@example.org.