Unstructured Data: Attacking a Myth
- By Stephen Swoyer
- September 5, 2007
While business intelligence (BI) search may not radically change how users work, there is a clear and growing sense in which search—especially when tapped as a complement to existing BI and performance management (PM) infrastructure investments—will make a decisive difference.
At the heart of the discussion is one important myth: that unstructured data is as pervasive as search, data integration, and other interested vendors claim it is. In fact, some search proponents maintain that as much as 80 percent of actual or potentially mission-critical enterprise information takes the form of unstructured or semi-structured data.
That’s contrary to the facts, according to TDWI Research, the research arm of The Data Warehousing Institute (TDWI). In an Internet survey conducted late last year, TDWI Research asked respondents to estimate approximate percentages for structured, semi-structured, and unstructured data across their organizations.
Structured data topped the list, according to respondents, accounting for 47 percent of all enterprise information. Respondents put unstructured data next, with 31 percent, followed by semi-structured data (at 22 percent). “Even if we fold semi-structured data into the unstructured data category, the sum … falls far short of the 80 [to] 85 percent mark claimed by other research organizations,” writes TDWI Research senior manager Philip Russom. “The discrepancy is probably due to the fact that TDWI surveyed data management professionals who deal mostly with structured data and rarely with unstructured data.”
Granted, TDWI Research caters largely to data warehousing (DW) and BI technologists—i.e., folks who concern themselves primarily with structured data—but Russom says this doesn’t discount his organization’s findings: “All survey populations have a bias, as this one does from daily exposure to structured data. Yet, the message from TDWI’s survey is that unstructured data is not as voluminous as some claim.”
On the other hand, as Russom and others point out, unstructured data—which comprises nearly one-third of all enterprise data—is nothing to sneeze at. “We should all pare down our claims about unstructured data volumes, but we should not change our conclusions about what needs to be done,” he notes. “In other words, regardless of how the numbers add up, we all know that the average user organization has a mass of textual information that BI and DW technologies and business processes are ignoring.”
How much—or what kind—of a “mass of textual information” are we talking about? According to TDWI Research, the overwhelming majority—77 percent—of the data consumed by data warehouses or BI processes is structured. Elsewhere, semi-structured data accounts for just 14 percent of the data that’s fed (on average) to DW or BI processes. This means that less than 10 percent of the data which is fed to DW and BI processes is purely unstructured. That’s a far cry from unstructured data’s overall importance: it accounts for nearly one-third of all enterprise data, according to respondents.
Clearly, this is a status quo that’s going to change. As enterprises continue to un-silo and expose once-isolated information assets, they’re going to need tools and technologies to reach into and intelligently catalogue this information, Russom argues.
“The view of corporate performance seen from a data warehouse is incomplete unless it represents—in a structured way—facts discovered in unstructured and semi-structured data,” he points out. There’s a reporting hook, too: “BI platforms today commonly manage thousands of reports, and techniques borrowed from unstructured data management—i.e., search—can make reports a lot more accessible.”
It’s for this reason that Russom and TDWI expect BI search and, to a lesser extent, text analytics, to mushroom in popularity over the next few years.
“TDWI suspects that adoption of both BI search and text analytics—though rarely deployed today—will increase over five years, until they are as commonplace as Web GUIs and dashboards are today,” he argues, noting that both technologies were comparatively rare half a decade ago.
For example, TDWI Research says, organizations are increasingly feeding unstructured data into their DW and BI processes. Typical unstructured data sources include voice recognition, wikis, RSS feeds, instant messaging transcripts, and document management systems, he indicates.
Russom and TDWI also anticipate a “moderate” increase in the amount of semi-structured data feeding into data warehouses and BI processes. This goes beyond XML and electronic data interchange (EDI) data, too, he suggests.
“The new kid on the block is the RSS feed, which contains both semi- and unstructured data. Most RSS feeds transport prose—unstructured data as text—but are beginning to carry transactions as semi-structured data in markup documents,” he says. “Either way, 22 percent of survey respondents claim that their data warehouse accepts RSS feeds today, and 90 percent anticipate integrating data from RSS feeds in three years. This makes sense, because RSS feeds operate in near-real time over the Web, and many organizations are looking for faster and broader ways to deliver alerts, time-sensitive data, and transactions.”
Finally, organizations are already feeding miscellaneous unstructured data into their DW and BI processes, too. This category includes sources such as e-mails, word processing documents, Web pages, and Web logs (or blogs). The amount of e-mail data fed into DW and BI processes grew by nearly one-half (47 percent) over the last three years, followed by word processing documents (35 percent), Web pages (35 percent), and Web logs (27 percent).
Why such modest gains for such important data sources? “Their increase will be moderate because they’re already established,” Russom indicates.