Text Analytics: Beyond Parsing
Text-analytics and natural-language processing technologies can help with the NoSQL tsunami -- but not all such products are the same.
- By Stephen Swoyer
- July 7, 2015
Traditionally, a huge chuck of business data was inaccessible, "locked" away in otherwise siloed log files, application messages, social communications, and other poly-structured sources. Until recently, data management (DM) practitioners didn't pay much attention to it.
They're starting to, thanks in no small part to the NoSQL tsunami, which has forced DM practitioners to come to terms with a universe of data that's not-so-strictly structured.
By all indications, they might be better off paying obsessive attention to the not-so-strictly-structured stuff. According to Howard Lau, CEO of text-analytics specialist Attensity Inc., semi-structured or poly-structured information accounts for as much as 85 percent of business data today. As human beings become more connected -- as devices or interfaces play an increasingly fundamental role in mediating connection between and among human beings -- poly-structured data will become even more ubiquitous and its value will likewise continue to increase.
"You have this tsunami of data that's not just social but internal data, too. This is data that's growing by leaps and bounds." Much of the data that companies have access to but don't analyze -- because they can't -- is unstructured, Lau says.
"Increasingly, the problem is about using the data that you have access to and making the most informed business decision you possibly can. Although social data is a part of that, you have to have a holistic view. This means you need to combine both social and business intelligence technologies," he continues. "In the world of social and other kinds of unstructured data, you have [concepts such as] 'dictionaries.' If you have disparate systems, they're all operating against their own versions of a dictionary. What's needed is an enterprise view of the semantics of an organization ... so you can process all of your data sets through a common set of reference points."
Not surprisingly, Lau and Attensity claim to provide just this. However, in a context in which the market for text analytic "solutions" seems to have become commoditized, Attensity has its work cut out for it to get its message out. This is one reason it makes intellectual property (IP) a big part of that message.
Many vendors claim to market Text-analytics "solutions," Lau observes, but comparatively few actually develop their own IP. To wit: many products use open source software (OSS) Text-analytics and natural language processing (NLP) technologies, while still others license Text-analytics technology from best-of-breed or OEM vendors. More to the point, Lau contends, few -- if any -- vendors encourage their customers to market (let alone to patent) proprietary products and solutions based on their own IP. "We actually have over half a dozen patents related to text analytics, so we made a concerted investment in the science behind that," Lau explains.
"We enable partners and customers to create their own proprietary solutions using our own technology: they can use our NLP capabilities to create their own products. Some of our customers have even received patents on our technology, which we're fine with," he says. "We provide both a complete out-of-the-box solution and we provide an SDK for customers to use to build their own proprietary solutions. They can say, 'I'll use what you have off the shelf, but I want to build my own.' They build their own products on top of our technology."
When it comes to IP, Lau contends, not all text-analytics technologies are alike.
"A lot of things out there, they look at nouns, [which] they call 'entities,' but what makes us really unique is that we are able to surface relationships between [entities]," he comments.
These relationships -- while implicit in the data -- are abstractions: i.e., they're generalizations, synthetic determinations, and the like. Think back to Logic 101 (Plato is a philosopher, all philosophers are bachelors, therefore ...) for an idea of what's involved here. Even though the available information doesn't explicitly establish a relationship between Plato and bachelorhood, Attensity's technology uses the available information to determine that there's a relationship.
There's even more to it, says Lau. He describes an example involving Elton Simpson, one of the two terrorists who attempted to attack a Prophet Muhammad cartoon contest in Garland, Texas.
"It isn't just that we capture these additional facts, [it's that] we link these facts together so that we can say [for example] that [Elton] Simpson 'lied to' authorities. 'Lied to' is a verb between entities. We can say that Simpson lied regarding his planned trip to Africa … even though other technologies would have completely missed that. It's all about relationships: we tag and identify all of the relationship that is communicated in a piece of text. As more information becomes available, [Attensity's technology is] able to draw stronger conclusions about relationships."
There's another wrinkle here, too, says Lau: even if you don't market your own IP as a product or service, you're still developing it. Business information is inherently valuable. Data warehouse systems and BI tools give business users a means to extract value (i.e., facts and insights) from strictly-structured business information; Text-analytics and NLP technologies do the same thing with semi-structured information. "A lot of organizations haven't embraced that their IP is an asset that needs to be managed. If that [is true], who's overseeing that? [Businesses] should embrace that the value of the organization is really in the IP, as an asset," he comments.
"You need a domain, a repository, where you build over time your IP," Lau concludes, likening "basic" Text-analytics or NLP technologies to parsing engines. "Any type of parsing engine that you have should be able to reference this domain because it is a unique asset for an organization."