Issues and Techniques in Text Analytics Implementation, Part 1 of 2
How to streamline the information extraction process
by Victoria Loewengart
Text analytics is a process of extracting information from unstructured or semi-structured machine-readable documents. Text analytics software delivers entity extraction and relationship discovery based on collections of documents, thus helping end users (typically analysts) glean necessary information and make decisions.
No matter how good the text analytics software is, however, it is the system administrator, the knowledge worker, and the software system engineer who make the vast numbers of documents "consumable" by the text analytic software. They make the results relevant to the end user and the workflow flawless.
For the knowledge management professional familiar with text analytics and information extraction concepts we will explore implementation techniques. We will also point out the problems and pitfalls that can hinder the implementation of the information extraction workflow and offer solutions.
The first step in the information extraction workflow is compiling and standardizing a collection of documents that hold the information you want to extract.
For text analytics software to analyze collections of documents efficiently and consistently, the documents must be in a standard format. Most of the text analytics community has adopted XML (Extensible Markup Language) as the standard document format. Since XML format is ASCII text with tags, the documents must be ASCII text before XML tags are added.
To extract entities, relationships and facts from the document, text analytics software uses extraction rules (discussed later) that heavily rely on language grammar. The ideal document for information extraction software is a grammatically correct ASCII text narrative document.
Unfortunately, the majority of documents do not come this way. Documents are stored in a variety of formats, such as Adobe PDF, MS Word, HTML, EXCEL, and PowerPoint. Many documents are created as a result of cutting and pasting from different sources, including Web sites. Large numbers of documents are poorly scanned, which results in poor OCR (Optical Character Recognition) processing. These documents must be converted to ACSII text and then to XML before they can be submitted to text analytics tagging engines.
Even though many of the text analytics software suites come with their own text converters, these converters are not always optimal. Often, unreadable characters or meaningless text strings result from converting documents to ASCII text and some additional "clean-up" might be necessary.
Further "batch editing" to find and replace unwanted characters or text strings in the document collection is advisable. Some text analytic software suites provide the tools to preprocess documents by specifying "find and replace" rules on the whole collection prior to entity tagging. If this capability is absent, global "find and replace" in a text editor may help on a modestly-sized text collection.
The likely candidates for "find and replace" operation are "<" or="" "="">" character combinations which are not part of a document but appear in the text because of bad scanning/OCR processing. They often cause errors in XML conversions, because they are interpreted as the "<" and="" "="">" characters that enclose XML tags.
Sometimes it is necessary to process 100-200M scanned documents. (In comparison, War and Peace is only 6M in PDF format or 3M in ASCII text). One solution is to convert the whole non-ASCII file into ASCII text. The ASCII text file is usually one-half the size of a PDF file with the same content.
The next step in the workflow is extracting entities and facts from a document or a collection of documents using software extraction. The challenge at this point is to extract information that is pertinent to the needs of the end user or analyst.
The quality of the information extraction process is measured by precision and recall.
Precision is the proportion of relevant entities retrieved to all of the entities retrieved in a document or a collection of documents. Recall is the proportion of relevant entities retrieved to all of the relevant entities in a document or a collection of documents. The higher these metrics, the more useful are the results to end users.
Most analysts prefer precision over recall because they feel it is better not to get a piece of information at all than to get it wrong.
Text analytic software uses information extraction rules to extract information from a document. Information extraction rules are algorithms used to search for and extract information based on language grammar, text patterns, specific constraints, and lexicons.
There are times when the extraction rules included "out-of-the-box" in a text analytic software suite are not enough. The subject matter of interest may be unique and thus no information extraction rules for this subject exist. When the content of a document is comprised of disjoint phrases or words, such as text in PowerPoint slides or Excel spreadsheets, grammar-based information extraction rules are ineffective. These situations are partially solved by using expanded lexicons, exclusion lists, dictionary tagging techniques, and thesauri.
A lexicon is a list of words that provides a vocabulary for a simple entity type. For example, the lexicon for the entity type "fruit" could be a list such as "apple, banana, orange." The tagging engine sees the word "apple" in a document and tags it as entity type "fruit."
Custom-developed lexicons of specific subject matter entities will enhance both precision and recall of the extracted information. However, they may cause problems as well.
Sometimes problems arise when using a list of people's last names as a lexicon to enhance name extraction. If the lexicon supersedes the pre-defined "people" extraction rules, the entity extraction software may mistake common words for the people's last names. For example, nouns such as "fox" or "brown" may be interpreted as last names if there are people with last names of Fox or Brown on the list of known people.
Tagging relevance refers to how well the results meet the information needs of the user. The definition of what constitutes a "relevant" entity or fact is subjective. If an extraction engine retrieves a "partial" entity (e.g., last and first name of a person, but not the middle name) does it constitute a hit? It may be if the analyst just wants to get an idea of where the entity of interest is located within the document, but may not be if it means that the analyst must correct this occurrence to include the entire entity in a database or other repository.
Some entities, even though correctly identified and extracted by the information extraction engine, are not of interest to the user. What is worse, sometimes having them in the same document as the relevant entities provides erroneous relationships. For example, a document has a recurring reference to the Department of Homeland Security investigating Al-Qaeda. Even though information on Al-Qaeda may be of relevance to an analyst, its "link" to the Department of Homeland Security is not.
What is "relevant" must be decided prior to designing the information extraction process, so that document processing is consistent.
One way to help resolve relevancy issues is to create exclusion lists; i.e., the lists of entities that are not to be tagged in a document. These exclusion lists are used by customized information extraction rules during the information extraction process.
There must be one exclusion list per entity type (person, organization), since what is relevant for one type of entity (person) may not be relevant to another (organization). For example Thompson (person) could be relevant to the end user; Thompson (organization) could be irrelevant.
Another dilemma needs to be resolved: if the entity itself is not relevant, could its relationship with other entities be relevant? If the answer is yes, this entity should not be in the exclusion list, so as not to eliminate a relevant relationship.
For further fine tuning, exclusion lists may be created for specific groups of users (even for individual users) and possibly per collection of documents.
There are more techniques that can be used for fine-tuning the information extraction process. In the second part of this article, we will explore dictionary tagging and the use of thesauri for better precision and recall. We will also discuss post processing of the end product.
- - -
Victoria Loewengart is the principal research scientist at Battelle Memorial Institute where she researches and implements new technologies and methods to enhance analyst/system effectiveness. You can reach her at firstname.lastname@example.org