TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Issues and Techniques in Text Analytics Implementation, Part 2 of 2

How to streamline the information extraction process

January 16, 2008

by Victoria Loewengart

Last week, in the first part of this two-part series we discussed the potential pitfalls of configuring the information extraction process, including document formatting and improving precision and recall. This week we continue with the fine-tuning of precision and recall. We will also discuss some post processing and integration issues.

Dictionary Tagging

Sometimes it is necessary just to tag the entities that match specific data from a repository of known entities and nothing else. This approach, known as dictionary tagging, will yield high precision but not necessarily high recall extraction.

Tagging entities that are already known and recorded somewhere helps analysts identify documents rich in specific information. When entering data into a database from a document, it helps to know which entities already exist in a database to avoid entering them multiple times. Dictionary tagging can also used to establish new relationships among the known entities.

One approach to dictionary tagging is to use lexicons created by exporting data from existing repositories, such as Excel spreadsheets or database tables.

Exporting names of people and organizations from a database presents unique problems. Usually people’s names are stored in specific formats or order, such as last name, first name, middle initial. However, these names may not appear in a text document in this way. Care must be taken to use entity extraction rules in conjunction with the dictionary list, otherwise many erroneous “hits” and “misses” may occur.

Thesauri

If built properly, using a thesaurus (or a synonym list) can increase the precision and recall of the extracted information. Most of the information extraction engines can use well-tuned internal thesauri that encapsulate common words.

For a specialized system, however, the analysts may want to use a specialized thesaurus, such as the thesaurus of country names and acronyms. To use an external thesaurus effectively, take care not to define the same synonym for multiple head terms. Beware of using short abbreviations in a thesaurus. Anything of four characters or less is a candidate for overlapping, for example CA may be a chemical (Calcium) or a state (California).

Post-Processing Issues

The final step in the information extraction workflow is production of an end product that meets the users’ requirements, such as a report, a diagram, or a set of data imported into a database. The end users must have control in customizing their final product.

Even with all of the pitfalls considered and the best efforts made to remedy them, end users (analysts) always need to add, remove, or correct entities and facts. A text analytic software suite should provide a manual tagging capability (for example, the ability to visually “un-tag” or “un-highlight” entities and facts that are not wanted in the final product).

The system engineer must plan for the manual tagging capabilities PRIOR to implementation of the entire information extraction workflow, because it will affect the choice of information extraction software and supporting tools. Additional software tools may be needed to import the results of the information extraction process into an editor so users can modify results visually by highlighting/un-highlighting entities and facts, diagramming, or other means.

Also, serious consideration must be given to integrating the text analytics software with other information management tools. Information extraction is most effective when used in conjunction with other applications. For example, the results of tagging could be visualized as a link diagram among the entities. Sometimes it is desirable to import tagged entities into a database.

Most text analytics tools provide visualization applications and database utilities; however, end users often prefer their own tools and familiar third-party applications. The text analytic software must produce XML output with tagged entities; XML schema must be well documented so the output can be transformed so it can be used by other applications.

Conclusion

In setting up an information extraction process, it is difficult to get it right the first time. As with any complex system, there are always unforeseen pitfalls and problems. With careful planning it is possible to avoid some of them.. The information extraction process must evolve over time with adjustments and tuning until it meets end users’ productivity needs.

- - -

Victoria Loewengart is the principal research scientist at Battelle Memorial Institute where she researches and implements new technologies and methods to enhance analyst/system effectiveness. You can reach her at [email protected]

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Issues and Techniques in Text Analytics Implementation, Part 2 of 2

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research