TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Articles

Using Text Analytics and NLP: An Introduction

How the power of text analytics and natural language processing can extract actionable insights from your unstructured text data.

By Dheeraj Nallagatla
June 3, 2019

Every business wants to get the most from its data, but unlike legacy data types, today's rising volume of data is not well structured -- especially text data, which includes conversations, social posts, surveys, product reviews, documents, and customer feedback.

For Further Reading:

3 Use Cases for Unstructured Data

Natural Language Generation: 3 Reasons It's the Next Wave of BI

Using OCR: How Accurate is Your Data?

Businesses can tap into the power of text analytics and natural language processing (NLP) to extract actionable insights from text data. Here's how it works.

Text Analytics Basics

Text analytics (also known as text mining or text data mining) is the process of extracting information and uncovering actionable insights from unstructured text.

Text analytics allows data scientists and analysts to evaluate content to determine its relevancy to a specific topic. Researchers mine and analyze text by leveraging sophisticated software developed by computer scientists.

Example business use cases for text analytics include:

Customer 360. Analyzing customer email, surveys, call center logs, and social media streams such as blogs, tweets, forum posts, and newsfeeds to understand customers better.

Warranty analysis. Understanding text from dealer service professionals, warranty claims, orders, and similar sources.

Product or service reviews. Analysis of customer reviews of products or services helps enterprises understand user sentiment or common issues customers are talking about.

Recruitment. Keyword analysis (comparing profiles with job descriptions) helps in short-listing suitable candidates.

The Text Analytics Process

There are many ways text analytics can be implemented depending on the business needs, data types, and data sources. All share four key steps.

Step 1: Data Acquisition

Text analytics begins with collecting the text to be analyzed -- defining, selecting, acquiring, and storing raw data. This data can include text documents, web pages (blogs, news, etc.), and online reviews, among other sources. Data sources can be internal or external to an organization.

Step 2: Data Preparation

Once data is acquired, the enterprise must prepare it for analysis. The data must be in the proper form to work with machine learning models that will be used for data analysis. There are four stages in data preparation:

Text cleansing removes any unnecessary or unwanted information, such as ads from web pages. Text data is restructured to ensure data can be read the same way across the system and to improve data integrity (also known as "text normalization").

Tokenization breaks up a sequence of strings into pieces (such as words, keywords, phrases, symbols, and other elements) called tokens. Semantically meaningful pieces (such as words) will be used for analysis.

Part-of-speech tagging (also referred as "PoS") assigns a grammatical category to the identified tokens. Familiar grammatical categories include noun, verb, adjective, and adverb.

Parsing creates syntactic structures from the text based on the tokens and PoS models. Parsing algorithms consider the text's grammar for syntactic structuring. Sentences with the same meaning but different grammatical structures will result in different syntactic structures.

Step 3: Data Analysis

Data analysis is the process of analyzing the prepared text data. Machine learning models can be used to analyze huge volumes of data, and the outcome is typically produced as an API in JSON format or in a CSV/Excel file. There are many ways data can be analyzed; two popular approaches are text extraction and text tagging.

For Further Reading:

3 Use Cases for Unstructured Data

Natural Language Generation: 3 Reasons It's the Next Wave of BI

Using OCR: How Accurate is Your Data?

Simply stated, text extraction is the process of identifying structured information from unstructured text. Text tagging is the process of assigning tags to text data based on its content and relevance.

Two common models for text tagging are "bag of words" and "Word2vec."

The bag-of-words method is the easiest method to understand, but it's outdated and has been deprecated. This method simply counts the number of words within the text content regardless of location and context. The disadvantage of this technique is that it does not offer a way to understand context from words -- content with a higher word count is given a higher (and, falsely, more relevant) score.

Word2Vec has become the preferred method of text tagging. Text collected for Word2Vec is turned into a vector, which provides relevant information about words (including synonyms). For example, the terms "man" and "boy" can be closely related. Word2Vec also understands that the words "humor" and "humour" should be treated the same way. Word2Vec produces a mesh of related words. The closer the words are to each other in the neural network, the stronger their relationship to each other. This neural net allows algorithms to better understand the context of words, so data scientists can generate better analysis of content relevancy.

Step 4: Data Visualization

Visualization is the process of transforming analysis into actionable insights, representing the data in graphs, tables, and other easy-to-understand representations. Organizations can use a wide variety of commercial and open source visualization tools.

The Role of Natural Language Processing

NLP is a component of text analytics. Most advanced text analytics platforms and products use NLP algorithms for linguistic (language-driven) analysis that helps machines read text. NLP analyzes words for relevancy, including related words that should be considered equivalent, even if they are expressed differently (e.g., humor vs. humour). It's the workhorse behind steps 2 and 3 described above.

One popular application of NLP is identifying relevant, quality content for search engines. For example, Google uses NLP in several ways, the most prominent of which is in search engine organization and categorization.

Long ago, a webmaster could achieve a higher rank in Google search results just by stuffing keywords into web content, so Google revised how its search engine processed content using numerous algorithms and NLP. NLP helps Google identify "spammy" content and categorize it. Google may de-index this content, penalize it, or simply rank it much lower than other content.

NLP is also used in email spam filters. Spammers try their best to evade such filters by changing words around, purposely misspelling words, or using synonyms. Email spam filters use a variety of factors to identify and block spam, phishing, and malicious content. Gmail's filter, for example, incorporates machine learning and NLP to perform "sentiment analysis." If content is determined to likely be spam, the content is sent to the user's junk folder. For some content, Gmail deletes the message.

A decade ago, application of NLP was comparatively complicated. AI-based technologies (including NLP and text analytics) have evolved considerably, and there are many cloud services, commercial products, and open source platforms businesses can leverage. Here are few open source NLP applications:

Stanford CoreNLP
Natural Language Toolkit
Apache Lucene and Solr
Apache OpenNLP
GATE and Apache UIMA

A Final Word

Text analytics isn't new, but it is still unfamiliar to many organizations. With APIs, cloud-based AI services, and open source platforms available today, your business can leverage the power of text analytics to get a competitive edge by better understanding your customers and improving your brand's value.

About the Author

Dheeraj Nallagatla is the founder of Dataflix. He is a serial entrepreneur with a history of building customer-centric solutions. Prior to his entrepreneur journey, he consulted for tech companies including Google, Yahoo, CNET, and Youtube. You can reach the author here or via LinkedIn.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

↑

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Using Text Analytics and NLP: An Introduction

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Articles

Using Text Analytics and NLP: An Introduction

Related Articles

Trending Articles

Breaking Barriers in Conversational BI/AI with a Semantic Layer

AI in 2025: Key Considerations for Technology Leaders

The Tech Blanket: Building a Seamless Tech Ecosystem

What’s Ahead in Generative AI in 2025? (Part Two)

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career