TDWI Articles

Executive Perspective: Text, Voice, and Facial Recognition

Natural language processing is helping enterprises analyze their data, but its complexities extend beyond understanding inputs and outputs. Behavioral Signals’ CEO, Rana Gujral, explains NLP’s strengths and weaknesses and compares text, voice, and facial recognition technologies.

Upside: How has NLP been employed to help enterprises analyze their data?

For Further Reading:

Getting Started with Natural Language Processing in Your Enterprise

Using Text Analytics and NLP: An Introduction

Q&A: Mining Social Media Content

Rana Gujral: NLP (natural language processing) and NLU (natural language understanding) are used wherever we need to analyze speech data. That can include analyzing words and sentences in text format as well as measuring speaking rate, overlap, and many more attributes of speech, such as tonality and other vocal cues that give us information about the speaker’s emotions and behaviors. Typical industries that use NLP are contact centers and other customer-facing ventures that collect a lot of voice data for analytics.

What is the relationship among NLP, ML, and neural networks?

NLP combines linguistics, computer engineering, and artificial intelligence to develop an understanding between machines and humans’ languages. Machine learning, deep learning, and neural networks are all subfields of artificial intelligence that combine and learn from data in different ways.

The purpose of NLP is to develop machines that will be able to read, understand, and derive meaning from human languages. Once that is achieved at a significant level, it will permit us to take a step further and enable machines to have meaningful conversations with humans where they can understand context, and comprehend the emotional state of the human speaker, and respond with empathy.

What are the strengths and weaknesses of NLP? What does it do well and where does it need improvement?

NLP is constantly evolving. The more research conducted (by hundreds of research teams from all over the world), the better it gets. It will always have its boundaries being pushed by science and technology as these evolve and garner more abilities to understand language, vocal cues, and cultural differences. NLP is complex; it carries all the limitations of linguistics, computer engineering, and AI.

Right now, NLP can -- at a very high level compared to previous years -- recognize speech and convert it to text, which allows for further analysis. Understanding and learning to deduce meaning from what it analyzes is the new hilltop it needs to conquer, and we’re confident it will. It’s only a matter of time and data.

NLP can be tricky. Traditionally it doesn’t do well assessing irony and sarcasm and other complex features of language. How is your AI processing able to get these things right?

NLP is science; with the right tools and methodology, it can analyze any data and capture sarcasm or irony, but it needs to be trained first. The differences between irony and sarcasm are so subtle that not even the human brain is always capable of capturing them or even explaining why one sentence is ironic while the other is sarcastic.

When people say NLP can’t capture sarcasm, they’re talking about accuracy. A 50 percent accuracy rate, where a machine understands sarcasm half the time, might be considered low. What everyone desires is higher accuracy. Everything boils down to data and human understanding.

For a machine to learn what sarcasm and irony are, you need to collect a sizable amount of speech data (in this case, dialogue that includes sarcasm or irony) as well as and have human annotators capture it and label it correctly. Then you can build ML models to train the machine to recognize these aspects when they happen in a new conversation. Finding open speech data with irony or sarcasm is not that easy. Human annotators need to go through tons of data to find those unique dialogues and pick them out by hand.

Using movies or YouTube videos is also not the most appropriate sort of data as most of these are staged. Although it might be easy to vocally show satisfaction and anger, sarcasm and irony often require context and purpose to be properly expressed, making it difficult to discover such dialogue. Having said that, it’s just a matter of time and effort to collect this data to train an NLP model, not a matter of its ability. We will have enough data to train computers to capture sarcasm and irony when they happen in conversations with pretty high accuracy.

What are the benefits of using NLP over facial recognition?

For Further Reading:

Getting Started with Natural Language Processing in Your Enterprise

Using Text Analytics and NLP: An Introduction

Q&A: Mining Social Media Content

Speech recognition combined with facial recognition can be very powerful for analyzing human identity and emotional states. Using them individually, however, allows us to capture different sorts of data. Although Facial recognition uses expressions produced by muscles in the face, and voice captures data produced by muscles starting from the mouth all the way down to the lungs and stomach.

The data extracted from voice is richer and can convey a whole range of emotions with every single sentence spoken. It is also much harder to manipulate. Although we might be able to control our face (and some people are masters at that), it is not so easy to control our voice precisely because it is so complicated. Our brain not only has to produce a series of words and sentences that include language, grammar, context, and meaning, it also has to enrich it with tones to express feelings -- something that is very difficult to control. It might be easy to convince ourselves to remain calm and collected in a difficult situation, but we all know it’s much harder to control the words we say or how we say them compared to our faces.

Why is voice/speech recognition more private and more accurate than facial recognition?

Going back to the previous question, it’s all a matter of control and how much we can manipulate our face and voice. However, accuracy and privacy are two different things.

Regarding accuracy, it is easier to show happiness or anger with our face than it is to show it with our voice. Voice is very difficult for humans to manipulate emotionally. Not only do they have to control what they say but how they say it. This how includes, but is not limited to, volume, speed, arousal, and tone. Voice is more accurate in defining what a person really feels compared to his/her face because it is harder to manipulate. Obviously, not everyone is manipulating their face or voice or always intentionally hiding what they are feeling, so both can be effective for emotion recognition. A combination of face and voice can yield even more accurate results.

What our technology can do is capture these subtleties in voice and measure them, and we usually only need a few seconds to understand the intention of the speaker.

As for privacy, our technology on the one hand analyzes anonymized voice recordings, and on the other hand, it works on how something is being said, not what is actually being said. This means we can analyze a voice recording in any language without knowledge or translation of that language -- we don't have to know what is being said. There is especially great value in the “unsaid,” and that is where our data lives.

I think it’s interesting that your product can handle audio without converting it to text first. What’s the advantage of skipping that step?

User privacy is the first advantage. That allows us to analyze voice data such as financial or medical conversations without actually capturing personal or sensitive data. We can deduce emotions or behaviors, such as customer satisfaction or propensity to buy, without analyzing the text. Language agnostic is the second advantage. We don’t need to know the actual language to analyze a dialogue. Vocal cues for emotions are universal for almost all languages. We’re using “almost” because we want to leave room for cultural differences where some expressions may differ.

As mentioned, it is human nature to enrich our speech with vocal cues that express more than words say. Even silence or large gaps have meaning when we talk. Our brain is trained from a very young age to capture the subtleties in dialogue. To paint you a picture, imagine how many different ways your spouse can ask “Did you remember to buy milk?” when you return home late from work. It can range from neutral, pleasant, or right up there in the frustrated/angry range. They are the same words but with a very different meaning. The takeaway from that example is that it is all in how those words were said.

How would companies integrate this type of capability/software into their business model to measure success?

There are two main ways to consume our technology -- either through integration in existing software platforms or via our API. We are already working with major software providers for call centers to integrate our capabilities into their offerings. That includes specific outcomes, such as agent alerts or behavioral profile pairing, that are highly targeted business KPIs in a contact-center environment.

Regarding our API -- it’s very robust and constantly evolving to give a wider range of emotions with higher accuracy. A company could connect and develop their own software utilizing any of the outputs we produce. For example, a robotic home assistant for the elderly or a smart toy that can play with a child, have a meaningful dialogue, and understand what the child is feeling.

How can companies identify a need for this solution in their specific data set?

It’s not easy, is the simple answer, and that’s because companies don’t understand this technology is feasible and can produce dependable results with very good accuracy. That is a major learning curve -- introducing capabilities they didn’t know existed. Once we show them it is feasible and once we start analyzing their own voice data and they see how it can actually produce data, they get hooked. They see the possibilities. A good example is how we can predict with 86 percent accuracy if a customer is going to buy within the first 30 seconds of a sales call. That gets them interested.

Another example is customer satisfaction. Something usually deduced from surveys (cost-ineffective) or by listening to a very small amount of calls (approximately 1 percent) after the fact, by a team dedicated to this purpose. When they see how AI can actually “listen” to 100 percent of the calls in real time, flag disgruntled customers, and allow supervisors to take action before the caller has even hung up, that disrupts their whole business process. It changes their costs and actionable insights.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.