TDWI Upside - Where Data Means Business

Q&A: As Emoji Use Grows, So Does Need for Analysis

A long-time expert in legal investigations discusses the growth of emojis and the need for tools to separate emojis from messages for analysis.

Whether we use them or disdain them, we're all familiar with emojis -- those little shorthand digital images for expressing an emotion or idea. More than six billion emojis are sent each day, according to Joe Sremack, CEO of Boxer Analytics and author of the book Big Data Forensics -- Learning Hadoop Investigations.

Given that growing volume, investigators and others delving into message content need a way to search and manage this new form of communication. "It's amazing how pervasive emojis have become," says Sremack, who is a fifteen-year veteran of the IT and investigations industry, where he has served as an analyst and expert witness on high-profile corporate crime investigations and civil litigation. "This is no longer the domain of teenagers. ... You can't look at social media and mobile device data without seeing emojis."

Because emojis can contain critical information -- indicating ideas ranging from affirmation to sarcasm to violent threats -- Sremack's company offers attorneys and investigators, his primary market, the ability to pull emojis from messages for sorting and analysis.

"Old methods for finding critical evidence in data sets of tens and hundreds of millions of documents, such as keyword searches, do not yield results when working with emojis," Sremack says.

Fifteen years ago, most of his work dealt with analyzing emails and document files captured from laptops, desktop computers, and servers. Now, he focuses instead on large-scale data sets for investigating Ponzi schemes, stock option backdating, mortgage-backed securities fraud, and intellectual property theft.

UPSIDE: How do you search out emojis in messages and what are the challenges?

Joe Sremack: First of all, there's the technical side of dealing with emojis. You're usually looking through a communications database, typically with one record per message. Inside the message body there may be no emojis, an emoji at the end of the message, or maybe a string of different emojis. Our software goes through and analyzes those -- it treats them as if they were keywords and allows users to search them.

Once our tool, UniSearch Pro, returns those results, the attorneys or investigators can review the data and annotate the records, flag them, identify them as privileged, and so forth. They can work with them later and potentially use them as evidence.

One difficulty is that there are more than 1,850 emojis currently defined by Unicode, and multiple emojis can represent the same concept. We group emojis together to allow users to search for a concept or type of idea.

There are also many differing interpretations of emojis. What does a wink mean? This has already come up in case law -- there have been more than 12 cases in the last year where emojis were key evidence. There were multiple arguments about whether an emoji indicated sarcasm or a threat -- things like that.

Another issue is how emojis are displayed on devices. An emoji on an iPhone might look different from the same emoji on an Android device.

There are many different ways to interpret emojis, and they create issues that require searching and deep analysis by the attorneys.

Your tool is used to find and pull out emojis in messages, then allow users to run analytics on the data using another product?

Correct. That could be done in either of two ways: One, you can use an official media-specific tool to collect all of the message data and build a database, then apply our tool to it [to find the emojis]. Two, someone could use UniSearch Pro in conjunction with their own tool for deeper analytics.

Part of our duty to our clients is to present the data in a way that they can search it, traditionally by keywords. With emojis, that presents a new type of problem, and that's what my software is designed to address.

You mentioned that emojis can appear differently on different devices. Is that a challenge as you're sorting through messages?

Yes, that's a challenge on a couple of levels. Emojis are defined by Unicode and have a standard representation. However, they can be implemented in any way that a hardware or software provider wants. Apple has its own set of emojis, Android has its own set of emojis, Twitter has its own set of emojis, and so forth. The classic depiction of each emoji can vary widely depending on the emoji, so if I'm sending someone an emoji from my Android to an iPhone, that can appear very differently to me than it does to them.

As an example, an emoji that looks like a real gun on my Android device is a squirt gun on an Apple device. That presents a challenge to attorneys, because they want to represent the evidence in the most favorable way to their case. They need to understand how the emoji was presented not only to the sender but also to the recipient, then figure out the correct emoji representation for their case.

There's a display problem here -- you need to know how the data was acquired, from what device, and who received it. Then you need to provide those emojis graphically to a reviewer so they can make a determination.

You're focused on the legal industry?

Primarily legal, yes -- really we're focused on investigations. Sometimes organizations want to do an internal investigation, for example. We've talked to several potential customers about social media analytics, but our focus has been on the legal industry to date.

Are there really more than 1,850 defined emojis out there?

Yes. I'm also a member of The Unicode Consortium [an international nonprofit corporation that standardizes how computers represent text]. In a given year, we release about 80 new emojis, and that number is going to continue to grow. Currently, there are 1,852 defined emojis; that number will increase early next year.

In addition to those, there are also platform-specific emojis -- for Twitter, for example. You can view them and can paste them in Twitter, but once you leave Twitter, they aren't available. There's a whole universe of emojis, and it's impossible to calculate just how many are out there across all platforms.

Your software focuses on emojis in text messages?

Not just text messages. You'll see emojis in text messages, chat messages, and social media. We see them in collaboration tools such as Slack. We're also seeing emojis in email now that Outlook and Gmail both fully support Unicode emojis. It's a matter of going through the messages and identifying which contain emojis. That's one of the big benefits of our tool -- you can search your entire repository for them and segregate those messages for a separate review process.

Your software translates the emojis into Unicode, but does it try to detect the sender's intent in an emoji string -- sarcasm or a joke, for example?

No, our software doesn't do that. We do offer the ability to cluster related emojis, though, which is useful. Typically, certain kinds and combinations of emojis are used to indicate sarcasm -- if you're searching for sarcasm, you can search for those emojis occurring as a group. Currently, however, we don't offer any kind of analysis regarding what any particular emoji might mean in the context of the message.

In an investigation, how many messages might you be dealing with?

We're typically looking at anywhere from several hundred thousand up to about 10 million messages. When you look at a typical cell phone, based on the averages we've seen in our investigations, you're going to see about 100,000 messages if the user has WhatsApp installed or is a heavy SMS user. If you think about the number of users or mobile devices needed to get up to 10 million, we're obviously talking about a large-scale investigation.

How many of those messages contain emojis?

On average, roughly 10 to 15 percent of the messages we see contain emojis. It's a subset, but it's a large enough subset that you cannot fully view them manually.

Is there a technical challenge to decoding the emojis, or is it a relatively simple matter of breaking each emoji down into Unicode?

The technical challenge isn't great except when we encounter large data stores. Emojis can be either two bytes or four bytes, so to tokenize them -- to view them as a single emoji -- you have to parse through all the data at the byte level. When you're dealing with tens of millions or hundreds of millions of records, things can slow down a bit. There needs to be some optimization, which we have focused on extensively in our software.

Remember that we're combining this with other sets of data. You're not just looking at communications with mobile data. You're also looking at complex financial data; you might be looking at emails, other document files, and so forth. Our software allows you to segment the data that contains emojis so that you can search and review those alongside your current workflow and other processes designed for email or non-emoji messages. Most law firms, investigators, and law enforcement already have established products that work very well for email and document files.

Are you pretty early to the game in terms of teasing out emojis so analytics can be run on them, or are there others doing this?

We're early in terms of the legal profession. There are several companies right now doing emoji analytics for sentiment analysis of social media data. You'll see a lot of companies that search Twitter feeds and can tell you, say, "Here's the most common emoji we're seeing in response to your recent post," or, "Here's what customers are saying about your new campaign." Social media is probably the biggest place so far where emojis are being used for market analysis.

What's the potential for emojis and analytics?

For the legal industry, this will become a bigger and bigger issue over the next few years -- case law tends to lag a few years behind technology. The increased amount of mobile communication data that's finding its way into cases and investigations will certainly require the courts to understand the value of this. We will need to ensure that there's a way to present emojis accurately and make sure that attorneys are producing the data to one another and reviewing it appropriately. I believe that eventually, in civil litigation, emojis will be as common as email -- you'll see just as many messages with emojis as with only text.

Beyond the legal sphere, I think this is going to be an extraordinarily large market. Using social media analytics to understand what your customers are saying through emojis will be immense. It will be just as valuable as running keyword searches. Companies are going to come up with their own sets of emojis, and it's going to become an important language for modern communication -- something we haven't seen before.

With more than six billion sent per day -- and that number is only going to increase -- emojis are omnipresent.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.