TDWI Articles

Q&A: Mining Social Media Content

What kind of data can you mine from social media and what does it take to do the job right? We asked Aaron Williams, VP of global community at OmniSci, for his perspective.

Upside: What kind of information can be gleaned from social media?

For Further Reading:

Using Text Analytics and NLP: An Introduction

U.K. Parliamentary Committee Hammers Facebook on Digital Privacy 

Data Privacy and Security Still Big Consumer Concern One Year After Facebook Scandal

Aaron Williams: Social media has developed into a tool that is used far beyond what it was originally intended for, whether that was keeping in touch with friends and family, staying in touch with business contacts, or simply sharing photos. Conversations can now be monitored as they happen in real time thanks to the instant nature of social media. These conversations often originate from mobile devices, which means their data has a spatio-temporal component to it, and by merging that data with pop culture topics, political events, or even natural disaster headlines, new insights can be gleaned.

Using location intelligence, these social media topics can be tracked as they spread from neighborhood to neighborhood and around the world. More companies are looking to tap into the power of this data. Brands are now able to see what other brands their loyal customers are using to explore co-marketing opportunities. Sentiment analysis of customer posts is being used to identify products that are selling below where they should be based on customer appreciation, helping companies identify marketing or sales gaps.

Let's also not forget that data is being collected by the social networks themselves on our activity, including likes, posts, and images. Social media companies make billions by being able to target us with the right ad for the right product at the right time, and that's only possible because our social media data is such a rich representation of us.

Which social media sites (Facebook, Instagram, LinkedIn, etc.) provide the best info (or best info depending on what you're looking for)?

Data from any of the social media sites can provide information that can inform strategic business decisions. The "best" information is often the information driving an organization forward. For instance, Twitter data is almost peerless among social media data in its ability to provide a glimpse into the human experience -- revealing what people are saying when and where. The ability to monitor hashtags on Twitter is an important part of the power of that platform and a fantastic tool for brands looking to find and engage their customers in real time.

How much time does it take to do this analysis?

Data from social media channels can be analyzed in near real time if an accelerated analytics tool is used. A good platform should allow users to search for specific hashtags, filter the information by location, and even drill down to individual tweets almost instantly.

What's the risk of false conclusions (a problem of natural language processing)?

False conclusions are a problem, but using a faster platform mitigates the danger. When queries take milliseconds, there is no cost to analysts being curious with their data -- why not dig deeper, look at the data from a different angle, or combine it with another data set that adds more context? Traditional technologies force users to make a difficult decision: do I really want to wait around minutes or hours to get an additional insight and try to reduce the likelihood of that false positive?

What do hashtags tell us?

Hashtags can tell us what topics are trending, and when geolocation is turned on, they can tell us where the topics are trending. They can also help group conversations together, such as in Tweet chats. People can participate in an online "conversation," and their messages are threaded together by using a common hashtag.

Given all the controversy about social media sites sharing data, how is social media data obtained (for example, are you screen scraping or getting feeds from the social media companies themselves) and are there privacy concerns about accessing it?

Because our platform can handle all kinds of structured data, we remain agnostic to how our users get the data they load into OmniSci. We get the data for our TweetMap demo directly from the Twitter API, and we have a license from Twitter to use that data in our demo. All of the social media APIs are getting more restrictive though, which is a shame, because it does encourage the kind of data grab you're talking about.

What is the primary advantage provided by OmniSci's service?

The speed of the platform. The very first instance of OmniSci's Tweetmap was created during OmniSci CEO and cofounder Todd Motak's graduate work at Harvard. It was there, while he was trying to analyze hundreds of millions of tweets and hashtags related to the Arab Spring, that he discovered there wasn't a practical way to interactively explore large data sets without waiting overnight (or longer) for query results. This prompted him to start building the database that would eventually become OmniSci.

Our TweetMap demo has the most recent 400 million tweets loaded into an OmniSciDB, and it can be queried and visualized in 200 milliseconds, with no indexing or pre-aggregation on the data. Traditional database technologies could take minutes for the same queries; that's orders of magnitude slower.

About the Author

James E. Powell is the editorial director of TDWI, including research reports, the Business Intelligence Journal, and Upside newsletter. You can contact him via email here.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.