Q&A: Getting a Handle on Big Data and Hadoop
The processing framework known as Hadoop is one of a number of challenges companies face in working with big data, says the CEO of start-up Cirro.
- By Linda L. Briggs
- December 11, 2012
Cirro CEO Mark Theissen has spent over 22 years in the BI, analytics, and data warehousing industry in a variety of roles. Previously, he was worldwide data warehousing technical lead at Microsoft, following its acquisition of DATAllegro in 2008, where Theissen served as COO and was a member of the board of directors. Prior to that, he was a VP and research lead at META Group (acquired by Gartner Group), covering data warehousing, BI, and data integration markets.
In this interview, the first of two parts, Theissen discusses the challenges of successfully working with big data, including the fast-moving market, lack of use cases, and dealing with Hadoop.
BI This Week: Is getting a handle on big data proving to be a challenge for some companies, and why?
Mark Theissen: There are several challenges involved. First, there’s no long history of use cases and successes with big data, so the people reaping advantages from big data are really spearheading the movement and driving business value.
Another challenge for companies is that when you talk about big data, you pretty much include Hadoop in the conversation. Hadoop represents a different set of skills for people who are traditionally responsible for analytics within an organization. Typically, the people responsible for analytics are more DBA types -- people who have been working on data warehouses. They tend to be SQL experts. They have some good BI tools, data visualization tool capabilities, and they know how to design and do their models, but big data is a different set of skills. It’s Java. It’s Hadoop. It’s other NoSQL data sources. There are cloud technologies involved. It’s a whole new world for them.
Yet Hadoop is not an easy tool to work with, and Hadoop skills can be hard to find.
Right. Hadoop is a processing framework, and there’s a lot of work to be done to make that processing framework do the things you want to do.
Let’s talk about vendors. How are they doing in terms of dealing with big data from your vantage point?
Part of what I see is some of the same challenges customers face, including the one that you brought up, which is that finding really good resources with Hadoop skills is a challenge not only for customers but also for vendors. You can invest as a vendor and you can invest as a customer to create those skills within the company, of course, but then retaining those people is always going to be a challenge as well.
The other aspect for vendors is that this is a hypermarket, if you will. This market is moving faster than any market I’ve seen in the past. It’s a combination of things. Take data warehousing and analytics or BI markets. Those are strong markets that are growing quite nicely in their own right, but then add big data and you have multiple markets converging on big data.
Big data itself includes aspects of cloud computing and mobile computing -- all of it seems to be feeding on each other and accelerating the market to a pace that we’ve not seen before.
We can add social media into that mix as well.
Exactly. It’s not just one thing. There are so many moving parts right now -- they feed on each other. It all creates what I see as hyper-acceleration in terms of the speed at which things are moving.
When you talk to customers, what do companies typically want to do with big data -- and what are they doing?
Their first challenge often lies in just being able to explore big data. It’s not just, “Oh, this is Twitter data and I want to look at Twitter data,” but being able to explore that data and provide context. They might have semi-structured or unstructured data in Hadoop, for example, that they want to explore. Typically, one approach is to join that data to other data sources, often in data marts or data warehouses. Combined, those sources can provide some context to what you’re trying to explore. That’s one of the first issues that people run in to – where am I going to get the value out of my Hadoop implementation?
The other thing companies struggle with is technology decisions. Are you going to be able to use your existing BI and data visualization tools and extend them to be used against big data, or do you have to buy a completely new set of tools that can do analytics on just Hadoop? Customers struggle with that because they’ve made significant investments in different platforms and technologies, and they are looking to leverage those investments as they go after the riches of big data. They don’t want to be told they have to completely retool to be able to get value out of big data.
In terms of use cases, we see some common patterns. For example, you might have data in a data warehouse -- customer data, sales data -- and have other information in Hadoop, maybe customer survey data, or data from multiple pharmacy outlets. The point is, if you have that data in Hadoop, you want to be able to combine it with the data in your data warehouse. You really don’t want to have to move all of the data from Hadoop into your data warehouse to be able to query it.
I think people look at that and they say, “Well, can you do it?” “Yes.” “Can you do it in a timely and efficient manner?” “No.” Most of the time, the customer will say, “We’re not even sure if we have to move it all there anyway because we’re not exactly sure which queries are the high-value ones that we want to run.”
What other ways to you see customers using Hadoop?
In some cases, we see customers who want to use Hadoop as an operational data store (ODS). That’s a case where people are running queries and they want to place those results into Hadoop, another relational database, a BI server, or something similar. In other cases, people want to make Hadoop a destination platform, but they still have a requirement to join to other data sources. That‘s usually where the challenge lies for them. It’s a challenge even to do joins sometimes between multiple Hadoop clusters or between Hadoop and HBase. Making Hadoop as a destination platform -- that’s definitely a use case.
There’s also exploration and analysis -- you have data, you want to combine things, and you want to explore it. You also want to share results downstream, whether you’re sharing the query so other people can also run it or you’re sharing the results of those queries.
Another use case is application enrichment. You might say, “I’d like to do some processing with Hadoop. I’d like to combine the data with other data, then pump that back into something, maybe HBase, where real-time applications can access that information” -- all to improve the customer experience.