Q&A: Addressing Big Data Performance
Achieving good performance with big data requires processing the data as close to the source as possible, says Cirro CEO Mark Theissen in this interview.
- By Linda L. Briggs
- December 18, 2012
To achieve performance at scale with big data, the best approach is to process the data as close to the source as possible, says Cirro CEO Mark Theissen. "We don't believe that trying to ingest everything into one stack of software products is the way to go," he adds in this interview.
Theissen has spent over 22 years in BI and data warehousing, in a variety of roles. Prior to Cirro, he was worldwide data warehousing technical lead at Microsoft, following Microsoft's acquisition of DATAllegro, where Theissen served as COO and was a member of the board of directors. Prior to that, he was a VP and research lead at META Group (acquired by Gartner Group), covering data warehousing, BI, and data integration markets.
In this interview, the second of two parts, Theissen talks about different approaches vendors are taking to deal with response times and other big data issues.
BI This Week: One issue with big data can be query performance. Can you talk about some of the challenges there, and what sorts of approaches are being taken?
Mark Theissen: There's one group of companies that are after real-time analytics. That means ingesting thousands of data pieces -- or transactions or whatever you want to call them -- per second, and providing real-time analytics on that. There are specialized databases for that… Lots of companies are trying to do some kind of consolidation and indexing to improve analytics, and to make response times faster.
Other vendors will take an approach much like what Cirro does. We believe that the real strength -- the real key to performance at scale with big data -- is to take the processing to the data, to process the data as close to the source as possible. There are others taking that approach as well. We don't believe that trying to ingest everything into one stack of software products is the way to go.
There are also vendors that provide true Hadoop-based analytics. That's all they provide; all their analytic processing occurs in Hadoop. They have a user interface and a BI interface, but at the end of the day, they can only run against data that resides in Hadoop. When it comes to performance, Hadoop does have overhead associated with it in terms of spinning up a MapReduce job. Running on Hadoop in general, you're not going to get sub-second, immediate response time on queries, and, of course, not all queries require sub-second response times.
The other issues that come into play include high-volume queries, or data that you're hitting on a consistent basis. Solutions such as what Cirro offers take advantage of caching mechanisms, allowing high-frequency data to be queried in a cache, as opposed to having to fire up another MapReduce job. That certainly speeds up the performance of queries.
You've said, "Our approach is to take the processing to the data, not the data to the processing." That sounds like what you're describing here.
Yes. If you look at what some of these pure Hadoop-query type tools do, they have all this data in Hadoop. If you want to join it with other data sources, whether they're data warehouse sources or others, you have to physically move all of that data from the data warehouse, or copy it from that data warehouse or wherever it is, and place it into Hadoop or into Hive tables, so that you can actually do queries, joins, and all the analysis you want. Typically that works fine for a proof of concept and pilot projects, but it's very difficult to do with large numbers of files and large volumes of data in a true production environment.
We believe that a better approach is to take the processing to the data. If you have a data ecosystem defined within a technology like Cirro, you can say, "This data ecosystem includes data in Teradata. I have SQL Server data marts. I also have some Oracle databases I'm interested in. I have some things in the cloud. I have MySQL. I have multiple Hadoop clusters. I want to be able to run queries across all of the items in my data ecosystem."
So you have data all over the place that you're drawing on.
This whole idea of trying to have everything in one place is working less and less in today's real world. The data world going forward is distributed, with a lot of best-of-breed products for analytics. You want to be able to run across those products. The best approach, we think, is to have a good federation approach. That allows you to federate the processing of a query and orchestrate the execution of it across the multiple participating systems. The idea is that you're trying to do as much processing as you can as close to the data as possible, and you're trying to minimize the amounts of data that you may have to move between systems to do any final join-type processes.
Cirro is fairly new on the BI scene; several members of your executive team were key players at DATAllegro. You offer a federated solution, and you're billing Cirro as purpose-built for big data.
Cirro as a company is fairly new on the scene, but our management and engineering teams are well-versed in the world of data warehousing analytics, and with Hadoop.
We provide this single point of access -- a way to access all the data within your overall data ecosystem. You can submit a query to our Cirro Data Hub, and that Cirro Data Hub will orchestrate the execution of that query across all of the participating systems that are defined in that data ecosystem. I may take that query, break it down into a series of steps, and execute step 1a and 1b on two different systems. I may then execute step 2 in a different place. I may have to do a move, finish up, then do step 3, but the data hub is orchestrating the execution and taking care of all the different platform peculiarities. You speak one language to the Cirro Data Hub.
This is transparent to the user?
It's all transparent to the user. They just have to know what data they want and where it is, in terms of what system it's on. From there, we take over the execution. Users don't have to worry about different syntaxes for different platforms. They don't have to worry about data movement or the best way to go about executing this. A key part of the data hub is figuring out the best and most efficient way to execute this particular query.
So the single point of entry is a key concept.
If you have a private cloud and you have Teradata, you have Hadoop, you have GreenPlum, you have SQL Server, and you have HBase, you could have a single point of entry through Cirro. You could access and query all of those different data sources.
What's really important in that concept is that your existing BI tools -- your Tableau, BusinessObjects, or other types of BI and data visualization tools -- connect to the Cirro Data Hub as well. Using those BI tools, you can submit queries to the Data Hub and get the benefit of true federated processing. You also get the benefit then of being able to fully take advantage of the processing capabilities of Hadoop.
With Cirro, we're not limited by the functionality of Hive. Most of the tools out there that people are using now use a Hive connector. Hive is really a subset of MapReduce. One of the things we've done with Cirro is to build out the capabilities in our function library, so that if you can execute code in SQL, you can also execute it in MapReduce. When we generate the code to run those queries, we might generate a HQL or Hive query. Equally, we might just generate directly to MapReduce with Hadoop, so we have a lot of processing capability, and you can use existing tools that you already have on your desktop.
That's an interesting approach -- this idea that the semantic layer is already there in most BI tools. Why reinvent the wheel?
Right, and I think people have struggled with that with other data visualization technologies or virtualization technologies, where you have a semantic layer in the BI tools or analytic tools, and you have a semantic layer at the database level. Trying to create the semantic layer that sits between those two to create this uber-view, if you will -- well, it sounds great architecturally, but from an implementation perspective, it's very time consuming. It's complex and it takes a lot of resources, and guess what -- that doesn't work so well once you've implemented it, because the BI semantic layer is constantly changing, as is the database layer. If you create a semantic layer that sits between those two, you will always struggle to keep those all in sync with each other.
That's how we're different. People sometimes say, "Aren't you kind of like data virtualization?" Well, in a way we are because we both do federated processing. That's what data virtualization does. However, we've taken what we feel is a much better approach to it -- an approach that's simpler, easier to implement, and much lower in overhead and administration.