Treasure Data: Not Your Typical Take on Hadoop or Data Warehouses
Think of SaaS newcover Treasure Data as a kind of Hadoop-as-a-service offering, albeit one with an emphasis on data warehousing workloads.
- By Stephen Swoyer
- November 5, 2013
Think of SaaS newcomer Treasure Data Inc. as a kind of Hadoop-as-a-service offering, albeit one with an emphasis on data warehousing (DW) workloads.
Treasure Data's SaaS offering mixes open source software (OSS) technologies (including Hadoop's MapReduce compute engine) with some special sauce of its own.
In place of the Hadoop Distributed File System (HDFS), for example, Treasure Data substitutes its own object-based columnar file system, which it calls "Plazma." Plazma delaminates MapReduce from HDFS, Hadoop's baked-in storage layer. Treasure Data claims that Plazma's columnar object storage format is better suited for analytical queries, which enables it to achieve superior I/O performance.
In addition, argues CTO and co-founder Kaz Ohta, Plazma itself incorporates I/O optimizations such as parallel pre-fetch and background decompression. Ohta says Plazma is likewise a better fit for fluentd, the OSS technology Treasure Data uses to power its td-agent software. Td-agent is a tool that collects, transforms, and repackages relational data in JSON format before replicating it to the Treasure Data service, which uses Amazon S3 for storage. Once in the cloud, Plasma extracts and saves the row-based JSON files as columnar objects.
In place of Hive, Hadoop's MapReduce-powered query engine, Treasure Data uses another OSS offering: Impala, which was largely developed by Cloudera Inc. This gives it what Ohta calls a "responsive" interactive query facility, which he distinguishes from the batch-centric Hive.
"Hive is more robust [than Impala] in terms of batch processing. If you have a nightly batch, it takes five hours or six hours, and if you run [with that much utilization for] six hours, there's a high probability of node failure. Hive is more tolerant of that [kind of failure] than Impala is," says Ohta, "but Impala supports the interactive query."
Industry veteran Rich Ghiossi, who signed on with Treasure Data as its vice president of marketing in July, says Treasure Data addresses a pair of use cases: first, greenfield Hadoop adopters -- for which its SaaS model is ideal, Ghiossi argues -- and second, underperforming, problematic, or floundering Hadoop deployments. With respect to greenfield Hadoopers, Ghiossi says most of them aren't interested in the technicalities of HDFS or MapReduce, let alone the phenomenon of Hadoop itself. They have untraditional problems that they're trying to solve.
"The people who are coming up to speak with us don't care about that," he says. "Their motto is, if you have a method and mechanism to get data into the cloud and you're going to charge a relatively inexpensive fee compared to doing it onsite, and if you're going to allow me to tie Tableau or another reporting environment to it, why should I care how it works?'"
With regard to existing Hadoop deployments, Ghiossi argues, Treasure Data principals Ohta and Hiro Yoshikawa -- its co-founder and CEO -- have been working with Hadoop for more than half a decade. Both are self-described OSS advocates: Ohta helped found Japan's Hadoop User Group, which claims to be the world's largest; Yoshikawa worked for Red Hat Inc. for almost six years.
Nevertheless, Ghiossi says, Ohta and Yoshikawa recognize that vanilla Hadoop has a variety of shortcomings, especially from a data management perspective; they founded Treasure Data specifically to address these. "What they saw was that Hadoop was just way too difficult for most people to successfully deploy because of all of the expertise required," he explains.
"That's why [Yoshikawa] said 'Let's build this company around a service, but let's hide all of the complexity [involved] in making it work.' Their idea was to make [Hadoop] really work like MPP [i.e., a data warehousing platform], make it easily scalable, and make it truly multi-tenant. They figured there would be lot of frustration out there [among Hadoop adopters]."
Viki, a video streaming website based in Singapore, was one such frustrated Hadooper. Jason Grendus, director of analytics for Viki, describes his company as a kind of "Asia-Pacific Hulu."
When Grendus came onboard at Viki, he says he inherited a Hadoop project that was going nowhere. "When I came in, the preexisting team had built its own Hadoop cluster. I was brought in because I had a background in analytics and [because] they were getting some of what they needed [from Hadoop], but not to the extent that they wanted. They were having problems with instability in reporting -- [with] inconsistency in reporting. The numbers were unreliable," says Grendus.
"I found Treasure Data because we were already using fluentd as the end point for [data] collection. Treasure Data offered us a free trial where we could try them out, and at that point, I was just trying to get things working. So I figured, 'Why not give it a try?' I started using Treasure Data more and more and more because it was reliable, and [Viki's existing Hadoop solution] wasn't. For the first time, I was getting consistent numbers out of it."
Viki continuously tracks which videos are watched; when a session begins, its country of origin, and when it ends; which ads run, and so on. Grendus says it uses Treasure Data-powered MapReduce to extract all of this information from system and event logs. "If we tried to take all of our logging data and put it in a database directly rather than pre-formatting it and summarizing it -- it just wouldn't be economical to do [on a conventional RDBMS]," he explains.
He particularly praises Treasure Data's capacity licensing scheme, which exploits multi-tenancy and time-zone differences to offer free capacity headroom. (For example, Ohta says that Treasure Data co-locates Japanese and EU customers on the same Hadoop cluster. When one region is at peak, the other is off-peak.)
"One thing we really like is that we have a dedicated number of cores -- [this means] a guaranteed number of cores that [Treasure Data gives] us. This is a measure of how many processors you have for Hadoop [workloads]. Right now we're on 24, but it can scale up to four times this number because of the way they balance their [Hadoop] clusters," he comments.
"They guarantee you a minimum, but they're also able to give you excess [capacity] when you need it -- up to a certain point -- at no additional charge."