Page 2 of 3
Teradata's SQL-on-Hadoop Strategy Begins with Presto
The wait is over: Teradata has itself a SQL-on-Hadoop strategy. In June, the high-end data warehousing specialist announced its support for "Presto," an open source software (OSS) SQL-on-Hadoop query engine first developed by Facebook.
Teradata's commitment to Presto is important because SQL itself is still the most efficient and -- above all -- widely supported tool for describing and manipulating relational data. From the perspective of Teradata customers, for example, a transparent SQL-on-Hadoop query mechanism such as Presto would open up data stored in Hadoop to a universe of SQL-speaking BI tools. The key caveat here is would: Teradata's come-to-Presto move is a "multi-year commitment to contribute to Presto's open source development and [to] provide the industry's first commercial support."
Multi-year means precisely that, according to Dan Graham, technical marketing specialist with Teradata, who says the Presto announcement comprises a "dramatic shift" for his company.
"We've never really been part of open source software before. The good news is that we had Hadapt [an OSS Hadoop specialist that Teradata acquired in 2014] out in Boston and all of these people who have grown up with open source, but it's a shift of thinking for Teradata. Now Microsoft has recently done that same shift and it was a lot harder for them because they disliked open source for decades and they made it very clear. We're making the same kind of change, but we're not rushing headlong into a lot of code, we're focusing on what we as a company can do very well, and that's SQL, and that's database access," says Graham.
"That's something that the Hadapt people are going to be doing for us. They're going to start out being the level 1-4 support on Presto because Facebook does't have this. We're going to help them get that stuff under control, add periphery things, and help [Presto developers] as they continue to generate more ANSI SQL functions."
ANSI SQL is the key. There's no shortage of SQL-like query engines for Hadoop, from Hive -- a SQL-like interpreter that compiles Hive Query Language (HiveQL) queries into MapReduce or Tez jobs -- to Impala (an in-memory, SQL-like query engine for Hadoop) to Spark SQL, a SQL-like query facility for the Apache Spark cluster computing framework. None of these offerings pretends to comply with more recent revisions of the ANSI SQL standards. Not yet, at least, but as Graham and others -- including Philip Russom, research director for data management -- have argued, ANSI SQL compliance isn't just a checklist item to the extent that HiveQL or Impala SQL depart from (or fail to implement) ANSI SQL features or functions.
"If a SQL query [engine] is not ANSI-compliant, there are typically enough changes that it will cause heartburn for [third-party] BI tools," says Graham, who praises the efforts of Hortonworks (with both Hive and Tez) and Cloudera (with Impala) to improve ANSI SQL compliance. "In the past, we've had to deal with database vendors that are not SQL-compliant, which is frustrating. If you [as a customer] follow the ANSI format, our system will behave and be completely compatible [with SQL code]. Our goal with our commitment to Presto is to bring the same experience to [querying data in] Hadoop."
Ultimately, Graham says, the plan is to work with Facebook and the Presto team to improve Presto's support for SQL-based data description, manipulation, and (a function of both) query. On Teradata's end, it will work to incorporate support for Presto into its QueryGrid initiative. To the extent that QueryGrid comprises the equivalent of a data abstraction layer for Teradata environments, users of BI tools will ultimately be able to query and analyze data in Hadoop much like they query and analyze data in Teradata Warehouse or Aster Discovery. That's the goal, with Teradata aiming to permit bi-directional query between and among Teradata Database, Aster Discovery, and Hadoop, along with more sophisticated workload provisioning.
In fact, BI tools could actually query against the Teradata database, which -- via QueryGrid -- could schedule them to run in Hadoop. In other words, a user wouldn't even need to know if she was interacting with Hadoop.
A robust, ANSI SQL-compliant Presto would also simplify the preparation and extraction of data from Hadoop: data transformations (i.e., ETL workloads) could be expressed in SQL and pushed down to Hadoop; Presto itself could schedule and execute them. (It would likewise be possible to use Presto and SQL to schedule the extraction of prepared data from Hadoop and into another repository.) It's possible to some extent to do all of this stuff today using either native Hadoop tools (Hive, Impala, Pig, or MapReduce itself, among others), but an ANSI-SQL-fluent Hadoop would be that much more open.
Of Commitments, "Major" and Otherwise
What will Teradata's "commitment" to Presto actually mean? Graham contrasts his company's approach with that of IBM Corp., which late last month announced a "major" commitment to Apache Spark, a cluster computing framework that can run in Hadoop, Cassandra, and other contexts.
Big Blue pledged to assign more than 3,500 researchers to "Spark-related projects" at any of 12 global labs. It also committed to donate its SystemML language to the Apache Spark ecosystem and committed to help train-up "more than one million data scientists and data engineers" on Spark.
Graham stops short of likening IBM's support for Spark to an attempt to co-opt the development, and, with it, the direction, of Spark itself. After all, he notes, the Apache Foundation works carefully to control for (and to safeguard against) the influence or dominance of any vendor or vendors in driving a project. However, IBM's commitment is as at least as PR-driven as substantive, Graham argues. What's more, it could be interpreted as a slap in the face by some in the open source community.
"They've pledged to put 3,500 researchers on Spark, which is about 100 times more than anybody else [has assigned to work on the Spark project]. We're trying to be very respectful, very cautious, [with respect to] how we work with the open source community. We're trying not to alienate anyone, and we're emphasizing that we want to help with the stuff we know best -- and that's SQL and database access. We're not announcing 'X' number of researchers or anything like that. We're trying to contribute our own expertise to help where we can."
If anything, IBM's come-to-Spark move isn't in any sense unprecedented, at least by Big Blue's standards. IBM typically emphasizes its commitment to any technology, be it open source, cloud computing, virtualization, or even its System z mainframe, by putting money and bodies behind it, as well as by setting ambitious training or education goals. Ultimately -- with respect to open source, cloud computing, or the revival of its mainframe platform, to say nothing of its initial foray into Hadoop (with its own, "enterprise-grade" distribution of Hadoop, viz., "Big Insights") -- such announcements tend to garner lots of press and generate a great deal of good will for IBM. In the end, however, they tend to position Big Blue as a strong competitive player -- and no more -- in the markets in which they're applicable.
The same could be said about Teradata's approach with Presto, Graham concedes. It's a question of playing to one's strengths. In IBM's case, that means money, bodies, and IP. In Teradata's case, it means IP -- i.e., expertise -- and a much smaller commitment of bodies.
"We wanted to be careful about not alienating the open source community or our partners," he says, referring to Cloudera Inc., Hortonworks Inc., and MapR Technologies Inc., which market the three most popular commercial distributions of Hadoop. "The thing about Presto is they actually have a fairly strong ANSI SQL basis. They don't have all of the SQL functions, but what they do have is compatible with ANSI. As they evolve this into something, we felt, 'This is something we can definitely help with.'"