RESEARCH & RESOURCES

Agony and Ecstasy at Hadoop Summit

Hadoop all but went supernova at last month's Hadoop Summit event.

Hadoop all but went supernova last month, although its luminosity -- as more of a metaphorical than a physical phenomenon -- wasn't visible in the night sky but rather

it was sited at the San Jose Convention Center in downtown San Jose. The 2013 Hadoop Summit was hosted by Yahoo Inc. and Hortonworks Inc., an open source software-oriented, Yahoo-founded Hadoop vendor.

If this year's Strata conference (hosted in nearby Santa Clara) produced a deluge of Hadoop-related news items, Hadoop Summit upped the ante with dozens of vendors teeing up announcements to coincide with the two-day event. Here's a sampling of the news.

Kognitio 8 Has Landed

Kognitio announced version 8 of its Kognitio Analytic Platform.

One of the most striking new features in Kognitio 8 is an ability to invoke scripts or to run binary code -- for example, jobs written in R, Python, or Java -- such that they can execute across Kognitio's massively parallel processing (MPP) architecture. Kognitio isn't the first vendor to tout an ability to parallelize jobs or queries written in languages other than SQL – Teradata Corp. (via its acquisition of the former Aster Data Systems Inc.), EMC Corp. (via its acquisition of the former Greenplum Software Inc.), and Actian Inc. (via its acquisition of the former ParAccel Inc.) boast a similar capability -- but it might be the most extensible such implementation to date. From within Kognitio 8, a user can invoke any language for which there's a Linux interpreter: this includes R, Python, and Java as well as C, C++, LISP, and a host of 4GLs, including SAS.

Whether the output from the interpreter (or the algorithms being invoked) can be effectively or meaningfully parallelized is another question, of course. (Parallelization is hard. That's the raison d'etre for MPP, which -- as honed and perfected by vendors such as Kognitio, Teradata, and others -- focuses on parallelizing SQL workloads.) Elsewhere, Kognitio 8 boasts improved Hadoop connectivity (via its Hadoop Connector) and implements a new "external tables" feature that purports to facilitate access to information in external sources -- chiefly, non-relational (or semi-structured) data.

In addition to connectivity into Hadoop, Kognitio 8 supports high-speed connectivity into Amazon's S3 cloud storage service.

Informatica

Informatica Corp., which wasn't a sponsor, announced a partnership with Zettaset Inc., which bills itself as a specialist in "secure big data management." Under the terms of the accord, Zettaset will embed Informatica's PowerCenter Big Data Edition in its Zettaset Orchestrator cluster management software for Hadoop. Zettaset says that its Orchestrator product can be used to simplify or automate aspects of Hadoop management, including installation and provisioning, cluster configuration, and Hadoop security. (Hadoop management is likely to be an ongoing pain point for some time to come. By embedding PowerCenter Big Data Edition into Orchestrator, Zettaset says customers drag and drop data from the multi-structured world of Hadoop into the more rigorously structured -- i.e., SQL-centric -- realm of business intelligence (BI) and decision support. Apache Sqoop, which uses a command-line interface (CLI), is the default means by which to move data between RDBMS platforms and Hadoop.

Simba

Simba, a specialist in data connectivity software, announced a bevy of partnerships -- including one with Informatica. The Simba-Informatica partnership aims to provide "improved" (i.e., more BI-friendly) read and write access to MongoDB and Cassandra, two prominent NoSQl repositories. The focus is on bringing structure -- let's call it "tablature" -- to the semi-structured or unstructured world of NoSQL. Both Simba and Informatica note that popular BI tools such as Tableau, Crystal Reports, and even Microsoft's Excel expect to consume structured data -- if only in the form of comma separated values.

Elsewhere, Simba announced version 9.1 of its SimbaEngine SDK, which is a toolkit for creating drivers or adapters to connect traditional relational data sources with Web or big data sources. Simba says its SimbaEngine SDK 9.1 offers limited support for the data definition language (DDL) via support for CREATE/DROP tables and indices. Other enhancements include support for dynamic schemas, Visual Studio 2012 interoperability, a "collaborative query execution" feature -- which is able to intelligently push queries down into a data store instead of performing them in the Simba engine itself -- and support for computer-aided software engineering (CASE) expressions in SQL queries. The SimbaEngine SDK 9.1 also supports 13 languages.

Simba's biggest announcement was arguably the connectivity deal that it signed with Microsoft to support Windows Azure HDInsight, Redmond's Apache-compatible Hadoop distribution. The Microsoft-Simba agreement gives users of Microsoft's BI stack and its Windows Azure HDInsight service access rights to Simba's Apache Hive ODBC Driver with SQL Connector. (Hive is a SQL-like interpreter for Hadoop that compiles queries formatted in Hive Query Language -- HQL, a variant of SQL -- into Hadoop jobs.) Basically, the connector gives users of Microsoft's on-premises and cloud-based BI tools a way to get data out of Hive via ODBC.

Hortonworks

Hadoop Summit co-host Hortonworks touted several announcements, starting with the release of a Community Preview (CP) of version 2.0 of its Hortonworks Data Platform (HDP). HDP 2.0 CP is the first Hortonworks distribution to include a beta release of Apache YARN (YARN is a bacronym for "Yet Another Resource Manager"), a project that promises to decouple Hadoop from its MapReduce execution engine. YARN aims to replace the less flexible Hadoop Job Tracker, which currently performs two functions: managing cluster resources and distributing MapReduce jobs.

YARN promises to open up or democratize Hadoop by making it easier to parallelize non-MapReduce jobs. In a post-YARN Hadoop, vendors can more easily write applications that run inside Hadoop -- i.e., exploiting Hadoop services and the Hadoop Distributed File System (HDFS) for storage -- while using their own libraries or logic in place of Hadoop's de rigueur MapReduce execution engine. It's the difference, argues Hortonworks founder Arun Murthy, between running applications "on" and running them in Hadoop.

Also at Hadoop Summit, Hortonworks announced a certification program for YARN. The idea is to encourage application developers to build and certify their apps for YARN; participating members get access to the YARN development team, access to a YARN implementation guide and YARN textbook, input into YARN development. The news came with a caveat: the YARN API will soon be frozen, pending general availability of Hadoop 2.0 -- and promotional bonuses, including a YARN certification for compliant apps.

Plenty of other vendors teed up announcements or promotions at Hadoop Summit. Teradata, for example, unveiled its Teradata Portfolio for Hadoop, a product and services bundle that it's offering in conjunction with Hortonworks. Just two days prior to the Hadoop Summit, Talend announced a connector for Neo4j, the graph database marketed by Neo Technology. At Hadoop Summit, Talend discussed "pitfalls" that companies should be mindful of as they seek to expand their big data projects. Cloudera Inc. announced a new high-speed connector for Teradata. Its "Cloudera Connector Powered by Teradata" is actually based on Apache Sqoop, a CLI tool for moving data between Hadoop and RDBMSs.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.