RESEARCH & RESOURCES

Amazon and Cloudera Make Big Moves

The performances of Amazon and Cloudera stand out from the pack in Gartner's new data warehousing Magic Quadrant. Both are oh-so-close to breaking through into leadership roles.

The "Leaders" quadrant of Gartner Inc.'s newest "Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics" report looks a lot like you'd imagine. It's populated by the data warehousing (DW) industry's lineup of the usual suspects: IBM Corp., Microsoft Corp., Oracle Corp., and Teradata Corp. lead the pack, joined by SAP AG.

Several other players stand out, however. Consider the curious case of Cloudera Inc., which Gartner plots as a "Visionary" on the Cartesian grid of its Magic Quadrant. The thing is, Cloudera's getting awfully close to cracking the "Leaders" quadrant: right now, it's the equivalent of about one-third of a plot-point -- i.e., the diameter of one of the blue dots Gartner uses to represent individual vendors -- from Leaders-ship. Cloudera's position has shifted significantly since last year, when Gartner had it as a "Challenger" vendor. As Gartner sees it, Cloudera's "Ability to Execute" (the Y axis on the grid) declined year-over-year, even as its "Completeness of Vision" (X axis) improved markedly.

How does Gartner explain this? It doesn't, exactly. Arguably, Cloudera's completeness of vision hasimproved over the last 24 months: in 2014, for example, it introduced new metadata management and data lineage features with version 2.0 of its Cloudera Navigator governance product.

"Cloudera differentiates itself from other Hadoop distribution vendors by continuing to invest in specific capabilities, such as further improvements to Cloudera Navigator ... which provides metadata management, lineage and auditing ... [while] keeping up with the Hadoop open-source project," Gartner analysts Mark Beyer and Roxane Edjlali write.

The upshot is that even if Cloudera's metadata management story isn't as robust as, say, Oracle's or Teradata's, it (1) is a vast improvement over straight-from-Git Apache Hadoop and (2) has a full two years of maturation under its belt, during which time Cloudera has continued to enhance it.

If nothing else, it's on the right track. Metadata management, lineage, and governance are hot tickets in big data management. Teradata, for example, recently cozied up to Alation, a start-up that specializes in metadata management for Hadoop. (Two years ago, Teradata also purchased the former Revalytix Inc., presumably in a bid to redress the Hadoop platform's metadata shortcomings.) Informatica Corp. last year announced a new product -- Big Data Management -- that it says addresses governance, security, metadata, lineage, and other amenities. Atlas, an Apache Software Foundation "Incubator" project, aims to provide governance services for Hadoop. Finally, upstart Diyotta touts its own offering as a solution for big data management, governance, and related issues.

What's to account for Cloudera's slippage on the "Ability to Execute" axis? One explanation is that the market as a whole continues to evolve and that Cloudera's own evolution isn't tracking (closely enough) with that of its competitors -- or with the needs of the market as a whole.

One hint of this comes via Gartner's critique of Cloudera's cloud strategy. "[O]rganizations have a growing interest in cloud deployments, [but] Cloudera mainly addresses the cloud using an infrastructure-as-a-service approach that does not offer scalable, elastic and managed service support," Beyer and Edjlali write, noting that "Cloudera is addressing these needs with enhancements to Cloudera Director ... to ease deployment of elastic clusters in the cloud."

Cloudera's version 2.0 release of Director arrived too late (January 21st of this year) to factor into Gartner's "Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics." That's unfortunate, because Director 2.0 boasts a few new features that seem to address Gartner's criticism. Cloudera has long supported IaaS cloud platforms such as Amazon's AWS and the Google Cloud Platform. In the same way, Cloudera's Director Service Provider Interface first debuted with Director 1.5. It permits organizations to deploy the company's Cloudera Distribution of Hadoop (CDH) on other cloud platforms. Nothing new there.

However, Director 2.0 should support a more resilient elastic cloud experience. For example, when growing (or shrinking) an instance, the revamped Director can now roll back to a prior (known-good) state if it detects an error. Director 2.0 can also automatically spin up short-lived "spot" instances that exploit unused compute capacity at low cost. (These are "spot" instances in Amazon's EC2 parlance; in Google's vernacular, they're called "preemptible" instances.) Because they're running in unused compute capacity, spot instances usually cost much less -- as much as 70 percent less, according to Google -- than do regular compute instances. Elsewhere, Director can now dynamically spin up new clusters to process queued Spark or Impala jobs. According to Cloudera, this helps automate job queues: Director handles the provisioning (spinning up) to run queued jobs as well as the termination (spinning down) of clusters once jobs have completed.

If anything, Cloudera is a victim of its own growth and success, Gartner suggests.

"Although Cloudera has expanded into new geographies and added new clients, reference customers consider that the availability of support or professional service resources is becoming constrained. Cloudera has recognized this as an issue, and worked to address these points in 2015 by, for example, expanding its support team in Europe," Beyer and Edjlali point out.

Amazon Agonistes

Amazon Inc. turned in a no less impressive performance in this year's "Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics." In 2015, Gartner plotted Amazon in its "Visionaries" quadrant, far to the left on the quadrant's Cartesian plane. In 2016, Amazon is nigh on nicking the vertical "Y" axis that bisects the "Visionaries" and "Leaders" quadrants. In the interleaving year, then, the retail and cloud giant managed (1) to hold its own on the Y ("Ability to Execute") axis -- Amazon finished fifth in "Ability to Execute," ahead of SAP and trailing only Microsoft, Oracle, IBM, and Teradata -- and (2) to dramatically improve its vision. It's completeness of vision, that is.

The centerpiece of Amazon's analytics data warehousing strategy is Redshift, the massively parallel processing (MPP) data warehousing service it first introduced in late-2012. Redshift wasn't in any sense a homegrown database, either: Amazon acquired most of its pieces, including its core MPP database engine, from the former ParAccel Inc., an analytics database pure-play that was in turn acquired by Actian in 2013. The thing is, Redshift is one of several weapons in Amazon's AWS arsenal: it also offers a streaming service (AWS Kinesis), a scalable cloud storage service (Amazon S3), and, of course, its seminal Amazon Elastic MapReduce (EMR) service.

There's also Amazon's newer Amazon Relational Database Service, or RDS. Customers can spin up instances of Amazon's Aurora RDBMS, MariaDB, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL. Amazon doesn't market RDS as an analytics database service, of course. Then again, MySQL wasn't designed for analytics query processing or decision support workloads. That didn't stop it from being used that way, however.

In AWS, then, Amazon has itself a fairly well-rounded data management stack. Gartner thinks so, noting that not only can Redshift credibly claim to be the leading data warehouse-as-a-service offering, but that Amazon's S3 service is increasingly employed as a storage sink for the data lake use case, too. "AWS ... continues to achieve strong adoption, driven by its broad acceptance of the cloud, flexibility, and agility from both a technical and a financial standpoint," Beyer and Edjlali write.

"AWS supports a wide variety of use cases when its offerings are combined with other data management solutions. For example, our client interactions indicate adoption of S3 in support of data lakes, in combination with Redshift for analytics."

One of the perceived strengths of the cloud model from a buyer's perspective is that it's as easy to pull the plug on it as it is to get started. (This is never true, of course. Depending on the circumstances -- size, concurrency, number of applications -- it can be far from a trivial matter to move from one data warehouse platform to another.)

According to Gartner, AWS customers seem pretty satisfied, however. Most plan to increase their use of Redshift and other services. "The vast majority of reference clients indicate that they plan to invest more in Redshift, which demonstrates continued satisfaction with this product," the analyst duo report.

"Strong scores for customer experience and rapid, significant market penetration are major contributors to AWS's position on the Ability to Execute axis."

Should we expect to see Amazon as a data warehousing market Leader in 2017? Possibly, although Gartner notes that the cloud data warehouse market has become increasingly crowded. In addition to powerhouses IBM, Microsoft, Oracle, SAP, and Teradata, pure-plays such as Snowflake Computing Inc. are contesting this space, too. All of these vendors (Snowflake excepted) have one notional edge on Amazon and AWS, Gartner notes: an on-premises option.

"As AWS is a pure-play cloud vendor, Redshift lacks support for the hybrid cloud-and-on-premises data warehousing combinations that Gartner predicts will be the norm for most organizations by the end of 2018," Beyer and Edjlali write. There's another wrinkle here, too. Yes, Redshift is based on best-of-breed MPP technology -- albeit technology that isn't as mature as, say, Teradata's database. (Teradata's workload management facility is arguably the envy of the industry.)

Why does this matter? Because as AWS subscribers tap Redshift to support mixed, high-concurrency workloads, they're running up against inherent limitations in that service, Gartner asserts: "As AWS's reference clients mature in their use of Redshift, they are starting to report limitations in relation to their expectations for complex, mixed-workload management."

Whether these limitations are endemic to Redshift itself or, just as likely, a function of the constraints of the cloud model (e.g., resource virtualization) is an open question, however.

One that might be resolved, one way or another, in 2017.

TDWI Membership

Get immediate access to training discounts, video library, BI Teams, Skills, Budget Report, and more

Individual, Student, & Team memberships available.