TDWI Articles

Google's New Dataproc Service Facilitates Big Data Management

Google officially unveiled its long-incubating Google Cloud Dataproc, a PaaS offering the company says takes most of the responsibility -- and a big share of the cost -- of deploying and managing Hadoop and Spark clusters out of the equation.

Do we really need another new Hadoop or Spark platform-as-a-service (PaaS) offering?

If the service is being offered under the aegis of Google Inc.'s Cloud Platform, yes, we probably do.

Google has officially unveiled its long-incubating Google Cloud Dataproc, a PaaS Hadoop and Spark offering. (Dataproc is a shortening of "data processing.")

Its pitch is simple: big data technologies such as Hadoop and Spark aren't just difficult to setup, manage, and operate, they're costly, too. Cloud Dataproc helpfully proposes to take most of the responsibility -- and a big share of the cost -- of deploying and managing Hadoop and Spark clusters out of the equation.

In this respect, it's a lot like Amazon's Elastic MapReduce (EMR), one of the PaaS products Amazon offers under the auspices of its Amazon Web Services (AWS) brand. It's also similar to Microsoft Corp.'s Azure HDInsight, which supports both Hadoop and Spark.

Truth be told, it's similar to a slew of Hadoop/Spark PaaS offerings, from providers such as Cloud Foundry and IBM Corp. These and other platforms tout managed Hadoop and Spark service offerings, but unlike Amazon, Microsoft, and Google, they don't lead the overall market for cloud services -- 1, 2, 3. That's what makes the official availability of Cloud Dataproc a very big deal.

Google hopes to draw on some of the same advantages that distinguish Amazon's AWS and Microsoft's Azure managed Hadoop/Spark services. Both vendors tout integration and interoperability with a broad portfolio of managed PaaS services, including streaming analytics (Amazon Kinesis, Azure Stream Analytics); a massively parallel processing, or MPP, data warehouse service (Amazon Redshift, Azure SQL Data Warehouse); machine learning (Amazon Machine Learning, Azure Machine Learning); scalable cloud storage (Amazon S3, Microsoft OneDrive Business); and so on.

True, Google doesn't offer an explicit MPP data warehouse service, although it positions Google BigQuery as something similar. Google can nonetheless claim to offer PaaS products (Cloud Dataflow, for streaming analytics; Prediction API, a cloud machine learning service; Cloud BigTable, a NoSQL PaaS database) that rival those of its two biggest competitors.

Dataproc also taps into the cost-friendly PaaS cloud model, which combines low-cost processing, cheap storage, and, most of all, convenience.

"With integrations to Google BigQuery, Google Cloud Bigtable, and Google Cloud Storage, which provide reliable storage independent from Dataproc clusters, customers have created clusters only when they need them, saving time and money, without losing data. Cloud Dataproc can also be used in conjunction with Google Cloud Dataflow for real-time batch and stream processing," product manager James Malone wrote on Google's Cloud Platform blog.

Malone could easily have invoked Google Prediction, or Google Cloud DataLab -- Google's cloud-based data visualization and discovery tool -- or Google Cloud Pub/Sub, a competitor to Apache Kafka and similar publish-subscribe messaging middleware technologies.

With Dataproc, Google credibly offers what Malone called a "complete data platform." As noted, Google doesn't provide an MPP data warehouse service, but BigQuery is arguably the next best thing -- possibly even better. It draws on Google's BigTable, Spanner, and F1 technologies.

Spanner is Google's distributed NewSQL database; F1 is the relational DBMS that sits on top of it and achieves OLTP-like ACID compliance. F1 is predicated on a strong, not an eventual, consistency model. NoSQL's single biggest cost, from a data management perspective, is its lack of atomic, consistent, isolated, and durable -- i.e., truly ACID -- transaction guarantees.

To say that the publication of Google's Spanner and F1 technical papers got database geeks excited is to drastically understate the case. As a combined data processing and data management platform, Spanner/F1 achieves most, if not all, of the benefits of the NoSQL model -- data type flexibility, relaxed or non-existent schema requirements, massive parallelism, transparent data distribution, and synchronous data replication -- without the associated costs. In other words, Google probably can plausibly position BigQuery as a scalable alternative to MPP cloud data warehouse services from Amazon and Microsoft.

Finally, database geeks aren't the only ones excited. Google can claim its "complete data platform" has partners worked up, too. In his blog post, Malone cited a partner ecosystem that includes several business intelligence (BI) industry familiars, among them Attunity, Looker, and Zoomdata.

About the Author

Stephen Swoyer is a technology writer with 20 years of experience. His writing has focused on business intelligence, data warehousing, and analytics for almost 15 years. Swoyer has an abiding interest in tech, but he’s particularly intrigued by the thorny people and process problems technology vendors never, ever want to talk about. You can contact him at [email protected].


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.