Google has announced yet another new cloud technology within its Cloud Platform line, Google Cloud Dataproc. This new cloud technology is aimed at making Hadoop and Spark easier to deploy and manage within Google Cloud Platform. Much like the recent announcement from Dell and Cloudera, this technology allows the use of Hadoop without the high costs of training involved.
Google has announced yet another new cloud technology within its Cloud Platform line, Google Cloud Dataproc. This new cloud technology is aimed at making Hadoop and Spark easier to deploy and manage within Google Cloud Platform. Much like the recent announcement from Dell and Cloudera, this technology allows the use of Hadoop without the high costs of training involved.
As datasets continue to grow in size and complexity more powerful tools will be needed to analyze these datasets. While the tools exist they often add another layer of complexity and can be costly to train administrators on new technologies or bring in consultants. Google is introducing Dataproc, an automatic and managed service for Hadoop and Spark. With Dataproc users can take advantage of open source data tools for batch processing, querying, streaming, and machine learning while using its automation to quickly create and manage clusters. Dataproc also allows clusters to be turned off when not in use helping save costs as billing is minute-by-minute.
Benefits include:
- Cloud Dataproc is priced at only 1 cent per virtual CPU in a customer’s cluster per hour, on top of the other Cloud Platform resources used. Cloud Dataproc clusters can include preemptible instances that have lower compute prices, reducing costs further. Instead of rounding usage up to the nearest hour, Cloud Dataproc charges customers only for what is used with minute-by-minute billing and a ten-minute-minimum billing period.
- Without using Dataproc, it can take anywhere from 5 to 30 minutes to create Spark and Hadoop clusters on-premises or through IaaS providers. By comparison, Cloud Dataproc clusters are quick to start, scale, and shutdown with each of these operations taking 90 seconds or less, on average. This means users can spend less time waiting for clusters and more hands-on time working with their data.
- Cloud Dataproc has built-in integration with other Google Cloud Platform services, such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring, so customers have more than just a Spark or Hadoop cluster—they have a complete data platform. For example, they can use Cloud Dataproc to effortlessly ETL terabytes of raw log data directly into BigQuery for business reporting.
- Customers can easily interact with clusters and Spark or Hadoop jobs through the Google Developers Console, the Google Cloud SDK, or the Cloud Dataproc REST API. When they're done with a cluster, they can simply turn it off so money isn’t wasted on an idle cluster. There is no worry about losing data, because Cloud Dataproc is integrated with Cloud Storage, BigQuery, and Cloud Bigtable.
- There is no need to learn new tools or APIs to use Cloud Dataproc, making it easy to move existing projects into Cloud Dataproc without redevelopment. Spark, Hadoop, Pig, and Hive are frequently updated, so users can be productive faster.
Availability and pricing
Google Cloud Dataproc is available now as a beta service as starts at $0.01 per virtual CPU.
Sign up for the StorageReview newsletter