Google Cloud Serverless Spark

What is Google Cloud Serverless Spark?

Apache Spark has become a popular platform for data workers (engineers, analysts, and scientists) to efficiently execute streaming, machine learning (ML), or SQL workloads that require fast iterative access to data sets. Being a fast in-memory data processing engine, it has provided unparalleled abilities in the exploration and engineering of both simple and intricate data. However, this tool bears some undesirable limitations that eat into its efficiency. With Apache Spark, there has to be an on-premises method of managing clusters and adjusting the infrastructure according to the needs of each job. For this reason, it requires highly competent data center workers who are proficient in all areas of conducting data workloads. With this in mind, is there a way to benefit from Apache Spark without dealing with these limitations?

Introducing Spark on Google Cloud, the first-ever autoscaling serverless Spark, cloud-native and fully integrated with, but not limited to GCP. With serverless Spark, you will be able to power ETL, data science, and data analytics workloads at scale. For the past half-decade, Google has been running mission-critical Spark workloads for businesses at scale using Dataproc (the open-source Spark). These capabilities are being extended to the customers to run their clusters in a cloud-native way using serverless Spark.

Serverless Spark Capabilities

Serverless Spark recruits DataprocBigQueryDataplex, and Vertex AI, to enable users:

  • Cut down the time spent to manage Spark clusters.

  • Execute Spark jobs from the interface of the user’s choice.

  • Deploy workloads flexibly on different clusters depending on the requirements.

Just like other services on GCP, the infrastructure that powers serverless Spark is fully managed by Google Cloud. Developers don’t have to be infrastructure experts as well, they can just focus on coding and logic. This service is integrated with Dataproc to enable users to accelerate their open-source data and analytics processing. Dataproc provides unmatched support for the most popular open-source software including Apache Spark and Hadoop. Suppose you are running OSS clusters on-premises, you can seamlessly migrate these workloads to the cloud and enjoy autoscaling and infrastructure management.

Serverless Spark Integrations

Users can also write SQL or PySpark code on BigQuery and serverless Spark will execute this code. Data can be stored in BigQuery as serverless Spark conducts analytics in a unified platform. Vertex AI Workbench permits developers to connect their notebooks to Spark via a single click and engage in interactive development. Vertex AI also offers an opportunity to use other ML frameworks such as TensorFlow, Pytorch, etc., with Spark. You can build and deploy ML models as Spark jobs on Dataproc to automatically analyze huge chunks of data.  

You can also manage data distributed in multiple data lakes, silos, or warehouses centrally and natively using Spark through Dataplex. Dataplex avails a unified analytics interface integrated with SparkSQL, PySpark, and notebooks, just one click away. Here you can save, share, search notebooks and scripts alongside data.

Depending on your needs and or preferences, Spark can be utilized in 3 different ways. If you prefer Kubernetes for infrastructure management, you can run it on GKE, but if you prefer the Hadoop infrastructure style, you can run it on GCE. Serverless Spark is made primarily for customers who prefer no-ops Spark deployment. Regardless of what you choose, you will be charged per Spark job.

Previous
Previous

Apigee

Next
Next

Google Cloud VMware Engine