Running Apache Spark on Kubernetes

For the last few weeks, I’ve been deploying a Spark cluster on Kubernetes (K8s). I want to share the challenges, architecture, and solution details I’ve discovered with you.

Challenges

At Empathy, all code running in production must be cloud-agnostic. As of this publication date, Empathy has overcome a previous dependency on cloud providers by using Spark solutions, according to the Cloud provider: EMR (AWS scenario), Dataproc (GCP scenario), and HDInsight (Azure scenario).

The different solutions for these cloud providers offer an easy and simple method to deploy Spark on the cloud. However, some limitations arise when a company scales up, leading to several key questions:

These are common questions when trying to execute Spark jobs. Solving them with Kubernetes can save effort and provide a better experience.

Running Apache Spark on K8s offers us the following benefits:

The benefits are the same as Empathy’s solution for Apache Flink running on Kubernetes, as I explored in my previous article.

Apache Spark on Kubernetes

Apache Spark is a unified analytics engine for big data processing, particularly handy for distributed processing. Spark is used for machine learning and is currently one of the biggest trends in technology.

Apache Spark Architecture

Spark Submit can be used to submit a Spark Application directly to a Kubernetes cluster. The flow would be as follows:

  1. Spark Submit is sent from a client to the Kubernetes API server in the master node.
  2. Kubernetes will schedule a new Spark Driver pod.
  3. Spark Driver pod will communicate with Kubernetes to request Spark executor pods.
  4. The new executor pods will be scheduled by Kubernetes.
  5. Once the new executor pods are running, Kubernetes will notify Spark Driver pod that new Spark executor pods are ready.
  6. Spark Driver pod will schedule tasks on the new Spark executor pods.

Spark Submit Flowchart

You can schedule a Spark Application using Spark Submit (vanilla way) or using Spark Operator.

Spark Submit

Spark Submit is a script used to submit a Spark Application and launch the application on the Spark cluster. Some nice features include:

Spark Operator

The SparkOperator project was developed by Google and is now an open-source project. It uses Kubernetes Custom Resource for specifying, running, and surfacing the status of Spark Applications. Some nice features include:

Spark Submit vs Spark Operator

The image above shows the main commands of Spark Submit vs Spark Operator.

Empathy’s solution prefers Spark Operator because it allows for faster iterations than Spark Submit, where you have to create custom Kubernetes manifests for each use case.

Solution Details

To solve the questions posed in the Challenges section, ArgoCD and Argo Workflows can help you, along with the support of CNCF projects. For instance, you can schedule your favorite Spark Applications workloads from Kubernetes using ArgoCD to create Argo Workflows and define sequential jobs.

The flowchart would be as follows:

Solution flowchart

ArgoCD

ArgoCD is a GitOps continuous delivery tool for Kubernetes. The main benefits are:

More detailed information can be found in their official documentation.

Argo Workflows

Argo Workflows is a workflow solution for Kubernetes. The main benefits are:

More detailed information can be found in their official documentation.

Monitoring

Once Prometheus scrapes the metrics, some Grafana Dashboards are needed. The custom Grafana Dashboards for Apache Spark is based on the following community dashboards:

To Sum Up

Empathy chooses Spark Operator, ArgoCD, and Argo Workflows to create a Spark Application Workflow solution on Kubernetes and uses GitOps to propagate the changes. The setup illustrated in this article has been used in production environments for about one month, and the feedback is great! Everyone is happy with the workflow — having a single workflow that’s valid for any cloud provider, thus getting rid of individual cloud provider solutions.

To test it for yourself, follow these hands-on samples and enjoy deploying some Spark Applications from localhost, with all the setup described in this guide: Hands-on Empathy Repo.

I’ve also drawn upon my presentation for Kubernetes Days Spain 2021.

Though the journey was long, we’ve learned a lot along the way. I hope our innovations will help you become more cloud-agnostic too.

References

 

 

 

 

Top