Right-Sizing GPU and CPU Resources for Training and Inferencing Using Kubernetes

The rapid rise of AI services has created a massive demand for computing resources, making efficient management of those resources a critical challenge. While running AI workloads with Kubernetes has come a long way, optimizing scheduling based on dynamic demand continues to be an area for improvement. Many organizations face constraints related to the cost and availability of GPU clusters worldwide and often rely on the same compute clusters for inference workloads and continuous model training and fine-tuning.  

AI Model Training and Model Inferencing in Kubernetes

Training typically requires far more computational power than inferencing. On the other hand, inferencing is far more frequent than training as it is used to make predictions repeatedly across many applications. Let’s explore how we can harness the best of what the cloud has to offer with advances in Kubernetes to optimize resource allocation by prioritizing workloads dynamically and efficiently based on need. 

The diagram below shows the process of training versus inferencing. For training, workloads may run less frequently but with more resources needed as we essentially “teach” it how to respond to new data. Once trained, a model is deployed and will often run on GPU compute instances to provide the best results with low latency. Inferencing will thus run more frequently, but not as intensely. All the while, we may go back and retrain a model to accommodate new data or even try other models that need to be trained before deployment.

 

A diagram of a model inferencing.

AI Model Training vs. AI Model Inferencing


AI workloads, especially training, are like High Performance Computing (HPC) workloads. Kubernetes wasn’t designed for HPC, but because Kubernetes is open source and largely led by the community, there have been rapid innovations in this space. The need for optimization has led to the development of tools like KubeFlow and Kueue.

AI Workloads for Kubernetes

KubeFlow uses pipelines to simplify the steps in data science into logical blocks of operation and offers numerous libraries that plug into these steps so you can get up and running quickly.

Kueue provides resource “flavors” that allow it to tailor workloads to the hardware provisioning available at the time and schedule the correct workloads accordingly (there’s much more to it, of course). The community has done an outstanding job of addressing issues of scaling, efficiency, distribution, and scheduling with these tools and more.    

Below is an example of how we can use Kubernetes to schedule and prioritize training and inference jobs on GPU clusters backed with Remote Direct Memory Access-- RDMA (RoCEv2). Let's create some sample code to demonstrate this concept. Note: In the code we use a fictional website, gpuconfig.com for the GPU manufacturer. Also, <gpu name> is a placeholder for the specific GPU you wish to target.

Shell
 
apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: high-priority-<gpu name>

value: 1000000

globalDefault: false

description: "This priority class should be used for high priority <GPU NAME> GPU jobs only."

---

apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: medium-priority-<gpu name>

value: 100000

globalDefault: false

description: "This priority class should be used for medium priority <GPU NAME> GPU jobs."

---

apiVersion: v1

kind: Pod

metadata:

    name: high-priority-gpu-job

spec:

    priorityClassName: high-priority-<gpu name>

    containers:

    - name: gpu-container

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 1

    nodeSelector:

      gpu-type: <gpu name>

      rdma: "true"

---

apiVersion: v1

kind: Pod

metadata:

    name: medium-priority-gpu-job

spec:

    priorityClassName: medium-priority-<gpu name>

    containers:

    - name: gpu-container

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 1

    nodeSelector:

      gpu-type: <gpu name>

      rdma: "true"

 

This Kubernetes configuration demonstrates how to prioritize jobs on our GPU nodes using an RDMA backbone. Let's break down the key components: 

1. PriorityClasses: We've defined two priority classes for our GPU’s jobs:

2. Pod Specifications: We've created two sample pods to show how to use these priority classes:

3. Node Selection: Both pods use nodeSelector to ensure they're scheduled on specific GPUs with RDMA:

Shell
 
   nodeSelector:

     gpu-type: <gpu name>

     rdma: "true"

 

4. Resource Requests: Each pod requests one GPU:

Shell
 
   resources:

     limits:

       gpuconfig.com/gpu: 1


Kubernetes uses priority classes to determine the order in which pods are scheduled and which pods are evicted if resources are constrained. Here's an example of how you might create a CronJob that uses a high-priority class:

Shell
 
apiVersion: batch/v1beta1

kind: CronJob

metadata:

    name: high-priority-ml-training

spec:

    schedule: "0 2 * * *"

    jobTemplate:

      spec:

        template:

          metadata:

          name: ml-training-job

          spec:

          priorityClassName: high-priority-<gpu name>

          containers:

          - name: ml-training

            image: your-ml-image:latest

            resources:

              limits:

                gpuconfig.com/gpu: 2

          restartPolicy: OnFailure

          nodeSelector:

            gpu-type: <gpu name>

            rdma: "true"

 

GPU Resource Management in Kubernetes

 Below are some examples of GPU resource management in Kubernetes. 

Shell
 
 

apiVersion: v1

kind: ResourceQuota

metadata:

    name: gpu-quota

    namespace: ml-workloads

spec:

    hard:

      requests.gpuconfig.com/gpu: 8

      limits.gpuconfig.com/gpu: 8

---

apiVersion: v1

kind: LimitRange

metadata:

    name: gpu-limits

    namespace: ml-workloads

spec:

    limits:

    - default:

        gpuconfig.com/gpu: 1

      defaultRequest:

        gpuconfig.com/gpu: 1

      max:

        gpuconfig.com/gpu: 4

      min:

        gpuconfig.com/gpu: 1

      type: Container

---

apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: gpu-burst

value: 1000000

globalDefault: false

description: "This priority class allows for burst GPU usage, but may be preempted."

---

apiVersion: v1

kind: Pod

metadata:

    name: gpu-burst-job

    namespace: ml-workloads

spec:

    priorityClassName: gpu-burst

    containers:

    - name: gpu-job

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 2

    nodeSelector:

      gpu-type: <gpu name>

 

In the past, it could be a challenge to know the current state of hardware to prioritize workloads, but thanks to open-source tools we now have solutions. For monitoring GPU utilization, we’re using tools like Prometheus and Grafana. Here's a sample Prometheus configuration to scrape GPU metrics:

Shell
 
global:

    scrape_interval: 15s

 

scrape_configs:

    - job_name: 'gpu_gpu_exporter'

      static_configs:

        - targets: ['localhost:9835']


And here's a simple Python script that we are using to optimize GPU allocation based on utilization metrics:

Python
 
import kubernetes

from prometheus_api_client import PrometheusConnect

 

def get_gpu_utilization(prometheus_url, pod_name):

      prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)

      query = f'gpu_gpu_utilization{{pod="{pod_name}"}}'

      result = prom.custom_query(query)

      return float(result[0]['value'][1]) if result else 0

 

def optimize_gpu_allocation():

      kubernetes.config.load_kube_config()

      v1 = kubernetes.client.CoreV1Api()

      

      pods = v1.list_pod_for_all_namespaces(label_selector='gpu=true').items

      for pod in pods:

          utilization = get_gpu_utilization('http://prometheus:9090', pod.metadata.name)

          if utilization < 30:  # If GPU utilization is less than 30%

            # Reduce GPU allocation

            patch = {

                "spec": {

                    "containers": [{

                        "name": pod.spec.containers[0].name,

                          "resources": {

                              "limits": {

                                "gpuconfig.com/gpu": "1"

                            }

                        }

                    }]

                }

            }

              v1.patch_namespaced_pod(name=pod.metadata.name, namespace=pod.metadata.namespace, body=patch)

            print(f"Reduced GPU allocation for pod {pod.metadata.name}")

 

if __name__ == "__main__":

      optimize_gpu_allocation()

 

This script checks GPU utilization for pods and reduces allocation if utilization is low. This script is run as a function to optimize resource usage.

Leveraging Kubernetes to Manage GPU and CPU Resources

Thus, we leveraged Kubernetes with OCI Kubernetes Engine (OKE) to dynamically manage GPU and CPU resources across training and inference workloads for AI models. Specifically, we focused on right-sizing the GPU allocations with RDMA (RoCEv2) capabilities. We developed Kubernetes configurations, helm charts, including custom priority classes, node selectors, and resource quotas, to ensure optimal scheduling and resource prioritization for both high-priority and medium-priority AI tasks.RDMA (RoCEv2) capabilities. We developed Kubernetes configurations, helm charts, including custom priority classes, node selectors, and resource quotas, to ensure optimal scheduling and resource prioritization for both high-priority and medium-priority AI tasks.

By utilizing Kubernetes' flexibility, and OKE’s management capabilities on Oracle Cloud Infrastructure (OCI), we balanced the heavy compute demands of training with the lighter demands of inferencing. This ensured that resources were dynamically allocated, reducing waste while maintaining high performance for critical tasks. Additionally, we integrated monitoring tools like Prometheus to track GPU utilization and adjust allocations automatically using a Python script. This automation helped optimize performance while managing costs and availability. 

In Conclusion

The solutions we outlined here apply universally across cloud and on-premises platforms using Kubernetes for AI/ML workloads. No matter the hardware, or any other compute platform, the key principles of using Kubernetes for dynamic scheduling and resource management remain the same. Kubernetes allows organizations to prioritize their workloads efficiently, optimizing their use of any available hardware resources. By using the same approach, enterprises can fine-tune their infrastructure, reduce bottlenecks, and cut down on underutilized resources, leading to more efficient and cost-effective operations.


This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.

View the Event

 

 

 

 

Top