Unlocking AI Potential: Deploying NVIDIA NIM on AWS EKS

Today’s technological landscape is buzzing with the rapid advancements in artificial intelligence (AI), and if you’ve ever thought about harnessing this power for your projects, you’re in the right place. This article, co-authored by a team from NVIDIA and AWS, builds on our previous discussion about utilizing Amazon Elastic Compute Cloud (Amazon EC2) G5 instances for AI model deployment. Here, we’ll dive deeper into setting up an efficient cluster that leverages NVIDIA’s NIM microservices for AI inference at scale.

What’s New in AWS

Before we jump in, let’s embrace some exciting news! AWS has rolled out the General Availability of G6e Amazon EC2 Instances, equipped with NVIDIA L40S Tensor Core GPUs. If you’re eager to explore this option, head over to the AWS site. Moreover, the Amazon EC2 P5e instances, which are powered by NVIDIA H200 GPUs, are now available for your machine learning needs. These updates signify AWS’s commitment to providing powerful infrastructure for AI developers, giving you the tools to scale your models efficiently.

Getting Started with Amazon EKS

Amazon Elastic Kubernetes Service (EKS) is a robust managed service that simplifies running Kubernetes workloads on AWS. With EKS, you can effortlessly deploy NVIDIA NIM pods across various nodes. It takes care of the complexities associated with managing the Kubernetes control plane, all while ensuring optimal performance and scalability that the AWS infrastructure guarantees.

Our previous article covered the foundations and initial setup—if you haven’t read it yet, you can catch up here. As a quick recap, we provisioned essential EKS resources and set the stage for deploying our AI inference solution.

Deploying Customized NIM

So, let’s get into the nitty-gritty! Customizing your NIM deployment is simple. The key lies in the values.yaml file that holds all the configurations. In our first setup, we quickly deployed the Llama3-8B-Instruct model. This time, for a more tailored approach, either tweak the default values.yaml or create your own.

Setting up Kubernetes Secrets is essential for pulling images from NGC using your API key. It’s straightforward:

export NGC_CLI_API_KEY="your_ngc_api_key"
kubectl create secret generic ngc-api --from-literal=NGC_CLI_API_KEY=$NGC_CLI_API_KEY
kubectl create secret docker-registry registry-secret --docker-server=nvcr.io --docker-username="$oauthtoken" --docker-password=$NGC_CLI_API_KEY

Additionally, to circumvent common issues with Kubernetes Persistent Volume Claims, we’ve introduced a convenient shell script that provisions the required EBS CSI controller.

Monitoring with Prometheus

Monitoring is a critical aspect of any deployment. To derived insights from your NIM pods, set up Prometheus. After installing the required stack, create a YAML manifest for custom metrics, particularly to scrape the num_requests_running metric. This can help you assess how many requests your model is processing in real-time.

To visualize and confirm this, you can access the Prometheus user interface, ensuring that everything is running smoothly, and even use tools like Grafana to further enhance your graphs and dashboards.

Scaling Your NIM Workload

Now, scaling effectively makes all that hard work feel worthwhile. You have two primary options:

Option 1: Horizontal Pod Auto Scaler (HPA) + Cluster Auto Scaler (CAS)

This duo offers a streamlined scaling solution. While CAS adjusts the number of nodes, HPA focuses on scaling the pods based on real-time metrics. After setting up the Metrics Server, you can easily configure HPA to utilize the custom metrics from Prometheus and scale as needed.

Option 2: KEDA + Karpenter

For a more innovative approach, consider pairing KEDA with Karpenter. KEDA automatically scales resources based on events, and Karpenter provisions the nodes dynamically. It allows for a highly responsive infrastructure, ideal for applications with unpredictable demands.

Load Balancing Across NIM Pods

Once your workloads are scaled, ensuring effective traffic distribution is vital. Create a Kubernetes Ingress Resource using an Application Load Balancer (ALB) to manage this seamlessly. This setup ensures that no pod is overwhelmed, maintaining responsive and stable service.

Cleanup and Best Practices

As you finish setting up your environments, don’t forget to clean up to avoid unnecessary charges. A quick command using eksctl or the AWS Management Console will help you delete your resources.

Conclusion

We’ve explored a powerful way to leverage Amazon EKS and NVIDIA NIM microservices for optimized AI inference workflows. By monitoring and scaling your deployments, you can unlock incredible performance from your AI models while maintaining operational efficiency.

The AI Buzz Hub team is excited to see where these breakthroughs take us. Want to stay in the loop on all things AI? Subscribe to our newsletter or share this article with your fellow enthusiasts.

What's Hot

Deploying Generative AI Applications with NVIDIA NIM Microservices on Amazon Elastic Kubernetes Service (Amazon EKS) – Part 2