Overview
NVIDIA NIM inference microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker, to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise, available in the AWS Marketplace, NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models from open-source community models to NVIDIA AI Foundation and custom models. NIM microservices are deployed with a single command for easy integration into generative AI applications using industry-standard APIs and just a few lines of code. Engineered to facilitate seamless generative AI inferencing at scale, NIM ensures generative AI applications can be deployed anywhere.
Benefits
Performance
As part of the NVIDIA AI Enterprise suite of software, NIM goes through exhaustive tuning to ensure the high-performance configuration for each model. Using NIM, throughput and latency improve significantly. For example, the NVIDIA Llama 3.1 8B Instruct NIM has achieved 2.5x improvement in throughput, 4x faster “time to first token” (TTFT), and 2.2x faster “inter-token latency” (ITL) compared to the best open-source alternatives.
Stats
Faster TTFT on Llama 3.1 8B Instruct with NIM On versus NIM Off
Faster ITL on Llama 3.1 8B Instruct with NIM On versus NIM Off
Features
Prebuilt containers
NIM offers a variety of prebuilt containers and Helm charts, which include optimized generative AI models. NIM seamlessly integrates with Amazon EKS to deliver a high-performance and cost-optimized model serving infrastructure.
Standardized APIs
Simplify the development, deployment, and scaling of generative AI-based applications, with industry-standard APIs for building powerful copilots, chatbots, and generative AI assistants on AWS. These APIs are compatible with standard deployment processes, meaning teams can update applications quickly and easily.
Model support
Deploy custom generative AI models that are fine-tuned to specific industries or use cases. NIM supports generative AI use cases across multiple domains including LLMs, vision language models (VLMs), and models for speech, images, video, 3D, drug discovery, medical imaging, and more.
Domain-specific
NIM includes domain-specific NVIDIA CUDA libraries and specialized code, covering areas such as speech, language, and video processing.
Inference engines
Optimized using Triton Inference Server, TensorRT, TensorRT-LLM, and PyTorch NIM maximizes throughput and decreases latency, thereby reducing the cost of running inference workloads as they scale.
How to get started with NVIDIA NIM on AWS
Deploy production-grade NIM microservices with NVIDIA AI Enterprise running on AWS
Fast and easy generative AI deployment
To get started, users can set up optimized inference workloads on AWS with accelerated generative AI models in NVIDIA’s API catalog at ai.nvidia.com. When ready to deploy, organizations can self-host models with NVIDIA NIM and run them securely on AWS, giving them ownership of their customizations and full control of their intellectual property (IP) and generative AI applications.
Customers can purchase the NVIDIA AI Enterprise license from the AWS Marketplace then go to NVIDIA NGC to access the NIM catalog, download the containers, and bring them to AWS. Deploy NIM on Amazon Elastic Compute Cloud (Amazon EC2), Amazon EKS, and Amazon SageMaker using AWS Batch, AWS ParallelCluster, Amazon FXs for Lustre, and Amazon Simple Storage Service (Amazon S3).