Skip to content

Steps for Deploying models with LMI Containers on AWS SageMaker

The following document provides a step-by-step guide for deploying LLMs using LMI Containers on AWS SageMaker. This is an in-depth guide that will cover all phases from model artifacts through benchmarking your endpoint. If you are new to LMI, we recommend starting with the example notebooks and starting guide.

Before starting this tutorial, you should have your model artifacts ready. If you are deploying a model directly from the HuggingFace Hub, you will need the model-id (e.g. TheBloke/Llama-2-7B-fp16). If you are deploying a model stored in S3, you will need the s3 uri pointing to your model artifacts (e.g. s3://my-bucket/my-model/). If you have a custom model, it must be saved in the HuggingFace Transformers Pretrained format. You can read this guide to verify your model is saved in the correct format for LMI.

This guide is organized as follows:

Next: Selecting an Instance Type

Below, we provide an overview of the various components of LMI containers. We recommend reading this overview to become familiar with some of the LMI specific terminology like Backends and Built-In Handlers.

Components of LMI

LMI containers bundle together a model server, LLM inference libraries, and inference handler code to deliver a batteries included LLM serving solution. The model server, inference libraries, and default inference handler code are brought together through a unified configuration that specifies your deployment setup. Brief overviews of the components relevant to LMI are presented below.

Model Server

LMI containers use DJL Serving as the Model Server. A full architectural overview of DJL Serving is available here. At a high level, DJL Serving consists of Netty front-end that manages request routing to the backend workers that execute inference for the target model. The backend manages model worker scaling, with each model being executed in an Engine. The Engine is an abstraction provided by DJL that you can think of as the interface that allows DJL to run inference for a model with a specific deep learning framework. In LMI, we use the Python Engine as it allows us to directly leverage the growing Python ecosystem of LLM inference libraries.

Python Engine and Inference Backends

The Python Engine allows LMI containers to leverage many Python-based inference libraries like lmi-dist, vLLM, TensorRT-LLM, and Transformers NeuronX. These libraries expose Python APIs for loading and executing models with optimized inference on accelerators like GPUs and AWS Inferentia. LMI containers integrate the front-end model server with backend workers running Python processes to provide high-performance inference of LLMs.

To support multi-gpu inference of large models using model parallelism techniques like tensor parallelism, many of the inference libraries support distributed inference through MPI. LMI supports running the Python Engine in mpi mode (referred to as the MPI Engine) to leverage tensor parallelism in mpi-aware libraries like LMI-Dist, and TensorRT-LLM.

Throughout the LMI documentation, we will use the term backend to refer to a combination of Engine and Inference Library (e.g., MPI Engine + LMI-Dist library).

You can learn more about the Python and MPI engines in the engine conceptual guide.

Built-In Handlers

LMI provides built-in inference handlers for all the supported backend. These handlers take care of parsing configurations, loading the model onto accelerators, applying optimizations, and executing inference. These libraries expose features and capabilities through different APIs and mechanisms, so switching between frameworks to maximize performance can be challenging as you need to learn each framework individually. With LMI's built-in handlers, there is no need to learn each library and write custom code to leverage the features and optimizations they offer. We expose a unified configuration format that allows you to easily switch between libraries as they evolve and improve over time. As the ecosystem grows and new libraries become available, LMI can integrate them and offer the same consistent experience.

Configuration

The configuration provided to LMI specifies your entire setup. The configuration covers many aspects including:

  • Where your model artifacts are stored (HuggingFace Model ID, S3 URI)
  • Model Server Configurations like job/request queue size, auto-scaling behavior for model workers, which engine to use (either Python or MPI for LMI)
  • Engine/Backend Configurations like whether to use quantization, input sequence limits, continuous batching size, tensor parallel degree, and more depending on the specific backend you use

Configurations can be provided as either a serving.properties file, or through environment variables passed to the container. A more in-depth explanation about configurations is presented in the deployment guide in the Container and Model Configurations section.

Feature Matrix

HuggingFace Accelerate LMI_dist (9.0.0) TensorRTLLM (0.8.0) TransformersNeuronX (2.18.0) vLLM (0.3.3)
DLC LMI LMI LMI TRTLLM LMI Neuron LMI
Default handler huggingface huggingface tensorrt-llm transformers-neuronx huggingface
support quantization BitsandBytes/GPTQ GPTQ/AWQ SmoothQuant, AWQ, GPTQ INT8 GPTQ/AWQ
AWS machine supported G4/G5/G6/P4D/P5 G5/G6/P4D/P5 G5/G6/P4D/P5 INF2/TRN1 G4/G5/G6/P4D/P5
execution mode Python MPI MPI Python Python
multi-accelerator weight loading Yes Yes Yes Yes Yes
tensor parallel No Yes Yes Yes Yes
continuous batching streaming Yes Yes Yes Yes Yes
need to compile No No Yes Yes No
SageMaker Inference Component support Yes Yes Yes Yes Yes
support logprob Yes Yes Yes No Yes