Skip to content

Table of Contents

Overview - Large Model Inference (LMI) Containers

LMI containers are a set of high performance Docker Containers purpose built for large language model (LLM) inference. With these containers you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution. We provide quick start notebooks that get you deploying popular open source models in minutes, and advanced guides to maximize performance of your endpoint.

LMI containers provide many features, including:

  • Optimized inference performance for popular model architectures like Llama, Bloom, Falcon, T5, Mixtral, and more
  • Integration with open source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, and Transformers NeuronX
  • Continuous Batching for maximizing throughput at high concurrency
  • Token Streaming
  • Quantization through AWQ, GPTQ, and SmoothQuant
  • Multi GPU inference using Tensor Parallelism
  • Serving LoRA fine-tuned models

LMI containers provide these features through integrations with popular inference libraries. A unified configuration format enables users to easily leverage the latest optimizations and technologies across libraries. We will refer to each of these libraries as backends throughout the documentation. The term backend refers to a combination of Engine (LMI uses the Python and MPI Engines) and inference library. You can learn more about the components of LMI here.

QuickStart

Our recommended progression for the LMI documentation is as follows:

  1. Sample Notebooks: We provide sample notebooks for popular models that can be run as-is. This is the quickest way to start deploying models with LMI.
  2. Starting Guide: The starter guide describes a simplified UX for configuring LMI containers. This UX is applicable across all LMI containers, and focuses on the most important configurations available for tuning performance.
  3. Deployment Guide: The deployment guide is an advanced guide tailored for users that want to squeeze the most performance out of LMI. It is intended for users aiming to deploy LLMs in a production setting, using a specific backend.

Sample Notebooks

The following table provides notebooks that demonstrate how to deploy popular open source LLMs using LMI containers on SageMaker. If this is your first time using LMI, or you want a starting point for deploying a specific model, we recommend following the notebooks below. All the below samples are hosted in the SageMaker Generative AI Hosting Examples Repository. That repository will be continuously updated with examples.

Model Instance Type Sample Notebook
Llama-2-7b ml.g5.2xlarge notebook
Llama-2-13b ml.g5.12xlarge notebook
Llama-2-70b ml.p4d.24xlarge notebook
Mistral-7b ml.g5.2xlarge notebook
Mixtral-8x7b ml.p4d.24xlarge notebook
Flan-T5-XXL ml.g5.12xlarge notebook
CodeLlama-13b ml.g5.48xlarge notebook
Falcon-7b ml.g5.2xlarge notebook
Falcon-40b ml.g5.48xlarge notebook

Note: Some models in the table above are available from multiple providers. We link to the specific model we tested with, but we expect same model from a different provider (or a fine-tuned variant) to work.

Starting Guide

The starting guide is our recommended introduction for all users. This guide provides a simplified UX through a reduced set of configurations that are applicable to all LMI containers starting from v0.27.0.

Advanced Deployment Guide

We have put together a comprehensive deployment guide that takes you through the steps needed to deploy a model using LMI containers on SageMaker. The document covers the phases from storing your model artifacts through benchmarking your SageMaker endpoint. It is intended for users moving towards deploying LLMs in production settings.

Supported LMI Inference Libraries

LMI Containers provide integration with multiple inference libraries. You can learn more about their integration with LMI from the respective user guides:

LMI provides access to multiple libraries to enable users to find the best stack for their model and use-case. Each inference framework provides a unique set of features and optimizations that can be tuned for your model and use case. With LMIs built-in inference handlers and unified configuration, experimenting with different stacks is as simple as changing a few configurations. Refer to the stack specific user guides, and the LMI deployment guide to learn more. An overview of the different LMI components is provided in the deployment guide

The following table shows which SageMaker DLC (deep learning container) to use for each backend. This information is also available on the SageMaker DLC GitHub repository.

Backend SageMakerDLC Example URI
vLLM djl-deepspeed 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
lmi-dist djl-deepspeed 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
hf-accelerate djl-deepspeed 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
deepspeed djl-deepspeed 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
tensorrt-llm djl-tensorrtllm 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122
transformers-neuronx djl-neuronx 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0