Release Notes¶
Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker. For details on historical releases, refer to the Github Releases page.
LMI V20 (DJL-Serving 0.36.0)¶
Meet your brand new image! 💿
LMI (vLLM) Image - 2-9-2026¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi20.0.0-cu128-v1.0
What's New¶
- vLLM has been upgraded to
0.15.1 - DeepSeek R1 0528 Regression Fix
LMI V19 (DJL-Serving 0.36.0)¶
LMI (vLLM) Image – 2-2-2026¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi19.0.0-cu128
What's New¶
- vLLM has been upgraded to
0.14.0 - LMCache auto configuration feature: LMI can now provide automatic LMCache configuration for CPU and local storage offloading based on your instance resources and model size. You can enable this feature by setting
OPTION_LMCACHE_AUTO_CONFIG=Trueas an environment variable. - Custom output formatter fix: Resolved an issue where user-specified
@output_formatterwas not being applied as the final formatter. Previously, DJL applied an additional LMI output formatter after the custom formatter, preventing users from fully controlling the response shape. Users can now implement custom response envelopes and alternate schemas as expected. (#2986)
Considerations¶
- Our benchmarks demonstrate consistent performance of LMI V19 compared to V18 for most models tested. However, GPT-OSS 120B with EAGLE speculative decoding shows performance regression at higher concurrency levels. We are actively working on a fix and expect to address this in an upcoming patch release.
LMI V18 (DJL-Serving 0.36.0)¶
LMI (vLLM) Image – 12-15-2025¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.36.0-lmi18.0.0-cu128
- vLLM has been upgraded to
0.12.0 - LMCache support for on-host caching of KV cache delivers up to 20x improvement to request latency for long context requests. Refer to LMCache Performance Benefits for LMI Customers for more details
- Added support for adapter-scoped custom code (e.g., model.py) that can be registered dynamically via the adapter management APIs, enabling per-adapter input/output formatting for multi-tenant LoRA deployments
Key Features¶
Enhanced Adapter Management with Custom Code Support
* On adapter registration, DJL Serving now checks the adapter directory for model.py and (if present) loads the adapter's custom formatters before registering adapter weights
* If adapter custom code loading fails, registration fails fast (adapter weights are not registered) and returns an error response (code 424)
* During inference, adapter-specific formatters override base model formatters when the inference targets an adapter
* On adapter unregistration, the adapter's custom code is unloaded/cleaned up
* Enables per-adapter input/output formatting for multi-tenant LoRA deployments
LMCache Performance Improvements * Up to 28x speedup in Time to First Token (TTFT) with CPU offloading (achieved with Qwen 2.5-7B at 2M token context length) * Up to 16x speedup in TTFT with NVMe-based offloading (achieved with Qwen 2.5-72B at 1M token context length using O_DIRECT) * Comprehensive benchmarking suite across different storage backends (CPU RAM, NVMe, Redis, S3, EBS)
Security & Stability * Enhanced security validation for adapters in Secure Mode plugin * Improved multimodal integration test stability with vLLM 0.12.0 * Updated CI/CD pipeline to use serving version consistently across workflows
LMI V17 (DJL-Serving 0.35.0)¶
LMI (vLLM) Image – 9-30-2025¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.35.0-lmi17.0.0-cu128
- vLLM has been upgraded to
0.11.1 - Going forward, async mode is the default configuration for the vLLM handler
- New models supported - DeepSeek V3.2, Qwen 3 VL, Minimax-M2
- LoRA supported in Async mode for MoE models - Llama4 Scout, Qwen3, DeepSeek, GPT-OSS
- EAGLE 3 support added for GPT-OSS Models
- Support for on-host KV Cache offloading with LMCache (LMCache v1 is in experimental phase).
Considerations¶
- Our benchmarks demonstrate improvement in performance of LMI V17 compared to V16 for all models benchmarked (DeepSeek R1 Distill Llama, Llama 3.1 8B Instruct, Mistral 7B Instruct v0.3) except for Qwen3 Coder 30B A3b Model at concurrency of 128. We are working with vLLM community to understand the root cause and potential fixes.
LMI V16 (DJL-Serving 0.34.0)¶
LMI (vLLM) Image – 9-30-2025¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128
- vLLM version upgraded to
0.10.2 - Going forward, async mode is the officially recommended configuration for the vLLM handler
- Async vLLM handler now supports custom input and output formatters
- Async vLLM handler now supports multi-adapter (LoRA) serving
- Async vLLM handler now supports session-based sticky routing
LMI V15 (DJL-Serving 0.33.0)¶
LMI (vLLM) Image – 4-17-2025¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128
- vLLM version upgraded to
0.8.4 - Llama4 Model Support
- Updated Async Implementation, please see the vLLM async user guide here
TensorRT-LLM Image – 6-24-2025¶
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128
- TensorRT-LLM version upgraded to
0.21.0rc1