LMI V10 containers release¶
This document will contain the latest releases of our LMI containers. For details on any other previous releases, please refer our github release page
Release Notes¶
Release date: August 16, 2024¶
Check out our latest Large Model Inference Containers.
Key Features¶
DJL Serving Changes (applicable to all containers)¶
- Allows configuring health checks to fail based on various types of error rates
- When not streaming responses, all invocation errors will respond with the appropriate 4xx or 5xx HTTP response code
- Previously, for some inference backends (vllm, lmi-dist, tensorrt-llm) the behavior was to return 2xx HTTP responses when errors occurred during inference
- HTTP Response Codes are now configurable if you require a specific 4xx or 5xx status to be returned in certain situations
- Introduced annotations
@input_formatter
and@output_formatter
to bring your own script for pre- and post-postprocessing.
LMI Container (vllm, lmi-dist)¶
- vLLM updated to version 0.5.3.post1
- Added MultiModal Support for Vision Language Models using the OpenAI Chat Completions Schema.
- More details available here
- Supports Llama 3.1 models
- Supports beam search,
best_of
andn
with non streaming output. - Supports chunked prefill support in both vllm and lmi-dist.
TensorRT-LLM Container¶
- TensorRT-LLM updated to version 0.11.0
- [Breaking change] Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
Transformers NeuronX Container¶
- Upgraded to Transformers NeuronX 2.19.1
Text Embedding (using the LMI container)¶
- Various performance improvements
Breaking Changes¶
- In the TensorRT-LLM container, Flan-T5 is now supported with C++ triton backend. Removed Flan-T5 support for TRTLLM python backend.
Known Issues¶
- Running Gemma and Phi models with TensorRT-LLM is only viable currently at TP=1 because of an issue in TensorRT-LLM where one engine is built even when TP > 1.
- When using LMI-dist, in the rare case that the machine has a broken cuda driver, it causes hanging. In that case, set LMI_USE_VLLM_GPU_P2P_CHECK=1 to prompt LMI to use a fallback option compatible with the broken cuda driver.