LMI Breaking Changes¶
This document details breaking changes for LMI container version releases.
LMI Container v12 (0.30.0)¶
Changes to model loading/warmup behavior.¶
Summary¶
Starting in LMI v12 (0.30.0), the default behavior for option.max_rolling_batch_prefill_tokens
when using the lmi-dist
rolling-batch backend has changed.
This configuration maps to vLLM's max_num_batched_tokens
configuration, and will be updated to match this name in a future version.
In previous versions (v11 and earlier), this value would default to 4096 tokens.
Starting in v12, the default behavior for lmi-dist will rely on vLLM's default behavior.
vLLM's default behavior is described here.
The previous behavior was implemented to improve the default experience for model loading,
especially on memory-constrained instances.
vLLM's default behavior is to set this value to option.max_model_len
, and use this as part of the warmup phase.
After the warmup phase, lmi-dist would set this value back to option.max_model_len
.
This allowed users to deploy models with long context lengths on smaller instances without needing
to configure option.max_rolling_batch_prefill_tokens
or option.max_model_len
.
However, if users attempted requests with longer than 4096 tokens of prefill,
there was a chance this could lead to out-of-memory scenarios.
What to expect¶
We have decided to make this change because of chunked prefill, our observation that more users are leveraging longer prompts,
and our belief that the previous motivation is no longer the correct default experience.
This change means that if option.max_rolling_batch_prefill_tokens
is not specified, some deployments that worked previously
with v11 and earlier may start failing in v12 due to higher memory requirements during model loading and warmup
for models with greater than a supported sequence length greater than 4096 tokens.
In v12, the default behavior is to allocate enough memory to store the KV-cache for option.max_model_len
tokens.
For models with large context lengths (or any context length above 4096 tokens), more memory will be required by default
during model load and startup time. As a result, some previous configurations when deployed with v12 will require
additional memory by default, which can lead to failures at startup.
Guidance¶
As a result of this change, we recommend you take the following actions:
- Enable chunked-prefill. You can do this by setting
option.enable_chunked_prefill=true
. - You can learn more about chunked prefill here https://docs.vllm.ai/en/latest/models/performance.html#chunked-prefill.
- We expect that chunked prefill will be beneficial for most users and use-cases. But, you should confirm this for your specific use-case.
- With chunked-prefill, the default value for
option.max_rolling_batch_prefill_tokens
is 512, which is set to optimize inter token latency (ITL). - Starting in LMI v12 (0.30.0), the default value for
option.enable_chunked_prefill
isNone
rather thanFalse
. This brings LMI default in line with OSS vllm default, which results in chunked prefill being enabled by default for models with a context length greater than 32k. - The default logic in vLLM is described here.
- Tune
option.max_rolling_batch_prefill_tokens
andoption.max_model_len
based on available accelerator memory (instance type), use-case, and model. option.max_model_len
should be set to the maximum input + output token count you expect to support at inference time.- If you are unable to use a certain
option.max_model_len
for a model on an instance, you will need to reduce this value, use an instance type with more gpu memory, or explore techniques like quantization to reduce the memory required for the model.
- If you are unable to use a certain
option.max_rolling_batch_prefill_tokens
depends on whether you use chunked prefill.- With chunked prefill, you can typically expect higher throughput with larger values, and better latency with lower values.
- Without chunked prefill, you should set this to the maximum input length you aim to support.
LMI-TensorrtLLM v12 (0.30.0)¶
T5 model family memory increase¶
We have observed slightly different behavior for the t5 model family compared to our previous v11 release (0.29.0). We have raised an issue with Tensorrt-LLM here. The change in behavior is likely due to how Tensorrt-LLM handles memory for the kv-cache.
Guidance¶
As a result, we recommend that you carefully tune the batch size and sequence length for your model. Configurations that previously worked on v11 may lead to Out-Of-Memory scenarios with v12.
T5 model family batch scheduler policy¶
We have noticed runtime issues when using the default batch_scheduler_policy
of max_utilization
for t5 models.
When this is combined with in-flight batching (the default), it leads to inference errors.
Guidance¶
We recommend users that are deploying t5 model variants to specify option.batch_scheduler_policy=guaranteed_no_evict
to ensure proper inference execution.
LMI-NeuronX v12 (0.30.0)¶
Legacy partition is removed¶
Before Neuron SDK version 2.18, it was necessary to load a safetensors model using Hugging Face on the CPU, save it as a pre-sharded version, and then load the model using NeuronAutoModel. Starting with Neuron SDK version 2.18, this step is no longer needed. In our code, although we have been loading the safetensors format on the CPU and loading them, the split_model_path is not used anywhere when using the safetensors format. Since we moved to Neuron SDK version 2.20.1, legacy partition is no longer needed.
Guidance¶
You no longer need to specify a partition schema, as the LMI container can deduce it for you using our auto-detection partition schema. If you don't use option.partition_schema, this change does not affect you. If you do, consider making use of our auto-detection partition schema, safetensors, or Optimum.