Transformers-NeuronX Engine in LMI¶
Model Artifacts Structure¶
LMI Transformers-NeuronX expects the model to be standard HuggingFace format for runtime compilation.
For loading of compiled models, both Optimum compiled models and split-models with a separate neff
cache (compiled models must be compiled with the same Neuron compiler version and model settings).
The source of the model could be:
- model_id string from huggingface
- s3 url that store the artifact that follows the HuggingFace model repo structure, the Optimum compiled model repo structure, or a split model with a second url for the
neff
cache. - A local path that contains everything in a folder follows HuggingFace model repo structure, the Optimum compiled model repo structure, or a split model with a second directory for the
neff
cache.
More detail on the options for model artifacts available for the LMI Transformers-NeuronX container available here
Supported Model architecture¶
The model architectures that are tested daily for LMI Transformers-NeuronX (in CI):
- LLAMA
- Mistral
- Mixtral
- GPT-NeoX
- GPT-J
- Bloom
- GPT2
- OPT
Complete model set¶
- BLOOM (
bigscience/bloom
,bigscience/bloomz
, etc.) - GPT-2 (
gpt2
,gpt2-xl
, etc.) - GPT BigCode (
bigcode/starcoder
,bigcode/gpt_bigcode-santacoder
, etc.) - GPT-J (
EleutherAI/gpt-j-6b
,nomic-ai/gpt4all-j
, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b
,databricks/dolly-v2-12b
,stabilityai/stablelm-tuned-alpha-7b
, etc.) - LLaMA, LLaMA-2, LLaMA-3 (
meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,meta-llama/Meta-Llama-3-70B
,openlm-research/open_llama_13b
, etc.) - Mistral (
mistralai/Mistral-7B-v0.1
,mistralai/Mistral-7B-Instruct-v0.1
, etc.) - Mixtral (
mistralai/Mixtral-8x7B-Instruct-v0.1
) - OPT (
facebook/opt-66b
,facebook/opt-iml-max-30b
, etc.)
Models supporting Transformers NeuronX continuous batching¶
- LLaMA, LLaMA-2, LLaMA-3 (
meta-llama/Llama-2-70b-hf
,lmsys/vicuna-13b-v1.3
,meta-llama/Meta-Llama-3-70B
,openlm-research/open_llama_13b
, etc.) - Mistral v2 (
mistralai/Mistral-7B-v0.2
,mistralai/Mistral-7B-Instruct-v0.2
, etc.)
We will add more model support for the future versions to have them tested. Please feel free to file us an issue for more model coverage in CI.
Quick Start Configurations¶
Most of the LMI Transformers-NeuronX models use the following template:
serving.properties¶
engine=Python
option.model_id=<your model>
option.entryPoint=djl_python.transformers_neuronx
option.rolling_batch=auto
# Adjust the following based on model size and instance type
option.tensor_parallel_degree=4
option.max_rolling_batch_size=8
option.model_loading_timeout=1600
You can follow this example to deploy a model with serving.properties configuration on SageMaker.
environment variables¶
HF_MODEL_ID=<your model>
OPTION_ENTRYPOINT=djl_python.transformers_neuronx
OPTION_ROLLING_BATCH=auto
# Adjust the following based on model size and instance type
TENSOR_PARALLEL_DEGREE=4
OPTION_MAX_ROLLING_BATCH_SIZE=8
OPTION_MODEL_LOADING_TIMEOUT=1600
You can follow this example to deploy a model with environment variable configuration on SageMaker.
Quantization¶
Currently, we allow customer to use option.quantize=static_int8
or OPTION_QUANTIZE=static_int8
to load the model using int8
weight quantization.
Advanced Transformers NeuronX Configurations¶
The following table lists the advanced configurations that are available with the Transformers NeuronX backend.
There are two types of advanced configurations: LMI
, and Pass Through
.
LMI
configurations are processed by LMI and translated into configurations that Transformers NeuronX uses.
Pass Through
configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
We recommend that you file an issue for any issues you encounter with configurations.
For LMI
configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
For Pass Through
configurations it is possible that our investigation reveals an issue with the backend library.
In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
Item | LMI Version | Configuration Type | Description | Example value |
---|---|---|---|---|
option.n_positions | >= 0.26.0 | Pass Through | Total sequence length, input sequence length + output sequence length. | Default: 128 |
option.load_in_8bit | >= 0.26.0 | Pass Through | Specify this option to quantize your model using the supported quantization methods in TransformerNeuronX | False , True Default: None |
option.unroll | >= 0.26.0 | Pass Through | Unroll the model graph for compilation. With unroll=None compiler will have more opportunities to do optimizations across the layers |
Default: None |
option.neuron_optimize_level | >= 0.26.0 | Pass Through | Neuron runtime compiler optimization level, determines the type of optimizations applied during compilation. The higher optimize level we go, the longer time will spend on compilation. But in exchange, you will get better latency/throughput. Default value is not set (optimize level 2) that have a balance of compilation time and performance | 1 ,2 ,3 Default: 2 |
option.context_length_estimate | >= 0.26.0 | Pass Through | Estimated context input length for Llama models. Customer can specify different size bucket to increase the KV cache re-usability. This will help to improve latency | Example: 256,512,1024 (integers separated by comma if multiple values) Default: None |
option.low_cpu_mem_usage | >= 0.26.0 | Pass Through | Reduce CPU memory usage when loading models. | Default: False |
option.load_split_model | >= 0.26.0 | Pass Through | Toggle to True when using model artifacts that have already been split for neuron compilation/loading. | False , True Default: None |
option.compiled_graph_path | >= 0.26.0 | Pass Through | Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. | Default: None |
option.group_query_attention | >= 0.26.0 | Pass Through | Enable K/V cache sharding for llama and mistral models types based on various strategies | shard-over-heads Default: None |
option.enable_mixed_precision_accumulation | >= 0.26.0 | Pass Through | Turn this on for LLAMA 70B model to achieve better accuracy. | True Default: None |
option.enable_saturate_infinity | >= 0.27.0 | Pass Through | Turn this on for LLAMA 13B model to correct for accuracy issues. | True Default: None |
option.speculative_draft_model | >= 0.28.0 | Pass Through | Model id or path to speculative decoding draft model. | Default: None |
option.speculative_length | >= 0.28.0 | Pass Through | Determines the number of tokens draft model generates before verifying against target model. | Default: 5 |
option.draft_model_compiled_path | >= 0.28.0 | Pass Through | Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation. | Default: None |
option.attention_layout | >= 0.28.0 | Pass Through | Layout to be used for attention computation. To be selected from ["HSB", "BSH"] . |
Default: None |
option.collectives_layout | >= 0.28.0 | Pass Through | Layout to be used for collectives within attention. To be selected from ["HSB", "BSH"] . |
Default: None |
option.cache_layout | >= 0.28.0 | Pass Through | Layout to be used for storing the KV cache. To be selected from ["SBH", "BSH"] . |
Default: None |
option.on_device_embedding | >= 0.30.0 | Pass Through | Enables the input embedding to be performed on Neuron. By default, the embedding is computed on CPU. | Default: None |
option.on_device_generation | >= 0.30.0 | Pass Through | Enables token generation to be performed on Neuron hardware with the given configuration. By default token generation is computed on CPU. By configuring this at compilation time, generation configurations cannot be dynamically configured during inference. | Default: None |
option.shard_over_sequence | >= 0.30.0 | Pass Through | Enables flash decoding / sequence parallel attention for token gen models, default=False . Recommendation is to set this option true for batch size * sequence length > 16k |
Default: None |
option.fuse_qkv | >= 0.30.0 | Pass Through | Fuses the QKV projection into a single matrix multiplication. | Default: None |
option.qkv_tiling | >= 0.30.0 | Pass Through | Splits attention QKV to introduce "free" 128 dimensions. | Default: None |
option.weight_tiling | >= 0.30.0 | Pass Through | Splits model MLP to introduce "free" 128 dimensions. | Default: None |
Advanced Multi-Model Inference Considerations¶
When using the LMI Transformers-NeuronX for multiple model inference endpoints you may need to limit the number of threads available to each model.
Follow this guide when setting the correct number of threads to avoid race conditions. LMI Transformers-NeuronX in its standard configuration will set threads equal to two times the tensor parallel degree as the OMP_NUM_THREADS
values.