Transformers-NeuronX Engine in LMI¶

Model Artifacts Structure¶

LMI Transformers-NeuronX expects the model to be standard HuggingFace format for runtime compilation.

For loading of compiled models, both Optimum compiled models and split-models with a separate neff cache (compiled models must be compiled with the same Neuron compiler version and model settings). The source of the model could be:

model_id string from huggingface
s3 url that store the artifact that follows the HuggingFace model repo structure, the Optimum compiled model repo structure, or a split model with a second url for the neff cache.
A local path that contains everything in a folder follows HuggingFace model repo structure, the Optimum compiled model repo structure, or a split model with a second directory for the neff cache.

More detail on the options for model artifacts available for the LMI Transformers-NeuronX container available here

Supported Model architecture¶

The model architectures that are tested daily for LMI Transformers-NeuronX (in CI):

LLAMA
Mistral
Mixtral
GPT-NeoX
GPT-J
Bloom
GPT2
OPT

Complete model set¶

BLOOM (bigscience/bloom, bigscience/bloomz, etc.)
GPT-2 (gpt2, gpt2-xl, etc.)
GPT BigCode (bigcode/starcoder, bigcode/gpt_bigcode-santacoder, etc.)
GPT-J (EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.)
GPT-NeoX (EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
LLaMA, LLaMA-2, LLaMA-3 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, meta-llama/Meta-Llama-3-70B, openlm-research/open_llama_13b, etc.)
Mistral (mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.)
Mixtral (mistralai/Mixtral-8x7B-Instruct-v0.1)
OPT (facebook/opt-66b, facebook/opt-iml-max-30b, etc.)

Models supporting Transformers NeuronX continuous batching¶

LLaMA, LLaMA-2, LLaMA-3 (meta-llama/Llama-2-70b-hf, lmsys/vicuna-13b-v1.3, meta-llama/Meta-Llama-3-70B, openlm-research/open_llama_13b, etc.)
Mistral v2 (mistralai/Mistral-7B-v0.2, mistralai/Mistral-7B-Instruct-v0.2, etc.)

We will add more model support for the future versions to have them tested. Please feel free to file us an issue for more model coverage in CI.

Quick Start Configurations¶

Most of the LMI Transformers-NeuronX models use the following template:

serving.properties¶

engine=Python
option.model_id=<your model>
option.entryPoint=djl_python.transformers_neuronx
option.rolling_batch=auto
# Adjust the following based on model size and instance type
option.tensor_parallel_degree=4
option.max_rolling_batch_size=8
option.model_loading_timeout=1600

You can follow this example to deploy a model with serving.properties configuration on SageMaker.

environment variables¶

HF_MODEL_ID=<your model>
OPTION_ENTRYPOINT=djl_python.transformers_neuronx
OPTION_ROLLING_BATCH=auto
# Adjust the following based on model size and instance type
TENSOR_PARALLEL_DEGREE=4
OPTION_MAX_ROLLING_BATCH_SIZE=8
OPTION_MODEL_LOADING_TIMEOUT=1600

You can follow this example to deploy a model with environment variable configuration on SageMaker.

Quantization¶

Currently, we allow customer to use option.quantize=static_int8 or OPTION_QUANTIZE=static_int8 to load the model using int8 weight quantization.

Advanced Transformers NeuronX Configurations¶

The following table lists the advanced configurations that are available with the Transformers NeuronX backend. There are two types of advanced configurations: LMI, and Pass Through. LMI configurations are processed by LMI and translated into configurations that Transformers NeuronX uses. Pass Through configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI. We recommend that you file an issue for any issues you encounter with configurations. For LMI configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release. For Pass Through configurations it is possible that our investigation reveals an issue with the backend library. In that situation, there is nothing LMI can do until the issue is fixed in the backend library.

Item	LMI Version	Configuration Type	Description	Example value
option.n_positions	>= 0.26.0	Pass Through	Total sequence length, input sequence length + output sequence length.	Default: `128`
option.load_in_8bit	>= 0.26.0	Pass Through	Specify this option to quantize your model using the supported quantization methods in TransformerNeuronX	`False`, `True` Default: `None`
option.unroll	>= 0.26.0	Pass Through	Unroll the model graph for compilation. With `unroll=None` compiler will have more opportunities to do optimizations across the layers	Default: `None`
option.neuron_optimize_level	>= 0.26.0	Pass Through	Neuron runtime compiler optimization level, determines the type of optimizations applied during compilation. The higher optimize level we go, the longer time will spend on compilation. But in exchange, you will get better latency/throughput. Default value is not set (optimize level 2) that have a balance of compilation time and performance	`1`,`2`,`3` Default: `2`
option.context_length_estimate	>= 0.26.0	Pass Through	Estimated context input length for Llama models. Customer can specify different size bucket to increase the KV cache re-usability. This will help to improve latency	Example: `256,512,1024` (integers separated by comma if multiple values) Default: `None`
option.low_cpu_mem_usage	>= 0.26.0	Pass Through	Reduce CPU memory usage when loading models.	Default: `False`
option.load_split_model	>= 0.26.0	Pass Through	Toggle to True when using model artifacts that have already been split for neuron compilation/loading.	`False`, `True` Default: `None`
option.compiled_graph_path	>= 0.26.0	Pass Through	Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation.	Default: `None`
option.group_query_attention	>= 0.26.0	Pass Through	Enable K/V cache sharding for llama and mistral models types based on various strategies	`shard-over-heads` Default: `None`
option.enable_mixed_precision_accumulation	>= 0.26.0	Pass Through	Turn this on for LLAMA 70B model to achieve better accuracy.	`True` Default: `None`
option.enable_saturate_infinity	>= 0.27.0	Pass Through	Turn this on for LLAMA 13B model to correct for accuracy issues.	`True` Default: `None`
option.speculative_draft_model	>= 0.28.0	Pass Through	Model id or path to speculative decoding draft model.	Default: `None`
option.speculative_length	>= 0.28.0	Pass Through	Determines the number of tokens draft model generates before verifying against target model.	Default: `5`
option.draft_model_compiled_path	>= 0.28.0	Pass Through	Provide an s3 URI, or a local directory that stores the pre-compiled graph for your model (NEFF cache) to skip runtime compilation.	Default: `None`
option.attention_layout	>= 0.28.0	Pass Through	Layout to be used for attention computation. To be selected from `["HSB", "BSH"]`.	Default: `None`
option.collectives_layout	>= 0.28.0	Pass Through	Layout to be used for collectives within attention. To be selected from `["HSB", "BSH"]`.	Default: `None`
option.cache_layout	>= 0.28.0	Pass Through	Layout to be used for storing the KV cache. To be selected from `["SBH", "BSH"]`.	Default: `None`
option.on_device_embedding	>= 0.30.0	Pass Through	Enables the input embedding to be performed on Neuron. By default, the embedding is computed on CPU.	Default: `None`
option.on_device_generation	>= 0.30.0	Pass Through	Enables token generation to be performed on Neuron hardware with the given configuration. By default token generation is computed on CPU. By configuring this at compilation time, generation configurations cannot be dynamically configured during inference.	Default: `None`
option.shard_over_sequence	>= 0.30.0	Pass Through	Enables flash decoding / sequence parallel attention for token gen models, `default=False`. Recommendation is to set this option true for batch size * sequence length > 16k	Default: `None`
option.fuse_qkv	>= 0.30.0	Pass Through	Fuses the QKV projection into a single matrix multiplication.	Default: `None`
option.qkv_tiling	>= 0.30.0	Pass Through	Splits attention QKV to introduce "free" 128 dimensions.	Default: `None`
option.weight_tiling	>= 0.30.0	Pass Through	Splits model MLP to introduce "free" 128 dimensions.	Default: `None`

Advanced Multi-Model Inference Considerations¶

When using the LMI Transformers-NeuronX for multiple model inference endpoints you may need to limit the number of threads available to each model. Follow this guide when setting the correct number of threads to avoid race conditions. LMI Transformers-NeuronX in its standard configuration will set threads equal to two times the tensor parallel degree as the OMP_NUM_THREADS values.