TensorRT-LLM(TRT-LLM) Engine User Guide¶
Model Artifacts Structure¶
TRT-LLM LMI supports two options for model artifacts
- Standard HuggingFace model format: In this case, TRT-LLM LMI will build TRT-LLM engines from HuggingFace model and package them with HuggingFace model config files during model load time.
- Custom TRT-LLM LMI model format: In this case, artifacts are directly loaded without the need to model compilation resulting in faster load times.
Supported Model Architectures¶
The below model architectures are supported for JIT model compilation and tested in our CI.
- LLaMA (since LMI V7 0.25.0)
- Falcon (since LMI V7 0.25.0)
- InternLM (since LMI V8 0.26.0)
- Baichuan (since LMI V8 0.26.0)
- ChatGLM (since LMI V8 0.26.0)
- GPT-J (since LMI V8 0.26.0)
- Mistral (since LMI V8 0.26.0)
- Mixtral (since LMI V8 0.26.0)
- Qwen (since LMI V8 0.26.0)
- GPT2/SantaCoder/StarCoder/GPTBigCode (since LMI V8 0.26.0)
- Phi2 (since LMI V9 0.27.0)
- OPT (since LMI V9 0.27.0)
- Gemma (since LMI V10 0.28.0)
TRT-LLM LMI v11 0.29.0 containers come with TRT-LLM 0.11.0. For models that are not listed here and supported by TRT-LLM with tensorrtllm_backend, you can use this tutorial instead to prepare model manually.
We will add more model support in the future versions in our CI. Please feel free to file an issue if you are looking for a specific model support.
Quick Start Configurations¶
You can leverage tensorrtllm
with LMI using the following starter configurations:
serving.properties¶
engine=MPI
option.tensor_parallel_degree=max
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_num_tokens=50000
You can follow this example to deploy a model with serving.properties configuration on SageMaker.
environment variables¶
HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
# Adjust the following based on model size and instance type
OPTION_MAX_NUM_TOKENS=50000
You can follow this example to deploy a model with environment variable configuration on SageMaker.
Where to find max_num_tokens
number?¶
To simplify your building experience, we pre-compiled and tested some of the models to help you verify the number we can use. You can follow this link to find the table of precompiled model.
Quantization Support¶
We support three methods of quantization when using TensorRT-LLM with LMI: SmoothQuant, AWQ and FP8.
You can enable these quantization strategies using option.quantize=<smoothquant|awq|fp8>
in serving.properties, or OPTION_QUANTIZE=<smoothquant|awq|fp8>
environment variable.
More details about additional (optional) quantization configurations are available in the advanced configuration table below.
Advanced TensorRT-LLM Configurations¶
The following table lists the advanced configurations that are available with the TensorRT-LLM backend.
There are two types of advanced configurations: LMI
, and Pass Through
.
LMI
configurations are processed by LMI and translated into configurations that TensorRT-LLM uses.
Pass Through
configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
We recommend that you file an issue for any issues you encounter with configurations.
For LMI
configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
For Pass Through
configurations it is possible that our investigation reveals an issue with the backend library.
In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
Item | LMI Version | Configuration Type | Description | Example value |
---|---|---|---|---|
option.max_input_len | >= 0.25.0 | LMI | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request. | Default: 1024 |
option.max_output_len | >= 0.25.0 | LMI | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default: 512 |
option.max_num_tokens | >= 0.27.0 | LMI | Max total tokens size the TRTLLM engine will use. Internally, if you set this value, we will extend the max_input, max_output and batch size to the model could actually support. This would allow the model to run under more arbitary traffic | Default: 16384 |
option.use_custom_all_reduce | >= 0.25.0 | Pass Through | Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. We will determine this automatically for customers. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected | true , false . Default is false |
option.tokens_per_block | >= 0.25.0 | Pass Through | tokens per block to be used in paged attention algorithm | Default values is 128 |
option.batch_scheduler_policy | >= 0.25.0 | Pass Through | scheduler policy of Tensorrt-LLM batch manager. | max_utilization , guaranteed_no_evict Default value is max_utilization |
option.kv_cache_free_gpu_mem_fraction | >= 0.25.0 | Pass Through | fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. | float number between 0 and 1. Default is 0.95 |
option.max_num_sequences | >= 0.25.0 | Pass Through | maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set | Integer greater than 0 Default value is the batch size set while building Tensorrt engine |
option.enable_trt_overlap | >= 0.25.0 | Pass Through | Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. | true , false . Default is false |
option.enable_kv_cache_reuse | >= 0.26.0 | Pass Through | This feature is only supported for GPT-like model on TRTLLM (as of 0.7.1) and need to compile the model with --use_paged_context_fmha . Let the LLM model to remember the last used input KV cache and try to reuse it in the next run. An instant benefit will be blazing fast first token latency. This is typically helpful for document understanding, chat applications that usually have the same input prefix. The TRTLLM backends will remember the prefix tree of the input and reuse most of its part for the next generation. However, this does come with the cost of extra GPU memory. |
true , false . Default is false |
option.baichuan_model_version | >= 0.26.0 | Pass Through | Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc. | v1_7b , v1_13b , v2_7b , v2_13b . Default is v1_13b |
option.chatglm_model_version | >= 0.26.0 | Pass Through | Parameter exclusive to ChatGLM models to specify the exact model type. Required for ChatGLM models. | chatglm_6b , chatglm2_6b , chatglm2_6b_32k , chatglm3_6b , chatglm3_6b_base , chatglm3_6b_32k , glm_10b . Default is unspecified , which will throw an error. |
option.gpt_model_version | >= 0.26.0 | Pass Through | Parameter exclusive to GPT2 models to specify the exact model type. Required for GPT2 models. | gpt2 , santacoder , starcoder . Default is gpt2 . |
option.multi_block_mode | >= 0.26.0 | Pass Through | Split long kv sequence into multiple blocks (applied to generation MHA kernels). It is beneifical when batch x num_heads cannot fully utilize GPU. This is not supported for qwen model type. |
true , false . Default is false |
option.use_fused_mlp | >= 0.26.0 | Pass Through | Enable horizontal fusion in GatedMLP, reduces layer input traffic and potentially improves performance for large Llama models(e.g. llama-2-70b). This option is only supported for Llama model type. | true , false . Default is false |
option.rotary_base | >= 0.26.0 | Pass Through | Rotary base parameter for RoPE embedding. This is supported for llama, internlm, qwen model types | float value. Default is 10000.0 |
option.rotary_dim | >= 0.26.0 | Pass Through | Rotary dimension parameter for RoPE embedding. This is supported for only gptj model | int value. Default is 64 |
option.rotary_scaling_type option.rotary_scaling_factor | >= 0.26.0 | Pass Through | Rotary scaling parameters. These two options should always be set together to prevent errors. These are supported for llama, qwen and internlm models | The value of rotary_scaling_type can be either linear and dynamic . The value of rotary_scaling_factor can be any value larger than 1.0. Default is None . |
option.logits_dtype | >= 0.28.0 | Pass Through | Datatype of logits; only applies for the T5 model. | fp16 , fp32 . Default is fp32 |
option.trtllm_checkpoint_path | >= 0.28.0 | Pass Through | Specifies the location of where the checkpoint artifacts are placed. Not recommended to set. | Default is /tmp/trtllm_{model_name}_ckpt/ |
option.num_checkpoint_workers | >= 0.27.0 | Pass Through | Specifies the number of workers used to perform checkpoint conversion. | Default is tensor_parallel_degree * pipeline_parallel_degree |
option.num_engine_workers | >= 0.27.0 | Pass Through | Specifies the number of workers used to perform engine build. | Default is number of available CUDA devices |
option.load_by_shard | >= 0.28.0 | Pass Through | Sharding during compilation - only for Falcon 40B model | true , false . Default is false |
Advanced parameters: SmoothQuant | ||||
option.quantize | >= 0.26.0 | Pass Through | Currently only supports smoothquant for Llama, Mistral, InternLM and Baichuan models with just in time compilation mode. |
smoothquant |
option.smoothquant_alpha | >= 0.26.0 | Pass Through | smoothquant alpha parameter | Default value is 0.8 |
option.smoothquant_per_token | >= 0.26.0 | Pass Through | This is only applied when option.quantize is set to smoothquant . This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate |
true , false . Default is false |
option.smoothquant_per_channel | >= 0.26.0 | Pass Through | This is only applied when option.quantize is set to smoothquant . This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate |
true , false . Default is false |
option.multi_query_mode | >= 0.26.0 | Pass Through | This is only needed when option.quantize is set to smoothquant . This is should be set for models that support multi-query-attention, for e.g llama-70b |
true , false . Default is false |
Advanced parameters: AWQ | ||||
option.quantize | >= 0.26.0 | Pass Through | Currently only supports awq for Llama and Mistral models with just in time compilation mode. |
awq |
option.awq_format | == 0.26.0 | Pass Through | This is only applied when option.quantize is set to awq . awq format you want to set. Currently only support int4_awq |
Default value is int4_awq |
option.awq_calib_size | == 0.26.0 | Pass Through | This is only applied when option.quantize is set to awq . Number of samples for calibration. |
Default is 32 |
option.q_format | >= 0.27.0 | Pass Through | This is only applied when option.quantize is set to awq . awq format you want to set. Currently only support int4_awq |
Default value is int4_awq |
option.calib_size | >= 0.27.0 | Pass Through | This is applied when option.quantize is set to awq . Number of samples for calibration. |
Default is 512 |
option.calib_batch_size | >= 0.28.0 | Pass Through | This is applied when option.quantize is set to awq . Batch size for calibration. |
Default is 32 |
option.awq_block_size | >= 0.28.0 | Pass Through | This is applied when option.quantize is set to awq . Block (group) size for AWQ quantization. |
Default is 128 |
Advanced parameters: FP8 | ||||
option.quantize | >= 0.26.0 | Pass Through | Currently only supports fp8 for Llama, Mistral, Mixtral, Baichuan, Gemma, and GPT2 models with just in time compilation mode. |
fp8 |
option.use_fp8_context_fmha | >= 0.28.0 | Pass Through | Paged attention for fp8; should only be turned on for p5 instances | true , false . Default is false |
option.calib_size | >= 0.27.0 | Pass Through | This is applied when option.quantize is set to fp8 . Number of samples for calibration. |
Default is 512 |
option.calib_batch_size | >= 0.28.0 | Pass Through | This is applied when option.quantize is set to fp8 . Batch size for calibration. |
Default is 32 |