Skip to content

TensorRT-LLM(TRT-LLM) Engine User Guide

Model Artifacts Structure

TRT-LLM LMI supports two options for model artifacts

  1. Standard HuggingFace model format: In this case, TRT-LLM LMI will build TRT-LLM engines from HuggingFace model and package them with HuggingFace model config files during model load time.
  2. Custom TRT-LLM LMI model format: In this case, artifacts are directly loaded without the need to model compilation resulting in faster load times.

Supported Model Architectures

The below model architectures are supported for JIT model compiltation and tested in our CI.

  • LLaMA (since LMI V7 0.25.0)
  • Falcon (since LMI V7 0.25.0)
  • InternLM (since LMI V8 0.26.0)
  • Baichuan (since LMI V8 0.26.0)
  • ChatGLM (since LMI V8 0.26.0)
  • GPT-J (since LMI V8 0.26.0)
  • Mistral (since LMI V8 0.26.0)
  • Mixtral (since LMI V8 0.26.0)
  • Qwen (since LMI V8 0.26.0)
  • GPT2/SantaCoder (since LMI V8 0.26.0)

TRT-LLM LMI v9 0.27.0 containers come with TRT-LLM 0.8.0. For models that are not listed here and supported by TRT-LLM with tensorrtllm_backend, you can use this tutorial instead to prepare model manually.

We will add more model support in the future versions in our CI. Please feel free to file an issue if you are looking for a specific model support.

Quick Start Configurations

You can leverage tensorrtllm with LMI using the following starter configurations:

serving.properties

engine=MPI
option.tensor_parallel_degree=max
option.rolling_batch=trtllm
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
option.max_input_len=1024
option.max_output_len=512

You can follow this example to deploy a model with serving.properties configuration on SageMaker.

environment variables

HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=trtllm
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
OPTION_MAX_INPUT_LEN=1024
OPTION_MAX_OUTPUT_LEN=512

You can follow this example to deploy a model with environment variable configuration on SageMaker.

Quantization Support

We support two methods of quantization when using TensorRT-LLM with LMI: SmoothQuant, and AWQ. You can enable these quantization strategies using option.quantize=<smoothquant|awq> in serving.properties, or OPTION_QUANTIZE=<smoothquant|awq> environment variable. More details about additional (optional) quantization configurations are available in the advanced configuration table below.

Advanced TensorRT-LLM Configurations

The following table lists the advanced configurations that are available with the TensorRT-LLM backend. There are two types of advanced configurations: LMI, and Pass Through. LMI configurations are processed by LMI and translated into configurations that DeepSpeed uses. Pass Through configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI. We recommend that you file an issue for any issues you encounter with configurations. For LMI configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release. For Pass Through configurations it is possible that our investigation reveals an issue with the backend library. In that situation, there is nothing LMI can do until the issue is fixed in the backend library.

Item LMI Version Configuration Type Description Example value
option.max_input_len >= 0.25.0 LMI Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request. Default: 1024
option.max_output_len >= 0.25.0 LMI Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. Default: 512
option.use_custom_all_reduce >= 0.25.0 Pass Through Custom all reduce kernel is used for GPUs that have NVLink enabled. This can help to speed up model inference speed with better communication. Turn this on by setting true on P4D, P4De, P5 and other GPUs that are NVLink connected true, false.
Default is false
option.tokens_per_block >= 0.25.0 Pass Through tokens per block to be used in paged attention algorithm Default values is 128
option.batch_scheduler_policy >= 0.25.0 Pass Through scheduler policy of Tensorrt-LLM batch manager. max_utilization, guaranteed_no_evict
Default value is max_utilization
option.kv_cache_free_gpu_mem_fraction >= 0.25.0 Pass Through fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size. float number between 0 and 1.
Default is 0.95
option.max_num_sequences >= 0.25.0 Pass Through maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set Integer greater than 0
Default value is the batch size set while building Tensorrt engine
option.enable_trt_overlap >= 0.25.0 Pass Through Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off. true, false.
Default is false
option.enable_kv_cache_reuse >= 0.26.0 Pass Through This feature is only supported for GPT-like model on TRTLLM (as of 0.7.1) and need to compile the model with --use_paged_context_fmha. Let the LLM model to remember the last used input KV cache and try to reuse it in the next run. An instant benefit will be blazing fast first token latency. This is typically helpful for document understanding, chat applications that usually have the same input prefix. The TRTLLM backends will remember the prefix tree of the input and reuse most of its part for the next generation. However, this does come with the cost of extra GPU memory. true, false.
Default is false
option.baichuan_model_version >= 0.26.0 Pass Through Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc. v1_7b, v1_13b, v2_7b, v2_13b.
Default is v1_13b
option.chatglm_model_version >= 0.26.0 Pass Through Parameter exclusive to ChatGLM models to specify the exact model type. Required for ChatGLM models. chatglm_6b, chatglm2_6b, chatglm2_6b_32k, chatglm3_6b, chatglm3_6b_base, chatglm3_6b_32k, glm_10b.
Default is unspecified, which will throw an error.
option.gpt_model_version >= 0.26.0 Pass Through Parameter exclusive to GPT2 models to specify the exact model type. Required for GPT2 models. gpt2, santacoder, starcoder.
Default is gpt2.
option.multi_block_mode >= 0.26.0 Pass Through Split long kv sequence into multiple blocks (applied to generation MHA kernels). It is beneifical when batch x num_heads cannot fully utilize GPU. This is not supported for qwen model type. true, false.
Default is false
option.use_fused_mlp >= 0.26.0 Pass Through Enable horizontal fusion in GatedMLP, reduces layer input traffic and potentially improves performance for large Llama models(e.g. llama-2-70b). This option is only supported for Llama model type. true, false.
Default is false
option.rotary_base >= 0.26.0 Pass Through Rotary base parameter for RoPE embedding. This is supported for llama, internlm, qwen model types float value.
Default is 10000.0
option.rotary_dim >= 0.26.0 Pass Through Rotary dimension parameter for RoPE embedding. This is supported for only gptj model int value.
Default is 64
option.rotary_scaling_type
option.rotary_scaling_factor
>= 0.26.0 Pass Through Rotary scaling parameters. These two options should always be set together to prevent errors. These are supported for llama, qwen and internlm models The value of rotary_scaling_type can be either linear and dynamic. The value of rotary_scaling_factor can be any value larger than 1.0. Default is None.
Advanced parameters: SmoothQuant
option.quantize >= 0.26.0 Pass Through Currently only supports smoothquant for Llama, Mistral, InternLM and Baichuan models with just in time compilation mode. smoothquant
option.smoothquant_alpha >= 0.26.0 Pass Through smoothquant alpha parameter Default value is 0.8
option.smoothquant_per_token >= 0.26.0 Pass Through This is only applied when option.quantize is set to smoothquant. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate true, false.
Default is false
option.smoothquant_per_channel >= 0.26.0 Pass Through This is only applied when option.quantize is set to smoothquant. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate true, false.
Default is false
option.multi_query_mode >= 0.26.0 Pass Through This is only needed when option.quantize is set to smoothquant . This is should be set for models that support multi-query-attention, for e.g llama-70b true, false.
Default is false
Advanced parameters: AWQ
option.quantize >= 0.26.0 Pass Through Currently only supports awq for Llama and Mistral models with just in time compilation mode. awq
option.awq_format >= 0.26.0 Pass Through This is only applied when option.quantize is set to awq. awq format you want to set. Currently only support int4_awq Default value is int4_awq
option.awq_calib_size >= 0.26.0 Pass Through This is only applied when option.quantize is set to awq. Number of samples for calibration. Default is 32