TensorRT-LLM(TRT-LLM) Engine User Guide¶

Model Artifacts Structure¶

TRT-LLM LMI supports two options for model artifacts

Standard HuggingFace model format: In this case, TRT-LLM LMI will build TRT-LLM engines from HuggingFace model and package them with HuggingFace model config files during model load time.
Custom TRT-LLM LMI model format: In this case, artifacts are directly loaded without the need to model compilation resulting in faster load times.

Supported Model Architectures¶

The below model architectures are supported for JIT model compilation and tested in our CI.

Llama 2/3/3.1
Falcon
InternLM
Baichuan
ChatGLM
GPT-J
Mistral
Mixtral
Qwen
GPT2/SantaCoder/StarCoder/GPTBigCode
Phi2
OPT
Gemma
T5

TRT-LLM LMI v12 0.30.0 containers come with TRT-LLM 0.12.0. For models that are not listed here and supported by TRT-LLM with tensorrtllm_backend, you can use this tutorial instead to prepare model manually.

Quick Start Configurations¶

You can leverage tensorrtllm with LMI using the following starter configurations:

serving.properties¶

engine=Python
option.mpi_mode=true
option.tensor_parallel_degree=max
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_num_tokens=50000

You can follow this example to deploy a model with serving.properties configuration on SageMaker.

environment variables¶

HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
# Adjust the following based on model size and instance type
OPTION_MAX_NUM_TOKENS=50000

You can follow this example to deploy a model with environment variable configuration on SageMaker.

Where to find `max_num_tokens` number?¶

To simplify your building experience, we pre-compiled and tested some of the models to help you verify the number we can use. You can follow this link to find the table of precompiled model.

Quantization Support¶

We support three methods of quantization when using TensorRT-LLM with LMI: SmoothQuant, AWQ and FP8. You can enable these quantization strategies using option.quantize=<smoothquant|awq|fp8> in serving.properties, or OPTION_QUANTIZE=<smoothquant|awq|fp8> environment variable. More details about additional (optional) quantization configurations are available in the advanced configuration table below.

Advanced TensorRT-LLM Configurations¶

The following table lists the advanced configurations that are available with the TensorRT-LLM backend. There are two types of advanced configurations: LMI, and Pass Through. LMI configurations are processed by LMI and translated into configurations that TensorRT-LLM uses. Pass Through configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI. We recommend that you file an issue for any issues you encounter with configurations. For LMI configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release. For Pass Through configurations it is possible that our investigation reveals an issue with the backend library. In that situation, there is nothing LMI can do until the issue is fixed in the backend library.

Item	LMI Version	Configuration Type	Description	Example value
option.max_input_len	>= 0.25.0	LMI	Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.	Default: `1024`
option.max_output_len	>= 0.25.0	LMI	Now maps to max_seq_len in TRTLLM! For backwards compatibility, it is still called max_output_len here. Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.	Default: `512`
option.max_num_tokens	>= 0.27.0	LMI	Max total tokens size the TRTLLM engine will use. Internally, if you set this value, we will extend the max_input, max_output and batch size to the model could actually support. This would allow the model to run under more arbitary traffic	Default: `16384`
option.tokens_per_block	>= 0.25.0	Pass Through	tokens per block to be used in paged attention algorithm	Default values is `128`
option.batch_scheduler_policy	>= 0.25.0	Pass Through	scheduler policy of Tensorrt-LLM batch manager.	`max_utilization`, `guaranteed_no_evict` Default value is `max_utilization`
option.kv_cache_free_gpu_mem_fraction	>= 0.25.0	Pass Through	fraction of free gpu memory allocated for kv cache. The larger value you set, the more memory the model will try to take over on the GPU. The more memory preserved, the larger KV Cache size we can use and that means longer input+output sequence or larger batch size.	float number between 0 and 1. Default is `0.95`
option.max_num_sequences	>= 0.25.0	Pass Through	maximum number of input requests processed in the batch. We will apply max_rolling_batch_size as the value for it if you don't set this. Generally you don't have to touch it unless you really want the model to be compiled to a batch size that not the same as model server set	Integer greater than 0 Default value is the batch size set while building Tensorrt engine
option.enable_trt_overlap	>= 0.25.0	Pass Through	Parameter to overlap the execution of batches of requests. It may have a negative impact on performance when the number of requests is too small. During our experiment, we saw more negative impact to turn this on than off.	`true`, `false`. Default is `false`
option.enable_kv_cache_reuse	>= 0.26.0	Pass Through	This feature is only supported for GPT-like model on TRTLLM (as of 0.7.1) and need to compile the model with `--use_paged_context_fmha`. Let the LLM model to remember the last used input KV cache and try to reuse it in the next run. An instant benefit will be blazing fast first token latency. This is typically helpful for document understanding, chat applications that usually have the same input prefix. The TRTLLM backends will remember the prefix tree of the input and reuse most of its part for the next generation. However, this does come with the cost of extra GPU memory.	`true`, `false`. Default is `false`
option.baichuan_model_version	>= 0.26.0	Pass Through	Parameter that exclusively for Baichuan LLM model to specify the version of the model. Need to specify the HF Baichuan checkpoint path. For v1_13b, you should use whether baichuan-inc/Baichuan-13B-Chat or baichuan-inc/Baichuan-13B-Base. For v2_13b, you should use whether baichuan-inc/Baichuan2-13B-Chat or baichuan-inc/Baichuan2-13B-Base. More Baichuan models could be found on baichuan-inc.	`v1_7b`, `v1_13b`, `v2_7b`, `v2_13b`. Default is `v1_13b`
option.chatglm_model_version	>= 0.26.0	Pass Through	Parameter exclusive to ChatGLM models to specify the exact model type. Required for ChatGLM models.	`glm`, `chatglm`, `chatglm2`, `chatglm3`, `glm4`. Default is `None`, which will let TensorRT-LLM automatically infer the model type.
option.gpt_model_version	>= 0.26.0	Pass Through	Parameter exclusive to GPT2 models to specify the exact model type. Required for GPT2 models.	`gpt2`, `santacoder`, `starcoder`. Default is `gpt2`.
option.use_fused_mlp	>= 0.26.0	Pass Through	Enable horizontal fusion in GatedMLP, reduces layer input traffic and potentially improves performance for large Llama models(e.g. llama-2-70b). This option is only supported for Llama model type.	`true`, `false`. Default is `false`
option.rotary_base	>= 0.26.0	Pass Through	Rotary base parameter for RoPE embedding. This is supported for llama, internlm, qwen model types	`float` value. Default is `10000.0`
option.rotary_dim	>= 0.26.0	Pass Through	Rotary dimension parameter for RoPE embedding. This is supported for only gptj model	`int` value. Default is `64`
option.rotary_scaling_type option.rotary_scaling_factor	>= 0.26.0	Pass Through	Rotary scaling parameters. These two options should always be set together to prevent errors. These are supported for llama, qwen and internlm models	The value of `rotary_scaling_type` can be either `linear` and `dynamic`. The value of `rotary_scaling_factor` can be any value larger than 1.0. Default is `None`.
option.logits_dtype	>= 0.28.0	Pass Through	Datatype of logits; only applies for the T5 model.	`fp16`, `fp32`. Default is `fp32`
option.trtllm_checkpoint_path	>= 0.28.0	Pass Through	Specifies the location of where the checkpoint artifacts are placed. Not recommended to set.	Default is `/tmp/trtllm_{model_name}_ckpt/`
option.num_checkpoint_workers	>= 0.27.0	Pass Through	Specifies the number of workers used to perform checkpoint conversion.	Default is `tensor_parallel_degree * pipeline_parallel_degree`
option.num_engine_workers	>= 0.27.0	Pass Through	Specifies the number of workers used to perform engine build.	Default is `number of available CUDA devices`
option.load_by_shard	>= 0.28.0	Pass Through	Sharding during compilation - only for Falcon 40B model	`true`, `false`. Default is `false`
Advanced parameters: SmoothQuant
option.quantize	>= 0.26.0	Pass Through	Currently only supports `smoothquant` for Llama, Mistral, InternLM and Baichuan models with just in time compilation mode.	`smoothquant`
option.smoothquant_alpha	>= 0.26.0	Pass Through	smoothquant alpha parameter	Default value is `0.8`
option.smoothquant_per_token	>= 0.26.0	Pass Through	This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate	`true`, `false`. Default is `false`
option.smoothquant_per_channel	>= 0.26.0	Pass Through	This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate	`true`, `false`. Default is `false`
option.multi_query_mode	>= 0.26.0	Pass Through	This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b	`true`, `false`. Default is `false`
Advanced parameters: AWQ
option.quantize	>= 0.26.0	Pass Through	Currently only supports `awq` for Llama and Mistral models with just in time compilation mode.	`awq`
option.awq_format	== 0.26.0	Pass Through	This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq`	Default value is `int4_awq`
option.awq_calib_size	== 0.26.0	Pass Through	This is only applied when `option.quantize` is set to `awq`. Number of samples for calibration.	Default is `32`
option.q_format	>= 0.27.0	Pass Through	This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq`	Default value is `int4_awq`
option.calib_size	>= 0.27.0	Pass Through	This is applied when `option.quantize` is set to `awq`. Number of samples for calibration.	Default is `512`
option.calib_batch_size	>= 0.28.0	Pass Through	This is applied when `option.quantize` is set to `awq`. Batch size for calibration.	Default is `32`
option.awq_block_size	>= 0.28.0	Pass Through	This is applied when `option.quantize` is set to `awq`. Block (group) size for AWQ quantization.	Default is `128`
Advanced parameters: FP8
option.quantize	>= 0.26.0	Pass Through	Currently only supports `fp8` for Llama, Mistral, Mixtral, Baichuan, Gemma, and GPT2 models with just in time compilation mode.	`fp8`
option.use_fp8_context_fmha	>= 0.28.0	Pass Through	Paged attention for fp8; should only be turned on for p5 instances	`true`, `false`. Default is `false`
option.calib_size	>= 0.27.0	Pass Through	This is applied when `option.quantize` is set to `fp8`. Number of samples for calibration.	Default is `512`
option.calib_batch_size	>= 0.28.0	Pass Through	This is applied when `option.quantize` is set to `fp8`. Batch size for calibration.	Default is `32`