TensorRT-LLM manual compilation of models tutorial¶
Overview¶
With LMI TensorRT-LLM container, you can run manual compilation of some models and quantization that LMI not supported for JIT and then load the model and run inference.
The goal of this document is for the user to be able to:
- Convert any TensorRT-LLM supported model to a model format that LMI can load and run inference.
- Upload the model to S3 so you could use it for runtime.
Step by step tutorial¶
In this tutorial, we will be converting the baichuan model to TensorRT-LLM model format on p4d.24xlarge.
Step 1: Choose your instance¶
To do the model compilation you need to use the instance that has the same GPU architecture that will be used for your inference deployment (e.g model compiled on a g5 instance can be deployed only on g5 instances).
Step 2: Pull the docker image¶
Refer here for the latest TensorRT-LLM DLC and pull the image.
For example:
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-tensorrtllm0.7.1-cu122
You can also pull the container from DockerHub:
docker pull deepjavalibrary/djl-serving:0.26.0-tensorrt-llm
Step 3: Login the container and prepare the environment¶
You need to manually login into the container to proceed for conversion
docker run -it --runtime=nvidia --gpus all \
--shm-size 12g \
deepjavalibrary/djl-serving:0.26.0-tensorrt-llm \
/bin/bash
Then clone the TensorRT-LLM repository inside the container for manual conversion. Please make sure the version of the TensorRT-LLM need to match the one installed on the system.
You can check it with
pip show tensorrt-llm
# Output
# Name: tensorrt-llm
# Version: 0.7.1
Then just clone the TensorRT-LLM Triton backend for model preparation. If the version is 0.5.0, then you need to checkout the tag for that version (v0.5.0).
git clone https://github.com/triton-inference-server/tensorrtllm_backend -b v0.5.0
cd tensorrtllm_backend && rm -rf tensorrt_llm
git clone https://github.com/NVIDIA/TensorRT-LLM -b v0.5.0 tensorrt_llm
Step 4: Build TensorRT-LLM compatible model¶
The following work we do is very similar to Triton Server model preparation. You can find more information on the official preparation doc. Our environment is pre-configured all necessary packages, so we don't need to pip install anything.
Here we just need to
cd tensorrt_llm/examples/baichuan
# or just tensorrt_llm/examples/<model you like>
Checking the README.md
we will find the following instruction. We just need to add a few more parameter to build ready-to-go model:
# Build the Baichuan V1 13B model using a single GPU and FP16.
python build.py --model_version v1_13b \
--model_dir baichuan-inc/Baichuan-13B-Chat \
--dtype float16 \
--use_gemm_plugin float16 \
--use_gpt_attention_plugin float16 \
--world_size 2 \
--output_dir baichuan_v1_13b/trt_engines/fp16/2-gpu/ \
--use_inflight_batching
We added parameter called --use_inflight_batching
, this will help us to do continuous batching with LMI.
Generally speaking, we would recommend to turn those flags on if the conversion script support:
- --enable_context_fmha
to speed up inference with deep optimization
- --remove_input_padding
to not allow padding and reduce memory cost
- --parallel_build
do build in parallel to speed up
After the conversion, you will find the model in baichuan_v1_13b/trt_engines/fp16/2-gpu/
,
you can change the output directory with --output_dir
. Also remember we are using world size 2,
this means we are sharding the model on 2 gpus. Tt will be the same as tensor_parallel_degree
we specified in LMI.
After the conversion, let's make the model format ready:
# cd to tensorrtllm_backend folder level
cd ../../../
mkdir -p triton_model_repo/tensorrt_llm/
cp -r all_models/inflight_batcher_llm/tensorrt_llm/* triton_model_repo/tensorrt_llm/
# copy the converted model to the repo
cp tensorrt_llm/examples/baichuan/baichuan_v1_13b/trt_engines/fp16/2-gpu/* triton_model_repo/tensorrt_llm/1
Then, let's configure the model settings. Here we are using a tool called template filler that are available in the repo:
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt \
"enable_trt_overlap:False,batch_scheduler_policy:max_utilization,kv_cache_free_gpu_mem_fraction:0.95,max_num_sequences:64"
This tool will help us fill in most of the config we need. Finally, we need to set the following property manually:
vi triton_model_repo/tensorrt_llm/config.pbtxt
'''
model_transaction_policy {
decoupled: True
}
'''
Change the value to True
.
We also need to delete/change the value the parameters we didn't define a value like:
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "${max_tokens_in_paged_kv_cache}"
}
}
Here is a table contains TRTLLM supported parameters with our suggestions. (up to 0.6.1)
Name | Settings | Meaning |
---|---|---|
enable _trt_overlap | False | Allow TRT to overlap calculation, this helped model perform better when using higher concurrency. But during our test, we found that setting this to False helped performance. |
max_tokens_in_paged_kv_cache | Remove this | For PagedAttention, TensorRT LLM will calculate this value based on batch size and model compilation settings. |
batch_scheduler_policy | max_utilization | max GPU utlization for most of the case |
kv_cache_free_gpu_mem_fraction | 0.95 | The percentage of GPU memory the model can us for KV Cache management. Here we are setting an agressive number to 0.95 to max out GPU |
max_num_sequences | 64 | The maximum concurrent batch size you can send as inputs |
max_beam_width | 1 | use beam size 1 for generation. Beam size > 1 is not supported currently for LMI |
gpt_model_type | inflight_fused_batching | Used continuous batching mechanism |
gpt_model_path | /tmp | LMI will change this value in runtime to correct the model path after download. Just leave it there and not remove it |
max_kv_cache_length | Remove this | For PagedAttention, TensorRT LLM will calculate this value based on batch size and model compilation settings. |
exclude_input_in_output | True | This will helps to follow the same way LMI generating as the result. Remove the prefix (input) |
(0.6.1) Also need to remove some configs that are used for dynamic batching:
dynamic_batching {
preferred_batch_size: [ ${triton_max_batch_size} ]
max_queue_delay_microseconds: ${max_queue_delay_microseconds}
}
Step 5: Prepare HuggingFace configs and tokenizers¶
Besides the standard Triton components, we also need to add the tokenizer stuff from HuggingFace to here.
Create a file called build_tokenizer.py
with the following contents:
from huggingface_hub import snapshot_download
from pathlib import Path
local_model_path = Path("./triton_model_repo/tensorrt_llm/")
model_name = "baichuan-inc/Baichuan-13B-Chat"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model"]
snapshot_download(
repo_id=model_name,
local_dir=local_model_path,
allow_patterns=allow_patterns,
local_dir_use_symlinks=False
)
Run with:
python build_tokenizer.py
It will download all necessary components to the folder.
Finally, let's see what's inside:
ls triton_model_repo/tensorrt_llm/
# 1 config.json config.pbtxt generation_config.json pytorch_model.bin.index.json requirements.txt special_tokens_map.json tokenizer.model tokenizer_config.json
ls triton_model_repo/tensorrt_llm/1/
# baichuan_float16_tp2_rank0.engine baichuan_float16_tp2_rank1.engine config.json model.cache
Step 6: Upload the compiled model to S3¶
Prepare necessary credentials for your models and do upload:
aws s3 sync triton_model_repo/tensorrt_llm/ s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/baichuan-13b-chat/
Note: We always need to create two folders here to load the model. Saying if you store like the followings:
aws s3 sync triton_model_repo/tensorrt_llm/ s3://<some-bucket>/...<some_folders>../folder1/folder2/
The S3 url used in LMI to load the model will be
s3://<some-bucket>/...<some_folders>../folder1/
and in our case is:
s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/
Check the file is there
aws s3 ls s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/baichuan-13b-chat/
PRE 1/
2023-12-19 01:58:03 733 config.json
2023-12-19 01:58:03 4425 config.pbtxt
2023-12-19 01:58:02 284 generation_config.json
2023-12-19 01:58:02 23274 pytorch_model.bin.index.json
2023-12-19 01:58:03 56 requirements.txt
2023-12-19 01:58:03 544 special_tokens_map.json
2023-12-19 01:58:03 1136765 tokenizer.model
2023-12-19 01:58:03 954 tokenizer_config.json
Load on SageMaker LMI container¶
Finally, you can use one of the following configuration to load your model on SageMaker:
### 1. Environment variables:
HF_MODEL_ID=s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/
OPTION_TENSOR_PARALLEL_DEGREE=2
OPTION_MAX_ROLLING_BATCH_SIZE=64
2. serving.properties
:¶
engine=MPI
option.model_id=s3://lmi-llm/trtllm/0.5.0/baichuan-13b-tp2/
option.tensor_parallel_degree=2
option.max_rolling_batch_size=64
3. extracted model artifacts:¶
serving.properties
:
engine=MPI
option.rolling_batch=trtllm
option.dtype=fp16
option.tensor_parallel_degree=2
Artifacts need to be in the following structure:
Mount should be to /opt/ml/model/
├── serving.properties
└── tensorrt_llm
├── 1
│ ├── baichuan_float16_tp2_rank0.engine
│ ├── baichuan_float16_tp2_rank1.engine
│ ├── config.json
│ └── model.cache
├── config.json
├── config.pbtxt
├── configuration_baichuan.py
├── generation_config.json
├── pytorch_model.bin.index.json
├── requirements.txt
├── special_tokens_map.json
├── tokenization_baichuan.py
├── tokenizer_config.json
└── tokenizer.model
config.pbtxt
:
Make sure to update gpt_model_path
to the correct path including parent folder name (/opt/ml/model/tensorrt_llm/1
)
parameters: {
key: "gpt_model_path"
value: {
string_value: "/opt/ml/model/tensorrt_llm/1"
}
}