Model Artifacts for LMI¶

LMI Containers support deploying models with artifacts stored in either the HuggingFace Hub, or AWS S3. For models stored in the HuggingFace Hub you will need the model_id (e.g. meta-llama/Llama-2-13b-hf). For models stored in S3 you will need the S3 uri for the object prefix of your model artifacts (e.g. s3://my-bucket/my-model-artifacts/).

Model Artifacts must be in the HuggingFace Transformers pretrained format.

HuggingFace Transformers Pretrained Format¶

LMI Containers support loading models saved in the HuggingFace Transformers pretrained format. This means that the model has been saved using the save_pretrained method from HuggingFace Transformers, and is loadable using the from_pretrained method. Most open source LLMs available on the HuggingFace Hub have been saved using this format and are compatible with LMI containers. LMI Containers only support loading HuggingFace models weights from PyTorch (pickle) or SafeTensor checkpoints. Most open source models available on the HuggingFace Hub offer checkpoints in at least one of these formats.

In addition to the model artifacts, we expect that the tokenizer has been saved as part of the model artifacts and is loadable using the AutoTokenizer.from_pretrained method.

A sample of what the model and tokenizer artifacts looks like is shown below:

model/
|- config.json [Required](model configuration file with architecture details)
|- model-000X-of-000Y.safetensors (safetensor checkpoint shard - large models will have multiple checkpoint shards)
|- model.safetensors.index.json (safetensor weight mapping)
|- pytorch_model-000X-of-000Y.bin (PyTorch pickle checkpoint shard - large models will have multiple checkpoint shards)
|- pytorch_model.bin.index.json (PyTorch weight mapping)
|- tokenizer_config.json [Required] (tokenizer configuration)
|- special_tokens_map.json (special token mapping)
|- *modelling.py (custom modelling files)
|- *tokenizer.py (custom tokenzer)
|- tokenizer.json (tokenizer model)
|- tokenizer.model (tokenizer model)

Please remember to turn on option.trust_remote_code=true or OPTION_TRUST_REMOTE_CODE=true if you have customized modelling and/or customized tokenizer.py files.

TensorRT-LLM(TRT-LLM) LMI model format¶

TRT-LLM LMI supports loading models in a custom format that includes compiled TRT-LLM engine files and Hugging Face model config files. Users can create these artifacts for model architectures that are supported for JIT compilation following this tutorial. For model architectures that are not supported by TRT-LLM LMI for JIT compilation, follow this tutorial to create model artifacts. Users can specify the resulting artifacts path as option.model_id during deployment for faster loading than compared to raw Hugging Face model for TRT-LLM LMI.

Below directory structure represents an example of TensorRT-LLM LMI model artifacts structure.

  trt_llm_model_repo
    └── tensorrt_llm
        ├── 1
        │ ├── trt_llm_model_float16_tp2_rank0.engine # trt-llm engine
        │ ├── trt_llm_model_float16_tp2_rank1.engine # trt-llm engine
        │ ├── config.json # trt-llm config file
        │ └── model.cache
        ├── config.pbtxt # trt-llm triton backend config
        ├── config.json # Below are HuggingFace model config files and may vary per model
        ├── pytorch_model.bin.index.json
        ├── requirements.txt
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── tokenizer.model

Neuron Pretrained Model Formats¶

For pretrained models that will be compiled at runtime, the HuggingFace Transformers pretrained format is preferred.

Model compile time can quickly become an issue for larger models, so compiled models are accepted in the following formats.

Standard Optimum-Neuron model artifacts 2.16.0 SDK¶

Under the same folder level, we expect:

config.json: Store the model architecture, structure information, and neuron compiler configuration
tokenizer_config.json: Store the tokenizer config information
modelling files (*.py): If your model has custom modelling or tokenizer files.
Please remember to turn on option.trust_remote_code=true or OPTION_TRUST_REMOTE_CODE=true
checkpoint directory: Directory containing the split-weights model
other files that are needed for split model loading
compiled directory: Directory containing the neff files
other files that are needed for model loading and inference

A sample of what the model and tokenizer artifacts looks like is shown below:

model/
|- checkpoint/ 
|- - pytorch_model.bin/ (directory containing the split model weights)
|- - config.json (model configuration of the model before compilation)
|- compiled/
|- - *.neff (files containing the serialization of the compiled model graph)
|- config.json [Required](model configuration file with architecture details)
|- tokenizer_config.json [Required] (tokenizer configuration)
|- special_tokens_map.json (special token mapping)
|- *modelling.py (custom modelling files)
|- *tokenizer.py (custom tokenzer)
|- tokenizer.json (tokenizer model)
|- tokenizer.model (tokenizer model)

Split Model and Compiled Graph 2.16.0 SDK¶

Split Model: Under the same folder level, we expect:

config.json: Store the model architecture, structure information, and neuron compiler configuration
tokenizer_config.json: Store the tokenizer config information
modelling files (*.py): If your model has custom modelling or tokenizer files.
Please remember to turn on option.trust_remote_code=true or OPTION_TRUST_REMOTE_CODE=true
pytorch_model.bin: Directory containing the split-weights model (This is not a typo it is a directory)
other files that are needed for split model loading

Compiled Graph: Under the same folder level, we expect: - The files specifying the compiled graph. This can be .neff files, or a dump of the neff cache.

A sample of what the model and tokenizer artifacts looks like is shown below:

model/
|- pytorch_model.bin/ (directory containing the split model weights)
|- config.json [Required](model configuration file with architecture details)
|- tokenizer_config.json [Required] (tokenizer configuration)
|- special_tokens_map.json (special token mapping)
|- *modelling.py (custom modelling files)
|- *tokenizer.py (custom tokenzer)
|- tokenizer.json (tokenizer model)
|- tokenizer.model (tokenizer model)

compiled/
|- *.neff (files containing the serialization, or dumped NEFF cache, of the compiled model graph)

To use this format when loading in LMI there are a few advanced configuration details required. The first is the flag for loading a split model option.load_split_model, which indicates the model has already been split and is ready for loading on neuron devices. The second is the option.compiled_graph_path which allows the user to specify either, the *.neff files compiled for a serialized model, or to a neuron cache directory containing the compiled graph. This allows for a workaround for models that do not support serialization, or other advanced use cases.

Note: Compiled model artifacts must be compiled under the same compiler version as the container being used, if the precompiled models compiler version does not match the image the model will fail to load.

Storing models in S3¶

For custom models and production use-cases, we recommend that you store model artifacts in S3.

If you want to use a model available from the huggingface hub, you can download the files locally with git:

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/<namespace>/<model>

Alternatively, you can use a download script to download from huggingface hub:

download.py

from huggingface_hub import snapshot_download
from pathlib import Path

# - This will download the model into the ./model directory where ever the script is running
local_model_path = Path("./model")
local_model_path.mkdir(exist_ok=True)
model_name = "<namespace>/<model>"
# Only download safetensors checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.pt", "*.txt", "*.model", "*.tiktoken"]

# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
snapshot_download(
    repo_id=model_name,
    local_dir=local_model_path,
    local_dir_use_symlinks=False,
    allow_patterns=allow_patterns,
    token="YOUR_HF_TOKEN_VALUE",  # Optional: If you need a token to download your model
)

With the model saved locally (either downloaded from the hub, or your own pretrained/fine-tuned model), upload it to S3:

# Assuming the model artifacts are stored in a directory called model/
aws s3 cp model s3://my-model-bucket/model/ --recursive

When specifying configurations for the LMI container, you can also upload the serving.properties file to this directory. See the configuration section for more details.

Compiled models (TensorRT-LLM, Transformers NeuronX)¶

We recommend that you precompile models when using TensorRT-LLM or Transformers NeuronX in production to reduce the endpoint startup time. If HuggingFace Pretrained Model artifacts are provided to these backends, they will just-in-time (JIT) compile the model at runtime before it can be used for inference. This compilation process increases the endpoint startup time, especially as the model size grows. Please see the respective compilation guides for steps on how to compile your model for the given framework:

Next: Instance Type Selection