Large model inference¶
DJLServing has the capability to host large language models and foundation models that does not fit into a single GPU. We maintain a collection of deep learning containers (DLC) specifically designed for conducting inferences with large models, and you can explore the available deep learning containers here. The AWS DLC for LMI provides documentation detailing the description of libraries available for use with these DLCs.
LMI containers Configurations¶
Beyond DJLServing configurations, the capability for large model inference involves additional settings. The LMI configuration document organizes these configurations based on the engines present in our DLCs.
These configurations can be specified in two manners: firstly, through the serving.properties file, and secondly, via environment variables within the Docker environment. For a comprehensive guide on specifying these configurations, refer to the LMI environment variable instruction document, which offers detailed instructions.
Depending on your model architecture, model size, and the instance type in use, adjustments to certain configurations may be necessary to optimize instance resource utilization without encountering Out of Memory (OOM) issues. These documents below provides recommended configurations for popular models, tailored to the specific library you are utilizing, such as DeepSpeed, TensorRT, Transformers-NeuronX or LMI Dist.
- LMI Dist tuning guide
- TensorRT-LLM tuning guide
- Transformers-NeuronX tuning guide
- DeepSpeed tuning guide