DeepSpeed Engine User Guide¶

[!WARNING] DeepSpeed support has been removed starting with LMI container release 0.28.0. You should migrate to one of the other supported backends.
Please see the migration document for more details.

Model Artifacts Structure¶

DeepSpeed expects the model to be in the standard HuggingFace format.

Supported Model Architectures¶

The LMI 0.27.0 container currently ships DeepSpeed 0.12.6. DeepSpeed supports two modes for launching models: Kernel Injection, and Auto Tensor-Parallelism. Model architectures that support Kernel Injection are defined here. These model architectures have been optimized for inference through custom CUDA Kernel implementations. Model architectures that support Auto Tensor-Parallelism are defined here. Auto Tensor-Parallelism is used to host models across GPUs that do not have custom CUDA Kernel support.

When using LMI's integration with DeepSpeed, we will automatically apply kernel injection if the model architecture is supported.

The below model architectures have been carefully tested with LMI's DeepSpeed integrations:

GPT2
Bloom
GPT-J
GPT-Neo
GPT-Neo-X
OPT
Stable Diffusion
Llama

Complete Model Set¶

Kernel Injection:

GPT2
Bert
Bloom
GPT-J
GPT-Neo
GPT-Neo-X
OPT
Megatron
DistilBert
Stable Diffusion
Llama
Llama2
InternLM

Auto Tensor-Parallelism:

albert
baichuan
bert
bigbird_pegasus
bloom
camembert
codegen
codellama
deberta_v2
electra
ernie
esm
falcon
glm
gpt-j
gpt-neo
gpt-neox
longt5
luke
llama
llama2
m2m_100
marian
mistral
mpt
mvp
nezha
openai
opt
pegasus
perceiver
plbart
qwen
reformer
roberta
roformer
splinter
starcode
t5
xglm
xlm_roberta
yoso

Quick Start Configurations¶

You can leverage deepspeed with LMI using the following starter configurations:

serving.properties¶

engine=MPI
option.entryPoint=djl_python.deepspeed
option.tensor_parallel_degree=max
option.rolling_batch=deepspeed
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64

You can follow this example to deploy a model with serving.properties configuration on SageMaker.

environment variables¶

HF_MODEL_ID=<your model id>
OPTION_ENTRYPOINT=djl_python.deepspeed
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=deepspeed
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64

You can follow this example to deploy a model with environment variable configuration on SageMaker.

Quantization Support¶

We support two methods of runtime quantization when using DeepSpeed with LMI: SmoothQuant, and Dynamic Int8.

If you are enabling quantization, we recommend using SmoothQuant for GPT2, GPT-J, GPT-Neo, GPT-Neo-X, Bloom, Llama model architectures. You can enable SmoothQuant using option.quantize=smoothquant in serving.properties, or OPTION_QUANTIZE=smoothquant environment variable. You can optionally specify option.smoothquant_alpha in serving.properties, orOPTION_QUANTIZE=smoothquant environment variable to control the quantization behavior. We recommend that you do not provide this value and let LMI determine it.

For models not supported for SmoothQuant, you can enable Dynamic Int8 quantization using option.quantize=dynamic_int8 in serving.properties, or OPTION_QUANTIZE=dynamic_int8 environment variable. The Dynamic Int8 quantization method uses DeepSpeed's Mixture-of-Quantization algorithm for quantization.

Advanced DeepSpeed Configurations¶

The following table lists the advanced configurations that are available with the DeepSpeed backend. There are two types of advanced configurations: LMI, and Pass Through. LMI configurations are processed by LMI and translated into configurations that DeepSpeed uses. Pass Through configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI. We recommend that you file an issue for any issues you encounter with configurations. For LMI configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release. For Pass Through configurations it is possible that our investigation reveals an issue with the backend library. In that situation, there is nothing LMI can do until the issue is fixed in the backend library.

Item	LMI Version	Configuration Type	Description	Example value
option.task	>= 0.25.0	LMI	The task used in Hugging Face for different pipelines. Default is text-generation	`text-generation`
option.quantize	>= 0.25.0	LMI	Specify this option to quantize your model using the supported quantization methods in DeepSpeed. SmoothQuant is our special offering to provide quantization with better quality	`dynamic_int8`, `smoothquant`
option.max_tokens	>= 0.25.0	LMI	Total number of tokens (input and output) with which DeepSpeed can work. The number of output tokens in the difference between the total number of tokens and the number of input tokens. By default we set the value to 1024. If you are looking for long sequence generation, you may want to set this to higher value (2048, 4096..)	1024
option.low_cpu_mem_usage	>= 0.25.0	Pass Through	Reduce CPU memory usage when loading models. We recommend that you set this to True.	Default:`true`
option.enable_cuda_graph	>= 0.25.0	Pass Through	Activates capturing the CUDA graph of the forward pass to accelerate.	Default: `false`
option.triangular_masking	>= 0.25.0	Pass Through	Whether to use triangular masking for the attention mask. This is application or model specific.	Default: `true`
option.return_tuple	>= 0.25.0	Pass Through	Whether transformer layers need to return a tuple or a tensor.	Default: `true`
option.training_mp_size	>= 0.25.0	Pass Through	If the model was trained with DeepSpeed, this indicates the tensor parallelism degree with which the model was trained. Can be different than the tensor parallel degree desired for inference.	Default: `1`
option.checkpoint	>= 0.25.0	Pass Through	Path to DeepSpeed compatible checkpoint file.	`ds_inference_checkpoint.json`
option.smoothquant_alpha	>= 0.25.0	LMI	If `smoothquant` is provided in option.quantize, you can provide this alpha value. If not provided, DeepSpeed will choose one for you.	Any float value between 0 and 1