Skip to content

HuggingFace Accelerate User Guide

The HuggingFace Accelerate backend is only recommended when the model you are deploying is not supported by the other backends. It is typically less performant than the other available options. You should confirm that your model is not supported by other backends before using HuggingFace Accelerate.

Model Artifact Structure

HuggingFace Accelerate expects the model to be in the standard HuggingFace format.

Supported Model Architectures

LMI's HuggingFace Accelerate supports most models that are supported by HuggingFace Transformers.

For text-generation models (i.e. *ForCausalLM, *LMHeadModel, and ForConditionalGeneration architectures), LMI provides continuous batching support. This significantly increases throughput compared to the base Transformers and Accelerate library implementations, and is the recommended operating mode for such models.

For non text-generation models, LMI will create the model via the Transformers pipeline API. These models are not compatible with continuous batching, and have not been extensively tested with LMI.

The set of supported tasks in LMI, and corresponding model architectures are:

  • text-generation (*ForCausalLM, *LMHeadModel)
  • text2text-generation (*ForConditionalGeneration)
  • table-question-answering (*TapasForQuestionAnswering)
  • question-answering (*ForQuestionAnswering)
  • token-classification (*ForTokenClassification)
  • text-classification (*ForSequenceClassification)
  • multiple-choice (*ForMultipleChoice)
  • fill-mask (*ForMaskedLM)

Quick Start Configurations

You can leverage HuggingFace Accelerate with LMI using the following configurations:

serving.properties

engine=Python
# use "scheduler" if deploying a text-generation model, and "disable" for other tasks (can also the config omit entirely)
option.rolling_batch=scheduler
option.model_id=<your model id>
# use max to partition the model across all gpus. This is naive sharding, where the model is sharded vertically (as opposed to horizontally with tensor parallelism)
option.tensor_parallel_degree=max

You can follow this example to deploy a model with serving.properties configuration on SageMaker.

environment variables

HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=scheduler

You can follow this example to deploy a model with environment variable configuration on SageMaker.

Quantization Support

The HuggingFace Accelerate backend supports quantization via bitsandbytes. Both 8-bit and 4-bit quantization are supported.

You can enable 8-bit quantization via option.quantize=bitsandbytes8 in serving.properties, or OPTION_QUANTIZE=bitsandbytes8 environment variable. You can enable 4-bit quantization via option.quantize=bitsandbytes4 in serving.properties, or OPTION_QUANTIZE=bitsandbytes4 environment variable.

Advanced HuggingFace Accelerate Configurations

The following table lists the advanced configurations that are available with the HuggingFace Accelerate backend. There are two types of advanced configurations: LMI, and Pass Through. LMI configurations are processed by LMI and translated into configurations that Accelerate uses. Pass Through configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI. We recommend that you file an issue for any issues you encounter with configurations. For LMI configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release. For Pass Through configurations it is possible that our investigation reveals an issue with the backend library. In that situation, there is nothing LMI can do until the issue is fixed in the backend library.

Item LMI Version Configuration Type Description Example value
option.task >= 0.25.0 LMI The task used in Hugging Face for different pipelines. Default is text-generation text-generation
option.quantize >= 0.25.0 Pass Through Specify this option to quantize your model using the supported quantization methods. bitsandbytes4, bitsandbytes8
option.low_cpu_mem_usage >= 0.25.0 Pass Through Reduce CPU memory usage when loading models. We recommend that you set this to True. Default:true