vLLM Engine User Guide¶
Model Artifacts Structure¶
vLLM expects the model artifacts to be in the standard HuggingFace format.
Supported Model architecture¶
Text Generation Models
Here is the list of text generation models supported in vllm 0.6.3.post1.
Multi Modal Models
Here is the list of multi-modal models supported in vllm 0.6.3.post1.
Model Coverage in CI¶
The following set of models are tested in our nightly tests
- GPT NeoX 20b
- Mistral 7b
- Phi2
- Starcoder2 7b
- Gemma 2b
- Llama2 7b
- Qwen2 7b - fp8
- Llama3 8b
- Falcon 11b
- Llava-v1.6-mistral (multi modal)
- Phi3v (multi modal)
- Pixtral 12b (multi modal)
- Llama3.2 11b (multi modal)
Quantization Support¶
The quantization techniques supported in vLLM 0.6.3.post1 are listed here.
We recommend that regardless of which quantization technique you are using that you pre-quantize the model. Runtime quantization adds additional overhead to the endpoint startup time. Depending on the quantization technique, this can be significant overhead. If you are using a pre-quantized model, you should not set any quantization specific configurations. vLLM will deduce the quantization from the model config, and apply optimizations at runtime. If you explicitly set the quantization configuration for a pre-quantized model, it limits the optimizations that vLLM can apply.
Runtime Quantization¶
The following quantization techniques are supported for runtime quantization:
- fp8
- bitsandbytes
You can leverage these techniques by specifying option.quantize=<fp8|bitsandbytes>
in serving.properties, or OPTION_QUANTIZE=<fp8|bitsandbytes>
environment variable.
Other quantization techniques supported by vLLM require ahead of time quantization to be served with LMI. You can find details on how to leverage those quantization techniques from the vLLM docs here.
Ahead of time (AOT) quantization¶
If you bring a pre-quantized model to LMI, you should not set the option.quantize
configuration.
The lmi-dist engine will directly parse the quantization configuration from the model and load it for inference.
This is especially important if you are using a technique that has a Marlin variant like GPTQ, AWQ, or FP8.
The engine will determine if it can use the Marlin kernels at runtime, and use them if it can (hardware support).
For example, let's say you are deploying an AWQ quantized model on a g6.12xlarge instance (GPUs support marlin).
If you explicitly specify option.quantize=awq
, the engine will not apply Marlin as it is explicitly instructed to only use awq
.
If you omit the option.quantize
configuration, then the engine will determine it can use marlin and leverage that for optimized performance.
Quick Start Configurations¶
You can leverage vllm
with LMI using the following starter configurations:
serving.properties¶
engine=Python
option.tensor_parallel_degree=max
option.model_id=<your model>
option.rolling_batch=vllm
# Adjust the following based on model size and instance type
option.max_rolling_batch_size=64
You can follow this example to deploy a model with serving.properties configuration on SageMaker.
environment variables¶
HF_MODEL_ID=<your model>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=vllm
# Adjust the following based on model size and instance type
OPTION_MAX_ROLLING_BATCH_SIZE=64
You can follow this example to deploy a model with environment variable configuration on SageMaker.
LoRA Adapter Support¶
vLLM has support for LoRA adapters using the adapters API.
In order to use the adapters, you must begin by enabling them by setting option.enable_lora=true
.
Following that, you can configure the LoRA support through the additional settings:
option.max_loras
option.max_lora_rank
option.lora_extra_vocab_size
option.max_cpu_loras
- If you run into OOM by enabling adapter support, reduce the
option.gpu_memory_utilization
.
Advanced vLLM Configurations¶
The following table lists the advanced configurations that are available with the vLLM backend.
There are two types of advanced configurations: LMI
, and Pass Through
.
LMI
configurations are processed by LMI and translated into configurations that vLLM uses.
Pass Through
configurations are passed directly to the backend library. These are opaque configurations from the perspective of the model server and LMI.
We recommend that you file an issue for any issues you encounter with configurations.
For LMI
configurations, if we determine an issue with the configuration, we will attempt to provide a workaround for the current released version, and attempt to fix the issue for the next release.
For Pass Through
configurations it is possible that our investigation reveals an issue with the backend library.
In that situation, there is nothing LMI can do until the issue is fixed in the backend library.
The following table lists the set of LMI configurations. In some situations, the equivalent vLLM configuration can be used interchangeably. Those situations will be called out specifically.
Item | LMI Version | vLLM alias | Example Value | Default Value | Description |
---|---|---|---|---|---|
option.quantize | >= 0.23.0 | option.quantization | awq | None | The quantization algorithm to use. See "Quantization Support" for more details |
option.max_rolling_batch_prefill_tokens | >= 0.24.0 | option.max_num_batched_tokens | 32768 | None | Maximum number of tokens that the engine can process in a single batch iteration (includes prefill + decode) |
option.cpu_offload_gb_per_gpu | >= 0.29.0 | option.cpu_offload_gb | 4 (GB) | 0 | The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. |
In addition to the configurations specified in the table above, LMI supports all additional vLLM EngineArguments in Pass-Through mode. Pass-Through configurations are not processed or validated by LMI. You can find the set of EngineArguments supported by vLLM here.
You can specify these pass-through configurations in the serving.properties file by prefixing the configuration with option.<config>
,
or as environment variables by prefixing the configuration with OPTION_<CONFIG>
.
We will consider two examples: a boolean configuration, and a string configuration.
Boolean Configuration
If you want to set the Engine Argument enable_prefix_caching
, you can do:
option.enable_prefix_caching=true
in serving.propertiesOPTION_ENABLE_PREFIX_CACHING=true
as an environment variable
String configuration
If you want to set the Engine Argument tokenizer_mode
, you can do:
option.tokenizer_mode=mistral
in serving.propertiesOPTION_TOKENIZER_MODE=true
in serving.properties