All DJL configuration options¶
DJL serving is highly configurable. This document tries to capture those configurations in a single document.
Note: For tunable parameters for Large Language Models please refer to this guide.
DJL settings¶
DJLServing build on top of Deep Java Library (DJL). Here is a list of settings for DJL:
Key | Type | Description |
---|---|---|
DJL_DEFAULT_ENGINE | env var/system prop | The preferred engine for DJL if there are multiple engines, default: MXNet |
ai.djl.default_engine | system prop | The preferred engine for DJL if there are multiple engines, default: MXNet |
DJL_CACHE_DIR | env var/system prop | The cache directory for DJL: default: $HOME/.djl.ai/ |
ENGINE_CACHE_DIR | env var/system prop | The cache directory for engine native libraries: default: $DJL_CACHE_DIR |
ai.djl.dataiterator.autoclose | system prop | Automatically close data set iterator, default: true |
ai.djl.repository.zoo.location | system prop | global model zoo search locations, not recommended |
offline | system prop | Don't access network for downloading engine's native library and model zoo metadata |
collect-memory | system prop | Enable memory metric collection, default: false |
disableProgressBar | system prop | Disable progress bar, default: false |
PyTorch¶
Key | Type | Description |
---|---|---|
PYTORCH_LIBRARY_PATH | env var/system prop | User provided custom PyTorch native library |
PYTORCH_VERSION | env var/system prop | PyTorch version to load |
PYTORCH_EXTRA_LIBRARY_PATH | env var/system prop | Custom pytorch library to load (e.g. torchneuron/torchvision/torchtext) |
PYTORCH_PRECXX11 | env var/system prop | Load precxx11 libtorch |
PYTORCH_FLAVOR | env var/system prop | To force override auto detection (e.g. cpu/cpu-precxx11/cu102/cu116-precxx11) |
PYTORCH_JIT_LOG_LEVEL | env var | Enable JIT logging |
ai.djl.pytorch.native_helper | system prop | A user provided custom loader class to help locate pytorch native resources |
ai.djl.pytorch.num_threads | system prop | Override OMP_NUM_THREAD environment variable |
ai.djl.pytorch.num_interop_threads | system prop | Set PyTorch interop threads |
ai.djl.pytorch.graph_optimizer | system prop | Enable/Disable JIT execution optimize, default: true. See: https://github.com/deepjavalibrary/djl/blob/master/docs/development/inference_performance_optimization.md#graph-optimizer |
ai.djl.pytorch.cudnn_benchmark | system prop | To speed up ConvNN related model loading, default: false |
ai.djl.pytorch.use_mkldnn | system prop | Enable MKLDNN, default: false, not recommended, use with your own risk |
TensorFlow¶
Key | Type | Description |
---|---|---|
TENSORFLOW_LIBRARY_PATH | env var/system prop | User provided custom TensorFlow native library |
TENSORRT_EXTRA_LIBRARY_PATH | env var/system prop | Extra TensorFlow custom operators library to load |
TF_CPP_MIN_LOG_LEVEL | env var | TensorFlow log level |
ai.djl.tensorflow.debug | env var | Enable devicePlacement logging, default: false |
MXNet¶
Key | Type | Description |
---|---|---|
MXNET_LIBRARY_PATH | env var/system prop | User provided custom MXNet native library |
MXNET_VERSION | env var/system prop | The version of custom MXNet build |
MXNET_EXTRA_LIBRARY_PATH | env var/system prop | Load extra MXNet custom libraries, e.g. Elastice Inference |
MXNET_EXTRA_LIBRARY_VERBOSE | env var/system prop | Set verbosity for MXNet custom library |
ai.djl.mxnet.static_alloc | system prop | CachedOp options, default: true |
ai.djl.mxnet.static_shape | system prop | CachedOp options, default: true |
ai.djl.use_local_parameter_server | system prop | Use java parameter server instead of MXNet native implemention, default: false |
PaddlePaddle¶
Key | Type | Description |
---|---|---|
PADDLE_LIBRARY_PATH | env var/system prop | User provided custom PaddlePaddle native library |
ai.djl.paddlepaddle.disable_alternative | system prop | Disable alternative engine |
Huggingface tokenizers¶
Key | Type | Description |
---|---|---|
TOKENIZERS_CACHE | env var | User provided custom Huggingface tokenizer native library |
Python¶
Key | Type | Description |
---|---|---|
PYTHON_EXECUTABLE | env var | The location is python executable, default: python |
DJL_ENTRY_POINT | env var | The entrypoint python file or module, default: model.py |
MODEL_LOADING_TIMEOUT | env var | Python worker load model timeout: default: 240 seconds |
PREDICT_TIMEOUT | env var | Python predict call timeout, default: 120 seconds |
DJL_VENV_DIR | env var/system prop | The venv directory, default: $DJL_CACHE_DIR/venv |
ai.djl.python.disable_alternative | system prop | Disable alternative engine |
TENSOR_PARALLEL_DEGREE | env var | Set tensor parallel degree. For mpi mode, the default is number of accelerators. Use "max" for non-mpi mode to use all GPUs for tensor parallel. |
DJLServing provides a few alias for Python engine to make it easy for common LLM configurations.
engine=DeepSpeed
, equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.deepspeed
engine=FasterTransformer
, this is equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.fastertransformer
engine=MPI
, this is equivalent to:
engine=Python
option.mpi_mode=true
option.entryPoint=djl_python.huggingface
Global Model Server settings¶
Global settings are configured at model server level. Change to these settings usually requires restart model server to take effect.
Most of the model server specific configuration can be configured in conf/config.properties
file.
You can find the configuration keys here:
ConfigManager.java
Each configuration key can also be override by environment variable with SERVING_
prefix, for example:
export SERVING_JOB_QUEUE_SIZE=1000 # This will override JOB_QUEUE_SIZE in the config
Key | Type | Description |
---|---|---|
MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.19.0/) |
DEFAULT_JVM_OPTS | env var | default: -Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml Override default JVM startup options and system properties. |
JAVA_OPTS | env var | default: -Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError Add extra JVM options. |
SERVING_OPTS | env var | default: N/A Add serving related JVM options. Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them. - -Dai.djl.pytorch.num_interop_threads=2 , this will override interop threads for PyTorch- -Dai.djl.pytorch.num_threads=2 , this will override OMP_NUM_THREADS for PyTorch- -Dai.djl.logging.level=debug change DJL loggging level |
Model specific settings¶
You set per model settings by adding a serving.properties file in the root of your model directory (or .zip).
Some of the options can be override by environment variable with OPTION_
prefix, for example:
# to enable rolling batch with only environment variable:
export OPTION_ROLLING_BATCH=auto
You can set number of workers for each model: https://github.com/deepjavalibrary/djl-serving/blob/master/serving/src/test/resources/identity/serving.properties#L4-L8
For example, set minimum workers and maximum workers for your model:
minWorkers=32
maxWorkers=64
Or you can configure minimum workers and maximum workers differently for GPU and CPU:
gpu.minWorkers=2
gpu.maxWorkers=3
cpu.minWorkers=2
cpu.maxWorkers=4
job queue size, batch size, max batch delay, max worker idle time can be configured at per model level, this will override global settings:
job_queue_size=10
batch_size=2
max_batch_delay=1
max_idle_time=120
You can configure which device to load the model on, default is *:
load_on_devices=gpu4;gpu5
# or simply:
load_on_devices=4;5
Python (DeepSpeed)¶
For Python (DeepSpeed) engine, DJL load multiple workers sequentially by default to avoid run out of memory. You can reduced model loading time by parallel loading workers if you know the peak memory won’t cause out of memory:
# Allows to load DeepSpeed workers in parallel
option.parallel_loading=true
# specify tensor parallel degree (number of partitions)
option.tensor_parallel_degree=2
# specify per model timeout
option.model_loading_timeout=600
option.predict_timeout=240
# mark the model as failure after python process crashing 10 times
retry_threshold=0
# enable virtual environment
option.enable_venv=true
# use built-in DeepSpeed handler
option.entryPoint=djl_python.deepspeed
# passing extra options to model.py or built-in handler
option.model_id=gpt2
option.data_type=fp32
option.max_new_tokens=50
# defines custom environment variables
env=LARGE_TENSOR=1
# specify the path to the python executable
option.pythonExecutable=/usr/bin/python3
Engine specific settings¶
DJL support 12 deep learning frameworks, each framework has their own settings. Please refer to each framework’s document for detail.
A common setting for most of the engines is OMP_NUM_THREADS
, for the best throughput,
DJLServing set this to 1 by default. For some engines (e.g. MXNet, this value must be one).
Since this is a global environment variable, setting this value will impact all other engines.
The follow table show some engine specific environment variables that is override by default by DJLServing:
Key | Engine | Description |
---|---|---|
TF_NUM_INTEROP_THREADS | TensorFlow | default 1, OMP_NUM_THREADS will override this value |
TF_NUM_INTRAOP_THREADS | TensorFlow | default 1 |
TF_CPP_MIN_LOG_LEVEL | TensorFlow | default 1 |
MXNET_ENGINE_TYPE | MXNet | this value must be NaiveEngine |
Appendix¶
How to configure logging¶
Option 1: enable debug log:¶
export SERVING_OPTS="-Dai.djl.logging.level=debug"
Option 2: use your log4j2.xml¶
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/MY_CONF/log4j2.xml
DJLServing provides a few built-in log4j2-XXX.xml
files in DJLServing containers.
Use the following environment variable to print HTTP access log to console:
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-access.xml
Use the following environment variable to print both access log, server metrics and model metrics to console:
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.23.0/conf/log4j2-console.xml
How to download uncompressed model from S3¶
To enable fast model downloading, you can store your model artifacts (weights) in a S3 bucket, and
only keep the model code and metadata in the model.tar.gz
(.zip) file. DJL can leverage
s5cmd to download uncompressed files from S3 with extremely fast
speed.
To enable s5cmd
downloading, you can configure serving.properties
as the following:
option.model_id=s3://YOUR_BUCKET/...
How to resolve python package conflict between models¶
If you want to deploy multiple python models, but their dependencies has conflict, you can enable python virtual environments for your model:
option.enable_venv=true