DJL Serving - WorkLoadManager

DJL Serving can be divided into a frontend and backend. The frontend is a netty webserver that manages incoming requests and operates the control plane. The backend WorkLoadManager handles the model batching, workers, and threading for high-performance inference.

For those who already have a web server infrastructure but want to operate high-performance inference, it is possible to use only the WorkLoadManager. For this reason, we have it split apart into a separate module.

Using the WorkLoadManager is quite simple. First, create a new one through the constructor:

WorkLoadManager wlm = new WorkLoadManager();

You can also configure the WorkLoadManager by using the static WlmConfigManager.

Then, you can construct a ModelInfo for each model you will want to run through wlm. With the ModelInfo, you are able to build a Job once you receive input:

ModelInfo modelInfo = new ModelInfo(...);
Job job = new Job(modelInfo, input);

Once you have your job, it can be submitted to the WorkLoadManager. It will automatically spin up workers if none are created and manage worker numbers. Then, it returns a CompletableFuture<Output> for the result.

CompletableFuture<Output> futureResult = wlm.runJob(job);

View the javadocs for the WorkLoadManager for more options.


The latest javadocs can be found on the

You can also build the latest javadocs locally using the following command:

# for Linux/macOS:
./gradlew javadoc

# for Windows:
..\..\gradlew javadoc

The javadocs output is built in the build/doc/javadoc folder.


You can pull the server from the central Maven repository by including the following dependency: