DJL Serving - WorkLoadManager¶
DJL Serving can be divided into a frontend and backend. The frontend is a netty webserver that manages incoming requests and operates the control plane. The backend WorkLoadManager handles the model batching, workers, and threading for high-performance inference.
For those who already have a web server infrastructure but want to operate high-performance inference, it is possible to use only the WorkLoadManager. For this reason, we have it split apart into a separate module.
Using the WorkLoadManager is quite simple. First, create a new one through the constructor:
WorkLoadManager wlm = new WorkLoadManager();
You can also configure the WorkLoadManager by using the static WlmConfigManager
.
Then, you can construct a ModelInfo
for each model you will want to run through wlm
.
With the ModelInfo
, you are able to build a Job
once you receive input:
ModelInfo modelInfo = new ModelInfo(...);
Job job = new Job(modelInfo, input);
Once you have your job, it can be submitted to the WorkLoadManager.
It will automatically spin up workers if none are created and manage worker numbers.
Then, it returns a CompletableFuture<Output>
for the result.
CompletableFuture<Output> futureResult = wlm.runJob(job);
View the javadocs for the WorkLoadManager
for more options.
Documentation¶
The latest javadocs can be found on the javadoc.io.
You can also build the latest javadocs locally using the following command:
# for Linux/macOS:
./gradlew javadoc
# for Windows:
..\..\gradlew javadoc
The javadocs output is built in the build/doc/javadoc
folder.
Installation¶
You can pull the server from the central Maven repository by including the following dependency:
<dependency>
<groupId>ai.djl.serving</groupId>
<artifactId>wlm</artifactId>
<version>0.28.0</version>
</dependency>