DJL Serving Management API¶

DJL Serving provides a set of API allow user to manage models at runtime:

Register a model
Increase/decrease number of workers for specific model
Describe a model's status
Unregister a model
List registered models

In addition, there is also the adapter management API for managing adapters.

Management API is listening on port 8080 and only accessible from localhost by default. To change the default setting, see DJL Serving Configuration.

Similar as Inference API.

Management APIs¶

Register a model¶

Registers a new model as a single model workflow. The workflow name and version matches the model name and version.

POST /models

url - Model url.
model_name - the name of the model and workflow; this name will be used as {workflow_name} in other API as path. If this parameter is not present, modelName will be inferred by url.
model_version - the version of the mode
engine - the name of engine to load the model. DJL will try to infer engine if not specified.
device - the device to load the model. DJL will pick optimal device if not specified, the value device can be:
- CPU device: cpu or simply -1
- GPU device: gpu0, gpu1, ... or simply 0, 1, 2, 3, ...
- Neuron core: nc1, nc2, ...
job_queue_size: the request job queue size, default is 1000.
batch_size - the inference batch size, default is 1.
max_batch_delay - the maximum delay for batch aggregation in millis, default value is 100 milliseconds.
max_idle_time - the maximum idle time in seconds before the worker thread is scaled down, default is 60 seconds.
min_worker - the minimum number of worker processes, DJL will auto detect minimum workers if not specified.
max_worker - the maximum number of worker processes, DJL will auto detect maximum workers if not specified.
synchronous - if the creation of worker is synchronous, the default is true.

curl -X POST "http://localhost:8080/models?url=https%3A%2F%2Fresources.djl.ai%2Ftest-models%2Fmlp.zip"

{
  "status": "Model \"mlp\" registered."
}

Download and load model may take some time, user can choose asynchronous call and check the status later.

The asynchronous call will return before trying to create workers with HTTP code 202:

curl -v -X POST "http://localhost:8080/models?url=https%3A%2F%2Fresources.djl.ai%2Ftest-models%2Fmlp.zip&synchronous=false"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: bf998daa-892f-482b-a660-6d0447aa5a7a
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 56
< connection: keep-alive
< 
{
  "status": "Model \"mlp\" registration scheduled."
}

Register a workflow¶

POST /workflows

url - Workflow url
template - A workflow template to use
engine - the name of engine to load the model. DJL will try to infer engine if not specified.
device - the device to load the model. DJL will pick optimal device if not specified, the value device can be:
- CPU device: cpu or simply -1
- GPU device: gpu0, gpu1, ... or simply 0, 1, 2, 3, ...
- Neuron core: nc1, nc2, ...
min_worker - the minimum number of worker processes. The default value is 1.
max_worker - the maximum number of worker processes. The default is the same as the setting for min_worker.
synchronous - if the creation of worker is synchronous. The default value is true.

Either a url or template is required.

curl -X POST "http://localhost:8080/workflows?url=https%3A%2F%2Fresources.djl.ai%2Ftest-workflows%2Fmlp.zip"

{
  "status": "Workflow \"mlp\" registered."
}

Download and load model may take some time, user can choose asynchronous call and check the status later.

The asynchronous call will return before trying to create workers with HTTP code 202:

curl -v -X POST "http://localhost:8080/workflows?url=https%3A%2F%2Fresources.djl.ai%2Ftest-workflows%2Fmlp.zip&synchronous=false"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: bf998daa-892f-482b-a660-6d0447aa5a7a
< Pragma: no-cache
< Cache-Control: no-cache; no-store, must-revalidate, private
< Expires: Thu, 01 Jan 1970 00:00:00 UTC
< content-length: 56
< connection: keep-alive
< 
{
  "status": "Workflow \"mlp\" registration scheduled."
}

Scale workers¶

PUT /models/{model_name}

PUT /models/{model_name}/{version}

PUT /workflows/{workflow_name}

PUT /workflows/{workflow_name}/{version}

min_worker - the minimum number of worker processes. The default value is 1.
max_worker - the maximum number of worker processes. The default is the same as the setting for min_worker.
synchronous - if the creation of worker is synchronous. The default value is true.

Use the Scale Worker API to dynamically adjust the number of workers to better serve different inference request loads.

There are two different flavour of this API, synchronous vs asynchronous.

The asynchronous call will return immediately with HTTP code 202:

curl -v -X PUT "http://localhost:8080/workflows/mlp?min_worker=3&synchronous=false"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 74b65aab-dea8-470c-bb7a-5a186c7ddee6
< content-length: 33
< connection: keep-alive
< 
{
  "status": "Worker updated"
}

Describe model or workflow¶

GET /models/{model_name}

GET /workflows/{workflow_name}

Use the Describe Model API to get detail runtime status of a model or workflow:

curl http://localhost:8080/models/mlp

[
  {
    "workflowName": "mlp",
    "models": [
      {
        "modelName": "mlp",
        "modelUrl": "https://resources.djl.ai/test-models/mlp.zip",
        "batchSize": 1,
        "maxBatchDelayMillis": 100,
        "maxIdleSeconds": 60,
        "queueSize": 1000,
        "requestInQueue": 0,
        "status": "Healthy",
        "loadedAtStartup": true,
        "workerGroups": [
          {
            "device": {
              "deviceType": "cpu",
              "deviceId": -1
            },
            "minWorkers": 1,
            "maxWorkers": 12,
            "workers": [
              {
                "id": 1,
                "startTime": "2023-06-08T08:14:16.999Z",
                "status": "READY"
              }
            ]
          }
        ]
      }
    ]
  }
]

Unregister a model or workflow¶

DELETE /models/{model_name}

DELETE /workflows/{workflow_name}

Use the Unregister Model or workflow API to free up system resources:

curl -X DELETE http://localhost:8080/models/mlp

{
  "status": "Workflow \"mlp\" unregistered"
}

List workflows¶

GET /models

GET /workflows

limit - (optional) the maximum number of items to return. It is passed as a query parameter. The default value is 100.
next_page_token - (optional) queries for next page. It is passed as a query parameter. This value is return by a previous API call.

Use the Workflows API to query current registered models and workflows:

curl "http://localhost:8080/workflows"

This API supports pagination:

curl "http://localhost:8080/models?limit=2&next_page_token=0"

{
  "models": [
    {
      "modelName": "mlp",
      "modelUrl": "https://resources.djl.ai/test-models/mlp.zip",
      "status": "READY"
    }
  ]
}

Configure server logging¶

POST /server/logging

level - set logging level: TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF, if no log level specified, INFO level will be set.

curl -X POST "http://localhost:8080/server/logging?level=debug"

{
  "status": "OK"
}

Get metrics¶

POST /server/metrics

name[] - metric name filters. Multiple name[] HTTP parameters are allowed

Use the get metrics to get metrics in prometheus format.

curl "http://localhost:8080/server/metrics?name[]=DJLServingStart&name[]=StartupLatency"

# HELP DJLServingStart_total : prometheus counter metric, unit: Count
# TYPE DJLServingStart_total counter
DJLServingStart_total{Version="0.27.0-SNAPSHOT"} 1.0
# HELP StartupLatency : prometheus gauge metric, unit: Microseconds
# TYPE StartupLatency gauge
StartupLatency{Host="localhost"} 425941.0