Skip to content

Custom output formatter schema

This document provides the schema of the output formatter, with which you can write your own custom output formatter.

Custom output formatters can be defined at two levels: - Base Model Level: Place model.py in the base model directory. These formatters apply to all responses by default. - Adapter Level: Place model.py in an adapter directory (e.g., adapters/my_adapter/model.py). These formatters apply only when that specific adapter is used, overriding the base model formatter.

When a request specifies an adapter, DJL Serving will use the adapter's custom formatter if available, otherwise falling back to the base model's formatter.

Signature of your own output formatter

To write your own custom output formatter, follow the signature below:

For vLLM and TensorRT-LLM backends:

from djl_python.output_formatter import output_formatter

@output_formatter
def custom_output_formatter(response_data: dict) -> dict:
    # your implementation here
    return response_data

For other backends:

from djl_python.output_formatter import RequestOutput, output_formatter

@output_formatter
def custom_output_formatter(request_output: RequestOutput) -> str:
    # your implementation here

You can write this function in your model.py. You don't need to write the handle function in your entry point Python file. DJLServing will search for the @output_formatter annotation and apply the annotated function as the output formatter.

Arguments and Return Types for vLLM and TensorRT-LLM backends:

Arguments: * response_data (dict): The final response data dictionary containing fields like generated_text, details, etc.

Return Type: dict - Modified response data dictionary with custom fields added at the top level of the response.

Arguments and Return Types for other backends:

Arguments: * request_output (RequestOutput): The request output object containing sequences, tokens, and other generation details.

Return Type: str - Formatted response string.

RequestOutput schema

The RequestOutput class is designed to encapsulate the output of a request in a structured format. Here is an in-depth look at its structure and the related classes:

classDiagram
    RequestOutput <|-- TextGenerationOutput
    RequestOutput *-- RequestInput
    TextGenerationOutput "1" --> "1..*" Sequence
    Sequence "1" --> "1..*" Token
    RequestInput <|-- TextInput
    class RequestOutput{
        +int request_id
        +bool finished
        +RequestInput input
    }
    class TextGenerationOutput{
        +map[int, Sequence] sequences
        +int best_sequence_index
        +list[Token] prompt_tokens_details
    }
    class Sequence{
        +list[Token] tokens
        +float cumulative_log_prob
        +string finish_reason
        +has_next_token()
        +get_next_token()
    }

    class RequestInput{
        +int request_id
        +map[str, any] parameters
        +Union[str, Callable] output_formatter
    }
    class TextInput{
        +str input_text
        +list[int] input_ids
        +any adapters
        +any tokenizer
    }
    class Token{
        +int id
        +string text
        +float log_prob
        +bool special_token
        +as_dict()
    }

Detailed Description

  • RequestOutput: This is the main class that encapsulates the output of a request.
  • TextGenerationOutput: This subclass of RequestOutput is specific to text generation tasks. Right now this is the only task supported for custom output formatter. Each text generation task can generate multiple sequences.
  • best_sequence_index: index of the best sequence with the highest log probabilities. Please use this, when you are trying to look up the output sequence.
  • map[int, Sequence] - represents sequence_index and it's respective sequence.
  • Note that, right now, only one sequence will be generated. In the future release, multiple sequences generation will be supported.
  • Sequence : Represents a sequence of generated tokens and it's details
  • has_next_token() and get_next_token() methods function like an iterator. In iterative generation, each step produces a single token.
  • get_next_token() advances the iterator to the next token and returns a Token instance along with details indicating whether it is the first token (is_first_token) and whether it is the last token (is_last_token).

How will your custom output_formatter called?

It's crucial to understand how your custom output formatter will be called before implementing it. - Your output formatter will be invoked at each generation step for each request individually. - Upon receiving the requests, DJLServing batches them together, performs preprocessing, and starts the inference process. - Inference may involve multiple token generations based on the max_new_tokens parameter. At each generation step, your custom formatter will be called for each request individually.

Examples

Example 1: Base Model Formatter (vLLM/TensorRT-LLM)

# model.py (in base model directory)
from djl_python.output_formatter import output_formatter
import time

@output_formatter
def custom_output_formatter(response_data: dict) -> dict:
    """
    Base model output formatter - applies to all responses by default.

    Args:
        response_data (dict): The final response data containing 'generated_text', 'details', etc.

    Returns:
        (dict): Modified response data with custom fields
    """
    if 'generated_text' in response_data:
        # Add custom processing timestamp
        response_data['custom_formatter_applied'] = True
        response_data['formatter_timestamp'] = int(time.time())

        # Example: Transform the generated text
        generated_text = response_data['generated_text']
        response_data['original_length'] = len(generated_text)
        response_data['generated_text'] = f"[BASE] {generated_text}"

    return response_data

Example 2: Adapter-Specific Formatter (vLLM/TensorRT-LLM)

# adapters/my_adapter/model.py
from djl_python.output_formatter import output_formatter
import json

@output_formatter
def custom_output_formatter(response_data: dict) -> dict:
    """
    Adapter-specific output formatter - only applies when this adapter is used.
    Overrides the base model formatter for this adapter.

    Args:
        response_data (dict): The final response data containing 'generated_text', 'details', etc.

    Returns:
        (dict): Modified response data with custom fields
    """
    if 'generated_text' in response_data:
        # Add adapter-specific metadata
        response_data['adapter_name'] = 'my_adapter'
        response_data['adapter_version'] = '1.0'

        # Adapter-specific text transformation
        generated_text = response_data['generated_text']
        response_data['generated_text'] = f"[ADAPTER] {generated_text}"

        # Add custom metrics
        response_data['custom_metrics'] = {
            'text_length': len(generated_text),
            'word_count': len(generated_text.split())
        }

    return response_data

Example 3: Base Model Formatter (Other Backends)

# model.py (in base model directory)
from djl_python.request_io import TextGenerationOutput
from djl_python.output_formatter import output_formatter
import json

@output_formatter
def custom_output_formatter(request_output: TextGenerationOutput) -> str:
    """
    Base model output formatter for other backends.

    Args:
        request_output (TextGenerationOutput): The request output

    Returns:
        (str): Response string
    """
    best_sequence = request_output.sequences[request_output.best_sequence_index]
    next_token, is_first_token, is_last_token = best_sequence.get_next_token()
    result = {}
    if next_token:
        result = {"token_id": next_token.id, "token_text": next_token.text, "token_log_prob": next_token.log_prob}
    if is_last_token:
        result["finish_reason"] = best_sequence.finish_reason
    return json.dumps(result) + "\n"

Adapter-Specific Formatters

When using adapters, you can place a model.py file in the adapter directory with the same decorator-based formatters. The adapter's formatter will be used when that adapter is specified in the request, otherwise the base model's formatter is used.

Directory Structure:

/opt/ml/model/
  model.py                    # Base model formatter (optional)
  adapters/
    adapter1/
      adapter_model.safetensors
      adapter_config.json
      model.py                # Adapter-specific formatter (optional)
    adapter2/
      adapter_model.safetensors
      adapter_config.json
      # No model.py - uses base model formatter

Formatter Resolution: 1. Request with adapter1 → Uses adapters/adapter1/model.py formatter 2. Request with adapter2 → Uses base model /opt/ml/model/model.py formatter 3. Request without adapter → Uses base model /opt/ml/model/model.py formatter

Streaming Support: Both base model and adapter-specific formatters work with streaming responses. The formatter is applied to each chunk as it's generated.