Custom output formatter schema¶

This document provides the schema of the output formatter, with which you can write your own custom output formatter.

Signature of your own output formatter¶

To write your own custom output formatter, follow the signature below:

For vLLM and TensorRT-LLM backends:¶

from djl_python.output_formatter import output_formatter

@output_formatter
def custom_output_formatter(response_data: dict) -> dict:
    # your implementation here
    return response_data

For other backends:¶

from djl_python.output_formatter import RequestOutput, output_formatter

@output_formatter
def custom_output_formatter(request_output: RequestOutput) -> str:
    # your implementation here

You can write this function in your model.py. You don't need to write the handle function in your entry point Python file. DJLServing will search for the @output_formatter annotation and apply the annotated function as the output formatter.

Arguments and Return Types for vLLM and TensorRT-LLM backends:¶

Arguments: * response_data (dict): The final response data dictionary containing fields like generated_text, details, etc.

Return Type: dict - Modified response data dictionary with custom fields added at the top level of the response.

Arguments and Return Types for other backends:¶

Arguments: * request_output (RequestOutput): The request output object containing sequences, tokens, and other generation details.

Return Type: str - Formatted response string.

RequestOutput schema¶

The RequestOutput class is designed to encapsulate the output of a request in a structured format. Here is an in-depth look at its structure and the related classes:

classDiagram
    RequestOutput <|-- TextGenerationOutput
    RequestOutput *-- RequestInput
    TextGenerationOutput "1" --> "1..*" Sequence
    Sequence "1" --> "1..*" Token
    RequestInput <|-- TextInput
    class RequestOutput{
        +int request_id
        +bool finished
        +RequestInput input
    }
    class TextGenerationOutput{
        +map[int, Sequence] sequences
        +int best_sequence_index
        +list[Token] prompt_tokens_details
    }
    class Sequence{
        +list[Token] tokens
        +float cumulative_log_prob
        +string finish_reason
        +has_next_token()
        +get_next_token()
    }

    class RequestInput{
        +int request_id
        +map[str, any] parameters
        +Union[str, Callable] output_formatter
    }
    class TextInput{
        +str input_text
        +list[int] input_ids
        +any adapters
        +any tokenizer
    }
    class Token{
        +int id
        +string text
        +float log_prob
        +bool special_token
        +as_dict()
    }

Detailed Description¶

RequestOutput: This is the main class that encapsulates the output of a request.
TextGenerationOutput: This subclass of RequestOutput is specific to text generation tasks. Right now this is the only task supported for custom output formatter. Each text generation task can generate multiple sequences.
best_sequence_index: index of the best sequence with the highest log probabilities. Please use this, when you are trying to look up the output sequence.
map[int, Sequence] - represents sequence_index and it's respective sequence.
Note that, right now, only one sequence will be generated. In the future release, multiple sequences generation will be supported.
Sequence : Represents a sequence of generated tokens and it's details
has_next_token() and get_next_token() methods function like an iterator. In iterative generation, each step produces a single token.
get_next_token() advances the iterator to the next token and returns a Token instance along with details indicating whether it is the first token (is_first_token) and whether it is the last token (is_last_token).

How will your custom output_formatter called?¶

It's crucial to understand how your custom output formatter will be called before implementing it. - Your output formatter will be invoked at each generation step for each request individually. - Upon receiving the requests, DJLServing batches them together, performs preprocessing, and starts the inference process. - Inference may involve multiple token generations based on the max_new_tokens parameter. At each generation step, your custom formatter will be called for each request individually.

Examples¶

Example for vLLM and TensorRT-LLM backends:¶

from djl_python.output_formatter import output_formatter
import json
import time

@output_formatter
def custom_output_formatter(response_data: dict) -> dict:
    """
    Custom output formatter for vLLM/TensorRT-LLM backends.
    Adds custom fields to the final response.

    Args:
        response_data (dict): The final response data containing 'generated_text', 'details', etc.

    Returns:
        (dict): Modified response data with custom fields
    """
    if 'generated_text' in response_data:
        # Add custom processing timestamp
        response_data['custom_formatter_applied'] = True
        response_data['formatter_timestamp'] = int(time.time())

        # Example: Transform the generated text
        generated_text = response_data['generated_text']
        response_data['original_length'] = len(generated_text)
        response_data['generated_text'] = f"[CUSTOM] {generated_text}"

    return response_data

Example for other backends:¶

from djl_python.request_io import TextGenerationOutput
from djl_python.output_formatter import output_formatter
import json

@output_formatter
def custom_output_formatter(request_output: TextGenerationOutput) -> str:
    """
    Replace this function with your custom output formatter.

    Args:
        request_output (TextGenerationOutput): The request output

    Returns:
        (str): Response string

    """
    best_sequence = request_output.sequences[request_output.best_sequence_index]
    next_token, is_first_token, is_last_token = best_sequence.get_next_token()
    result = {}
    if next_token:
        result = {"token_id": next_token.id, "token_text": next_token.text, "token_log_prob": next_token.log_prob}
    if is_last_token:
        result["finish_reason"] = best_sequence.finish_reason
    return json.dumps(result) + "\n"