Skip to content

Run this notebook online:Binder

Mistral 7B deployment guide

In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

  • S3 bucket push access

Step 1: Let's bump up SageMaker and import stuff

%pip install sagemaker --upgrade  --quiet
import os
import sagemaker
from sagemaker.djl_inference.model import DJLModel

role = sagemaker.get_execution_role()  # execution role for the endpoint
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs

Step 2: Start building SageMaker endpoint

In this step, we will build SageMaker endpoint from scratch

Getting the container image URI (optional)

Check out available images: Large Model Inference available DLC

# Choose a specific version of LMI image directly:
# image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124"

Create SageMaker model

Here we are using LMI PySDK to create the model.

Checkout more configuration options.

model_id = "mistralai/Mistral-7B-v0.1" # model will be download form Huggingface hub
hf_token = os.getenv("HF_TOKEN", "hf_XXXXXXXXXXX")    # use your HF_TOKEN to access this model

env = {
    "TENSOR_PARALLEL_DEGREE": "2",            # use 2 GPUs
    "HF_TOKEN": hf_token,
    "OPTION_ROLLING_BATCH": "vllm",           # use vllm for rolling batching
    "OPTION_TRUST_REMOTE_CODE": "true",
}

model = DJLModel(
    model_id=model_id,
    env=env,
    role=role)

Create SageMaker endpoint

You need to specify the instance to use and endpoint names

instance_type = "ml.g5.12xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

predictor = model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name
            )

Step 3: Test and benchmark the inference

predictor.predict(
    {"inputs": "tell me a story of the little red riding hood", "parameters": {}}
)
%%timeit -n3 -r1
predictor.predict(
    {"inputs": "tell me a story of the little red riding hood", "parameters": {}}
)

Clean up the environment

session.delete_endpoint(endpoint_name)
session.delete_endpoint_config(endpoint_name)
model.delete_model()