Skip to content

PySDK instruction for using LMI container on SageMaker

In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

  • S3 bucket push access
  • SageMaker access

Step 1: Let's bump up SageMaker and import stuff

%pip install sagemaker boto3 awscli --upgrade  --quiet
import boto3
import sagemaker
from sagemaker import image_uris, serializers, deserializers
from sagemaker.djl_inference.model import DeepSpeedModel, DJLModel, HuggingFaceAccelerateModel

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

(Remove if not needed) upload HuggingFace model to S3 bucket

LMI has good capability to download model in a S3 bucket. This step is recommended if you would like to speed up model loading process

%pip install huggingface_hub --upgrade --quiet
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the ./model directory where ever the jupyter file is running
local_model_path = Path("/tmp")
local_model_path.mkdir(exist_ok=True)
model_name = "facebook/opt-6.7b"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

(remove if not needed) DeepSpeed HF faster loading

DeepSpeed offers a way to speed up model loading while keep the CPU memory low. This has only been tested with - OPT - GPT-NeoX - BLOOM

We just put a checkpoints.json along with it

import io
import json
checkpoints_json = os.path.join(model_download_path, "checkpoints.json")
tensor_parallel_degree=4
weight_dtype="float16"

with io.open(checkpoints_json, "w", encoding="utf-8") as f:
    file_list = [str(entry).split('/')[-1] for entry in Path(model_download_path).rglob("*.[bp][it][n]") if entry.is_file()]
    data = {"type": "ds_model", "checkpoints": file_list, "version": 1.0, "parallelization": "tp", "tp_size": tensor_parallel_degree, "dtype": weight_dtype}
    json.dump(data, f)

Upload the model to S3 bucket

key_prefix="lmi_models/mymodel"
model_artifact = sess.upload_data(path=model_download_path, key_prefix=key_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"You can set option.s3url={model_artifact}")

Step 2: Start building SageMaker endpoint

In this step, we will build SageMaker endpoint from scratch

Getting the container image URI

Available framework are: - djl-deepspeed (0.20.0, 0.21.0) - djl-fastertransformer (0.21.0)

image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.21.0"
    )

Create SageMaker model

Here we are using LMI PySDK to create the model.

if not model_artifact:
    model_artifact = "s3://my_bucket/my_saved_model_artifacts/"
model = DeepSpeedModel(
    model_artifact,
    role,
    data_type="fp16",
    task="text-generation",
    tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism
)

Create SageMaker endpoint

You need to specify the instance to use and endpoint names

instance_type = "ml.g4dn.4xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

Step 3: Test and benchmark the inference

%%timeit -n3 -r1
predictor.predict(
    {"inputs": "Large model inference is", "parameters": {}}
)

Clean up the environment

sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()