Instance Type Selection¶

While there are many open source LLMs and architectures available, most models tend to fall within a few common parameter count sizes. The following table provides instance type recommendations for common model parameter counts using half-precision (fp16/bf16) weights.

Model Parameter Count	Instance Type	Accelerators	Aggregate Accelerator Memory	Sample Models	Estimated Max Batch Size Range
~7 billion	ml.g5.4xlarge ml.g6.4xlarge	1 x A10G 1 x L4	24 GB	Llama2-7b, Falcon-7b, GPT-J-6B, Mistral-7b	32-64
~13 billion	ml.g5.12xlarge ml.g6.12xlarge	4 x A10G 4 x L4	96GB	Llama2-13b, CodeLlama-13b, Flan-T5-XXL	32-64
~20 billion	ml.g5.12xlarge ml.g6.12xlarge	4 x A10G 4 x L4	96GB	GPT-NEOX-20b, Flan-Ul2	16-32
~35 billion	ml.g5.48xlarge ml.g6.48xlarge	8 x A10G 8 x L4	192GB	CodeLlama-34b, Falcon-40b	32-64
~70 billion	ml.g5.48xlarge ml.g6.48xlarge	8 x A10G 8 x L4	192GB	Llama2-70b, CodeLlama-70b	1-8
~70 billion	ml.p4d.24xlarge	8 x A100	320GB	Llama2-70b, CodeLlama-70b	32-64
~180 billion	ml.p4de.24xlarge ml.p5.48xlarge	8 x A100 8 x H100	640GB	Falcon-180b, Bloom-176B	32-64

We recommend starting with the guidance above based on the model parameter count. The estimated batch size is a conservative estimate. You will likely be able to increase the batch size beyond the recommendation, but that is dependent on your model and expected max sequence lengths (prompt + generation tokens).

For a more in-depth instance type sizing guide, you can follow the steps below.

Selecting an instance is based on a few factors:

Model Size
Desired Accelerators (A10, A100, H100, AWS Inferentia etc.)
- We recommend using instances with at least A series gpus (g5/p4). The performance is much greater compared to older T series gpus
- g6 instance types are slightly less performant than g5 instances, but provide fp8 support and are typically more price-performant
- You should select an instance that has sufficient aggregate memory (across all gpus) for both loading the model, and making requests at runtime
Desired Concurrency/Batch Size
- Increasing Batch Size allows for more concurrent requests, but requires additional VRAM

We will walk through a sizing example using the Llama2-13b model.

Model Size¶

We can establish a lower bound for the required memory based on the model size. The model size is determined by the number of parameters, and the data type. We can quickly estimate the model size in GB using the number of parameters and data type like this:

Half-Precision data type (fp16, bf16): Size in GB = Number of Parameters * 2 (bytes / param)
Full Precision data type (fp32): Size in GB = Number of Parameters * 4 (bytes / param)
8-bit Quantized data type (int8, fp8): Size in GB = Number of Parameters * 1 (bytes / param)

We recommend using a half-precision data type as it requires less memory than full precision without losing accuracy for most cases.

We estimate the Llama2-13b model to take 13 billion params * 2 bytes / param = 26GB of memory. This is just the memory required to load the model. To execute inference, additional memory is required at runtime.

Additional Runtime Memory¶

To estimate the additional memory required at runtime, we need to estimate the size of the KV cache. The KV cache can be thought of as the state of your model for a given generation loop. It stores the Key (K) and Value (V) states for each attention layer in your model.

To estimate the size of the KV cache, we will use the following formula.

KV-Cache Size (bytes / token) = 2 * n_dtype * n_layers * n_hidden_size

Breaking down this formula:

2 comes from the two matrices we need to cache: Key (K), and Value (V)
n_dtype represents the number of bytes per parameter, which is based on the data type (4 for fp32, 2 for fp16, 1 for int8/fp8)
n_layers represents the number of transformer blocks in the model (usually num_hidden_layers in model's config.json)
n_hidden_size represents the dimension of the attention block (num_heads * d_head matrix) (usually hidden_size in model's config.json)

For the Llama2-13b model, we get the kv-cache size per token as:

2 * 2 * 5120 * 40 = 819,200 bytes/token = ~0.00082 GB / token

For a single max sequence (4096 tokens for the Llama2-13b model), we require 4096 token * 0.00082 GB / token = 3.36 GB

Selecting an Instance Type¶

Now that we know the memory required to load the model, and have an estimate of the runtime memory required per token, we can figure out what instance type to use. We recommend that you have an understanding of the max sequence length you will be operating with (prompt tokens + generation tokens). Alternatively, you can select an instance type and calculate and max batch size estimate based on the available memory.

Please see this link for an up-to-date list of available instance types on SageMaker.

For our Llama2-13b model, we need a minimum of 26GB. Let's consider two instances types: ml.g5.12xlarge, and ml.g5.48xlarge

On the ml.g5.12xlarge instance, we will have 96GB - 26GB = 66GB available for batches of sequences at runtime. This would provide us enough memory for roughly 85,300 tokens at a time. This means we can have:

batch size of ~20 for maximum sequence length of 4096 tokens
batch size of ~40 for maximum sequence length of 2048 tokens
And so on

On a ml.g5.48xlarge, we will have 192GB - 26GB = 166GB available for batches of sequences at runtime This would provide us enough memory for roughly 202,400 tokens at runtime. This means we can have:

batch size of ~49 for maximum sequence length of 4096
batch size of ~98 for maximum sequence length of 2048
And so on

This exercise demonstrates how to estimate the memory requirements of your model and use case in order to select an instance type. The calculations made are estimates, but should serve as a good starting point for testing. Memory allocation behavior and memory optimization features differ between backends, so actual runtime memory usage will be different than what was derived above. We recommend testing your expected traffic against your setup for a specific instance type to determine the proper configuration.

Next: Backend Selection