Exploring the Transformer Series (16) --- Resource Consumption
Exploring the Transformer Series (16) --- Resource Consumption
0x00 Overview
For standard Transformer models, whether it’s the Encoder-only BERT series or the Decoder-only GPT series, the number of parameters and computational cost are similar under the same configuration. One key point is that the input, output, and intermediate hidden dims of a standard Transformer block (layer) remain unchanged; they are always the hidden dims of the token embedding, and all Transformer blocks are very well-organized.
As shown in the figure below, the main parameters of the Encoder are derived from the weight matrix multiplied by several matrices, where d represents the Hidden Dim of the Token Embedding, l represents the number of Tokens, and h represents the number of Heads in the MHA. d_{FFN} represents Dim after the intermediate dimensionality increase of the FFN layer. The parameters of its main modules are as follows.
- MHA: , , and each have size . Of course, this can also be viewed from the perspective of
hHeads, then each Head’s , , and are . There is also a matrix multiplication operation at the end of MHA, corresponding to , and the dimension remains . Therefore, the number of parameters in the weight matrix at MHA is . - FFN: The standard Transformer’s FFN has two Linear layers (first increasing the dimension, then decreasing it), corresponding to the weight matrix and . The size is , and standard has the value
4d, meaning that the parameter count of the two weight matrices at FFN is .

In summary, in standard Transformer models or the LLaMA series (MHA), if parameters such as vocabulary, embedding, and LayerNorm are ignored, the total number of parameters (for all Transformer Blocks) is:
Note: This chapter references multiple papers, which have different definitions of terms. Because the model structures are also different, the calculation results may differ from those in other sources.
0x01 Background Knowledge
1.1 Data Types
In deep learning, the naming convention for numerical types is generally TypeNum, such as Int64, Float32, and Double64.
- Type: Includes Int, Float, Double, etc.
- Num: Typically 8, 16, 32, 64, 128, indicating the number of bits occupied by this type.
Commonly used numeric data types are shown in the image below.
| type | Size (in bytes) |
|---|---|
| int4 | 0.5 |
| you8 | 1 |
| int16 | 2 |
| int32 | 4 |
| int64 | 8 |
| float32 | 4 |
| float16 | 2 |
1.2 Number Systems & Conversion
Let’s start with a question: How many gigabytes of video memory does a 1B parameter correspond to? Both B and G represent billions (1000M or 1024M), but these are two different measurement dimensions.
Number base
B is a commonly used number system in British and American English, for example:
1K = 1000, one thousand;1M = 1000 K, one million;1B = 1000 M, one billion;
As you can see, this number system uses 1000 as its base. Taking Qwen-7B as an example, 7B means that this LLM model has 7 billion parameters.
Storage metrics
G is a measure of computer memory/disk storage, with the basic unit being bytes, and a base-1024 system. The units are KB / MB / GB / TB. When we say a certain amount of video memory (G/M), we mean a certain number of G/M bytes (bytes). One byte = 8 bits (bits). For example, a 1000x1000 matrix with float32 would occupy approximately 1000x1000x4 bytes = 4MB of video memory.
Conversion
It can be seen that and bytes. The sizes of 1B and 1G are basically the same, so we denote B and G as equal. However, the number of gigabytes of memory corresponding to 1B model parameters depends on the precision of the parameters. For full-precision training (fp32), one parameter corresponds to 32 bits, or 4 bytes. When converting parameters to GPU memory, the value is multiplied by 4, meaning 1B model parameters correspond to 4GB of GPU memory. For fp16 or bf16, the value is multiplied by 2, meaning 1B model parameters correspond to 2GB of GPU memory. See the table below for details.
| Data types | Each 1B parameter requires memory. |
|---|---|
| fp32 | 4G |
| fp16/bf16 | 2G |
| you8 | 1G |
| int4 | 0.5G |
1.3 Parameter Memory Usage
Only modules with parameters will use GPU memory. This portion of GPU memory usage is independent of the input and is used after the model is loaded. Typical convolutional layers will use GPU memory, while the ReLU activation layers we frequently use have no parameters and therefore do not use the cache.
Parameterized layers
Commonly used modules with parameters include:
- Convolutional layers, typically
conv2d. - Fully connected layer, also known as Linear layer.
- BatchNorm layer.
- Embedding layer.
Parameterless layer
Common parameterless modules include:
- Most activation layers, such as
Sigmoid/ReLU. - Pooling layer.
Dropout.
Required resources
We can calculate the memory usage of a neural network using the following formula: Memory Usage = Model Memory Usage + Input/Output Related Memory.
Model memory usage refers to the memory usage of the model that is independent of the input, and mainly includes:
- Model weight parameters.
- Gradient (usually 1 times the number of parameters).
- The momentum of the optimizer (closely related to the specific optimizer; for example, ordinary SGD has no momentum, momentum-SGD has the same momentum as the gradient, and the Adam optimizer has twice the momentum of the gradient).
The main video memory usage related to input/output is as follows:
batch_size × memory usage per sample.- Each layer’s feature map needs to save activations for backpropagation.
Due to factors such as backpropagation, Adam optimization, and Transformer architecture, training generally requires 3-4 times more GPU memory than inference of the same scale.
1.4 Computational complexity
The above text mentions that the computational complexity of Transformer is . Big O notation focuses on the relationship between the computational magnitude and the input size, not the actual computational cost. The actual computational cost is usually represented by FLOPs. Here are some common units:
- FLOPs: an abbreviation for floating point of operations, refers to the number of floating-point operations, usually specifically multiplying and adding operations. It can be understood as the amount of computation and can be used to measure the complexity of an algorithm/model.
- One GFLOPS (gigaFLOPS) = one billion (= ) floating-point operations per second.
- One TFLOPS (teraFLOPS) = one trillion (= ) floating-point operations per second.
0x02 Transformer Parameter Count
Taking the decoder-only model as an example, it mainly consists of three parts: embedding, decoder, and head. The most important part is the decoder, which is composed of several decoder-layers. Each decoder-layer is further divided into two parts: MHA and FFN. We will now examine the number of parameters in each of these modules.
2.1 Terminology
Let’s first give the terminology used in this section.
| Symbol | Meaning |
|---|---|
| The model’s word embedding size (hidden state dimension / positional encoding size) | |
| Attention head count | |
| Total text length (prompt + decoder output) | |
| Data batch size | |
| Transformer layer number | |
| vocabulary size |
2.2 Embedding Layer
The input shape of the embedding layer is [b, s, v], the output shape is [b, s, d], and the number of parameters is .
If trainable positional encoding is used, there will be some trainable model parameters, but their number is relatively small. If relative positional encoding is used, such as RoPE and ALiBi, there are no trainable model parameters. Therefore, we ignore the parameters of positional encoding.
2.3 Transformer Layer
The Transformer model consists of l identical layers, each mainly divided into two parts: MHA and FFN. Since multi-head is only a logical partition and does not physically add modules, multi-head will be omitted in the following discussion (if multi-head is discussed in some papers, we will follow the paper’s interpretation). And since the Decoder-only model uses self-attention, we will assume that the dimensions of Q, K, V, and O are equal.
MHA
MHA contains four weight matrices , , , and , and biases (some models may not have biases). The four weight matrices have the following shapes: [d, d]. The shapes of the four biases are [d], and . Therefore, the parameters of the multi-head attention layer are:
FFN
FFN consists of two linear layers.
- The first layer maps the original dimensions to a size four times the original dimensions, that is, from mapped to . The weight matrix has a shape of
[d, 4d], and the bias matrix has a shape of[4d]. The number of parameters is: . - The second layer reduces the dimensionality back to the original dimension from a factor of 4. That is, from mapped to . The weight matrix has a shape of
[4d, d], and the bias matrix has a shape of[d]. The number of parameters is: .
The final parameters of FFN are:
LayerNorm
For Layer Norm, its scaling parameter and translation parameters all have dimension . Therefore, the number of parameters is . Because both MHA and FFN have LayerNorm, the total number of parameters is .
summary
In summary, the number of parameters in a single Transformer layer is:
2.4 lm_head
lm_head is a component in a natural language processing model. Its main function is to transform the model’s output (usually the hidden state after being processed by the Transformer encoder) into a probability distribution for predicting the next word.
The head and embedding have the same number of parameters. If it is a tied embedding (i.e., the head weight matrix and word embedding matrix share parameters), then they share a single parameter.
2.5 Final Parameter Quantity
Ultimately, the number of trainable model parameters for the l-layer transformer model is:
When d is large, the first-order term can be ignored, and the number of model parameters is approximately:

2.6 LLaMA3
Let’s take another look at the unique aspects of LLaMA3’s application in the industrial sector.
SwiGLU
Models like LLaMA use SwiGLU activation in their FFN, which results in an additional weight matrix. The LLaMA paper mentions that using SwiGLU reduces the from 4d to 8d/3. This way, the number of parameters in the three weight matrices remains 8d, and the total number of parameters is still manageable enough to make an estimate with:
GQA
The formula above corresponds to MHA (Multi-Head Attention), which is also the standard implementation of the LLaMA-1 series models. However, the 30B and 70B models of LLaMA-2, as well as all models of LLaMA-3, began to use GQA (Grouped Query Attention). When using GQA, multiple attention heads share a key and value. and will become , where g represents that each g Head shares the same Key and Value. The LLaMA 2 paper mentions that, in order to keep the total number of parameters constant when using GQA and when using MHA, LLaMA 2 will multiply the FFN Dim dimension by 1.3 for the GQA model.
After the above adjustments, LLaMA 3 is no longer a standard Transformer Block; at this point, using to estimate parameter quantities is no longer very accurate. However, it can still be done according to , , , and . We can use four parts for statistical analysis. For example, for the LLaMA 3 model, we can estimate its parameter count as follows:
0x03 Transformer Memory Usage
3.1 Training
During neural network training, the majority of GPU memory usage is comprised of four parts: model parameters, intermediate activations generated during forward computation, gradients obtained from backpropagation, and optimizer states. The latter three may be larger in number than the model parameters, thus requiring more memory from the model.
When training large models, the AdamW optimizer is often used, along with mixed-precision training to accelerate the process. Based on this premise, we analyze memory usage. In a single training iteration, each trainable model parameter needs to store the parameter itself, its corresponding gradient, and the optimizer’s two states for that parameter (first-order momentum and second-order momentum in Adam). Let the number of model parameters be , then the number of gradient elements is , and the number of elements in the AdamW optimizer is . In mixed-precision training, half-precision is used for forward and backward propagation calculations, while single-precision is used to update the state, gradient, and parameters when the optimizer updates the model parameters. Therefore, the space occupied by a parameter during training is the sum of the space occupied by half-precision during forward propagation and single-precision during backward propagation. Thus, when using the AdamW optimizer and mixed-precision training, the training phase will occupy (2+4)+(2+4)+(4+4)=20 bytes for each trainable model parameter. For a large model with parameters, the memory occupied by the model parameters, gradients, and optimizer states is 20Φ bytes.

The space requirements for model parameters, gradients, and optimizer states have been calculated. The next step is to determine the space requirements for the intermediate activations during forward propagation. We will analyze this in a later section.
Model training involves Forward and Backward processes. The Backward process actually consists of two parts: gradients to the input (using the chain rule) and gradients to the weights. The main computational cost of these two parts is matrix multiplication, and the magnitudes are the same as in the Forward process. Therefore, it is often directly approximated that the computational cost of Backward is twice that of Forward.
3.2 Reasoning
The inference phase typically requires less GPU memory than the training phase because it doesn’t involve extensive computations such as gradient calculations and parameter updates. Without gradients, optimizer states, and intermediate activations, the model’s inference phase consumes significantly less GPU memory than the training phase.
If a key-value cache is used to accelerate the inference process, the key-value cache also needs to occupy video memory. The video memory occupied by the key-value cache will be described in detail below, so it will be omitted here. In addition, the input data also needs to be placed on the GPU, as well as some intermediate results (intermediate results during the inference process will be released as soon as they are used up). However, the video memory occupied by this part is very small and can be ignored.
Ultimately, the main memory usage during the inference phase is for the model’s parameters, where model parameter memory = n × p. n is the total number of model parameters, and p is the number of bytes occupied by each parameter. If half-precision inference is used, and each parameter occupies 2 bytes, then the memory usage of the model during inference is approximately:
The following are some key factors regarding the GPU memory required for computational model inference:
- Model structure: The model structure includes the number of layers, the number of neurons in each layer, and the size of the convolutional kernel. Deeper models typically require more GPU memory because each layer produces intermediate computation results.
- Input data: The amount of video memory required for inference depends on the size of the input data. Larger input data will require more video memory.
- Batch Size: Batch size refers to the number of samples processed in a single inference iteration. A larger batch size may increase GPU memory usage because the computation results for multiple samples need to be stored simultaneously.
- Data type: The data type used (such as single-precision floating-point number, half-precision floating-point number) also affects video memory requirements. Lower precision data types generally reduce video memory requirements.
- Intermediate calculations: During the inference process of the model, some intermediate calculation results may be generated, which will also occupy a certain amount of GPU memory.
3.3 Activation
Activations during training refer to all tensors computed during forward propagation and used during backpropagation. These activations do not include model parameters and optimizer states, but they do include the mask matrix used in the dropout operation.
In a single training iteration, the memory usage of model parameters (or gradients) depends only on the number and data type of the parameters, and is independent of the input data size. Similarly, the memory usage of the optimizer state depends on the optimizer type and the number of model parameters, but is independent of the input data size. Intermediate activation values, however, are positively correlated with the input data size (batch size b and sequence length s). As both batch size b and sequence length s increase, the memory usage of intermediate activations increases proportionally. When encountering an Out Of Memory (OOM) error during neural network training, we typically try to reduce the batch size to avoid this problem. However, this approach actually reduces the memory usage of intermediate activations, not the memory used for model parameters, gradients, or the optimizer.
Next, we will take Megatron from the paper “Reducing Activation Recomputation in Large Transformer Models” as an example to calculate the memory usage of intermediate activation step by step.
Architecture
The image below shows the architecture of Megatron.

The code is shown below. It specifies that core_attention is submodules.core_attention and linear_proj is submodules.linear_proj.
class Attention(MegatronModule, ABC):
"""Attention layer abstract class.
This layer only contains common modules required for the "self attn" and
"cross attn" specializations.
"""
def __init__(
self,
config: TransformerConfig,
submodules: Union[SelfAttentionSubmodules, CrossAttentionSubmodules],
layer_number: int,
attn_mask_type: AttnMaskType,
attention_type: str,
):
super().__init__(config=config)
self.config = config
self.layer_number = layer_number
self.attn_mask_type = attn_mask_type
self.attention_type = attention_type
# For normal attention without groups, num_query_groups == num_attention_heads,
# so these two will be the same
self.query_projection_size = self.config.kv_channels * self.config.num_attention_heads
self.kv_projection_size = self.config.kv_channels * self.config.num_query_groups
# Per attention head and per partition values.
world_size = parallel_state.get_tensor_model_parallel_world_size()
self.hidden_size_per_attention_head = divide(
self.query_projection_size, self.config.num_attention_heads
)
self.num_attention_heads_per_partition = divide(self.config.num_attention_heads, world_size)
self.num_query_groups_per_partition = divide(self.config.num_query_groups, world_size)
self.core_attention = build_module(
submodules.core_attention,
config=self.config,
layer_number=self.layer_number,
attn_mask_type=self.attn_mask_type,
attention_type=self.attention_type,
)
self.checkpoint_core_attention = self.config.recompute_granularity == 'selective'
# Output.
self.linear_proj = build_module(
submodules.linear_proj,
self.query_projection_size,
self.config.hidden_size,
config=self.config,
init_method=self.config.output_layer_init_method,
bias=self.config.add_bias_linear,
input_is_parallel=True,
skip_bias_add=True,
is_expert=False,
tp_comm_buffer_name='proj',
)
def forward(
self,
hidden_states,
attention_mask,
key_value_states=None,
inference_params=None,
rotary_pos_emb=None,
packed_seq_params=None,
):
# hidden_states: [sq, b, h]
# For self attention we just duplicate the rotary_pos_emb if it isn't already
if rotary_pos_emb is not None and not isinstance(rotary_pos_emb, tuple):
rotary_pos_emb = (rotary_pos_emb,) * 2
# =====================
# Query, Key, and Value
# =====================
# Get the query, key and value tensors based on the type of attention -
# self or cross attn.
query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)
# ===================================================
# Adjust key, value, and rotary_pos_emb for inference
# ===================================================
key, value, rotary_pos_emb, attn_mask_type = self._adjust_key_value_for_inference(
inference_params, key, value, rotary_pos_emb
)
if packed_seq_params is not None:
query = query.squeeze(1)
key = key.squeeze(1)
value = value.squeeze(1)
# ================================================
# relative positional embedding (rotary embedding)
# ================================================
if rotary_pos_emb is not None:
q_pos_emb, k_pos_emb = rotary_pos_emb
if packed_seq_params is not None:
cu_seqlens_q = packed_seq_params.cu_seqlens_q
cu_seqlens_kv = packed_seq_params.cu_seqlens_kv
else:
cu_seqlens_q = cu_seqlens_kv = None
query = apply_rotary_pos_emb(
query, q_pos_emb, config=self.config, cu_seqlens=cu_seqlens_q
)
key = apply_rotary_pos_emb(key, k_pos_emb, config=self.config, cu_seqlens=cu_seqlens_kv)
# TODO, can apply positional embedding to value_layer so it has
# absolute positional embedding.
# otherwise, only relative positional embedding takes effect
# value_layer = apply_rotary_pos_emb(value_layer, k_pos_emb)
# ==================================
# core attention computation
# ==================================
if self.checkpoint_core_attention and self.training:
core_attn_out = self._checkpointed_attention_forward(
query,
key,
value,
attention_mask,
attn_mask_type=attn_mask_type,
packed_seq_params=packed_seq_params,
)
else:
core_attn_out = self.core_attention(
query,
key,
value,
attention_mask,
attn_mask_type=attn_mask_type,
packed_seq_params=packed_seq_params,
)
if packed_seq_params is not None:
# reshape to same output shape as unpacked case
# (t, np, hn) -> (t, b=1, h=np*hn)
# t is the pack size = sum (sq_i)
# note that batch is a dummy dimension in the packed case
core_attn_out = core_attn_out.reshape(core_attn_out.size(0), 1, -1)
# =================
# Output. [sq, b, h]
# =================
output, bias = self.linear_proj(core_attn_out) # 这里是线性层
return output, bias
The final attention code is:
class DotProductAttention(MegatronModule):
"""
Region where selective activation recomputation is applied.
This region is memory intensive but less compute intensive which
makes activation checkpointing more efficient for LLMs (20B+).
See Reducing Activation Recomputation in Large Transformer Models:
https://arxiv.org/abs/2205.05198 for more details.
We use the following notation:
h: hidden size
n: number of attention heads
p: number of tensor model parallel partitions
b: batch size
s: sequence length
"""
def __init__(
self,
config: TransformerConfig,
layer_number: int,
attn_mask_type: AttnMaskType,
attention_type: str,
attention_dropout: float = None,
):
super().__init__(config=config)
self.config: TransformerConfig = config
assert (
self.config.context_parallel_size == 1
), "Context parallelism is only supported by TEDotProductAttention!"
assert (
self.config.window_size is None
), "Sliding Window Attention is only supported by TEDotProductAttention!"
self.layer_number = max(1, layer_number)
self.attn_mask_type = attn_mask_type
self.attention_type = attention_type # unused for now
projection_size = self.config.kv_channels * self.config.num_attention_heads
# Per attention head and per partition values.
world_size = parallel_state.get_tensor_model_parallel_world_size()
self.hidden_size_per_partition = divide(projection_size, world_size)
self.hidden_size_per_attention_head = divide(projection_size, config.num_attention_heads)
self.num_attention_heads_per_partition = divide(self.config.num_attention_heads, world_size)
self.num_query_groups_per_partition = divide(self.config.num_query_groups, world_size)
coeff = None
self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
if self.config.apply_query_key_layer_scaling:
coeff = self.layer_number
self.norm_factor *= coeff
self.scale_mask_softmax = FusedScaleMaskSoftmax(
input_in_fp16=self.config.fp16,
input_in_bf16=self.config.bf16,
attn_mask_type=self.attn_mask_type,
scaled_masked_softmax_fusion=self.config.masked_softmax_fusion,
mask_func=attention_mask_func,
softmax_in_fp32=self.config.attention_softmax_in_fp32,
scale=coeff,
)
# Dropout. Note that for a single iteration, this layer will generate
# different outputs on different number of parallel partitions but
# on average it should not be partition dependent.
self.attention_dropout = torch.nn.Dropout(
self.config.attention_dropout if attention_dropout is None else attention_dropout
)
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
attention_mask: Tensor,
attn_mask_type: AttnMaskType = None,
packed_seq_params: Optional[PackedSeqParams] = None,
):
assert packed_seq_params is None, (
"Packed sequence is not supported by DotProductAttention."
"Please use TEDotProductAttention instead."
)
# ===================================
# Raw attention scores. [b, n/p, s, s]
# ===================================
# expand the key and value [sk, b, ng, hn] -> [sk, b, np, hn]
# This is a noop for normal attention where ng == np. When using group query attention this
# creates a view that has the keys and values virtually repeated along their dimension to
# match the number of queries.
# attn_mask_type is not used.
if self.num_attention_heads_per_partition // self.num_query_groups_per_partition > 1:
key = key.repeat_interleave(
self.num_attention_heads_per_partition // self.num_query_groups_per_partition, dim=2
)
value = value.repeat_interleave(
self.num_attention_heads_per_partition // self.num_query_groups_per_partition, dim=2
)
# [b, np, sq, sk]
output_size = (query.size(1), query.size(2), query.size(0), key.size(0))
# [sq, b, np, hn] -> [sq, b * np, hn]
# This will be a simple view when doing normal attention, but in group query attention
# the key and value tensors are repeated to match the queries so you can't use
# simple strides to extract the queries.
query = query.reshape(output_size[2], output_size[0] * output_size[1], -1)
# [sk, b, np, hn] -> [sk, b * np, hn]
key = key.view(output_size[3], output_size[0] * output_size[1], -1)
# preallocting input tensor: [b * np, sq, sk]
matmul_input_buffer = parallel_state.get_global_memory_buffer().get_tensor(
(output_size[0] * output_size[1], output_size[2], output_size[3]), query.dtype, "mpu"
)
# Raw attention scores. [b * np, sq, sk]
matmul_result = torch.baddbmm(
matmul_input_buffer,
query.transpose(0, 1), # [b * np, sq, hn]
key.transpose(0, 1).transpose(1, 2), # [b * np, hn, sk]
beta=0.0,
alpha=(1.0 / self.norm_factor),
)
# change view to [b, np, sq, sk]
attention_scores = matmul_result.view(*output_size)
# ===========================
# Attention probs and dropout ----------------- 在这里有softmax的dropout
# ===========================
# attention scores and attention mask [b, np, sq, sk]
attention_probs: Tensor = self.scale_mask_softmax(attention_scores, attention_mask)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
if not self.config.sequence_parallel:
with tensor_parallel.get_cuda_rng_tracker().fork():
attention_probs = self.attention_dropout(attention_probs)
else:
attention_probs = self.attention_dropout(attention_probs)
# =========================
# Context layer. [sq, b, hp]
# =========================
# value -> context layer.
# [sk, b, np, hn] --> [b, np, sq, hn]
# context layer shape: [b, np, sq, hn]
output_size = (value.size(1), value.size(2), query.size(0), value.size(3))
# change view [sk, b * np, hn]
value = value.view(value.size(0), output_size[0] * output_size[1], -1)
# change view [b * np, sq, sk]
attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)
# matmul: [b * np, sq, hn]
context = torch.bmm(attention_probs, value.transpose(0, 1))
# change view [b, np, sq, hn]
context = context.view(*output_size)
# [b, np, sq, hn] --> [sq, b, np, hn]
context = context.permute(2, 0, 1, 3).contiguous()
# [sq, b, np, hn] --> [sq, b, hp]
new_context_shape = context.size()[:-2] + (self.hidden_size_per_partition,)
context = context.view(*new_context_shape)
return context
Terminology Explanation
Let’s first look at the terminology in the paper.
arepresents the number of attention heads in the transformer model.bis the batch size for each GPU;his the hidden dimension of each transformer layer.Lis the number of layers in the Transformer;pis the number of parallel machines in the pipelined parallelism;sis the length of the sentence, that is, the number of words in the sequence.tis the number of parallel machines in tensor parallelism;vrepresents the size of the dictionary;
We assume the activation data type is fp16.
Data volume
Each Transformer layer consists of an attention layer and an MLP, with two LayerNorms in between. Below, we derive the memory required to store the activation of each element. Several points should be noted in the following analysis:
- The unit is bytes, not the number of elements.
- Large models typically employ mixed precision training during the training process. Therefore, when analyzing the memory usage of intermediate activations, we assume that the intermediate activation values are stored in
float16orbfloat16data format, with each element occupying 2 bytes. The only exception is the mask matrix used in dropout operations, where each element occupies only 1 byte. - When analyzing the memory usage of intermediate activations, only the majority of memory usage by activations is considered, ignoring smaller buffers. For example, for layer normalization, calculating the gradient requires the layer’s input, mean, and variance. The input contains
bshelements, while the mean and variance each containbselements. Sincehis typically quite large (on the order of thousands),bsh ≫ bs. Therefore, for layer normalization, the intermediate activation is approximately estimated asbsh, notbsh + 2bs.
attention block
The activation of the attention block is as follows.
| Save content | operate | Activation size | Belonging Module | Reason for storage |
|---|---|---|---|---|
| X | Matrix multiplication related to Query (Q), Key (K), Value (V) | 2bsh | self attention | Save the common input X of Q/K/V |
| Q, K | Matrix multiplication | 4bsh | self attention | keep input for matrix multiplication |
| Softmax | self attention | Store the input to Softmax in the shape [b, a, s, s]. | ||
| Mask | Softmax dropout | self attention | Save the mask, shape, and type of Softmax dropout; same as , only one byte is needed. | |
| V | Attention computation | 2bsh | self attention | keep input V |
| Score | Attention computation | self attention | keep input | |
| Linear | Calculate output mapping | 2bsh | linear projection | Input mapping needs to save its input. |
| Mask | attention dropout | bsh | attention dropout | Dropout within attention needs to save the mask matrix, which only needs to be one byte. |
| total |
Let’s review the calculation logic of MHA as follows:
The calculations in the table above are explained below.
- Input
X. X is used to calculate Q, K, and V. The shape of X is[b, s, h], and the number of elements isbsh. FP16 occupies two bytes, so the video memory is2bsh. - Intermediate activations
QandKare used to calculate . Both Q and K have shapes of[b,s,h], element typeFP16, and occupy4bshof video memory. - Intermediate activation . is the input to softmax, the element type is
FP16, and the size of the video memory it occupies is .arepresents the number of attention heads.
The shape of Q is [b, a, s, h/a]. has the shape [b,a,h/a,s]. has the shape [b, a, s, s]. The calculation formula is as follows:
-
The mask matrix used in dropout. After the softmax operation is complete, the dropout operation is performed. A mask matrix needs to be saved; the shape of the mask matrix is the same as . They are the same, of type
int, and occupy the same amount of video memory, namely . -
The score weight matrix and V are used to calculate Z.
- After softmax and dropout, the score weight matrix is obtained, which has a size of .
- The shape of V is
[b,s,h], the element type isFP16, and the size of the video memory occupied is2bsh.
-
The calculation includes an output mapping and a dropout operation. The input mapping needs to store its input, which is 2 bytes in size; the dropout needs to store a mask matrix, which is also 2 bytes in size. The total memory usage for both is 3 bytes.
Therefore, the memory size occupied by the intermediate activations of the self-attention block obtained by adding the above intermediate activations is:
MLP
The two linear layers of FFN store their inputs in sizes 2sbh and 8sbh, respectively. The GeLU nonlinearity also requires an input of size 8sbh for backpropagation. Finally, dropout stores its mask in sbh size. In total, the MLP block requires 19sbh bytes of storage.
| Module | action | Activation size |
|---|---|---|
| linear 1 | The first linear layer needs to save its input. | 2 bsh |
| shave | The activation function needs to save its input. | 8 bsh |
| linear 2 | The second linear layer needs to save its input. | 8 bsh |
| dropout | Finally, there is a dropout operation that needs to save the mask matrix. | bsh |
| total |
Let’s review the calculation logic of MLP as follows:
The calculations described above are as follows.
- The first linear layer needs to store its input, occupying 2 bytes of video memory.
- The activation function needs to save its input and occupies 8 bits of video memory.
- The second linear layer needs to store its input, occupying 8 bits of video memory.
- Finally, there is a dropout operation that needs to save the mask matrix, which occupies a memory size of
bsh.
Therefore, for MLP blocks, the intermediate activation value that needs to be saved is 19bsh.
LayerNorm
In addition, the self-attention block and the MLP block each correspond to a layer normalization. Each layer normalization needs to store its input, which is 2sbh in size. The intermediate activations of the two layer normalizations need to be stored, which is 4sbh in size.
Summarize
In summary, the amount of GPU memory required to store intermediate activations in each transformer layer is:
For an l-layer transformer model, there are also embedding layers, the final LayerNorm, and the output layer. When the hidden dimension h is large and the number of layers l is deep, the intermediate activations in this part are few and can be ignored. Therefore, for an l-layer transformer model, the memory occupied by intermediate activations can be approximated as:
In contrast, the image below shows the activation status of the decoder in the Harvard code, which includes the shapes of the various tensors.

Studies have shown that 13B during LLM inference, each token consumes approximately 1MB of GPU memory.
In addition, we can easily see different calculation results regarding computational cost and memory usage. This is mainly due to different calculation principles. For example, the gradient may be stored in FP16, while the parameters may be stored in FP32, and whether recalculation is used, etc.
parallel
In practice, LLMs are always trained or inferred using various parallel strategies, resulting in different activation values. The following figure shows the activation size (in bytes) of each Transformer layer under various parallel strategies.

Let’s take a look at the activations output by the embedding layer, the final LayerNorm, and the output layer for an l-layer transformer model under the parallel strategy.
- Position and word embeddings do not require storing a large number of activations for backpropagation. However, dropout does require storage. Dropout in the embedding layer is also parallelized along the sequence dimension (sequence parallelism). Therefore, its storage will occupy
sbhp/t. Note that the coefficientpis because in pipeline parallelism, we need to storepmicrobatches. - The Layer Norm preceding the output layer also uses sequence parallelism, thus requiring
2sbh/tof storage. The output layer projects to the vocabulary dimension, which requires storing an input of size2sbh/t. Finally, the cross-entropy loss requires storing the logit computed in 32-bit floating-point format, thus requiring4sbv/tof storage. Note that since we only consider the activations of the first stage of the pipeline, the above activations, totaling , are only considered in the absence of pipeline parallelism (p = 1).
The total additional memory required for input embedding, the last LayerNorm, and the output layer is:

0x04 Transformer computational complexity
In a broad sense, when processing a token, the model performs two types of operations: attention computation and matrix-vector multiplication.
- MHA (red box): , , and . The corresponding computational complexity is , where 2 represents a multiplication and an addition.
- MHA (blue box): . The corresponding computational cost is .
- MHA Attention (green rounded corner cube): The computational cost is . For a Decoder (LLM), due to the presence of the Causal Mask, the computational cost here should be halved, i.e., .
- FFN (green box): The computational cost corresponding to W1 and W2 is and . LLaMA’s SwiGLU is similar.

We subsequently followed the terminology from the megatron paper in our analysis, ignoring multiple heads, i.e., the head count was 1.
4.1 Matrix Multiplication
The decoding stage primarily involves matrix-vector multiplication. A large matrix is multiplied by a vector to obtain another vector.
Therefore, let’s first look at the computational characteristics of matrix multiplication. Arithmetic intensity is defined as FLOP: I/O. When a matrix and a matrix multiplication produce a matrix, matrix-vector multiplication performs a multiplication-addition operation once for each matrix element. FLOPs (floating-point operations, i.e., computational complexity) are:
The I/O (data transfer from GPU memory to GPU registers) count is:
4.2 Forward Propagation Computational Cost
Embedding
The input to the embedding operation is [b, s]. In the actual matrix-vector multiplication computation, the embedding operation does not use this entire embedding matrix; each token only reads one row from this matrix, which is a table lookup operation. The final output tensor becomes [b, s, h]. Therefore, the computational cost is relatively small, and we will ignore this part later.
MHA
In standard Transformer computation, it is assumed that . The calculation is as follows (details omitted). is the sequence length, and h is the dimension.
- Obtain attention score: . For each query vector, compute the dot product between it and the key vector at all positions.
- Obtain attention weights: . That is, a set of scalars obtained by normalization.
- Calculate the final output: . A vector
ois computed by weighting all the previous value vectors using attention weights.
Therefore, we can see that calculating S and O is the main part.
Calculate Q, K, V
A single matrix multiplication is: [b, s, h] × [h, h] to obtain [b, s, h], therefore its computational cost is:
The computational cost of the three matrices is:
QK^T
At this stage, for each query element, the attention computation performs a multiply-add operation on each key element to calculate the dot product. The overall operation is: [b, s, h] × [b, h, s] = [b, s, s], and its computational complexity is:
The softmax function does not change the dimension of the input matrix, i.e., [x, x] → [s, s]. The native softmax has FLOPs value (4/5)sh. Because it is relatively small, it can be ignored. Scaling operates on an element-by-element basis and can be ignored.
Multiply by V
The multiplication by V (attention over values) stage performs a multiply-add operation once for each value element to calculate the weighted sum. The overall operation is: [b, s, s] × [b, s, h] = [b, s, h], and the computational complexity is:
linear mapping
The linear projection step (post-attention linear projection) is related to . For multi-head fusion, the input and output shapes of matrix multiplication are [b,s,h] × [h,h] → [b,s,h]. The computational complexity is:
MLP
This step involves two operations.
-
In the first linear layer, the input and output shapes for matrix multiplication are
[b,s,h] × [h,4h] → [b,s,4h]. The computational cost is: -
In the second linear layer, the input and output shapes for matrix multiplication are
[b,s,4h] × [4h,h] → [b,s,h]. The computational cost is:
LayerNorm
LayerNorm operates element-wise, so there is no universal formula. The two weights of the layer are both vectors of length h, and FLOPs can be estimated as 2h, but are usually ignored.
single layer
Adding the above computational costs together, the computational cost of each transformer layer in the forward propagation phase is approximately:
It can be observed that:
- The number of parameters and computational cost are unrelated to the number of heads. Head partitioning is more about improving accuracy through feature subspace partitioning than about saving on the number of parameters or computational cost.
- The number of recall parameters is . Therefore, given a fixed sequence length, the computational cost increases linearly with the number of parameters.
- The computational complexity increases quadratically with the increase of sequence length.
| Attention | computational load | FFN | computational load |
|---|---|---|---|
| Calculate Q, K, V | First linear layer | ||
| QK^T | Second linear layer | ||
| Multiply by V | |||
| linear mapping |
4.3 Comprehensive Consideration
Model training involves forward propagation and backpropagation. The above only considers the computational cost of the Transformer during the forward propagation stage. We will now consider it in conjunction with backpropagation.
Backpropagation
We’ll continue our analysis by incorporating backpropagation.
single layer
The computational cost of a single Transformer layer is now as follows:
- Floating-point operations required for forward propagation: .
- For backward propagation, gradients need to be calculated for the weights and inputs in the neural network, so backpropagation requires twice the number of FLOPs.
- If activation checkpointing is used: during backwards, additional forward calculations are required for each layer.
Therefore, the total number of floating-point numbers required for each layer is calculated as follows:
logits
Another computationally intensive part is the calculation of logits: mapping the hidden vectors to the vocabulary size to obtain the logits vector for each token. The input and output shapes of matrix multiplication are [b,s,h] × [h,V] → [b,s,V]. The input and output shapes of matrix multiplication are [s,h] × [h,V] → [s,V].
Therefore, forward propagation requires 2bshV, backward propagation requires 4bshV, and the total computational cost is 6bshV.
Total computational load
The classic Megatron-LM paper, “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” provides a formula for computing floating-point operations on a standard Transformer-decoder structure. For an l-layer Transformer model with an input shape of [b, s], the computational complexity is as follows.
-
The floating-point operations required for a single inference and forward propagation:
-
In a single training iteration, the forward and backward propagation requires the following floating-point operations:
If activation checkpointing is not used, then…
In the Megatron-Deepspeed code, we can also see this formula used to calculate TFLOPS (floating-point operations per second):
# General TFLOPs formula (borrowed from Equation 3 in Section 5.1 of
# https://arxiv.org/pdf/2104.04473.pdf).
# The factor of 4 is when used with activation check-pointing,
# otherwise it will be 3, but for 200B model, activation check-pointing will always be on.
checkpoint_activations_factor = 4 if args.checkpoint_activations else 3
# GLU activations double the hidden states in the upscaling feed-forward in each transformer layer
# This leads to 16bsh^2 instead of 8bsh^2 per first feed-forward layer in MLP, thus we increase the coefficient by 8.
# Refer to https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/283#issue-1260805063 for more details.
coefficient = 32 if args.glu_activation else 24
flops_per_iteration = (coefficient * checkpoint_activations_factor * batch_size * seq_len * num_layers * (hidden_size**2)) * (1. + (seq_len / (6. * hidden_size)) + (vocab_size / (16. * num_layers * hidden_size)))
tflops = flops_per_iteration / (elapsed_time_per_iteration * args.world_size * (10**12))
4.4 Calculation Characteristics
Relationship with parameter quantity
Our conclusion is that the computational cost is mainly related to the model parameters and the number of tokens. Assuming the dataset contains a total of D tokens and the model has N parameters, then for scenarios where the sequence is not particularly long, the computational cost of all token forwards can be approximated as 2ND.
Single reasoning
During a single inference iteration, the relationship between computational complexity and parameter count is as follows:
Since the number of input tokens in a single inference iteration is bs, it can be approximated that in one forward propagation, for each token, each model parameter requires two floating-point operations (one multiplication and one addition). That is, from the perspective of a single token multiplying a single matrix, it can be approximated that the computational cost in a single inference iteration (containing only forward propagation) is twice the number of parameters, which is the computational cost of processing each token through all parameters.
A single training iteration includes forward propagation and backpropagation, with the computational cost of backpropagation being 2 several times that of forward propagation. Therefore, in a single training iteration, for each token and each model parameter, six floating-point operations are required.
A similar calculation formula can be found in the paper “Scaling Laws for Neural Language Model”, as shown in the figure below.

Single training session
A single training iteration includes forward and backward propagation. Because backward propagation is twice as computationally expensive as forward propagation, each training iteration requires 6 floating-point operations per token and per model parameter. Total training computation (Flops) = 6 * number of model parameters * number of tokens in the training data. This is the computational power required to run through all the training data once. The training time can be approximated using the following formula:
Bandwidth limited
Computing power doesn’t tell the whole story; the model also needs to access GPU memory, and memory bandwidth can become a bottleneck. This is because parameters need to be read from memory, right? Memory access volume = number of parameters * 2 bytes. Regarding memory bandwidth, the computation in large language models has some distinct characteristics. We will analyze them one by one.
Attention computation
In large language model inference, attention computation is memory-intensive, and its time consumption is limited by the memory access bandwidth of the hardware, rather than the computation speed.
The matrix multiplication operator has the following characteristics.
- A large number of parameters can cause matrix multiplication operators to become memory-intensive. When the input data is not large enough and the computation is not large enough, these operators will be limited by memory access bandwidth due to excessive parameter memory access.
- The computational cost increases rapidly with batch size. When batch size is less than 16, we can consider matrix multiplication operators to be memory-intensive. Only when batch size is sufficiently large will matrix multiplication operators become computationally intensive, and their properties will change with batch size.
FFN calculation
For the FFN operator, in most edge applications, we call the large language model with Batchsize=1. In this case, most of the computation and memory access in the network are concentrated in the FFN. The overall computation-to-memory ratio of the large language model is extremely low, and the entire network will be memory-intensive. Its running time is entirely limited by memory bandwidth rather than hardware computing power.
Impact of KV Cache
KV Cache is an important approach to attention optimization. It is essentially a set of key and value vectors for each previous position in the text. This technique significantly reduces the computational cost of Self Attention. This allows the inference process to be divided into prefill and decode stages (which we will analyze in detail later). The following diagram illustrates the decode stage. In summary, it includes two operators:
- Self-attention (highlighted in yellow) involves matrix-matrix multiplication.
- Dense projection (highlighted in green) involves vector-matrix multiplication.
The Self Attention operator has a very significant computational characteristic: it is a memory-intensive operator with a computation-to-memory ratio close to 1:1. Theoretical estimates of its memory access and computational complexity reveal that both its memory access and computational complexity are . In comparison, similar estimates for MatMul and FeedForward (both matrix multiplication operators) lead to the conclusion that their memory access and computational complexity are both .

prefill
MHA block FLOPs: . FFN is .
decode
MHA each layer decodes in each round FLOPs:
FFN is .
overall
When the input data is of shape [b,s], a single training/inference iteration is as follows:
-
prefilltotal computational cost per round of each phase: -
decodetotal computational cost per round of each phase:
How much computation does the kv cache save?
For a context length s, the total computational complexity of self-attention without using kv cache is: total computation . The total computational cost after use is approximately . Computational savings rate:
Computational complexity from reduced to . Using a key-value cache can save approximately 1/s times the computational cost. When 1/s is large, the computational cost reduction is close to 0. The more tokens output, the more significant the computational savings.
0x05 Optimization Directions
The biggest drawback of autoregressive large language models in terms of operational efficiency is that the decoding process is serial and variable-length, making it difficult to efficiently utilize parallel computing and memory bandwidth resources, which in turn leads to memory management and reclamation issues. To address this, the industry has developed numerous system optimization solutions, each of which can significantly improve the speed and performance of model inference.
5.1 Modifying Extrapolation Techniques Based on Attention Mechanisms
The blog post “How Do Language Models put Attention Weights over Long Context” mentions that the attention distribution differs significantly across different layers:
- The initial layer mainly consists of a mixture of word embeddings and word embeddings, with attention distributed roughly evenly.
- The attention patterns in the middle layers become more complex, with most probabilistic quality focused on the initial label (attention convergence) and the most recent/last label (recent bias).
- The final layer reveals all the attention patterns.
As can be seen from the above, the middle layer mostly exhibits a “V-shaped” attention distribution, meaning that many tokens in the middle layer are actually not very effective. Therefore, we can consider accelerating inference and increasing extrapolation capabilities by reducing the number of tokens for different layers.
Next, we’ll look at how to increase extrapolation capabilities based on attention mechanisms.
| name | Main idea |
|---|---|
| StreamLLM | When assembling the KV-Cache, all header tokens are included (Sink mode), and a Window Attention mechanism is introduced to improve computational efficiency. |
| LM-Infinite | A V-shaped attention mechanism is employed. Because the attention distribution of the middle tokens is relatively small, a Λ-shaped attention mask is introduced, and a distance upper limit is set to restrict the “effective distance”. At the same time, it is possible to selectively focus on the k middle tokens with the largest attention logits. |
| SirLLM | Key phrases are filtered by measuring token entropy and using a memory decay mechanism. Tokens with higher entropy values are considered to contain more information. The memory decay mechanism works by multiplying each entropy value in the token entropy cache by a decay rate less than 1. Over time, earlier information is gradually forgotten, while more recent key information is retained. |
| Sparase-Q | Tokens typically only require attention to a small portion of the sequence. If it’s possible to efficiently predict which tokens will achieve high attention scores, then only the keys of the high-scoring tokens can be stored, improving memory bandwidth efficiency. Therefore, a compression approach is proposed: r components are selected by estimating the maximum attention score, and then the top-k key and value vectors are determined. |
| Dynamic Memory Compression | DMC fine-tunes on pre-trained LLMs to learn a compression strategy, and then performs online compression of the key-value cache during inference. DMC introduces decision variables and importance variables , which at each time step determine whether to append the current key-value representation to the cache or perform a weighted average with the top element in the cache. |
| Infinite attention | Compressive memory is integrated into the standard attention mechanism, and masked local attention and long-term linear attention mechanisms are built in a single Transformer block. |
| LongLoRA | We introduce Shifted Sparse Attention to fine-tune the model and extend the context length. The model fine-tuned with Shifted Sparse Attention retains the original standard self-attention architecture during inference. This means that the model can use the unmodified attention mechanism during inference, allowing most of the existing optimizations and infrastructure to be reused. |
| self-extend Attention | Self Extend uses a simple floor division operation to map large, unseen relative positions to relative positions encountered during pre-training. To address the issues of long-distance and proximity dependencies, Self Extend introduces a two-layer attention mechanism: Grouped Attention and Neighbor Attention. |
| Dual Chunk Attention | By decomposing the attention computation of long sequences into block-based modules, the model can effectively capture relative positional information within the same block (Intra-Chunk) and between different blocks (Inter-Chunk). The attention outputs of intra-block, inter-block, and consecutive blocks are then merged to obtain the final output representation. This representation considers both local and global information in the sequence, enabling the model to effectively handle long sequences. |
5.2 Extrapolation Technique Based on Memory Mechanism
The extrapolation technology based on the memory mechanism actually follows the compression idea. It uses external storage to store historical information and then uses the most recent token to query and retrieve some historically important tokens.
| name | Main idea |
|---|---|
| InfLLM | By constructing an additional context memory module to store context information far from the current processing location, and designing an efficient mechanism to find units related to the currently processed tag for use in attention computation. |
| Recurrent Memory Transformer (RMT) | Context extension is achieved by combining the recurrent mechanism of recurrent neural networks (RNNs) with the memory augmentation capabilities of the Transformer model. RMT introduces a memory mechanism on top of the Transformer model, which consists of a set of trainable real-valued vectors (called memory tags). These memory vectors can store and process local and global information and pass information between different segments of a long sequence through the recurrent mechanism. |
0xFF Reference
The potential for parallel inference across multiple large language fine-tuning models
Contiguous Batching/Inflight Batching
Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models Yao Fu Paper version
GPTQ / AWQ
How Do Language Models put Attention Weights over Long Context Yao Fu
HunYuan MoE: A casual chat about AI, including LLM parameter count, computational cost, and MFU.
llm Parameter Quantity, Computational Cost, and Memory Usage Analysis (Zhang)
LLM Large Model Training-Inference Memory Usage Analysis (chaofa adds a touch of coding)
LLM (23): The Long Text Problem in LLM (Auspicious Purple Qi from the East)
Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.
OpenPPL-LLM | OpenPPL’s Large Language Model Inference Engine is Here !
PagedAttention
Towards 100x Speedup: Full Stack Transformer Inference Optimization Yao Fu
Transformer estimate 101
Transformer Data Estimation - Memory Usage Bruce’s Wanderings
Analyze the number of parameters, computational cost, intermediate activations, KV cache, and spintronics of the transformer model.
Analyzing the Batch Processing Effect in GPT Inference ( Lequn Chen || abcdabcd987)
The Potential of Parallel Inference for Multiple Large Language Fine-tuning Models (Lequn Chen || abcdabcd987)
Large Model-Deployment-Capacity Estimation Ideas (Lancet)
Analysis of Bottlenecks and Limiting Theoretical Values in Large Model Inference (by WALL-E, who likes curly hair)
Activating Memory: How Much Memory Does Model Inference Require? (Wei Xinyu [Da Wei Sharing])