Exploring the Transformer Series (16) --- Resource Consumption

0x00 Overview

For standard Transformer models, whether it’s the Encoder-only BERT series or the Decoder-only GPT series, the number of parameters and computational cost are similar under the same configuration. One key point is that the input, output, and intermediate hidden dims of a standard Transformer block (layer) remain unchanged; they are always the hidden dims of the token embedding, and all Transformer blocks are very well-organized.

As shown in the figure below, the main parameters of the Encoder are derived from the weight matrix multiplied by several matrices, where d represents the Hidden Dim of the Token Embedding, l represents the number of Tokens, and h represents the number of Heads in the MHA. d_{FFN} represents Dim after the intermediate dimensionality increase of the FFN layer. The parameters of its main modules are as follows.

MHA: $W_Q$ , $W_K$ , and $W_V$ each have size $d \times d$ . Of course, this can also be viewed from the perspective of h Heads, then each Head’s $W_Q$ , $W_K$ , and $W_V$ are $d \times d / h$ . There is also a matrix multiplication operation at the end of MHA, corresponding to $W_O$ , and the dimension remains $d \times d$ . Therefore, the number of parameters in the weight matrix at MHA is $3d \times d + d \times d$ .
FFN: The standard Transformer’s FFN has two Linear layers (first increasing the dimension, then decreasing it), corresponding to the weight matrix $W_1$ and $W_2$ . The size is $d_{FFN} \times d$ , and standard $d_{FFN}$ has the value 4d, meaning that the parameter count of the two weight matrices at FFN is $8d \times d$ .

1601

In summary, in standard Transformer models or the LLaMA series (MHA), if parameters such as vocabulary, embedding, and LayerNorm are ignored, the total number of parameters (for all Transformer Blocks) is:

$N = n_{layer} \times (n_{mha} + n_{ffn}) = n_{layer} \times (3d \times d + d \times d + 8d \times d) = 12 \times n_{layer} \times d \times d$

Note: This chapter references multiple papers, which have different definitions of terms. Because the model structures are also different, the calculation results may differ from those in other sources.

0x01 Background Knowledge

1.1 Data Types

In deep learning, the naming convention for numerical types is generally TypeNum, such as Int64, Float32, and Double64.

Type: Includes Int, Float, Double, etc.
Num: Typically 8, 16, 32, 64, 128, indicating the number of bits occupied by this type.

Commonly used numeric data types are shown in the image below.

type	Size (in bytes)
int4	0.5
you8	1
int16	2
int32	4
int64	8
float32	4
float16	2

1.2 Number Systems & Conversion

Let’s start with a question: How many gigabytes of video memory does a 1B parameter correspond to? Both B and G represent billions (1000M or 1024M), but these are two different measurement dimensions.

Number base

B is a commonly used number system in British and American English, for example:

1K = 1000, one thousand;
1M = 1000 K, one million;
1B = 1000 M, one billion;

As you can see, this number system uses 1000 as its base. Taking Qwen-7B as an example, 7B means that this LLM model has 7 billion parameters.

Storage metrics

G is a measure of computer memory/disk storage, with the basic unit being bytes, and a base-1024 system. The units are KB / MB / GB / TB. When we say a certain amount of video memory (G/M), we mean a certain number of G/M bytes (bytes). One byte = 8 bits (bits). For example, a 1000x1000 matrix with float32 would occupy approximately 1000x1000x4 bytes = 4MB of video memory.

Conversion

It can be seen that $1B = 10^9$ and $1GB \approx 10^9$ bytes. The sizes of 1B and 1G are basically the same, so we denote B and G as equal. However, the number of gigabytes of memory corresponding to 1B model parameters depends on the precision of the parameters. For full-precision training (fp32), one parameter corresponds to 32 bits, or 4 bytes. When converting parameters to GPU memory, the value is multiplied by 4, meaning 1B model parameters correspond to 4GB of GPU memory. For fp16 or bf16, the value is multiplied by 2, meaning 1B model parameters correspond to 2GB of GPU memory. See the table below for details.

Data types	Each 1B parameter requires memory.
fp32	4G
fp16/bf16	2G
you8	1G
int4	0.5G

1.3 Parameter Memory Usage

Only modules with parameters will use GPU memory. This portion of GPU memory usage is independent of the input and is used after the model is loaded. Typical convolutional layers will use GPU memory, while the ReLU activation layers we frequently use have no parameters and therefore do not use the cache.

Parameterized layers

Commonly used modules with parameters include:

Convolutional layers, typically conv2d.
Fully connected layer, also known as Linear layer.
BatchNorm layer.
Embedding layer.

Parameterless layer

Common parameterless modules include:

Most activation layers, such as Sigmoid/ReLU.
Pooling layer.
Dropout.

Required resources

We can calculate the memory usage of a neural network using the following formula: Memory Usage = Model Memory Usage + Input/Output Related Memory.

Model memory usage refers to the memory usage of the model that is independent of the input, and mainly includes:

Model weight parameters.
Gradient (usually 1 times the number of parameters).
The momentum of the optimizer (closely related to the specific optimizer; for example, ordinary SGD has no momentum, momentum-SGD has the same momentum as the gradient, and the Adam optimizer has twice the momentum of the gradient).

The main video memory usage related to input/output is as follows:

batch_size × memory usage per sample.
Each layer’s feature map needs to save activations for backpropagation.

Due to factors such as backpropagation, Adam optimization, and Transformer architecture, training generally requires 3-4 times more GPU memory than inference of the same scale.

1.4 Computational complexity

The above text mentions that the computational complexity of Transformer is $O(dN^2)$ . Big O notation focuses on the relationship between the computational magnitude and the input size, not the actual computational cost. The actual computational cost is usually represented by FLOPs. Here are some common units:

FLOPs: an abbreviation for floating point of operations, refers to the number of floating-point operations, usually specifically multiplying and adding operations. It can be understood as the amount of computation and can be used to measure the complexity of an algorithm/model.
One GFLOPS (gigaFLOPS) = one billion (= $10^9$ ) floating-point operations per second.
One TFLOPS (teraFLOPS) = one trillion (= $10^{12}$ ) floating-point operations per second.

0x02 Transformer Parameter Count

Taking the decoder-only model as an example, it mainly consists of three parts: embedding, decoder, and head. The most important part is the decoder, which is composed of several decoder-layers. Each decoder-layer is further divided into two parts: MHA and FFN. We will now examine the number of parameters in each of these modules.

2.1 Terminology

Let’s first give the terminology used in this section.

Symbol	Meaning
$d$	The model’s word embedding size (hidden state dimension / positional encoding size)
$h$	Attention head count
$s$	Total text length (prompt + decoder output)
$b$	Data batch size
$l$	Transformer layer number
$v$	vocabulary size

2.2 Embedding Layer

The input shape of the embedding layer is [b, s, v], the output shape is [b, s, d], and the number of parameters is $v \times d$ .

If trainable positional encoding is used, there will be some trainable model parameters, but their number is relatively small. If relative positional encoding is used, such as RoPE and ALiBi, there are no trainable model parameters. Therefore, we ignore the parameters of positional encoding.

2.3 Transformer Layer

The Transformer model consists of l identical layers, each mainly divided into two parts: MHA and FFN. Since multi-head is only a logical partition and does not physically add modules, multi-head will be omitted in the following discussion (if multi-head is discussed in some papers, we will follow the paper’s interpretation). And since the Decoder-only model uses self-attention, we will assume that the dimensions of Q, K, V, and O are equal.

MHA

MHA contains four weight matrices $W_Q$ , $W_K$ , $W_V$ , and $W_O$ , and biases (some models may not have biases). The four weight matrices have the following shapes: [d, d]. The shapes of the four biases are [d], and $d = h \times d_{head}$ . Therefore, the parameters of the multi-head attention layer are:

$4 \times (d \times d + d) = 4d^2 + 4d$

FFN

FFN consists of two linear layers.

The first layer maps the original dimensions to a size four times the original dimensions, that is, from $d$ mapped to $4d$ . The weight matrix has a shape of [d, 4d], and the bias matrix has a shape of [4d]. The number of parameters is: $d \times 4d + 4d$ .
The second layer reduces the dimensionality back to the original dimension from a factor of 4. That is, from $4d$ mapped to $d$ . The weight matrix has a shape of [4d, d], and the bias matrix has a shape of [d]. The number of parameters is: $4d \times d + d$ .

The final parameters of FFN are:

$8d^2 + 5d$

LayerNorm

For Layer Norm, its scaling parameter $\gamma$ and translation parameters $\beta$ all have dimension $d$ . Therefore, the number of parameters is $2 \times d$ . Because both MHA and FFN have LayerNorm, the total number of parameters is $4 \times d$ .

summary

In summary, the number of parameters in a single Transformer layer is:

$12d^2 + 13d$

2.4 lm_head

lm_head is a component in a natural language processing model. Its main function is to transform the model’s output (usually the hidden state after being processed by the Transformer encoder) into a probability distribution for predicting the next word.

The head and embedding have the same number of parameters. If it is a tied embedding (i.e., the head weight matrix and word embedding matrix share parameters), then they share a single parameter.

2.5 Final Parameter Quantity

Ultimately, the number of trainable model parameters for the l-layer transformer model is:

$l(12d^2 + 13d) + 2vd$

When d is large, the first-order term can be ignored, and the number of model parameters is approximately:

$12ld^2$

1602

2.6 LLaMA3

Let’s take another look at the unique aspects of LLaMA3’s application in the industrial sector.

SwiGLU

Models like LLaMA use SwiGLU activation in their FFN, which results in an additional weight matrix. The LLaMA paper mentions that using SwiGLU reduces the $d_{FFN}$ from 4d to 8d/3. This way, the number of parameters in the three weight matrices remains 8d, and the total number of parameters is still manageable enough to make an estimate with:

$12 \times n_{layer} \times d \times d$

GQA

The formula above corresponds to MHA (Multi-Head Attention), which is also the standard implementation of the LLaMA-1 series models. However, the 30B and 70B models of LLaMA-2, as well as all models of LLaMA-3, began to use GQA (Grouped Query Attention). When using GQA, multiple attention heads share a key and value. $W_K$ and $W_V$ will become $d \times d / g$ , where g represents that each g Head shares the same Key and Value. The LLaMA 2 paper mentions that, in order to keep the total number of parameters constant when using GQA and when using MHA, LLaMA 2 will multiply the FFN Dim dimension by 1.3 for the GQA model.

After the above adjustments, LLaMA 3 is no longer a standard Transformer Block; at this point, using $N = 12d^2$ to estimate parameter quantities is no longer very accurate. However, it can still be done according to $(W_Q, W_O)$ , $(W_K, W_V)$ , $W_{FFN}$ , and $W_{emb}$ . We can use four parts for statistical analysis. For example, for the LLaMA 3 model, we can estimate its parameter count as follows:

$N = n_{layer} \times (2d^2 + 2d \times d \times k_{v} / h + 3d \times d_{FFN}) + 2 \times n_{thecab} \times d$

0x03 Transformer Memory Usage

3.1 Training

During neural network training, the majority of GPU memory usage is comprised of four parts: model parameters, intermediate activations generated during forward computation, gradients obtained from backpropagation, and optimizer states. The latter three may be larger in number than the model parameters, thus requiring more memory from the model.

When training large models, the AdamW optimizer is often used, along with mixed-precision training to accelerate the process. Based on this premise, we analyze memory usage. In a single training iteration, each trainable model parameter needs to store the parameter itself, its corresponding gradient, and the optimizer’s two states for that parameter (first-order momentum and second-order momentum in Adam). Let the number of model parameters be $\Phi$ , then the number of gradient elements is $\Phi$ , and the number of elements in the AdamW optimizer is $2\Phi$ . In mixed-precision training, half-precision is used for forward and backward propagation calculations, while single-precision is used to update the state, gradient, and parameters when the optimizer updates the model parameters. Therefore, the space occupied by a parameter during training is the sum of the space occupied by half-precision during forward propagation and single-precision during backward propagation. Thus, when using the AdamW optimizer and mixed-precision training, the training phase will occupy (2+4)+(2+4)+(4+4)=20 bytes for each trainable model parameter. For a large model with $\Phi$ parameters, the memory occupied by the model parameters, gradients, and optimizer states is 20Φ bytes.

1603

The space requirements for model parameters, gradients, and optimizer states have been calculated. The next step is to determine the space requirements for the intermediate activations during forward propagation. We will analyze this in a later section.

Model training involves Forward and Backward processes. The Backward process actually consists of two parts: gradients to the input (using the chain rule) and gradients to the weights. The main computational cost of these two parts is matrix multiplication, and the magnitudes are the same as in the Forward process. Therefore, it is often directly approximated that the computational cost of Backward is twice that of Forward.

3.2 Reasoning

The inference phase typically requires less GPU memory than the training phase because it doesn’t involve extensive computations such as gradient calculations and parameter updates. Without gradients, optimizer states, and intermediate activations, the model’s inference phase consumes significantly less GPU memory than the training phase.

If a key-value cache is used to accelerate the inference process, the key-value cache also needs to occupy video memory. The video memory occupied by the key-value cache will be described in detail below, so it will be omitted here. In addition, the input data also needs to be placed on the GPU, as well as some intermediate results (intermediate results during the inference process will be released as soon as they are used up). However, the video memory occupied by this part is very small and can be ignored.

Ultimately, the main memory usage during the inference phase is for the model’s parameters, where model parameter memory = n × p. n is the total number of model parameters, and p is the number of bytes occupied by each parameter. If half-precision inference is used, and each parameter occupies 2 bytes, then the memory usage of the model during inference is approximately:

$m_{inference} = 2 \times n_{params}$

The following are some key factors regarding the GPU memory required for computational model inference:

Model structure: The model structure includes the number of layers, the number of neurons in each layer, and the size of the convolutional kernel. Deeper models typically require more GPU memory because each layer produces intermediate computation results.
Input data: The amount of video memory required for inference depends on the size of the input data. Larger input data will require more video memory.
Batch Size: Batch size refers to the number of samples processed in a single inference iteration. A larger batch size may increase GPU memory usage because the computation results for multiple samples need to be stored simultaneously.
Data type: The data type used (such as single-precision floating-point number, half-precision floating-point number) also affects video memory requirements. Lower precision data types generally reduce video memory requirements.
Intermediate calculations: During the inference process of the model, some intermediate calculation results may be generated, which will also occupy a certain amount of GPU memory.

3.3 Activation

Activations during training refer to all tensors computed during forward propagation and used during backpropagation. These activations do not include model parameters and optimizer states, but they do include the mask matrix used in the dropout operation.

In a single training iteration, the memory usage of model parameters (or gradients) depends only on the number and data type of the parameters, and is independent of the input data size. Similarly, the memory usage of the optimizer state depends on the optimizer type and the number of model parameters, but is independent of the input data size. Intermediate activation values, however, are positively correlated with the input data size (batch size b and sequence length s). As both batch size b and sequence length s increase, the memory usage of intermediate activations increases proportionally. When encountering an Out Of Memory (OOM) error during neural network training, we typically try to reduce the batch size to avoid this problem. However, this approach actually reduces the memory usage of intermediate activations, not the memory used for model parameters, gradients, or the optimizer.

Next, we will take Megatron from the paper “Reducing Activation Recomputation in Large Transformer Models” as an example to calculate the memory usage of intermediate activation step by step.

Architecture

The image below shows the architecture of Megatron.

1604

The code is shown below. It specifies that core_attention is submodules.core_attention and linear_proj is submodules.linear_proj.

class Attention(MegatronModule, ABC):
    """Attention layer abstract class.
    This layer only contains common modules required for the "self attn" and
    "cross attn" specializations.
    """

    def __init__(
        self,
        config: TransformerConfig,
        submodules: Union[SelfAttentionSubmodules, CrossAttentionSubmodules],
        layer_number: int,
        attn_mask_type: AttnMaskType,
        attention_type: str,
    ):
        super().__init__(config=config)

        self.config = config
        self.layer_number = layer_number
        self.attn_mask_type = attn_mask_type
        self.attention_type = attention_type

        # For normal attention without groups, num_query_groups == num_attention_heads,
        # so these two will be the same
        self.query_projection_size = self.config.kv_channels * self.config.num_attention_heads
        self.kv_projection_size = self.config.kv_channels * self.config.num_query_groups

        # Per attention head and per partition values.
        world_size = parallel_state.get_tensor_model_parallel_world_size()
        self.hidden_size_per_attention_head = divide(
            self.query_projection_size, self.config.num_attention_heads
        )
        self.num_attention_heads_per_partition = divide(self.config.num_attention_heads, world_size)
        self.num_query_groups_per_partition = divide(self.config.num_query_groups, world_size)

        self.core_attention = build_module(
            submodules.core_attention,
            config=self.config,
            layer_number=self.layer_number,
            attn_mask_type=self.attn_mask_type,
            attention_type=self.attention_type,
        )

        self.checkpoint_core_attention = self.config.recompute_granularity == 'selective'

        # Output.
        self.linear_proj = build_module(
            submodules.linear_proj,
            self.query_projection_size,
            self.config.hidden_size,
            config=self.config,
            init_method=self.config.output_layer_init_method,
            bias=self.config.add_bias_linear,
            input_is_parallel=True,
            skip_bias_add=True,
            is_expert=False,
            tp_comm_buffer_name='proj',
        )

    def forward(
        self,
        hidden_states,
        attention_mask,
        key_value_states=None,
        inference_params=None,
        rotary_pos_emb=None,
        packed_seq_params=None,
    ):
        # hidden_states: [sq, b, h]

        # For self attention we just duplicate the rotary_pos_emb if it isn't already
        if rotary_pos_emb is not None and not isinstance(rotary_pos_emb, tuple):
            rotary_pos_emb = (rotary_pos_emb,) * 2

        # =====================
        # Query, Key, and Value
        # =====================
        # Get the query, key and value tensors based on the type of attention -
        # self or cross attn.
        query, key, value = self.get_query_key_value_tensors(hidden_states, key_value_states)

        # ===================================================
        # Adjust key, value, and rotary_pos_emb for inference
        # ===================================================
        key, value, rotary_pos_emb, attn_mask_type = self._adjust_key_value_for_inference(
            inference_params, key, value, rotary_pos_emb
        )

        if packed_seq_params is not None:
            query = query.squeeze(1)
            key = key.squeeze(1)
            value = value.squeeze(1)

        # ================================================
        # relative positional embedding (rotary embedding)
        # ================================================
        if rotary_pos_emb is not None:
            q_pos_emb, k_pos_emb = rotary_pos_emb

            if packed_seq_params is not None:
                cu_seqlens_q = packed_seq_params.cu_seqlens_q
                cu_seqlens_kv = packed_seq_params.cu_seqlens_kv
            else:
                cu_seqlens_q = cu_seqlens_kv = None
            query = apply_rotary_pos_emb(
                query, q_pos_emb, config=self.config, cu_seqlens=cu_seqlens_q
            )
            key = apply_rotary_pos_emb(key, k_pos_emb, config=self.config, cu_seqlens=cu_seqlens_kv)

            # TODO, can apply positional embedding to value_layer so it has
            # absolute positional embedding.
            # otherwise, only relative positional embedding takes effect
            # value_layer = apply_rotary_pos_emb(value_layer, k_pos_emb)

        # ==================================
        # core attention computation
        # ==================================

        if self.checkpoint_core_attention and self.training:
            core_attn_out = self._checkpointed_attention_forward(
                query,
                key,
                value,
                attention_mask,
                attn_mask_type=attn_mask_type,
                packed_seq_params=packed_seq_params,
            )
        else:
            core_attn_out = self.core_attention(
                query,
                key,
                value,
                attention_mask,
                attn_mask_type=attn_mask_type,
                packed_seq_params=packed_seq_params,
            )

        if packed_seq_params is not None:
            # reshape to same output shape as unpacked case
            # (t, np, hn) -> (t, b=1, h=np*hn)
            # t is the pack size = sum (sq_i)
            # note that batch is a dummy dimension in the packed case
            core_attn_out = core_attn_out.reshape(core_attn_out.size(0), 1, -1)

        # =================
        # Output. [sq, b, h]
        # =================

        output, bias = self.linear_proj(core_attn_out) # 这里是线性层

        return output, bias

The final attention code is:

class DotProductAttention(MegatronModule):
    """
    Region where selective activation recomputation is applied.
    This region is memory intensive but less compute intensive which
    makes activation checkpointing more efficient for LLMs (20B+).
    See Reducing Activation Recomputation in Large Transformer Models:
    https://arxiv.org/abs/2205.05198 for more details.

    We use the following notation:
     h: hidden size
     n: number of attention heads
     p: number of tensor model parallel partitions
     b: batch size
     s: sequence length
    """

    def __init__(
        self,
        config: TransformerConfig,
        layer_number: int,
        attn_mask_type: AttnMaskType,
        attention_type: str,
        attention_dropout: float = None,
    ):
        super().__init__(config=config)

        self.config: TransformerConfig = config

        assert (
            self.config.context_parallel_size == 1
        ), "Context parallelism is only supported by TEDotProductAttention!"

        assert (
            self.config.window_size is None
        ), "Sliding Window Attention is only supported by TEDotProductAttention!"

        self.layer_number = max(1, layer_number)
        self.attn_mask_type = attn_mask_type
        self.attention_type = attention_type  # unused for now

        projection_size = self.config.kv_channels * self.config.num_attention_heads

        # Per attention head and per partition values.
        world_size = parallel_state.get_tensor_model_parallel_world_size()
        self.hidden_size_per_partition = divide(projection_size, world_size)
        self.hidden_size_per_attention_head = divide(projection_size, config.num_attention_heads)
        self.num_attention_heads_per_partition = divide(self.config.num_attention_heads, world_size)
        self.num_query_groups_per_partition = divide(self.config.num_query_groups, world_size)

        coeff = None
        self.norm_factor = math.sqrt(self.hidden_size_per_attention_head)
        if self.config.apply_query_key_layer_scaling:
            coeff = self.layer_number
            self.norm_factor *= coeff

        self.scale_mask_softmax = FusedScaleMaskSoftmax(
            input_in_fp16=self.config.fp16,
            input_in_bf16=self.config.bf16,
            attn_mask_type=self.attn_mask_type,
            scaled_masked_softmax_fusion=self.config.masked_softmax_fusion,
            mask_func=attention_mask_func,
            softmax_in_fp32=self.config.attention_softmax_in_fp32,
            scale=coeff,
        )

        # Dropout. Note that for a single iteration, this layer will generate
        # different outputs on different number of parallel partitions but
        # on average it should not be partition dependent.
        self.attention_dropout = torch.nn.Dropout(
            self.config.attention_dropout if attention_dropout is None else attention_dropout
        )

    def forward(
        self,
        query: Tensor,
        key: Tensor,
        value: Tensor,
        attention_mask: Tensor,
        attn_mask_type: AttnMaskType = None,
        packed_seq_params: Optional[PackedSeqParams] = None,
    ):
        assert packed_seq_params is None, (
            "Packed sequence is not supported by DotProductAttention."
            "Please use TEDotProductAttention instead."
        )

        # ===================================
        # Raw attention scores. [b, n/p, s, s]
        # ===================================

        # expand the key and value [sk, b, ng, hn] -> [sk, b, np, hn]
        # This is a noop for normal attention where ng == np. When using group query attention this
        # creates a view that has the keys and values virtually repeated along their dimension to
        # match the number of queries.

        # attn_mask_type is not used.
        if self.num_attention_heads_per_partition // self.num_query_groups_per_partition > 1:
            key = key.repeat_interleave(
                self.num_attention_heads_per_partition // self.num_query_groups_per_partition, dim=2
            )
            value = value.repeat_interleave(
                self.num_attention_heads_per_partition // self.num_query_groups_per_partition, dim=2
            )

        # [b, np, sq, sk]
        output_size = (query.size(1), query.size(2), query.size(0), key.size(0))

        # [sq, b, np, hn] -> [sq, b * np, hn]
        # This will be a simple view when doing normal attention, but in group query attention
        # the key and value tensors are repeated to match the queries so you can't use
        # simple strides to extract the queries.
        query = query.reshape(output_size[2], output_size[0] * output_size[1], -1)
        # [sk, b, np, hn] -> [sk, b * np, hn]
        key = key.view(output_size[3], output_size[0] * output_size[1], -1)

        # preallocting input tensor: [b * np, sq, sk]
        matmul_input_buffer = parallel_state.get_global_memory_buffer().get_tensor(
            (output_size[0] * output_size[1], output_size[2], output_size[3]), query.dtype, "mpu"
        )

        # Raw attention scores. [b * np, sq, sk]
        matmul_result = torch.baddbmm(
            matmul_input_buffer,
            query.transpose(0, 1),  # [b * np, sq, hn]
            key.transpose(0, 1).transpose(1, 2),  # [b * np, hn, sk]
            beta=0.0,
            alpha=(1.0 / self.norm_factor),
        )

        # change view to [b, np, sq, sk]
        attention_scores = matmul_result.view(*output_size)

        # ===========================
        # Attention probs and dropout ----------------- 在这里有softmax的dropout
        # ===========================

        # attention scores and attention mask [b, np, sq, sk]
        attention_probs: Tensor = self.scale_mask_softmax(attention_scores, attention_mask)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.

        if not self.config.sequence_parallel:
            with tensor_parallel.get_cuda_rng_tracker().fork():
                attention_probs = self.attention_dropout(attention_probs)
        else:
            attention_probs = self.attention_dropout(attention_probs)

        # =========================
        # Context layer. [sq, b, hp]
        # =========================

        # value -> context layer.
        # [sk, b, np, hn] --> [b, np, sq, hn]

        # context layer shape: [b, np, sq, hn]
        output_size = (value.size(1), value.size(2), query.size(0), value.size(3))

        # change view [sk, b * np, hn]
        value = value.view(value.size(0), output_size[0] * output_size[1], -1)

        # change view [b * np, sq, sk]
        attention_probs = attention_probs.view(output_size[0] * output_size[1], output_size[2], -1)

        # matmul: [b * np, sq, hn]
        context = torch.bmm(attention_probs, value.transpose(0, 1))

        # change view [b, np, sq, hn]
        context = context.view(*output_size)

        # [b, np, sq, hn] --> [sq, b, np, hn]
        context = context.permute(2, 0, 1, 3).contiguous()

        # [sq, b, np, hn] --> [sq, b, hp]
        new_context_shape = context.size()[:-2] + (self.hidden_size_per_partition,)
        context = context.view(*new_context_shape)

        return context

Terminology Explanation

Let’s first look at the terminology in the paper.

a represents the number of attention heads in the transformer model.
b is the batch size for each GPU;
h is the hidden dimension of each transformer layer.
L is the number of layers in the Transformer;
p is the number of parallel machines in the pipelined parallelism;
s is the length of the sentence, that is, the number of words in the sequence.
t is the number of parallel machines in tensor parallelism;
v represents the size of the dictionary;

We assume the activation data type is fp16.

Data volume

Each Transformer layer consists of an attention layer and an MLP, with two LayerNorms in between. Below, we derive the memory required to store the activation of each element. Several points should be noted in the following analysis:

The unit is bytes, not the number of elements.
Large models typically employ mixed precision training during the training process. Therefore, when analyzing the memory usage of intermediate activations, we assume that the intermediate activation values are stored in float16 or bfloat16 data format, with each element occupying 2 bytes. The only exception is the mask matrix used in dropout operations, where each element occupies only 1 byte.
When analyzing the memory usage of intermediate activations, only the majority of memory usage by activations is considered, ignoring smaller buffers. For example, for layer normalization, calculating the gradient requires the layer’s input, mean, and variance. The input contains bsh elements, while the mean and variance each contain bs elements. Since h is typically quite large (on the order of thousands), bsh ≫ bs. Therefore, for layer normalization, the intermediate activation is approximately estimated as bsh, not bsh + 2bs.

attention block

The activation of the attention block is as follows.

Save content	operate	Activation size	Belonging Module	Reason for storage
X	Matrix multiplication related to Query (Q), Key (K), Value (V)	`2bsh`	self attention	Save the common input X of Q/K/V
Q, K	$QK^T$ Matrix multiplication	`4bsh`	self attention	keep $QK^T$ input for matrix multiplication
$QK^T$	Softmax	$2bas^2$	self attention	Store the input to Softmax in the shape `[b, a, s, s]`.
Mask	Softmax dropout	$bas^2$	self attention	Save the mask, shape, and type of Softmax dropout; same as $QK^T$ , only one byte is needed.
V	Attention computation	`2bsh`	self attention	keep $softmax(\frac{QK^T}{\sqrt{d}})$ input V
Score	Attention computation	$2bas^2$	self attention	keep $softmax(\frac{QK^T}{\sqrt{d}})$ input $softmax(\frac{QK^T}{\sqrt{d}})$
Linear	Calculate output mapping	`2bsh`	linear projection	Input mapping needs to save its input.
Mask	attention dropout	`bsh`	attention dropout	Dropout within attention needs to save the mask matrix, which only needs to be one byte.
total		$11bsh + 5bas^2$

Let’s review the calculation logic of MHA as follows:

$MultiHead(Q, K, V) = Concat(head_1, head_2, \ldots, head_{n_{heads}})W_O$ $head_i = Attention(QW^Q_i, KW^K_i, VW^V_i) = softmax\left(\frac{QW^Q_i (KW^K_i)^T}{\sqrt{d_{head}}}\right)VW^V_i$

The calculations in the table above are explained below.

Input X. X is used to calculate Q, K, and V. The shape of X is [b, s, h], and the number of elements is bsh. FP16 occupies two bytes, so the video memory is 2bsh.
Intermediate activations Q and K are used to calculate $QK^T$ . Both Q and K have shapes of [b,s,h], element type FP16, and occupy 4bsh of video memory.
Intermediate activation $QK^T$ . $QK^T$ is the input to softmax, the element type is FP16, and the size of the video memory it occupies is $2bs^2a$ . a represents the number of attention heads.

The shape of Q is [b, a, s, h/a]. $K^T$ has the shape [b,a,h/a,s]. $QK^T$ has the shape [b, a, s, s]. The calculation formula is as follows:

$score = softmax(QK^T / \sqrt{d_k})$

The mask matrix used in dropout. After the softmax operation is complete, the dropout operation is performed. A mask matrix needs to be saved; the shape of the mask matrix is the same as $QK^T$ . They are the same, of type int, and occupy the same amount of video memory, namely $bs^2a$ .
The score weight matrix and V are used to calculate Z.
- After softmax and dropout, the score weight matrix is obtained, which has a size of $2bs^2a$ .
- The shape of V is [b,s,h], the element type is FP16, and the size of the video memory occupied is 2bsh.
The calculation includes an output mapping and a dropout operation. The input mapping needs to store its input, which is 2 bytes in size; the dropout needs to store a mask matrix, which is also 2 bytes in size. The total memory usage for both is 3 bytes.

Therefore, the memory size occupied by the intermediate activations of the self-attention block obtained by adding the above intermediate activations is:

$11bsh + 5bas^2$

MLP

The two linear layers of FFN store their inputs in sizes 2sbh and 8sbh, respectively. The GeLU nonlinearity also requires an input of size 8sbh for backpropagation. Finally, dropout stores its mask in sbh size. In total, the MLP block requires 19sbh bytes of storage.

Module	action	Activation size
linear 1	The first linear layer needs to save its input.	`2 bsh`
shave	The activation function needs to save its input.	`8 bsh`
linear 2	The second linear layer needs to save its input.	`8 bsh`
dropout	Finally, there is a dropout operation that needs to save the mask matrix.	`bsh`
total		$19sbh$

Let’s review the calculation logic of MLP as follows:

$FFN(x) = f_{gelu}(xW_1 + b_1)W_2 + b_2$

The calculations described above are as follows.

The first linear layer needs to store its input, occupying 2 bytes of video memory.
The activation function needs to save its input and occupies 8 bits of video memory.
The second linear layer needs to store its input, occupying 8 bits of video memory.
Finally, there is a dropout operation that needs to save the mask matrix, which occupies a memory size of bsh.

Therefore, for MLP blocks, the intermediate activation value that needs to be saved is 19bsh.

LayerNorm

In addition, the self-attention block and the MLP block each correspond to a layer normalization. Each layer normalization needs to store its input, which is 2sbh in size. The intermediate activations of the two layer normalizations need to be stored, which is 4sbh in size.

Summarize

In summary, the amount of GPU memory required to store intermediate activations in each transformer layer is:

$34bsh + 5bas^2$

For an l-layer transformer model, there are also embedding layers, the final LayerNorm, and the output layer. When the hidden dimension h is large and the number of layers l is deep, the intermediate activations in this part are few and can be ignored. Therefore, for an l-layer transformer model, the memory occupied by intermediate activations can be approximated as:

$(34bsh + 5bas^2) \times l$

In contrast, the image below shows the activation status of the decoder in the Harvard code, which includes the shapes of the various tensors.

1605

Studies have shown that 13B during LLM inference, each token consumes approximately 1MB of GPU memory.

In addition, we can easily see different calculation results regarding computational cost and memory usage. This is mainly due to different calculation principles. For example, the gradient may be stored in FP16, while the parameters may be stored in FP32, and whether recalculation is used, etc.

parallel

In practice, LLMs are always trained or inferred using various parallel strategies, resulting in different activation values. The following figure shows the activation size (in bytes) of each Transformer layer under various parallel strategies.

1606

Let’s take a look at the activations output by the embedding layer, the final LayerNorm, and the output layer for an l-layer transformer model under the parallel strategy.

Position and word embeddings do not require storing a large number of activations for backpropagation. However, dropout does require storage. Dropout in the embedding layer is also parallelized along the sequence dimension (sequence parallelism). Therefore, its storage will occupy sbhp/t. Note that the coefficient p is because in pipeline parallelism, we need to store p microbatches.
The Layer Norm preceding the output layer also uses sequence parallelism, thus requiring 2sbh/t of storage. The output layer projects to the vocabulary dimension, which requires storing an input of size 2sbh/t. Finally, the cross-entropy loss requires storing the logit computed in 32-bit floating-point format, thus requiring 4sbv/t of storage. Note that since we only consider the activations of the first stage of the pipeline, the above activations, totaling $4sbh/t(1+v/h)$ , are only considered in the absence of pipeline parallelism (p = 1).

The total additional memory required for input embedding, the last LayerNorm, and the output layer is:

1607

0x04 Transformer computational complexity

In a broad sense, when processing a token, the model performs two types of operations: attention computation and matrix-vector multiplication.

MHA (red box): $W_Q$ , $W_K$ , and $W_V$ . The corresponding computational complexity is $2 \times (d \times d \times l)$ , where 2 represents a multiplication and an addition.
MHA (blue box): $W_{out}$ . The corresponding computational cost is $2 \times (d \times d \times l)$ .
MHA Attention (green rounded corner cube): The computational cost is $2 \times (l \times d/h \times l + l \times d/h \times l) \times h = 4 \times d \times l \times l$ . For a Decoder (LLM), due to the presence of the Causal Mask, the computational cost here should be halved, i.e., $2 \times d \times l \times l$ .
FFN (green box): The computational cost corresponding to W1 and W2 is $2 \times (d_{FFN} \times d \times l)$ and $2 \times (d \times d_{FFN} \times l)$ . LLaMA’s SwiGLU is similar.

1608

We subsequently followed the terminology from the megatron paper in our analysis, ignoring multiple heads, i.e., the head count was 1.

4.1 Matrix Multiplication

The decoding stage primarily involves matrix-vector multiplication. A large matrix is multiplied by a vector to obtain another vector.

Therefore, let’s first look at the computational characteristics of matrix multiplication. Arithmetic intensity is defined as FLOP: I/O. When a $N \times M$ matrix and a $M \times P$ matrix multiplication produce a $N \times P$ matrix, matrix-vector multiplication performs a multiplication-addition operation once for each matrix element. FLOPs (floating-point operations, i.e., computational complexity) are:

$2M \times P \times N$

The I/O (data transfer from GPU memory to GPU registers) count is:

$M \times N + M \times P + N \times P$

4.2 Forward Propagation Computational Cost

Embedding

The input to the embedding operation is [b, s]. In the actual matrix-vector multiplication computation, the embedding operation does not use this entire embedding matrix; each token only reads one row from this matrix, which is a table lookup operation. The final output tensor becomes [b, s, h]. Therefore, the computational cost is relatively small, and we will ignore this part later.

MHA

In standard Transformer computation, it is assumed that $Q, K, V \in R^{s \times h}$ . The calculation is as follows (details omitted). $\sqrt{h}$ is the sequence length, and h is the dimension.

Obtain attention score: $S = QK^T \in R^{s \times s}$ . For each query vector, compute the dot product between it and the key vector at all positions.
Obtain attention weights: $P = softmax(S) \in R^{s \times s}$ . That is, a set of scalars obtained by normalization.
Calculate the final output: $O = PV \in R^{s \times h}$ . A vector o is computed by weighting all the previous value vectors using attention weights.

Therefore, we can see that calculating S and O is the main part.

Calculate Q, K, V

A single matrix multiplication is: [b, s, h] × [h, h] to obtain [b, s, h], therefore its computational cost is:

$2bsh^2$

The computational cost of the three matrices is:

$3 \times 2bsh^2 = 6bsh^2$

QK^T

At this stage, for each query element, the attention computation performs a multiply-add operation on each key element to calculate the dot product. The overall operation is: [b, s, h] × [b, h, s] = [b, s, s], and its computational complexity is:

$2bs^2h$

The softmax function does not change the dimension of the input matrix, i.e., [x, x] → [s, s]. The native softmax has FLOPs value (4/5)sh. Because it is relatively small, it can be ignored. Scaling $\sqrt{d}$ operates on an element-by-element basis and can be ignored.

Multiply by V

The multiplication by V (attention over values) stage performs a multiply-add operation once for each value element to calculate the weighted sum. The overall operation is: [b, s, s] × [b, s, h] = [b, s, h], and the computational complexity is:

$2bs^2h$

linear mapping

The linear projection step (post-attention linear projection) is related to $W_O$ . For multi-head fusion, the input and output shapes of matrix multiplication are [b,s,h] × [h,h] → [b,s,h]. The computational complexity is:

$2bsh^2$

MLP

This step involves two operations.

In the first linear layer, the input and output shapes for matrix multiplication are [b,s,h] × [h,4h] → [b,s,4h]. The computational cost is:
$8bsh^2$
In the second linear layer, the input and output shapes for matrix multiplication are [b,s,4h] × [4h,h] → [b,s,h]. The computational cost is:
$8bsh^2$

LayerNorm

LayerNorm operates element-wise, so there is no universal formula. The two weights of the layer are both vectors of length h, and FLOPs can be estimated as 2h, but are usually ignored.

single layer

Adding the above computational costs together, the computational cost of each transformer layer in the forward propagation phase is approximately:

$24bsh^2 + 4bs^2h$

It can be observed that:

The number of parameters and computational cost are unrelated to the number of heads. Head partitioning is more about improving accuracy through feature subspace partitioning than about saving on the number of parameters or computational cost.
The number of recall parameters is $12lh^2$ . Therefore, given a fixed sequence length, the computational cost increases linearly with the number of parameters.
The computational complexity increases quadratically with the increase of sequence length.

Attention	computational load	FFN	computational load
Calculate Q, K, V	$6bsh^2$	First linear layer	$8bsh^2$
QK^T	$2bs^2h$	Second linear layer	$8bsh^2$
Multiply by V	$2bs^2h$
linear mapping	$2bsh^2$

4.3 Comprehensive Consideration

Model training involves forward propagation and backpropagation. The above only considers the computational cost of the Transformer during the forward propagation stage. We will now consider it in conjunction with backpropagation.

Backpropagation

We’ll continue our analysis by incorporating backpropagation.

single layer

The computational cost of a single Transformer layer is now as follows:

Floating-point operations required for forward propagation: $24bsh^2 + 4bs^2h$ .
For backward propagation, gradients need to be calculated for the weights and inputs in the neural network, so backpropagation requires twice the number of FLOPs.
If activation checkpointing is used: during backwards, additional forward calculations are required for each layer.

Therefore, the total number of floating-point numbers required for each layer is calculated as follows:

$4 \times (24bsh^2 + 4bs^2h) = 96bsh^2(1 + s/6h)$

logits

Another computationally intensive part is the calculation of logits: mapping the hidden vectors to the vocabulary size to obtain the logits vector for each token. The input and output shapes of matrix multiplication are [b,s,h] × [h,V] → [b,s,V]. The input and output shapes of matrix multiplication are [s,h] × [h,V] → [s,V].

Therefore, forward propagation requires 2bshV, backward propagation requires 4bshV, and the total computational cost is 6bshV.

Total computational load

The classic Megatron-LM paper, “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” provides a formula for computing floating-point operations on a standard Transformer-decoder structure. For an l-layer Transformer model with an input shape of [b, s], the computational complexity is as follows.

The floating-point operations required for a single inference and forward propagation:
$l \times (24bsh^2 + 4bs^2h) + 2bshv$
In a single training iteration, the forward and backward propagation requires the following floating-point operations:
$96blsh^2 \left(1 + \frac{s}{6h} + \frac{v}{16lh}\right) + 2bshv$

If activation checkpointing is not used, then…

$72blsh^2 \left(1 + \frac{s}{6h} + \frac{v}{16lh}\right) + 2bshv$

In the Megatron-Deepspeed code, we can also see this formula used to calculate TFLOPS (floating-point operations per second):

# General TFLOPs formula (borrowed from Equation 3 in Section 5.1 of
# https://arxiv.org/pdf/2104.04473.pdf).
# The factor of 4 is when used with activation check-pointing,
# otherwise it will be 3, but for 200B model, activation check-pointing will always be on.
checkpoint_activations_factor = 4 if args.checkpoint_activations else 3
# GLU activations double the hidden states in the upscaling feed-forward in each transformer layer
# This leads to 16bsh^2 instead of 8bsh^2 per first feed-forward layer in MLP, thus we increase the coefficient by 8.
# Refer to https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/283#issue-1260805063 for more details.
coefficient = 32 if args.glu_activation else 24
flops_per_iteration = (coefficient * checkpoint_activations_factor * batch_size * seq_len * num_layers * (hidden_size**2)) * (1. + (seq_len / (6. * hidden_size)) + (vocab_size / (16. * num_layers * hidden_size)))
tflops = flops_per_iteration / (elapsed_time_per_iteration * args.world_size * (10**12))

4.4 Calculation Characteristics

Relationship with parameter quantity

Our conclusion is that the computational cost is mainly related to the model parameters and the number of tokens. Assuming the dataset contains a total of D tokens and the model has N parameters, then for scenarios where the sequence is not particularly long, the computational cost of all token forwards can be approximated as 2ND.

Single reasoning

During a single inference iteration, the relationship between computational complexity and parameter count is as follows:

$\frac{\text{compute}}{\text{params}} = \frac{l \times (24bsh^2 + 4bs^2h) + 2bshv}{l(12h^2 + 13h) + 2vh} \approx \frac{24lbsh^2}{12lh^2} = 2bs$

Since the number of input tokens in a single inference iteration is bs, it can be approximated that in one forward propagation, for each token, each model parameter requires two floating-point operations (one multiplication and one addition). That is, from the perspective of a single token multiplying a single matrix, it can be approximated that the computational cost in a single inference iteration (containing only forward propagation) is twice the number of parameters, which is the computational cost of processing each token through all parameters.

A single training iteration includes forward propagation and backpropagation, with the computational cost of backpropagation being 2 several times that of forward propagation. Therefore, in a single training iteration, for each token and each model parameter, six floating-point operations are required.

A similar calculation formula can be found in the paper “Scaling Laws for Neural Language Model”, as shown in the figure below.

1609

Single training session

A single training iteration includes forward and backward propagation. Because backward propagation is twice as computationally expensive as forward propagation, each training iteration requires 6 floating-point operations per token and per model parameter. Total training computation (Flops) = 6 * number of model parameters * number of tokens in the training data. This is the computational power required to run through all the training data once. The training time can be approximated using the following formula:

$\frac{6 \times \text{model params} \times \text{data volume}}{\text{GPU count} \times \text{GPU FLOPS} \times \text{GPU utilization}}$

Bandwidth limited

Computing power doesn’t tell the whole story; the model also needs to access GPU memory, and memory bandwidth can become a bottleneck. This is because parameters need to be read from memory, right? Memory access volume = number of parameters * 2 bytes. Regarding memory bandwidth, the computation in large language models has some distinct characteristics. We will analyze them one by one.

Attention computation

In large language model inference, attention computation is memory-intensive, and its time consumption is limited by the memory access bandwidth of the hardware, rather than the computation speed.

The matrix multiplication operator has the following characteristics.

A large number of parameters can cause matrix multiplication operators to become memory-intensive. When the input data is not large enough and the computation is not large enough, these operators will be limited by memory access bandwidth due to excessive parameter memory access.
The computational cost increases rapidly with batch size. When batch size is less than 16, we can consider matrix multiplication operators to be memory-intensive. Only when batch size is sufficiently large will matrix multiplication operators become computationally intensive, and their properties will change with batch size.

FFN calculation

For the FFN operator, in most edge applications, we call the large language model with Batchsize=1. In this case, most of the computation and memory access in the network are concentrated in the FFN. The overall computation-to-memory ratio of the large language model is extremely low, and the entire network will be memory-intensive. Its running time is entirely limited by memory bandwidth rather than hardware computing power.

Impact of KV Cache

KV Cache is an important approach to attention optimization. It is essentially a set of key and value vectors for each previous position in the text. This technique significantly reduces the computational cost of Self Attention. This allows the inference process to be divided into prefill and decode stages (which we will analyze in detail later). The following diagram illustrates the decode stage. In summary, it includes two operators:

Self-attention (highlighted in yellow) involves matrix-matrix multiplication.
Dense projection (highlighted in green) involves vector-matrix multiplication.

The Self Attention operator has a very significant computational characteristic: it is a memory-intensive operator with a computation-to-memory ratio close to 1:1. Theoretical estimates of its memory access and computational complexity reveal that both its memory access and computational complexity are $O(batch\ size \times sequence\ length \times hidden\ dimension)$ . In comparison, similar estimates for MatMul and FeedForward (both matrix multiplication operators) lead to the conclusion that their memory access and computational complexity are both $O(batch\ size \times hidden\ dimension)$ .

1610

prefill

MHA block FLOPs: $8sh^2 + 4s^2h$ . FFN is $16sh^2$ .

decode

MHA each layer decodes in each round FLOPs:

$8h^2 + 4(s + 1)h$

FFN is $16h^2$ .

overall

When the input data is of shape [b,s], a single training/inference iteration is as follows:

prefill total computational cost per round of each phase:
$b \times (24lh^2s + 4lhs^2) + 2bshv = 24lh^2bs + 4lhbs^2 + 2bshv$
decode total computational cost per round of each phase:
$b \times (8lh^2 + 4lh(s + 1) + 16lh^2) + 2bhv = 24lh^2b + 4lhb(s + 1) + 2bshv$

How much computation does the kv cache save?

For a context length s, the total computational complexity of self-attention without using kv cache is: total computation $O(s^3h)$ . The total computational cost after use is approximately $O(s^2h)$ . Computational savings rate:

$\text{savings ratio} = O(s^3h) - O(s^2h) = 1 - \frac{1}{s}$

Computational complexity from $O(s^3h)$ reduced to $O(s^2h)$ . Using a key-value cache can save approximately 1/s times the computational cost. When 1/s is large, the computational cost reduction is close to 0. The more tokens output, the more significant the computational savings.

0x05 Optimization Directions

The biggest drawback of autoregressive large language models in terms of operational efficiency is that the decoding process is serial and variable-length, making it difficult to efficiently utilize parallel computing and memory bandwidth resources, which in turn leads to memory management and reclamation issues. To address this, the industry has developed numerous system optimization solutions, each of which can significantly improve the speed and performance of model inference.

5.1 Modifying Extrapolation Techniques Based on Attention Mechanisms

The blog post “How Do Language Models put Attention Weights over Long Context” mentions that the attention distribution differs significantly across different layers:

The initial layer mainly consists of a mixture of word embeddings and word embeddings, with attention distributed roughly evenly.
The attention patterns in the middle layers become more complex, with most probabilistic quality focused on the initial label (attention convergence) and the most recent/last label (recent bias).
The final layer reveals all the attention patterns.

As can be seen from the above, the middle layer mostly exhibits a “V-shaped” attention distribution, meaning that many tokens in the middle layer are actually not very effective. Therefore, we can consider accelerating inference and increasing extrapolation capabilities by reducing the number of tokens for different layers.

Next, we’ll look at how to increase extrapolation capabilities based on attention mechanisms.

name	Main idea
StreamLLM	When assembling the KV-Cache, all header tokens are included (Sink mode), and a Window Attention mechanism is introduced to improve computational efficiency.
LM-Infinite	A V-shaped attention mechanism is employed. Because the attention distribution of the middle tokens is relatively small, a Λ-shaped attention mask is introduced, and a distance upper limit is set to restrict the “effective distance”. At the same time, it is possible to selectively focus on the k middle tokens with the largest attention logits.
SirLLM	Key phrases are filtered by measuring token entropy and using a memory decay mechanism. Tokens with higher entropy values are considered to contain more information. The memory decay mechanism works by multiplying each entropy value in the token entropy cache by a decay rate less than 1. Over time, earlier information is gradually forgotten, while more recent key information is retained.
Sparase-Q	Tokens typically only require attention to a small portion of the sequence. If it’s possible to efficiently predict which tokens will achieve high attention scores, then only the keys of the high-scoring tokens can be stored, improving memory bandwidth efficiency. Therefore, a compression approach is proposed: `r` components are selected by estimating the maximum attention score, and then the top-k key and value vectors are determined.
Dynamic Memory Compression	DMC fine-tunes on pre-trained LLMs to learn a compression strategy, and then performs online compression of the key-value cache during inference. DMC introduces decision variables $\alpha$ and importance variables $\omega$ , which at each time step determine whether to append the current key-value representation to the cache or perform a weighted average with the top element in the cache.
Infinite attention	Compressive memory is integrated into the standard attention mechanism, and masked local attention and long-term linear attention mechanisms are built in a single Transformer block.
LongLoRA	We introduce Shifted Sparse Attention to fine-tune the model and extend the context length. The model fine-tuned with Shifted Sparse Attention retains the original standard self-attention architecture during inference. This means that the model can use the unmodified attention mechanism during inference, allowing most of the existing optimizations and infrastructure to be reused.
self-extend Attention	Self Extend uses a simple floor division operation to map large, unseen relative positions to relative positions encountered during pre-training. To address the issues of long-distance and proximity dependencies, Self Extend introduces a two-layer attention mechanism: Grouped Attention and Neighbor Attention.
Dual Chunk Attention	By decomposing the attention computation of long sequences into block-based modules, the model can effectively capture relative positional information within the same block (Intra-Chunk) and between different blocks (Inter-Chunk). The attention outputs of intra-block, inter-block, and consecutive blocks are then merged to obtain the final output representation. This representation considers both local and global information in the sequence, enabling the model to effectively handle long sequences.

5.2 Extrapolation Technique Based on Memory Mechanism

The extrapolation technology based on the memory mechanism actually follows the compression idea. It uses external storage to store historical information and then uses the most recent token to query and retrieve some historically important tokens.

name	Main idea
InfLLM	By constructing an additional context memory module to store context information far from the current processing location, and designing an efficient mechanism to find units related to the currently processed tag for use in attention computation.
Recurrent Memory Transformer (RMT)	Context extension is achieved by combining the recurrent mechanism of recurrent neural networks (RNNs) with the memory augmentation capabilities of the Transformer model. RMT introduces a memory mechanism on top of the Transformer model, which consists of a set of trainable real-valued vectors (called memory tags). These memory vectors can store and process local and global information and pass information between different segments of a long sequence through the recurrent mechanism.

name

Main idea

InfLLM

By constructing an additional context memory module to store context information far from the current processing location, and designing an efficient mechanism to find units related to the currently processed tag for use in attention computation.

Recurrent Memory Transformer (RMT)

Context extension is achieved by combining the recurrent mechanism of recurrent neural networks (RNNs) with the memory augmentation capabilities of the Transformer model. RMT introduces a memory mechanism on top of the Transformer model, which consists of a set of trainable real-valued vectors (called memory tags). These memory vectors can store and process local and global information and pass information between different segments of a long sequence through the recurrent mechanism.

0xFF Reference

The potential for parallel inference across multiple large language fine-tuning models

Contiguous Batching/Inflight Batching

Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models Yao Fu Paper version

GPTQ / AWQ

How Do Language Models put Attention Weights over Long Context Yao Fu

HunYuan MoE: A casual chat about AI, including LLM parameter count, computational cost, and MFU.

llm Parameter Quantity, Computational Cost, and Memory Usage Analysis (Zhang)

LLM Large Model Training-Inference Memory Usage Analysis (chaofa adds a touch of coding)

LLM (23): The Long Text Problem in LLM (Auspicious Purple Qi from the East)

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

OpenPPL-LLM | OpenPPL’s Large Language Model Inference Engine is Here !

PagedAttention

Towards 100x Speedup: Full Stack Transformer Inference Optimization Yao Fu

Transformer estimate 101

Transformer Data Estimation - Memory Usage Bruce’s Wanderings

Analyze the number of parameters, computational cost, intermediate activations, KV cache, and spintronics of the transformer model.

Analyzing the Batch Processing Effect in GPT Inference ( Lequn Chen || abcdabcd987)

The Potential of Parallel Inference for Multiple Large Language Fine-tuning Models (Lequn Chen || abcdabcd987)

Large Model-Deployment-Capacity Estimation Ideas (Lancet)

Analysis of Bottlenecks and Limiting Theoretical Values in Large Model Inference (by WALL-E, who likes curly hair)

Activating Memory: How Much Memory Does Model Inference Require? (Wei Xinyu [Da Wei Sharing])

| [Transformer Series] (16) Resource Consumption

Exploring the Transformer Series (16) --- Resource Consumption

Exploring the Transformer Series (16) --- Resource Consumption

0x00 Overview

0x01 Background Knowledge

1.1 Data Types

1.2 Number Systems & Conversion

Number base

Storage metrics

Conversion

1.3 Parameter Memory Usage

Parameterized layers

Parameterless layer

Required resources

1.4 Computational complexity

0x02 Transformer Parameter Count

2.1 Terminology

2.2 Embedding Layer

2.3 Transformer Layer

MHA

FFN

LayerNorm

summary

2.4 lm_head

2.5 Final Parameter Quantity

2.6 LLaMA3

SwiGLU

GQA

0x03 Transformer Memory Usage

3.1 Training

3.2 Reasoning

3.3 Activation

Architecture

Terminology Explanation

Data volume

attention block

MLP

LayerNorm

Summarize

parallel

0x04 Transformer computational complexity

4.1 Matrix Multiplication

4.2 Forward Propagation Computational Cost

Embedding

MHA

Calculate Q, K, V

QK^T

Multiply by V

linear mapping

MLP

LayerNorm

single layer

4.3 Comprehensive Consideration

Backpropagation

single layer

logits

Total computational load

4.4 Calculation Characteristics

Relationship with parameter quantity

Single reasoning

Single training session

Bandwidth limited

Attention computation

FFN calculation

Impact of KV Cache

prefill

decode

overall

How much computation does the kv cache save?

0x05 Optimization Directions

5.1 Modifying Extrapolation Techniques Based on Attention Mechanisms

5.2 Extrapolation Technique Based on Memory Mechanism

0xFF Reference