Exploring the Transformer Series (35) --- Fundamentals of Large Model Quantization

0x00 Overview

Applying existing quantization techniques directly to large models presents challenges, leading to significant quantization errors and decreased accuracy. This is primarily due to the scale and complexity of large models. Compared to smaller models, large models typically exhibit more outliers in their weights and activations, and possess a wider distribution range. The authors of LLM.int() observed this:

Unlike smaller models, LLMs exhibit unique weight and activation distributions, characterized by a large number of outliers. Because of these outliers, most normal values will be zeroed out if we use INT8 quantization.
The phase shift in Emergent Features occurs when the number of model parameters reaches 6.7B, indicating that the model performs very differently before and after the number of parameters reaches 6.7B.

These outliers significantly impact the quantization process because they increase the quantization step size and reduce the accuracy of intermediate values. Therefore, methods applicable to transformer models with fewer than 6.7 B parameters must be generalized to models with more than 6.7 B parameters with extreme caution. For example, the best quantization clipping methods for small models are not readily available for LLMs. This necessitates developing tailored quantization techniques to handle these unique characteristics without compromising model performance or efficiency.

0x01 outlier

Internally, data is typically represented as multidimensional tensors (which can be understood as multidimensional arrays). For attention mechanisms, these tensors include dimensions such as batch size, sequence length, number of attention heads, and head dimension. Large-scale values refer to situations where, in the attention head dimension of an LLM, the values of some elements are significantly higher than the average of other dimensions of that head (usually more than 5 times).

1.1 Definition

The author of LLM.int8() provides the following definition for the outlier:

For a transformer with (l) layers, the hidden state is (X_l \in R^{s \times h}, l = 0 \ldots L), where (s) is the sequence dimension and (h) is the feature dimension. We define a feature (h_i) as a specific dimension within hidden state (X_{l_i}). We track each dimension across all layers. If (0 \le i \le h), we call a dimension an outlier if it satisfies the following condition:

At least one value has an absolute value greater than or equal to 6;
Satisfying condition 1, (h_i) must appear in at least 25% of the layers of the transformer;
Satisfying condition 1, (h_i) appears in the hidden states of at least 6% of the sequence dimension.

3501

The image above shows an example of an outlier (with four yellow outlier features). The horizontal axis represents the hidden_dim dimension, and the vertical axis represents the sequence dimension. In the image above, seq_len = 3, so an outlier feature is a 3×1 vector.

1.2 Features

Researchers have found that the flatness of the distribution of weights and activation values before quantization is a key factor affecting the quantization error of LLM. Intuitively, the flatter the distribution, the fewer outliers, and the higher the accuracy during quantization. The more uneven the distribution of activations and weights, the greater the quantization error.

The image below illustrates the characteristics of outliers in activation and weights. Red represents activation, and green represents outliers.

3502

activation

For activation, the characteristics of outliers are as follows:

After training begins, the residual activation values rapidly change from a Gaussian distribution to a logistic distribution, and a large outlier appears.
When the model parameters exceed a certain threshold, the proportion of outliers suddenly increases.
Each activated channel has a similar distribution across different tokens. For example, a channel with a large weight value on a certain token will also have a large weight value on other tokens.

Weight

For weights, the characteristics of outliers are as follows:

The weight distribution is quite uniform and flat, making it easy to quantize. Quantizing the weights of an LLM using INT8 or even INT4 does not reduce accuracy.
The weights in the FFN part are initially initialized from a randomly selected Gaussian distribution, which is relatively stable at first; after a certain training period, they begin to change drastically; subsequently, the overall distribution stabilizes again. The weights generally retain a Gaussian distribution, but there are some outliers that are not very large.
Outliers in an LLM are few and concentrated in a few specific columns, which may store context-independent information.
In addition, in models with gated linear units (GLUs), activations and weights are mostly symmetrically distributed, making symmetric quantization the best choice.

gradient

The gradient distribution shows a similar trend to the weights, and no significant outliers were observed during the training process, indicating that the gradient itself has good stability and the possibility of low-precision calculation and storage exists.

1.3 Occurrence Process

Tim Dettmers, the author of LLM.int8(), mentioned the concept of Emergent Features in his blog post “LLM.int8() and Emergent Features”. Emergent Features are Emergent outlier features.

The other pitch talks about emergent outliers in transformers and how they radically change what transformers learn and how they function.

This blog post is a more speculative version of the paper that teases out the super curious details about the fascinating properties surrounding the emergent outlier features I found.

“Emergent” describes a phenomenon where outliers gradually increase and, after experiencing a phase shift (a sudden and rapid increase in outliers), severely impact model performance. Here, “increase” refers to: the numerical value of the outliers increases, their number increases, and the number of tokens and model layers affected by these outliers also increases.

Experiments in the LLM.int8() paper revealed that when the transformer size was increased to 6 bytes, outliers first appeared in 25% of the transformer layers, then gradually spread to other layers. When the model reached 6.7 bytes, all transformer layers were affected by outliers, and the affected sequence dimensions increased from 35% to 75%. Their distribution across all transformer layers was concentrated in the six feature dimensions. The figure below shows the percentage of layers affected by outlier features and the percentage of sequence dimensions in the Transformer; these values are correlated with model size.

3503

The authors of the paper also described in their blog how Emergent Features grow as the number of model parameters increases.

Even in a relatively small Transformer model with 125M parameters, Emergent Features still exist. However, in this case, Emergent Features only exist in the output of the attention projection layer (query/key/value/output).
As the number of parameters in the Transformer model increases to 350M to 130 million, Emergent Features begin to appear in the outputs of the attention and FFN, and they appear on the same dimension, but their positions differ in different mini-batches or different layers. The original blog post argues that this consistency, to some extent, represents the collaboration between different layers of the model. At the same time, the distribution of Emergent Features begins to exhibit some patterns.
When the number of parameters in the model reaches 2.7B to 6B, Emergent Features appear on the same dimension in 60% of the layers.
Significant outlier features across all Transformer layers suddenly appear between the 6B and 6.7B parameters, indicating a phase shift. The original blog post also describes some changes that occur within the model under these circumstances:
- The percentage of layers affected by outliers increased from 65% to 100%, and the percentage of tokens affected by outliers increased from 35% to 75%. Simultaneously, quantization began to fail. The core reason for the quantization method failing starting with version 6.7B is likely that the quantization distribution range is too large, causing most quantization bins to be empty, and small quantization values to be quantized to zero, essentially eliminating useful information.
- Attention layers become very sparse; FFN layers become denser; Transformers become more stable. If outliers are separated from the model, the remaining parts can be computed with 8-bit or even lower precision.

1.4 Distribution Pattern

The distribution of newly emerging outlier features follows a pattern. The authors of LLM.int8() summarized several phenomena regarding emerging features reported in the original paper on their blog:

Emergent features are systematic in large models: they either appear in most layers or not at all. However, they appear probabilistically in small models: they only appear sometimes in certain layers.
Emergent Features are likely to appear in the following locations: attention projection layer (query/key/value/output) and the first layer of FFN.
Outliers in activations are concentrated in a small subset of channels. Typically, these outlier features are distributed across only a few dimensions of the Transformer layer. For example, for a Transformer model with 670 million parameters, if the input sentence sequence is 2048 characters long, each sequence will have approximately 150k outlier features across the entire model, but these will be concentrated in only 6 distinct feature dimensions.
These outliers in activations appear on almost all tokens, but are confined to fixed channels in the hidden layer dimension; given a token, the variance between different channels can be large, but for different tokens, the variance within the same channel is small. Considering that these outliers in activations are typically 100 times larger than other activation values, this makes activation quantization difficult.
Emergent features increase exponentially with ppl and are independent of model size.

As shown in the figure below, the emergence of numerous outlier features exhibits a smooth trend, essentially functioning as an exponential function of perplexity. This indicates that the appearance of outliers is not sudden. Furthermore, by studying the exponential trend in smaller models, the authors were able to detect outlier features before phase shift (a temporal shift of a waveform) occurs. This also suggests that the occurrence of outliers is not only related to model size but also to perplexity, as well as multiple additional factors such as the amount and quality of training data used. The authors speculate that model size is only one important covariate among many required for the emergence of discrete features.

3504

Furthermore, these large-scale values exhibit a peculiar phenomenon: they display a striking consistency across different attention heads, concentrating on similar location indices. This challenges our conventional understanding—that each attention head should operate independently, processing different types of information. Imagine 10 people considering the same problem; normally they would focus on different angles, but research now shows they all coincidentally focus on the same few points—a highly counterintuitive approach. You might notice that the distribution of these values is not random but follows a structured pattern, suggesting they play a specific and crucial role in the model’s information processing.

1.5 Reasons for occurrence

The reasons for outliers have been studied and many viewpoints have been proposed. A general summary is as follows: softmax and RoPE are the starting points for generating outliers. When tokens flow within the Transformer architecture, modules such as FFN and Norm further amplify outliers. We will analyze these in detail below.

1.5.1 softmax

One viewpoint suggests a connection between outliers and the softmax mechanism. In the context of LLM, outliers arise because an attention head doesn’t want to focus on certain semantically meaningful tokens, thus prioritizing non-semantic tokens (such as commas). How can more attention be given to non-semantic tokens? This is achieved through heavy weighting to generate outliers. We use the papers “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing” and “StableMask: Refining Causal Masking in Decoder-only Transformer” for learning.

LLM model

The research results of the paper “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing” are as follows:

Over 97% of the outliers are delimiter tokens, such as [SEP], commas, and periods.

3505

The attention head assigns almost all the probabilities to the [SEP] token and other tokens with less information, such as dots/commas (left side of Figure (a) below), while the values corresponding to these tokens are very small (middle side of Figure (a) below). This results in a small product between the two (right side of Figure (a) below).
In other cases (Figures b and c), we observe that a large portion of the attention probabilities are still assigned to the delimiter token. However, this results in a (soft) selective update of the hidden representation by assigning some probabilities to other tokens.

3506

The image above illustrates the “no attention/no update” mechanism of Attention, explained in detail below:

There are some irrelevant tokens in the sequence, such as the initial token or non-functional words like punctuation marks. These tokens are more often observed by other tokens, and we can call them non-semantic tokens.
Attention mechanisms typically require only a few important tokens; the others are likely just noise. The mechanism should only update the weights of these important tokens. Ideally, the attention scores of these noise tokens should be zeroed out. After all, not all tokens need to be learned; that is, not all tokens participate in updating the weights.
In some situations, the model might find that “there are few tokens worth noting.” In this case, it chooses to disproportionately focus attention on non-semantic tokens, minimizing the allocation of attention probabilities to other tokens. These non-semantic tokens will output smaller values. This prevents the accidental updating of meaningful tokens, effectively achieving the effect of “not paying attention/not updating.” The authors refer to this mechanism as the “no-op” phenomenon of attention.

The authors of the paper “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing” present several hypotheses and inferences regarding the “no-op” phenomenon. We will analyze these in conjunction with the ideas from the paper “StableMask: Refining Causal Masking in Decoder-only Transformer”:

In order to prevent the attention module from updating the representation of a token on the residual, some attention heads want to allocate most of their attention probabilities to some fixed and common tokens with low information content (e.g., delimiter tokens or background blocks), which can produce small values after learning.
As can be easily seen from the definition of the softmax function, this gives the softmax input a relatively large dynamic range. In fact, in the extreme case where softmax is exactly zero, this dynamic range will be infinite.

3507

Because layer normalization normalizes outliers, the output of the preceding FFN layer must be very high to maintain a sufficiently large dynamic range after layer normalization. Note that this also applies to Transformer models that apply layer normalization before self-attention or linear transformations.
In the original softmax function, all inputs are mapped to the range of 0 to 1, and the sum of all output values is 1, thus inevitably requiring attention to be distributed across all visible tokens. In this case, the requirement imposed by the softmax function prevents the model from effectively zeroing out attention scores for irrelevant tokens. This means that even some very small input values will have a non-zero output value after processing by the softmax function. Since softmax never outputs an exact zero, it always backpropagates gradient signals, resulting in larger outliers. Therefore, the longer the network is trained, the larger the magnitude of outliers becomes.
In Decoder Only models, the first tokens in the sequence are more likely to become outliers. This is likely because, as the sequence is gradually masked, the first tokens receive more attention initially, as there are few tokens in the input. After softmax, the smaller the denominator, the more attention they receive. However, as the number of tokens increases, more tokens participate in the softmax operation, and even assigning a very small probability to each token can lead to significant cumulative probability. Therefore, sink tokens do not receive as much attention as they do at the beginning of the sequence. Furthermore, because the initial tokens are visible to almost all subsequent tokens, they are more likely to be trained as attention pools, attracting unnecessary attention.

The image below is a schematic diagram of the attention layer in BERT, illustrating how outliers in the previous layer affect the behavior of the attention mechanism in the next layer. The hidden activation tensor is represented by x. The output of the FFN that generates the largest outlier is highlighted in red.

3508

VIT model

The paper “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing” analyzes the feature maps of the ViT model and finds that it also has outliers. Outliers in ViT mainly appear in the image background, and the background tokens are also largely uninformative. The following figure shows a summary of the ViT outlier analysis on random images in the ImageNet validation set. (a) Input image. (b) Outliers in the output of layer 11. (c) Cumulative attention weights spent on each patch of attention head #1 and layer #12 (attention probability matrix summed by rows). (d) Corresponding attention probability matrix. (e) Average size of outlier and non-outlier patches.

3509

The paper “Vision Transformers Need Registers” presents the following viewpoints:

Similar to the Quantizable Transformer, the authors observed several visual Transformer models and found that their feature maps contain many outlier features, which are features with a large L2 norm, and these outlier features appear more frequently in deeper layers and during long-term training of larger models.
By measuring the cosine similarity between outlier features and the features of four adjacent tokens, the authors found that these features are highly similar, suggesting that the information in outlier features is highly redundant.
The authors hypothesize that outlier features contain less location and pixel information.
The authors designed another experiment, using individual outlier features and normal features to predict the category of the image to which the token belongs. The results showed that outlier tokens had higher classification accuracy, thus demonstrating that outlier tokens may contain more global information.
The authors’ hypothesis is similar to previous research: fully trained large models identify redundant tokens during training and use these tokens to process, store, and retrieve global information. The authors assume this behavior itself is not bad, but it is undesirable for the model’s output to contain such tokens. In fact, these anomalous tokens can cause the model to discard local information, leading to a decrease in performance on dense prediction tasks.

1.5.2 RoPE

The authors of the paper “Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding” argue that RoPE is the fundamental reason for the emergence of structured “outliers” in QK representations.

Phenomenon

The paper found the following phenomena:

In models using RoPE (such as Llama, Gemma, Qwen, etc.), there are significant massive values in the query (Q) and key (K) components of the attention mechanism. These values are concentrated in specific regions, but this phenomenon is not present in the value (V) component.
The above characteristics were not observed in models that did not use RoPE (such as GPT-2, OPT, etc.).
Quantization techniques that intentionally preserve these extreme values (such as AWQ and SmoothQuant) can maintain the original performance of the model; conversely, if methods that fail to preserve these values are used (such as GPTQ), the model’s contextual reasoning ability will be severely damaged.
The authors consistently observed this phenomenon of “huge value” concentration across various Transformer architectures, including autoregressive LLM (large language model) and multimodal models.
Large-scale values appear from the first layer and maintain a relatively consistent pattern before and after applying RoPE. This indicates that the formation of large-scale values is a gradual result of model training, rather than simply caused by RoPE.

analyze

The authors argue that the RoPE mechanism, by dividing the embedding dimension into pairs and applying rotation operations at different frequencies, enables low-frequency regions to encode rich semantic content rather than positional information, thus facilitating a concentrated distribution of large-scale values. This pattern does not exist in LLMs (Large Language Models) that do not employ RoPE.

1.5.3 FFN

The following text is excerpted from “From Training Dynamics to Outlier: Numerical Characteristics Analysis of LLM Model Training Process”.

Dynamic Analysis of Matrix Multiplication

Given matrix multiplication (C = A \times B \in \mathbb{R}^{m \times n}) Its elements can be decomposed into:

$C = A \times B = \begin{bmatrix} a_0 \\ a_1 \\ \vdots \end{bmatrix} \times \begin{bmatrix} b_0 ; b_1; \dots \end{bmatrix} = \begin{bmatrix} a_0b_0 & a_0b_1 & \dots \\ a_1b_0 & a_1b_1 & \dots \\ \vdots & \vdots & \ddots \end{bmatrix} = [c_{ij}] = [a_i \cdot b_j]$

Further, there are:

$c_{ij} = \|a_i\| \|b_j\| \left[ \frac{a_i \cdot b_j}{\|a_i\| \|b_j\|} \right] = \|a_i\| \|b_j\|\cos{\theta_{ij}}$ $c_{ij} = \langle a_i, b_j \rangle = \|a_i\| \|b_j\| \left( \frac{\langle a_i, b_j \rangle}{\|a_i\| \|b_j\|} \right) = \underbrace{\|a_i\|\|b_j\|}_{\text{能量项}} \cdot \underbrace{\cos\theta_{ij}}_{\text{相关性项}}$

Let (\theta_{ij}) be the angle between two vectors. If both vectors have zero values, then (\cos{\theta_{ij}}) Linear dependence of two vectors (\rho) They are equal. Based on the above analysis, matrix multiplication can be decomposed into two parts:

$C = \underbrace{[\|a_i\| \|b_j\|]}_{能量矩阵} \odot \overbrace{[\cos \theta_{i,j}]}^{相关性矩阵} = \mathbf{E} \odot \mathbf{R}$

in (a_i) This represents the energy of each token. (b_j) This represents the inherent scaling of the weight matrix for each feature channel. The energy matrix spanned by both is (\mathbf{E}). This represents the total energy distribution input to the matrix multiplication stage, while the correlation matrix (\mathbf{R}) This represents energy transfer efficiency and information selection. Generally speaking, the energy matrix (\mathbf{E}) It has a high dynamic range, and the correlation matrix (\mathbf{R}) High computational accuracy is required.

FFN main calculation

The main calculations for FFN are as follows:

$y = \left[ \text{Swish}(xW_1)\odot xW_2 \right] W_3$

Among them, the cubic linear transformation (W_1) This is called gate projection. (W_2) It is called up projection, and (W_3) This is called down projection. The properties between the energy matrix and the correlation matrix of these three projections are as follows:

The weight matrix distribution basically follows a normal distribution and is relatively stable, while the activation input often shows significant outliers.
There is no correlation between the correlation matrix and the energy matrix;
Energy Matrix (\mathbf{E}) The correlation with the output activation value is weak, and its impact on the generation of outliers is limited.
Correlation matrix (\mathbf{R}) It shows a clear linear correlation with the output activation value and is the dominant factor in generating large outliers;

Let’s analyze the mathematical mechanism of outlier generation in SwiGLU: SwiGLU can be viewed as a gated selection amplification unit, and its calculation can be decomposed into:

$\overbrace{\text{Swish}(W_{gate}x)}^{门控向量} \odot \underbrace{W_{up}x}_{特征向量}$

This makes the gated vector a feature selector, allowing only positively correlated features to pass through. The mechanism of the gated amplification unit also leads to the following effects:

Outlier amplification effect, when the gating unit is in the linear region, i.e. (W_{\text{gate}}x > 0) hour, (\text{Swish}(z) \approx z) At this point, the gated output linearly amplifies the up projection. When both the gate value and the up projection are relatively large, the up projection “collides” with the outlier in the gate projection, and the multiplication results in a very large output value, causing the down projection input activation value distribution to typically exhibit a highly sharp characteristic.
Zero-value clustering effect, when the gated unit is in the saturation region, when (W_{\text{gate}}x < 0) hour, (\text{Swish}(z) \rightarrow 0) This causes a large number of activation values to be pulled close to zero, which in turn compresses the dynamic range of the down projection input. Thus, the down projection includes both outliers far from zero and small values very close to zero, making this input extremely difficult to handle for low-precision training.

1.5.4 LayerNorm

The scale parameter γ in the LayerNorm structure acts as an amplifier, amplifying outliers in the output.

The figure below shows the outlier representation of LayerNorm on BERT-ST-2. It can be seen that the value at dimension 308 is sharper, and the multiplier γ and output (\widetilde X) Outliers are contained within the same embedding dimension. If γ is removed, then (X’) The distribution is more even.

3510

1.5.5 RMSNorm

The RMSNorm layer is a normalization layer used in mainstream large-scale models. Compared to the traditional LayerNorm, it reduces the calculation of the mean, thus making the normalization layer computation more efficient. The RMSNorm is defined as follows:

$\bar{x_i} = \frac{x_i}{RMS(x)}w_i, \qquad RMS(x) = \sqrt{\frac{1}{H}\sum_{i=0}^{H} x_i^2}$

In the large model, RMSNorm first normalizes the representation of each token, then scales each channel. We can intuitively see the impact of the normalization and scaling stages on the numerical range. Normalization effectively suppresses outliers, while scaling amplifies some important channels, but it can also cause some values to become very large, thus exacerbating the outlier situation. In general, RMSNorm’s main function is to compress the dynamic range of outliers. This compression is successful in most cases, but some layers can only provide a small compression ratio. In such cases, it can affect outliers.

1.6 Function

1.6.1 Negative Effects

What impact do outliers have on quantization? According to findings in Llm.int8() and Smoothquant, significant outliers exist in both weights and activations in LLM. These outliers significantly impact the quantization process because they increase the quantization step size and reduce the precision of intermediate values. Retaining these sparse outliers can also lead to a substantial speed reduction (e.g., 1.5% outliers cause a speed reduction of over 30% in SpQR).

Suppose we have a vector A = [-0.10, -0.23, 0.08, -0.38, -0.28, -0.29, -2.11, 0.34, -0.53, -67.0]. We can see that outliers are harmful:

If we quantize and dequantize this vector while preserving the emergent feature -67.0, the result is: [-0.00, -0.00, 0.00, -0.53, -0.53, -0.53, -2.11, 0.53, -0.53, -67.00]. Most of the information is lost after processing.
If we remove the emergent feature -67.0 from the quantization and dequantization of vector A, the result is: [-0.10, -0.23, 0.08, -0.38, -0.28, -0.28, -2.11, 0.33, -0.53]. The only error is that one value, -0.29, becomes -0.28.

Let’s derive it using mathematical formulas.

3511

Please note that the calculation (\Delta) When the maximum value is used, the step size for outliers in X increases significantly. This causes outliers to be rounded to even further values, increasing quantization error. As the number of outliers increases, they are rounded to fewer discrete values, resulting in more quantization bins being unused. Consequently, outliers lead to decreased quantization fidelity.

1.6.2 Positive Effects

Maintain performance

The authors of LLM.int8() found that outlier features strongly impact the overall performance of Attention and Transformer. It turns out that pruning these outliers during quantization is detrimental to LLM performance. Preserving outliers is crucial for the performance of large language models. For example:

In the initial stages of training, any pruning-based method will result in an abnormally high perplexity score (i.e., > 10000), leading to a significant loss of information that is difficult to recover through fine-tuning. Moreover, for larger models, the impact of outliers on model performance is greater, and the model becomes more dependent on outliers.
Retaining sparse outliers can improve accuracy. Removing outliers, even with a maximum of seven outlier feature dimensions, reduces the top-1 softmax probability from approximately 40% to approximately 20%, while increasing validation set perplexity by 600-1000%. When seven random feature dimensions are removed instead, the top-1 probability decreases by only 0.02-0.3%, and perplexity increases by 0.1%. These results highlight the crucial nature of outlier features. The quantization accuracy of these outlier features is critical, as even small errors can significantly impact model performance.

contextual understanding

Before understanding the role of large-scale values, we need to distinguish between two types of knowledge. “Parametric knowledge” refers to the knowledge that a model learns during training and stores in its parameters, such as “Paris is the capital of France”; while “contextual knowledge” refers to the information that a model obtains from the current input context, such as understanding an article and answering questions about the details mentioned in it.

The authors of the paper “Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding” argue that massive values play a crucial role in processing contextual knowledge, while having a relatively small impact on parametric knowledge.

Such extreme values are crucial for the model’s contextual understanding ability, but rely less on parameterized knowledge. Experiments show that if such values are disturbed, the model can still recall existing facts (e.g., answering “What is the capital of China?”), but its performance will significantly decline in tasks that require context (such as the GSM8K mathematical reasoning test).
When the authors intentionally created conflicts between the contextual information and the model’s intrinsic knowledge (e.g., “Geographical knowledge has changed; New York is now a city in the United Kingdom”), they found that LLMs (Large Language Models) performed no differently than random guessing. However, when the large values were corrupted, the model’s accuracy was significantly higher than random levels. This suggests that LLMs, by default, tend to rely more on their internal knowledge, and these “huge values” play a crucial role in guiding the model’s understanding of context. When the large values are corrupted, the model loses its ability to handle misleading contextual information and instead defaults to using its parameter knowledge, effectively ignoring contradictory contexts.
These “huge values” are indispensable for tasks that require context—such as key information retrieval, text sentiment analysis, and logical reasoning—while their impact on the direct retrieval of parameterized knowledge is relatively limited.

Therefore, people often choose to retain these outliers. Naïve and efficient quantization methods (W8A8, ZeroQuant, etc.) lead to increased quantization errors and decreased accuracy. GPTQ does not specifically protect large-scale values, resulting in a significant performance drop on context knowledge understanding tasks (accuracy drops to about 75% after normalization), but it still performs well on parametric knowledge retrieval tasks. To handle outliers more effectively, LLM.int8() proposes a hybrid precision decomposition to ensure the accuracy of some outliers in activations. SmoothQuant reduces the difficulty of quantization by smoothing outliers in activations. SpQR identifies sensitive weights to ensure their accuracy while quantizing other weights to a lower bit width. AWQ maintains strong performance on all tasks by selectively protecting “important” weights during quantization to preserve large-scale values.

1.7 Difficulties

We summarize the reasons for the difficulty in quantizing large models as follows:

Activations are more difficult to quantize than weights. Weights are more uniformly distributed and easier to quantize. Studies have shown that quantizing LLM weights to INT8 or even INT4 does not affect accuracy.
The activations are difficult to quantize due to outliers. In the case of per-tensor quantization, large outliers dominate the maximum amplitude, with outliers being approximately 100 times larger than most activations. This compresses the effective quantization bits for non-outlier channels, resulting in lower effective quantization bits/levels for non-outlier channels: assuming the maximum amplitude of channel (i) is (m_i), the maximum value of the entire matrix is (m), and the effective quantization levels of channel are (2^8 \cdot m_i / m). For channels that are not outliers, the effective quantization level will be very small (2-3), resulting in a large quantization error.
Outliers can be considered as emerging features. These outliers represent important features upon which the model relies for predictions, playing a crucial role in the model’s decision-making process. They effectively serve as indicators of features the model has learned to recognize, aiding the network in processing and interpreting input data. Therefore, these outliers should be preserved at their original accuracy to allow the network to retain its learned knowledge and maintain the model’s performance and accuracy.
Outliers persist in a small subset of specific channels (outliers exist in a fixed number of channels, and the outlier channel values are relatively large). If a channel has one outlier, it persists across all tokens. For a specific token, the activation values of different channels vary greatly (a small number of channels have large activation values, while most channels have small ones), but within the same channel, the amplitude of activation values for different tokens varies little (the outlier channel amplitude remains relatively large).

Due to the persistence of outliers and the small variance within each channel, per-channel quantization (using a different quantization step for each channel) can produce much smaller quantization errors compared to per-tensor quantization, while per-token quantization offers little benefit.

In summary, outliers are channel-dependent, not token-dependent. Therefore, channel-wise quantization (using different quantization coefficients for each channel) should be applied to activations, which significantly reduces quantization errors. Token-wise quantization offers little benefit. However, channel-wise activation quantization is unsuitable for hardware-accelerated GEMM kernels (linear layer computations) because these kernels rely on high-throughput sequential operations (such as Tensor Core MMAs) and cannot tolerate the insertion of low-throughput instructions (such as conversions or CUDA Core FMAs). Furthermore, both the quantization coefficients and the floating-point to fixed-point conversion in the quantization formula are performed using CUDA Core computation units.

0x02 Outlier

In practice, researchers have discovered some unusual outliers with the following characteristics: 1. They are extremely large, usually much larger than other outliers. 2. Their numbers are very small. Some researchers refer to these problematic, few-numbered, but exceptionally large outliers as super-outliers, which include (including super-weighted and super-activated values).

2.1 Superweight

Next, let’s look at superweights. Researchers have found that in large models, there is a small subset of features that are particularly important (called “superweights”). Although they are few in number, they are crucial to the model’s performance. Removing other, less important features will only slightly affect the model’s performance. However, if these “superweights” are removed, the model will start rambling and may not even generate any text.

2.2.1 Function

To quantify the impact of “superweights” on the model, the research team pruned all outlier weights. They found that removing a single “superweight” had a more severe effect than removing the other 7000 outlier weights combined. The research team identified two main effects of superweights:

This leads to “superactivation.” Superweights amplify outliers in the activation of the input token, a phenomenon researchers call “superactivation.” The paper found that before dimensionality reduction projection, the Hadamard product of gating and upprojection produces a relatively large activation, which is further amplified by the “superweights,” creating “superactivation.” Regardless of the input cue word, “superactivation” persists with the same magnitude and location throughout the model. This stems from “cross-layer connections” in neural networks.
The generation probability of stop words was suppressed. The original model with “superweights” could correctly predict stop words with a high probability of 81.4%. However, after removing the “superweights,” the most frequently predicted word became the stop word “the,” and the probability of “the” was only 9.0%, mostly nonsensical. This shows that “superweights” are crucial for the model to correctly and confidently predict semantically meaningful words.

The research team also analyzed the impact of superweight magnitude changes on model quality by scaling the superweights by factors ranging from 0.0 to 3.0. The results showed that moderately scaling the magnitudes can improve model accuracy.

3512

2.2.2 Identification

The distribution of “superweights” in different large models is surprisingly similar; for example, they always appear in mlp.down_proj.

But how do we identify specific “hyperweights”? This paper proposes an efficient method: “Hyperweights” can be further located through activation peaks. Specifically, we can locate “hyperweights” by detecting peaks in the input and output distributions of the inter-layer dimensionality-reduced projection. This method only requires a single prompt word, making it very simple and convenient, eliminating the need for a set of validation data or specific examples.

Assume there exists a down-projection weight matrix (W \in R^{D \times H}), where D represents the dimension of the activation feature, and H is the dimension of the intermediate hidden layer. Let (X \in R^{L \times H}). Let L be the input matrix, where L represents the sequence length. Define the output matrix as (Y = XW^T). “Super Activation” is (Y_{ij}), (Y_{ij} = \sum_{k=1}^{d} x_{ik}W_{jk}). if (X_{ik}) and (W_{jk}) These are all outliers that are much larger than other values, so (Y_{ij}) The value of will be primarily determined by the product of these two outliers. In this case, j and k are determined by (X_{ik}) and (Y_{ij}). The value of the value determines the extreme outliers in the input and output activations of the mlp.down proj layer. Next, determine the layer and coordinates of the superweights. Once a superweight is detected, remove it from the model and repeat the above process until the largest maximum activation value is suppressed.

The following diagram illustrates the behavior of superweights.

I: Superweights typically appear in the downward projection of early layers of the model and are indicated by a blue-purple box. Superweights immediately produce incredibly large activation values.
II: Superactivation propagates via skip connections, indicated by blue-purple lines.
III: There is a net effect, which is that removing superweights causes the probability of generating stop words to skyrocket, represented by blue-purple stacked bars.

3513

These outliers are very important for model quality, so it is crucial to preserve them during the quantization process.

2.2 massive outlier

In fact, people more often refer to superactivation as massive activation or massive outliers.

The term “outliers” in LLM often refers to outliers of activation values, as the overall variation in weights is relatively small. Early work did not distinguish between outliers; what was generally considered outliers were normal outliers, which appeared more frequently in the channel dimension—a problem that previous channel-level quantization methods aimed to address. Recent work, such as Massive Activations in Large Language Models, classifies outliers—those with exceptionally large but very few values in the token dimension—as massive outliers. This is something previous channel-level quantization methods couldn’t solve, but it can be addressed through rotation matrices or cushioning cache (prompt) absorption.

3514

2.2.1 Definition

For quantitative analysis, researchers provided a loose but broad definition: an activation is considered massive if its magnitude exceeds 100 and is at least or approximately 1000 times the median magnitude of its hidden state.

Massive outliers are characterized by their very high values and limited occurrence within a subset of tokens. The following figure illustrates the phenomenon of Massive Outliers in different models.

The top shows the activation magnitudes (z-axis) in LLaMA2-7B. The x and y axes represent the sequence and feature dimensions. For this particular model, we observed activations with huge magnitudes appearing in two fixed feature dimensions (1415 and 2533) and in two types of tokens—the start token and the first period (.) or newline character (\n).
In the middle is the massive activation from LLaMA2-13B. In this model, they appear in two fixed feature dimensions (2100 and 4743) and are limited to the initial token.
Below is the massive activation in Mixtral-8x7B. In this model, they reside in two feature dimensions (2070 and 3398) and are located in the start token, separator token, and certain word tokens (“and” and “of”).

3515

2.2.2 Function

The paper “Massive Activations in Large Language Models” argues that massive activations act as a bias.

3516

Features

Regarding bias, massive activations exhibit the following specific characteristics:

Massive activations primarily occur in:
- The starting token. This can be attributed to the fact that the starting token is used by all tokens in the sequence, making it suitable to place more bias information here.
- Delimiter tokens (punctuation marks) and weakly semantic tokens. These tokens have relatively low semantic values, making them a low-cost option for storing bias.
Most massive activations begin to appear in the early layers and remain almost constant, eventually dissipating in the last few layers. ViT has fewer massive activations than LLM, but they still exist. Massive activations only appear after a few initial layers because LLM requires some initial layers to process the meaning of tokens associated with massive activations. At these layers, the semantics of tokens can be transferred to other tokens via self-attention mechanisms and preserved during forward propagation.
Massive activations are crucial, and most tokens focus their attention on these massive activations. Simply setting the four massive activations to zero can lead to a catastrophic collapse in model performance. Of course, setting them to their average value does not harm the model, indicating that massive activations are important, but not in their precise values; rather, they are crucial as a bias term for the overall computation.

The paper’s authors argue that massive activations are actually quite similar to Register Tokens in ViT, noting that they also possess extremely large activation values. To illustrate their practical role, they averaged out all Register Tokens to destroy the information they contained, but the performance remained unchanged.

Argument

The authors tracked the performance of these token embeddings in the attention layer. They found that extremely large activation values disappeared after being scaled by LayerNorm. Furthermore, the information provided by these tokens “looked very similar” and was relatively stable.

The diagram below illustrates the activation trajectory from the input hidden state to the query, key, and value states.

Figure a illustrates that in LLM, at each layer, the input features are processed by layer normalization 3 and then transformed into query, key, and value states through linear projection.

Figure b shows all the hidden states computed in this schematic (LLaMA2-7b, layer 3). We find that, at all stages, the features of the two tokens associated with large-scale activations are distinctly different from the other tokens. Specifically, after the first “normalization” step, the embeddings of these two tokens behave as sparse vectors with two distinct non-zero elements. Notably, the subsequent QKV states exhibit fairly small changes in each embedding.

3517

The authors hypothesize that the goal of these tokens is to provide a fixed bias in the attention process. Given that the attention also focuses on tokens associated with large-scale activations, the authors therefore isolate these tokens and investigate their impact on the attention output (attention matrix layer multiplied by value vector).

The figure below illustrates the decomposition of value updates and attention outputs in LLaMA2-7B. In the equations below, the authors decompose the attention output at each token k into two parts: value updates from the token set C, where attention is centralized; and value updates aggregated from other tokens. The input cue in the figure is “Summer is warm. Winter is cold.” In this case, set C consists of tokenSummer and the first-cycle token. We can see that the value updates of C are almost identical across tokens, meaning they act as additive bias terms, although not explicitly imposed. Furthermore, we note that this value update pattern is very similar across various inputs. Overall, the paper’s results demonstrate that LLM uses a large number of activations to distribute a large amount of attention across certain tokens. These tokens are then used to form a constant bias term when computing the attention output.

3518

2.2.3 Difficulties

A major problem in LLM quantization is the presence of activation outliers, which amplify the quantization step size and lead to significant accuracy loss. To mitigate this, current research has developed various methods to address normal outliers in activation. Unfortunately, existing LLM quantization methods struggle to effectively handle massive outliers. Channel-level quantization methods mostly perform equivalent transformations on weights and activation values along the channel dimension, thereby reducing the magnitude of outliers across channels and making quantization easier. However, they do not consider that outliers are not limited to the channel dimension, and this approach may introduce new outliers.

For example, even though SmoothQuant uses a smoothing factor to shift some activation outliers to the weights, it still cannot effectively handle massive outliers with maxima. As shown in the figure below, the corresponding weight changes in SmoothQuantit lead to the emergence of new outliers.

3519

Therefore, there is an urgent need for an LLM quantization method to effectively address the problems of normality and large-scale outliers. This is the approach of rotation matrix or CushionCache (Prompt) absorption, which we will discuss in detail later.

0x03 Transformer Quantization

3.1 Principle Analysis

We begin with an efficiency analysis to explain how quantization techniques can reduce end-to-end inference latency for large models.

3.1.1 Characteristics of Reasoning

Inference in LLMs models can be broadly divided into two stages: prefill and decoding. Each stage has its own distinct characteristics.

The behavior of the prefill phase can be compared to the forward process of training. In most cases, the prefill phase is compute bounded.
The sequence length in the decoding phase is always equal to 1, making it I/O bounded. In many cases, decoding occurs more frequently than prefilling in applications. The autoregressive phase of inference is actually limited by memory bandwidth, so faster computation doesn’t offer much value. Because inference is memory bandwidth limited, we are actually only interested in reducing memory usage, as this means less data transfer.

3.1.2 Requirements for the Framework

These characteristics require the reasoning framework to support two sets of computational logic to adapt to their different features.

Weight activation quantization. In the prefilling stage, large models typically handle long token sequences, with the primary operation being Generalized Matrix Multiplication (GEMM). The latency in the prefilling stage is mainly limited by the computational operations performed by high-precision CUDA kernels. To address this issue, existing research methods quantize both weights and activations to accelerate computation using low-precision Tensor kernels. Activation quantization is performed online before each GEMM operation, allowing computation using low-precision Tensor kernels (e.g., INT8). This quantization method is called weight activation quantization.
Weight quantization only. In the decoding phase, the large model processes only one token in each generation step, using General Matrix-Vector Multiplication (GEMV) as its core operation. The latency in the decoding phase is mainly affected by loading large weight tensors. To address this issue, existing methods primarily focus on using quantized weights to accelerate memory access. First, the weights are quantized offline, and then the low-precision weights are dequantized into FP16 format for computation. The quantized model, due to its low-bit-width weight representation, can significantly alleviate the IO-bound phenomenon.

In the diagram below, (a) is the weight-only quantization inference process, and (b) is the weight-activated quantization inference process.

3520

3.2 Quantization Module

Based on the above information, researchers made quantization modifications to the Transformer. Because for large language models, the performance bottleneck lies in matrix multiplication operators (accounting for over 80% of system runtime) and Self Attention (accounting for over 50% of system memory overhead), most current LLM model quantization algorithms can be understood as only focusing on the quantization of Linear operators, without researching other components (norm, embedding, softmax, add…).

The diagram below illustrates the quantization inference process in LLM. The purple rectangles represent the quantization modules. Quantization primarily targets weighted linear layers, specifically the Q, K, V, and O layers in the self-attention module, and the Up, Gate, and Down layers in the FFN module. The right side of the diagram shows three quantization types: weight-activation quantization, weight-only quantization, and KV-cache quantization. X, K, and V are quantized per-token. The range of W is statically calibrated before deployment. For weight-only settings, W uses per-group quantization; for weight activation quantization, W uses per-channel quantization.

Assuming the input is [batchsize, seq_len, dim], multiplying batchsize by seq_len and merging them, we get [num_tokens, dim]. Per-token quantization is row-wise quantization, meaning each token has a corresponding scale. For a weight [input_dim, output_dim], performing per-channel quantization means quantizing output_dim column-wise, with each column of the weight corresponding to a scale.

3521

The diagram below illustrates the quantized inference data flow.

3522

The diagram below illustrates the paradigm for INT quantization.

FP32 serves as the benchmark, offering the largest numerical range and zero precision loss, but also the highest storage overhead.
If efficiency is not a major concern for the user, then the INT16 format is the best choice. INT16 is the most accurate format; for FP32 conversions, INT16 is even more accurate than FP16.
For services with high real-time requirements, the INT8 quantization scheme is recommended, as it can achieve significant performance improvements while maintaining high accuracy. If certain layers in your network require even higher accuracy, W8A16 can be used to address this issue.
In scenarios where resources are limited but accuracy requirements are relatively low, the INT4 scheme can be used to achieve the greatest resource efficiency.

3523

3.3 Classification

The following classifications come from three different papers, which we can compare and learn from. As you can see, these classifications are basically distinguished by the quantization process, and they are all expanded from the two categories of QAT (Quantization-Aware Training) and PTQ (Post-Training Quantization).

3524

Further subdivision of PTQ can be categorized into three types: weight quantization only; weight and activation quantization simultaneously; and key-value cache quantization. See the diagram below for details.

3525

3.3.1 PTQ

The main advantage of PTQ lies in its simplicity and efficiency, requiring no modification or retraining of the LLM architecture and calibrating using only limited calibration data. However, it’s worth noting that PTQ may introduce some accuracy loss during quantization. PTQ is particularly suitable for scenarios requiring significant model compression. For LLMs that typically contain billions of parameters, quantization-aware training (QAT) is prohibitively expensive, making PTQ a more practical alternative. While PTQ has been well explored in smaller models, large models, characterized by their scale and complexity, require specialized methods to efficiently handle the quantization process.

We can divide the PTQ of LLM into three groups: weight-only quantization, weight activation quantization, and key-value cache quantization. The difference between these groups lies in their quantization objectives.

Weight-only quantization focuses solely on the quantization weights. Previous work has shown that activation quantization is often more susceptible to weight quantization, allowing us to achieve lower bit widths using only weight quantization. However, since the quantization weights need to be dequantized before being multiplied by the activation, weight-only quantization inevitably introduces additional computational overhead during inference and cannot benefit from hardware-specific accelerated low-bit operations.
Weighted activation quantization extends its target to both weights and activations.
Key-value cache quantization targets key-value caches. Key-value caches typically consume significant amounts of memory, becoming a memory bottleneck. Implementing key-value cache quantization can improve throughput and more efficiently accommodate inputs with longer tokens.

The following diagram shows representative algorithms, which are classified from four dimensions.

3526

Next, we will introduce some typical schemes of weight-only quantization and weight activation quantization. We will also introduce KV cache quantization in a later article.

weight-only quantization

In PTQ, some methods focus on quantizing only the weights of the LLM to improve efficiency and reduce computational requirements.

GPTQ: GPTQ is a post-training weight quantization method that uses layer-by-layer quantization based on second-order information to successfully quantize each weight to 3-4 bits with almost no precision loss. GPTQ quantizes all parameters within a block one by one. After each parameter is quantized, the other unquantized parameters within the block need to be adjusted appropriately to compensate for the precision loss caused by quantization. GPTQ quantization requires a calibration dataset.
QuantEase builds upon GPTQ. When quantizing each layer, QuantEase uses a coordinate descent-based method to more accurately compensate for unquantized weights. Furthermore, QuantEase can utilize quantized weights from GPTQ as initialization and further refine the compensation process.
LUT-GEMM: LUT-GEMM is a dequantization method that uses a look-up table (LUT). It employs a non-uniform quantization method called binary coded quantization (BCQ), which includes a learnable quantization range.
AWQ observed that different weight channels have varying degrees of importance to performance, and that retaining only 1% of significant weights can greatly reduce quantization error. Based on this observation, AWQ employs activation-aware weighted quantization to quantize LLM, specifically focusing on weight channels with larger activation values and achieving optimal quantization results through per-channel scaling.
OWQ: The authors of OWQ observed that activating outliers amplifies the loss of weight quantization. They proposed Outlier-Aware Weight Quantization (OWQ) to identify vulnerable weights with activating outliers and assign them higher precision, while quantizing the remaining weights at a lower precision level.
SpQR: During the quantization process, SpQR identifies and separates outlier weights that are prone to causing large quantization errors. These outlier weights are stored with higher precision, while the remaining weights are compressed to 3-4 bits. Furthermore, they introduce a decoding scheme specifically tailored for the SpQR format, improving the efficiency of token-by-token inference.
SqueezeLLM proposes storing outliers in a full-precision sparse matrix and applying non-uniform quantization to the remaining weights. Determining the value of the non-uniform quantization based on the quantization sensitivity can improve the performance of the quantization model.
FineQuant employs an experience-based heuristic approach, assigning different levels of granularity to different weight matrices.
LLM-MQ uses the FP16 format to store outlier weights and stores them in Compressed Sparse Rows (CSR) format to improve computational efficiency. Furthermore, LLM-MQ models the bit width allocation for each layer as an integer programming problem and uses an efficient solver to solve it in seconds.

weight-activation quantization

Many works in PTQ attempt to quantize both the weights and activations of LLMs simultaneously.

ZeroQuant employs fine-grained quantization of weights and activations (group-wise for weights, token-wise for activations), leverages kernel fusion to minimize memory access costs during quantization, and performs knowledge distillation layer by layer to restore performance.
LLM.int8() detects outliers in activations concentrated in a small subset of channels. Based on this, LLM.int8() separates activations and weights into two distinct parts according to the distribution of outliers within the input channels. The channel containing the outlier data for both activations and weights is stored in FP16 format, while the other channels are stored in INT8 format.
SmoothQuant observed that different tokens exhibited similar changes across their channels, so SmoothQuant introduced a channel-wise scaling transformation to effectively smooth the amplitude, making the model easier to quantize.
FlexGen directly quantizes weights and key-value caches into INT4 to reduce memory usage during large-scale inference.
OliVe observes that common values near outliers are less critical. Therefore, it pairs each outlier with a common value, sacrificing the common values to gain a wider range of outlier representations.
OS+ observes that outliers are asymmetrically distributed and concentrated in specific channels, posing a challenge for quantization of large models. To address this issue, OS+ introduces a channel-level shifting and scaling technique. Determining the shifting and scaling parameters during the search process effectively handles both concentrated and asymmetric outlier distributions.
LLM-FP4 strives to quantize the entire model into FP4 format and introduces a pre-shifted exponential bias technique. This method combines the scaling factor of the activation values with the weights to address the quantization problems caused by outliers.
Omniquant differs from previous methods that relied on empirical design of quantization parameters. Instead, it optimizes the boundaries of weight clipping and the scaling factor of the equivalent transformation to minimize quantization error.
QLLM addresses the impact of outliers on quantization by implementing channel reconfiguration. Furthermore, QLLM incorporates learnable low-rank parameters to reduce quantization error in post-quantized models.
RPTQ discovered that the distribution of different activation channels is actually variable, which poses a challenge to quantization. To alleviate this problem, RPTQ clusters channels with similar activation distributions and applies quantization independently within each group, effectively mitigating the differences in channel range.
Outlier Suppression+ identifies asymmetrical distributions of harmful anomalies in activations, primarily concentrated in specific channels. Therefore, Outlier introduces a novel strategy involving channel-level translation and scaling operations to correct this asymmetrical presentation of anomalies.

3.3.2 QAT

In quantization-aware training (QAT QAT), the quantization process is seamlessly integrated into the training of large language models (LLMs), enabling these models to adapt to low-precision representations and thus mitigating accuracy loss. Because this typically involves retraining the entire model, it usually requires substantial training data and computational resources, posing a potential bottleneck to QAT implementation. Unfortunately, even the best current PTQ methods below 8 bits can lead to a sharp decline in model quality. Therefore, for higher levels of quantization, quantization-aware training (QAT) has become necessary.

Current QAT methods for LLM fall into two categories: full parameter retraining and parameter-efficient retraining.

Full parameter retraining is a method for fully retraining the parameters of an LLM when quantizing it. By using data generation methods, along with QAT and distillation techniques, it can preserve the emergent capabilities of the original model while reducing memory usage and computation.
Parametric-efficient retraining refers to retraining LLMs using parametrically efficient methods. Among these, QLoRA and QA-LoRA integrate group quantization into QLoRA to alleviate the imbalance between quantization and low-rank adaptation. Other methods include freezing the quantization exponent and fine-tuning only the quantization parameters, and using binary quantization. These methods employ low-rank adaptation to retrain quantized LLMs within a relatively acceptable computational budget.

Using QAT to quantify LLMs presents challenges in several key areas:

It has high requirements for data. If the training data domain is too narrow or differs significantly from the distribution of the original pre-training data, it may impair the performance of the model.
It has high computational power requirements. Training large models requires concentrated computational resources, typically involving a large amount of training data and computing resources.
Due to the complexity of LLM training, it is difficult to accurately reproduce the original training setup.

We use optimization points for classification.

Reduce data requirements

To reduce data requirements, LLM-QAT introduces a data-free approach, generating training data from the original large FP16 model and then using the results generated by the pre-trained model to achieve data-free distillation. Specifically, LLM-QAT uses each token in the vocabulary as the starting token for generating sentences. Based on the generated training data, LLM-QAT applies a distillation-based workflow to train a quantized LLM to match the output distribution of the original large FP16 model.

Reduce computational load

To reduce computational costs, many methods employ parameter-efficient tuning (PEFT) strategies to accelerate QAT.

QLoRA quantizes the weights of a large model into 4 bits, and then LoRA is used in BF16 to fine-tune the quantized model for each 4-bit weight matrix.
LoftQ points out that initializing the LoRA matrix with zeros in QLoRA is inefficient for downstream tasks. As an alternative, LoftQ suggests initializing the LoRA matrix using Singular Value Decomposition (SVD), which is the difference between the original FP16 weights and the quantized weights. LoftQ iteratively applies quantization and SVD to obtain a more accurate approximation of the original weights.
PEQA is a novel quantization-aware PEFT technique that facilitates model compression and accelerates inference. It employs a two-stage process. In the first stage, the parameter matrix of each fully connected layer is quantized into low-bit integer matrices and scalar vectors. In the second stage, the scalar vectors are fine-tuned for each specific downstream task. This strategy significantly compresses the model size, thereby reducing inference latency at deployment and decreasing the overall memory required. Simultaneously, rapid fine-tuning and efficient task switching are enabled.

3.3.3 Common Solutions

In general, weight quantization can be differentiated by varying sensitivities. Sensitivity can be measured using activation values, Hessian matrices, or approximate Fisher information. Processing methods include smoothing, mixed-precision dense-sparse decomposition, and hardware-friendly approaches designed around mixed precision. Activation value quantization requires special handling of outliers, including smoothing, memory rearrangement, mixed-precision handling (high precision for outliers + low precision for normal values), and pruning.

Next, we will look at some common methods from the perspective of quantitative technical solutions. These methods are not orthogonal to the PTQ/QAT classification above.

optimization

The core idea of GPTQ is to achieve high-precision, low-bit quantization by minimizing the output error introduced by quantization. GPTQ uses an optimization method to analytically update subsequent elements while quantizing earlier elements, thereby minimizing the optimization objective formula for quantization.

3527

Transfer

The difficulty of quantizing large models lies in the existence of outliers. AWQ and Smooth Quant use a smoothing coefficient to transfer the outliers of activation to the weights. On the one hand, this can improve the quantization range of weights on significant channels, and on the other hand, it can reduce the quantization error of activation.

Rotation

However, simply transferring outliers doesn’t eliminate them; therefore, researchers devised rotation schemes. These methods typically utilize unit random matrices and weight transformations before quantization, making the transformed weights easier to quantize. Spinquant uses rotation matrices (Hadamard matrices or other learnable orthogonal matrices) to left-multiply and right-multiply the weight and activation matrices respectively. Due to the orthogonality of rotation matrices, this transformation is equivalent. DuQuant, by learning rotation and channel permutation transformations, transfers outliers to other channels within the activation matrix, ultimately obtaining a smooth activation matrix, thus significantly reducing the difficulty of quantization.

Learnable

Many methods involve manually designing or searching for scaling parameters, which some researchers consider suboptimal. A better solution would be to learn transformation and clipping parameters during the quantization process.

OmniQuant sets both the smoothing factor and the quantization cutoff threshold to be learnable, and trains them within the block range, achieving good results.

Table lookup and VQ

The biggest difference between the LUT (Look Up Table) method and the previous methods lies in the transformation of the mapping relationship. Because the weight matrix of a real-world model is unlikely to be uniformly distributed, a significant portion of the weights are clustered together. Therefore, we consider clustering these weights, using a cluster center to map them, and storing the cluster center (FP16) and the cluster index (INT). When we need to perform inference, we can use the FP16 values in the index table, or the vector, for calculations. This is theoretically a more efficient compression method.

In the very low-bit (2-3 bit) quantization direction of LLM, conventional scalar quantization methods often struggle to achieve acceptable accuracy due to limitations in numerical representation range. In recent years, many researchers have begun to employ Vector Quantization (VQ) for weight-only quantization of LLM. Vector Quantization is a data compression technique that maps high-dimensional vectors onto a pre-defined set of low-dimensional vectors, which are stored in a codebook. Obtaining the codebook involves a data-driven learning process aimed at finding an optimal set of codewords to efficiently represent and compress data vectors. During encoding, each data point is represented by an index of the corresponding vector in the codebook; during decoding, these indices are used to approximate the original data. This method significantly reduces data storage requirements while allowing for rapid reconstruction of the original vector through simple index references.

Theoretically, VQ (Vacuum Quenching) seeks commonalities and removes redundancy from data, thus achieving higher data compression efficiency than scalar quantization. This is the theoretical basis for using VQ in LLM weighted compression. While VQ can overcome the limitations of scalar quantization, applying the VQ method to LLM weighted compression still faces the following challenges:

Maintain compression accuracy with extremely low bit compression ratio;
Given the large number of LLM parameters, how can compression and quantization algorithms be applied efficiently to LLM?
The computational overhead of restoring the original model weights from the quantized and compressed weights;

We’ll use VPTQ as an example for our learning. VPTQ is primarily designed for extremely low bit (e.g., 2-bit) scenarios. It’s a combination of Vector Quantization and GPTQ, which can be simply understood as replacing GPTQ’s column-by-column scaling with VPTQ. The core idea of Vector Quantization is to organize and cluster data into vectors, storing the clustering results as a codebook, while the original data can be represented by an index. The compressed model weights can be compressed into the index + codebook. During LLM inference, the desired original weights can be obtained by decompressing the index and codebook before executing the current operator.

VPTQ treats the quantization problem as an optimization problem, guiding the design of the quantization algorithm through a second-order optimization method. The innovations of this method are mainly reflected in the following aspects:

Channel-independent second-order optimization: VPTQ quantizes the weight matrix of each column independently, using second-order optimization to guide the design of the quantization algorithm.
Initial codebook optimization: The codebook is initialized using a weighted K-means clustering method to ensure that errors are minimized during the quantization process.
Low dequantization overhead: VPTQ only needs to read the center point in the codebook for dequantization during inference, which significantly improves inference throughput.

3528

Quantization methods based on Attention Sink

The characteristic of massive outliers is that they are all relatively large values in the token they belong to, making it impossible to perform per-tensor quantization. Therefore, the goal of PrefixQuant is to completely remove these massive outliers.

Given that the number of outlier markers is finite and they typically appear at the beginning of the input sequence, PrefixQuant offline places Massive Tokens in the KVCache prefix after identification to prevent Massive Outliers from being generated during inference.

This method is mainly based on the following phenomenon: by adding a learnable “prefix token” or “register token” before the input of the Transformer model, these tokens can absorb excess attention. In other words, when quantizing activation values or KV-Cache, these tokens are removed, and the remaining values are easier to quantize.

Instead of learning the “prefix token,” PrefixQuant directly stores outlier tokens exceeding a set threshold in front of the KV-Cache, achieving more efficient computation. Specifically, given an LLM model, PrefixQuant first calculates the number of N outlier tokens, then selects the Top-N high-frequency outlier tokens from the KV-Cache as a prefix. Combined with classic equivalent operations such as smoothing, rotating, and reordering, the remaining outliers can be ignored. This allows per-tensor quantization to replace per-token quantization for activation, thus improving inference speed while maintaining inference accuracy.

3529

The problem with this method is that adding learnable “prefix tokens” or “register tokens” can have performance side effects if not fine-tuned on downstream datasets. Therefore, finding a good prefix is crucial. These tokens are generally special symbols such as ”.\n [BOS]”. The image below shows the prefix tokens for different models in the KV cache. [BOS] indicates a special marker for the beginning of a sequence (e.g., Llama-2 <s> and Llama-3 |begin of text|). Note that the single quotes below represent spaces.

3530

3.4 Effects

Next, we will examine the effects of quantification through several papers.

3.4.1 Comparative Experiments and Analysis

The paper “A Survey on Efficient Inference for Large Language Models” analyzes the speedup effect of weight-only quantization techniques in different scenarios. The models used LLaMA-2-7B and LLaMA-2-13B, and their weights were quantized to 4 bits using AWQ; the GPU was an NVIDIA A100; and the deployment frameworks were TensorRT-LLM and LMDeploy.

The paper evaluates the speedup achieved by these inference frameworks on different input sequences with varying batch sizes and context lengths. Experimental results show that:

Weight-only quantization can accelerate the decoding stage, thereby achieving end-to-end speedup. This improvement mainly stems from the faster loading of quantized models with low-precision weight tensors from high-bandwidth memory (HBM), as quantization significantly reduces memory access overhead.
For the prefilling stage, weight-only quantization may increase latency. This is because the bottleneck in the prefilling stage is computational cost, not memory access overhead. Therefore, quantizing only inactive weights has the least impact on latency. Furthermore, weight-only quantization requires dequantizing low-precision weights to FP16, which incurs additional computational overhead, thus slowing down prefilling.
The speedup from weight-only quantization gradually decreases as batch size and input length increase. This is primarily because, for larger batch sizes and input lengths, computational costs constitute a larger proportion of latency. While weight-only quantization mainly reduces memory access costs, its impact on latency becomes less significant as computational demands become more pronounced with larger batch sizes and input lengths.
Since memory access overhead is related to the size of the model’s parameters, weight-only quantization offers greater benefits for models with larger parameter sizes. This is because as the complexity and size of the model increase, the amount of memory required to store and access the weights also increases proportionally. By quantizing the model weights, weight-only quantization can effectively reduce memory usage and memory access overhead.

3531

3.4.2 Quantification accuracy

While LLM quantization is popular for accelerating inference, the accuracy-performance trade-offs of various quantization formats remain highly uncertain. The paper “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization provides a comprehensive study of quantization accuracy, evaluating common quantization formats (FP8, INT8, INT4) across the entire LLaMA-3.1 model family in academic benchmarks and real-world tasks. Furthermore, the authors investigated the differences between text generated by quantized and unquantized models. Their experiments comprised over 500,000 individual evaluations, primarily assessing three quantization types:

W8A8-FP: The weights and activations of all linear layers are in FP8 format.
W8A8-INT8: The weights and activations of all linear layers in the Transformer Block are in INT8 format.
- For weights: a symmetric per-channel GPTQ quantization method is used.
- For activation: a dynamic quantification technique using per token is employed.
W4A16-INT4: All linear layer weights in the Transformer Block are quantized to INT4, while activation values are maintained with 16-bit precision. Weights are compressed using GPTQ quantization, applied to each group of 128 consecutive elements, and pruned using the mean squared error (MSE) optimal pruning factor.

Several key findings were made:

FP8 weights and activation quantization (W8A8-FP) are essentially lossless across all models.
With proper adjustments, the accuracy drop for INT8 weights and activation quantization (W8A8-INT) is very low, only 1%-3%.
INT4 weighted quantization (W4A16-INT4) is comparable to W8A8-INT.

To address the issue of finding the “best” format for a given environment, the authors conducted inference analysis on various GPUs using the popular open-source vLLM framework. They found that W4A16 is suitable for latency-sensitive (synchronous inference) scenarios and throughput-sensitive (asynchronous inference) scenarios on mid-range GPUs. Meanwhile, W8A8 is well-suited for throughput-sensitive scenarios on high-end GPUs.

3.4.3 QAT and PTQ

The paper “A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B” provides a comprehensive evaluation of the performance of quantization techniques in instruction-tuned large language models (LLMs). This paper addresses a key question: how to effectively deploy ultra-large-scale LLMs using quantization techniques in resource-constrained environments.

This paper analyzes two mainstream quantization methods—Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ). While PTQ can significantly reduce memory and computational requirements, it may sacrifice model accuracy. Therefore, the researchers used 13 different benchmark datasets to test the performance of the quantized models on various complex tasks, focusing particularly on the models’ reasoning, mathematical, and knowledge capabilities. Through experiments, the paper found several key conclusions:

Quantized models typically outperform smaller models, especially in tasks involving non-truth detection and instruction following. For example, the 4-bit quantized Llama-2-13B outperforms the original Llama-2-7B on most tasks.
Different quantization methods and model sizes have varying impacts on performance. Weighted quantization methods typically maintain higher accuracy, with AWQ outperforming GPTQ.
Quantization does not significantly affect the relationship between task difficulty and accuracy. Regardless of whether the task is simple or complex, the quantized model can basically maintain similar performance to the full-precision model.
The performance of quantization methods varies significantly across different models; for example, AWQ outperforms GPTQ in most tests.

These results are significant for future compression and optimization of LLMs. The paper points out that while quantization can effectively compress models and improve performance, not all tasks benefit from it. For specific tasks, such as instruction following and complex inference, some high-precision models still have advantages. This paper provides a strong experimental foundation for future quantization research and offers important references for the practical deployment of large models.

0xFF Reference

[EMNLP2023] [W8A8] Outlier Suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

NeurIPS 2024 Oral: Implementing State-of-the-art 4-bit Quantization of Barley with DuQuant

The Art of Compressed Neural Networks: An Analysis of Yixin Based on Two Classic Papers by Professor Song Han from MIT

From Training Dynamics to Outlier: Numerical Characteristic Analysis of LLM Model Training Process (Reiase)

[Read a Paper] A Survey of Quantization Methods for Efficient Neural Network Inference

A Survey of Quantization Methods for Efficient Neural Network Inference

https://www.armcvai.cn/2023-03-05/model-quantization.html

A Review of Model Quantization Techniques: Revealing Cutting-Edge Technologies for Large Language Model Compression [DeepHub IMBA]

Large Model Performance Optimization (Part 1): Quantization Starting with Half-Precision, Understanding fp32, fp16, and bf16

A convenient post-training quantization solution: GPTQ

[AI Unraveling the Mysteries] The Principles, Current Status, and Future Prospects of Model Quantization Technology - Long Peng - Three Mottos

AWQ, Activation-aware Weight Quantization

LLM.int8()

SqueezeLLM

SmoothQuant

GPTQ

A pioneering work in large-scale model quantization perception training: LLM-QAT – eating jelly without spitting out the jelly shell.

ZeroQuant series (v1, v2)

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

(https://link.zhihu.com/?target=https%3A//github.com/facebookresearch/LLM-QAT)

Summary of Quantization Algorithms for Large Model Inference (by Sun Peiqin)

Wang Wenguang’s lengthy article reveals the secrets of the GPTQ method for large model quantization: from OBS through OBQ to GPTQ, the magic of the Hessian matrix.

Wang Wenguang’s lengthy article reveals the secrets of large-scale model quantization technology: exploring the principles and understanding the most important technology for efficient inference in large-scale models.

akaihaoshuai: Implementing LLM from Scratch: 6. Model Quantization Theory + Code Practice (LLM-QAT/GPTQ/BitNet 1.58Bits/OneBit)

akaihaoshuai: Implementing LLM from Scratch: 6.1. Model Quantization (AWQ/SqueezeLLM/Marlin)

https://zhuanlan.zhihu.com/p/703513611

A Technical Overview of Efficient Inference in Large AI Models! [AI Large Model Frontiers]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Braverman, V., Beidi Chen, & Hu, X. (2023). KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization.

Databricks Blog: LLM Inference Performance Engineering: Best Practices

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, & Amir Gholami. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.

A. Gholami, S. Kim, Z. Dong, Z. Yao, MW Mahoney, and K. Keutzer, (2021). A Survey of Quantization Methods for Efficient Neural Network Inference.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for kv Cache: https://arxiv.org/abs/2402.02750

Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023.

[2] Ashkboos, Saleh, et al. “Quarot: Outlier-free 4-bit inference in rotated llms.” arXiv preprint arXiv:2404.00456 (2024).

Sun, Mingjie, et al. “Massive Activations in Large Language Models.” arXiv preprint arXiv:2402.17762 (2024).

Liu, Ruikang, et al. “IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact.” arXiv preprint arXiv:2403.01241(2024).

[5] Liu, Zechun, et al. “SpinQuant—LLM quantization with learned rotations.” arXiv preprint arXiv:2405.16406(2024).

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

[2411.02355] “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization [1]

Integer Quantization for Deep Learning Inference Principles and Empirical Evaluation

Deployment and Inference Principles and Empirical Verification of Deep Learning Int8

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Quantizing deep convolutional networks for efficient inference: A whitepaper

Discovering low-precision networks close to full-precision networks for efficient embedded inference

Pact: Parameterized clipping activation for quantized neural networks

TensorFlow Model Optimization: Model Quantization - Zhang Yixin

Practical Guide: A Comprehensive Summary of Deep Learning Model Quantization (Low-Precision Inference)

The Super Weight in Large Language Models

20,000 words of analysis on quantization technology, the most comprehensive analysis of large-scale model quantization technology on the entire internet. (Baiqi Yuewen)

A Plain Language Explanation of Scaling Laws for Precision: Justin’s Surname Isn’t Ding

Efficient Deep Learning - Study Notes - 4 - Model Quantization Backtracking Cat

LLMs Quantification Series | LLMs Quantization Need What? (The Cat of Retrospection)

LLMs Quantization Series | MiLo: How to use LoRA to compensate for the quantization loss of the MoE model (backtracking cat)

LLMs Quantitative Analysis Series | Summary of LLM Quantitative Methods (Backtracking Cat)

Xingyueye’s Years of Model Quantization

[LLM Quantitative Series] The VQ Journey: Killua’s Ascent from AQLM, GPTVQ to VPTQ

[LLM Quantitative Series] PTQ Quantitative Classic Research Analysis: Killua’s Advance

AttnSink related papers shared by Jin Qin

Introduction to decoupleQ 2-bit quantization technology: Smooth and fluid operation

Latest discovery: Large-scale values, the key to attention mechanisms. ICML2025 AI Cat Repair Prompt

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

ICML25 research finds: RoPE has made a significant contribution again! A small eggplant…

[ICML2024] StableMask: Refining Causal Masking in Decoder-only Transformer

Massive Activation in LLM Cainiao Poverty Alleviation Households

Transformers need glasses! Information over-squashing in language tasks

Large model quantization compression is a wrong path

Reading the paper “Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models” by Mr. Liu

massive activation

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Massive Activations in Large Language Models

A Survey on Model Compression for Large Language Models

[ICLR2024] [W4A4] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

[Killua’s Advance: LLM Quantitative Analysis Series] The Path to VQ: From AQLM, GPTVQ to VPTQ41 (2 Likes, 2 Comments)

[MLSys2024] QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

A new approach to low-bit quantization in large models: AQLM. If you give up, you’re really done.

A staggering 90% reduction! Parameter compression can be done like this? Three amazing VQ quantization techniques, new ideas for model slimming down. Deep Blue Academy

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

PrefixQuant: Static Quantization Beats Dynamics through Prefixed Outliers in LLMs - A Training Tool

The use of rotation matrices in LLM quantization

| [Transformer Series] (35) Fundamentals of Large Model Quantization

Exploring the Transformer Series (35) --- Fundamentals of Large Model Quantization

Exploring the Transformer Series (35) --- Fundamentals of Large Model Quantization

0x00 Overview

0x01 outlier

1.1 Definition

1.2 Features

1.3 Occurrence Process

1.4 Distribution Pattern

1.5 Reasons for occurrence

1.5.1 softmax

LLM model

VIT model

1.5.2 RoPE

Phenomenon

analyze

1.5.3 FFN

Dynamic Analysis of Matrix Multiplication

FFN main calculation

1.5.4 LayerNorm

1.5.5 RMSNorm

1.6 Function

1.6.1 Negative Effects

1.6.2 Positive Effects

Maintain performance

contextual understanding

1.7 Difficulties

0x02 Outlier

2.1 Superweight

2.2.1 Function

2.2.2 Identification

2.2 massive outlier

2.2.1 Definition

2.2.2 Function

Features

Argument

2.2.3 Difficulties

0x03 Transformer Quantization

3.1 Principle Analysis

3.1.1 Characteristics of Reasoning

3.1.2 Requirements for the Framework

3.2 Quantization Module

3.3 Classification

3.3.1 PTQ

weight-only quantization

weight-activation quantization

3.3.2 QAT

Reduce data requirements

Reduce computational load

3.3.3 Common Solutions

optimization

Transfer

Rotation

Learnable

Table lookup and VQ

Quantization methods based on Attention Sink

3.4 Effects

3.4.1 Comparative Experiments and Analysis

3.4.2 Quantification accuracy

3.4.3 QAT and PTQ

0xFF Reference