Exploring the Transformer Series (36) --- Large Model Quantization Scheme

0x00 Overview

Following the previous article which introduced the basics of large-scale model quantization, this article will explore some quantization schemes.

Since everyone currently uses compression to a certain bit to measure quantization schemes, we will classify and learn according to the quantization bits.

0x01 8-bit quantization

Since most hardware (such as NVIDIA GPUs, Intel CPUs, Qualcomm DSPs, etc.) currently supports INT8 GEMM, researchers have proposed a scheme to quantize weight and activation into INT8 (i.e. W8A8) in order to speed up inference.

3601

The following diagram shows a comparison of several 8-bit quantization schemes.

3602

The characteristics of the three schemes introduced in this section are summarized below.

method	Quantization weights and activation	Features
LLM.int8()	W8A8	Mixed-precision quantization preserves outliers at FP16, while other values are quantized to INT8.
ZeroQuant	W8A8	Apply Dynamic per-token activation quantization and grouped weight quantization.
SmoothQuant	W8A8	A channel-scale method is proposed to shift the complexity of high-precision quantization from activation to weights, thereby smoothing out activation outliers.

1.1 LLM.int8()

LLM.int8() comes from the paper “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”, one of the early works on LLM quantization. The core idea of the paper is divide and conquer. Most weights (per-channel) and activations (per-token) are quantized using 8 bits (vector-wise). The few dimensions of outlier features are reserved at 16 bits and not quantized.

1.1.1 Motivation

The authors first proposed the concept of activation outlier, which has been a focus of subsequent large-scale model quantization. In the past two years, the optimization of large-scale models has mainly focused on how to perform quantization while maintaining the weights corresponding to activation outliers.

The activation of LLMs is extremely difficult to quantize because it contains significant activation outliers, leading to substantial quantization errors and a decrease in accuracy. These outliers can differ from normal values by hundreds of times. Activation outliers have a huge impact and can easily cause quantization failure. This is because, with tensor quantization, a single tensor shares a scaling value, meaning even a single outlier can ruin quantization precision. Imagine a tensor where most values are between 0 and 1; if there’s an outlier equal to 10,000, after quantization, this one outlier will pull all the other values to 0, resulting in a significant loss of precision. However, research also shows that these outliers significantly impact model performance, so it’s necessary to find ways to preserve outliers rather than simply zeroing them out, creating a seemingly irreconcilable contradiction.

The authors discovered that outliers in the activation were concentrated in a small subset of channels, and that outliers occupied very few dimensions (less than 1%). Therefore, the authors considered whether it was possible to process outliers separately. The remaining outliers could then be handled well using only a single scaling factor. Furthermore, the authors argued that matrix multiplication can be viewed from another perspective as a dot product of the rows of the left matrix and the columns of the right matrix. If we quantize the left matrix by rows and the right matrix by columns, we can obtain more precise quantization values, potentially leading to even higher accuracy.

1.1.2 Scheme

Based on the above considerations, the paper proposes a two-step quantization method, LLM.int8().

Vector-wise quantization. Specifically:
- Extract outliers (outlier features, i.e., values greater than a certain threshold) from the implicit state of the input, column by column.
- Based on the distribution of outliers within the input channels, activations and weights are divided into two distinct parts: outliers and normal values. Channels containing outlier data (activations and weights) are stored in FP16 format, while other channels are stored in INT8 format. Different normalization constants will then be applied to each vector inner product.
Mixed-precision matrix factorization. Specifically:
- Most weights and activations of normal features are still calculated using 8-bit vector-wise quantization (W8A8), and then quantized to FP16. For outlier features, several dimensions retain FP16 and are subjected to high-precision matrix multiplication (W16A16).
- Add the matrix multiplication results of the non-outliers to the matrix multiplication results of the outliers to obtain the final FP16 result.

As shown in the figure below, outlier detection is used to select the rows and columns containing outliers in the input X and weights W and perform fp16 floating-point matrix multiplication directly. Then, the remaining normal points (each row of X is quantized using absmax, and each column of W is quantized using absmax) are quantized, multiplied using int8, and dequantized back to fp16. Finally, they are summed up as the final output.

3603

8-bit data type and quantization

Let’s first look at the characteristics of 8-bit data types and quantization. Here we’ll introduce high-precision asymmetric quantization (Zeropoint quantization) and asymmetric quantization (Absmax quantization). While Zeropoint quantization provides high precision by using the full bit range of the data type, it’s rarely used due to practical limitations. Absmax quantization is the most commonly used technique.

Figure (a) below shows Absmax quantization, and (b) shows Zeropoint quantization. [beta, alpha] represent the minimum and maximum values of the interval, respectively.

Absmax quantization: The input is scaled to 8 bits [-127, 127] by dividing by the maximum absolute value and multiplying by 127. ⌊⌉ represents taking the nearest neighbor integer, and (s_{xfl16}) is the scaling factor. This method is called scaled quantization.
Zeropoint quantization: The quantization schemes described above only use the positive range of the ReLU output. Zeropoint, however, proposes using dynamic range scaling (ndx) and zero-point translation (zpx), thus utilizing the full range of values. This method is affine quantization, which is more accurate, but is less commonly used due to practical limitations.

3604

Vector-wise quantization

When performing Absmax quantization, a scaling constant is used for each tensor. Since only one scaling constant is used per tensor, a single outlier in the activation can cause the scaling factor to become smaller, ultimately reducing the accuracy of other non-outlier values. This problem can be alleviated by having multiple scaling constants for each tensor. Therefore, the authors used vector-wise quantization.

Vector-wise quantization specifically involves treating matrix multiplication as an inner product of a series of vectors (independent rows and columns), with different normalization constants applied to each inner product. For a given input (X_{f16}) and weights (W_{f16}), in an int8 matrix multiplication with FP16 input, we assign different scaling constants to each row of the input (X_{f16} \in R^{s \times h}) and each column of the weights (W_{f16} \in R^{h \times o}), denoted (c_{xf16}) and (c_{wf16}). During dequantization, the quantized inner product is divided by the reciprocal of the product of these scaling constants. For the entire matrix multiplication, this is equivalent to the denormalization of the outer product (c_{xf16} \otimes c_{wf16}). Furthermore, the output of the matrix multiplication can be recovered by denormalization before the next operation.

3605

Mixed precision decomposition

A significant challenge with 8-bit Transformers, which have billions of parameters, is their outlier characteristics, requiring high-precision quantization. However, vector-wise quantization (quantizing each row of the hidden state) remains insufficient for outlier features. The authors observed that these outlier features are both highly sparse and regular in practice, representing only 0.1% of all feature dimensions. Therefore, they developed a novel mixed-precision decomposition technique for matrix multiplication: outliers with dimensions greater than a threshold are placed into a set O. Features within this set are not quantized and are directly multiplied using FP16, while other dimensions undergo 8-bit multiplication. Experiments show that (\alpha = 6.0) is sufficient to offset the performance degradation caused by quantization. Since the outlier feature dimensions are typically less than 7 ((|O| \le 7)), the decomposition operation consumes very little memory.

3606

1.1.3 Disadvantages

The paper has the following shortcomings:

It is difficult to efficiently implement outlier decomposition on hardware accelerators, and runtime outlier detection results in inference speeds that are even slower than FP16.
The model quantization is only up to 8 bits, which is still twice as large as 4 bits.

1.2 ZeroQuant

1.2.1 Main Contributions

ZeroQuant’s main contributions are as follows:

Activation is dynamically quantized per-token, with each token having independent quantization parameters. The weight matrix is then grouped for quantization, with each group having independent quantization parameters. This approach balances performance and accuracy.
We propose a layer-by-layer knowledge distillation (LKD) algorithm to extract quantized networks without accessing the original training data, thus reducing GPU memory overhead.

1.2.2 Scheme

Token-based quantification for activation

In existing PTQ, the common practice for quantizing activation is static quantization, where the minimum/maximum range is calculated during the offline calibration phase.

For small models with low activation range variance, this approach might be sufficient. However, the activation ranges of LLMs vary significantly. Therefore, static quantization schemes (typically applied to all tokens/samples) will lead to a significant drop in accuracy. One way to overcome this problem is to employ finer-grained token-wise quantization and dynamically compute the minimum/maximum range for each token to reduce quantization errors caused by activation. However, directly applying token-wise quantization using existing deep learning frameworks incurs significant quantization and dequantization costs because token-wise quantization introduces additional operations, resulting in expensive data movement overhead between GPU computing units and main memory.

To address this issue, ZeroQuant has built a highly optimized inference backend for token-wise quantization of Transformer models. For example, ZeroQuant’s inference backend employs operator fusion techniques to combine quantization operators with their previous operators (such as Layer Norm) to mitigate the data movement costs of token-wise quantization. Similarly, scaling the INT32 accumulation using weighted and activation quantization scales before writing the final FP16 result back to main memory for the next FP16 operator (such as GeLU) can reduce the dequantization costs of different GeMM outputs.

Token-wise quantization can significantly reduce the representation error of quantized activations. It does not require calibration of activation ranges, and there are no quantization-related costs (e.g., activation range calibration) for ZeroQuant’s quantization scheme (INT8 weights and INT8 activations).

3607

Group quantization for weight matrix

Applying INT8 PTQ to BERT/GPT-3 models leads to a significant drop in accuracy. The key challenge is that the INT8 representation cannot fully capture the different numerical ranges of different rows in the weight matrix and different activation tokens. One approach to address this is to use group-wise (token-wise) quantization of the weight matrix (activations).

Grouped quantization was first proposed in Q-BERT, where the weight matrix is divided into g groups, each quantized individually. However, in Q-BERT, the authors only applied it to quantization-aware training. More importantly, they did not consider hardware efficiency constraints or provide system backend support, thus failing to reduce latency.

ZeroQuant’s design takes into account the hardware constraints of GPU Ampere architectures (e.g., A100), where computational units are based on the tailing size of Warp Matrix Multiply and Accumulate (WMMA) to achieve optimal acceleration. Compared to single-matrix quantization, ZeroQuant’s grouped quantization offers better accuracy due to its finer-grained quantization, while significantly reducing latency.

Layer-by-layer knowledge distillation

Knowledge distillation (KD) is one of the most powerful methods to mitigate the accuracy degradation after model compression. However, KD has some limitations when applied to LLM:

KD requires the teacher and student models to be placed together during training, which greatly increases memory and computation costs.
KD typically requires extensive training of the student model. Therefore, multiple copies of the weight parameters need to be stored in memory to update the model;
KD typically requires raw training data, which is sometimes inaccessible due to privacy/confidentiality issues.

To address these limitations, ZeroQuant proposed the Layer-by-Layer Distillation (LKD) algorithm to mitigate accuracy loss. This algorithm uses the original model as the teacher model and the quantized model as the student model, guiding the student model to mimic the teacher model’s output by passing knowledge layer by layer. This method does not require original training data and can effectively improve the accuracy of the quantized model without increasing computational costs.

3608

The second edition of ZeroQuant-V2 analyzed the sensitivity of weight quantization and activation quantization to accuracy, compared the model performance of commonly used PTQ algorithms, and finally introduced an optimization technique called Low Rank Compensation (LoRC), which can work in conjunction with PTQ to improve the overall model quality recovery with minimal increase in model parameter size. However, the scalability of this method does not seem to be very good.

The third edition of ZeroQuant-FP primarily explores the feasibility of floating-point (FP) quantization, with a particular focus on FP8 and FP4 formats. For LLM, FP8 activation outperforms INT8 in performance, while for weighted quantization, FP4 is comparable to, or even superior to, INT4 in performance; LoRC helps improve the overall performance of W4A8. Furthermore, to address the challenges arising from the differences between weights and activations (from FP4 to FP8), ZeroQuant-FP requires all scaling factors to be powers of 2 and restricts the scaling factors to a single computational group.

ZeroQuant-HERO version 4 is a new hardware-enhanced post-training W8A8 quantization framework that takes into account memory bandwidth limitations and computationally intensive operations. It is generally more engineering-oriented and has quantized more sub-modules in the transformer, namely: Embedding layer quantization, Attention module quantization, and MLP module quantization. At the same time, it uses different quantization precisions to combine the quantization of each module and selects the quantization level as needed.

1.3 SmoothQuant

SmoothQuant, as defined in the paper “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models,” has a quantization bit width of 8 bits W8A8, meaning that both weights and activations are quantized using 8 bits.

The core idea of SmoothQuant is that the activation X is difficult to quantize because outliers stretch the linear mapping range of quantization, resulting in a reduction in the number of significant bits for most values. Furthermore, different tokens exhibit similar changes across activation channels. Therefore, the paper proposes a channel-wise scaling method. In the offline stage, a smoothing factor (s) is introduced to smooth outliers in the activation. Through a mathematical equivalent transformation, the quantization difficulty is shifted from the activation to the weights (W) (multiplying the weights by the scaling factor channel-wise, and then multiplying the activation by the inverse of the scaling factor matrix, thus keeping the overall output constant, resulting in smaller overall activation values and larger overall weights, making the activation values easier to quantize), thereby reducing the quantization difficulty of the activation. Both the smoothed activation X and the adjusted weights (W) are easier to quantize.

3609

1.3.1 Motivation

status quo

ZeroQuantBoth [the proposed solutions] and LLM.int8()[the proposed solutions] address outliers :

ZeroQuantDynamic activation-by-activation quantization and grouped weighted quantization are applied tokento address the problem of activated outliers. Although this method is highly efficient, its 175Baccuracy is poor for the OPT model with specific parameters.
LLM.int8()This accuracy problem was solved by introducing hybrid precision decomposition (i.e., keeping FP16 for outliers and using INT8 for other activations), but this method is difficult to implement efficiently on AI accelerators in engineering and is a hardware-incompatible quantization scheme.

Therefore, it is necessary to find an efficient, hardware-friendly, and training-free quantization scheme so that all computationally intensive operations in LLMs can use INT8.

Problems with Quantification Schemes

Based on the shared range of the quantization parameters s (the interval between data quantizations) and z (the bias of the data offset), i.e., the different quantization granularities, quantization methods can also be divided into per-tensor quantization, per-channel quantization, per-token quantization, and per-group (group-wise) quantization. The authors of SmoothQuant compared these quantization schemes and concluded the following:

Per-tensor quantization is the most efficient implementation method in hardware.
Per-channel quantization preserves precision, but it is incompatible with INT8 GEMM kernels (which require both weights and activations to be INT8). In other words, per-channel quantization does not map well to hardware-accelerated GEMM kernels (the hardware cannot execute it efficiently, thus increasing computation time). In these kernels, scaling can only be performed along the outer dimension of matrix multiplication (i.e., the token dimension of the activations, the output channel dimension of (C_o)). See the diagram below for details.
GEMMTherefore, despite the lower precision of per-token quantization, it has been used in linear layers in previous studies to avoid reducing the throughput of the kernel itself. However, it cannot solve the problem of activation quantization.

3610

Because different tokens exhibit similar changes across their channels when it comes to outliers, SmoothQuant proposed a mathematically equivalent per-channel scaling transformation scheme that can significantly smooth the amplitude between channels, making the model easier to quantize, maintaining accuracy while also ensuring improved inference speed.

Let’s further analyze why per-channel quantization doesn’t map well to hardware-accelerated GEMM kernels. The fundamental reason is: if acceleration isn’t a concern and the goal is simply parameter compression, then any quantization method will suffice. However, to accelerate matrix multiplication, the entire matrix multiplication process must be dequantized. Quantization can only be performed on the outer dimensions of the matrix multiplication; otherwise, dequantization is impossible.

3611

Let’s continue our analysis using attention mechanism quantization as an example. Channel-wise quantization of K avoids the sharing of scale factors between different channels. However, this presents a problem in attention mechanism computation. This is because after matrix multiplication (QK^T), the resulting matrix has a dimension of (N \times N). If we perform per-channel quantization on K (top of the diagram, a total of d channels, each containing N elements), each channel has an independent scale factor, resulting in a total of d scale factors. During dequantization, we need to multiply the quantized result by the corresponding scale factor. However, (QK^T) has a dimension of (N \times N), without a direct correspondence of d channel dimensions, making it impossible to correctly dequantize using the scaling factor of K’s channel dimensions. Therefore, in attention multiplication, for each matrix, we can only quantize along the common dimension (bottom of the diagram). Based on this simple principle, the four matrices in Attention can be quantized in the following combinations.

	Q	K	P	V
per-channel	N	N	N	Y
per-token	Y	Y	Y	N
per-block	Y	Y	Y	N

1.3.2 Scheme

SmoothQuant proposed an 8-bit weight, 8-bit activation (W8A8) quantization method. This scheme can be viewed as data preprocessing, allowing the data to be adapted for int8 quantization. The main idea is that since weights are easy to quantize while activations are more difficult, the quantization difficulty is shifted from activations to the next layer of weights, utilizing fine-grained quantization of the weights to handle this challenge. The scheme’s features are as follows:

A smoothing factor s is introduced to smooth out activation outliers, shifting the difficulty of quantization from activation equivalence to weights;
The role of the smoothing factor is to establish a mathematically equivalent transformation, that is, to reduce the dynamic range by dividing the dynamic range of the input by the scale (which is > 1), thereby improving the quantization result, and then absorb the scale into the weight of the next layer;
The difficulty of quantizing the weight is mitigated by using fine-grained quantization of the weight (because input often uses token-wise quantization, while weight usually uses channel-wise quantization or group-wise quantization).

The SmoothQuant algorithm presents two main challenges: how to shift the quantization difficulty from activation to weights through mathematical equivalence transformation, and how to achieve channel-by-channel equivalent scaling transformation. This is specifically achieved through matrix multiplication upgrades and the proper calculation of smoothing factors.

Calculate the smoothing factor

The goal of SmoothQuant is to select a smoothing factor (\phi) for each channel, making (\hat{X} = X diag(s)^{-1}) easy to quantize. To reduce quantization error, the effective quantization bits for all channels should be increased. The total effective quantization bits will be maximized when all channels have the same maximum amplitude value.

One approach is to let (s_j = max(|X_j|), j = 1, 2, …, C_j), where (C_j) represents the j-th input channel. After dividing each channel by (s_j), all activation channels will have the same maximum value, making activation relatively easy to quantize. However, this causes outliers in the activation values to shift to the weights, making the activation values easy to quantize but the weights difficult.
Another approach is to let (s_j = 1 / max(|W_j|)), so that all weight channels will have the same maximum value, making the weights easier to quantize. This essentially shifts all quantization difficulty from weights to activation, which will make the already smaller variance of the weights even smaller, and the larger variance of the activation values, making activation more difficult to quantize.

Therefore, we need to balance the quantization difficulty between weights and activations, making them both easy to quantize. The authors of the paper control how much difficulty is transferred from activation values to weight values by adding a hyperparameter (\alpha) (transfer strength). A suitable transfer strength value allows both weights and activations to be easily quantized. (\alpha) is too large, weights are difficult to quantize; (\alpha) is too small, activations are difficult to quantize. See the figure below.

(\alpha) represents the migration strength, which is a hyperparameter that controls how much of the quantization difficulty of activation values is migrated to weight quantization.
C represents the number of active input channels.

3612

For scaling factors, they can actually be calculated offline using a small calibration dataset. Considering that X is generally obtained from the previous linear operation, we can effectively merge the scaling of X into the previous linear layer. As for weight quantization, it can be performed by multiplying the current layer’s scaling factors (and the scaling factors of X in the next layer) together before quantization. This avoids calculating the scaling matrix multiplication during inference.

Matrix multiplication upgrade

With a smoothing factor, outlier smoothing can be achieved, which is essentially done by transferring activation difficulty to weights.

A standard matrix multiplication is (Y = XW). SmoothQuant smooths the activations by dividing by a smoothing factor per channel. That is, to maintain the mathematical equivalence of linear matrix multiplication, the weights are adjusted in the opposite way. X is divided by (s_i) for each element in each row. This makes Y mathematically identical.

The specific conversion process is shown in the following figure.

3613

Considering that the input X is typically generated by preceding linear operations (such as linear layers, layer normalization, etc.), we can easily fuse the smoothing factor offline into the parameters of the previous layer without increasing kernel call overhead due to additional scaling operations. In some other cases, such as when the input comes from residual summation, we can add additional scaling in the residual branch (diag(s) \times W_{pre}).

Example

The calculation process is shown in the figure below. The smoothing factor s was obtained offline on the calibration samples.

3614

The specific steps are shown in the figure below, where red labels b, d, and d represent the offline calibration stage. e and f represent the smoothing stage, and e represents the inference stage.

Preparation stage
- X is label 1, and W is label 3.
Calibration phase (offline)
- X is calculated column by column, yielding the maximum absolute value of each column’s elements, which is labeled 2.
- W is calculated row by row, yielding the maximum absolute value of each row’s element, which is number 4.
- Perform the operation on labels 2 and 4, that is, perform the operation according to formula s, to obtain label 5.
Smoothing phase (offline)
- Dividing number 1 by number 5 gives number 6. The formula for calculating (\hat{W}) is as follows: (\hat{W} = diag(s)W)
- Multiplying number 3 by number 5 gives number 7. The formula for calculating (\hat{X}) is as follows: (\hat{X} = X diag(s)^{-1})
Inference phase (online, model deployment)
- Multiplying number 6 by number 7 gives Y. The activation value after smoothing is calculated as follows: (Y = \hat{X}\hat{W})

3615

application

Because linear layers account for the majority of parameters and computational overhead in large language models, the authors, by default, smooth the input activations of self-attention and feedforward layers and quantize all linear layers to W8A8. They also BMMquantize the operations within the attention mechanism. The following diagram illustrates the quantization process designed by the authors for the Transformer module:

The inputs and weights of computationally intensive operators such as BMM (Batch Matrix Multiplication) in linear and attention layers are quantized using INT8 (W8A8).
For other lightweight element-wise operation modules such as Softmax and LayerNorm, FP16 remains active.

This design helps to balance accuracy and inference efficiency. The reason for not quantizing the Token Embedding layer is likely because it has less parameter redundancy and does not exhibit weight sparsity.

3616

1.3.3 Shortcomings

SmoothQuant has the following shortcomings:

Suitable for LLM, but performs poorly with stable diffusion;
Using small batches of data for offline calibration carries the risk of overfitting the calibration data.
(Accuracy is not guaranteed as well as LLM.int8() and is easily affected by calibration set; moreover, once the weight precision is adjusted to 4 bits, the model accuracy drops significantly.)

0x02 4-bit quantization

As the LLM parameter increases, the aforementioned INT8 quantization scheme exhibits significant performance degradation at low bit-level precision. For example, SmoothQuant shows a substantial decrease in precision at low bit-level settings such as W4A4, W3A4, W2A16, and W2A8, thus impacting the practical user experience of the quantization model. Therefore, lower bit-level solutions are needed, such as the 4-bit schemes proposed by GPTQ and AWQ.

GPTQ and AWQ successfully compressed the weight matrix of LLMs to 4 bits while maintaining the main capabilities of LLMs. This represents a significant advance in LLM optimization by achieving a balance between time and space efficiency and model performance. The paper “The case for 4-bit precision: k-bit inference scaling laws” delves into the trade-off between model size and bit precision in LLMs regarding zero-shot performance. They conducted extensive experiments across various LLM families and found that 4-bit precision is almost universally the optimal choice for achieving a balance between the total number of model bits and zero-shot accuracy.

The characteristics of the several schemes introduced in this section are as follows.

method	Quantization weights and activation	Features
GPTQ	W3A16, W4A16	Based on the Hessian matrix, a column of weights is sequentially quantized and the unquantized weights are adjusted to compensate for the loss. Sequential parallel quantization, delayed update, and Cholesky decomposition are proposed to accelerate the process.
AWQ	W4A16	Weight: Per-group, Activation: FP16. Important weights are selected based on activation magnitude, and these weights and activations are scaled individually using a grid search to find the optimal scaling factor.
LLM-QAT	W4A16	This is a pioneering work on quantization-aware training for large models. In addition to quantizing weights and activations, it also quantizes key-value caches, which is crucial for improving throughput and supporting long sequence dependencies at the current model scale.
QLoRA	W4A16	Combining low-rank adaptation with quantization enables efficient fine-tuning of large models while minimizing computational costs and memory usage.
FlatQuant	W4A4	Exploring optimization paths for flat distribution and quantization loss

2.1 GPTQ

GPTQ, derived from the paper “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” aims to correct the remaining weights using second-order Hessian information to compensate for the error introduced by quantization, thereby minimizing the difference in model output before and after quantization. In other words, GPTQ quantizes the current weights and then updates the remaining weights to minimize the loss of layer output before and after quantization.

GPTQ is built upon the traditional OBQ algorithm. OBQ achieves the optimal quantization order for each row of the weight matrix by using the reconstruction error relative to the Hessian matrix of the unquantized weights. After each quantization step, OBQ iteratively adjusts the unquantized weights to mitigate the reconstruction error. However, the frequent updates to the Hessian matrix during quantization increase computational complexity.

The authors of GPTQ argue that current models are too complex, making the calculation of the Hessian matrix for the entire model overly complicated. They suggest that if parameters within the same row of the parameter matrix are correlated, while parameters in different rows are uncorrelated, then only the intra-row Hessian matrix needs to be calculated. The GPTQ authors consider pruning a special form of quantization and propose OBQ: calculating the Hessian matrix layer by layer, sorting the weights within each layer according to their influence from smallest to largest, and then quantizing them sequentially.

The diagram below illustrates the GPTQ quantization process. In a given step, GPTQ quantizes blocks (bold) of consecutive columns using inverse Hessian information stored in the Cholesky decomposition and updates the remaining weights (blue) at the end of the step. The quantization process is applied recursively within each block. The white middle column represents the portion currently being quantized. The black box in the right-hand diagram represents a block, indicating the group currently being quantized.

3617

2.1.1 Background

Prior to GPTQ, most PTQ schemes could only achieve up to 8-bit quantization. GPTQ further explored quantization with even lower bit depths. GPTQ didn’t appear out of thin air; it’s based on two things:

Quantization at each level.
Optimal Brain Quantization (OBQ).

Layer-by-layer quantization

GPTQ uses asymmetric quantization, performed layer by layer, with each layer processed independently before moving to the next. Sampling and layer-by-layer quantization is equivalent to solving a reconstruction problem for each layer. Given the weights (W_\ell) of a linear layer (\ell) and the input (X_\ell), the optimization objective is to find a quantized weight matrix (\widehat{W}) that minimizes the squared difference relative to the output of the full-precision layer. In the formulas below, W and (\widehat{W}) represent the original weights and the pruned weight matrix, respectively, and X represents the matrix-form input.

$\operatorname*{argmin}_{\widehat{W}} \lVert WX - \widehat{W}X \rVert_2^2$

If the weight matrix is decomposed row-wise, the pruning loss can be expressed as a row-wise summation. Since we only prune one weight in a row in each iteration, it only affects the pruning loss of that row. Furthermore, each iteration only affects the Hessian matrix of the current row, and the calculation of the Hessian matrix between different rows is independent of each other, so iterations between different rows can be parallelized.

OBQ

GPTQ is based on another quantization method, OBQ (Optimal Brain Quantization). OBQ is actually a modified version of OBS (Optimal Brain Surgeon, a classic pruning method), which in turn comes from OBD (Optimal Brain Damage, a pruning method proposed by Yann LeCun in 1990).

The role of OBD is to introduce second-order derivative information to perform minimum loss pruning on neural networks.
The purpose of OBS is to introduce the concept of deletion weight compensation, so as to minimize the pruning loss.

Therefore, we need to clarify the main points. Our core question is: how can we selectively remove some weights to reduce network size without introducing too much error? This leads to the question: how do we define the error introduced by removing weights? Or, in other words, how do we measure it?

OBD

OBD is a pruning scheme; model pruning essentially involves adjusting a certain value of W to 0. Therefore, we can measure the importance of a weight parameter by the change in the loss function resulting from removing it. That is, if we want to remove some parameters from the model (i.e., prune), we want to remove parameters that have a small impact on the objective function E. However, deleting a parameter and then re-evaluating the objective function is too cumbersome. Therefore, OBD establishes a local model of the error function to predict the impact of perturbation parameter vectors (removing some weight parameters) on the optimization objective.

Assuming the model’s weights are W and the loss function is f, then under the current weights, the model’s training loss is (L = L(W)). The basic idea of OBD is to use the model’s training loss as the objective function and iteratively select pruning parameters to minimize the objective function. The OBD authors performed a Taylor expansion on the objective function E to estimate the impact of weight adjustments. Simultaneously, the authors made some assumptions (e.g., that deleting any parameter does not change the impact of other parameters on the objective function; that is, each parameter’s impact on the objective function is independent), and then simplified the approach. The final solution is that the change in the objective function is simplified to (\frac{1}{2}h_{ii}w_i^2). By calculating the Hessian matrix (h_{ii}), the impact of each parameter on the objective can be determined. Then, the parameters can be sorted in ascending order of influence, thus determining the order of parameter pruning. See the diagram below for details.

3618

The OBD pruning algorithm is as follows:

Constructing a neural network;
Train the neural network until it converges;
Calculate the second derivative of each parameter (h_{ii});
Calculate the increase in the loss function caused by each parameter according to (\Delta T);
Remove some coefficients with small (\Delta t);
Repeat steps 2-5 until the number of items deleted reaches the preset requirement.

OBS

OBS affirmed OBD’s approach to reducing the impact of model pruning on the loss function. However, OBS argued that the independence assumption between parameters proposed in OBD was invalid. OBS believed that there was a correlation between weight pruning, and we still needed to consider the interaction terms. Therefore, we cannot simply assume that the Hessian matrix is a diagonal matrix.

Specifically, OBS argues that the independence between parameters does not hold, therefore, interaction terms must still be considered. Suppose we want to remove a weight, denoted as (w_q), to minimize its increase in overall error. However, we should simultaneously calculate a compensation (\delta_q) and apply it to the remaining weights to offset the error caused by the removed weight.

The authors of OBS found a method: by solving for the inverse of the Hessian matrix, the influence of (w_q) is (\frac{1}{2}\frac{w_q^2}{[\mathbf{H^{-1}}]_{qq}}). Then, the parameters can be sorted in ascending order of influence, thus determining the order of parameter pruning. Furthermore, each time a parameter is pruned, the other parameters are updated, thereby reducing error.

The final OBS algorithm is shown in the figure below: the weights are directly expanded, the corresponding Hessian matrix is calculated, and then quantization is performed in sequence.

3619

In fact, OBS studies parameter sensitivity from a quantization perspective. Not all parameters in a neural network are equally important. Intuitively, if a weight has a large rounding error, it can be considered to have high quantization parameter sensitivity, meaning that this weight is not close to the quantization point. Directly calculating the rounding error of weights ignores the fact that LLM vectors are highly correlated and need to be reasonably compensated: weight may have a large rounding error, but is closely related to another weight, which means that the rounding error can be well compensated by rounding down.

In each iteration, OBS involves calculating the inverse of the Hessian matrix. Let the total number of weight parameters be (d). The time complexity of calculating the inverse of the Hessian matrix is (O(d^3)), and this calculation is performed once in each iteration, resulting in a total time complexity of OBS of (O(d^4)). Clearly, for models with millions or even hundreds of millions of parameters, this method involves an enormous amount of computation.

OBQ

OBQ extends OBS to model quantization, while adding a row-by-row computation method. Model pruning sets some weights to 0, while model quantization approximates the weights to a single value. Therefore, pruning can be understood as a special type of quantization. OBQ uses a greedy algorithm to independently compute the weight matrix W row by row, selecting the quantization order within each row based on the principle of minimizing (\Delta t). Specifically, the formula for OBQ’s layer-by-layer quantization can be written as the sum of the squared differences of each row of W, thus performing quantization in a greedy manner.

The Hessian matrix information is used as a criterion to determine the impact of each weight quantization on the output loss. The row weight w that minimizes the quantization error of the current weight W (finding the coefficient that minimizes the increase in the loss function due to quantization within this row through (\Delta I)) is selected for quantization.
The remaining unquantized weights are updated to compensate for the quantization loss.

The detailed derivation is shown in the diagram below, where quant(w) approximates w to the nearest point in the quant grid, essentially rounding to a specified number of decimal places. As you can see, if quant(q) always returns zero, it’s essentially the original OBS.

3620

This method works because the weights are usually correlated. Therefore, when a weight experiences a quantization error, the related weights are updated accordingly (via the inverse Hessian matrix). The specific process is shown in the following diagram:

First, the layer weights are converted to inverse Hessian matrices. The Hessian matrix is the second derivative of the model’s loss function; it tells us how sensitive the model output is to changes in each weight, essentially showing the (inverse) importance of each weight within the layer. In the inverse Hessian matrix, lower values represent more “important” weights. This is because small changes to these weights can lead to significant changes in model performance. That is, the adjustment process prioritizes weights that are critical to accuracy, effectively generating weighted quantization errors to preserve more important details.
The first weight in the first row of the weight matrix is quantized, and then dequantized. This process allows us to calculate the quantization error (q), which we can weight using the previously calculated inverse Hessian (h_1). Essentially, a weighted quantization error is created based on the importance of the weights.
OBQ doesn’t allow quantization error to exist in isolation on a single weight; instead, it redistributes this weighted quantization error across the other weights in the row. This helps maintain the overall functionality and output of the network. For example, if we quantize the first weight (w1) in the first row, we multiply the quantization error (q) by the other two weights in the first row to compensate for the error loss. That is, we update the other weights to adjust them and compensate for the effect of quantizing weight w1. We repeat this process, redistributing the weighted quantization error, until all values are quantized.

3621

The algorithm presented in the paper is as follows.

The first column from the left is the dense matrix W to be pruned.
The second column from the left indicates that W will be split by row, ready to be operated on row by row.
The third and fourth columns from the left represent the row-by-row pruning performed according to the algorithm.
- Pruning Traces represent the pruning results after each iteration step in each row.
- Loss Changes represents the amount of change in loss after each iteration step; the darker the color, the greater the loss.
The fifth column from the left indicates the weights that need to be pruned based on the Loss Changes obtained in the previous step. In the figure, 6 weights need to be reset to 0, that is, the white box represents the weight at that position that needs to be pruned.
Less compute: load stored trace elements, meaning that if pruning traces are stored, the pruning process does not need to be rerun.
Less memory: resolve not pruned weights. This means that if there is not enough space to store the pruning traces, the pruning process can be rerun line by line or the parameters can be updated directly to get the pruned result for that line.

3622

In summary, the OBQ method uses a greedy strategy to find the optimal quantization parameters by iteratively updating the weights, minimizing the error introduced by quantization while controlling the model size. This approach allows the weights of the neural network to be quantized to the desired accuracy in the post-training stage without retraining the entire model. This strategy is particularly useful for large language models such as GPT, as they often require reducing model size while maintaining high accuracy.

While OBQ is good, it’s too slow because for a weight matrix W of size (d_{row} \times d_{col}), OBQ quantizes a ResNet50 in about an hour, but on large models (such as GPT3), it might take several days. Its complexity is (O(d_{row} \cdot d_{col}^3)).

2.1.2 GPTQ Scheme

feature

The features or main improvements of GPTQ are as follows:

Layer-by-layer quantization. GPTQ performs asymmetric quantization layer by layer, processing each layer independently before moving on to the next.
Quantization is performed in a fixed order. The authors of GPTQ believed that for LLMs, selecting parameters to be quantized each time is unnecessary. While a greedy algorithm to minimize error during quantization performs well, it doesn’t offer a significant improvement over a fixed-order approach, especially on large models where a fixed-order approach might be better. Therefore, GPTQ changed the quantization weight selection method from the greedy strategy of OBQ to a fixed index-based selection. This approach results in higher computational efficiency on large models because it reduces computational complexity.
Lazy Batch-Updates. If we update the weights every time we quantize a parameter, more time is spent accessing memory, which doesn’t fully utilize the GPU’s computing power. Therefore, GPTQ chooses to update in batches, updating the global matrix only after each batch is completed. This delays the update of some parameters, alleviating I/O pressure.
Cholesky decomposition. GPTQ found that weighted quantization requires frequent updates to the inverse of the entire Hessian matrix, which introduces numerical errors. Therefore, Cholesky decomposition is used to find the inverse of the Hessian matrix, enhancing numerical stability while eliminating the need for updating the Hessian matrix, further reducing computational load. Combined with Lazy Batch-Updates, the weight updates can be restricted to a block, and then all columns after this block are updated in a single update. This saves memory bandwidth, thus significantly improving algorithm efficiency.
GPTQ is the first to apply 4-bit/3-bit weight quantization to a 176-bit model, and also provides the corresponding kernel.

Cancel the greedy algorithm

OBQ employs a greedy strategy, independently quantizing each row of the weight matrix W. The quantization order within each row differs, and each row maintains its own optimal quantization order: selecting the quantization order based on minimizing (\Delta E). That is, each weight is selected in a “greedy order,” always prioritizing the weight that has the least impact on the overall error. While this strategy is theoretically superior, it is extremely inefficient in large models. The reason is that quantizing each weight requires updating the inverse Hessian matrix, and each row needs to be executed independently, resulting in extremely high total time complexity.

GPTQ found that for layers with a large number of parameters, the difference between strictly chosen quantization order and arbitrary quantization order is not significant. Even without a greedy strategy, as long as the weights in each row are quantized in any fixed order, the final error is almost identical to that of the greedy order. The GPTQ authors analyzed that a few weights incur significant quantization losses, which can be offset by the remaining unquantized weights, thus the quantization order may be unimportant. Therefore, GPTQ proposes using the same quantization order for all rows. Although theoretically not a local optimum, the overall error change is very small, especially with a large number of parameters, where it has almost no impact.

3623

The benefits of doing this are as follows:

For each row of the weight matrix, its initial Hessian matrix is only related to the input matrix X, so the initial Hessian matrix is the same for all rows. (H = 2XX^T)
In OBQ, as the pruning process progresses, the optimization order of the weights for each row may differ, thus causing different changes to the Hessian matrix associated with each row. Specifically, each time a parameter is quantized, (H^{-1}) for each row will be different in each step and needs to be calculated separately.
(H^{-1}) of each row is the same in each step. All rows share a single inverse Hessian matrix, which only needs to be calculated once, saving the time of constructing H(i) each time. Furthermore, considering that all rows are quantized in the same order, all rows in the same column can be quantized in a single iteration. This improvement allows for parallel matrix computation of each row of the matrix.
Because large models are extremely redundant, the error introduced by a single weight has a negligible impact, and suboptimal choices caused by a uniform order will not accumulate into significant performance loss. The Hessian-based error compensation mechanism used in GPTQ automatically adjusts the unquantized weights at each step, further suppressing error propagation.

This improvement enables parallel matrix computation (i.e., per-channel quantization) for each row of the parameter matrix. For large model scenarios, this results in quantization speed being an order of magnitude faster.

Lazy Batch-Updates

Because the Hessian matrix of the weights in a large model is very large, OBQ needs to update the inverse of the entire Hessian matrix and the entire weights in each iteration. This process involves a large number of memory accesses, but the actual computational cost is relatively small. Therefore, the algorithm’s computation-to-memory ratio is relatively low, and the performance bottleneck is actually the GPU’s memory bandwidth. Specifically, since the inverse of the Hessian matrix needs to be updated once for each column of parameters quantized, assuming the size of the inverse matrix is (d_{row} \times d_{col}), it needs to be updated (d_{col}) times, resulting in a total memory access of (d_{row} \times d_{col}^2). When the dimension is large, this can lead to the entire runtime being consumed by memory access. For example, when quantizing a parameter matrix, each time a parameter is quantized, all other unquantized parameters need to be updated according to the formula.

The authors found that weight updates for different columns of the same feature matrix W do not affect each other, as detailed below:

The quantization of the current column i is only affected by the quantization of the previous sequence (the parameters of the subsequent columns will be updated after the previous sequence is quantized).
The quantization of the current column i will not affect the previously quantized columns (the parameters of the previous columns are fixed after quantization is completed).
The quantization of the current column i will update the parameters of subsequent columns, but the updates of subsequent columns will not affect the quantization of the current column i. In other words, the quantization of the current column will not be affected by columns that have not yet been updated.

As shown in the diagram below, since parameter quantization is performed column by column sequentially, the quantization result of the parameter in column i is affected by the quantization of the previous i-1 columns, but the quantization result of column i does not affect the quantization of the preceding columns. Therefore, we do not need to update the parameter in column i every time we quantize the preceding columns. Instead, we can first record the update amount brought to column i by quantizing the preceding columns, and then update the parameter all at once when quantizing column i. This reduces the number of I/O operations.

3624

Therefore, the authors proposed a delayed batch processing method: first delaying the update of a portion of the parameters, and then processing multiple columns at once, thereby alleviating bandwidth pressure and significantly improving computation speed. Details are as follows:

The weight matrix is grouped into groups/blocks of B = 128 columns. After quantizing the current column, the remaining weights in the group are adjusted to compensate for the error introduced by quantization. This restricts the updates to the corresponding BxB blocks of these columns.
Once all parameters of a group have been quantized, we use the following multi-weighted version of the formula to perform a global update on (H^{-1}) and W. That is, after processing all columns of the current group, we update the other column groups of the entire matrix (the other 128 column groups) to further correct errors. This complete process is called “Lazy Batch Update”.

This method does not reduce the total theoretical computation, but avoids the need for frequent updates to the Hessian matrix, effectively alleviating the memory access bottleneck and thus accelerating the entire quantization process.

3625

3626

Cholesky (decomposition)

GPTQ also applies Cholesky decomposition to address the numerical stability problem in calculating the (H^{-1}) inverse matrix. During their experiments, the GPTQ authors noticed that repeatedly applying the following formula to large-scale parameter matrices accumulates errors, leading to inaccurate numerical calculations. Specifically, the matrix can become infinite, causing the algorithm to update the remaining weights incorrectly, resulting in poor quantization performance.

3627

To address this issue, for small models, adding small constant terms to the diagonal elements of H is sufficient. However, for large models, a more stable and general approach is needed. To solve this problem, the authors use Cholesky decomposition (a method of matrix decomposition) to find the inverse of the Hessian matrix, pre-calculating all necessary information and accelerating the process with an optimized Cholesky kernel. (\lambda)

Overall Algorithm

GPTQ quantizes the weights W based on the input Hessian inverse matrix (H^{-1}) and other parameters, and its core idea is to reduce the error by iteratively quantizing the weights. The main steps of the GPTQ algorithm are as follows.

The quantization result Q is initialized as a zero matrix, and the error matrix E is initialized as a zero matrix.
Perform Cholesky decomposition on (H^{-1})
Iterative processing of the quantization of the weight columns (quantizing the weight matrix column by column, and adjusting the unquantized parts at each step to compensate for the quantization error mentioned above):
- Quantize each column and update the corresponding column in the Q matrix.
- Calculate the quantization error and update the error matrix E.
- Update the weights in the weight matrix W to reduce the error.
- Repeat the above steps until all weight columns have been quantized.
Finally, the quantized weight matrix Q is returned.

To reduce computational complexity, GPTQ employs a column-by-column optimization approach. The columns of the weight matrix W are represented as (w_i), and each column is quantized, while simultaneously considering the accumulated error introduced by the quantization of previous columns. The main advantage of column-by-column optimization is:

Reduce computational complexity: Decompose the high-dimensional matrix optimization problem into multiple low-dimensional vector optimization problems.
Error accumulation is considered: In each update step, the error introduced by the previous quantization is taken into account, which ensures that the overall error is minimized.

Here are a few brief points:

The algorithm here processes each row of weights in parallel.
on the left side of line 3 of the algorithm is actually (L^T) in (H^{-1} = LL^T)
The numerator in the fifth-to-last row of the algorithm, squared and divided by 2, represents the incremental loss caused by this iteration.
The fourth-to-last line of the algorithm actually updates the quantized weight term itself, after which its weight is 0. The update terms for the weights at the beginning of the column containing the quantized weight are actually 0, so this can be written consistently.
The second-to-last line of the algorithm indicates that all weights after the block are updated at once. In fact, the fourth-to-last line expands the column range (j:(i+B)) to (i+B).

3628

While GPTQ improves quantization performance by minimizing quantization error using calibration data, it may overfit the calibration set, leading to a performance degradation of the model in the outer domain.

2.2 AWQ

The paper “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” is a sequel to SmoothQuant. The authors found that weights are not equally important for model performance. A very small number of parameters (0.1%-1%) can dominate the performance loss during quantization. Retaining these salicy weights without quantizing them can significantly reduce quantization error. Which weights are worth protecting? The authors suggest referring to the activation distribution rather than the weight distribution. Weights that are significantly activated during inference are definitely more important because they handle more important features. That is, use activation values to identify important weights; weights with larger activation values are more important. Based on this observation, AWQ employs an activation-aware method. This data-driven approach uses channel-wise scaling to automatically search for the optimal scaling factor, improving accuracy by protecting more “important” weights, thereby minimizing quantization error while quantizing all weights.

2.2.1 Motivation

The authors of AWQ made the following findings:

In LLM, weights are not all equally important. A small subset of weights are more critical to LLM performance than others. The authors argue that by retaining these significant weights without quantization (maintaining their precision at FP16) and using low-bit quantization for the other weights inference, the performance degradation caused by quantization loss can be compensated for without any training.
The weights corresponding to large activation values are considered important weights. Weights can be selected based on activation magnitude. By retaining only the weight channels corresponding to larger activations (0.1%-1%), significant weights can be amplified, preventing the degradation of high-influence weights, preserving key knowledge in large language models, and significantly improving quantization performance.

2.2.2 Scheme

Select weights

A common way to assess the importance of weights is to look at their magnitude or L2 norm.

However, the AWQ authors found that retaining weights with larger norms offered only a limited improvement in quantization performance, showing only a slight improvement compared to random selection. In contrast, selecting weights based on activation magnitude significantly improved the model’s quantization performance. The authors hypothesize that input features with larger magnitudes are generally more important, and retaining their corresponding weights as FP16 better protects these features, thereby improving model performance. Specifically, the authors averaged the absolute values of each column of activation values, then considered the channel corresponding to the column with the largest average as the salient channel, preserving FP16 precision.

However, a problem arose during implementation: while retaining 0.1% of the weights as FP16 can improve quantization performance without significantly increasing the model size, this mixed-precision data type presents implementation challenges (due to hardware inefficiency). Therefore, we need to devise a method that protects important weights without actually retaining them as FP16. The specific solution is illustrated in the figure below, where RTN stands for vanilla round-to-nearest baseline.

3629

Activate perceptual scaling

The core idea of AWQ is to protect salient weights through activation-aware scaling. That is, it identifies the most important weights in each layer and reduces the quantization error of these salient weights by scaling them per-channel. These weights are crucial for maintaining model performance. By focusing on the weights corresponding to highly activated features, AWQ minimizes quantization errors that could lead to a significant decrease in accuracy. See the figure below for details.

3630

Derivation

A direct method to protect outlier weight channels is to multiply the channels by a scaling ratio > 1, thus enabling precise quantization. We will next analyze the weight-only quantization error before and after applying the scaling ratio. As shown in the figure below, the solution is basically feasible.

3631

question

To better represent important weights, a larger scaling ratio is needed for better quantization. However, if we use a very large scaling ratio, non-salient channels will be forced to use a smaller dynamic range, which may harm the overall accuracy of the model. That is, as s increases further, approximation of no longer holds, and the relative error of non-salient channels will increase as (\Delta) increases (the error of non-salient channels will be amplified), thus reducing quantization accuracy and making the method of retaining 1% weights less effective. Therefore, when protecting Salient Channels, it is also necessary to consider how to reduce the error of Non-salient channels. (\Delta’ \approx \Delta)

Optimal scaling factor search

To simultaneously consider both Salient Weight and Non-salient Weight, AWQ proposes a method for automatically searching for the optimal scaling ratio (for each input channel) (finding a suitable scaling factor s) that reduces the quantization loss of significant weights without increasing the quantization loss of other weights, thus minimizing the output difference after quantization of a certain layer.

The diagram below illustrates the approach. Formally, the goal is to optimize the objective function labeled 1 in the diagram. However, since the quantization function is non-differentiable, backpropagation cannot be used to directly optimize the problem. Some optimization techniques, relying on approximate gradients, still suffer from convergence instability. To make the entire process more stable, the AWQ authors defined the search space for optimal scaling by analyzing the factors influencing the choice of scaling factor.

As mentioned earlier, the significance of the weighted channels is actually determined by the scale of activation (hence the term activation-awareness). Therefore, the authors simply use a very simple search space (as shown in Figure 2 below). This method automatically searches for an optimal scaling factor (for each input channel) and further applies weight clipping to minimize the MSE error of quantization, that is, to minimize the difference in output after quantization at a certain layer.

3632

How to determine the hyperparameter (\alpha)? The authors suggest finding the optimal (\alpha) by performing a fast grid search over the interval [0,1] (0 represents no scaling; 1 represents the most aggressive scaling in the search space). Reading the source code reveals that this method essentially takes 20 numbers on average over the interval: 0, 0.05, 0.10, 0.15, … and calculates the loss for each of these values under MSE. The value with the smallest loss is the optimal (\alpha).

application

In AWQ, different linear layer groups need to be processed separately according to the module type, and the weight scaling factors of different modules need to be calculated separately. For example, llmthe model can be divided into the following modules and calculated separately scales.

Self-attention query, key-value projection layer
The output projection layer of self-attention
The first fully connected layer of MLP
The second fully connected layer of MLP
The linear layers of gate_proj and up_proj in MLP
MLP’s down_proj linear layer

Relationship with SmoothQuant

We will analyze both papers because the authors are from the renowned MIT HAN LAB team.

Similarities

All of these are Post-Training Quantization (PTQ).
Both methods scale some weights (and their corresponding input activations), that is, the weight is multiplied by a scaling factor and the corresponding input activation is divided by this scaling factor.
Both require a calibration set to determine the value of the scaling factor (no additional training required).

the difference

The quantization precision differs: SmoothQuant has a quantization precision of W8A8; AWQ has a quantization precision of W4A16.
The methods for determining the scaling factor differ: SmoothQuant’s scaling factor is calculated as: (s_j = max(|X_j|)^\alpha / max(|W_j|)^{1-\alpha}); AWQ’s scaling factor is found by searching (s = s_{X^\alpha}, \alpha^* = \arg \min_\alpha L(s_{X^\alpha})), where (s_X) is the average magnitude of activation.
The scaling factors apply different weights: SmoothQuant scales each weight (and its corresponding input activation) equally; AWQ scales only a small number (about 0.1%) of salient weights (and their corresponding input activations).
The authors of AWQ also developed TinyChata high-performance and flexible 4-bit device-side LLM/VLM inference framework.

2.3 LLM-QAT

The paper “LLM-QAT: Data-Free Quantization Aware Training for Large Language Models” is a pioneering work in quantization-aware training for large models. Currently, some PTQ methods for large models have proven to perform well even at 8-bit resolutions. However, the authors found that these methods suffer from problems at lower bit precision. Therefore, this paper investigates QAT for LLM to further improve quantization levels. The authors also propose a data-free distillation method that leverages the generation from a pre-trained model, better preserving the original output distribution and allowing quantization of any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, the paper also quantizes a key-value cache, which is crucial for improving throughput and supporting long sequence dependencies at the current model size.

2.3.1 Motivation

Compared to post-training, QAT typically delivers better accuracy because the model is trained from the outset to account for accuracy degradation. Furthermore, it allows the model to continue training or fine-tuning, which is crucial for large language models. However, using QAT to quantize LLMs faces challenges in several key areas:

Training large models is technically difficult and requires significant computational resources. Furthermore, QAT requires the introduction of simulated quantization, which further increases memory and computational costs, and introduces gradient mismatch issues, thereby increasing training costs and impacting scaling laws.
QAT requires training data, but for large models, it is difficult to obtain such training data. The sheer size and diversity of pre-training data is itself an obstacle, and data preprocessing is also very difficult.
LLM excels in zero-shot generation, and maintaining this ability after quantization is crucial. Therefore, choosing a suitable fine-tuning dataset is important. If the QAT data domain is too narrow or significantly different from the original pre-training data distribution, it may harm the model’s performance.
Because LLMs exhibit unique weight and activation distributions, they are characterized by a large number of outliers. Therefore, the best quantization clipping method is used for small models, but this method is not a work-in-the-box solution for LLMs.
Furthermore, it remains unclear whether quantization-aware training follows the scaling rules of the model.

2.3.2 Scheme

Data-free distillation

To address the challenges of training data, it is necessary to closely integrate the distribution of pre-trained data with a limited amount of fine-tuning data. This paper proposes a method to generate the next token data from the original pre-trained model and combines knowledge distillation to circumvent this problem. This method is called data-free knowledge distillation and is applicable to any generative model, regardless of whether the original training data is available.

As shown in Figure (a), we randomize the first token from the vocabulary: <start>, and let the pre-trained model generate the next token: <out1>, then append the generated token to the starting token to generate a new output: <out2>. This iterative process is repeated until the end of the sentence tokens or the maximum generation length is reached. Sampling the next token from the distribution is crucial during data generation; the authors tested three different sampling strategies when generating the next token.

The most straightforward approach is to select the first candidate as the next token. However, the next token does not necessarily represent the best label for training the student model because sampling introduces inherent noise. Therefore, the sentences generated by this strategy lack diversity and repeatedly use the same token several times.
To address this issue, the paper uses the SoftMax output of a pre-trained model as the probability to randomly sample the next token from the distribution. This sampling strategy produces more diverse sentences and significantly improves the accuracy of fine-tuning the student model.
Furthermore, the paper found that the initial few tokens play a crucial role in determining the prediction trend. Therefore, it is important to have higher confidence in them. During the generation process, the paper employs a hybrid sampling strategy, deterministically selecting the top-1 prediction for the first 3-5 tokens, and then randomly sampling the remaining tokens.

3633

Experiments show that this method better preserves the output distribution of the original model, even compared to training with a large subset of the original training set. Furthermore, we can successfully extract the quantized model using only a small fraction (100k) of sampled data, thus ensuring a reasonable computational cost.

Knowledge distillation

The authors used logits distillation based on cross-entropy to train a quantized student network from a fully-precision pre-trained teacher network, as shown in the following formula:

3634

Where i represents the i-th sample in the current batch, and there are a total of n sentences. c represents the number of classes, which in the paper example is equal to the size of the vocabulary. T and S are the teacher network and the student network, respectively.

Quantization function

3635

The image above is an example of quantizing a Transformer model. Based on findings from Llm.int8() and Smoothquant, significant outliers exist in both weights and activations in LLMs. These outliers significantly impact the quantization process because they increase the quantization step size while reducing the accuracy of intermediate values.

However, it has been found that pruning these outliers during quantization is detrimental to the performance of LLM. In the initial stages of training, any pruning-based method results in abnormally high perplexity scores (i.e., > 10000), leading to a significant loss of information that proves difficult to recover through fine-tuning. Therefore, this paper chooses to retain these outliers. Furthermore, the paper finds that in models with gated linear units (GLUs), activation weights are mostly symmetrically distributed. Based on the paper’s analysis and empirical observations, symmetric MinMax quantization is chosen for the weights and activations, as shown in the following formula:

3636

Here, XQ represents the quantized weights or activations, and XR represents the actual weights or activations. To ensure effective quantization, the paper employs per-token activation quantization and per-channel weight quantization, as shown in the figure below.

3637

Quantization-aware training of KV Cache

Besides weights and activations, the key-value cache (KV cache) in Large Language Models (LLMs) also consumes a significant amount of memory. However, previous work has only addressed the quantization problem of the KV cache in LLMs, and the methods are mainly limited to post-training quantization (paper “FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU”). This paper demonstrates that a similar quantization-aware training method used for activation quantization can be employed to quantize the KV cache.

During training, LLM-QAT quantizes the entire activation tensor of the key and value, as shown in the figure below. By integrating the quantization function into gradient calculation, it ensures effective training using quantized key-value pairs.

3638

in conclusion

This paper uses three post-training quantization methods-round-to-nearest (RTN), GPT-Q, and SmoothQuant-as baselines. It compares the zero-shot performance of different quantization methods on the BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, and OBQA datasets. It also evaluates the few-shot performance of different quantization methods on the TriviaQA and MMLU datasets, and compares the perplexity scores of different quantization methods on the WikiText2 and C4 datasets. The perplexity evaluation verifies whether the quantized model can preserve the model’s output distribution across different samples in its training domain. Zero-shot and few-shot evaluations measure whether the model’s capabilities are preserved on downstream tasks.

For practitioners, an important question is whether to choose a small, full-precision model or a larger quantized model with similar inference costs? While the exact trade-off may vary depending on a number of factors, the authors offer some suggestions based on the results of this paper.

Large models using 8-bit quantization should outperform smaller, full-precision models, and the PTQ method is sufficient to satisfy this requirement.
A 4-bit model quantized using LLM-QAT should outperform an 8-bit model of similar size.

Therefore, the authors recommend using a 4-bit LLM-QAT model to achieve the best balance between efficiency and accuracy.

2.4 QLoRA

QLoRA (Quantized Low-Rank Adapter), derived from the paper “QLORA: Efficient Finetuning of Quantized LLMs,” is the first PTQ method based on LoRA (Low-rank Adapter), where the adapter resides in each network layer. Through a series of innovative works, such as the introduction of 4-bit NormalFloat, double quantization, and Paged Optimizers, QLoRA better tunes the LoRA method, maintaining performance while reducing memory usage.

During the fine-tuning phase, QLoRA first quantizes the LLM to 4 bits. Then, it locks the original model parameters and excludes them from training. At higher precision (such as BFloat16 or Float16), it uses LoRA to fine-tune each 4-bit weight matrix of the quantized model. During the inference phase, QLoRA dequantizes the LLM to the same precision as LoRA, and then adds the LoRA updates to the LLM.

3639

2.4.1 Motivation

Since there is a quantization error before and after model quantization, a natural idea is to use LoRA to learn this quantization error. QLoRA uses a full-precision LoRA matrix to learn this quantization error during PTQ. The weights are encoded in NF4 format during quantization.

However, due to the additional forward path on the LoRA parameters, QLoRA still needs to be inferred together with a full-precision LoRA during the inference process, calculating the sum of the full-precision model, and cannot integrate the LoRA into the quantized model. This introduces a problem of low computational efficiency. This inefficiency is because high-precision LoRA parameters and low-precision quantization weights cannot be merged into low-precision values. Originally, the intention was to accelerate inference using quantization, but now only the memory overhead has been reduced.

3640

2.4.2 Scheme

QLoRA’s main innovations are as follows:

4-bit NormalFloat (NF4) Quantization: QLoRA introduces a new 4-bit quantization data type called NormalFloat (NF4), which is optimal for weights of a normal distribution in information theory. Therefore, weights can be quantized to 4 bits to reduce memory usage while maintaining model performance.
Double Quantization: QLoRA employs double quantization, which involves quantizing ordinary parameters once and quantization constants a second time, thereby further reducing average memory usage.
Paged Optimizers: To manage memory spikes, QLoRA introduced paged optimizers that use NVIDIA Unified Memory to handle memory demands when processing small batches of data with long sequence lengths.
Forward and Backward Propagation: During forward computation, QLoRA first dequantizes the parameters of the original model into fp16 using a dequantization function, and then adds a LoRA adaptation layer. The LoRA parameters are not quantized because they require backpropagation optimization. The parameters of the original model, however, are frozen and therefore can be quantized. During parameter updates, only the gradient of the LoRA adapter weights with respect to the error needs to be calculated, not the gradient of the 4-bit weights.

4-bit normal floating-point quantization

4-bit NormalFloat (NF_k) is a data type that preserves zeros during quantization and uses all to represent a k-bit data type. This data type creates an asymmetric data type for the positive part. It then unifies these two sets of quantiles q and removes one of the two zeros that appear in both sets. This resulting data type has an equal number of expected values in each quantization bin and is therefore called k-bit NormalFloat ((NF_k)). This data type is information-theoretically optimal for zero-centered normally distributed data.

The NormalFloat data type is based on Block-wise k-bit Quantization and Quantile Quantization.

Block-wise k-bit quantization

Quantization is the process of converting data from a form representing more information to a form representing less information. Typically, this involves converting a data type from a form that occupies more bits to a form that occupies fewer bits, such as converting a 32-bit floating-point number to an 8-bit integer. To ensure that the data type with fewer bits makes full use of its range, the input data is usually normalized to fit the range of the target data type. For example, an FP32 tensor is quantized into an Int8 tensor with a range of [-127, 127].

3641

The above figure shows int8 quantization and dequantization. In equation (1), the scaling scale is determined based on the maximum value in the parameters. The problem with this method is that if there are large amplitude values (i.e., outliers) in the input tensor, it is not suitable to use it to calculate the scaling scale, because it will cause most of the values of the entire tensor to be near 0 after quantization, thus destroying the uniformity of the feature distribution after quantization. Block-wise k-bit quantization uses a batch quantization strategy: by dividing the tensor into several blocks, each block has an independent quantization constant c, thereby solving the problem of extremely large and small outliers in the model parameters. Another advantage of block quantization is that it reduces communication between cores, can achieve better parallelism, and fully utilize the multi-core capabilities of the hardware.

Quantile Quantization

QLoRA quantizes parameters to 4 bits, so we can use numbers (2^4), or 16. If we use the round-to-nearest (RTN) method, under certain distributions, most of the original values may be quantized to the same 4-bit number. The inherent differences or information are lost in this quantization process, failing to fully utilize the available bits. For example, original float32 floating-point numbers fluctuate around 0; using the RTN method, these numbers might be quantized to zero.

To make more efficient use of the existing 16 numbers, we can use quantile quantization. In mathematics, a quantile is defined as the value at which a set of sequentially arranged data is divided into several equal blocks. Quantile quantization is an optimal data type in information theory; its main idea is to make the values fall as close as possible to the fixed expected value of a normal distribution with a mean of 0 and a standard deviation of [-1, 1].

Since pre-trained neural network weights typically exhibit a normal distribution with a standard deviation of 0, we can convert all neural network weights to a fixed expected value using a scaling factor, thus ensuring the distribution perfectly fits our data type range. Once the weight range and data type range are matched, we can quantize as usual. For example, we can arrange all numbers in ascending order, divide them into sixteen equal parts, map the smallest block to the first quantized number, the second block to the second quantized number, and so on. This results in a uniform distribution of the original data across the quantized numbers. By using quantiles to divide the tensor into blocks of equal size, we obtain a more uniform quantization feature; this is known as quantile quantization. Quantile quantization ensures that each quantized partition has an equal expected value. This equal expected value avoids expensive quantile estimation and approximation errors, making accurate quantile estimation computationally feasible.

4-bit normal floating-point quantization

Quantile quantization has a problem: it’s too cumbersome. Calculating the corresponding quantile for each batch of numbers is costly. Therefore, QLoRA provides a faster method for normal floating-point quantization.

The pre-trained parameters generally conform to a normal distribution with a mean of 0, so they can be directly scaled to a specified range, such as [-1, 1]. For a zero-mean normal distribution within the range [-1, 1], the information-theoretically optimal data type with an arbitrary standard deviation (\delta) is calculated as follows:

quantiles of the theoretical (N(0,1)) distribution to obtain a k-bit normal distribution quantized data type;
Quantization is performed by rescaling the input weight tensor to the range [-1,1] using absolute maximum rescaling. This step is equivalent to rescaling the standard deviation of the weight tensor to match the standard deviation of the k-bit data type. More specifically, the estimation of values (q_i), where (Q_X(\cdot)) is the quantile function of the standard normal distribution (N(0,1)):

$q_i = \frac{1}{2}\left(Q_X\left(\frac{i}{2^k+1}\right) + Q_X\left(\frac{i+1}{2^k+1}\right)\right)$

This approach has a drawback: 0 might be mapped to a non-zero value, losing the special properties of 0. To address this, QLoRA uses two of (2^{k-1}) in [0, 1] to represent positive and negative 0 to 1 respectively, and then removes the overlapping 0 value.

As shown in the figure below, for 4-bit quantization, we want to find 15 quantiles to divide the area under the curve (integral) into 16 equal parts. The midpoint between two quantiles is the value that the model quantizes to map to this interval (q_i).

3642

Double Quantization

During quantization, to reduce the impact of the outlier, QLoRA uses a block-based quantization approach. Specifically, every 64 parameters share a single quantization constant (Absmax, 32 bits), meaning the quantization overhead for each parameter is approximately (32/64 = 0.5) bits. For 4-bit quantization, this extra 0.5 bits equates to 12.5% more memory usage, which is a significant overall overhead.

To further optimize this quantization overhead, QLoRA performs double quantization ((\text{Double Quantization})), further quantizing the quantization constant. That is, the output of the first 32-bit quantization is used as the input for the second quantization. Considering that the probability of an outliner appearing in c is generally small, LoRA uses a block size of 256 for (\text{FP8}) quantization of the quantization constant. On average, for a block size of 64, this quantization method reduces the memory usage per parameter from (32/64 = 0.5) bits to

$8/64 + 32/(64 * 256) = 0.127\ bit$

a reduction of 0.373 bits of memory usage per parameter. Because double quantization is used, we also need to perform two dequantizations to restore the quantized value.

Optimizer state allocation of paged memory

Gradient checkpointing is a technique used to address the problem of excessive GPU memory usage during model training. During model training, we typically need to save all activation values from the forward propagation for use during backpropagation, but this consumes a significant amount of GPU memory. While we could avoid saving activation values and recalculate them when calculating gradients, this reduces cache usage but increases computational complexity and slows down training.

Gradient checkpointing is a technique that falls somewhere between not discarding any gradients and discarding all gradients; it only saves a portion of the activation values during the forward propagation. When running the backward propagation, if there are saved gradients, we directly use those saved values; otherwise, we recalculate the gradients based on the loss function.

Paging optimization is a further optimization for gradient checkpoints to prevent memory out-of-memory (OOM) issues during peak memory usage. QLoRA paging optimization essentially involves automatically evicting these gradient checkpoint states to CPU RAM when memory is insufficient, and then paging them back to GPU memory when memory is needed during the optimizer update step-similar to regular memory paging where data is transferred from memory to hard drive.

2.5 FlatQuant

The current W4A4 (4-bit weights, 4-bit activations) quantization model still suffers from significant quantization loss compared to the full-precision model, making it difficult to use in practical applications and thus hindering the use of the highest peak computing power INT4 Tensor Cores to accelerate the actual inference deployment of LLM. The authors of FlatQuant discovered that the flatness of the weight and activation distribution before quantization is a key factor affecting the quantization error of LLM.

Intuitively, the flatter the distribution, the fewer outliers, and the higher the accuracy of quantization. Most existing methods use pre-quantization transformations, which reduce quantization error by performing equivalent transformations on the weights and activations before quantization to obtain a flatter distribution. Commonly used transformations include per-channel scaling and the Hadamard transformation.

However, the authors of FlatQuant discovered that these transformations were not optimal, and the transformed weights and activations could still remain steep and scattered. To address this, the authors proposed FlatQuant (Fast and Learnable Affine Transformation), which aims to flatten the distribution of weights and activations as much as possible through equivalent transformations. Specifically, it learns an optimal affine transformation for each linear layer to effectively mitigate outliers in weights and activations. The values of this set of affine transformation matrices are trained using calibration data. This results in a flat distribution of weights and activations, effectively improving quantization accuracy. Furthermore, for online transformations during inference, the FlatQuant authors implemented operator fusion to further reduce memory access overhead, ensuring that online transformations introduce only minimal inference overhead.

2.5.1 Motivation

LLMs often exhibit numerous outliers in their weights and activations, particularly in the activations, where outlier channels frequently exist, making LLMs difficult to quantize. Current methods for LLM WA quantization mostly perform equivalent transformations on the weights and activations before quantization to absorb outliers using other channels, thereby achieving a flatter distribution and reducing quantization loss. For example:

The equivalent transformation of Per-channel Scaling is (Y = (Xdiag(c)^{-1}) \cdot (diag(c)W^\top)). Scaling transfers outliers on the activation values to the same channels of the weights, making the distribution of activation values flatter.
The Hadamard transformation redistributes outliers to other channels of weights/activations by applying the Hadamard transformation to both weights and activations simultaneously. (Y = XW^\top = (XH)(H^\top W^\top), H^\top H = I)

However, the distributions obtained by existing equivalent transformations may still be uneven. The weight and activation value distributions can be viewed as two slopes, and the transformation is similar to moving soil with a shovel. Soil cannot be added or subtracted out of thin air, so the goal is to fill the two slopes by moving the soil from the higher part (outliers) to the lower part (non-outlier channels).

Per-channel scaling is like transferring soil from one slope to the same location on another slope, which is quite limiting. Outliers are still confined to the same channels of weights and activations, and non-outlier channels are not effectively utilized. Therefore, the transformed distributions of both weights and activations are very steep, exhibiting very obvious outlier channels.
The Hadamard transformation is equivalent to moving soil from a higher point to a lower point within each slope, but it cannot transfer soil between two slopes. Furthermore, due to the different shapes of slopes, the same Hadamard transformation (soil transfer within a slope) may not be applicable to all slopes. The Hadamard transformation applies the same transformation to all weights and activation values, but the distribution of weights and activation values differs across layers. This means that the Hadamard transformation is not the optimal solution for every layer. For example, in Figure (b) below, the weights and activation values of LLaMA-3-8B remain relatively steep after the Hadamard transformation, especially since outliers in the activation values cannot be effectively smoothed. In addition, as an orthogonal transformation, the Hadamard transformation does not change the magnitude of the vector. However, a large number of outliers in the LLM activation values cause the magnitude of the activation values to be significantly greater than that of the weights. This makes the quantization difficulty of the activation values after the orthogonal transformation significantly higher than that of the weights, making it impossible to flexibly balance the quantization difficulty of weights and activation values like per-channel scaling.

Pivot tokens with massive outliers are crucial for model performance, as quantization errors on these tokens significantly impact quantization accuracy. The figure below illustrates the mean squared error (MSE) of quantization for the Transformer layer and the input sequence after applying different transformations in LLaMA-3-8B. It can be observed that per-channel scaling and the Hadamard transform fail to handle pivot tokens with massive outliers well, resulting in very large quantization errors on the first token. In contrast, the FlatQuant method exhibits a smaller MSE, thus significantly reducing quantization loss on pivot tokens and effectively suppressing the propagation of quantization errors layer by layer, resulting in a flatter quantization loss plane.

3643

2.5.2 Scheme

The key steps of the FlatQuant solution are as follows:

Lightweight affine transformation: Smooth outliers by learning the optimal affine transformation for each linear layer.
Kronecker decomposition: Decomposes a large transformation matrix into smaller matrices, reducing storage and computational overhead.
Per-channel Scaling: Provides an independent scaling factor for each channel, increasing the flexibility of transformations.
Learnable Clipping Thresholds: Further reduce the impact of outliers by using learnable clipping thresholds.

Compared to Per-channel Scaling and the Hadamard Transform, the FlatQuant method can be seen as a more refined and intelligent soil-moving strategy. In this method, we are no longer limited to moving soil within a single slope, nor are we simply transferring soil at the same location across two slopes. FlatQuant allows for customized adjustments for each slope, meaning we can design the optimal soil-moving scheme for the unique shape and requirements of each slope. This is equivalent to learning a specific affine transformation for each layer of the model, resulting not only in a flat distribution but also adaptively balancing the quantization difficulty of weights and activation values.

Lightweight affine transformation

FlatQuant smooths outliers on weights and activations using lightweight affine transformations, requiring the learning of the optimal affine transformation for each linear layer. After learning the invertible matrix P, the transformation can be incorporated into the weights without incurring additional inference overhead. However, XP must be used as an online transformation, which doubles the storage and computational costs of the linear layers, making it impractical. Therefore, to address this issue, FlatQuant uses Kronecker decomposition on P. The specific derivation is as follows.

3644

Per-channel Scaling

Kronecker decomposition is essentially a rank-1 approximation of P. We further enhance the representational power of Kronecker decomposition by using learnable per-channel scaling. Per-channel scaling can be integrated into preceding LN/linear layers without incurring additional inference overhead.

Learnable Clipping Thresholds

We further employed learnable clipping on the transformed weights and activation values to better eliminate outliers.

Model Architecture

The diagram below illustrates the model architecture. FlatQuant introduces five different online transformations within a single Transformer. The loss function used is Layer-wise MSE loss.

3645

0x03 Low-bit Quantization

Reducing the quantization bit depth to below 8 bits has proven to be a daunting task, as the quantization error increases with each bit reduction. However, the pressure of memory and processing speed has forced researchers to continue their ingenuity. This led to the development of low-bit quantization schemes.

The characteristics of the several schemes introduced in this section are as follows.

method	Quantization weights and activation	Features
SqueezeLLM	W3~W4/A16	First, separate out the extreme values, and then apply a non-uniform quantization method of weighted k-means clustering to the remaining weights.
SpQR	W3~W4/A16	The sensitive weights are separated using a divide-and-conquer method, and then the insensitive weights are compressed using a two-stage compression method to convert them into 3-bit storage.
BitNet	W1	BitNet uses -1 or 1 to represent a single bit of the model weight and quantizes the activation into 8 bits to perform matrix multiplication. The rest of the model maintains high precision, such as FP16.
BitNet b1.58	W1.58	The main change is to change the weights values from {-1, 1} to {-1, 0, 1}, which gives the model the ability to ignore specific features, thus improving model performance.
OneBit	W1	A novel 1-bit parameter representation method and an efficient parameter initialization method based on matrix factorization are proposed to improve the convergence speed of the QAT framework.

3.1 SqueezeLLM

SqueezeLLM is a post-training quantization framework that not only enables lossless compression to ultra-low precision down to 3 bits, but also achieves higher quantization performance within the same memory constraints. SqueezeLLM proposes storing outliers in a full-precision sparse matrix and applying non-uniform quantization to the remaining weights. Determining the value of the non-uniform quantization based on the quantization sensitivity improves the performance of the quantization model.

3.1.1 Motivation

SqueezeLLM discovered that the main bottleneck in generative inference lies in memory bandwidth, i.e., the memory wall, rather than arithmetic computation, especially for single-batch inference. Since loading weights into memory is the main bottleneck, minimizing memory usage is the solution. For example, simply reducing the weight bit width without quantizing activations can accelerate execution (equivalent to loading weights into memory and then dequantizing them back to FP16; this increases computational overhead but reduces memory access overhead). Furthermore, weights have varying sensitivities; a small number of weights are highly sensitive and require targeted quantization.

3.1.2 Scheme

SqueezeLLM’s main contributions are as follows:

Sensitivity-based non-uniform quantization. Since the weight distribution is non-uniform, non-uniform quantization can be used to achieve 3-bit quantization.
- It uses approximate Fisher information to measure sensitivity, which is a novel metric. Then, low-bit non-uniform quantization is performed based on the sensitivity, using k-means clustering to generate quantization points close to the sensitivity value, and placing other points with the minimum MSE to minimize quantization error.
Dense and sparse decomposition, with dense and sparse quantization for outliers: weights are divided into dense and sparse categories; sparse weights are not quantized, only dense weights are quantized. An efficient sparse format is used to store outliers and sensitive weights (FP16), while a dense format is used to store large amounts of low-bit regular weight values. Kernels were developed separately to handle dense and sparse matrix-vector multiplication kernels.

Sensitivity-based non-uniform quantization

The main advantage of uniform quantization is its computational efficiency, but computation is not the primary bottleneck in LLM inference, and the weights themselves are non-uniformly distributed. Therefore, the authors propose a non-uniform quantization approach to reduce the quantization error of these sensitive weights. Specifically, the quantization-sensitive weights are determined using the second-order Hessian information of the loss, and the quantization points are placed near these sensitive weights. The detailed derivation is shown in the figure below. The core point is:

With 3-bit resolution, there are only 8 quantization slots, so careful selection of quantization values is necessary. A common approach is to use k-means clustering, where the key is the definition of the loss function. The authors attempted to use task loss instead of the simplified single-layer reconstruction error.
Solving for H here is still too expensive, so the authors used the Fisher information matrix (FIM) instead, specifically to calculate the covariance of the first gradient.
Therefore, for each (Q(w_i)), the k-means algorithm (k=8 when it is 3 bits) can be used to minimize above (L(W_Q)), and finally obtain k discrete one-dimensional floating-point values.

3646

Dense-sparse quantization for outliers

The author analyzed the weight distribution of the output proj matrix of MHA and the contraction matrix of FFN, as shown below, and found that…

99.9% of the weights are squeezed into the less than 10% of the largest interval.
The remaining 0.1% weight, located in the 10%-90% range of the maximum interval, belongs to the outlier category.

This indicates that removing outliers can reduce the quantization threshold by a factor of 10.

Therefore, the authors adopted a simple yet effective decomposition approach, as shown in the figure below, decomposing the weights W into a sparse matrix (S) that accommodates outlier weights and a dense matrix (D) that contains the majority of remaining weights. The former preserves FP16, while the latter is quantized. SqueezeLLM decomposes the weight matrix into a sparse matrix and a dense matrix through an iterative process, achieving compression and quantization during the decomposition into the sparse matrix. Furthermore, in addition to placing outliers into the sparse matrix, a small portion of sensitive weights can also be placed into the sparse matrix based on Fisher information metric.

3647

Here, (T_{min/max}) is the threshold for outliers defined based on percentiles.

Importantly, because the number of outliers is small, an efficient sparse format can be used to store outliers and sensitive weights (FP16), while a dense format can be used to store a large number of low-bit regular weight values.

3.2 SpQR

The paper “SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression” proposes SpQR. Like AWQ, SpQR also recognizes the strong imbalance in the importance of weights to the model, where 1% of the parameters can dominate the performance loss during quantization. However, its approach to identifying and protecting sensitive weights differs from AWQ. It works by identifying and individually processing outlier weights that cause significant quantization errors, storing them with higher precision, while compressing all other weights to 3-4 bits.

3.2.1 Motivation

As the parameters of large models increase, it is often necessary to compress such LLMs to 3-4 bits per parameter through quantization to suit memory-constrained devices such as laptops and mobile phones for personalized use. However, existing techniques (such as GPTQ) that quantize parameters to 3-4 bits typically result in significant accuracy loss, especially for smaller models with 1-10B parameters, which are well-suited for edge deployments. To address this accuracy issue, the authors introduce Sparse Quantization Representation (SpQR), a novel compression format and quantization technique that achieves near-lossless LLM compression across model parameter scales for the first time, while maintaining a similar level of compression as previous methods.

Outliers

Early work [LLM.int8(), Smoothquant] observed significantly high values of “outliers” in the input/output of large language models, leading to higher quantization errors, and proposed various mitigation strategies. This paper analyzes this phenomenon from the perspective of weight quantization. Specifically, the authors investigated the structure of outliers beyond those in the input feature outliers of the weight matrix. The results show that while the input feature outliers of the current layer are correlated with the outlier weights of the hidden units of the previous layer, there is no strict correspondence. This paper is the first to demonstrate that, unlike input feature outliers, output hidden dimension outliers only appear in small segments for a particular output hidden dimension. This partially structured outlier pattern requires a fine-grained hybrid compression format.

To address the characteristics of outliers, the SpQR authors proposed a quantization algorithm to isolate such outliers and efficiently encode a given model in the SpQR format.

Parameter sensitivity analysis

First, it calculates the sensitivity of the parameters using a method similar to OBC (Optimal Brain Compression), and the solution is shown below.

3648

Secondly, the paper analyzes the parameter sensitivity of the weights, finding that the positions of sensitive weights in the weight matrix are not random, but rather possess a specific structure (rows, columns, attention heads, unstructured elements, etc.). The figure below shows the log-sensitivities of the weights in the last self-attention layer of LLaMA-65B. Dark blue shading indicates higher sensitivity. Through sensitivity analysis, the authors observed several patterns in the weight matrix:

Row outliers: The bottom center of the graph corresponds to a high-sensitivity region of the output features. Some of these patterns span the entire row, while others occupy only a portion of it. In the attention layer, some partial row outliers correspond to a subset of the attention head.
Column outliers: The bottom right corner of the graph shows the high sensitivity of the selected input dimension (column) across all rows.
Sensitive attention heads: Regular stripes with a width of 128 appear in the center of the top of the graph, corresponding to all the weights of an attention head. These “stripes” are horizontal in the Q & K projection matrix and vertical in the output projection matrix, but they do not appear in the V projection matrix or any MLP weights. Notably, even within sensitive heads, there are significant differences in the sensitivity of individual weights.
Rotation Embedding Pattern: The upper right corner of the figure shows a repetitive vertical sensitivity pattern with 64 unit cycles, which is a unique pattern of rotation embedding position encoding. No layer that does not use rotation embedding exhibits this pattern.
Unstructured outliers: In addition, each layer has many individual sensitivity weights that are not suited to any of the patterns described above. These unstructured outliers appear more frequently in the column with the largest input index (i.e., on the right side of the image). However, this effect is difficult to see on a heatmap.

Therefore, the authors hope to use a compression scheme to maintain all these different outlier types.

3649

3.2.2 Scheme

To address outliers, the authors propose using a quantization algorithm to isolate such outliers (sensitive value groups and individual outliers), then storing them in higher precision; all other weights are compressed to 3-4 bits. The given model is then efficiently encoded in SpQR format.

Quantization algorithm

Previous LLM quantization algorithms treated low-sensitivity and high-sensitivity weights equally, which could lead to suboptimal quantization. Ideally, we should allocate more storage resources (size budge) to higher-sensitivity weights. However, these weights are often unstructured anomalies (individual weights) or form groups scattered throughout the weight matrix, such as partial rows or attention heads. To capture this structure, this paper divides the quantization process into two parts: capturing small groups of anomalies; and capturing individual outliers.

Capturing small group weights through bilevel quantization

In the figure above, we observed that in several patterns, the weights behaved similarly within consecutive groups, but abrupt changes occurred between groups. Examples include certain attention heads and outliers in some rows. When applying standard methods, in many cases, these weights would be grouped together, sharing the same quantization statistics. To reduce such cases, we use extremely small groups for grouped quantization, typically (\beta_1 = 8-32) weights. That is, for every (\beta) consecutive weights, there is a separate quantization scale and zero-point.

To avoid the overhead of storing quantized statistics outweighing the accuracy advantage, we use the same quantization algorithm as the weights-asymmetric (min-max) quantization-to quantize the grouped statistics themselves. In other words, we group the grouped statistics from (\beta_2 = 16) consecutive values and quantize them together with the same number of bits, so that groups with atypical quantization parameters end up using more “quantization budget”.

High sensitivity outliers

In fact, a small number of sensitive weights appear as groups (in the Self Attention layer) or as individual “outliers” (in the MLP). These outlier weights often account for only 1% of all weights, but can account for more than 75% of the total quantization error. Since these outliers are usually unstructured, we choose to keep them at high precision (16 bits) and encode them individually in a row arrangement similar to Compressed Sparse Rows (CSR) representation [5].

The algorithm for detecting highly sensitive outliers is shown in the figure below: the left segment describes the complete process, and the right segment contains subroutines for bilevel quantization and outlier detection. The left segment is detailed below:

The first step is the “outlier detection” step: finding outliers and maintaining them as 16-bit weights. We found that quantizing outliers leads to disproportionately high errors. Therefore, these weights are kept at high precision. The calculation of the sensitivity weights follows the conclusions in OBS. Layer-by-layer reconstruction error is used here. Globally, for each matrix, the algorithm aims to select a sensitivity threshold (\tau) to obtain the desired number of outliers in the entire model, typically about 1% of the weights.
The second step is the “actual compression step”: the majority ((\ge 99%)) of the non-outlier weights are approximately quantized to 3~4 bits, and the remaining quantization is transferred to the 16-bit outlier weights.

After the first outlier detection step, SpQR ignores all outliers in the same quantization group and only quantizes non-outlier weights. Following min-max quantization, the algorithm further applies GPTQ to quantize the remaining weights. Finally, the algorithm collects and compresses the sparse outlier matrix and the final quantization statistics through hierarchical quantization, returning the compressed weights and their metadata.

3650

Implementation and utilization of sparse quantization representation

SpQR transforms uniform weights into various data structures of different sizes and precisions. Overall, the representation consists of (1) quantized weights, (2) first-level quantized statistics, second-level quantized statistics, and (3) CSR outlier indices and values. The following figure summarizes the overall structure of SpQR. The left side provides an overview of the SpQR representation of a single weight tensor, while the right side depicts all stored data types and their dimensions.

3651

Below is a description of each component.

Storing quantized groups. All non-outlier weights are encoded into a structure that contains:

An independent (b_w)-bit weight
The scale and zero point (first-level quantization) of each group (group size is B) corresponding to (b_q)
Used to quantize quantization groups (scale and zero point) of 16-bit statistics (secondary quantization) (B_q)

Storing outliers. Since outliers are unstructured, SpQR sorts them by row and column, ensuring that outliers within the same row are contiguous in memory. For each outlier, this paper stores two scalars: a 16-bit weight value and a 16-bit column index. For each row, a 32-bit number is also stored to represent the total number of outliers in the row. The average storage cost per outlier weight is 32.03 to 32.1 bits.

3.3 BitNet

BitNet uses -1 or 1 to represent a single bit of the model weights, quantizing the weight matrix into 1-bit (i.e., taking the value 1 or -1) and the activation into 8-bit for matrix multiplication. Then, it uses the 1-bit matrix multiplied by the 8-bit activation matrix operation, replacing the matrix multiplication nn.Linear in typical neural networks. The rest of the model maintains high precision, such as FP16, including gradients, optimizer states, and operations within the attention mechanism.

3.3.1 Scheme

Architecture

BitNet directly injects the quantization process into the Transformer architecture. The Transformer architecture is the foundation of most LLMs, which heavily rely on linear layers in computation. These linear layers are typically represented with higher precision (e.g., FP16), and most of the model’s weights reside in them. Therefore, BitNet modifies this by introducing BitLinear layers to replace these linear layers.

3652

The BitLinear layer works similarly to a regular linear layer, calculating the output by multiplying the weights by the activations. However, the BitLinear layer uses 1 bit to represent the model’s weights and INT8 to represent the activations. This approach significantly reduces the model’s storage and computational requirements, making it feasible to deploy large language models in resource-constrained environments. Simultaneously, through this extreme quantization method, BitNet significantly reduces energy consumption and operating costs while maintaining performance.

On the other hand, the authors want to maintain the high precision of gradient update operations, so they will also maintain a high-precision implicit weight matrix during training (of course, 1-bit quantization will be performed during calculation, but it is high precision when updating parameters). On the other hand, the authors believe that the main workload of Transformer computation lies in matrix multiplication, so optimizing the nn.Linear operation is the most important.

process

The training process for BitLinear is as follows.

Initially, there is a high-precision input activation and a high-precision weight matrix.
The weight is initialized with zero, and then converted to 1-bit (1 or -1) using a sign function.
The input is normalized using LayerNorm, and then 8-bit quantization is performed.
Multiply the 1-bit weight by the 8-bit input to get the 8-bit output.
Dequantization is performed using the previously quantized parameters to obtain a high-precision output.

3653

Weight quantization

During training, the weights are stored in INT8 and then quantized to 1 bit using a basic strategy called the sign function. Specifically, it shifts the weight distribution to 0-centered, assigning -1 to all values to the left of 0 and 1 to all values to the right.

3654

In addition, it also tracks a value (\beta) (mean absolute value), which will be used later for dequantization.

Activation Quantification

To quantize the activation values, BitLinear uses absmax quantization to convert the activation values from FP16 to INT8 because they require higher precision in matrix multiplication. Additionally, it records the highest absolute value (\alpha) of the activation value, as it will be used later for dequantization.

3655

Inverse Quantization

The above tracks (\alpha) (maximum absolute value of activations) and (\beta) (average absolute value of weights), which will help us dequantize the activations back to FP16. The output activations are rescaled using ({\alpha, \gamma}) to dequantize them back to their original precision.

3656

QAT

Like QAT, the BitLinear layer performs a form of “fake” quantization during training to analyze the effect of weight and activation quantization.

Backpropagation encounters the problem that quantization and sign operations cannot propagate gradients. The paper addresses this by using a straight-through estimator (STE) to directly propagate the gradients back. During inference, maintaining high-precision weights is unnecessary; a 1-bit quantized version suffices. Furthermore, the paper finds that the 1-bit model is more stable during training than the high-precision model. Since weights are either 1 or -1, a small gradient may result in minimal model change after a single update; therefore, the paper recommends using a larger learning rate.

3.3.2 Results

BitNet demonstrates a good scaling law, meaning that it can accurately predict the training results of models with larger parameter counts using the training loss and parameter size of a small-parameter model, and the gap with the FP16 accuracy model decreases as the parameter size increases. Furthermore, due to the use of 1-bit weights, matrix multiplication, previously performed in nn.Linear, can now be reduced to addition, with multiplication mainly used during scaling, significantly reducing computational cost.

The main difference between BitNet and Post-training quantization is that BitNet requires training the model from scratch, which is indeed a disadvantage, but its advantages in the inference stage are very obvious, with extremely low memory usage and computational cost.

3.4 BitNet b1.58

BitNet b1.58 is an improved version of BitNet, mainly changing the weight values from {-1, 1} to {-1, 0, 1}, which is why this model is called 1.58-bit. This gives the model the ability to ignore specific features, improves model performance, and adds the option of 0, so matrix multiplication still only requires addition.

3.4.1 Scheme

1.58-bit quantization mainly requires two techniques:

Adding 0 creates a ternary representation [-1, 0, 1].
absmean quantization is used for weights.

This results in lightweight models, as they only require 1.58 bits of computational efficiency.

3657

The Power of 0

Adding a 0 eliminates one multiplication operation. Therefore, if the weight is quantized to 1.58 bits, only an addition operation is needed.

Matrix multiplication calculates the output by multiplying a weight matrix by an input vector. This multiplication involves two actions: multiplying the input and the individual weights, and then summing them together.

BitNet 1.58b essentially avoids multiplication operations by using ternary weights because ternary weights essentially tell you the following information:

1 - I want to add this value
0 - I don’t need this value.
-1 - I want to subtract this value.

By setting a given weight to 0, it can be ignored, instead of either adding or subtracting weights as in a 1-bit representation. This not only significantly speeds up computation but also allows for feature filtering.

3658

Quantification

For weight quantization, BitNet 1.58b uses absmean quantization. It simply compresses the distribution of weights and quantizes the values using the absolute mean ((\alpha)), meaning the weight distribution is compressed to near the absolute mean ((\alpha)). These values are then rounded to -1, 0, or 1. Activation quantization is essentially the same as in BitNet, but instead of scaling to the range [0, (2^{b-1})], activations are scaled to [(-2^{b-1}), (2^{b-1})] using absmax quantization.

3659

3.5 OneBit

The paper “Onebit: Towards extremely low-bit large language models” introduces a 1-bit quantization-aware training (QAT) framework called OneBit, which includes a novel 1-bit parameter representation method and an efficient parameter initialization method based on matrix factorization to improve the convergence speed of the QAT framework.

3.5.1 Main Contributions

Onebit’s contributions are as follows:

A novel and efficient 1-bit model architecture for LLMs is proposed, which can improve time and space efficiency during model inference and is more stable when quantizing LLMs.
We propose SVID (Symbol-Value Independent Decomposition) to decompose high-bit matrices into low-bit matrices, which is crucial for the initialization of 1-bit architectures. Experiments show that SVID-based initialization can improve model performance and convergence speed.

3.5.2 Challenges

The main idea behind model quantization is to compress each weight matrix W in the model from FP32 or FP16 format into a low-bit counterpart. Specifically, we typically quantize the weight matrices of linear layers in a Transformer to 8 bits, 4 bits, or even 2 bits. Most quantization studies primarily employ the round-to-nearest (RTN) method, which rounds the weights w to the nearest value in the quantization grid. This can be represented as:

$\hat{w} = Clip\left(\left\lfloor \frac{w}{s} \right\rceil + z, 0, 2^N - 1\right)$

Where s represents the quantization ratio parameter, z represents the zero-point parameter, and N is the quantization bit width. Clip(·) truncates the result to the range of 0 to (2^N-1).

As the bit width decreases, the quantization grid becomes increasingly sparse. When we quantize an LLM to 1 bit, only two numbers are available in the quantization model. Furthermore, when N equals 1, quantization based on the RTN method is essentially equivalent to setting a threshold, with the weights w converted to their corresponding integer values (\hat{w}). In this case, the parameters s and z in the equation practically lose their meaning. Therefore, when quantizing the weights to 1 bit, element-based RTN operations severely degrade the precision of the weight matrix W, resulting in a significant precision loss at extremely low bit widths and significantly increasing the loss of the LLM core operator, linear projection WX. This leads to a degraded quantization model performance.

3.5.3 Scheme

Onebit has been improved on BitNet.

1-bit linear layer architecture

3660

Due to the significant precision loss associated with 1-bit weight quantization, directly converting the weight matrix in a linear layer from FP32/16 format to 1-bit format according to RTN is extremely difficult. BitNet explored this possibility by studying the capability of pure 1-bit weight matrices and training a 1-bit model from scratch. In the W1A16 setting, BitNet’s linear layers are designed as shown on the left side of the diagram above. Here, W represents the quantized weight matrix of shape (m \times n), and (W_{\pm 1}) represents the 1-bit quantization matrix. X is the input to the linear layer, and Y is the output. The Sign(·), Mean(·), and Abs(·) functions return the sign matrix, mean matrix, and absolute value matrix, respectively. Unfortunately, while this approach reduces computational requirements, it also leads to a significant performance degradation.

Inspired by BitNet, OneBit also uses the Sign(·) function to quantize the weight matrix and sets the elements of the quantization matrix to +1 or -1. Furthermore, we note that although (W_{\pm 1}) maintains the high rank of W, the missing floating-point precision still degrades model performance. Unlike previous work, OneBit introduces two FP16 format value vectors to compensate for the precision loss during quantization. The linear layer proposed by OneBit is designed as follows:

$W_{\pm 1} = Sign(W)$ $Y = [(X \odot g)\cdot W_{\pm 1}^T] \odot h$ $Z = LayerNorm(Y)$

Here, g and h are two FP16 value vectors. Note that parentheses are used in the above formula to specify the calculation order, which minimizes time and space costs. The main difference between BitNet and OneBit lies in the additional parameters g and h. Even with the introduction of these additional parameters, the benefits far outweigh the low cost. For example, when quantizing a weight matrix of shape 4096x4096, the average bit width of the quantized result is 1.0073.

The diagram below illustrates the main idea of the OneBit method. The left side shows the original FP16 linear layer, where both the activation values X and the weight matrix W are in FP16 format. The right side shows the architecture proposed by OneBit, where only the value vectors g and h retain the FP16 format, while the weight matrix consists of (\pm 1)s. (\otimes): Hadamard product.

3661

Symbol-Value Independent Decomposition (SVID)

In SVID (Sign-Value-Independent Decomposition), each original high-bit weight matrix is decomposed into a sign matrix ((W_{\pm 1})) in INT1 format and two FP16 value vectors g and h. The value vectors provide the necessary floating-point precision for linear projection at a very low cost and help the model train easily. The sign matrix maintains the high rank of the original weight matrix with a small space cost, thus preserving high information capacity.

The specific derivation is as follows:

3662

Knowledge transfer

SVID provides better parameter initialization for 1-bit models, and OneBit employs quantization-aware knowledge distillation to transfer the capabilities of the original model to its proposed 1-bit counterpart. Specifically, OneBit uses quantization-aware knowledge distillation to transfer knowledge from the original model (i.e., the teacher model) to the quantized model (i.e., the student model). Elements of matrix W and vector g/h are trained in the student model. The quantized student model is guided using logits based on cross-entropy and the full-precision teacher model hidden states based on mean squared error.

3663

3664

0xFF Reference

Onebit: Towards extremely low-bit large language models

https://arxiv.org/pdf/2402.17764

H. Wang, S. Ma, L. Dong, S. Huang, H. Wang, L. Ma, F. Yang, R. Wang, Y. Wu, and F. Wei. BitNet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023.

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. 2024. Onebit: Towards extremely low-bit large language models. CoRR, abs/2402.11295.

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Large Model Quantization Technology Principle - SpQR: Eating Jelly Without Spitting Out the Jelly Shell

SqueezeLLM

A New Chapter in LLM Quantization: 4-bit Weighted Activated Quantization with Near-Loss Efficiency! FlatQuant’s Flat Path (NeuralTalk)

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

https://www.armcvai.cn/2024-11-01/llm-quant-awq.html

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

https://www.armcvai.cn/2024-10-30/llm-smoothquant.html

https://www.armcvai.cn/2024-10-31/smoothquant-inplement.html

https://www.armcvai.cn/2024-11-03/awq-code.html

[ Source Code] [10,000 words] In-depth Exploration of SmoothQuant Quantitative Analysis

https://github.com/mit-han-lab/smoothquant/

https://github.com/Guangxuan-Xiao/torch-int/

[LLM Quantitative Series] PTQ Quantitative Classic Research Analysis: Killua’s Advance

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Lightweighting of Large Models (Part 2): AWQ: A Technological Powerhouse for Weight Quantization of 4-bit Large Language Models Suitable for End-User Devices

GPTQ & OBQ: Quantifying Your GPT ( Yang Xinyu)

QLoRA: Training a Larger GPT (Yang Xinyu)

LLM Inference Acceleration Technology Principles - GPTQ Quantization Technology Evolution Fenrier Lab

(Alpha) GPTQ Explained by buxianchen

LeCun, Yann, John Denker, and Sara Solla. “Optimal brain damage.” Advances in neural information processing systems 2 (1989).

Hassibi, Babak, David G. Stork, and Gregory J. Wolff. “Optimal brain surgeon and general network pruning.” IEEE international conference on neural networks. IEEE, 1993.

Frantar, Elias, and Dan Alistarh. “Optimal brain compression: A framework for accurate post-training quantization and pruning.” Advances in Neural Information Processing Systems 35 (2022): 4475-4488.

Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers https://arxiv.org/abs/2210.17323

Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning https://arxiv.org/abs/2208.11580v2

Optimal Brain Surgeon and general network pruning https://www.babak.caltech.edu/pubs/conferences/00298572.pdf

Optimal Brain Damage https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ( https://arxiv.org/abs/2210.17323)

Wu H, Judd P, Zhang X, et al. Integer quantization for valuation[J]. arXiv preprint arXiv:2004.09602, 2020.

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia nt and affordable post-training quantization for .01861, 2022.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke n for transformers at scale. arXiv preprint

Elias Frantar, Sidak Pal Singh, and Dan r accurate post-training quantization and pruning. arXiv S 2022, to appear.

Bondarenko, Y., Nagel, M., and Blankevoort, T. Understanding and overcoming the challenges of efficient transformer quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7947-7969, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.627 .

A Review of Model Quantization Techniques: Revealing Cutting-Edge Technologies for Large Language Model Compression [DeepHub IMBA](javascript:void(0)😉

Large Model Performance Optimization (Part 1): Quantization Starting with Half-Precision, Understanding fp32, fp16, and bf16

A convenient post-training quantization solution: GPTQ

[AI Unraveling the Mysteries] The Principles, Current Status, and Future Prospects of Model Quantization Technology - Long Peng - Three Mottos

AWQ, Activation-aware Weight Quantization

LLM.int8()

SqueezeLLM

SmoothQuant

ZeroQuant series ( v1 , v2 )

GPTQ

A pioneering work in large-scale model quantization perception training: LLM-QAT - eating jelly without spitting out the jelly shell.

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

(https://link.zhihu.com/?target=https%3A//github.com/facebookresearch/LLM-QAT )

Summary of Quantization Algorithms for Large Model Inference ( by Sun Peiqin)

Wang Wenguang’s lengthy article reveals the secrets of the GPTQ method for large model quantization: from OBS through OBQ to GPTQ, the magic of the Hessian matrix.

Wang Wenguang’s lengthy article reveals the secrets of large-scale model quantization technology: exploring the principles and understanding the most important technology for efficient inference in large-scale models.

akaihaoshuai: Implementing LLM from Scratch: 6. Model Quantization Theory + Code Practice (LLM-QAT/GPTQ/BitNet 1.58Bits/OneBit) akaihaoshuai: Implementing LLM from Scratch: 6.1. Model Quantization (AWQ/SqueezeLLM/Marlin)

https://zhuanlan.zhihu.com/p/703513611

A Technical Overview of Efficient Inference in Large AI Models! [AI Large Model Frontiers] (javascript:void(0)😉)

[LLM Quantitative Series] Killua’s Advances with DuQuant, AffineQuant, and FlatQuant

Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. 2023. GKD: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649.

E Beltrami. 1990. Sulle funzioni bilineari, giomale di mathematiche ad uso studenti delle uniersita. 11, 98-106. (an English translation by d boley is available as University of Minnesota, Department of Computer Science). Technical report, Technical Report 90-37.

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language models across training and scaling. In ICML, pages 2397- 2430.

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI, volume 34, pages 7432-7439.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in NeurIPS, 33:1877-1901.

Sebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023a. QLoRA: Efficient finetuning of quantized LLMs. In Advances in NeurIPS.

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2023b. SpQR: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078.

Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: k-bit inference scaling laws. In ICML, pages 7750-7774.

Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In ICML, pages 10323-10337.

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. In ICLR.

Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the ACL, pages 8003-8017.

Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, and Dongsoo Lee. 2023a. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14152.

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W Mahoney, and Kurt Keutzer. 2023b. SqueezeLLM: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. 2023. LLM-QAT: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.

Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. LLM-Pruner: On the structural pruning of large language models. In Advances in NeurIPS.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.

Matan Ben Noach and Yoav Goldberg. 2020. Compressing pre-trained language models by matrix decomposition. In Proceedings of the AACL-IJCNLP, pages 884-889.

Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111-126.

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485-5551.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99-106.

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2023. OmniQuant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137.

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. 2023. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model compression. In Proceedings of the EMNLP-IJCNLP, pages 4323-4332.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca .

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in NeurIPS, 30.

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, et al. 2023. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. 2023. BitNet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453.

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, pages 38087- 38099.

Mingxue Xu, Yao Lei Xu, and Danilo P Mandic. 2023. TensorGPT: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. TinyLlama: An open-source small language model. arXiv preprint arXiv:2401.02385.

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. 2023. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits: https://arxiv.org/abs/2402.17764

BitNet: Scaling 1-bit Transformers for Large Language Models: https://arxiv.org/abs/2310.11453

The Era of 1-bit LLMs: Training Tips, Code and FAQ: https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

OLMo-Bitnet-1B: NousResearch/OLMo-Bitnet-1B · Hugging Face

1bitLLM/bitnet_b1_58-3B · Hugging Face

Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

[2] Taori, Rohan, et al. “Stanford alpaca: An instruction-following llama model.” (2023).

Dettmers, Tim, et al. “Llm. int8 : 8-bit matrix multiplication for transformers at scale.” arXiv preprint arXiv:2208.07339 (2022).

[4] Frantar, Elias, et al. “Gptq: Accurate post-training quantization for generative pre-trained transformers.” arXiv preprint arXiv:2210.17323 (2022).

Hoefler, Torsten, et al. “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.” The Journal of Machine Learning Research 22.1 (2021): 10882-11005.

QLORA: Efficient Finetuning of Quantized LLMs

20,000 words of analysis on quantization technology, the most comprehensive analysis of large-scale model quantization technology on the entire internet. (Baiqi Yuewen)

SageAttention: Plug-and-play 8-bit Attention Best Practices (by Fang Jiarui)

Introduction to decoupleQ 2-bit quantization technology: Smooth and fluid operation

[LLM Quantitative Series] PTQ Quantitative Classic Research Analysis: Killua’s Advance

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

Large-scale model quantization techniques demystify pandas that don’t rely on Newton’s tubes.

[ICML2023] [W8A8] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

[Frantar et al., NIPS2022] Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

[ICLR2023][W3A16] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers QLoRA, GPTQ: An Overview of Model Quantization

[MLSys2024] [W3A16] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

NF4 Isn’t Information Theoretically Optimal (and that’s Good)

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

What methods are currently available for quantizing large models? (Top Secret Ambush)

QLoRA: Efficient Finetuning of Quantized LLMs https://arxiv.org/abs/2305.14314

QLoRA: 4-bit quantization + LoRA training = instant takeoff LokLok