Exploring the Transformer Series (34) --- Quantitative Fundamentals

0x00 Overview

While Transformer-based LLMs have made significant progress, deploying them presents major challenges due to the increasing number of parameters. For example, even a medium-sized LLM, such as LLaMA13B, requires approximately 26GB of memory to load the FP16 format of all its parameters. This overhead not only increases the cost of use but also limits their wider adoption.

To address these challenges, numerous specialized compression methods for LLMs have been proposed, including pruning, knowledge transfer, quantization, compact architecture design, and dynamic networks. These methods help reduce memory and computational costs during model inference, enabling models to run on resource-constrained devices. Model quantization ( Model Quantizationalso known as network quantization) is a crucial method in model compression. The model quantization process consists of two parts: converting the model’s single-precision parameters (generally FP321- 32bit floating-point parameters) into low-precision parameters (generally 1 INT8- 8bit fixed-point parameters), and converting floating-point operations during model inference into fixed-point operations. Model quantization techniques can reduce the model’s storage space, memory footprint, and computational resource requirements, thereby improving the model’s inference speed.

3401

Note: There are 5 articles on quantitative analysis in total, and this is the first one.

0x01 Background Knowledge

1.1 Requirements

In the actual operation of large language models, resource efficiency issues are mainly reflected in two aspects: memory pressure and computational efficiency. These problems become increasingly prominent as the model size increases and the application scenarios become more complex.

Memory Pressure: Large language models face severe memory pressure when processing long texts. This pressure manifests in two ways:
- The increasing size of model parameters. Large language models are named for the sheer number of parameters they contain. These models typically have billions of parameters, which can be quite expensive to store. During inference, activation values are the product of inputs and weights, which can also be enormous, often requiring GPUs with significant amounts of video memory to accelerate the inference process.
- The characteristic of KV Cache is that the required GPU memory increases linearly with the length of the input sequence. For example, when processing a 32K text sequence, the KV Cache alone may occupy 10-15GB of GPU memory, which shares the limited GPU memory space with the model’s weight parameters. This GPU memory pressure directly limits the model’s ability to process extremely long texts and also affects the overall scalability of the system.
Computational efficiency: The limitation of video memory resources further leads to a decrease in computational efficiency. This manifests in three aspects:
- Limited batch processing capabilities: Due to the large amount of video memory used, the system has difficulty processing multiple large batch requests simultaneously.
- Increased response latency: Especially when dealing with long sequences, the inference time of the model is significantly extended.
- System throughput decreased: Due to limitations in video memory capacity, the number of requests that the server can handle simultaneously has been significantly reduced.

In addition, different application scenarios also bring unique challenges:

In edge computing environments, it is necessary to achieve efficient operation of models within limited computing resources.
Mobile device applications require models to be able to adapt to strict memory constraints.
Real-time interactive scenarios place higher demands on the model’s response latency.

For model miniaturization, the focus is on the model’s average inference time and power consumption. Average inference time can be measured by latency or throughput, while power consumption can be measured by the GPU power consumption used in the token generation process. Both of these metrics are closely related to the number of model parameters, especially since LLMs have a huge number of parameters, resulting in high GPU consumption during deployment.

In conclusion, how to make the model smaller and lighter during deployment while maintaining its capabilities as much as possible has become an important research topic.

1.2 Compression

Model compression algorithms can transform a large and complex pre-trained model into a smaller, more streamlined model, reducing the demands on hardware storage, bandwidth, and computation, thereby accelerating model inference and deployment. Based on the degree of disruption to the network structure caused by the compression process, researchers have divided model compression techniques into two parts: “front-end compression” and “back-end compression.”

Front-end compression refers to compression techniques that do not change the original network structure. These mainly include knowledge distillation, lightweight networks (compact model structure design), and filter-level pruning (structured pruning).
Backend compression refers to techniques such as low-rank approximation, unrestricted pruning (unstructured pruning/sparseness), parameter quantization, and binary networks, with the goal of minimizing model size, which can significantly alter the original network structure.

In recent years, mainstream model compression methods include: data quantization (also called model quantization), model sparsification (also called model pruning), knowledge distillation, lightweight network design, and tensor decomposition.

1.3 How to represent numerical values

Let’s first look at how numerical values are represented in computers. Because both inference and training in neural networks are computationally intensive, efficient numerical representation is particularly important.

In computer science, a given numerical value is typically represented as a floating-point number, which is a positive or negative number with a decimal point. This notation uses scientific notation to represent real numbers. Specifically, it uses a mantissa (an informal term for significant figures), a base, an exponent, and a sign to indicate positive or negative. See the diagram below for details.

Fixed-point is another way of representing numbers. The difference between fixed-point and floating-point lies in where the integer and fractional parts are separated. Fixed-point retains a specific number of digits for the integer and fractional parts, while floating-point retains a specific number of digits for the significant digits and the exponent.

3402

As shown in the diagram above, the mantissa and exponent are represented by “bits,” or binary digits. Different data types have different mantissas and exponents; the more bits we use to represent a value, the more accurate it usually is. A clever property of these bits is that we can calculate how much memory a device needs to store a given value. Since there are 8 bits in one byte of memory, we can create a basic formula for most forms of floating-point representation.

$memory = \frac{number\_of\_bits}{8} \times number\_of\_params$

Now suppose we have a model containing 70 billion parameters. Most models use 32-bit floating-point numbers (often called full precision) by default, and loading the model alone requires 280GB of memory.

$64\text{-}bits = \frac{64}{8} \times 70B \approx 560GB \\ 32\text{-}bits = \frac{32}{8} \times 70B \approx 280GB \\ 16\text{-}bits = \frac{16}{8} \times 70B \approx 140GB \\$

The number of bits used to represent model parameters (including during training) becomes very important because it determines the amount of memory the model occupies. However, as accuracy decreases, model accuracy typically also decreases. Therefore, we aim to reduce the number of bits used to represent numerical values while maintaining accuracy.

1.4 Common Data Types

Let’s take a look at common data types and the impact of using them instead of 32-bit (called full precision or FP32 ) representation.

The standard precision model represents the model weights in a FP3232-bit floating-point, single-precision format.
Mixed precision uses both FP32and FP16weight numerical formats in the model. FP16This reduces memory usage by half, but some parameters or operators must be in FP32format to maintain accuracy.
Low-precision models represent model weights in a format of FP16either half-precision floating-point or INT88-bit fixed-point integer. Currently, the industry is striving to achieve the precision of INT4 and INT1, while the performance improvements of INT32 or FP16 are not significant. Therefore, INT8 quantization is the primary choice.

Additionally, when performing calculations within a specific data type (such as INT8), we need a structure of another data type to store the result in order to handle overflow. This is called an accumulation data type, such as INT32 for INT8. An accumulation data type specifies the type of the result of accumulating (addition, multiplication, etc.) values of related data types. For example, consider two INT8 values A = 127 and B = 127, and define C as the sum of A and B:

C = A + B

The result C here is much larger than the maximum representable value of int8 (127). Therefore, we need a higher-precision data type to avoid a huge loss of precision. For example, the accumulation data type of bfloat16 is float32.

The following figure shows the cumulative data type, mathematical operations, and bandwidth optimization rate for different data types.

3403

0x02 Introduction to Quantitative Analyses

The core idea of quantization is to represent the original data in a more compact data format within an acceptable range of precision loss.

The concept of quantization originates from the field of digital signal processing and refers to the process of approximating the continuous values (or a large number of possible discrete values) of a signal to a finite number (or a small number) of discrete values. Research in neuroscience, particularly on the quantization of neural networks, has shown that information stored in continuous form is inevitably susceptible to noise corruption, and that discrete representations have higher generalization ability and greater efficiency with limited resources. Therefore, it is generally believed that the human brain stores information in discrete/quantized form, rather than continuously.

In machine learning, model quantization refers to converting the floating-point arithmetic of a neural network into low-precision integers to save computational overhead. For example, using low-precision data types like 8-bit integers (int8) to represent weights and activations instead of the usual 32-bit floating-point numbers. Reducing the number of bits means a smaller model size, requiring less memory storage during inference and consuming less power. Furthermore, operations like matrix multiplication can be performed faster using integer arithmetic. Therefore, quantization is an effective technique for reducing computational and memory costs during inference.

Several key factors make neural network models quantifiable:

First, both inference and training in neural networks are computationally intensive. Therefore, efficient numerical representation is particularly important.
Secondly, most current neural network models are severely over-parameterized, thus providing ample opportunity to reduce bit precision without compromising accuracy.
Furthermore, neural networks are highly robust to aggressive quantization and extreme discretization. They can still achieve very good generalization performance while maintaining a high error/distance between the quantized model and the original non-quantized model.
Finally, the hierarchical structure of neural network models provides an additional dimension for exploration. Different layers in a neural network have different effects on the loss function, which inspires mixed-precision quantization methods.

2.1 Sparsity of Deep Neural Networks

Based on the objects that can be sparsified in deep learning models, sparsity in deep neural networks mainly includes weight sparsity, activation sparsity, and gradient sparsity.

2.1.1 Sparse Weights

In most neural networks, by performing histogram analysis on the weight values of network layers (convolutional or fully connected layers), it can be observed that the distribution of weight values closely resembles a normal distribution (or a mixture of multiple normal distributions), and the closer the weight is to 0, the more weights there are; this is the phenomenon of weight sparsity. Some researchers believe that the absolute value of a weight can be seen as a measure of its importance; the larger the weight, the greater its contribution to the model output, and vice versa. Removing a weight would likely have a relatively small impact on model accuracy.

2.1.2 Activating Sparseness

Activation functions introduce sparsity into the network. Taking the ReLU activation function as an example, defined as: $ReLU(x)=max(0,x)$ , this function causes inputs along the negative half-axis to produce outputs of 0 values. This can be considered as the activation function introducing another type of sparsity to the network. That is, regardless of the input received by the network, a large portion of the neurons in a large network will output mostly zero.

2.1.3 Gradient Sparsity

The gradients of most deep neural network model parameters are also highly sparse. The paper “DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING” found that in distributed SGDalgorithms, 99.9% of gradient swaps are redundant.

2.2 Quantification Mechanism

Next, we will introduce some quantitative knowledge points.

The principle of quantization is shown in the figure below. The main goal of quantization is to reduce the number of bits required to represent the original parameters while maintaining the accuracy of the original parameters as much as possible. Specifically, as shown in the figure below, [-R, R] is the data range before quantization, and [-127, 127] is the data range after quantization. This process reduces the accuracy of parameters such as model weights from high-bit-width continuous values (usually float32 or a large number of possible discrete values) to a finite number of low-bit-width discrete values (usually int8).

3404

The quantization formula is shown in the figure below. Here, X is a floating-point tensor, X¯ is the quantized integer tensor, Δ is the scaling factor, ⌊⋅⌉ represents the rounding function, and N is the number of quantization bits (in this case, 8bits). It is assumed that the tensor is symmetric around 0. This quantization method calculates the scaling factor Δ based on the maximum absolute value of the floating-point number, which preserves outliers in the activations, crucial for model accuracy. We can calculate the scaling factor Δ offline based on the activation values of a partial calibration sample set; this is called static quantization. Alternatively, we can dynamically calculate the scaling factor based on activation statistics during model runtime; this is called dynamic quantization.

3405

2.2.1 Basic Operations

Quantization has two basic operations:

Quantization: Converting data to a lower precision, such as converting a real number into a quantized integer (float32 becomes int8).
Dequantization: Converting data to a higher precision. For example, converting a number from a quantized integer representation to a real number (int8 becomes float32).

The specific details are shown in the image below. We will conduct a detailed analysis later.

3406

It can be further refined. Equation (1) below is quantization, and equation (2) is dequantization. Q(·) represents the quantization operation, X represents the input tensor, S is the scale, Z is the zero-point, b is the quantization bit, round(⋅) and clip(⋅) represent the rounding and truncation operations respectively, and are the minimum and maximum values after quantization. S and Z are collectively referred to as quantization parameters. Most quantization algorithms can be understood as finding better S and Z so that the result of the quantized model approximates the result of the original model as closely as possible. $q_{min}$ $q_{max}$

$Q(X) = clamp(\lfloor\frac{X}{S}\rceil + Z, 0, 2^b -1)\tag {1}$ $\hat X = (Q(X) - Z) \times S \tag{2}$

Assuming r represents the unquantized floating-point number, the quantized integer q can be represented as:

$q = clip(round(\frac{r}{s} + z), q_{min},q_{max})$

2.2.2 Range Mapping

The core of the quantization process is finding a suitable mapping relationship to map floating-point numbers to integer space. This mapping needs to ensure that:

It can cover the entire range of the original data.
Maintain the relative relationships of the values as much as possible.
This ensures that the quantized values fall within a certain range, such as 0-255 for INT8.

In range mapping, data within the range [A1, A2] must be converted to a range of B-bit integers. Specifically, all elements within the range [A1, A2] are mapped to the target range, with elements outside the range being clipped to the nearest boundary. For example, we might need to map FP32 to the int8 range. Int8 can only represent 256 values, while float32 can represent a wider range. However, the value distribution of weights in a typical neural network layer is narrow, but the number of numerical points within this range is large. Therefore, careful adjustment of the range mapping is necessary to better project the float32 value range [a, b] onto the int8 space.

Quantization error

Most users and researchers of model quantization are more concerned with the accuracy of the quantized model. Model quantization converts a higher numerical precision floating-point model into a smaller number of discrete fixed-point models, inevitably introducing errors. In neural network models, the quantization error is amplified at each layer, leading to excessively low accuracy. The quantization operation is generally defined as: $\hat{X} = Round( clamp(\frac{X}{s}, Q_{min}, Q_{max}) )$ , where Round represents rounding, and clamp truncates outliers exceeding the quantization domain $\left [ Q_{min}, Q_{max} \right ]$ . Both Round and Clamp operations result in irreversible loss of numerical precision. Simply put, the quantization error of a tensor can be expressed as the sum of the truncation error and the rounding error. These two are related. ’s’ represents the scaling factor, defined as $s = \frac{X_{max} - X_{min} }{2^{bit}-1 }$ , meaning it is determined by the upper and lower bounds of the truncation. Combining the two, we can obtain an empirical formula $\sigma ^{2} ～ \frac{s^{2} }{12}$ , that is, the quantization error is proportional to $s^{2}$ .

calibration

To calculate scaleand zero_point, we need to know FP32 weight/activationthe actual dynamic range of . Calibration is the process of selecting an optimal range, that is, finding a range that can contain as many values as possible while minimizing quantization error. Calibration is crucial for effective quantization. For the inference process, weightis a constant tensor with a fixed dynamic range, activationwhile ” has a variable dynamic range. Its actual dynamic range must be obtained through sampling (data calibration). If every FP32value of during the quantization process falls within this actual dynamic range, we generally call this an unsaturated state; conversely, if some FP32values of are not within this actual dynamic range, we call it a saturated state.

Uniform quantization & Non-uniform quantization

From the perspective of range mapping, quantization methods can be divided into uniform quantization and non-uniform quantization based on whether the range of the original data represented by the quantized data is uniform. The weights and activation values of actual deep neural networks are usually non-uniform, so theoretically, non-linear quantization would result in less accuracy loss. However, in actual inference, non-linear quantization has higher computational complexity, so we usually use linear quantization.

Non-uniform quantization (also known as nonlinear quantization) is defined as follows: when the value of a real number r falls between the quantization step size and , the quantizer Q projects it onto the corresponding quantization level , and is non-uniform. For a fixed bit width, non-uniform quantization can achieve higher precision because it can better capture the distribution by focusing more on the region of significant values or finding an appropriate dynamic range. For example, many non-uniform quantization methods are designed for bell-shaped distributions of weights and activations, which often involve long tails. A typical rule-based non-uniform quantization uses a logarithmic distribution, where the quantization step size and level grow exponentially rather than linearly. $\Delta_i$ $\Delta_{i+1}$ $X_i$ $X_i$ $\Delta_i$

3407

Uniform quantization (also known as linear quantization) uses uniformly distributed cluster centers, and the original floating-point data and the quantized integers have a simple linear transformation relationship. Because convolutional, fully connected, and other network layers are themselves simple linear computations, linear quantization can directly use the quantized data for calculations. Uniform quantization divides the range of real values into uniform finite intervals, and then real values within the same interval are mapped to the same integer. In fact, the quantized values (also known as quantization levels) obtained by uniform quantization are also uniform.

The figure below shows a comparison between uniform quantization (left) and non-uniform quantization (right). Real values in the continuous domain r are mapped to discrete, lower-precision values in the quantization domain Q, which are marked with orange dots. Note that in uniform quantization, the distances between quantized values (quantization levels) are the same, while in non-uniform quantization they can vary.

3408

The weights of neural networks are typically not uniformly distributed, making non-uniform quantization feasible. Generally, non-uniform quantization allows us to better capture signal information by non-uniformly distributing bits and discretizing the parameter range, providing higher accuracy and lower quantization errors. However, non-uniform quantization schemes can be inefficient due to time-consuming lookup operations, often making them difficult to deploy effectively on general-purpose computing hardware such as GPUs and CPUs. Therefore, uniform quantization has become the de facto standard method due to its simplicity and efficient hardware mapping.

Affine Quantization & Scale Quantization

Uniform quantization consists of two steps:

Select the range of values (floating-point) to be quantized and truncate them. The meaning of truncation is: values greater than the range are made the maximum value of the range, and values less than the range are made the minimum value of the range.
Mapping the truncated values to integers involves a rounding operation.

Let be the range of real values that can be represented for quantization, and b be the bit width for signed integer representation. Uniform quantization transforms the input value into , where inputs outside the range are truncated to the nearest boundary. $[\beta,\alpha]$ $x\in[\beta,\alpha]$ $[-2^{b-1}, 2^{b-1}-1]$

Since we only consider uniform transformations, linear quantization of floating-point numbers can be divided into two categories based on whether the quantized value of zero in the floating-point space remains 0 before and after quantization: Symmetric Quantization and Asymmetric Quantization. We call these two choices scale and asymmetric, respectively. The corresponding transformation function also has two choices: f(x) = s · x + z and its special case f(x) = s · x.

Scale Quantization: The transformation function is f(x) = s · x, i.e., symmetric quantization, where s is the scaling factor. The characteristics of scale quantization are as follows:
- FP32Symmetric quantization maps the values within the tensor [−f(∣f(∣), f(∣)] to 8 bitthe range [−128, 127] of the data, with intermediate values mapped linearly. It can be seen that both the floating-point values and the quantized value range of symmetric quantization are symmetric with respect to zero, hence the name symmetric quantization.
- In symmetric quantization, the range of the original floating-point value is mapped to a symmetric range centered at zero in the quantization space. For int8, the range of int8 is [-127, 127], and the value 128 is not applicable. A good example of symmetric quantization is called absolute maximum (absmax) quantization. Given a series of values, we take the largest absolute value (α) as the range for which a linear mapping is performed.
- Symmetric quantization can avoid the quantization operator from calculating the z-related part during inference, reducing the computational complexity of inference; in the dequantization step, symmetric quantization is more computationally efficient and simpler to implement.
- Symmetric quantization is widely used in the practice of quantization weights because zeroing the zero point can reduce the computational cost in the inference process and make the implementation simpler.
- There are questions about using symmetric quantization for activation, because activation values are often non-negative. Since the quantization range mostly falls within the non-negative range, this wastes the quantization range and increases quantization error.
- The quantification of weights and data can be reduced to the process of finding 𝑠𝑐𝑎𝑙𝑒, and the improvement of quantification methods is essentially the process of selecting the optimal 𝑠𝑐𝑎𝑙𝑒 value.
Asymmetric Quantization: The transformation function is f(x) = s · x + z, i.e., asymmetric quantization, where s is the scaling factor and z is the zero point. The characteristics of asymmetric Quantization are as follows:
- The clipping range of asymmetric quantization is not necessarily symmetric with respect to the origin. It maps the minimum (β) and maximum (α) values in the floating-point range to the minimum and maximum values in the quantization range. This utilizes the entire quantization range, but the computation is more complex. For example, for int8, the value range of int8 is [0, 255].
- Asymmetric algorithms generally handle unevenly distributed data better. Compared to symmetric quantization, asymmetric quantization typically results in a narrower clipping range, which is particularly important for activations in neural networks where the activations may be severely imbalanced. For example, activations after ReLU always have non-negative values; in this case, symmetric quantization would waste a bit of representation power, limiting it to [0, 127]. Asymmetric quantization can determine the minimum and maximum values based on the actual data distribution, making fuller use of the quantized data information and resulting in lower quantization loss.
- The asymmetric quantization algorithm for weight parameters can be divided into two steps:
  - The scaling factor s and the zero offset value z are determined by finding the min and max values in the weight tensor.
  - Convert each value of the weight tensor from FP32 to INT8.

See the figure below for details. round(⋅) and clip(⋅) represent rounding and truncation operations, respectively. s is the data quantization interval, and z is the offset representing the data offset.

3409

The image below provides an intuitive explanation of different uniform quantization methods with a bit width of 8. ‘s’ is the scaling factor, and ‘z’ is the zero point. The floating-point grid is black, and the integer quantized grid is blue.

3410

MinMax quantization & Truncated quantization

Clipping range and calibration are important factors in uniform quantization. Choosing an appropriate clipping range can reduce the number of outliers, but it can also lead to larger scaling factors and quantization errors. Depending on whether the actual values are clipped, uniform quantization can be divided into two categories: MinMax quantization, which preserves all value ranges, and clipping-based quantization.

3411

While MinMax quantization allows mapping the full range of vector values, it introduces a major drawback: outliers. If a value in the original data is significantly larger than all others, it can be considered an outlier. If we map outliers to this full range, all smaller values will be mapped to the same or similar lower-order bit representations, losing their distinguishing information. For example, in the image below, the presence of the outlier 500 causes numbers from -255 to 255 to be mapped to an even smaller range.

3412

Therefore, we can choose to prune certain values, that is, define a custom dynamic range to exclude extreme values. Pruning sets different dynamic ranges for the original values, and any value outside this range will be “pruned,” regardless of its actual size, and mapped to the maximum or minimum value of the target range. In the example below, the dynamic range is manually set to [-255, 255], so all values outside this range will be mapped to -127 or 127, regardless of their actual value.

Compared to MinMax quantization, truncation of outliers helps improve accuracy and allocates more bits to intermediate values, significantly reducing quantization error for “non-outliers”. However, it leads to increased quantization error for outliers.

3413

In existing truncation algorithms, the truncation value is often based on a given numerical value and lacks the ability to learn about the actual distribution of the data. To address this, a learnable truncation method can be adopted, which involves constructing a minimization model of a single-layer quantized MSE to obtain the optimal value of the truncation parameter t.

2.2.3 Quantization bits

Different data types in computers occupy different numbers of bits and represent different data ranges. The original model can be quantized into models with different numbers of bits according to actual business needs. The fewer bits used for quantization, the higher the compression ratio of the quantized model. The most commonly used quantization bit depth in industry is currently 8 bits; quantization with less than 8 bits is called low-bit quantization. 1 bit is the limit of model compression, which can compress the model to 1/32. Efficient XNOR and BitCount bitwise operations can also be used during inference to improve inference speed. The following diagram illustrates the quantization principle and the formulas for 1-bit and 2-bit quantization.

3414

2.2.4 Classification

The paper “LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit” mathematically categorizes quantization algorithms as follows: “Trans.” indicates whether the algorithm is an equivalent transformation. γ is the scaling factor. s and Q represent the transformation vector and matrix, respectively. I is the identity matrix. L is the loss function with a learning rate η. α and β represent the minimum and maximum shearing values. H is the Hessian matrix, and E represents the quantization error calculated using H. λ is the decay coefficient.

3415

2.3 Quantification Objects

The objects of model quantization mainly include the following aspects:

Weights: Quantizing weights is the most common and routine step. Quantizing weights can reduce model size, memory usage, and space consumption.
Activation: In practice, activation often accounts for a large portion of memory usage. Therefore, quantizing activation can not only significantly reduce memory consumption, but more importantly, combining it with weight quantization can fully utilize integer computation to achieve performance improvements.
Gradients: When training deep learning models, gradients are usually floating-point numbers. Therefore, the main purpose of gradient quantization is to reduce communication overhead in distributed computing, and also to reduce backward overhead. Because it is mainly used for training, gradient quantization is slightly less common than the two methods mentioned above, so it will be omitted.
Key-Value Cache: Quantizing key-value caches is crucial for improving the throughput of long sequence generation. We will analyze this in more detail in later chapters.

2.3.1 Weights (and biases)

Weight quantization only quantizes the model’s weights to compress the model size. During inference, the weights are dequantized back to the original float32 data, and the subsequent inference process is consistent with a regular float32 model. The advantage of weight quantization is that in some applications, it eliminates the need for dataset calibration and quantization operators, and the model’s accuracy error is relatively small. Since the actual inference still uses float32 operators, inference performance is not improved. Because the number of biases (in the millions) is far less than the number of weights (in the billions), and biases typically maintain high precision (e.g., INT16), the main work of quantization is focused on the weights.

2.3.2 Activation

Because the model primarily consists of weights and biases, whose values are known before running the model, we can consider the weights and biases of an LLM as static values. Unlike weights, activations change with each data point and layer input into the model during inference, making accurate quantification challenging. Since these values are updated after each hidden layer, their state during inference is only known as the input data passes through the model. To quantify activation values, the user needs to provide a calibration dataset to statistically analyze the distribution of activation values at each layer and calibrate the quantized operators. The calibration dataset can come from the training dataset or real-world input data, and is typically very small in size.

2.3.3 Calibration

The calibration techniques available for selection mainly include:

Max-Min (Maximum and Minimum): This is the simplest and most commonly used sampling method. The basic idea is to directly FP32select the maximum and minimum values from the tensor to determine the actual dynamic range. For weightsa given tensor, this sampling method is unsaturated. However, for activationa given tensor, if outliers appear in the sampled data, the actual dynamic range may be significantly increased. For example, in actual calculations, 99%the data for a given tensor might be evenly distributed between 0 and 1 [-100, 100], but if there is an outlier with a value of 1 during sampling 10000, the resulting dynamic range becomes 1 [-100, 10000].
Moving Average MinMax: Unlike MinMaxdirect replacement algorithms, MovingAverageMinMax uses a hyperparameter c(default value 0.01 in PyTorch) to progressively update the dynamic range. The dynamic range obtained using this method is generally smaller than the actual dynamic range.
KL divergence minimizes the distribution difference between the original and quantized values. It is generally believed that the more similar the distribution of the quantized data is to the distribution before quantization, the less information is lost from the original data, meaning the higher the accuracy of the quantization algorithm. KLDistance (also called KLdivergence) is generally used to measure the similarity between two distributions. TensorRT uses the KL divergence algorithm for quantization calibration: First, FP32 inference is run on the calibration set, and then the following steps are performed for each layer of the network:
1. Collect the histogram of the activated output.
2. Many quantization distributions with different saturation thresholds are generated.
3. Choose the threshold that minimizes KL_divergence(ref_distr, quant_distr) Tand determine it Scale.
Percentile: Manually select the percentile of the input range , such as 99% or other percentage values of the tensor, truncating the rest. That is, instead of using the maximum/minimum value, use the i-th maximum/minimum value as β/α.
MSE: Optimizes the mean squared error between the original weights and the quantized weights.

2.3.3 Quantization Granularity

A key aspect of quantization methods is the granularity of the range [α, β] to be clipped for weight calculation. Different granularities of quantization refer to calculating the quantization scaling factor based on different granularities; that is, large datasets sharing the same scaling factor. Generally, finer granularity reduces quantization error but requires storing more quantization parameters and introduces higher computational overhead.

The quantization granularity corresponds to which weights/activations are quantized together and share quantization parameters. According to the quantization granularity, we can divide model quantization into per-tensor quantization, per-token quantization, per-channel quantization, and per-group quantization, etc.

Tensor-by-Tension Quantization

Per-tensor quantization, also known as layer-by-layer quantization, is the simplest and coarsest-grained quantization method.

This scheme uses a single network layer as the quantization unit, with each layer having a set of quantization parameters (S and Z). For example, a uniform scaling factor is calculated for the entire activation matrix tensor; the same applies to the entire weight matrix. Then, all elements in the tensor are quantized to a low-precision format, such as INT8, using this scaling factor. Therefore, it is also called layerwise quantization. For instance, an INT8 per-tensor dynamic quantizer finds the maximum absolute value of the entire tensor, calculates a scaling factor based on it, and then scales all elements to the INT8 representation range [-127, +127] and rounds them down.

In most computer vision tasks, the activation inputs of a layer are convolved with many different convolutional filters, as shown in the figure below. Each convolutional filter can have a different range of values. Layer-by-layer quantization applies the same clipping range to all convolutional filters belonging to the same layer. While this approach is easy to implement, it often leads to suboptimal accuracy because the range of each convolutional filter can vary greatly. For example, a convolutional kernel with a relatively narrow parameter range (such as Filter 1 in the figure below) may lose its quantization resolution due to another kernel with a larger range in the same layer.

3416

Channel-by-channel quantization & Token-by-token quantization

Tensor-by-tensor quantization uses a single scaling factor for the entire matrix, while token-by-token quantization and channel-by-channel quantization set different scaling factors for each token or weight’s output channel.

Per-channel quantization (channel is a dimension, similar to the embedding dimension or hidden size) is shown in the last column of the convolution example above. That is, each quantized channel in each layer of the network is assigned a separate set of quantization parameters. Per-channel quantization achieves higher quantization accuracy due to its finer granularity, but it is also more computationally complex. In LLM, per-channel quantization is usually applied to the weights, with each column (usually a column in the hidden dim dimension) corresponding to a set of quantization coefficients.
Per-token is similar to per-channel quantization, but it is based on activation (the token is the text token), and each line (each token) corresponds to a set of independent quantization parameters.

When performing matrix multiplication, more accurate results can be obtained by combining various techniques, such as row-wise or vector-wise quantization. See the diagram below for details. The left side of the diagram shows vector-wise quantization for both activations X and weights W, which normalizes the tensor using the maximum absolute value of the entire matrix tensor. The right side uses a combination of token-wise and channel-wise quantization. The maximum absolute value of each row of X and each column of W is found, and then X and W are normalized row-wise or column-wise. Finally, X and W are multiplied to obtain C. Finally, the outer product of the vectors containing the maximum absolute values of X and W is calculated, and this product is then used to perform a Hadamard product with C to dequantize back to FP16.

3417

Quantization by group

A coarse-grained version of per-channel quantization is using different scaling factors for different groups of channels. Group-wise quantization operates on a group basis, with each group using a set of S and Z values; its granularity lies between per-tensor and per-channel. K rows (corresponding to activations) or K columns (corresponding to weights) share a single quantization factor. When group=1, group-wise quantization is equivalent to layer-wise quantization; when group=num_filters (e.g., dw (Depthwise) transforms the convolution kernel into a single channel), group-wise quantization is equivalent to per-channel quantization.

For example convolutional diagrams, multiple different channels can be grouped within a layer to compute the clipping range (activation or kernel). This is helpful when the distribution of parameters varies greatly in a single convolution/activation. However, this approach inevitably incurs the additional cost of considering different scaling factors.

The image below shows a comparison between per-token and group-wise methods.

3418

Let’s take another look at the OpenPPL-LLM solution and our thoughts.

OpenPPL-LLM introduces two different quantization techniques: Groupwise quantization (usually int4 quantization) and Channelwise quantization (usually int8 quantization). As shown in the diagram, for a 4x4 matrix, Groupwise quantization treats adjacent elements as a group (elements of the same color in the diagram are grouped together), and elements within a group share the quantization parameter. Due to the small number of elements within a group, this quantization scheme has high quantization accuracy, but it imposes a greater computational burden and memory access overhead on the system. Channelwise quantization treats each column of the matrix as a group, and each column shares the quantization parameter. This scheme has slightly lower quantization accuracy, but its memory access mode is beneficial for tensor core operations, fully utilizing the low-precision arithmetic units of the GPU to accelerate computation. Both quantization methods can reduce GPU memory usage.

3419

For inference scenarios with batch sizes < 16, all computational tasks in large language models are memory-intensive. The system’s bottleneck is not computation, but rather parameter memory access, which far exceeds activation memory access. Therefore, groupwise quantization is well-suited for compressing the parameter matrices of matrix multiplication individually, while maintaining FP16 precision in computation and input/output. This quantization provides higher quantization accuracy. The additional computational cost of groupwise quantization does not slow down the overall system efficiency.

For inference scenarios with batch sizes greater than 16, groupwise quantization fails to provide acceleration. In this case, the system’s computational load gradually becomes computationally intensive, requiring the use of the GPU’s high-performance low-precision arithmetic units (int8 TensorCores) to accelerate network execution. Groupwise quantization is detrimental to GPU TensorCore operation, leading to severe memory access discontinuities and poor performance. Therefore, channelwise quantization is recommended to simultaneously quantize both the input and parameter matrices for matrix multiplication.

It’s important to note that Groupwise quantization and Channelwise quantization involve different weight preprocessing steps. These two approaches must be determined at service startup and cannot be dynamically switched during model runtime. Therefore, we need to determine the quantization mode at the beginning of service deployment.

Hybrid Quantization

Mixed-precision quantization is a more flexible quantization strategy that allows different parts of the model to be quantized using different bit widths. During the quantization of large models, different linear layers have varying sensitivities to quantization. Based on this, mixed-precision quantization categorizes all linear layers by sensitivity, reverting sensitive layers to higher precision calculations while performing low-precision quantization on insensitive layers. Before quantization, a calibration set is generated through random sampling, and the activation values of all linear layers are collected to calculate the “interquartile range” statistic.

3420

2.3.4 Terminology

Let’s take a look at the terminology next.

Taking W1A16 as an example: W1A16 refers to a specific quantization setting, where “W1” and “A16” represent the quantization bit width of the weight matrix and the activation value, respectively.

” W1 ” indicates that the weight matrix is quantized to 1 bit. This means that the weights in the model only use two possible values (usually -1 and +1), which greatly reduces the model’s storage requirements and computational complexity. However, this extreme quantization can lead to information loss, requiring special design and techniques to maintain model performance.
” A16 ” indicates that the activation values maintain 16-bit precision. Activation values are data passed between layers in a neural network. Maintaining 16-bit precision helps reduce errors introduced by quantization and helps the model maintain a certain level of computational accuracy.

2.4 Quantify the workflow

We have introduced calibration above. In practice, there are three methods to convert a floating-point model to a quantized model:

data freeTraditional methods, which directly convert floating-point parameters into quantized numbers without using a calibration set, are very simple to use but generally result in a significant loss of accuracy. However, Qualcomm’s latest paper demonstrates that DFQhigh accuracy can be achieved without using a calibration set.
calibrationBased on the calibration set scheme, statistical analysis is performed by inputting a small amount of real data.
finetuneThe training-based finetuneapproach simulates and models the quantization error during training, adjusting the weights to better suit quantization. The advantage is a greater improvement in accuracy, but the disadvantage is that it requires modifying the model training code, resulting in a longer development cycle.

Based on the different stages of quantization work (the stage of the compressed model or the time of quantization application), calibration can be classified into the following three types:

Quantization-Aware-Training (QAT) incorporates pseudo-quantization operators during model training, seamlessly integrating the quantization target into the training process. It then uses training data for fine-tuning, allowing the model to adjust and learn under the constraints of the quantized values. By statistically analyzing the range of input and output data during training, it improves the accuracy of the quantized model. This approach is suitable for scenarios requiring high model accuracy.
Post-Training-Quantization (PTQ) quantizes the parameters of the model after training, requiring only a small amount of calibration data to calculate the clipping range and scaling factor. The main advantages of PTQ are its simplicity and efficiency, requiring no modification to the LLM architecture or retraining. It is suitable for scenarios where ease of use is paramount and training resources are limited. However, PTQ may introduce some accuracy loss during the quantization process.
Quantization-Aware Fine-tuning (QAF) applies quantization during the fine-tuning of a pre-trained model. The primary goal is to ensure that the fine-tuned LLM retains its performance after quantization to a lower bit width. By integrating quantization awareness into fine-tuning, a balance is struck between model compression and performance preservation.

Because QAF is relatively niche, we will focus on introducing QAT and PTQ.

The image below compares QAT and PTQ. In QAT, the pre-trained model is quantized and then fine-tuned using the training data to adjust parameters and recover from accuracy drops. In PTQ, the pre-trained model is calibrated using calibration data (e.g., a small subset of the training data) to calculate the clipping range and scaling factor. The model is then quantized based on the calibration results. Essentially, PTQ involves studying different metrics during the calibration process to better select the upper and lower bounds of the cutoff.

In practical applications, PTQ is more commonly used. Most chip manufacturers’ compilers have integrated basic PTQ methods, which can be combined with operator fusion graph optimization and other techniques to achieve satisfactory results in terms of both accuracy and speed for most models. PTQ also has the added advantage of being applicable to situations with limited or unlabeled data. However, compared to QAT, this usually comes at the cost of lower accuracy, especially for low-precision quantization. Furthermore, PTQ has a drawback: the quantization does not take into account the actual training process.

3421

2.4.1 PTQ

PTQ involves quantizing the model’s parameters (including weights and activations) after the model has been trained.

Weight quantization can be performed before inference using either symmetric or asymmetric quantization because, in most cases, the parameters are fixed during inference, and the clipping range can be statically determined beforehand. However, activation quantization requires inferring the model to obtain their latent distributions, since the activation mappings for each input sample are different, and we do not know their ranges beforehand.

This leads to two forms of activation quantization: static and dynamic quantization, which can be categorized based on whether they are used during inference. Static quantization pre-calculates quantization parameters before inference, finding typical activation statistics offline using calibration samples. Dynamic quantization uses statistical information to dynamically calculate quantization parameters at runtime, typically offering greater accuracy but requiring higher overhead to compute the necessary statistics.

From an architectural design perspective, static quantization achieves optimal inference performance through pre-computation. This approach excels in latency stability and resource efficiency, making it particularly suitable for edge computing and large-scale server deployments. Dynamic quantization, on the other hand, offers greater flexibility through runtime adaptive mechanisms, better handling scenarios with drastically changing data distributions. This provides unique advantages in fields such as natural language processing.

Dynamic quantization

Dynamic quantization is a technique that performs quantization at runtime. It pre-quantizes only the weights; for activation values, the quantization parameters are dynamically calculated at runtime—that is, the range of each activation is dynamically calculated during runtime. The specific process is as follows:

After the data passes through the hidden layer, its activation values are collected.
The distribution of these activation values is then used to calculate the zero-point ( z ) and scaling factor ( s ) values required for the quantization output .
This process is repeated each time data passes through a new layer. Each layer has its own z and s values, and therefore a different quantization scheme.

Dynamic quantization incurs significant computational overhead per operation, making it slower than static quantization and unusable on certain hardware. However, this approach is simple, flexible, and highly effective (with higher accuracy), making it particularly suitable for scenarios where the input data distribution varies considerably.

Static quantization

Static quantization is performed before model inference, hence the name “static” quantization. This method involves pre-calculating the range values (z and s) of the activations before quantization. To find these values, a representative dataset (calibration dataset) is provided to the model to collect the distribution of the activation values. The actual steps are as follows:

After the model is trained, an observer is placed on the activation function to record the activation values.
Perform several forward propagations using the calibration dataset (a few hundred samples are sufficient), execute the inference process, statistically analyze the data distribution of activation values at each layer, and obtain the corresponding quantization parameters.
After collecting these values, the s and z values required for quantization during inference can be calculated and then stored as part of the parameters.
During actual inference, the s and z values are not recalculated but are used globally to quantize all activations. This is because only one set of sums was used during calibration. If the distribution of activation values changes significantly during actual inference, it could lead to higher quantization errors.

summary

Dynamic quantization is typically more accurate because it only attempts to compute the s and z values for each hidden layer . However, this significantly increases computation time because these values need to be calculated. Static quantization, while less precise, is faster because the s and z values used for quantization are already known, and therefore is generally used.

2.4.2 QAT

Sometimes, when a trained model has a poor numerical distribution and various PTQ strategies fail to achieve good results, QAT (Quantization Attempt) is needed. This involves introducing quantization error during training or fine-tuning to constrain the numerical distribution and obtain better quantization results. Alternatively, in a few special model structures where lower-bit quantization is required, QAT may also be necessary.

Unlike Post-Training Quantization (PTQ), which performs quantization after model training is complete , QAT learns the quantization process (simulates quantization) during training. It uses pseudo-quantization operators to incorporate the accuracy loss caused by quantization into the training error, enabling the optimizer to minimize quantization error during training and obtain higher model accuracy.

QAT is generally more accurate than PTQ because the quantization process is taken into account during training. It works as follows:

Initialization: Set the initial values for the range of weights and activation values, and $q_{min}$ $q_{max}$
Constructing a simulated quantization network involves inserting pseudo-quantization operators after the weights and activation values that need to be quantized; this is a process of first quantizing the weights to, for example, INT4, and then dequantizing them back to FP32. This process allows the model to consider the quantization process during training, loss calculation, and weight updates. QAT attempts to explore the ” wide ” minimum in the loss to minimize quantization error, as ” narrow ” minimums tend to lead to larger quantization errors. Choosing a different update weight within a ” wide ” minimum will significantly reduce the quantization error.
Quantization training: Repeat the following steps until the network converges, calculate the range of weights and activation values of the quantized network layers and , and incorporate the quantization loss into the forward inference and backward parameter update processes based on this range; $q_{min}$ $q_{max}$
Export the quantization network: Obtain and and calculate the quantization parameters s and z; substitute the quantization parameters into the quantization formula and convert the weights in the network to quantized integer values; delete pseudo-quantization operators and insert quantization and dequantization operators before and after the quantization network layer, respectively. $q_{min}$ $q_{max}$

Therefore, although PTQ has lower loss in high precision (e.g., FP32), QAT will achieve lower loss in low precision (e.g., INT4).

3422

2.4.3 Recommendation Process

The paper “INTEGER QUANTIZATION FOR DEEP LEARNING INFERENCE: PRINCIPLES AND EMPIRICAL EVALUATION” provides a recommended quantification workflow, as shown in the figure below.

3423

2.5 Reasons for acceleration

How can quantization models be accelerated? The reasons are roughly as follows: faster memory access, faster vectorized operations, and optimized quantization operators.

2.5.1 Memory Access Acceleration

The memory usage of a quantized model is proportional to the number of quantized bits (ignoring the additional memory for quantization parameters), which is extremely beneficial for models with memory access bottlenecks. Modern high-performance computing chips are highly optimized for parallel computations such as multiplication and addition, making memory read/write operations the bottleneck. Converting a floating-point model to low-bit values through quantization significantly reduces memory access time. It’s important to note that in some schemes, quantization only applies to weights, so in actual inference, dequantization is required to obtain the floating-point weights before calculation, thus increasing the computational load. However, the speed improvement in memory access results in a considerable overall acceleration of model inference. This also demonstrates another significant advantage of model quantization: saving memory.

2.5.2 Accelerating Operations via Vectorization

In high-performance CPUs or GPU tensor cores, efficient vectorized computation instructions are typically designed to enable parallel computing for large-scale operations. Combining vectorized computation with model quantization yields excellent results. For example, a 512-bit register can process 16 32-bit float numbers simultaneously; if quantized to 8-bit fixed-point numbers, the same resources can process 64 numbers concurrently.

2.5.3 Quantization Operator Optimization

We’ll take Cutlass’s FPA_INTB GEMM operator as an example. This operator implements efficient LLM weight-only quantization and is widely used. Common GEMM operators perform matrix multiplication with tensors of equal precision, but in the field of large-scale model quantization, weight-only quantization is more common due to limitations in model precision. This raises the question of how to perform GEMM computation between quantized INT4/8 type weights and float type activations. Cutlass integrates DeQuant operations and GEMM operations into a single operator and designs an efficient bitwise DeQuant implementation, achieving a 26x throughput speedup. For details, please refer to the paper “Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production” .

0xFF Reference

NeurIPS 2024 Oral: Implementing State-of-the-art 4-bit Quantization of Barley with DuQuant

The Art of Compressed Neural Networks: An Analysis of Yixin Based on Two Classic Papers by Professor Song Han from MIT

From Training Dynamics to Outlier: Numerical Characteristic Analysis of LLM Model Training Process (Reiase)

[Read a Paper ] A Survey of Quantization Methods for Efficient Neural Network Inference

A Survey of Quantization Methods for Efficient Neural Network Inference

https://www.armcvai.cn/2023-03-05/model-quantization.html

A Review of Model Quantization Techniques: Revealing Cutting-Edge Technologies for Large Language Model Compression [DeepHub IMBA](javascript:void(0)😉

Large Model Performance Optimization (Part 1): Quantization Starting with Half-Precision, Understanding fp32, fp16, and bf16

A convenient post-training quantization solution: GPTQ

[AI Unraveling the Mysteries] The Principles, Current Status, and Future Prospects of Model Quantization Technology - Long Peng - Three Mottos

AWQ, Activation-aware Weight Quantization

LLM.int8()

SqueezeLLM

SmoothQuant

GPTQ

A pioneering work in large-scale model quantization perception training: LLM-QAT – eating jelly without spitting out the jelly shell.

ZeroQuant series ( v1 , v2 )

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

(https://link.zhihu.com/?target=https%3A//github.com/facebookresearch/LLM-QAT )

Summary of Quantization Algorithms for Large Model Inference ( by Sun Peiqin)

Wang Wenguang’s lengthy article reveals the secrets of the GPTQ method for large model quantization: from OBS through OBQ to GPTQ, the magic of the Hessian matrix.

Wang Wenguang’s lengthy article reveals the secrets of large-scale model quantization technology: exploring the principles and understanding the most important technology for efficient inference in large-scale models.

akaihaoshuai: Implementing LLM from Scratch: 6. Model Quantization Theory + Code Practice (LLM-QAT/GPTQ/BitNet 1.58Bits/OneBit) akaihaoshuai: Implementing LLM from Scratch: 6.1. Model Quantization (AWQ/SqueezeLLM/Marlin)

https://zhuanlan.zhihu.com/p/703513611

A Technical Overview of Efficient Inference in Large AI Models! [AI Large Model Frontiers] (javascript:void(0)😉)

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Braverman, V., Beidi Chen, & Hu, X. (2023). KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization.

Databricks Blog: LLM Inference Performance Engineering: Best Practices

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, & Amir Gholami. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.

A. Gholami, S. Kim, Z. Dong, Z. Yao, MW Mahoney, and K. Keutzer, (2021). A Survey of Quantization Methods for Efficient Neural Network Inference.

KIVI: A Tuning-Free Asymmetric 2bit Quantization for kv Cache: https://arxiv.org/abs/2402.02750

Xiao, Guangxuan, et al. “Smoothquant: Accurate and efficient post-training quantization for large language models.” International Conference on Machine Learning. PMLR, 2023.

[2] Ashkboos, Saleh, et al. “Quarot: Outlier-free 4-bit inference in rotated llms.” arXiv preprint arXiv:2404.00456 (2024).

Sun, Mingjie, et al. “Massive Activations in Large Language Models.” arXiv preprint arXiv:2402.17762 (2024).

Liu, Ruikang, et al. “IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact.” arXiv preprint arXiv:2403.01241(2024).

[5] Liu, Zechun, et al. “SpinQuant—LLM quantization with learned rotations.” arXiv preprint arXiv:2405.16406(2024).

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

[2411.02355] “Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization [1]

Integer Quantization for Deep Learning Inference Principles and Empirical Evaluation

Deployment and Inference Principles and Empirical Verification of Deep Learning Int8

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Quantizing deep convolutional networks for efficient inference: A whitepaper

Discovering low-precision networks close to full-precision networks for efficient embedded inference

Pact: Parameterized clipping activation for quantized neural networks

TensorFlow Model Optimization: Model Quantization - Zhang Yixin

Practical Guide: A Comprehensive Summary of Deep Learning Model Quantization (Low-Precision Inference)

The Super Weight in Large Language Models

20,000 words of analysis on quantization technology, the most comprehensive analysis of large-scale model quantization technology on the entire internet. (Baiqi Yuewen)

A Plain Language Explanation of Scaling Laws for Precision: Justin’s Surname Isn’t Ding

Efficient Deep Learning - Study Notes - 4 - Model Quantization Backtracking Cat

LLMs Quantification Series | LLMs Quantization Need What? (The Cat of Retrospection)

LLMs Quantization Series | MiLo: How to use LoRA to compensate for the quantization loss of the MoE model (backtracking cat)

LLMs Quantitative Analysis Series | Summary of LLM Quantitative Methods ( Backtracking Cat)

Xingyueye ‘s Years of Model Quantization

[LLM Quantitative Series] The VQ Journey: Killua’s Ascent from AQLM, GPTVQ to VPTQ

[LLM Quantitative Series] PTQ Quantitative Classic Research Analysis: Killua’s Advance

AttnSink related papers shared by Jin Qin

Introduction to decoupleQ 2-bit quantization technology: Smooth and fluid operation

Latest discovery: Large-scale values, the key to attention mechanisms. ICML2025 AI Cat Repair Prompt

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

ICML25 research finds: RoPE has made a significant contribution again! A small eggplant…

Robust Quantization

OptiQuant – Ascend-friendly DeepSeek model quantization technology