Transformer Systems · Transformer Systems

Exploring the Transformer Series (22) --- LoRA

LoRA: PEFT, low-rank adaptation, rank, initialization, implementation, optimization, and continual learning.

MoE, Adaptation, And Compressionadvanced2.5 hrReading
transformerlorapeftlow-rankfine-tuningdoraqlora

Exploring the Transformer Series (22) --- LoRA

0x00 Overview

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, driving breakthroughs in language understanding, generation, and reasoning capabilities. Similar to self-supervised learning methods in other fields, LLMs are typically pre-trained on large amounts of unlabeled text data and then fine-tuned for specific downstream tasks to adapt their knowledge to the target domain. However, the sheer size of LLMs, often reaching billions of parameters, presents significant challenges in terms of computational complexity and resource requirements during the fine-tuning process.

To address these challenges, a promising approach called Parameter Efficient Fine-Tuning (PEFT) promises to adapt large language models to downstream tasks without significantly increasing the number of trainable parameters, thereby reducing computational and memory overhead. Among these methods, Low-Rank Adaptation (LoRA) has attracted considerable attention due to its effectiveness and simplicity.

2201

The core idea of LoRA is to use low-rank matrices to approximate changes in model parameters, thereby achieving indirect training of large models with a minimal number of parameters. LoRA freezes the weights of the pre-trained model and introduces low-rank matrices to approximate the changes in model parameters. By updating these low-rank matrices only during fine-tuning, LoRA adapts to specific tasks while keeping most of the pre-trained model’s parameters unchanged. LoRA’s goal is to achieve great results with minimal resources, achieving indirect training of large models with a very small number of parameters, approaching the effect of full fine-tuning. This method reduces storage and computational requirements while maintaining model performance.

2202

This article is primarily based on two papers:

  • A Survey on LoRA of Large Language Models
  • Low-Rank Adaptation for Foundation Models: A Comprehensive Review

Note: The complete list of articles is here. It’s estimated to eventually have around 35 articles. The list will be updated after each subsequent article is published.
Cnblogs Exploring Transformer Series: Article List


0x01 Background Knowledge

1.1 Fine-tuning

As open-source pre-trained large language models (LLMs) become more powerful and accessible, more and more developers are incorporating them into their projects. Pre-trained LLMs are often referred to as base models (large-scale neural networks trained on diverse, large-scale datasets) because of their versatility across various tasks. However, due to the knowledge boundaries of LLMs, the capabilities of base models remain limited on certain downstream tasks. To extend these knowledge boundaries, fine-tuning of LLMs on downstream tasks is still necessary, that is, adjusting pre-trained LLMs for specific datasets or tasks. Furthermore, training large language models consumes significant computational resources and time. This creates a bottleneck for the development of artificial intelligence and raises environmental issues. To alleviate this problem, people often choose to fine-tune pre-trained models.

Fine-tuning allows models to adapt to specific domains without the need for expensive pre-training. However, traditionally, adapting a pre-trained model to a specific downstream task requires full fine-tuning of all parameters. For larger models, updating all layers remains computationally expensive, and full fine-tuning of large models can easily result in excessive memory consumption. As the complexity and scale of these models increase, this traditional fine-tuning approach becomes unfeasible in terms of computation and resources.

1.2 PEFT

To address these challenges, more parameter-efficient fine-tuning techniques have emerged, collectively known as PEFT (Parameter-Efficient Tuning). PEFT methods have become standard practice for resource-constrained institutions and researchers fine-tuning large models. The general idea is to freeze the core parameters of the large model and introduce a small set of trainable parameters as adaptation modules for training. By fine-tuning a small number of (additional) model parameters or reducing the number of iterations, the LLM can be adapted to downstream tasks, significantly reducing computational requirements without affecting task performance, saving GPU memory and parameter storage overhead during model fine-tuning, and lowering fine-tuning costs. While these PEFT methods have great potential, they often require trade-offs between efficiency, performance, and adaptability, thus leaving considerable room for optimization.

There is no unified standard for classifying PEFT. Here, we adopt the approach from the paper “Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,” which roughly divides PEFT strategies into four categories:

  • Additivity PEFT modifies the model architecture by injecting new trainable modules or parameters;
  • Selective PEFT makes a subset of parameters trainable during fine-tuning.
  • Reparameterized PEFT constructs a (low-dimensional) reparameterized training of the original model parameters and then equivalently transforms it back for inference;
  • Hybrid PEFT combines the advantages of different PEFT methods to construct a unified PEFT model.

The following figure provides an overview of the different types of PEFT algorithms.

2203

The image below provides a detailed classification.

2204

Furthermore, some of the PEFT methods described above can be used interchangeably. For example, in the diagram below, all learnable components are red, and frozen components are gray. LoRA is applied to Q, K, and V, and the adapter is used on FFN. Soft-Prompt adjusts the input activation for each decoder.

2205

1.3 Rank

Rank refers to the rank of a matrix, that is, how many rows (or columns) in a matrix are “unique,” meaning these rows (or columns) cannot be obtained by linearly combining other rows (or columns). For example:

The second row is three times the length of the first row, so the rank of the matrix is 1. As for the columns, the second column is twice the length of the first column, and the third column is three times the length of the first column, so the rank is still 1.

And the following matrix:

The second row cannot be formed from the first row, so the rank is at least 2. At first glance, the third row seems unrelated to the first and second rows, but upon closer inspection, the third row can be obtained by subtracting twice the second row from the first row, so the rank of this matrix is 2. The same applies to the columns; the second column can be obtained by adding the first and third columns.

In fact, the rank of a matrix is always the same whether it is calculated based on the number of rows or columns. This also explains why the rank of a matrix is always less than or equal to the smaller of the number of rows or columns.

1.4 SVD Decomposition

Since networks can always be described using the language of matrices and tensors, linear algebra provides an important tool for studying network properties. Singular Value Decomposition (SVD) is a matrix decomposition technique in linear algebra. SVD decomposes a matrix into a sum of components of varying importance. SVD is often used for dimensionality reduction and compression; by preserving larger singular values, the original matrix can be approximated.

In SVD decomposition, given a real matrix of size , the output of SVD on is:

That is, the original weight matrix is decomposed into three main components, which together cover the entire original matrix space.

  • : Left singular vectors, forming an orthogonal basis of the column space, with a size of ;
  • : A diagonal matrix, where the elements on the diagonal are called singular values. Singular values are used to measure the strength or importance of each principal axis and to adjust the dimensions and scaling within a subspace; their size is .
  • : Right singular vectors form an orthogonal basis of the row space, with a size of .

Expanding the three matrices, we get:

Since the inverse of an orthogonal matrix is its transpose, therefore and . The subscripts of the two identity matrices indicate their sizes as and , respectively.

Furthermore, matrix can be written as the sum of each singular value and its corresponding vector, and this decomposition can reveal its importance. If some singular values in the matrix already account for 90% of all singular values, we only need to save the corresponding singular vectors and singular values to recover 90% of the matrix.

The geometric meaning of SVD is essentially basis transformation (transforming to an orthogonal basis represented by , corresponding to the vector multiplied by ). Then stretch and compress (multiply the corresponding vector by ), then rotate (multiply the corresponding vector by ). That is, the movement of one vector to another in space is simply decomposing a vector onto , and then performing the stretching described by , then decomposed back onto , becoming another vector.

0x02 LoRA

2.1 Definition

LoRA is an abbreviation for Low-Rank Adaptation, a fine-tuning method used to reduce memory requirements.

At an abstract level, the basic model can be abstracted into a function:

The function processes the input and outputs . is the model’s weights, which can also be considered the large model itself. For current large language models, is a set of weights consisting of hundreds of billions (or trillions, even billions) of floating-point numbers, and it is where the large model exerts its “magic.” Generally speaking, training a large model involves learning the model’s weights through a large-scale corpus. Formally, this is achieved by iteratively adjusting the weights using the dataset, that is:

This continues until a good is trained. refers to the change in weights calculated using the loss function in each training data session.

LoRA freezes the model weight parameters of a base large model and introduces low-rank matrices to approximate the changes in model parameters. The logic behind LoRA is that low-rank matrices can be used to efficiently capture adaptations for a specific task. The newly added knowledge represents only a small fraction compared to the original weights. Therefore, LoRA freezes the original matrix weight parameters of a pre-trained model and uses a low-rank decomposition to simulate the weight changes of layers during training. This allows for indirect training of large models with a minimal number of parameters. That is, by constraining the update matrix to low rank, matrix decomposition reduces the amount of parameter learning. By optimizing these low-rank matrices for each task and freezing the original model parameters, LoRA achieves efficient adaptation and can combine adaptations for multiple tasks without increasing inference latency.

LoRA’s design not only reduces the number of parameters but also maintains the original structure and performance of the model. Since the original model parameters remain unchanged, and only a small number of trainable parameters are added to adapt to new tasks, LoRA can be flexibly applied to different tasks without affecting the model’s original capabilities.

The figure below illustrates the difference between full fine-tuning and LoRA. The dimensions of and are and , respectively, where is much smaller than , and is the dimension of the original weight matrix. Therefore, the number of parameters for fine-tuning is reduced from to , significantly reducing the number of parameters and computational cost.

2206

2.1.1 Training

Suppose we want to fine-tune a pre-trained language model for a downstream task. LoRA adds a side matrix to the original matrix, performing dimensionality reduction and then dimensionality increase operations on this side matrix (by adding the product of two low-rank matrices to the pre-trained model weights for fine-tuning). When targeting the downstream task, only the parameters of the side matrix are updated. In other words, the core mathematical formula of LoRA is: during fine-tuning, the parameters of the side matrix are updated and the constraint on is a low-rank matrix. By limiting to low rank, LoRA minimizes the number of parameters that need to be learned during fine-tuning, thereby improving computational and storage efficiency. The specific operation is as follows:

  • Assume the model matrix is , depend on . The parameters are used for initialization. During training, the original pre-trained weights remain frozen and do not receive gradient updates during training (that is, is fixed), only the changes in the weights of the parameters that need to be updated are calculated, .
  • Increment matrix is decomposed into two low-rank matrices and . The low-rank approximation is obtained by multiplying these two matrices and . Right now .
  • LoRA employs a specific initialization strategy to ensure stable and effective training. Typically, is initialized with a random Gaussian distribution, and is initialized with a zero matrix. This means that at the start of training . This ensures that the initial state model is consistent with the pre-trained model.
  • Only and are training parameters that need to be updated and will be adjusted for a specific task. The update during training can be represented as:

Among them, rank .

  • During training, calculation is performed in each iteration for . If in each iteration we do not directly update the model weights but instead accumulate the changes in these weights into a matrix , wait for all training iterations to complete before updating all at once, we can also obtain the same model.

2207

2.1.2 Reasoning

During inference, the original LLM output and the output of the learnable matrix can be added together to obtain the final output. That is, given a pre-trained parameter matrix of shape named , a small matrix of shape named , and a matrix of shape named , fine-tuned for a specific downstream task. After training the LoRA model, we use:

As weights for fine-tuning the model.

From a matrix perspective, both and are multiplied by the same input , and their sums yield the final result, so the input and output dimensions of the model remain unchanged. That is,

Corresponding to the image above:

Combined weights

The above inference still has a certain delay. If you want to eliminate the inference delay, you can merge (add) the trained low-rank matrix and the frozen original model weights to calculate new weights, and then use the new weights for inference. The reason for doing this is that LoRA itself can be merged back into the original model, so inference can be compatible with the original model structure.

Furthermore, how can we continue training on an existing LoRA model? The answer is the same. We can merge the previous LoRA model with the original model and then continue training, thus preserving the knowledge and capabilities gained from the previous LoRA model.

Pluggable

LoRA also has pluggability, meaning that the trained LoRA parameters can be separated from the model.

When faced with numerous downstream tasks and micro-customization requirements, interference between different tasks can negatively impact the training process. Therefore, given the same number of parameters, instead of using a single LoRA for the entire domain dataset, it’s more efficient to deploy multiple smaller LoRA modules, each focusing on a specific downstream task. LoRA’s pluggable nature allows us to freeze shared models and switch between different downstream tasks by replacing matrices and .

For example, we do not use instead of updating , save it as a separate model , and then update it again during the inference phase, so that we can fine-tune the model for different tasks, forming a task-specific model . In other words:

  • For task 1, there are
  • For Task 2, there are
  • For task 3, there are

When faced with a new expected task , if the current task is , we subtract from the LoRA part and add to achieve task switching. We can then perform inference for task . This allows us to quickly and easily switch between models for different tasks.

Because LoRA adapters can be stored separately from the underlying LLM, it’s very straightforward to retain the original capabilities while adding new features. Therefore, people often combine both approaches: updating knowledge through full fine-tuning, followed by specialization using LoRA.

Combinatorial

Multiple LoRAs can also be stacked together for combined task enhancement and cross-task generalization. This means different LoRAs can be trained for different tasks, and mixing them enables knowledge and skill transfer across tasks. Of course, stacking different requires careful selection and training; simply stacking elements randomly may not achieve the desired effect. For example, in text-to-image tasks, combining LoRA images of people, landscapes, and clothing to generate a single photograph is necessary.

This method of mixing multiple LoRA plugins together is called LoRA fusion, and there are several approaches. Existing LoRA fusion methods can be divided into (1) fusion with manually designed weights; (2) fusion with learned weights; and (3) LoRA expert fusion.

  • Manually designed weights for mixing. Early LoRA mixing methods attempted to linearly combine different LoRA plugins using manually designed weights. Some studies have shown that appropriate cross-task generalization can be achieved by simply averaging the plugins or their related outputs. Furthermore, researchers have proposed several methods to further improve the performance of LoRA mixing by employing manually designed weights. For example: Linear combination: Some studies have attempted to combine LoRA plugins for different tasks using simple averaging or weighted averaging, where the weights are manually designed. Hyperparameter tuning: Methods such as ControlPE use weights as hyperparameters and determine the optimal combination of LoRA plugins through hyperparameter search. Feature similarity weights: Methods such as Token-level Adaptation use the cosine similarity between the input features and the adapter dataset centers as weights. Model fusion methods: Methods such as BYOM apply basic model fusion techniques, such as task arithmetic, Fisher merging, and RegMean. Manually designed weights can quickly blend multiple LoRAs without additional training, demonstrating simplicity and computational efficiency. However, it often fails to find the optimal weights, leading to unstable performance and limited generalization ability. Subsequently, researchers explored learning-based methods to achieve more accurate and adaptive blending.
  • Learning weight blending. To learn the optimal blended weights, researchers have proposed several methods to meet different needs at the task, instance, and token levels. Task-level methods focus on enhancing task transferability and can be gradient-based or gradient-free. LoRAHub employs a black-box algorithm called CMA-ES to optimize the weight factors of the LoRA plugin, simplifying the training process. ComPEFT and L-LoRA use LoRAHub to hybrid quantize the LoRA plugin, further improving computational efficiency. Compared to task-level methods, instance-level and token-level methods offer greater flexibility and accuracy for complex inputs. For multimodal instruction tuning, MixLoRA dynamically selects appropriate low-rank decomposition vectors based on the input instance; these vectors are then integrated into the LoRA matrix for training. For protein mechanics analysis and design tasks, X-LoRA develops a dynamic gating mechanism to assign weights to LoRA plugins at both the token and layer granularities. These methods demonstrate better performance in specific tasks or application scenarios.
  • LoRA Expert Hybrid. To jointly learn hybrid weights and LoRA plugins, the LoRA Expert Hybrid (LoRA MoE) is a natural choice, where each LoRA plugin acts as an expert, while the routing network typically assigns hybrid weights.

2208

2.2 The role of the AB matrix

Researchers also conducted in-depth research on the differences between matrices and , which provided important insights for improving parameter efficiency and effectiveness.

It has been observed that when multiple LoRA modules are trained independently on different data, the parameters of matrix from different heads tend to be consistent, while the parameters of matrix are clearly distinguishable. This linearity is analyzed as follows:

  • The A matrix is primarily used for dimensionality reduction, extracting features from the input and tending to capture commonalities across different domains. Therefore, the parameters of the A matrix can be shared across multiple heads, thus reducing redundancy.
  • The B matrix is primarily used for dimensionality enhancement, leveraging these features to generate the desired output (for prediction), thus better adapting to domain-specific differences. The dispersed parameters of the B matrix across different heads indicate that using a single head to adapt to multiple domains may not be as effective as using an independent head for each domain, as this minimizes interference between domains.

We need to elaborate further on matrix . Researchers have found that is far more important than , for example…

  • Freezing will project away most of the output, while freezing will only project away a portion of the input feature space, which usually has a smaller impact.
  • Updating only matrix consistently outperforms updating only matrix . Updating only the parameters of without updating the parameters of yields results similar to LoRA, and can improve performance with half the number of parameters.
  • Randomly initializing and freezing matrix , and then updating only matrix , usually yields better out-of-domain test accuracy.

This asymmetry suggests that fine-tuning alone may be more effective than fine-tuning .

The paper “Asymmetry in Low-Rank Adapters of Foundation Models” presents the generalization bounds for different LoRA variants. The figure below shows the generalization bounds for updating , updating only , and updating only , respectively. Here, is the rank, and is the quantization bits. is related to the sub-Gaussianity of the loss, where is the sample size, and , are the input and output dimensions of the -th layer, respectively. It can be seen that the bound is tighter when updating only compared to updating both and , suggesting that freezing as a random orthogonal matrix and updating only may potentially enhance generalization to unseen data.

2209

The FLoRA paper provides proof that the matrix in LoRA dominates the weight updates, as shown in the figure below.

2210

2.3 Deployment Location

2.3.1 Original Paper

In the original study, LoRA was applied to the weight matrix of the attention layer. The HuggingFace PEFT library simply adds LoRA to q_proj and v_proj.

2211

However, as we can see from the above figure:

  • All parameter fine-tuning is placed in or (or, placing it in a matrix within the attention mechanism) will lead to a significant performance decrease, but at the same time and adjustments will yield the best results.
  • Even with , it can still be sufficient information is obtained during training.

In summary, it is best to distribute the fine-tunable parameters across multiple types of weight matrices, rather than using a larger rank to fine-tune a single type of weight matrix.

2.3.2 Expansion

In principle, LoRA can be integrated anywhere within a Transformer layer. Some studies, such as QLoRA, advocate for its inclusion in all dense projections.

2212

For large language models (LLMs) based on Transformers, dense layers typically contain two types of weight matrices: projection matrices in the attention module and matrices in the feedforward neural network (FFN) module. The blue areas in the diagram below represent dense layers. In the Transformer architecture, Self-Attention has four weight matrices (, , , ). The MLP module has two weight matrices. The blue areas in the diagram below represent these dense layers. Therefore, LoRA can be applied to these dense layers.

2213

In addition, to address the issues of excessive computation and resource consumption caused by long contexts, as well as the gap between LoRA and full parameter fine-tuning, LongLoRA enables the embedding layer and normalization layer during LoRA training to participate in weight updates.

2214

2.3.3 Dynamic Selection

In theory, the LoRA matrix can be added to any layer of a neural network, but because there is still a gap between actual performance and the theoretical optimal value, some work is also investigating whether it is possible to skip certain layers during training.

LoRA-drop introduces an algorithm to determine which layers are fine-tuned by LoRA and which are not. The LoRA-drop algorithm allows the model to be trained using only a subset of LoRA layers. According to the evidence presented by the authors, the accuracy changes only slightly compared to training all LoRA layers, but computation time is reduced due to the smaller number of parameters that must be trained.

The steps for LoRA-drop are as follows:

  • Training is performed using a subset of the dataset, and then an importance score is calculated for each LoRA adapter. A high score indicates that the adapter has a significant impact on the model, while a low score indicates that the adapter has a negligible impact.
  • Then, LoRA-drop will aggregate the importance scores until a threshold is reached, from which only the most important fixed LoRA layers will be selected.
  • Finally, LoRA-drop performs full training on the entire dataset, while the other layers are fixed with a set of shared parameters that will not be changed during training.

2215

XGBLoRA can also perform random layer selection. Instead of modifying all layers of the language model (LM), the authors randomly selected them. is added to each layer to build augmenters. By adapting only a subset of layers in each iteration, the augmenters’ ability to change the model is limited. This intentional constraint keeps each augmenter relatively “weak” in its predictive power. However, this strategy injects randomness into the final ensemble model, creating diversity among the augmenters. Each augmenter focuses on a different part of the model, capturing different aspects of the data. This diversity is crucial for the success of the ensemble approach.

2.4 Initialization

During initialization, is typically initialized with a random Gaussian distribution, and is initialized with a zero matrix. This means that at the start of training . This ensures that the initial state model is consistent with the pre-trained model.

If both and are initialized to 0, it can easily lead to gradient vanishing. If both and are initialized Gaussianically, there is a probability of obtaining an excessively large offset value at the beginning of network training, thus introducing too much noise and making convergence difficult. Therefore, initializing some to 0 and others normally is to maintain the network’s original output at the beginning of training, while also ensuring better convergence after learning begins.

Of course, some researchers believe that initializing either or with all zeros would introduce an asymmetry problem (one all zeros, the other not all zeros). Therefore, both and can be initialized with not all zeros, as long as the pre-trained weights are subtracted beforehand:

This ensures that the initial state is consistent and also enhances the symmetry.

2.5 Hyperparameters

2.5.1 Rank

The rank in LoRA fine-tuning is crucial for understanding the expressiveness of adaptation and maintaining computational efficiency.

  • A smaller value corresponds to a simpler low-rank matrix, resulting in a smaller LoRA model. This means fewer parameters need to be learned during adaptation, potentially leading to faster training and reduced computational requirements. However, as decreases, the amount of information that can be stored also decreases, reducing the ability of the low-rank matrix to capture task-specific information and making it difficult to capture complex patterns. It is typically suitable for tasks in very narrow domains.
  • The larger the value, the larger the LoRA model, the more information it can store, and the more expressive it is, generally suitable for tasks in a wider range of domains. However, this increases computational and memory requirements.
  • The reason why fine-tuning techniques like LoRA are generally considered inferior to fine-tuning (Finetune) is that LoRA is considered a low-rank approximation of fine-tuning; by increasing the rank, LoRA can achieve similar fine-tuning effects to fine-tuning. In practice, it is important to try different values to find a suitable balance to achieve the desired performance in new tasks. The rank is usually chosen based on the downstream task and the amount of training corpus.
intrinsic rank

If LoRA can achieve good results with a very small , this indicates that updating the matrix has a very small intrinsic rank. In other words, the first matrix is responsible for dimensionality reduction, the second matrix is responsible for dimensionality increase, and the intermediate layer has a dimension of , thus simulating the so-called intrinsic rank.

The figure below shows the validation accuracy for different ranks on WikiSQL and MultiNLI. It can be observed that LoRA already demonstrates competitiveness at very small (taking and adjusting together is better than adjusting only ). This suggests that the update matrix may have a very small “intrinsic rank”. The LoRA authors argue that increasing does not cover a more meaningful subspace, thus indicating that a low-rank adaptive matrix is sufficient for fine-tuning.

2216

However, in theory, the greater the difference between the task and the pre-training, the higher the required rank should be, because this means there should be more adjustable parameters.

Best performance

How many rank values are needed to achieve optimal LoRA performance? Some papers have explored this question in depth.

The paper “THE EXPRESSIVE POWER OF LOW-RANK ADAPTATION” points out:

  • For a fully connected neural network, LoRA can adjust any pre-trained model to accurately match a smaller target model if the LoRA rank satisfies the following condition.
  • For Transformer networks, they proved that as long as the rank of the adaptation satisfies a sufficient lower bound, any model can be adapted to a target model using LoRA.

2217

However, in practice, a smaller rank is usually used (for example, ). The rank is used to balance performance and efficiency. The difference between the theoretical optimal value and actual use leads to performance gaps. Increasing the rank to meet the above theoretical requirements increases memory usage and computational complexity, thus offsetting the advantages of LoRA and making its cost comparable to a fully fine-tuned strategy.

The paper “LoRA Training in the NTK Regime has No Spurious Local Minima” analyzes the fine-tuning process of LoRA within the neural tangential nucleus (NTK) framework, showing that:

  • Full fine-tuning (without LoRA) allows for a rank of 1. The low-rank solution is , where is the number of training data points.
  • Using rank LoRA helps avoid spurious local minima and can facilitate the discovery of low-rank solutions with good generalization ability.
Distribution of R

In the original LoRA, all matrices have the same rank. AdaLoRA, however, selects different rank values for matrices based on their importance (for example, using the singular values of the LoRA matrices as an indicator of importance). Important matrices have higher rank, and less important matrices have lower rank, so the final total number of parameters is the same.

We will explain this in detail later.

2.5.2 Learning Rate

In the original LoRA method, both the adapter matrices and are updated with the same learning rate. However, the authors of LoRA+ demonstrate that using the same learning rate for both and does not effectively learn features, therefore, a single learning rate may not be suitable. For example, it can lead to suboptimal results when fine-tuning neural networks with many layers or large widths. This is because updates to and contribute differently to the learning dynamics.

Therefore, LoRA+ introduces different learning rates for the two matrices and , setting the learning rate of matrix to be much higher than that of matrix , which can make training more efficient. For effective learning, the magnitude of feature updates from and should be . This requires scaling the learning rate so that and , where represents the model width. In practice, LoRA+ introduces a fixed ratio:

It allows users to automatically adjust one learning rate while adjusting another.

2218

scaling factor

LoRA’s output will be scaled by a factor. scaling is applied. When optimizing with Adam, adjusting the scaling factor is roughly similar to adjusting the learning rate. In fact, the value of can be set based on the rank . This scaling mechanism helps reduce the need for excessive readjustment when tuning hyperparameters.

However, increasing the adapter rank causes the scaling factor to collapse, resulting in slower learning speeds and decreased performance for higher-rank adapters. To overcome this limitation, rsLoRA redefines the scaling factor as:

This adjustment ensures the rank stability of the adapter, meaning that even if the rank increases, the forward and backward channels maintain stable amplitudes, preventing gradient collapse.

2.5.3 Dropout

Although LoRA-based models have fewer trainable parameters, overfitting remains a problem, especially when fine-tuning small or specialized datasets. In such cases, traditional dropout techniques may not be sufficient to mitigate overfitting.

The authors of HiddenKey highlighted this problem and proposed a comprehensive framework to address it through three dimensions: Dropout location, structural patterns, and compensation measures. Dropout location specifies where noise is introduced, such as in attention logits, weights, or hidden representations. Structural patterns define the granularity of unit dropouts, including element-level, column-level, or span-level patterns. Compensation measures aim to minimize the discrepancy between the training and inference phases using techniques such as normalized rescaling or Kullback-Leibler divergence loss. BiLoRA, on the other hand, employs a two-layer optimization strategy. It alternately trains the singular vectors and singular values of the low-rank increment matrix on different subsets of the training data. This approach avoids simultaneously optimizing parameters at different levels on a single dataset, thus mitigating the overfitting problem.

2.6 Advantages

Compared to full-scale fine-tuning, LoRA has the following key advantages:

  • Parameter Efficiency: LoRA introduces very few trainable parameters through low-rank decomposition, typically reducing the number of parameters for a specific task by several orders of magnitude. This approach is particularly advantageous in resource-constrained environments and multi-task scenarios requiring multiple adaptations of the base model. It reduces the memory and computational requirements for fine-tuning without increasing inference latency.
  • Reduced Memory Usage: LoRA significantly reduces memory usage when fine-tuning large language models (LLMs). LoRA significantly reduces optimization and gradient memory usage. Although it introduces some additional “incremental parameters” that lead to a slight increase in activation and weight memory, this increase is negligible considering the overall reduction in memory usage.
  • Enhanced training efficiency: Traditional full fine-tuning updates all model parameters, while LoRA optimizes only the low-rank adaptation matrix. This approach significantly reduces computational costs and memory requirements, especially for models with billions of parameters. The reduced parameter space typically leads to faster convergence during training.
  • Latency-free inference: LoRA does not introduce additional inference latency because the update matrix can be explicitly incorporated into the original frozen weights. This integration ensures that the adapted model remains efficient during deployment and inference. Additionally, reduced memory usage also contributes to faster forward propagation.
  • Flexible modular adaptation: LoRA enables the creation of lightweight, task-specific adapters that can be interchanged without modifying the underlying model architecture. This modularity facilitates efficient multi-task learning and task switching, while minimizing storage requirements, compared to maintaining a separate model instance for each task.
  • Robust knowledge preservation: By preserving pre-trained weights, LoRA effectively mitigates catastrophic forgetting, a common challenge in traditional fine-tuning. This approach maintains the model’s fundamental knowledge while acquiring task-specific capabilities.
  • Extended Context Window: LoRA is also used to extend the context window size of large language models. For example, LongLoRA effectively extends the context window of LLaMA2-7B from 4k to 100k tokens by combining LoRA and shift sparse attention.
  • Other application cases (Beyond Fine-tuning): In addition to fine-tuning, LoRA can also be applied to other learning paradigms, such as pre-training and continuous training. In pre-training, LoRA can be used to train high-rank networks; in continuous training, LoRA can address the problem of catastrophic forgetting.

Through these advantages, LoRA can effectively adapt to the base model while maintaining model performance and significantly reducing computational requirements. Furthermore, the paper “LoRA Learns Less and Forgets Less” compares the performance of low-rank adaptation (LoRA) and full fine-tuning on large language models (LLMs), focusing on two domains (programming and mathematics) and two tasks (instruction fine-tuning and continuous pre-training). This paper finds that:

  • LoRA requires less learning. The greater the distance between the new task and the model’s pre-training data, the more pronounced the advantage of full fine-tuning in terms of learning ability.
  • LoRA exhibits less forgetting. When examining knowledge acquired before its loss, LoRA consistently shows less forgetting. This is particularly evident when adapting to data that differs significantly from the source domain.

Overall, there is a trade-off: full fine-tuning is better suited for absorbing new knowledge from more distant domains, but it leads to more forgetting of previous learning tasks. LoRA learns less new information by changing fewer parameters, but retains more of the original capabilities.

0x03 Complexity & Resource Consumption

LoRA is highly parameter efficient because it only updates a small subset of model parameters, reducing memory and computational requirements during fine-tuning without increasing inference latency.

3.1 Computational complexity analysis

Theoretically, the computational cost of LoRA is comparable to that of full parameter fine-tuning.

Calculation ItemFull parameter fine-tuningLoRA
Backbone model (forward computation)
Backbone model (gradient)
LoRA part (forward + gradient)×

3.1.1 Training

Many parameter-efficient fine-tuning techniques only reduce memory requirements, not computational costs. For example, Adapter and P-Tuning techniques can achieve near-full-parameter fine-tuning results by tweaking only a small number of parameters. However, these techniques are usually “parameter-efficient” rather than “training-efficient” because they still require backpropagation throughout the model to obtain gradients for a small subset of trainable parameters. In other words, while the number of trainable parameters is indeed reduced, the training speed isn’t significantly improved. The problem lies in the characteristics of backpropagation. Backpropagation, which calculates model gradients, proceeds incrementally from the output layer to the input layer. Therefore, the depth/computational cost of backpropagation primarily depends on the depth of trainable parameters closest to the input layer, and is not necessarily related to the total number of trainable parameters.

Taking the diagram below as an example, for the Adapter, it inserts a small layer after each previous layer. Although the other parameters are fixed, only the newly inserted layer is trainable. However, since each layer contains a new module, backpropagation still needs to pass from the output layer to the input layer. For P-tuning, essentially, it only has a few trainable parameters in the Embedding layer, but since the Embedding layer is the input layer, its backpropagation also needs to run through the entire model. Therefore, neither of these two approaches significantly reduces the computational cost.

2219

LoRA is no exception. During the training process of LoRA, is fixed and unchanging; only and are training parameters. Assume the model’s loss function is . The gradient calculations for parameters and during training are as follows:

During training, calculating the model gradient is the main computational cost. If full parameter fine-tuning is performed, then the gradient used is . The gradient used by LoRA is and . They are based on gradients from full updates, . Therefore, LoRA is based on the fundamentals, so theoretically the computational cost is greater than that of full parameter fine-tuning.

However, from a practical perspective, LoRA training is faster. Why does LoRA actually speed up training? There are several main reasons:

  • Low-precision acceleration: When using LoRA, we can employ various low-precision acceleration techniques on the backbone model, such as FP16, FP8, or INT8 quantization. This can reduce the time consumed by forward and backward propagation of the backbone model.
  • Only some parameters are updated. During training, the original parameters are frozen. That is, although participates in forward and backward propagation, its gradient is not calculated, and its parameters are not updated. Thus, during fine-tuning, the model primarily learns the low-rank matrices and , rather than directly updating the original weight matrix . For example, the original LoRA paper chose to update only the parameters of Self Attention. In practice, we can also choose to update only the parameters of certain layers.
  • When using multi-GPU training (data parallelism), we only need to synchronize the gradients of the LoRA model portion, which greatly reduces the pressure of inter-GPU communication and improves the overall training speed. Reduced communication time: Because the number of parameters being updated is reduced, the amount of data to be transmitted (especially in multi-GPU training) is also reduced, thus reducing transmission time.
  • In addition, reducing memory usage also speeds up forward propagation.

3.1.2 Reasoning

During inference, the low-rank adjustment matrix of LoRA can be directly merged with the weights of the original model, thus without introducing additional inference latency. This means that the computational efficiency during the inference phase is essentially the same as that of the original model.

3.2 Memory Usage

In theory, LoRA uses less memory than full-parameter fine-tuning. Its memory-saving nature stems from the fact that when using the Adam optimizer, it avoids calculating the first and second momentum with full weights (both of which must be represented in fp32, consuming significant memory). See the figure below for details.

Video memory usageFull parameter fine-tuningLoRA
Backbone model (model parameters)
Backbone model (gradient)
Backbone model (intermediate activation)
Backbone model (optimizer)×
LoRA section×

3.2.1 Full parameter fine-tuning

Full parameter fine-tuning is a resource-intensive fine-tuning method that optimizes all parameters of the model. The downside is that the memory overhead of the optimizer state and gradients is larger than the model itself. Therefore, even for models with fewer parameters, full parameter fine-tuning consumes significant computational resources. Memory usage in large language models (LLMs) can be divided into four parts:

  • Model memory (weight memory): The memory required to store the model weights;
  • Activation memory: The memory occupied by intermediate activations during the forward propagation process, which mainly depends on factors such as batch size and sequence length;
  • Gradient memory: The memory required to store gradients during backpropagation. Gradients are calculated only for trainable parameters.
  • Optimizer memory: The memory used to store the optimizer’s state. For example, the Adam optimizer stores the “first moment” and “second moment” of the trainable parameters.

3.2.2 LoRA

Main Model
  • First, the weights of the backbone model must be stored in GPU memory, and this part of GPU memory cannot be omitted.
  • Secondly, since the gradient of the LoRA model depends on the gradient of the backbone model, we must calculate the gradient of the backbone model, even if we do not need to optimize the backbone model.
  • Third, activation cannot be omitted.
  • Finally, since the backbone model does not need to be optimized, the optimizer corresponding to the backbone model does not need to be stored, which saves some memory (for example, the Adam optimizer needs to maintain the first-order momentum and second-order momentum of each parameter, which are the exponential moving average of the gradient and the exponential moving average of the squared gradient, respectively).
Branching model

The weights, gradients, and optimizer states of a LoRA model all need to be stored.

in conclusion

LoRA does not require saving the optimizer state of the backbone model; instead, it introduces additional incremental parameters. Although these incremental parameters cause a slight increase in activation and weight memory, this increase is far less than the memory occupied by the optimizer state of the backbone model. Therefore, considering the overall reduction in memory usage, this increase is negligible. Furthermore, in practical applications, we can leverage the fact that the backbone model does not require optimization by using fp16, or even low-precision data types such as int8 or int4, to further reduce GPU memory consumption.

0x04 Support Mechanism & Analysis

LoRA is built on the following assumption: During pre-training, the model needs to handle a variety of complex tasks and data, so its weight matrix is usually high-rank, possessing greater expressive power to adapt to different tasks. However, when the model is fine-tuned for a specific downstream task, it is found that its weight matrix actually performs in a low-rank manner on that particular task. In other words, although the model is high-dimensional during pre-training, it requires fewer degrees of freedom (low-rank) to perform well on a specific task; that is, weight updates during fine-tuning are typically located in a low-dimensional subspace. Based on this observation, LoRA proposes to adapt to specific downstream tasks by adding a low-rank adjustment matrix while maintaining the high-rank structure of the pre-trained model.

This section will analyze some concepts, such as intrinsic dimensions and subspace fine-tuning, to explore why LoRA is effective and how to make it more effective.

  • Intrinsic dimensions are the source of the LoRA concept and a useful tool for understanding and optimizing the complex behavior of large models.
  • Subspace fine-tuning unifies all known PEFT methods under a single theory and elucidates the mathematical principles of each method from the perspective of decomposition theory.
  • The theory of low-rank representation of complex systems innovatively verifies the ubiquitous existence of low-rank structures in complex systems and provides a feasible approach for constructing a unified dimensionality reduction theory for large-scale complex networks.

4.1 Intrinsic Dimensions

The LoRA paper mentions that its approach stems from the intrinsic dimension discussed in two other papers. That is, common pre-trained models exhibit exceptionally low intrinsic dimensionality. In other words, a low-dimensional reparameterization can be found that is effective for fine-tuning the entire parameter space.

2220

Therefore, we need to study the intrinsic dimension.

4.1.1 Definition

Intrinsic dimension refers to the number of effective dimensions in a dataset; that is, the minimum number of intrinsic dimensions required to represent most of the information in the dataset. This concept is often used to address dimensionality reduction problems involving high-dimensional data.

In practical applications, many datasets appear high-dimensional, but in reality, a low-dimensional subspace exists that contains the majority of the data’s information. The intrinsic dimension of an objective function describes the minimum dimension required to solve the optimization problem it defines. Within the subspace represented by the intrinsic dimension, the original objective function can be optimized to within a certain degree of approximation error.

Determining the intrinsic dimensionality of a dataset can effectively reduce its dimensionality, leading to better understanding and analysis. Determining the intrinsic dimensionality of a dataset is a complex problem, typically requiring various mathematical methods and algorithms, such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Multidimensional Scaling (MDS). These methods help us find an optimal low-dimensional representation that minimizes information loss and preserves as many features as possible.

4.1.2 Intrinsic Dimensions of the Model

Intrinsic dimensions are a useful tool for understanding and optimizing the complex behavior of large models.

In deep learning, overparameterization refers to the phenomenon where the number of model parameters exceeds the amount of training data required for interpolation. However, the advantages of overparameterization are accompanied by a significant increase in computational cost. Specifically, neural networks are actually too large for most of the predictions they make. Although the entire network is run for each prediction, only a small portion of the model is actually utilized. Therefore, when dealing with a small, specific task, such a complex, large model is unnecessary; it may be sufficient to solve the problem within a subspace of the model parameters, thus eliminating the need for optimization of all parameters.

In deep overparameterized decomposition, the learning dynamics of each weight matrix (Gradient Descent process GD) occur only within a roughly invariant low-dimensional subspace. Therefore, it is necessary to achieve an end-to-end trajectory that is almost identical to the original fully parametric decomposition by running gradient descent on fewer parameters. This decomposition allows us to optimize only the low-rank core while ignoring the orthogonal components that do not change with gradient updates.

In reality, it’s difficult to pinpoint the exact subspace corresponding to a problem. However, we can make a rough approximation. When optimizing the parameters of a subspace, if we can achieve a certain level of performance (for example, 90% accuracy) compared to optimizing all parameters, then the dimension corresponding to this subspace can be called the intrinsic dimension of the problem to be solved. In other words, the intrinsic dimension represents the lowest-dimensional subspace, within which the original objective function of the model can be optimized to within a certain range of approximate error.

For example, for a model with parameters . Training this model means searching for an efficient solution in a -dimensional space. Since may be redundant, it’s possible that optimizing only parameters is sufficient to find an efficient solution. That is, for a given model parameterize, rather than using the original parameters , the experience loss is optimized. This can be expressed by the following formula:

  • : It is a randomly initialized parameter and is not updated during training.
  • is a randomly initialized matrix that is not updated during training.
  • : This represents the -dimensional parameter to be optimized.

In other words, if only -dimensional parameters are updated during network training, the network can achieve its intended performance. This is then referred to as the intrinsic dimension of the model. Calculating the exact intrinsic dimension of the objective function is difficult; therefore, we use a heuristic method to calculate an upper bound, as follows:

2221

For large models, testing intrinsic dimensions reveals how many parameters need to be adjusted to approximate a solution to a specific type of downstream problem. From a certain perspective, the training process is like traversing a path within an objective landscape. As long as the training dataset and network architecture are defined, the landscape is completely defined. Once the landscape is instantiated and fixed, subsequent parameter initialization, forward pass, backward pass, and gradient optimization are all explored within this space.

4.1.3 Pre-training and intrinsic dimensions

Researchers have conducted empirical studies on existing pre-training methods and their respective intrinsic dimensions, and their insights are as follows:

  • The effectiveness of pre-trained models can be explained by their intrinsic dimension. Pre-training can be interpreted as a compression framework that implicitly optimizes the average description length for natural language tasks (reducing the intrinsic dimension). In other words, there exists a low-dimensional reparameterization that is as effective as fine-tuning the full parameters of a pre-trained model in its optimization within a low-dimensional subspace.
  • Larger models tend to have lower intrinsic dimensionality. From the perspective of intrinsic dimensionality, larger models have stronger information compression capabilities, and after a period of training, they can achieve even lower intrinsic dimensionality, the more model parameters there are, the less information we need to represent a task (because the corresponding task can be learned in a lower-dimensional subspace). As the number of pre-trained representation parameters increases, the intrinsic dimensionality also decreases. Simpler downstream tasks tend to have lower intrinsic dimensionality. In the context of pre-trained representations, the intrinsic dimensionality of common NLP tasks is several orders of magnitude smaller than that of full parameterization.
  • Lower intrinsic dimensionality generally results in better generalization performance. Generalization is not necessarily measured by the number of parameters or complexity of the pre-trained model; it can also be measured by how well the pre-trained model compresses downstream tasks. In a sense, if we want to compress downstream tasks better, we must expect the pre-trained representations to have considerable complexity.

4.1.4 LoRA and Eigendimensionality

Why is the LoRA approach effective? We can look at it from the perspective of intrinsic dimensions.

  • The intrinsic dimension of an objective function measures the minimum number of parameters required to achieve a satisfactory solution. For large models, it represents the minimum number of features you need to retain during dimensionality reduction or compression to best preserve the data’s characteristics. Measuring the intrinsic dimension tells us how many free parameters we need to adjust to approximate the optimization problem.
  • Over-parameterized models exhibit a low intrinsic dimensionality. This suggests that adequate learning performance can be achieved simply by updating parameters related to intrinsic rank, resulting in good performance on downstream tasks.

Based on the above derivation, LoRA proposed using low-rank matrices to update dense layers in the model, thereby achieving a dual improvement in parameters and computational efficiency.

4.2 Subspace Fine-tuning

The paper “See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition” allows us to analyze LoRA from a subspace perspective. This paper proposes a new framework called subspace fine-tuning using decomposition theory, including matrix (decomposition) and subspace (decomposition) theory. This framework unifies all known PEFT methods under a single theory and elucidates the mathematical principles of each method from the perspective of decomposition theory, providing a comprehensive theoretical foundation for understanding the intrinsic dynamics of different PEFT strategies. Furthermore, the paper analyzes why these methods lead to performance differences.

4.2.1 Subspace Fine-tuning

Consider as the frozen weight matrix for any given backbone network layer, and . Without loss of generality, we use a weight matrix . Quantifying model performance , a higher value indicates better performance. For a specific task, it is assumed that an optimal weight matrix exists, . We assert that for all . The objective of PEFT is therefore formulated as:

Here, measures the difference between two matrices. In previous work, the function is conceptualized as incremental tuning, representing the tuning of a matrix , the modification of each element. While this representation is accurate, it is too general to fully capture the underlying logic of each method.

From the perspective of decomposition theory, adjusting a matrix essentially involves modifying its corresponding subspace. Subspace fine-tuning methods mainly focus on adjusting the subspace of the original parameters, involving the reconstruction and expansion of the subspace. Therefore, all PEFT methods can be viewed as subspace fine-tuning.

4.2.2 Classification

The paper suggests that consider it as a transformation function used to modify the weight matrix , the relevant subspace. The goal of the transformation function is to find at the base , the maximum projection within the generated subspace, then align with it. There are clearly two ways to achieve this:

  • Subspace Reconstruction: Directly modify the corresponding , the subspace for better alignment . That is, by adjusting approaching , the projection of. The function is .
  • Subspace Extension: Introduces a new subspace and combines it with the original subspace, performing this operation. the subspace is close to or contains . The function is .

These processes can be mathematically represented by the following formula:

Here, the subspace reconstruction process is summarized, and the union of subspaces is described. We refer to these operations as “subspace reconstruction” and “subspace expansion,” respectively. Therefore, we categorize existing methods into three types: methods based on subspace reconstruction, methods based on subspace expansion, and methods based on subspace composition.

  • Subspace reconstruction: This will be compared with the original weight matrix . The related complex space is decomposed into more intuitive and easier-to-understand subspaces, and the basis of these derived subspaces is adjusted;
  • Subspace expansion: Introducing a new subspace. They seek to expand the subspace by combining the new subspace and the original weight matrix . Find the optimal weights within the space generated by the corresponding subspace basis, the maximum projection of .
  • Subspace composition: Simultaneously adjust the subspace by rebuilding and expanding the subspace.

The diagram below illustrates the framework for subspace tuning.

  • a: Subspace tuning aims to identify the maximum projection of the optimal weights onto a subspace consisting of the basis . Here, represents the original frozen weights .
  • b: Subspace Reconstruction. Subspace reconstruction involves rescaling the subspace of the original weights to approximate the original weights . Alternatively, it can be approximated by constructing a new subspace derived from the original weights. Subspace expansion involves adjusting the subspace of the original weights to approximate or encompass .
  • c: Numerical perspective of subspace tuning. Reconstruction involves modifying the original frozen parameters, while expansion requires adding new adjustable parameters.

2222

4.2.3 Subspace Reconstruction

Building upon the previously outlined framework, subspace reconstruction methods first divide the model space into interpretable subspaces. These subspaces are then refined to improve model efficiency. That is, decomposition followed by optimization of certain subspaces. Many PEFT strategies focus on directly reconstructing subspaces related to the original weight matrix. Notable examples include SAM-PARSER, Diff Pruning, (IA)3, BitFit, Prefix-tuning, and Prompt-tuning.

2223

We begin our exploration with Singular Value Decomposition (SVD), which systematically divides into three main components:

  • : Left singular vectors, forming an orthogonal basis for the column space;
  • : Singular values, represented by diagonal elements, measure the strength or importance of each principal axis and adjust the dimensions and scaling within the subspace;
  • : Right singular vectors form an orthogonal basis for the row space.

SVD elucidates the fundamental subspace upon which the structure is based. By cleverly adjusting the subspaces obtained through the decomposition process, the original space can be reconstructed. The refinement of these subspaces follows three different modes:

  • Mode 1, Singular Value Adjustment: This mode involves adjusting the singular values of the components, thereby adjusting the scaling within the corresponding principal subspaces. Changing these values modifies the weights of each principal component without affecting the defined subspace orientation properties;
  • Mode 2, Simple Singular Vector Adjustment: This mode involves making simple adjustments to the singular vectors in the vectors by scaling the subspaces they generate. It preserves the directional properties of the subspaces while adjusting their magnitudes to improve performance;
  • Mode 3, Complex Singular Vector Adjustment: This mode involves more complex transformations of singular vectors, involving the reorientation or reshaping of subspaces. It simultaneously affects the orientation and scale of the subspaces, facilitating a comprehensive adjustment of the matrix structure.

2224

4.2.4 Subspace Expansion

2225

The extended approach introduces a new subspace, which is then combined with the original weight matrix . These methods aim to generate an extended space using the basis of the equations. The closest projection within this new space is essentially achieved by introducing an additional weight matrix to expand the basis of the original subspace to cover a larger dimensional region (as shown in the figure below). Typically, the transformation function for these methods is defined as , where represents the scaling factor. , the new subspace introduced is also called an additional term.

Consider the weight matrix . Without loss of generality, assume . Ideally, we have . This setting means and occupy the same row and column space and are positioned within the same hyperplane. If has a rank of , it has a column space dimension of , which allows it to generate subspaces . Since we do not know , a conservative assumption about the basis of a column space is that and , the column space basis can generate the entire space. In the optimal case, the column basis vectors should ideally supplement . Because possibly with , the subspaces share a large common basis. Therefore, perhaps all that’s needed is to consider what is missing in , but a small subset of bases exist in , thus making become a low-rank matrix.

For extension-based methods, our goal is to determine in the hyperplane formed by and . The maximum projection within the formed hyperplane ensures as much as possible with alignment. Given a fixed alignment, and , only one scaling factor value will cause direction and orientation to be aligned. Therefore, the value can have a significant or even critical impact on performance.

In efficient parameter fine-tuning, there are two main families of extension-based methods. The first family is LoRA and its derivatives, including LoRA, AdaLoRA, TriLoRA, FLoRA, VeRA, and LoTR. These methods primarily rely on low-rank matrix factorization techniques. The second family is adapter-derived methods, which introduce small-scale neural modules or adapters into existing architectures. See the figure below for details.

The figure below shows the subspace and numerical view based on the extension method. The extension-based method introduces an additional weight matrix and then attempts to find the optimal weight projection within the subspace spanned by this additional weight and the original weights. To achieve this, the basis of the subspace constructed by the additional matrix should supplement the basis of the original weights as much as possible. The right side of the figure lists some common extension-based methods and their operations on the matrices.

2226

4.2.5 Subspace Combination

Subspace composition performs both subspace reconstruction and expansion, combining the principles of both approaches. Furthermore, for some methods that can be classified as either reconstruction-based or expansion-based, we also classify them as composition-based. Several representative composition-based methods include DoRA, Spectral Adapter, and SVDiff.

4.3 Low-rank representation theory of complex systems

Complex systems are high-dimensional, nonlinear dynamical systems whose components interact heterogeneously. To make interpretable predictions of the large-scale behavior of complex systems, they are often assumed to be reducible to a small number of equations involving low-rank matrices.

The paper “The low-rank hypothesis of complex systems” explores the validity of the low-rank hypothesis, demonstrating that many complex systems can be simplified while still retaining the fundamental characteristics of the original high-dimensional network. From this perspective, dimensionality reduction techniques are based on an implicit assumption that the dynamics of high-dimensional complex systems depend on the behavior of low-rank matrices.

In the image below:

  • The hemisphere of the fruit fly (Drosophila melanogaster) is an example of a complex system.
  • b represents the complex network diagram of Drosophila melanogaster connectons, where 5% of the 21,733 vertices were randomly selected for visualization.
  • c denotes the singular value decomposition of a real matrix of rank . The truncated SVD is the best low-rank approximation of the matrix.

2227

4.4 Neural Tangent Kernel (NTK)

The paper “A kernel-based view of language model fine-tuning” investigates the role of LoRA from the perspective of NTK. The paper finds that LoRA approximately preserves the kernel of the original model during fine-tuning. See the figure below for details.

2228

Although LoRA restricts updates to a low-rank subspace, it effectively focuses on the gradients that have the greatest impact on changes in network behavior. By focusing on these critical gradients, LoRA preserves the model’s generalization ability, ensuring the network remains sensitive to changes in the basic input while maintaining high parameter efficiency. See the figure below for details.

4.5 Changes to the Model

The paper “The impact of LoRA on the emergence of clusters in transformers” analyzes the dynamic properties of the attention matrix and proves that the low-rank modification introduced by LoRA maintains short-term stability in token clustering while promoting significant long-term differences in the learned representations.

After introducing LoRA, the dynamic behavior of attention changes as follows.

2229

LoRA can maintain the short-term stability of token clustering by keeping the Wasserstein distance between perturbed and undisturbed tokens bounded.

2230

A key result is the identification of phase transitions in which tokn occurs at the critical time . The cluster then branches into new clusters, with the critical time controlled by the eigenvalue gap of the value matrix. This demonstrates how LoRA can fine-tune the model without catastrophic forgetting, preserving token structure early in training while allowing controlled divergence.

0x05 Implementation

We will use the HuggingFace PEFT code for learning.

5.1 Use

By specifying the configuration, we can use LoRA.

model = prepare_model_for_kbit_training(model)
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

5.2 Creation

5.2.1 LoraModel

LoRAFor the class corresponding to the fine-tuning method LoraModel, we skip some code and directly find LoraModel through PEFT_TYPE_TO_MODEL_MAPPING.

PEFT_TYPE_TO_MODEL_MAPPING = {
    PeftType.LORA: LoraModel,
    PeftType.LOHA: LoHaModel,
    PeftType.LOKR: LoKrModel,
    PeftType.PROMPT_TUNING: PromptEmbedding,
    PeftType.P_TUNING: PromptEncoder,
    PeftType.PREFIX_TUNING: PrefixEncoder,
    PeftType.ADALORA: AdaLoraModel,
    PeftType.BOFT: BOFTModel,
    PeftType.ADAPTION_PROMPT: AdaptionPromptModel,
    PeftType.IA3: IA3Model,
    PeftType.OFT: OFTModel,
    PeftType.POLY: PolyModel,
    PeftType.LN_TUNING: LNTuningModel,
    PeftType.VERA: VeraModel,
    PeftType.FOURIERFT: FourierFTModel,
    PeftType.XLORA: XLoraModel,
    PeftType.HRA: HRAModel,
    PeftType.VBLORA: VBLoRAModel,
    PeftType.CPT: CPTEmbedding,
    PeftType.BONE: BoneModel,
}

Since the base class of LoraModel is BaseTuner, we need to take a look at BaseTuner.

class LoraModel(BaseTuner):
    """
    Creates Low Rank Adapter (LoRA) model from a pretrained transformers model.
    The method is described in detail in https://arxiv.org/abs/2106.09685.
    """

5.2.2 BaseTuner

The base class BaseTuner mainly contains two functions:

  • The inject_adapter() function inject_adapter() first stores the name of each layer of the model in key_list, and then iterates through key_list to obtain the parent module class, layer name, and the module class corresponding to the layer name for the current layer.
  • _create_and_replace() The function _create_and_replace() is an abstract function; its concrete implementation is still in the LoraModel class.
class BaseTuner(nn.Module, ABC):
    r"""
    A base tuner model that provides the common methods and attributes for all tuners that are injectable into a torch.nn.Module
    """
    def __init__(
        self,
        model,
        peft_config: Union[PeftConfig, dict[str, PeftConfig]],
        adapter_name: str,
        low_cpu_mem_usage: bool = False,
    ) -> None:
        super().__init__()

        self.model = model
        self.targeted_module_names: list[str] = []
        self.active_adapter: str | list[str] = adapter_name
        self._pre_injection_hook(self.model, self.peft_config[adapter_name], adapter_name)
        if peft_config != PeftType.XLORA or peft_config[adapter_name] != PeftType.XLORA:
            self.inject_adapter(self.model, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)

        # Copy the peft_config in the injected model.
        self.model.peft_config = self.peft_config

After obtaining the key parameters r and alpha for LoRA fine-tuning, the _create_and_replace() function uses the _create_new_module() function to create a new model architecture.

def _create_and_replace(
    self,
    lora_config,
    adapter_name,
    target,
    target_name,
    parent,
    current_key,
):
    # Regexp 匹配 - 在提供的模式中查找与当前目标名称匹配的键值
    # Regexp matching - Find key which matches current target_name in patterns provided
    r_key = get_pattern_key(lora_config.rank_pattern.keys(), current_key)
    alpha_key = get_pattern_key(lora_config.alpha_pattern.keys(), current_key)
    # 获取r和alpha参数
    r = lora_config.rank_pattern.get(r_key, lora_config.r)
    alpha = lora_config.alpha_pattern.get(alpha_key, lora_config.lora_alpha)

    kwargs = {
        "r": r,
        "lora_alpha": alpha,
        "lora_dropout": lora_config.lora_dropout,
        "fan_in_fan_out": lora_config.fan_in_fan_out,
        "init_lora_weights": lora_config.init_lora_weights,
        "use_rslora": lora_config.use_rslora,
        "use_dora": lora_config.use_dora,
        "ephemeral_gpu_offload": lora_config.runtime_config.ephemeral_gpu_offload,
        "lora_bias": lora_config.lora_bias,
        "loaded_in_8bit": getattr(self.model, "is_loaded_in_8bit", False),
        "loaded_in_4bit": getattr(self.model, "is_loaded_in_4bit", False),
    }

    # note: AdaLoraLayer is a subclass of LoraLayer, we need to exclude it
    from peft.tuners.adalora import AdaLoraLayer

    if isinstance(target, LoraLayer) and not isinstance(target, AdaLoraLayer):
        # 如果属于LoraLayer或Adaloralayer,则进行更新
        target.update_layer(
            adapter_name,
            r,
            lora_alpha=alpha,
            lora_dropout=lora_config.lora_dropout,
            init_lora_weights=lora_config.init_lora_weights,
            use_rslora=lora_config.use_rslora,
            use_dora=lora_config.use_dora,
            lora_bias=lora_config.lora_bias,
        )
    else:
        # 根据LoRA关键参数创建新的模型
        new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
        if adapter_name not in self.active_adapters:
            # adding an additional adapter: it is not automatically trainable
            new_module.requires_grad_(False)
        self._replace_module(parent, target_name, new_module, target)

5.2.3 Creation of LoraModel

The _create_new_module() function calls the dispatch function for processing; let’s go into the dispatch_default() function to see what’s going on.

@staticmethod
def _create_new_module(lora_config, adapter_name, target, **kwargs):
 # Collect dispatcher functions to decide what backend to use for the replaced LoRA layer. The order matters,
 # because the first match is always used. Therefore, the default layers should be checked last.
 dispatchers = []
 if lora_config._custom_modules:
     def dynamic_dispatch_func(target, adapter_name, lora_config, **kwargs):
         new_module = None
         if isinstance(target, BaseTunerLayer):
             target_base_layer = target.get_base_layer()
         else:
             target_base_layer = target
         for key, custom_cls in lora_config._custom_modules.items():
             if isinstance(target_base_layer, key):
                 new_module = custom_cls(target, adapter_name, **kwargs)
                 break
         return new_module
     dispatchers.append(dynamic_dispatch_func)

 dispatchers.extend(
     [
         dispatch_eetq,
         dispatch_aqlm,
         dispatch_awq,
         dispatch_gptq,
         dispatch_hqq,
         dispatch_torchao,
         dispatch_megatron,
         dispatch_default,
     ]
 )
 new_module = None
 for dispatcher in dispatchers:
     new_module = dispatcher(target, adapter_name, lora_config=lora_config, **kwargs)
     if new_module is not None:  # first match wins
         break
 return new_module

As can be seen from the dispatch_default() function, the LoRA fine-tuning method is mainly for Embedding, Conv1D, Conv2D, and Linear layers.

def dispatch_default(
    target: torch.nn.Module,
    adapter_name: str,
    lora_config: LoraConfig,
    **kwargs,
) -> Optional[torch.nn.Module]:
    new_module = None

    if isinstance(target, BaseTunerLayer):
        target_base_layer = target.get_base_layer()
    else:
        target_base_layer = target

    # 更新Embedding层
    if isinstance(target_base_layer, torch.nn.Embedding):
        embedding_kwargs = kwargs.copy()
        embedding_kwargs.pop("fan_in_fan_out", None)
        embedding_kwargs.update(lora_config.loftq_config)
        new_module = Embedding(target, adapter_name, **embedding_kwargs)
    # 更新Conv2d层    
    elif isinstance(target_base_layer, torch.nn.Conv2d):
        kwargs.update(lora_config.loftq_config)
        new_module = Conv2d(target, adapter_name, **kwargs)
    # 更新Conv3d层    
    elif isinstance(target_base_layer, torch.nn.Conv3d):
        kwargs.update(lora_config.loftq_config)
        new_module = Conv3d(target, adapter_name, **kwargs)
    # 更新Linear层     
    elif isinstance(target_base_layer, torch.nn.Linear):
        kwargs.update(lora_config.loftq_config)
        new_module = Linear(target, adapter_name, **kwargs)
    # 更新Conv1D层    
    elif isinstance(target_base_layer, Conv1D):
        kwargs.update(lora_config.loftq_config)
        new_module = Linear(target, adapter_name, is_target_conv_1d_layer=True, **kwargs)

    return new_module

5.3 Adjusting specific modules

We’ll use Linear as an example. The Linear class integrates the nn.Module and LoreLayer classes.

5.3.1 Linear

class Linear(nn.Module, LoraLayer):
    # Lora implemented in a dense layer
    def __init__(
        self,
        base_layer,
        adapter_name: str,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.0,
        fan_in_fan_out: bool = False,  # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        is_target_conv_1d_layer: bool = False,
        init_lora_weights: Union[bool, str] = True,
        use_rslora: bool = False,
        use_dora: bool = False,
        lora_bias: bool = False,
        **kwargs,
    ) -> None:
        super().__init__()
        LoraLayer.__init__(self, base_layer, **kwargs)
        self.fan_in_fan_out = fan_in_fan_out

        self._active_adapter = adapter_name
        self.update_layer(
            adapter_name,
            r,
            lora_alpha=lora_alpha,
            lora_dropout=lora_dropout,
            init_lora_weights=init_lora_weights,
            use_rslora=use_rslora,
            use_dora=use_dora,
            lora_bias=lora_bias,
        )
        self.is_target_conv_1d_layer = is_target_conv_1d_layer

LoraLayer the key behavior of the class’s initialization method is to obtain the input and output dimensions of the adjustable layers (Embedding, Conv1D, Conv2D, Linear), so as to facilitate the construction of new layers.

5.3.2 LoraLayer

LoraLayer the class obtains the input and output dimensions of adjustable layers (Embedding, Conv1D, Conv2D, Linear), making it convenient to construct new layers. LoraLayer has the following important functions:

  • The update_layer() function performs the following operations: reads LoRA key parameters r; determines whether to add a layer based on the Dropout parameters; creates and initializes a linear layer, lora_A, lora_B.
  • The reset_lora_parameters() function initializes these parameters lora_A.
  • set_adapter() The method will set trainable parameters.
class LoraLayer(BaseTunerLayer):
    # All names of layers that may contain (trainable) adapter weights
    adapter_layer_names = ("lora_A", "lora_B", "lora_embedding_A", "lora_embedding_B")
    # All names of other parameters that may contain adapter-related parameters
    other_param_names = ("r", "lora_alpha", "scaling", "lora_dropout")

    def __init__(self, base_layer: nn.Module, ephemeral_gpu_offload: bool = False, **kwargs) -> None:
        self.base_layer = base_layer
        self.r = {}
        self.lora_alpha = {}
        self.scaling = {}
        self.lora_dropout = nn.ModuleDict({})
        self.lora_A = nn.ModuleDict({})
        self.lora_B = nn.ModuleDict({})
        # For Embedding layer
        self.lora_embedding_A = nn.ParameterDict({})
        self.lora_embedding_B = nn.ParameterDict({})
        # Mark the weight as unmerged
        self._disable_adapters = False
        self.merged_adapters = []
        self.use_dora: dict[str, bool] = {}
        self.lora_bias: dict[str, bool] = {}
        self.lora_magnitude_vector = torch.nn.ModuleDict()  # for DoRA
        self._caches: dict[str, Any] = {}
        self.ephemeral_gpu_offload: bool = ephemeral_gpu_offload
        self.kwargs = kwargs

        base_layer = self.get_base_layer()
        if isinstance(base_layer, nn.Linear):
            in_features, out_features = base_layer.in_features, base_layer.out_features
        elif isinstance(base_layer, nn.Conv2d):
            in_features, out_features = base_layer.in_channels, base_layer.out_channels
        elif isinstance(base_layer, nn.Conv3d):
            in_features, out_features = base_layer.in_channels, base_layer.out_channels
        elif isinstance(base_layer, nn.Embedding):
            in_features, out_features = base_layer.num_embeddings, base_layer.embedding_dim
        elif isinstance(base_layer, Conv1D):
            in_features, out_features = (
                base_layer.weight.ds_shape if hasattr(base_layer.weight, "ds_shape") else base_layer.weight.shape
            )
		# 省略其它部分

        self.in_features = in_features
        self.out_features = out_features

    def update_layer(
        self,
        adapter_name,
        r,
        lora_alpha,
        lora_dropout,
        init_lora_weights,
        use_rslora,
        use_dora: bool = False,
        lora_bias: bool = False,
    ):
        # This code works for linear layers, override for other layer types
		# 读取r、alpha参数
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        # 如果存在dropout参数则加入Dropout层
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        # 在lora_dropout中加入lora_dropout_layer    
        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        # 实际可训练参数,矩阵A,B
        self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
        self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=lora_bias)
        self.lora_bias[adapter_name] = lora_bias

        if use_rslora:
            self.scaling[adapter_name] = lora_alpha / math.sqrt(r)
        else:
            self.scaling[adapter_name] = lora_alpha / r

        # for inits that require access to the base weight, use gather_param_ctx so that the weight is gathered when using DeepSpeed
        if isinstance(init_lora_weights, str) and init_lora_weights.startswith("pissa"):
            with gather_params_ctx(self.get_base_layer().weight):
                self.pissa_init(adapter_name, init_lora_weights)
        elif isinstance(init_lora_weights, str) and init_lora_weights.lower() == "olora":
            with gather_params_ctx(self.get_base_layer().weight):
                self.olora_init(adapter_name)
        elif init_lora_weights == "loftq":
            with gather_params_ctx(self.get_base_layer().weight):
                self.loftq_init(adapter_name)
        elif init_lora_weights == "eva":
            nn.init.zeros_(self.lora_B[adapter_name].weight)
        elif init_lora_weights:
            self.reset_lora_parameters(adapter_name, init_lora_weights)
        # call this before dora_init
        self._move_adapter_to_device_of_base_layer(adapter_name)

        if use_dora:
            self.dora_init(adapter_name)
            self.use_dora[adapter_name] = True
        else:
            self.use_dora[adapter_name] = False

        # 设置可训练参数    
        self.set_adapter(self.active_adapters)
        
    def reset_lora_parameters(self, adapter_name, init_lora_weights):
        if adapter_name in self.lora_A.keys():
            # 若init_lora_weights为true则使用kaiming初始化
            if init_lora_weights is True:
                # initialize A the same way as the default for nn.Linear and B to zero
                nn.init.kaiming_uniform_(self.lora_A[adapter_name].weight, a=math.sqrt(5))
            # 如果为gaussian则进行正态初始化
            elif init_lora_weights.lower() == "gaussian":
                nn.init.normal_(self.lora_A[adapter_name].weight, std=1 / self.r[adapter_name])
            else:
                raise ValueError(f"Unknown initialization {init_lora_weights=}")
            # 对B矩阵使用全0初始化    
            nn.init.zeros_(self.lora_B[adapter_name].weight)
            if self.lora_bias[adapter_name]:
                nn.init.zeros_(self.lora_B[adapter_name].bias)
        if adapter_name in self.lora_embedding_A.keys():
            # Initialize A to zeros and B the same way as the default for nn.Embedding, 
            nn.init.zeros_(self.lora_embedding_A[adapter_name])
            nn.init.normal_(self.lora_embedding_B[adapter_name])
            if self.lora_bias[adapter_name]:
                # embeddings are not supported at the moment, but still adding this for consistency
                nn.init.zeros_(self.lora_embedding_B[adapter_name].bias)
                
    def set_adapter(self, adapter_names: str | list[str]) -> None:
        """Set the active adapter(s).

        Additionally, this function will set the specified adapters to trainable (i.e., requires_grad=True). If this is
        not desired, use the following code.

        ```py
        >>> for name, param in model_peft.named_parameters():
        ...     if ...:  # some check on name (ex. if 'lora' in name)
        ...         param.requires_grad = False、
        Args:
            adapter_name (`str` or `List[str]`): Name of the adapter(s) to be activated.
        """
        if isinstance(adapter_names, str):
            adapter_names = [adapter_names]

        # Deactivate grads on the inactive adapter and activate grads on the active adapter
        for layer_name in self.adapter_layer_names:
            module_dict = getattr(self, layer_name)
            for key, layer in module_dict.items():
                # 如果是adapter_names中需要训练的层,则开启梯度传播,否则关闭
                if key in adapter_names:
                    # Note: It is possible that not a single layer is called with requires_grad_(True) here. This may
                    # happen if a completely different adapter layer is being activated.
                    layer.requires_grad_(True)
                else:
                    layer.requires_grad_(False)

        self._active_adapter = adapter_names          

5.3.3 Forward Propagation

The Linear’s forward() function demonstrates LoRAhow the model merges the results of inference with the original model.

class Linear(nn.Module, LoraLayer):
    def forward(self, x: torch.Tensor, *args: Any, **kwargs: Any) -> torch.Tensor:
        self._check_forward_args(x, *args, **kwargs)
        adapter_names = kwargs.pop("adapter_names", None)

        if self.disable_adapters:
            if self.merged:
                self.unmerge()
            result = self.base_layer(x, *args, **kwargs)
        elif adapter_names is not None:
            result = self._mixed_batch_forward(x, *args, adapter_names=adapter_names, **kwargs)
        elif self.merged:
            result = self.base_layer(x, *args, **kwargs)
        else:
            # 得到原始模型中的结果
            result = self.base_layer(x, *args, **kwargs)
            torch_result_dtype = result.dtype
            for active_adapter in self.active_adapters:
                if active_adapter not in self.lora_A.keys():
                    continue
                lora_A = self.lora_A[active_adapter]
                lora_B = self.lora_B[active_adapter]
                dropout = self.lora_dropout[active_adapter]
                scaling = self.scaling[active_adapter]
                x = x.to(lora_A.weight.dtype)

                if not self.use_dora[active_adapter]:
                    # 原始模型输出+可训练lora层的结果
                    result = result + lora_B(lora_A(dropout(x))) * scaling
                else:
                    if isinstance(dropout, nn.Identity) or not self.training:
                        base_result = result
                    else:
                        x = dropout(x)
                        base_result = None

                    result = result + self.lora_magnitude_vector[active_adapter](
                        x,
                        lora_A=lora_A,
                        lora_B=lora_B,
                        scaling=scaling,
                        base_layer=self.get_base_layer(),
                        base_result=base_result,
                    )

            result = result.to(torch_result_dtype)

        return result

0x06 Improvement

Note: The following text is based on the paper “Low-Rank Adaptation for Foundation Models: A Comprehensive Review”.

While LoRA can achieve adequate adaptive performance on some downstream tasks, its performance still lags behind that of full-parameter fine-tuning on many downstream tasks, such as knowledge-intensive tasks like mathematical reasoning and programming. To bridge this gap, many methods have been proposed to further improve LoRA’s performance in downstream task adaptation. Existing methods typically enhance downstream adaptation performance from several angles: parameter efficiency enhancement; rank adaptation; and training process improvement.

6.1 Enhanced Parameter Efficiency

While LoRA significantly reduces the number of tunable parameters in fine-tuning LLMs through its projection matrix, achieving improved parameter efficiency, it still requires a large number of trainable parameters and expensive activation memory to update low-rank matrices. For example, applying LoRA to the LLaMA-270B model requires updating over 16 million parameters. Moreover, as downstream tasks increase, the number of LoRA plugins also increases, making further efficiency improvements a critical issue. Current research primarily addresses this challenge through four methods: parameter decomposition, parameter pruning, parameter freezing and sharing, and parameter quantization. The figure below illustrates examples of these techniques.

2231

6.1.1 Parameter Decomposition

Parametric decomposition methods improve parameter efficiency while maintaining task performance by decomposing matrices into a more compact form. In addition to reducing the number of trainable parameters, these methods also enable finer control during fine-tuning.

method

Current methods can be broadly categorized into two main approaches: update matrix factorization and pre-trained weight factorization. Both methods offer unique advantages in terms of parameter efficiency and fine-tuning flexibility. Update matrix factorization focuses on decomposing the incremental updates applied during fine-tuning, while pre-trained weight factorization directly modifies the structure of the original model’s weights. The figure below provides a detailed comparison of these methods.

2232

Update matrix factorization

In updating matrix factorization methods, two main strategies have emerged:

  • Based on singular value (SVD) decomposition. For example, AdaLoRA parameterizes the updated weights in the form of SVD and then dynamically adjusts the rank of according to the importance score to achieve adaptive parameter efficiency during fine-tuning; on this basis, BiLoRA extends the framework with two-level optimization to separate singular vectors and value training on different subsets of data to mitigate overfitting.
  • Decomposition based on tensor sequences (TT). For example, LoRETTA uses TT decomposition to represent a matrix as a series of low-rank, small three-dimensional tensors, often referred to as kernels.

Let’s take AdaLoRA as an example to examine methods based on Singular Value Decomposition (SVD). AdaLoRA approximates SVD decomposition by regularizing the orthogonality of and , and then discards unimportant singular values (filtering elements in the diagonal matrix ) based on a new importance scoring method. That is, AdaLoRA uses the singular values of the LoRA matrix as an importance index to select different LoRA adapters and adjust their rank. Compared to standard LoRA with the same rank, AdaLoRA and standard LoRA have the same total number of parameters, but the distribution of these parameters differs. In standard LoRA, all matrices have the same rank, while in AdaLoRA, important matrices have higher ranks and less important matrices have lower ranks, so the final total number of parameters is the same. Experiments show that AdaLoRA produces better results than standard LoRA, indicating a better distribution of trainable parameters in parts of the model, which is particularly important for a given task.

2233

Research Motivation

The native LoRa method introduces a rank-4 matrix into each attention layer, employing an even parameter distribution strategy. However, this “even parameter distribution” strategy of LoRa is clearly not optimal. Intuitively, some matrices in the model are more important than others and should be allocated more parameters for tuning, while some matrices are relatively unimportant and do not require much modification. This paper thus raises a core question: how to adaptively allocate parameter budgets based on the importance of different layers and modules of the transformer to improve fine-tuning performance?

plan

These importance scores can be calculated using methods such as the magnitude of singular values or the gradient contribution of the loss. This allows for more targeted parameter tuning. An intuitive approach is to use SVD (Singular Value Decomposition): each time and are updated, SVD is performed on all and matrices. During each update, the magnitude of the singular values dynamically determines which ranks need updating and which need to remain unchanged. If a layer’s matrix has low importance, we only update the high-importance parts and not the low-importance parts. However, this method is extremely inefficient because SVD decomposition is required for every training step. For large models, frequently applying SVD to high-dimensional matrices is very time-consuming.

To address this issue, the paper proposes simulating the effect of SVD by parameterizing . Specifically, a diagonal matrix is used to represent singular values, and orthogonal matrices and represent the left and right singular vectors of . To ensure the orthogonality of and , a regularization penalty term is added during training. This allows the model to gradually approach the form of SVD decomposition during training, avoiding intensive computation of SVD and improving efficiency. See the figure below for details.

2234

Pre-trained weight decomposition

A typical weight decomposition scheme is DoRA (Weight-Decomposed Low-Rank Adaptation), which decomposes pre-trained weights into two components, magnitude and direction, through a normalization method for independent fine-tuning. In particular, it can effectively update the direction component using LoRA. This two-step approach gives DoRA greater flexibility than standard LoRA. Unlike LoRA, which tends to scale magnitude and direction uniformly, DoRA can make subtle direction adjustments without necessarily increasing the magnitude. DoRA outperforms LoRA even with fewer parameters and is less sensitive to rank selection.

Research Motivation

2235

The authors of DoRA found that LoRA’s learning and update patterns are very different from those of FT, and these differences may reflect the learning capabilities of each method.

DoRA is based on the idea that any vector can be represented by its length (magnitude) and direction (orientation). The paper first introduces a novel weight decomposition analysis method to investigate the intrinsic differences in learning patterns between full-parameter fine-tuning and LoRA fine-tuning. Specifically, the paper decomposes the weight matrix into two independent parts: a magnitude vector and a direction matrix . Once and are obtained, DoRA applies a LoRA-style low-rank update only to the direction matrix , while allowing the magnitude vector to be trained independently. By examining the updates of LoRA and FT relative to pre-trained weights in magnitude and direction, DoRA can reveal the fundamental differences in their learning behavior.

The formula for the weight decomposition of matrix can be expressed as shown in Figure 1 below. Based on this formula, the fully fine-tuned weights and the weights after merging LoRA perform the above decomposition, and then combine it with the original weights compare. For example, and the changes in amplitude and direction between them can be defined as shown in Figure 2 below. Wherein, and these represent the values at training step , respectively. The magnitude and direction differences between them are represented by , which is the cosine similarity function. and each is the th scalar in its magnitude vector, and and that is and , the th column in the array. and the differences in amplitude and direction between them are calculated using the same method.

The paper selects four checkpoints from different training steps of FT and LoRA for analysis, including three intermediate steps and the final checkpoint. Weight decomposition analysis is performed at these checkpoints to determine the changes in and at different levels. In the lower part of the figure, the x-axis represents the model update direction, the y-axis represents the magnitude change, and the scatter plot represents the data for each layer. As can be seen:

  • The FT shows a more diverse learning pattern, with little (or a small) correlation between its training methods, update direction, and magnitude.
  • LoRA exhibits a strong positive correlation, meaning there is a strong proportional relationship between changes in direction and amplitude. This can be detrimental to more refined learning because it lacks the ability to handle finer adjustments. Specifically, LoRA performs poorly in performing small directional adjustments accompanied by significant changes in amplitude, or vice versa, a capability that is characteristic of the Fourier Transform (FT) method.

Therefore, the paper suspects that LoRA’s limitations may stem from the challenge of simultaneously learning amplitude and orientation adaptability, which may be too complex for LoRA, leading to the paper’s methodology.

2236

plan

Based on the authors’ insights into weight decomposition analysis, this paper further introduces a low-rank adaptation method for weight decomposition (DoRA). DoRA first decomposes the pre-trained weights into their magnitude and direction components, and then fine-tunes these two components. See the formula in the figure below for details.

2237

Specifically, the following fine-tuning ideas are proposed.

  • First, DoRA restricts LoRA to focus on orientation adjustment while allowing the amplitude component to be adjustable. Compared to LoRA, which needs to learn both amplitude and orientation adjustments simultaneously, DoRA simplifies the task. Corresponding to label 1 in the diagram below, the pre-trained weight matrix is decomposed into an amplitude vector and an orientation matrix .
  • Secondly, since the directional components have a large number of parameters, the paper further decomposes them into weights using LoRA, making the directional update process more stable and enabling efficient fine-tuning.

Because DoRA makes it easier to adjust the magnitude and direction components separately, or to compensate for a change in one component with a negative change in the other, the relationship between the direction and magnitude of DoRA is more like fine-tuning.

2238

The following figure shows the combined DoRA weights under the same settings as FT and LoRA, . The differences lie in the magnitude and direction of the slopes. From the and regression lines of DoRA and FT, both DoRA and FT exhibit the same negative slope. The paper hypothesizes that FT tends towards a negative slope because the pre-trained weights already possess a wealth of knowledge suitable for various downstream tasks. Therefore, with sufficient learning capacity, downstream adaptation can be achieved simply by changing the magnitude or direction.

The paper also calculated the correlations between and for FT, LoRA, and DoRA, finding that the correlation values for FT and DoRA were -0.62 and -0.31, respectively, both negative. LoRA, however, showed a positive correlation with a value of 0.83. DoRA demonstrated the ability to make significant directional adjustments with only small magnitude changes or vice versa, and its learning pattern is closer to FT, indicating stronger learning ability compared to LoRA. Therefore, DoRA can be considered a costless alternative to LoRA because its decomposed magnitude and direction components can be merged back into the pre-trained weights after training, thus avoiding additional inference overhead.

Because DoRA makes it easier to adjust the vector and the direction matrix separately, or to compensate for a change in one by making a negative change in the other, the relationship between the direction and magnitude of DoRA is more like fine-tuning.

2239

6.1.2 Parametric Pruning

Parametric pruning techniques focus on evaluating the importance of different parameters in the LoRA matrix and removing those deemed less important. These methods can be categorized into three types based on their pruning methods: importance-based pruning, regularization-based pruning, and output-based pruning.

  • Importance-based pruning: These methods use multiple metrics to evaluate parameter importance. SparseAdapter applies traditional network pruning techniques to LoRA parameters, evaluating importance through parameter magnitude, gradient information, and sensitivity analysis. RoseLoRA achieves row/column pruning through sensitivity-based scoring, enabling selective knowledge updates while preserving the advantages of low-rank adaptation. LoRA-prune, based on LoRA gradient information, jointly prunes the LoRA matrix and parameters of a large language model (LLM) to optimize the model structure.
  • Regularized pruning: Regularized pruning techniques introduce sparsity by optimizing constraints. SoRA utilizes a gating mechanism between the lower and upper projection matrices of LoRA, employing proximal gradient descent and L1 regularization. This method achieves automatic sparsity during training and eliminates zero-valued elements after training.
  • Output-based pruning: Output-based methods evaluate the LoRA parameter based on its hierarchical impact. LoRA-drop analyzes the hierarchical effects of different layers, . The importance of LoRA modules is evaluated by their distribution. This method reserves a separate LoRA module for the most important layers, while sharing a single LoRA among other layers considered less important.

2240

6.1.3 Parameter Freezing and Sharing

Parameter freezing and sharing techniques reduce the number of trainable parameters by freezing matrices and sharing parameters across layers. They can be divided into two categories: intrinsic parameter methods and extrinsic parameter methods.

  • The internal parameter method adjusts some parameters of LoRA while freezing others. Matrix Freezing: This involves freezing a portion of the LoRA parameters during fine-tuning, updating only the remaining parameters. Research has revealed the asymmetric roles of matrices and in the adaptation process. LoRA-FA demonstrates that freezing randomly initialized matrix while updating only can preserve model performance. LoRA-FA, short for LoRA and Frozen-A, freezes the downprojection weights of each layer of LoRA and updates the upprojection weights, training only matrix . In LoRA-FA, matrix is frozen after initialization, while matrix is trained after zero initialization (just like in the original LoRA). This halves the number of parameters while maintaining performance comparable to ordinary LoRA. Asymmetric LoRA provides the theoretical basis for this approach, suggesting that primarily acts as a feature extractor, while acts as a task-specific projector. AFLoRA constructs a low-rank trainable path and progressively freezes parameters during LoRA training. DropBP accelerates the training process by randomly discarding some LoRA gradient calculations during backpropagation. Cross-layer parameter sharing: Several methods have explored parameter sharing across network layers. VeRA proposes sharing a pair of frozen random matrices across all layers and adapting them layer by layer through “scaling vectors”. NOLA extends this concept, representing and as a trainable linear combination of shared frozen basis matrices. Tied-LoRA achieves layer-wise parameter binding while maintaining the trainability of the shared matrices. VB-LoRA (Vector Bank-based LoRA) proposes a “split and share/divide and conquer” paradigm, splitting the low-rank decomposition of LoRA through rank decomposition and achieving global sharing based on a hybrid model.
  • Additional parameter methods introduce and tune a set of additional parameters while freezing the original LoRA parameters. Most methods are based on Singular Value Decomposition (SVD). LoRA-XS adds a small weight matrix between the frozen LoRA matrices, which is constructed by performing SVD on the original weight matrices; then, only these weight matrices are tuned during fine-tuning. Similarly, BYOM-LoRA uses SVD to compress the LoRA matrix of a multi-task model.

The figure below illustrates NOLA. Its motivation stems from two main limitations of LoRA: (1) the number of parameters is constrained by the lower bound of the rank decomposition, and (2) the degree of reduction is greatly influenced by the model architecture and the chosen rank. NOLA reparameterizes the low-rank matrix in LoRA using a linear combination of randomly generated matrices (basis) and optimizes only the linear mixing coefficients. This approach allows us to decouple the number of trainable parameters from the choice of rank and the network architecture.

2241

By combining parametric pruning techniques, these methods can significantly reduce the number of parameters while maintaining adaptive effectiveness. The figure below provides a comprehensive comparison of these methods.

2242

6.1.4 Parameter Quantization

2243

Quantization refers to optimizing the complexity of neural networks by using lower-precision numerical representations, thereby significantly reducing storage and computational requirements. In the context of LoRA, quantization methods mainly have two dimensions: quantization timing and quantization techniques.

Quantification timing: Quantification timing refers to whether quantification is carried out before, during, or after fine-tuning.

  • Pre-fine-tuning quantization: Pre-fine-tuning quantization involves quantizing pre-trained weights before performing LoRA-based adaptation. For example, QLoRA uses a 4-bit NormalFloat (NF4) quantization method. LoftQ improves upon QLoRA by addressing the discrepancies introduced by quantizing high-precision weights.
  • Fine-tuning-time quantization: Fine-tuning-time quantization applies quantization before and throughout the fine-tuning process. Methods such as QA-LoRA utilize grouped quantization to dynamically adjust accuracy during training, ensuring a more balanced interaction between low-rank updates and quantization weights.
  • Post-fine-tuning quantization: Post-fine-tuning quantization is performed after fine-tuning and focuses primarily on the quantization used for inference. LQER utilizes a low-rank SVD-based decomposition to minimize quantization error, thereby ensuring that the quantized weights closely match the original high-precision weights.

Quantization techniques: Different quantization methods have been proposed for LoRA, including uniform quantization, non-uniform quantization and mixed precision quantization.

  • Uniform quantization: Uniform quantization assigns the same bit width to all weights, regardless of their distribution. QA-LoRA applications feature uniform quantization with group-wise refinement, optimizing memory efficiency and adaptability by balancing precision tradeoffs. However, uniform quantization may be less effective for non-uniformly distributed weights, in which case non-uniform quantization is more efficient.
  • Non-uniform quantization: QLoRA employs non-uniform quantization, which is a weighting method specifically designed for Gaussian distributions, by allocating more precision where it is most needed (close to zero). This method allows for a better representation of the smaller weights that dominate in the pre-trained model.
  • Mixed-precision quantization: Mixed-precision quantization dynamically adjusts the bit width based on the weight matrix or layers, providing greater flexibility. Methods such as LoftQ and LQ-LoRA utilize mixed precision to optimize the quantization of different components of the model. For example, LoftQ alternately quantizes the residuals of the weight matrix and uses SVD to refine the low-rank components. By iteratively optimizing the low-rank parameters and adjusting the quantization level, LoftQ minimizes the quantization error. LQ-LoRA extends this further by employing integer linear programming to dynamically configure the bit width for each weight matrix. LQ-LoRA also introduces a data-aware mechanism that uses an approximation of the Fisher information matrix to guide the quantization process. This allows LQ-LoRA to achieve a more accurate decomposition of the weight matrix with minimal loss caused by quantization.

In summary, pre-fine-tuned quantization methods (such as QLoRA and LoftQ) typically offer greater memory savings by freezing pre-trained weights, while post-fine-tuning methods (such as LQER) focus more on optimizing inference accuracy. Regarding quantization techniques, non-uniform and mixed-precision methods (as seen in QLoRA, LoftQ, and LQ-LoRA) exhibit superior performance in low-bit scenarios, providing more flexible precision allocation based on weight distribution. The timing of quantization and the specific quantization technique play a crucial role in determining the balance between memory efficiency and model performance. The figure below provides a comprehensive summary of the quantization methods discussed.

2244

6.2 Rank Adaptation

Rank is a key parameter in LoRA, directly affecting the model’s adaptability and the number of trainable parameters. The original LoRA method uses a fixed low rank across all layers, which may not be optimal for different downstream tasks and model architectures. Furthermore, a higher rank for LoRA is not always better. Excessively high LoRA ranks can lead to a double degradation in performance and efficiency. Additionally, during fine-tuning, the importance of different layers in the Transformer model may vary, requiring the assignment of different ranks to each layer.

To address these limitations, recent work has proposed various methods to optimize rank assignment in LoRA, which can be broadly categorized into two main aspects: rank refinement and rank enhancement. The figure below illustrates these two methods.

2245

6.2.1 Rank Refinement

Rank refinement methods aim to improve the model’s adaptability and efficiency by adaptively selecting the rank of LoRA modules during fine-tuning, rather than assigning the same rank to all modules. A key insight is that different layers may require different degrees of adaptation and thus benefit from different ranks. Rank refinement methods can be categorized into three main types: adaptive assignment, heuristic strategies, and multi-rank training.

Adaptive Allocation: Adaptive allocation methods dynamically adjust the rank of LoRA modules during training based on importance metrics derived from the data or model parameters. Currently, there are three methods: SVD-based methods; SRD-based methods; and rank sampling-based methods.

  • SVD-based methods utilize Singular Value Decomposition (SVD) to decompose a matrix and selectively truncate its singular values, providing an effective way to control the rank of a matrix. Inspired by SVD, we can decompose the LoRA parameter matrix into SVD form, , where and are orthogonal and is a non-negative diagonal matrix. By controlling the elements in , we can control the rank of and assign ranks to LoRA modules. Based on this idea, several rank allocation methods approximate the SVD decomposition of and assign ranks by filtering the diagonal matrix. For example, AdaLoRA introduces an adaptive rank allocation mechanism by parameterizing LoRA updates using SVD. This mechanism dynamically prunes singular values based on their magnitude, allowing each layer to have a customized rank while maintaining the global parameter budget. Similarly, SoRA employs learnable gating mechanisms to control the effective rank of each LoRA module. To improve sparsity, these gating mechanisms are optimized using L1-regularized proximal gradient descent. This method can automatically discover the appropriate rank for different layers, thereby improving parameter efficiency without manual adjustment.
  • SRD-based methods are used. However, orthogonal regularization introduces a significant computational cost to LoRA, reducing its efficiency. To address this, some methods omit the orthogonality requirement of SVD, directly decomposing the matrix into single-rank components and then assigning ranks by selecting appropriate components. DoRA (Dynamic Low-Rank Adaptation) proposes decomposing the LoRA parameter matrix into single-rank components and pruning these components based on heuristic importance scores. Similarly, AutoLoRA also decomposes the LoRA parameter matrix into single-rank components, but it prunes the components based on meta-learning. SoRA eliminates orthogonal regularization and filters the columns and rows of and by directly controlling the diagonal matrix (their combination can be viewed as single-rank components). ALoRA also filters components by using gating units, but in contrast, it learns gating units based on neural architecture search.
  • Rank sampling-based methods address the computational burden of determining the appropriate rank through iteration or orthogonality constraints, which is essential in methods based on SVD parameterization and SRD. Rank sampling avoids this additional computational cost, making the rank allocation process more efficient. Random sampling allows for flexible adaptation to different training conditions and task requirements, and its simple and intuitive process makes it easy to implement and integrate into existing LoRA frameworks. To avoid this additional cost, DyLoRA proposes direct rank allocation via random sampling. In each training step, it samples a value from a predefined discrete distribution and assigns as the rank. Matrices and are then truncated to rank . During fine-tuning, only the parameters of row of and column of are adjustable, while other parameters are frozen. Furthermore, the distribution can be defined according to user preferences.

2246

Heuristic Strategies: Heuristic strategies assign ranks according to predefined rules, which can come from prior knowledge or empirical observations. PRILoRA proposes a deterministic strategy where the rank of a LoRA module increases linearly from lower to higher layers. In transfer learning, higher layers typically require more adaptation, and this heuristic algorithm assigns higher ranks to higher layers.

Multi-Rank Training: Multi-rank training methods enable models to perform well across a range of ranks, providing flexibility during inference. DyLoRA trains LoRA modules simultaneously across multiple ranks. In each training iteration, it samples ranks from a predefined distribution, allowing the model to learn to perform effectively across multiple ranks. This strategy enables adaptation at inference time without additional training, which is useful for deployment scenarios with varying compute constraints.

6.2.2 Rank Augmentation

LoRA uses low-rank matrices to update model parameters. Although this helps parameter efficiency, it also limits the model’s expressive power on some knowledge-intensive tasks. Rank augmentation methods aim to achieve higher-rank model updates through a series of low-rank modifications, closing the performance gap between LoRA and full-parameter fine-tuning. These methods can be divided into two categories: matrix-merging methods and matrix-resampling methods.

Matrix-merging methods: Matrix-merging methods increase rank by merging low-rank update matrices. The key idea is that matrix rank is subadditive. That is, for matrices and of the same size, . Based on this property, multiple LoRA modules can be stacked together. The sum of multiple low-rank matrices can approximate a higher-rank matrix, thereby enhancing the ability to capture complex patterns without introducing enormous computational overhead.

  • ReLoRA introduces an iterative training framework. It repeatedly trains low-rank LoRA modules, periodically merges them into the pretrained model weights, and then reinitializes the modules. This is equivalent to stacking multiple LoRA modules across fine-tuning steps, which increases the overall update rank while remaining memory efficient.
  • COLA proposes a similar iterative optimization strategy inspired by the Frank-Wolfe algorithm. It iteratively trains LoRA modules and merges them into the model, progressively constructing higher-rank adaptation.
  • MELoRA points out that merge-and-reinitialize does not necessarily guarantee an increase in rank, because the sequence of LoRA modules during fine-tuning may overlap. To address this, MELoRA introduces a parallel rank-expansion method by decomposing LoRA modules into smaller mini-LoRAs and stacking them in parallel.
  • XGBLoRA introduces a gradient boosting framework for LoRA. It builds the final model by combining a series of rank-1 boosters, gradually improving prediction quality.

Matrix-resampling methods: Matrix-resampling methods achieve higher-rank adaptation by dynamically resampling projection matrices during training. The core idea is that although each training step uses a low-rank matrix, the effect of high-rank updates can accumulate over time.

  • FLoRA reinterprets LoRA as a gradient compression and decompression mechanism. It observes that LoRA effectively performs a fixed random projection in gradient space. To break this limitation, FLoRA proposes periodically resampling the projection matrices used in LoRA, allowing the model to explore the parameter space more fully over time while keeping parameter efficiency.

In summary, rank adaptation strategies improve LoRA by tailoring the rank of adaptation matrices to the needs of different layers and tasks. The figure below provides a detailed summary of rank refinement and augmentation.

2247

6.3 Improvements to the Training Process

While LoRA has achieved significant success in efficient parameter fine-tuning, optimizing its training dynamics remains crucial for maximizing adaptive performance. In this section, we discuss recent advances aimed at improving the training process, such as increasing LoRA’s convergence speed and reducing its sensitivity to hyperparameters. These improvements include: improved initialization methods to accelerate convergence; optimized gradient updates to enhance training stability and reliability; and reduced overfitting.

6.3.1 Co-updating LLM and LoRA

The purpose of co-updating LLM and LoRA is to update the high-rank LLM during fine-tuning to achieve better representational capabilities than updating LoRA alone.

Delta-LoRA proposes a joint update strategy that calculates the difference between the LoRA parameters of two consecutive iterations and then applies this difference to the LLM update. The advantage of this method lies in its ability to leverage the parameter efficiency of LoRA while achieving better performance by directly updating the high-rank LLM. This approach achieves better representational power than independently updating LoRA without additional memory overhead, improving model performance on specific downstream tasks and enabling efficient LLM updates without incurring additional memory costs.

6.3.2 Initialization Improvement

In LoRA, parameter matrices and are typically initialized using Gaussian noise and zeros.

There are two simple approaches: Init[A], which sets matrix to zero and randomly initializes matrix , and Init[B], and vice versa. The paper “The impact of initialization on lora finetuning dynamics” compares these two approaches and concludes through theoretical analysis that Init[A] is superior. The study shows that Init[A] allows for a larger learning rate without instability, thus making the learning process more efficient. However, even with Init[A], this random initialization method still results in a smaller initial gradient, which slows down the convergence speed. That is, improper initialization of these low-rank matrices can lead to inefficient fine-tuning and even affect the model’s adaptability to downstream tasks.

By aligning the initialization with the important directions of the pre-trained weight matrix, improved initialization methods can accelerate the convergence of LoRA. Good initialization helps maintain model stability and performance during fine-tuning, especially when dealing with new tasks. Typical improvements are as follows:

  • PiSSA: PiSSA (Principal Singular Values and Singular Vectors Adaptation) is an initialization method that uses the principal singular values and singular vectors of a pre-trained weight matrix to initialize the parameters of LoRA. Since the principal singular components represent the most important directions in the matrix, aligning the initial weights with these components can accelerate convergence and improve performance.
  • MiLoRA: In contrast to PiSSA, MiLoRA (Minor Singular Components Adaptation) uses minor singular values and singular vectors for initialization. Considering that random initialization of low-rank matrices might interfere with important features already learned in the pre-trained matrix, MiLoRA improves overall performance by reducing this interference while adapting to new tasks.
PiSSA

We will use PiSSA as an example for our learning.

PiSSA identifies and fine-tunes the principal components within the model, decomposing the original matrix using Fast SVD and initializing the LoRA AB matrix with the product of two singular vectors and the principal singular values. The figure below, from left to right, shows the initialization methods for full-parameter fine-tuning, LoRA fine-tuning, and PiSSA fine-tuning of large models. In the initial stage, for the same input, the outputs of these three methods are completely equal. However, the main difference between PiSSA and LoRA lies in their initialization methods:

  • LoRA: is initialized using a random Gaussian distribution, and is initialized to zero. Only low-rank matrices and are trained during the process.
  • PiSSA, also based on the low-rank assumption, operates directly on instead of approximating . PiSSA uses SVD to decompose into the product of two matrices and , plus a residual matrix . and are initialized using the principal singular values and singular vectors of , while is initialized using the remaining singular values and singular vectors and remains unchanged during fine-tuning. This ensures that the initialization is the same as the base model. Similar to LoRA, PiSSA training only trains the low-rank matrices and , while remains frozen.

Compared to full parameter fine-tuning, LoRA and PiSSA both save on the number of trainable parameters (indicated in orange).

2248

The paper proposes a simple approach: performing SVD decomposition on the original matrix, splitting it into two parts: the principal singular value part and the residual singular value part. Fine-tuning is performed on the principal singular value part, and the residual singular values are frozen. The specific decomposition is as follows: consists of the part with smaller singular values. ( and ) consists of the part with larger singular values, which is the part to be trained.

In fact, LoRA and PiSSA have different approaches. PiSSA believes that fine-tuning mainly focuses on adjusting principal components with large singular values. The LoRA paper argues that primarily amplifies directions in that were not emphasized during training. In other words, LoRA amplifies important features in downstream tasks but not emphasized during pre-training (i.e., features with small singular values), which contradicts the viewpoint of PiSSA.

  • LoRA assumes that the change in the matrix before and after fine-tuning a large model has a very low intrinsic rank . Therefore, it simulates the model change by multiplying and to obtain a low-rank matrix. In the initial stage, is initialized with Gaussian noise and is initialized with 0, so , thus ensuring that the initial capabilities of the model remain unchanged. Fine-tuning and then updates . However, in the early stages of fine-tuning, the gradients are either very small or randomly distributed, causing many gradient descent steps to be wasted. LoRA often lingers around the initial point in the early stages of training, wasting a lot of time. Furthermore, poor initialization may cause the model to get trapped in suboptimal local minima, affecting its generalization ability.
  • PiSSA doesn’t care about , but rather assumes that has a very low intrinsic rank . Therefore, it directly performs singular value decomposition on and corrects the error using . This decomposition method is effective because the elements of the principal singular values are much larger than the elements of the residual singular values; therefore, the trainable adapter contains the most important directions in the original weight matrix . Ideally, training can simulate the effect of fine-tuning the entire model with fewer parameters. By directly fine-tuning the most critical parts of the model, PiSSA achieves faster and better convergence. In contrast, LoRA uses Gaussian noise and zeros when initializing adapters and , while keeping the original weight matrix frozen. Therefore, in the early stages of fine-tuning, the gradients are either very small or randomly distributed, resulting in many wasted gradient descent steps. Furthermore, poor initialization may cause the model to get stuck in a suboptimal local minimum, affecting its generalization ability.

Some researchers believe that if the task is relatively general, it may indeed be necessary to adjust the parts of the model with large singular values; however, if the task differs significantly from the pre-training, simply adjusting the principal components with large singular values may not yield the best results, and it may be more necessary to focus on those features that were not emphasized in the pre-training but are important in the new task.

MiLoRA

In contrast to PiSSA, MiLoRA (Minor Singular Component Based Low Rank Adaptation) uses minor singular values and singular vectors for initialization. Considering that random initialization of the low-rank matrix might interfere with important features already learned in the pre-trained matrix, MiLoRA improves overall performance by reducing this interference while adapting to new tasks. PiSSA, on the other hand, primarily preserves singular values for updates.

MiLoRA’s approach is based on two observations:

  • Minor singular components correspond to noise or long-tailed information.
  • The main singular components contain important knowledge, namely the core information of the original model parameters.

Some singular values of the weight matrix remain stable throughout the optimization process, and the subspace corresponding to these singular values remains invariant throughout gradient descent. Therefore, the core idea of MiLoRA is to update only the minor singular components in the weight matrix while keeping the principal singular components unchanged. MiLoRA decomposes the weight matrix into principal and minor component matrices using Singular Value Decomposition (SVD). The principal components correspond to large singular values, and the minor components correspond to small singular values. MiLoRA keeps the principal components (pre-trained knowledge) unchanged and updates the minor singular components only during fine-tuning.

2249

CorDA

The motivation for CorDA is that the traditional LoRA method randomly initializes the low-rank matrix during the fine-tuning process, which leads to two problems.

  • The importance of data context has been ignored;
  • Catastrophic forgetting results in the loss of information about the original model parameters.

CorDA is a context-oriented decomposition and adaptation method.

  • First, a small number of data samples are randomly collected, and it is assumed that these samples contain representative context of the corresponding task.
  • These samples are fed into a pre-trained LLM to obtain the covariance matrix of the activations of each linear layer. The context represented by the covariance matrix can guide the decomposition direction.
  • Next, singular value decomposition (SVD) is performed on the product of the weights and the covariance matrix to construct a learnable adapter for downstream tasks or world knowledge context. Specifically, the covariance matrix of the input activations is multiplied with the weights of each linear layer of the pre-trained LLM, and then SVD decomposition is performed.

CorDA supports two optional modes: knowledge preservation adaptation and instruction preview adaptation. Different initialization strategies can be selected according to different scenarios.

  • Knowledge-preserving adaptive weight reconstruction: fine-tuning the minimum singular value matrix. In this mode, components with large singular value correlations are kept frozen to minimize impairment to the model’s existing capabilities. This approach is suitable for applications that require finding a balance between learning new tasks and preserving world knowledge.
  • Preview of the instruction for adapted weight reconstruction: fine-tuning the maximum singular value matrix. In this mode, components related to large singular values are fine-tuned to better adapt to the requirements of new tasks, thereby improving performance on specific tasks. This approach is suitable for application scenarios with high performance requirements for specific tasks and relatively low requirements for preserving all world knowledge.

2250

6.3.3 Continuous Learning

Forget

There’s a thorny issue: when LLMs attempt to learn multiple consecutive tasks, they seem to be somewhat “forgetful,” easily forgetting what they’ve already learned, this is what I call “catastrophic forgetting.” Therefore, it’s crucial to incrementally enhance LLMs while retaining prior knowledge, that is, to engage in continuous learning.

Continuous learning

LoRA’s parameter efficiency allows for incremental model updates on new tasks while mitigating catastrophic forgetting. Several key advantages encourage the use of LoRA in continuous learning (CL): (1) reduced computational cost compared to full fine-tuning, (2) natural isolation of task-specific knowledge, and (3) flexible combination of task-specific adapters. Existing LoRA-based continuous learning methods mainly fall into two categories: regularization-based methods and ensemble-based techniques.

  • Regularization-based methods use LoRA’s parameter constraints as the primary mechanism to prevent catastrophic forgetting, focusing on preserving key model parameters. O-LoRA addresses the catastrophic forgetting problem by constraining the new task update to be orthogonal to the subspace of the previous task. O-LoRA incrementally learns new tasks in orthogonal subspaces while keeping previous LoRA parameters fixed. This approach allows for efficient knowledge accumulation without interference. Online-LoRA proposes a novel online weight regularization strategy for identifying and consolidating important model parameters. Furthermore, Online-LoRA leverages the training dynamics of the loss value to automatically identify changes in data distribution.
  • CoLoR maintains and combines multiple task-specific LoRA modules based on integrated operations. CoLoR maintains a separate LoRA module for each task and selects the appropriate module during inference using an unsupervised method. CoLoR sequentially trains task-specific LoRA modules and combines them using prototype-based task recognition. This allows for the isolation of task knowledge while enabling flexible composition. AM-LoRA combines multiple task-specific LoRA modules with an attention mechanism to integrate knowledge from different tasks. The key to AM-LoRA is the design of an attention mechanism as a knowledge mixing module to adaptively integrate information from each LoRA. Through the attention mechanism, AM-LoRA can effectively utilize the unique contribution of each LoRA while mitigating the risk of catastrophic forgetting that may occur between them. Furthermore, AM-LoRA introduces a norm during the learning process, making the attention vector more sparse. The sparsity constraint makes the model tend to select a few highly correlated LoRAs rather than collectively aggregating and weighting all LoRAs, which can further reduce the influence from mutual interference.
O-LoRA

The authors of O-LoRA based their work on two observations:

  • In past methods, all tasks updated model parameters in the same vector space, which could easily destroy the previously learned task representations.
  • LoRA’s low-rank assumption states that the model’s fine-tuning process often takes place in a low-rank subspace. Therefore, the matrix parameters of LoRA not only represent the numerical updates to the original model but also capture the direction of the model parameter updates.

O-LoRA proposes an assumption that the gradient subspace of previous tasks can be represented by LoRA parameters. This allows the model to progressively learn a new orthogonal subspace while learning new tasks, thereby mitigating catastrophic forgetting while learning new tasks simultaneously.

From the perspective of parameter space, the assumption of orthogonal gradient descent is based on the existence of a common optimal solution for different tasks in the parameter space, as shown in Figure (a). When learning a new task, O-LoRA constrains its LoRA subspace to be orthogonal to the LoRA subspaces of past tasks. The orthogonality of the subspaces necessarily ensures that the vectors in their respective spaces are orthogonal, thus guaranteeing that the gradient update of the new task will not affect the output of past tasks.

However, in LLMs with their vast parameter spaces, the optimal parameters for multiple tasks can be very different, and there may not even be a common optimal solution. In this case, when learning a new task, the model parameters will have a lower degree of matching with the previous task, leading to catastrophic forgetting. Even if the parameters converge to a common optimal solution, the model will gradually deviate from the optimal parameters of the previous task, resulting in a mismatch between the current parameters and the parameters of the previous task. Furthermore, O-LoRA also limits the model’s ability to capture cross-task heterogeneity, thus affecting the learning of subsequent tasks. See Figure (b) below for details.

2251

AM-LoRA

We will use AM-LoRA as an example for analysis.

Fine-tuning large language models (LLMs) using low-rank adaptation (LoRa) is widely considered an effective method for continuous learning of new tasks. However, it often leads to catastrophic forgetting when dealing with multiple tasks consecutively. To address this, the authors of AM-LoRA propose a continuous learning method, Attentional Mixture of LoRAs (AM-LoRA), specifically designed for LLMs.

Specifically, AM-LoRA learns a series of LoRAs for a series of tasks to continuously learn knowledge from different tasks. Since the original method requires adjusting the LoRAs for all parameters in each task, changes to a single LoRA can easily lead to the loss of knowledge from previous tasks when learning multiple tasks consecutively. Therefore, the authors employ an incremental learning approach, learning an independent LoRA for each task and then using the LoRAs corresponding to all tasks to construct a task-specific LoRA sequence.

However, simply freezing the LoRA parameters from the previous task is insufficient. Simply adding all LoRA features during inference will lead to the loss of information from past tasks, thus degrading past task performance and easily causing catastrophic forgetting. The key to AM-LoRA lies in designing an attention mechanism as a knowledge fusion module to adaptively integrate the information from each LoRA. Through this attention mechanism, AM-LoRA can effectively utilize the unique contribution of each LoRA while mitigating the risk of catastrophic forgetting that may occur between them.

2252

As shown in the diagram above, AM-LoRA consists of two parts: a sequence of LoRA matrices for a specific task and an attention selector responsible for combining all LoRA capabilities. The LoRA matrix sequence is primarily responsible for learning knowledge relevant to the new task. The attention selector focuses on learning how to filter useful knowledge from the LoRA to better handle new tasks.

Attention Selector

The attention selector is the core component of AM-LoRA, based on an attention mechanism designed by the authors for effectively integrating knowledge from task-specific LoRA. Specifically, the attention selector is added and trained along with the new task’s LoRA matrix.

As shown on the right side of the image above, the LoRA features for each specific task are input into the attention selector and then subjected to a non-linear transformation through a corresponding dense layer, denoted as (where). Next, the transformed LoRA features are combined and passed through a softmax function. Then, in this state, the attention score for each task can be obtained:

2253

Inspired by the residual connections in ResNet, the authors added a zero LoRA matrix to the model, allowing the model to choose not to utilize knowledge from previous tasks when learning new tasks. Furthermore, it is responsible for assigning weights during the first task of training the LoRA.

Loss function with sparsity constraints

The authors observed that during model learning, the model may retain features irrelevant to the current task or even harmful features, leading to heterogeneous conflicts in knowledge across different tasks. This negatively impacts the model’s generalization performance and learning effectiveness. To address these issues, the authors further introduced the L1 norm during the learning process, making the attention vectors more sparse. The sparsity constraint encourages the model to select a few highly relevant LoRAs, rather than collectively aggregating and weighting all LoRAs. This allows the model to perform LoRA selection more accurately and reduces mutual interference between different tasks. The authors only added the sparsity constraint to the dense layers of the Attentional Selector, which does not affect the learning performance of the LoRA sequences.

0xFF Reference

Not only does PiSSA offer fine-tuning performance superior to LoRA, it also reduces quantization error, surpassing QLoRA and LoftQ.

LoRA-FA: Zhang, L., Zhang, L., Shi, S., Chu, X., & Li, B. (2023). Lora-fa: Memory-efficient low-rank tation for large language models fine-tuning. arXiv preprint arXiv:2308.03303.

[2501.00365] Low-Rank Adaptation for Foundation Models: A Comprehensive Review

A Survey on LoRA of Large Language Models

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Asymmetry in Low-Rank Adapters of Foundation Models

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

DoRA: Weight-Decomposed Low-Rank Adaptation

FLORA: Low-Rank Adapters Are Secretly Gradient Compressors

GPT4 Technology Principle 4: Renormalized Group Flow as Optimal Transport Wang Qingfa [Qingxi]

https://arxiv.org/abs/2208.04848

HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning

LLM Fine-tuning Series: A Comprehensive Overview of LoRa, 7 Key Tips (CourseAI)

LoRA Learns Less and Forgets Less

How does the LoRa model work? (A single hair sticking up)

LoRA+: Efficient Low Rank Adaptation of Large Models

LoRA

A review of LoRA is here! Zhejiang University’s “LoRA Research on Large Language Models” reviews deep learning and NLP.

LoRA Overview: A Comprehensive and Systematic Understanding of Lora Fine-tuning Sherlock Ma

MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

NOLA: COMPRESSING LORA USING LINEAR COMBINATION OF RANDOM BASIS

PISSA: PRINCIPAL SINGULAR VALUES AND SINGULAR VECTORS ADAPTATION OF LARGE LANGUAGE MODELS

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

A Comprehensive Analysis of Shanghai Jiao Tong University’s High-Efficiency Fine-Tuning Algorithm | Standing on the shoulders of decomposition theory, we see the future of high-efficiency fine-tuning algorithms and gain insights into the underlying logic!

Write LoRA code from scratch (LitGate)

Low-rank adaptation

Nankai University proposes AM-LoRA | The LoRA family welcomes another powerful member, knowledge blending and continuous learning ensure LLM doesn’t become forgotten.

The potential for parallel inference across multiple large language fine-tuning models (Lequn Chen)

A Comprehensive Overview of Low-Rank Adaptive LoRA Techniques for Large Models: Background, Fundamentals, Frontiers, Applications, and Challenges (Wang Knowledge)

A New Paradigm for Fine-tuning Large Models: When LoRA Meets MoE (Chen Szu-shuo)

Full-scale fine-tuning! This is the most brilliant LoRA improvement I’ve ever seen. - Su Jianlin

Exploring Different Forms of Lora for Fine-tuning Large Models (Part 1): AdaLora, AsLora, PiSSA, DoRA

The most powerful LoRA method is born | XGBLoRA learns and merges a series of LoRA methods through gradient boosting to achieve efficient fine-tuning

Regarding the efficient fine-tuning of LoRA variants by observers

Intrinsic dimensions explain the effectiveness of language model fine-tuning. (Johnson7788)

LoRA from a Gradient Perspective: Introduction, Analysis, Conjecture, and Generalization by Su Jianlin

A review of LoRA, a large language model published by Zhejiang University.

Wang Wenguang reveals LoRA: Low-rank adaptation, a dimensionality reduction method for efficient training of large models

Detailed Explanation of LoRA Source Code in the PEFT Library by Yu Xing_s

Can LoRA improve further with different learning rates? (Su Jianlin)

AdaLoRA: Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). Adaptive et allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.

AGHAJANYAN, A., ZETTLEMOYER, L., AND GUPTA, S. Intrinsic dimensionality explains the effectiveness of language model finetuning. arXiv preprint arXiv:2012.13255 (2020).

Delta-LoRA: Zi, B., Qi, X., Wang, L., Wang, J., Wong, K. F., & Zhang, L. (2023). Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411.

DoRA: Liu, S. Y., Wang, C. Y., Yin, H., Molchanov, P., Wang, Y. C. F., Cheng, K. T., & Chen, M. H. (). DoRA: Weight-Decomposed Low-Rank Adaptation. arXiv preprint arXiv:2402.09353.

H. Koubbi, M. Boussard, and L. Hernandez, “The impact of lora on the emergence of clusters in transformers,” 2024.

https://arxiv.org/abs/2202.11737

https://arxiv.org/pdf/2407.05417

https://zhuanlan.zhihu.com/p/663034986

https://zhuanlan.zhihu.com/p/687583780

https://zhuanlan.zhihu.com/p/709734296

https://zhuanlan.zhihu.com/p/719438707

HU, E. J., SHEN, Y., WALLIS, P., ALLEN-ZHU, Z., LI, Y., WANG, S., WANG, L., AND CHEN, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).

INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING

J. Liu, J. Wu, J. Liu, and Y. Duan, “Learning attentional mixture of loras for language model continual learning,” arXiv:2409.19611, 2024.

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs.

LoRA+: Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient Low Rank Adaptation of Large Models. v preprint arXiv:2402.12354.

LoRA-drop: Zhou, H., Lu, X., Xu, W., Zhu, C., & Zhao, T. (2024). LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation. arXiv preprint arXiv:2402.07721.

LoRA: Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … & Chen, W. (2021). Lora: rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

Online-LoRA: Task-free Online Continual Learning via Low Rank Adaptation.

S. Malladi, A. Wettig, D. Yu, D. Chen, and S. Arora, “A kernelbased view of language model fine-tuning,” in Proceedings of the 40th ICML, vol. 202, 23-29 Jul 2023, pp. 23 610-23 641.

Sander, M. E., Ablin, P., Blondel, M., and Peyre, G. Sink-formers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515-3530. PMLR, 2022.

Thibeault, V., Allard, A. & Desrosiers, P. The low-rank hypothesis of complex systems. Nat. Phys. 20, 294-302 (2024). https://doi.org/10.1038/s41567-023-02303-0

VeRA: Kopiczko, D. J., Blankevoort, T., & Asano, Y. M. (2023). Vera: Vector-based random matrix tation. arXiv preprint arXiv:2310.11454.

X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang, “Orthogonal subspace learning for language model continual learning,” arXiv:2310.14152, 2023.

Y. Zeng and K. Lee, “The expressive power of low-rank adaptation,” arXiv:2310.17513, 2023.

Zhu J, Greenewald K H, Nadjahi K, Ocariz Borde H S, Gabrielsson R B, Choshen L, Ghassemi M, Yurochkin M, Solomon J. Asymmetry in lowrank adapters of foundation models. arXiv preprint arXiv.2402.16842, 2024

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Are you familiar with LoRA, QLoRA, LoRA+, LongRA, DoRA, MaLoRA, and GaLore solutions? Worth a look!!!

Hayou S, Ghosh N, Yu B. The impact of initialization on lora finetuning dynamics. arXiv preprint arXiv:2406.08447, 2024