2026/04/09 16.3k words

Transformer Series #transformer#length-extrapolation#position-encoding#rope#llm#context-window

Exploring the Transformer Series (23) --- Length Extrapolation

Exploring the Transformer Series (23) --- Length Extrapolation

0x00 Overview

Advances in LLMs are driving longer contexts and broader text generation, with these models trained on sequences of millions of labeled data. This trend puts pressure on system memory bandwidth, leading to increased execution costs. LLMs for multi-turn dialogue scenarios present several challenges: 1. Attention mechanisms’ computational complexity; 2. Cache of key-value pairs during the decoding stage requires a large amount of memory; 3. Popular LLMs cannot be extended beyond the training length. In this article, we will discuss the third point.

Text continuation and language extension are core capabilities of human language. With limited learning resources, humans can understand potentially infinitely long discourses by understanding their components and structure. Although Transformers have achieved great success in almost all NLP tasks, language models pre-trained on finite-length texts cannot generalize to texts of arbitrary length like humans do, thus limiting their application potential.

Ensuring that the model can handle text lengths far exceeding those used during pre-training during the inference phase has become one of the core challenges facing large-scale models. We consider this problem as the challenge of extrapolating the length of large models. This is because we always hope that the model can handle arbitrarily long texts, but it is impossible to stretch the length of the training samples to an arbitrarily long length.

This paper examines the research progress of Transformer models in length extrapolation from the perspective of position encoding (PE), and studies various methods aimed at enhancing the length extrapolation capability of Transformer, mainly including extrapolable position encoding and extension methods based on these position encodings.

Note: The complete list of articles is here. The list will be updated every time an article is published.

List of articles in the cnblogs series exploring Transformer

0x01 Background

1.1 Problem

Transformers have swept the NLP field since their inception. As the capabilities of LLMs have grown, so have our expectations of them, such as the desire for models to handle longer texts, because understanding and extending the context length of LLMs is crucial for improving their performance in various NLP applications.

However, increasing the context window of an LLM is not so simple, because the capacity advantage of the Transformer comes at the cost of quadratic computation and memory complexity relative to the length of the input sequence. This results in neither the Transformer nor the LLM built upon it having the ability for effective length extrapolation. This means that, limited by the context length constraint preset during training, large models cannot effectively handle sequences exceeding this length constraint. When the input exceeds this constraint, its performance will significantly degrade because the model has not seen new token positions outside the context window during pre-training.

Therefore, solving the length generalization problem has become a major challenge for LLM.

1.2 Solution Approach

To support longer texts, the current solutions can be mainly divided into several strategies:

  • During the pre-training phase, it is necessary to support longer text lengths as much as possible. To achieve this goal, parallelization methods are typically used to distribute memory usage across multiple devices, or the attention structure is modified to avoid a quadratic relationship between memory usage and text length.
  • Fine-tune it. For example, train the model on a large amount of data in a relatively small window (e.g., 4K tokens), and then fine-tune it in a larger window (e.g., 64K tokens).
  • During the inference phase, extrapolate to a larger length as possible. To achieve this goal, two aspects typically need to be considered: extrapolating the positional encoding and optimizing the Attention mechanism.

1.3 The Challenges of Fine-tuning

Since fine-tuning and pre-training are essentially similar, but fine-tuning is far less difficult than pre-training, let’s take a look at the challenges of fine-tuning.

Fine-tuning within the context of LLM represents a complex evolution in the field of NLP. This process involves specifically refining the existing capabilities of a model. Through fine-tuning, LLMs can understand and accurately generate text beyond the parameters of their initial training data, demonstrating remarkable flexibility in adapting to new content types and structures. Fine-tuning extrapolation focuses on improving the model’s proficiency through additional, targeted training. However, further expanding the context window (fine-tuning) presents several key challenges:

  • High fine-tuning cost: Expanding the context window of a pre-trained large language model (LLM) to longer texts typically requires fine-tuning on texts of corresponding lengths. However, since the spatial complexity of attention is this results in high costs in terms of computational resources and time. As the context window continues to expand, the computational and memory requirements of the model will increase significantly, leading to extremely expensive fine-tuning time costs and GPU resource overhead.
  • Long texts are scarce: Fine-tuning typically requires long texts of appropriate length, but the number of long texts in the current training data is limited. In the current dataset, especially long texts exceeding 1000k are very limited, which restricts the methods for expanding the context window through fine-tuning.
  • Catastrophic values introduced by new positions: First, the new position indices, which are not trained, introduce many outliers, making fine-tuning difficult. For example, when expanding from 4k tokens to more than 1000k, more than 90% of new positions are introduced. These positions introduce many catastrophic values, causing out-of-distribution problems and making fine-tuning difficult to converge.
  • Attention Scattering: When scaling to extremely long context windows, the attention of large models becomes scattered across numerous token locations due to the introduction of many new positional information, thus degrading the performance of large models on the original short context windows. While the context length does not affect the number of model weights, it does affect how those weights encode the positional information of the tokens. Even after fine-tuning, this reduces the model’s ability to adapt to longer context windows, resulting in poor performance.

Therefore, it is generally believed that fine-tuning existing models with longer context windows is either harmful or costly.

1.4 The necessity of length extrapolation

Due to the limitations of context windows in traditional large models, the scarcity of high-quality long text data, and the high cost of fine-tuning, it is not feasible to expand the context window by training Transformers directly on long sequences.

Given the numerous challenges in fine-tuning, we might consider training on shorter context windows and performing inference on longer ones (train on short, test on long). Theoretically, this is feasible, and the space cost of inference is significantly lower than training. Therefore, length extrapolation seems to be the most suitable method to reduce training overhead while relaxing the constraints on the Transformer’s context length.

0x02 Length Extrapolation

2.1 Definition

The concept of extrapolation can be traced back to the ALiBi paper. If a model, without fine-tuning, can still maintain its training performance well when tested on texts exceeding the training length, we say that the model has length extrapolation capability. Later, this task was also called “Context Window Extension,” the purpose of which is still to generate larger texts using a trained model, but the method is no longer emphasized as extrapolation.

As the name suggests, training-free length extrapolation means that without the need for additional training with long sequence data, the model can be trained using only short sequence data to obtain a model that can handle and predict long sequences, i.e., “Train Short, Test Long”.

  • Train short: Most texts are not particularly long; exceptionally long inputs are only long-tail cases. Therefore, using exceptionally long texts during training is not very meaningful. In addition, due to limitations in training costs, people usually use short sequences for training, which is both realistic and can significantly reduce training overhead.
  • test long: Here, long means that the text length during inference is longer than the maximum text length during training. The goal is to achieve good results on long texts without fine-tuning.

2.2 Measurement

Extrapolation ability is generally measured based on language modeling tasks, where increasing the length of the test sequence does not significantly increase, remain the same, or even decrease the perplexity of the corresponding text. This is because longer texts can cause models to struggle to adapt. Take the commonly used positional encoding (RoPE) as an example: training with short text and inferring with long text will cause the model to misunderstand such long relative distances, potentially resulting in an excessively high perplexity. A more practical evaluation method involves inputting a sufficiently long context, allowing the model to predict the answer, and then comparing it to the actual answer to calculate BLEU, ROUGE, etc. LongBench is a leaderboard of this type.

However, it’s important to note that length extrapolation should not come at the expense of long dependencies, otherwise, there would be no point in considering length extrapolation, and it would be better to simply truncate the text. This means that any approach that explicitly truncates long dependencies needs to be carefully chosen. How can we determine whether long dependencies are lost during length extrapolation? A more rigorous approach is to prepare sufficiently long text, but only calculate the metrics for the last segment of each sample for each model.

2.3 Analysis

Length extrapolation is a problem of inconsistent training and prediction lengths. The training and inference of LLMs are inherently misaligned; during training, the decoder always operates on a fixed number of tokens, such as 2048 tokens. However, during inference, the decoder is always of variable length. This problem manifests in two ways:

  • The prediction process uses untrained positional encodings (whether absolute or relative). Untrained encodings cannot guarantee good processing or generalization; this is a very real phenomenon in deep learning, even for functional positional encodings like Sinusoidal or RoPE, since they haven’t been used during training.
  • A longer prediction time series leads to more dispersed attention compared to training. The number of tokens processed by the attention mechanism during prediction far exceeds the number processed during training. What does this discrepancy between training and prediction lengths affect? The answer is entropy. More tokens used to average attention means a more “uniform” distribution (higher entropy), resulting in more dispersed attention. Conversely, a shorter training time series means lower entropy and more concentrated attention. This difference between training and prediction also affects performance.

2.4 Scheme

Extrapolation refers to the process where the context length is n during LLM pre-training and m (m >> n) during prediction, while maintaining model performance. In other words, extrapolation aims to extend the model’s understanding beyond its initial observation length, employing innovative strategies to capture dependencies within this extended range.

In summary, extrapolation techniques can be categorized into three types:

  • Attention-based extrapolation techniques are used. Because RoPE-based self-attention cannot remain stable outside the training context and exhibits attention score explosion and monotonic entropy increase, this approach focuses on extrapolating by adjusting the scope of attention. For example:
  • Sparse attention: This approach focuses on “spotlighting” only the truly important information, reducing computational complexity by limiting each token to a portion of the context. While attention possesses sparse properties, its sparsity shape varies across different models and even different layers of the same model, exhibiting strong dynamism. Therefore, implementing a universal, training-free sparse attention mechanism is extremely difficult.
  • Global attention: In addition to the “spotlight”, a “floodlight” is added to take into account global information. In addition to local attention, a small number of global tokens are added to capture long-distance dependencies.
  • Dynamic attention: Based on the text content, the “brightness” and “area” of the “spotlight” are dynamically adjusted, and the attention range is dynamically adjusted according to the context to improve computational efficiency.
  • Memory-based extrapolation technology. This technology essentially follows the compression principle, using external storage to store historical information and then querying with the most recent token to retrieve historically important tokens.
  • Position-based extrapolation techniques. These techniques effectively extend the context window of a pre-trained LLM by inserting positional codes (PEs). Unlike other techniques such as efficient Transformers and memory augmentations, PE-based methods do not require changes to the model architecture or merging of supplementary modules. Therefore, PE-based methods offer the advantages of direct implementation and rapid adaptation, making them a practical solution for extending the operational scope of LLMs in tasks involving larger context windows.

It is evident that the length extrapolation problem is not entirely equivalent to designing a good positional encoding. This article mainly focuses on how to solve or mitigate the length extrapolation problem by adjusting the positional encoding.

0x03 Position Encoding and Length Extrapolation

As text length increases, positional encoding changes accordingly. Therefore, properly handling positional encoding is a crucial step in solving long text problems. As mentioned earlier, a major challenge is how to modify or adjust positional encoding to enable models that originally lacked extrapolation capabilities to handle long documents effectively through retraining or fine-tuning.

In Transformer architecture models, the values of the Attention module are independent of order; therefore, positional encoding is needed to determine tokens at different positions. There are two typical types of positional encoding:

  • Absolute position encoding: This involves incorporating positional information into the input.
  • Relative position encoding: Fine-tune the Attention structure to enable it to distinguish tokens in different positions.

To address the extrapolation problem, researchers made corresponding adjustments and optimizations to these two location encoding methods based on their characteristics. The figure below shows a list of different extrapolation PEs, categorized according to whether the PE is absolute or relative. The Manifestation shows how location information is incorporated. Learnable shows whether it can be adjusted based on the input. Integration shows how the location representation is integrated with the token representation. The Injection layer shows where the location PEs are deployed.

2301

Note: Extrapolation schemes are classified or explained in different ways. Here, I have chosen a relatively easy-to-understand approach for study. See the diagram below for this approach.

2302

Next, let’s see how to make the specific adjustments.

3.1 Absolute position encoding and its extrapolation

The earliest absolute positional encodings were of two types: learnable positional encoding and trigonometric function positional encoding. Learnable positional encoding lacks extrapolation and will not be discussed further. Trigonometric function positional encoding is characterized by explicit generation rules, thus it can be expected to have some extrapolation properties. Furthermore, trigonometric functions have the following properties:

This demonstrates that sin-cos positional encoding has the ability to express relative positions, i.e., positions vectors can be represented as positions vectors and positions combinations of vectors. This provides the possibility of positional expansion.

The Transformer authors claim that sinusoidal positional embeddings may be able to infer longer sequences than what is seen.

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

However, later research refuted this conjecture. Researchers subsequently discovered that sinusoidal APE is difficult to extrapolate. That is, while sinusoidal APE has some extrapolation potential, it lacks relative positional relationships, resulting in poor performance. This is because sinusoidal encoding incorporates absolute positional information into the input in the input vector: the i-th input vector add position vector get , in it depends only on position i. Therefore, the query and key the compatibility score between them is formalized as follows:

2303

Since absolute position encoding is ultimately composed of two independent parts, it is impossible to calculate the relative distance.

Sinusoidal position coding is the foundation and focus of many different positional encoding methods (PEs). Therefore, various absolute positional encoding methods (APEs) and relative positional encoding methods (RPEs) have been proposed to enhance sinusoidal position coding, thereby improving the extrapolation of the Transformer. Subsequent absolute positional encoding methods primarily attempt to improve extrapolation in two directions:

  • Generate location embeddings that vary smoothly with location and expect the model to learn to infer this variation function.
  • Shift invariance is incorporated into the sinusoidal APE by using random shift.

3.1.1 Add smoothness

This approach attempts to directly capture the dependencies or dynamic relationships between positional representations, such as introducing a dynamic system to model the global absolute positions of words and their order relationships. This allows positional encodings to change smoothly with position indices, and the model is expected to learn this variation during training and infer positional encodings never seen before. The paper “Encoding word order in complex embeddings” proposes expanding each word embedding to a continuous word function on an independent variable (i.e., position) (instead of representing a word as a sum of a word vector and positional encoding), so that word representations change smoothly with increasing position. The advantage of continuous functions over variable positions is that word representations move smoothly with increasing position. Therefore, word representations at different positions can be correlated within the continuous function.

2304

3.1.2 Random Offset

Some researchers speculate that the superior extrapolation performance stems from the translation invariance of the PE: the function’s output remains unchanged even if the input is moved. Therefore, they address extrapolation by introducing random offsets in the positional indices. This scheme shifts each positional index by a random offset in the trigonometric function encoding formula, preventing the model from using absolute positions and encouraging the use of relative positions instead. The paper “CAPE: encoding relative positions with continuous augmented positional embeddings” introduces local offsets and global scaling in addition to shifting each positional index of the APE by the same random offset (global offset). These three augmentation methods take the following forms.

2305

3.1.3 Summary

The sinusoidal APE, as the first PE in the Transformer, has a significant impact on subsequent PEs. However, its extrapolation ability is poor. To enhance the extrapolation ability of the Transformer, researchers have either incorporated shift invariance into the sinusoidal APE using random shifts or generated position embeddings that smoothly change with position. Methods based on these ideas have shown stronger extrapolation capabilities than the sinusoidal APE, but still fall short of the level of the Realized Position PE (RPE). One reason is that the APE maps different positions to different position embeddings, and extrapolation means the model must infer position embeddings that have never been seen before. However, this is a challenging task for the model because the number of position embeddings that recurs during extensive pre-training is limited, especially in the case of LLM (Locally Scaled Model), making it highly susceptible to overfitting to these position encodings.

3.2 Relative Position Encoding and its Extrapolation

Relative position encoding naturally possesses translation invariance, making it easier to extrapolate. Many new RPEs have been proposed, which enhance extrapolation by characterizing the relative distances between different positions in a sequence. Since these RPEs have already been introduced earlier, they will not be repeated here.

3.3 Length Extrapolation in the LLM Era

LLM has fundamentally changed the NLP field, placing high demands on length extrapolation to better adapt to various business needs, and leading to the emergence of many new PEs (Physical Explanations). In fact, many of the RPEs mentioned earlier are products of this. Based on these PEs, many methods have been proposed to further enhance the length extrapolation of LLM. In the LLM era, there are mainly two optimization approaches:

  • Novel generalizable position coding methods are proposed, such as Alibi and XPOS.
  • Modify existing position codes (mainly RoPE) using methods such as interpolation and extrapolation, for example, PI, YaRN, and random PE.

We will first introduce random PE, and then explain position interpolation in detail later.

3.4 Randomized Position Encoding

Essentially, random PE improves the exposure of all positions in a longer context window by introducing random positions during training, thus decoupling the pre-trained context window from the longer inference length.

For APEs and RPEs without clipping mechanisms, length extrapolation means that the positional representations exceed those observed during training, resulting in out-of-distribution positional representations and thus performance degradation. The key limitation on a model’s ability to handle long texts lies in the gap between training and testing lengths, i.e., “prediction uses positional encodings that haven’t been trained.” One of the most intuitive ways to address this is to have the model observe all possible positional representations during training, i.e., “train the positional encodings used for prediction during the training phase.” This is the core idea behind randomized PEs.

As a concrete manifestation of this idea, the researchers proposed simulating the positions of longer sequences and randomly selecting a random (or ordered) subset to fit the training context window. This subset can cover the entire range of possible positions during the testing phase of each training sample. Specifically, let N be the training length (N=40 in the paper) and M be the prediction length (M=500 in the paper), where M is much longer than the maximum length during training and evaluation. A large L>M is selected (this is a hyperparameter, L=2048 in the paper). During the training phase, the position sequence corresponding to the sequence of length N was originally [0,1,⋯,N−2,N−1]. Now, instead, N positions are randomly selected from {0,1,⋯,L−2,L−1} without repetition and arranged in ascending order as the position sequence of the current sequence. For each training step, the random positions of the sequence of length N are ascending subsamples of the larger range of positions, without repetition.

However, this presents a problem: the differences between adjacent positions are different during the training and prediction phases, which can be described as some kind of inconsistency, yet the model still performs well. Why is this? We can understand it from the perspective of “order.” Since the position IDs during the training phase are randomly sampled, the differences between adjacent positions are also random. Therefore, whether it’s a relative or absolute position, the model is unlikely to obtain position information through precise position IDs. Instead, it obtains a vague position signal. More accurately, it encodes the position through the “order” of the position sequence rather than through the position IDs themselves. For example, the position sequence [1,3,5] is equivalent to [2,4,8] because they are both sequences arranged in ascending order. Random position training “forces” the model to learn an equivalence class, meaning that all ascending position sequences are equivalent and can be substituted for each other. This is the true meaning of position robustness.

Therefore, through sufficient training, it can be ensured that the model encounters enough unique positions and has been adequately trained on all positions from 1 to M before inference, thereby achieving consistent performance on any sequence in inference as during training.

2306

In simple terms, randomized PE simply decouples the pre-trained context window from the longer inference length by introducing random positions during training, thereby increasing the exposure of all positions within the longer context window. The idea behind randomized PE differs significantly from position interpolation methods. The former aims to allow the model to observe all possible positions during training, while the latter attempts to interpolate positions during inference, ensuring they fall within a predetermined range. For the same reason, position interpolation methods are mostly plug-and-play, while randomized PE typically requires further fine-tuning, making position interpolation more attractive. However, these two approaches are not mutually exclusive, and therefore can be combined to further enhance the model’s extrapolation capabilities.

0x04 RoPE Extrapolation

RoPE (Rotary Position Embedding) is widely used in current large-scale models, including but not limited to Llama, Baichuan, ChatGLM, and Qwen. Although RoPE can theoretically encode absolute position information of arbitrary length and represent relative position information of arbitrary length through triangulation, it still suffers from the length extrapolation problem. Specifically, for large language models based on RoPE, during inference, when the model’s input length exceeds its training length, the model’s performance deteriorates significantly, manifested as a sharp increase in perplexity. Therefore, many methods have been proposed to enhance the extrapolation of existing LLMs pre-trained with RoPE, with position interpolation being the most popular.

4.1 Reasons

Why does the model’s performance degrade when the inference length exceeds the training length L of RoPE? This is mainly due to the frequency invariance and rigidity of the frequency distribution in RoPE (the frequency distribution of all dimensions is fixed and does not support dynamic adjustment).

From an intuitive perspective, the problem of positional encoding extrapolation lies in the overfitting problem during the training process. Fixed during pre-training, positional encoding induces the model to misinterpret features in short sequences, preventing the learned patterns from being generalized to longer sequences and adapting to longer context lengths. In RoPE, each position i corresponds to a rotation radian . When any vector q is located at position m, the rotation radians of its i-th component are: , where d represents the dimension of vector q. See the figure below for details. When the training length of the model is L, the rotation radians corresponding to position 0 to position L−1 are: . We can reasonably assume that the model, during training, has only seen within the range of rotational radians, I have never seen one greater than the rotation radians, so when the inference length is greater than when this happens, the model struggles to understand the new rotation radian and cannot correctly inject position information, leading to a decrease in model performance.

2307

There are other arguments regarding the model’s performance degradation or insufficient extrapolation ability, which we have excerpted below:

  • The bias curve of RoPE is inherently non-monotonic. In this case, the model struggles to understand the characteristics and patterns of location information. xPos, by incorporating exponential correction, forces the RoPE bias at more distant locations to converge to 0, effectively improving extrapolation performance.
  • An improperly chosen rotation angle can cause fluctuations in the RoPE bias curve at its nearest location. In this case, each prediction by the language model will result in a certain loss, which increases monotonically with the length. These fluctuations will affect the gradient backpropagation process, causing the model to incorrectly attribute the prediction loss to irrelevant locations, ultimately resulting in a distorted and incorrect distribution of locations. It is precisely because of this “distorted awareness” that the model exhibits a collapse-like effect when predicting long sequences.
  • The limited dimensions of RoPE can lead to insufficient fitting accuracy; the larger the relative distance, the greater the fitting error.
  • Overfitting during training is also a contributing factor, namely, positional encoding induces the model to misinterpret features on short sequences, thus preventing the learned patterns from being extended to long sequences.
  • The long tail problem of relative bias in RoPE may also be a reason affecting its extrapolation ability.

Next, let’s look at some extrapolation properties of RoPE.

4.2 Properties

The paper “Scaling Laws of RoPE-based Extrapolation” provides a detailed analysis of RoPE. The following analysis will primarily focus on this paper, while also incorporating other related papers.

2308

4.2.1 Property 1 Critical Dimension

In the original RoPE, dimensions and training are somewhat related. Whether the rotation angle corresponding to each dimension has completed a full rotation cycle during the training phase is a crucial question.

  • The earlier the dimension appears, the more it corresponds to . The larger the value, the shorter the period, so that the dimension can see the information of the entire period during the training phase.
  • Conversely, the last few dimensions will not have seen the complete cos/sin range during training.

Assume the pre-training text length of the model is the number of self-attention head dimensions is d. For RoPE-based LLMs, there exists such a feature dimension . The two dimensions before and after this dimension have significant differences in behavior.

  • forward these dimensions are called “pre-critical dimensions,” meaning they are feature dimensions that have already covered all possible rotation angles during the model’s pre-training phase. Their characteristics are as follows:
    • These dimensions have shorter wavelengths. the corresponding trigonometric function period can be covered in training length within the range.
    • During pre-training, the markers at each location in these dimensions undergo one or more complete rotation cycles. All location information is available and the data is fully trained during the pre-training phase.
    • Because the training is sufficient, extrapolation can be performed in these dimensions.
  • back these dimensions are called “post-critical dimensions.” They refer to RoPE feature dimensions that were not fully covered during the model’s pre-training phase. Their characteristics are as follows:
    • These dimensions have longer wavelengths. the corresponding trigonometric function period longer than training length .
    • During pre-training, the model does not have the opportunity to see all possible rotation angles in these dimensions. It only perceives a portion of the encoding within one cycle in the corresponding dimension.
    • The root cause of the extrapolation problem lies in the lack of sufficient training and the failure to perceive complete location information. the subsequent dimensions, when based on RoPE LLM, are when extrapolating beyond the training scope, the absolute position information of newly added tokens is not seen during training, becoming out-of-distribution (OOD). The relative position information of these new tokens with respect to previous tokens will also be out of distribution. This misalignment means that the attention scores associated with these dimensions deviate from their expected distribution, resulting in a significant out-of-distribution performance of the overall attention score, thus causing extrapolation problems. This leads to a significant collapse in the overall model’s attention score after the training length has expired.
    • When the model encounters sequences that exceed the length of the pre-trained sequences during the testing phase, the features in these dimensions will encounter rotation angles that were not seen during training, making it difficult for the model to generalize to these new positions.
  • this refers to the critical dimension of RoPE extrapolation, specifically the number of dimensions in that perceive the full-cycle position encoding. Essentially, yes and the dimension of its values can be cycled within a period during pre-training or fine-tuning, playing a key role in enhancing extrapolation.

There is a causal relationship between the critical dimension and the extrapolation effect. The existence of the critical dimension causes fluctuations in the attention score for the portion exceeding the critical dimension when the inference length exceeds the training length, limiting the upper limit of the model’s extrapolation. This also proves that explaining and improving the extrapolation performance of RoPE-based large language models from a periodic perspective is reasonable, correct, and effective. Once the context length exceeds the extrapolation upper limit defined by the critical dimension, the new dimension will encounter unseen positional information, which corresponds to an Out-of-Depth (OOD) value in the attention score. At the same time, the perplexity begins to rise rapidly, and the model’s extrapolation fails.

4.2.2 Property 2 Critical base

In RoPE, base is a key hyperparameter that plays a crucial role in extrapolation performance.

  • A smaller base means as the value increases, the period of the corresponding trigonometric function shortens. the corresponding period is more likely to be limited to the training length. Different dimensions of q and k will encounter a more complete range of cos/sin values during training or retraining, and will be learned more fully, thus perceiving positional information from more dimensions.
  • A larger base means the smaller the value, the longer the period of the corresponding trigonometric function, which can represent more positional information and is beneficial for the model to capture low-frequency features corresponding to the context. However, there are limitations. when the corresponding period exceeds the training length, some dimensions may appear in positions outside the training range that were not seen during testing. Although the complete cos/sin value range cannot be seen during training, it is still within the monotonic interval during extrapolation.

Therefore, for RoPE-based LLMs, there exists a critical base. the critical base is the worst basis for extrapolation and also the smallest basis that forces RoPE to extrapolate based on the feature dimensions within the critical dimension. This base is determined by the length of the training text and “pre-training text length” joint decision:

  • if the extrapolation upper bound is determined based on the base value and critical dimension decide,

    Since the base is greater than or equal to the critical base, the dimensions that can be traversed during the fine-tuning phase can already be traversed during the training phase, so the critical dimension of the model extrapolation remains unchanged.

  • if the upper bound of the extrapolation is the length of the training continuation. however, the critical dimension will be updated to .

    Because the base is less than the critical base, the dimension traversed during the fine-tuning phase exceeds the original critical dimension. The critical dimension is updated, but since this dimension depends on the training duration, the upper limit of the model’s extrapolation is still limited by the training duration. However, if the base is small enough that each dimension of the model traverses values from 0 to , , or within the training duration, the model’s extrapolation performance will be further improved. The model can still extrapolate beyond . In particular, if is less than the following extrapolation will be significantly improved.

Let’s summarize: a smaller base allows more dimensions to perceive positional information; a larger base allows for the representation of longer positional information. When we set the base of 𝜃 to be very large, the rotation speed of each disk is very slow, which ensures that no matter how many tokens there are, their absolute position codes will not be repeated.

The paper “Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use” found that changing different bases for the same model and then averaging the outputs can enhance the overall performance of the model. This shows that different base sizes have their own advantages and should not be reduced simply for the purpose of extrapolation.

4.3 Rule

4.3.1 Scaling rules when shrinking the base

The scaling rule is as follows: For RoPE-based large language models (RoPE-based LLMs), assuming the length of its pre-trained text is . If the base is adjusted to 𝛽<10000 during the fine-tuning phase, and the pre-trained text length is used if training continues, the extrapolation ability of the model will be improved.

When the base is reduced, the attention score with a reduced base learns the fluctuations from cosine/sin within the training length. Because these fluctuations are already perceived during training, compared to the case with a larger original base, each dimension will not encode positions outside the training range that were not seen during testing, thus improving extrapolation capability. Furthermore, the smaller the base, the more fully the perception, resulting in a flatter extrapolation curve. Specifically, if 𝛽 is smaller than the following in each dimension of the position encoding, cos/sin will sequentially traverse from 0 to , , , thereby further enhancing the extrapolation capability.

The process of base shrinking to improve RoPE extrapolation is a process of updating one critical dimension at a time. Each time the base is shrunk by a segment (RoPE is grouped into two dimensions, updating both dimensions at a time), both dimensions perceive complete location information; ultimately, the base is shrunk to λ3. . This allows each dimension to perceive all location information, resulting in a leap in the model’s extrapolation capabilities.

2309

4.3.2 Scaling rules for RoPE extrapolation when the base is enlarged

The scaling rule is as follows: For RoPE-based large language models (RoPE-based LLMs), assume that the length of its pre-trained text is . If the base is adjusted to 𝛽>10000 during the fine-tuning phase, and the pre-trained text length is used if training continues, the extrapolation ability of the model will be improved.

The mathematical relationship between the extrapolation upper bound and the base is as follows: based on the critical dimension corresponding to the period after updating the base. then the upper limit of the model extrapolation can be obtained.

Conversely, if in order to make the model support given the context length, then there exists a minimum base. as shown in the formula below.

The relationship between base, dimension, and period is as follows:

  • The period is covered in the training length. within a given dimension, it can adapt to the periodic changes in the positional encoding of each corresponding dimension. Therefore, when fine-tuning over a longer period, although these dimensions have not seen the complete period, they can still represent the positional information within that period and can adapt to new periodic changes in positional embeddings in the extended context. In other words, while scaling up the base amplifies the period, the resulting it remains within the range seen during the original pre-training.
  • For periods exceeding the training length the dimension of parameter learning is problematic because the entire period hasn’t been seen during training, leading to a lack of understanding of periodicity and potential overfitting. Furthermore, scaling up the base makes it even harder to perceive positional information within a complete period. Therefore, only when previously observed these dimensions are reliable only when the values are met. The update cycle of key dimensions can be used as an upper bound for RoPE LLM extrapolation.

2310

0x05 Basic RoPE Extrapolation Scheme

In RoPE, positional information is represented by a trigonometric function of the product of the position index and the rotation angle. To keep this product within the pre-training range as the index increases, researchers have proposed several schemes, such as limiting index growth or reducing the rotation angle.

5.1 Direct Extrapolation

Extrapolation here refers to extending the length without making further changes to the encoding. Essentially, it means doing nothing. The diagram below shows how the range of values is extended from [1,10] to [1,17] while keeping the interval between adjacent points constant at 1.

2311

ROPE positional encoding exhibits long-range decay, theoretically allowing for infinite length. When the extension length is small, this method has little impact on performance. However, with large extension lengths, direct extrapolation typically severely impacts performance because the model may not be adaptable to untrained scenarios. Assume L is the current sample length. When L significantly exceeds the training length, the extra positions, due to lack of training or insufficient training data in certain dimensions, often lead to a significant performance degradation when extrapolated directly. To mitigate the impact of length extrapolation on performance, the model needs to be fine-tuned with a few steps within a longer context.

5.2 Linear Interpolation

The paper “EXTENDING CONTEXT WINDOW OF LARGE LANGUAGE MODELS VIA POSITION INTERPOLATION” proposes an indexing scheme called “Position Interpolation” (PI), which for the first time introduces a linear scaling of the position index using a scaling factor to extend the context length. Specifically, PI scales the target position proportionally to the position supported by the model during inference. It can expand the context window size of RoPE-based pre-trained LLMs (e.g., LLaMA models) to 32768 with only a small amount of fine-tuning (within 1000 steps). However, this method is still limited by the training length and ignores the feature differences between RoPE queries and key vector dimensions.

5.2.1 Approach

Since positions beyond L haven’t been trained, why not select more points with scores within L? This allows us to select more positions within the already learned index range, achieving length extrapolation. This is positional interpolation. Compared to direct fine-tuning and length extrapolation, the key idea of positional interpolation is: instead of extrapolating beyond the context length used in training, it linearly scales down the position indices to match the original context window size from the pre-training phase. In other words, it compresses a larger context length back to the pre-training context length.

Position interpolation transforms the unseen into the seen, and the outside of the distribution into within the distribution. This method is quite simple to implement. Formally, it replaces RoPE f with f′, defined as shown in the formula below. By dividing the position index of the RoPE by a coefficient, the values of the position encoding are constrained to within the training length range; where L is the length limit during pre-training (the original maximum value of the context window), and L′ is the longer context window during inference (the expanded maximum value of the context window). PI linearly compresses the positions of length 0 ~ L′ into the range 0 ~ L. Since the position numbers are fed into a sine function, even fractional numbers are acceptable.

For example, if we want to extrapolate the position vector range [0, 2048] from the pre-training stage to [0, 4096], we only need to scale the corresponding positions to the originally supported interval ([0, 2048]), where L is the originally supported length (e.g., 2048) and L′ is the length to be expanded (e.g., 4096).

For any context length L’ > L that we want to achieve, we can define a scaling factor s = L/L’ < 1.

Positional interpolation reduces the absolute position index from [0, L′) to [0, L) to match the original range, which also reduces the maximum relative distance from L′ to L. Therefore, by aligning the range of position indices and the relative distances before and after expansion, positional interpolation ensures that positional encodings that originally exceeded the model’s training length fall into the trained position range after interpolation. This mitigates the impact of context window expansion on attention score calculation, making the model more adaptable.

2312

Let’s refine this further. If we want to double the number of positions that can be applied to the position encoding, we set L=4096, L’=8192, which means expanding the model length from 4096 to 8192. Position Interpolation halves the rotation radians at each position. The following figure visually shows the change in rotation radians for the 0th component. The original rotation radian range of [0, 2047] can now represent the length range of 4096. This is equivalent to inserting more positions within the original radian range. Since the rotation radians change linearly, it is also called linear position interpolation. As can be seen from the following figure:

  • The top left corner shows the position vector range [0, 4096] during the pre-training phase, corresponding to normal use of the LLM model. The input position index (blue dot) is within the pre-training range.
  • The upper right corner shows the part of the length extrapolation (4096, 8192), where the model needs to operate on up to 4096 unseen locations (red dots).
  • The bottom left corner shows the position interpolation method. We downsample the position index (blue and green dots) from [0, 4096] to [0, 2048], which is the range supported by the pre-training stage, and downsample the position index (red dot) from [4096, 8192] to [2048, 4096], which is the range supported by the pre-training stage.

In other words, to accommodate more input tokens, the authors leverage the fact that positional encoding can be applied to non-integer positions by inserting positional codes at adjacent integer positions, instead of performing inference outside of the training positions. This is equivalent to inserting more positions within the original radian range; since the rotation radian range changes linearly, this is also called linear positional interpolation.

2313

5.2.2 Principle

Next, we will analyze why PI works.

From a perspective of visual field, assuming the inference length is a times the training length, simply reducing the position by a factor of a to achieve position interpolation is equivalent to reducing the bias coefficient by a factor of a, which means expanding the attentional visual field by a factor of a.

RoPE works by encoding positional information as complex vectors, where the embedding of each dimension can be viewed as a rotation whose frequency is determined by the basis b. For any vector q at position m, the rotation radians of its i-th component are , where d represents the dimension of the vector.

From the perspective of rotation angle, the scaling factor can be understood as dividing by the subscript or dividing by the rotation angle. above, let a constant make shrink.

Where s is the interpolation scale, i.e., L’/L. After this scaling operation, the rotation angle of dimension i at position m becomes . The difference in rotation angle between adjacent positions of the i-th group of vectors is determined by reduced to . Linear interpolation reduces the rotation radians at each position, slowing down the vector’s rotation, increasing the period, and decreasing the frequency. The rotation radians at each position become L/L’ of the original. If the length is increased by a factor of several, the rotation radians are reduced by a factor of several.

The derivation of the upper bound in the paper is given below:

2314

5.2.3 Fine-tuning

After interpolation, the mapping becomes inconsistent; from a relative numerical perspective, this leads to a more “crowded” dimension. Therefore, fine-tuning training is usually required after interpolation modifications. After initial training on sequences (interpolated) within a specified length range, the model undergoes a fine-tuning process to improve its performance on longer sequences. This adaptation improves the model’s ability to generalize to extended contexts, ensuring seamless handling of the initially observed and inferred input lengths. In other words, fine-tuning allows the model to readjust to crowded mappings. Compared to extrapolation schemes, interpolation schemes require far fewer fine-tuning steps because in many scenarios (such as positional encoding), relative size (or perhaps order information) is more important. In other words, the model only needs to know that 874.5 is larger than 874, without needing to know what actual number it represents. Since the model has already learned that 875 is larger than 874, and given its inherent generalization ability, learning another 874.5 is not too difficult.

5.2.4 Comparison

The following diagram compares the extrapolation method and the interpolation method.

2315

  • The red curve in the left figure is the fitted attention score curve, which can be seen to be within the range [-1, 1].
  • The intermediate graph illustrates that the fitted curve performs well within the range [0, 2048]. However, when directly extended to larger context windows not seen in training, the perplexity can spike to extremely high numbers compared to an untrained model. For example, the attention score becomes disastrously high beyond 2048, even exceeding 8000. This completely destroys the attention mechanism.
  • The right figure illustrates that interpolation is more stable and performs better under integer positional difference.

Compared to direct extrapolation, positional interpolation has the following advantages:

  • Positional interpolation can easily enable very long context windows (e.g., 32768). The upper bound of positional interpolation is smaller than the upper bound of extrapolation, avoiding catastrophically high attention scores, demonstrating that positional interpolation has greater stability.
  • Positional interpolation generates powerful models that can effectively utilize extended context windows. Models extended by positional interpolation retain their original network structure and can reuse most pre-existing optimizations and infrastructure.
  • The model extended by positional interpolation also retains relatively good quality on the task within its original context window.

5.2.5 Disadvantages

The characteristics or advantages of RoPE are: faster rotation in low dimensions (corresponding to local detail capture) and slower rotation in high dimensions (corresponding to long-range dependencies). This design cleverly combines the information encoding capabilities of both long-range and short-range operations.

The problem with direct extrapolation is that it goes out of bounds at distant locations. Direct extrapolation maintains locality (the encoding near 0 remains unchanged), but its poor performance is due to the introduction of positional encodings that exceed the training length. Therefore, direct extrapolation is prone to problems when used over long distances, but is unaffected when used over short distances.

The problems with position interpolation are local distortion, loss of high-frequency information, and lack of dynamic scaling.

The positional interpolation method is linear, which treats all dimensions equally, reducing the rotation angle and speed of all vector groups indiscriminately (further manifested as stretching the sine function). That is, the reduction factor for high-frequency rotation angles is the same as that for low-frequency rotation angles, without considering different scaling for different dimensions. This can cause the following problems: when processing similar tokens, it may be difficult to accurately distinguish their order and position, severely disrupting the model’s local resolution. This can lead to the model losing detailed information in the original high-frequency components, making it difficult to distinguish tokens that are relatively close in position and semantically similar.

In other words, interpolation methods result in inconsistent distributions across different dimensions, making each dimension unequal. For high-frequency low dimensions, the interpolated data becomes excessively crowded. This leads to a loss of high-frequency information in the model, making it less able to recognize small rotations, unable to calculate the positional order of nearby markers, and increasing the difficulty of further learning. Although positional interpolation avoids the problem of distant positions going out of bounds, linear interpolation disrupts locality (the encoding of positions near 0 is compressed to 1/k), resulting in a loss of field resolution. It is prone to problems when used in short-range scenarios, so the effect is not good without fine-tuning.

Furthermore, the PI method employs a static interpolation strategy when expanding the context window, without considering the actual length of the input sequence. This can lead to performance degradation when processing sequences of varying lengths.

Therefore, the key to achieving training-free length extrapolation is to “preserve the near and compress the far,” that is, to “ensure no distortion in the local area” and “compress the far area without exceeding the limit.”

5.2.6 Implementation

The main changes to the Transformer library are as follows:

  • A new scaling_factor parameter has been added to control the interpolation ratio.
  • The index is divided by the interpolation ratio.

The specific code is as follows.

class OpenLlamaLinearScalingRotaryEmbedding(OpenLlamaRotaryEmbedding):
    """OpenLlamaRotaryEmbedding extended with linear scaling. """

    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        # 线性插值比较简单,就是直接将原来的各整数之间插上带小数点的值,经过少量数据微调后,可以较好的扩展到较长的文本上
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
        # 线性插值方法的关键,通过下面的操作,将所有的频率都降低了
        t = t / self.scaling_factor
		# 计算[seq_len, dim//2]矩阵,得到绝对位置编码矩阵的核心要素, dim默认是偶数
        freqs = torch.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        # 在最后一个维度复制一份,符合前面的矩阵计算公式
        emb = torch.cat((freqs, freqs), dim=-1)
        # 分别得到cos和sin分量,并设置为模型的常量
        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)

0x06 RoPE External Advancement Scheme

There are two main schemes for extrapolating position codes:

  • Pre-training modifications, such as ALIBI, KERPLE, XPOS, and HWFA, can achieve a certain length extrapolation without modification, but the corresponding modifications need to be introduced before training, so they cannot be used on existing models without fine-tuning.
  • Post-processing methods, such as NTK-RoPE and ReRoPE, are characterized by directly modifying the inference model and achieving a certain length extrapolation effect without fine-tuning. However, their disadvantage is that they cannot maintain the model’s identity within the training length.

The methods for extending the length of large models introduced in this section are all modified after the fact.

Besides Giraffe, these methods essentially affect the rotation angle at each position by changing the base, thereby influencing the model’s positional encoding information and ultimately achieving length extrapolation. Specifically, this involves reducing the rotation radian of the RoPE and decreasing the rotation speed, which helps capture low-frequency features corresponding to long contexts, thus extending the length. Adjusting the rotation radian will affect the model’s attention distribution. To achieve better results, fine-tuning with a small amount of long text is generally needed to allow the model to adapt to the adjusted positional information. In short, the characteristics of these methods can be summarized as follows:

  • Position Interpolation: Linearly interpolates the rotation angle of the RoPE by scaling. If the target length is n times the original, the rotation radian is reduced to 1/n of the original.
  • NTK-Aware Interpolation: A nonlinear interpolation method that distributes the interpolation density across multiple dimensions by scaling the frequencies of different RoPE dimensions to varying degrees (reducing high frequencies and increasing low frequencies), thereby addressing the issue of potentially lost high-frequency information during RoPE interpolation. Specifically, NTK-Aware Interpolation directly scales the cardinality of the RoPE, resulting in a small reduction in the rotational speed of high-frequency components and a large reduction in the rotational speed of low-frequency components. This allows for extrapolation in the high-frequency portion and interpolation in the low-frequency portion. It exhibits extrapolation characteristics in short-distance cases and interpolation characteristics in long-distance cases.
  • NTK-by-parts Interpolation: This further refines the interpolation strategy: it does not change the high-frequency components, but only reduces the rotation radians of the low-frequency components. Furthermore, it imposes two thresholds to limit the scaling ratio to be higher or lower than certain dimensions.
  • Dynamic NTK Interpolation: This dynamically adjusts the interpolation strategy across different inference steps of the model. When the inference length is less than or equal to the training length, no interpolation is performed; when the inference length is greater than the training length, the base is dynamically amplified at each step using NTK-Aware interpolation. The base (s) is dynamically adjusted during inference.
  • YaRN combines NTK-by-parts interpolation with an attention distribution correction strategy, using a temperature coefficient to adjust the attention distribution. YaRN divides the RoPE dimension into three groups and applies different interpolation strategies to each group based on frequency: direct extrapolation, NTK-aware interpolation, and linear interpolation. The YaRN method can be considered as NTK-aware + NTK-by-parts + Dynamic NTK.

Furthermore, in its paper “Effective Long-Context Scaling of Foundation Models”, Meta renamed NTK-RoPE to RoPE-ABF (Adjusted Base Frequency). Compared to the mysterious NTK, the name ABF more intuitively reflects its meaning.

6.1 General Formula for Position Encoding

The following section references the general formula from ROPE to Yarn for quickly applying positional encoding in large long-text models and the ideas presented in the paper “YaRN: Efficient Context Window Extension of Large Language Models”.

The authors of Yarn believe that the encoding function is a function of input vector x, position m, and . The functions, whether ROPE or all its variants, can essentially be unified by the following formula.

in:

  • f′ is the adjusted query and key vector.
  • f is the original query and key vector calculation function.
  • it is the embedding vector at position m in the input sequence.
  • m is the position index in the sequence.
  • it is the rotation angle parameter in RoPE, i.e., the frequency parameter.
  • g(m) is an adjustable function used to adjust the position index m according to the scaling factor s, describing the position transformation logic.
  • it is an adjustable function used to adjust the rotation angle parameter of RoPE according to the scaling factor s. this describes the frequency transformation logic. The design of aims to preserve high-frequency information while avoiding over-interpolation of low-frequency information.

The underlying logic of this formula is: how to adapt g(m) and . This allows for the extension of language models within the constraint of a fixed context length.

6.1.1 Trigonometric Function Encoding

The trigonometric function encoding can be extended using a general formula as follows:

in:

  • W is a linear projection matrix used to transform vectors.
  • it is a positional encoding. .

6.1.2 RoPE

The derivation of RoPE is given in the Yarn paper.

2316

The paper then presents the representation of RoPE in a general formula, directly mapping RoPE to: g(m) = m. keep the frequency parameters unchanged.

Rotation angle this determines the frequency, that is, the rotation speed of each dimension.

  • when close to 1 (lower dimensions with smaller d values). the changes are significant, and the rotation is faster.
  • when close to 0 (higher dimensions with larger d values). the changes are smaller, and the rotation is slower.

2317

6.1.3 PI

PI attempts to uniformly stretch the position index into the pre-training window by redefining g(m).

2318

Next, let’s look at some variations of RoPE. These methods mainly achieve plug-and-play length extrapolation by adjusting the rotation base of RoPE.

6.2 NTK-Aware Interpolation

In RoPE, vectors grouped closer to the beginning rotate faster and have higher frequencies. However, linear interpolation performs only simple interpolation (except for a constant) across all dimensions of the position encoding, leading to the loss of high-frequency information and hindering the model from distinguishing nearby locations. This is because for . In other words, since it’s a number less than 1, the frequency of all terms decreases after linear interpolation. Naturally, the maximum frequency the formula can express also decreases, reducing its ability to fit high-frequency information.

Shortly after the introduction of positional interpolation, researcher Bowen Peng proposed a more effective length extrapolation technique that requires no fine-tuning: NTK-aware Scaled RoPE (also simply called “NTK-aware RoPE”) in the community (https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/). Bowen Peng later further refined and optimized this method, publishing the paper YaRN: Efficient Context Window Extension of Large Language Models.

The paper “Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains” points out in its NTK (Neural Tangent Kernel) theory that deep neural networks may encounter difficulties when learning high-frequency information, especially when the input dimension is low. The solution is to transform deep neural networks into Fourier features. From the perspective of Neural Tangent Kernel (NTK) theory, simply performing RoPE linear interpolation results in the loss of high-frequency information, which the network needs to resolve very similar and closely spaced labels.

2319

Therefore, based on NTK theory, Bowen Peng proposed nonlinear interpolation of the phase of trigonometric functions. Specifically, Bowen Peng’s scheme involves scaling different frequency dimensions to varying degrees when expanding the context length, resulting in a smaller reduction in the rotation speed of high-frequency components (preserving high-frequency information) and a larger reduction in the rotation speed of low-frequency components. This involves “short-distance high-frequency extrapolation (keeping high-frequency information unchanged with minimal changes) and long-distance low-frequency interpolation (interpolating low-frequency information with significant changes).” This maintains the model’s sensitivity to high-frequency information. Here, extrapolation refers to expanding the length without significantly altering the encoding.

Why does NTK-Aware Interpolation work? Or, why extrapolate in the high-frequency range and interpolate in the low-frequency range? We can explain why NTK-Aware Interpolation works as follows: the later the group is, the slower its rotation speed, the lower its frequency, and the longer its period.

  • The earlier dimensional groups have seen many complete rotation cycles during training, and their positional information has been fully trained, so they have a strong extrapolation ability.
  • Later dimension groups may not see complete rotation cycles during training, or may see very few rotation cycles, resulting in insufficient training, weak extrapolation performance, and the need for position interpolation.

In addition, based on his experience with NTK (Neural Tangent Kernel), Bowen Peng determined that high frequencies (i→0) learn relative distances, so they do not need to be changed, while low frequencies (i→d/2−1) learn absolute distances, so interpolation is required.

6.2.1 Scheme

The Yarn paper provides a detailed derivation starting from PI, as shown in the figure below. When d=0, . s will not have any impact, becoming a direct extrapolation. When hour, . It became linear interpolation.

2320

Yarn will also defined as the wavelength of the RoPE embedding in the nth hidden dimension, the wavelength describes the number of tokens required to complete one full rotation (2π) in that dimension. The wavelength is related to the frequency of the RoPE embedding and may vary in different dimensions. We can calculate the wavelength corresponding to each dimension of the RoPE. . Given a length L, some dimensions may have a period longer than L. In this case, all positions can obtain a unique code, meaning the absolute position is preserved. Conversely, dimensions with shorter periods can only retain relative position information.

The following are the specific steps of the NTK-aware Interpolation method:

  • Determine the new basis b’: To preserve high-frequency information when expanding the context window, a new basis b’ needs to be found such that, under the new context length L’, the wavelength of the lowest frequency matches the wavelength of the linear position scaling. The new basis b’ can be calculated using the following formula: b’ = b ⋅ s, where b is the basis in the original RoPE and s is the scaling factor for the context length expansion. . .
  • Adjusting RoPE embedding: Using the new basis b’ with respect to the rotation angle parameters of each dimension adjustments will be made. , where |D| is a summary of dimensions, and i is an index of a specific dimension.
    • Change the rotation angle to , in this represents the scaling factor of the base.
    • The degree of modification varies across different dimensions. This method preserves high-frequency information, meaning that the reduction in rotational speed is small for high-frequency components and large for low-frequency components. The later the group, the greater the factor by which the rotational radius is reduced.
    • Taking Code LlaMA as an example, its . This means enlarging the base of the original model by a factor of 100. The relationship between the adjusted rotation radius and the original rotation radius is as follows: . The rotation radius of group 0 remains unchanged, equivalent to the original RoPE, which can be understood as a direct extrapolation in this dimension; the rotation radius of the last group becomes 1/100 of the original, equivalent to linear interpolation. The intermediate dimension actually represents something between extrapolation and interpolation, so the NTK-aware Interpolation scheme actually combines extrapolation and interpolation in a simple and ingenious way.
  • Calculate the new query and key vector: Calculate the new query based on the adjusted . s and keys vector.

6.2.2 Analysis

We will analyze NTK-Aware from several perspectives.

number system

Since we will be using number system analysis for our learning, let’s first look at Su Shen’s views on number system and positional encoding in Transformer Upgrade Path: 10. RoPE is a beta encoding method, and the interpretation is as follows.

567 is a three-digit decimal number, with each digit ranging from 0 to 9. This number is represented by a three-dimensional vector [a, b, c], where a, b, and c represent the hundreds, tens, and units digits of “567,” respectively. We train our model using a three-dimensional decimal system. Now, suppose we need to input a four-digit number, “1567,” and the original model was designed and trained for three-dimensional vectors. Adding a new dimension will cause the model to fail. So, how should we handle this? Here are some extrapolation approaches:

  • Direct extrapolation: Reserve several dimensions in advance, set them to 0 during the training phase, and change them to other numbers during the inference phase. For example, if the original training value was “0567”, change it to “1567” during inference. Because the highest bit reserved during the training phase is always 0, these dimensions will not be trained sufficiently, and direct extrapolation will yield ideal results.
  • Linear interpolation: Compressing “1567” to within 1000, for example, dividing by 4 to get “391.75”, with the three digits being [3, 9, 1.75]. This also requires fine-tuning to allow the model to readjust to the crowded mapping. However, this results in different distributions and adjacent differences across dimensions. For example, currently, the hundreds and tens digits still retain an adjacent difference of 1, while the units digit becomes a decimal. This makes each dimension unequal, increasing the difficulty for the model to learn further.
  • Number system conversion: Given a three-dimensional vector [a, b, c], if encoded in decimal, its range is 0~999. If converted to hexadecimal, its maximum value can be . This can cover 1567. The cost is that the numbers in each dimension change from 0-9 to 0-15. Our main concern is utilizing order information; the previously trained model has already learned that 875 > 874, and the same applies in hexadecimal, with the comparison rules being exactly the same (the model has no idea what base you’re inputting). The only concern is whether the model can still make comparisons correctly when each dimension exceeds 9 (10-15), but in reality, most models have some generalization ability, so slightly extrapolating each dimension is fine. Therefore, this base conversion approach might even be effective without fine-tuning the original model. Furthermore, to further narrow the extrapolation range, we can use a smaller base. . It is a base system, not hexadecimal.

2321

Modify base

In fact, number system conversion is simply modifying the base.

Bowen Peng argues that directly performing linear interpolation on RoPEs is suboptimal because it prevents the model from distinguishing the positional information of two tokens that are very close together. However, if non-linear interpolation is used, it changes the “base” of the RoPEs rather than their “scale,” effectively affecting the “rotation” speed of each dimension of the affected vector, the later the dimension, the faster the rotation. Therefore, NTK-aware Interpolation’s non-linear interpolation scheme changes the base of the RoPEs rather than their scale, thereby altering the “rotation” speed of each RoPE dimension vector relative to the next vector. Because it does not directly scale Fourier features, all positions can be perfectly distinguished even in extreme cases.

To unravel this mystery, we need to understand that the structure of RoPE can be viewed as a kind of . Rotated Position Encoding (RoPE) at position n is essentially a beta encoding of the number n. From this perspective, NTK-aware Scaled RoPE can be understood as different ways of amplifying the base encoding.

Direct extrapolation concentrates the extrapolation pressure on the “higher bits,” while positional interpolation makes the representation of the “lower bits” denser, which is not conducive to distinguishing relative distances. NTK-aware Scaled RoPE, which is essentially a base conversion, distributes the extrapolation pressure evenly across each bit while maintaining the adjacent spacing. These characteristics are very friendly and crucial for LLMs, which obviously rely more on relative positions, so it can achieve certain effects without fine-tuning.

2322

contrast

In fact, in the original RoPE, as i increases, starting from 1, the values gradually decrease, with the limit being 1/base; in other words, smaller values of i correspond to high-frequency components, and larger values of i correspond to low-frequency components. The PI method discards the high-frequency components. Assuming PI is applied within a context of twice the length, all all are equivalent to being halved from the original value, and the highest frequency of rotation encoding will be directly halved.

NTK-aware RoPE still uses the idea of linear interpolation of position, but its impact on RoPE is smoother: for high-frequency terms of position coding, the formula remains almost unchanged; for the lowest frequency terms, the formula is exactly the same as the formula for linear interpolation.

Using a clock as an analogy, Bowen Peng explained why the highest frequency shouldn’t be modified like in linear interpolation: Just as we use a second hand to distinguish the most precise time, neural networks use the highest-frequency sine wave encoding to distinguish relative positions, and can only detect deviations greater than 1 second. With linear interpolation, the smallest time deviation is 0.5 seconds, at which point the neural network cannot handle the highest-frequency information well. NTK-aware RoPE, however, doesn’t modify the definition of a second; it only interpolates slightly more on lower-frequency components like minutes and hours, allowing the neural network to still distinguish the most precise time.

NTK-aware Interpolation, which modifies the base, works when i is small. the degree to which it decreases will also be very small, only when i is large. it will become much smaller (the exact degree depends on ). The decision is made not to directly discard high-frequency components. When processing text that exceeds the length of the pre-trained context, high-frequency components are still primarily extrapolated, while low-frequency components are interpolated, similar to the PI method.

2323

Let’s look at it from a clock perspective. RoPE behaves like a clock, with each φ value controlling the rotation speed of one disk, and there are a total of d/2 disks.

2324

We assume the first three dials represent the second, minute, and hour hands. A 12-hour clock is essentially a RoPE with dimensions 3 and a base of 60. The second, minute, and hour hands rotate at different frequencies (from highest to lowest). Each second, the minute hand rotates 1/60 of a minute, and each minute, the hour hand rotates 1/60 of a minute. Currently, a RoPE clock can represent a maximum of 60 * 60 * 12 = 43200 seconds in a day. If we want the clock to represent a longer period, say 4 days, we need to slow the clock down by a factor of 4. There are two methods for this:

  • PI (Pi) theory suggests that this goal can be achieved by reducing the frequency of seconds, minutes, and clocks by a factor of four (lengthening the period). Unfortunately, it is now difficult to distinguish each second because the second hand hardly moves per second. Therefore, if someone gives you two different times, differing by only one second, you will not be able to tell them apart from a distance.
  • NTK-Aware RoPE: We should not scale the high-frequency seconds, but instead slow down the minutes by a factor of 1.5 and the hours by a factor of 2. This means we can accommodate 90 minutes in one hour and 24 hours in half a day. The clock can now express: 60 * (60 * 1.5) * (2 * 12) = 129600.0. We only care about the overall time: therefore, we don’t need to precisely measure the hour hand, so scaling the hours more than the seconds is crucial. We don’t want to lose the precision of the second hand, but we can tolerate a loss of precision in the minute and even hour hands.
Fitted curve

The NTK-aware Interpolation method essentially defines the degree of extrapolation as a function related to the group d. .

  • When d = 0 represents the highest frequency component, we aim for complete extrapolation. .
  • d = D/2 -1 represents the lowest frequency component. We want to interpolate it completely. .

This function can be represented by a monotonically decreasing curve passing through the points (0,1) and (L/2−1,L/L′) with group d as the variable. There are various specific curve forms; Bowen Pen used an exponential function to fit this curve, obtaining .

6.3 NTK-by-parts Interpolation

In fact, there are some sufficiently low-frequency components during the training process of RoPE, and these low-frequency components correspond to wavelengths. . Even the longest sequences during training cannot accommodate these components throughout a complete cycle. Clearly, we shouldn’t extrapolate these components. Otherwise, we might introduce unseen rotation angles with corresponding sine and cosine values that the model has never encountered during training, leading to a degraded performance. Furthermore, scaling the position index or modifying the base makes all tokens closer together, impairing the LLM’s ability to distinguish the positional order of nearby tokens.

NTK-by-parts Interpolation recommends never interpolating higher frequency dimensions, but always interpolating lower frequency dimensions. Its core idea is to maintain the high-frequency components while reducing the rotation radians of the low-frequency components. That is, it doesn’t change the rotation radians of the earlier groups (smaller dimensions), but only reduces the rotation radians of the later groups (larger dimensions), which is the meaning of “by-parts.” This maintains the model’s sensitivity to local positional relationships when expanding the context length, while avoiding over-interpolation of high-frequency information.

  • If the wavelength is much smaller than the context size , we do not perform interpolation.
  • If the wavelength is equal to or greater than the context size , we only perform interpolation and avoid any inference (unlike the “NTK-aware” method).
  • The intermediate dimensions can have both, similar to “NTK-aware” interpolation.

So, how do we determine which components are sufficiently high-frequency and which are sufficiently low-frequency? We can do this by using the ratio of sequence length to wavelength: . The i individual dimension spend of spin change week expect for: , that exist training practice long spend inside spin change of week expect individual number like down: . Next, we define the ramp function used to adjust the wavelength, where the hyperparameters the constraint represents the number of rotation cycles.

  • when this indicates that the period is much smaller than the length L, so no interpolation is performed. In other words, if the number of rotation periods is sufficiently large, the group is considered a high-frequency component and requires no modification.
  • when the statement indicates that the period is greater than or equal to the length L, meaning there are few rotation periods, which constitutes a low-frequency group. In this case, only interpolation is performed, not extrapolation. For dimensions with wavelengths greater than or equal to the original context length L, the model has already learned all possible relative positional relationships during pre-training. Therefore, embeddings in these dimensions do not require interpolation when expanding the context. For dimensions with wavelengths less than L, the model needs interpolation to adapt to the new context length L’.
  • when this indicates that the period is between the two, and ntk aware interpolation can be used.

Based on the above conditions, for the dimensions requiring interpolation, the slope function should be used. to adjust the wavelength in order to obtain new wavelengths , where s is the scaling factor for context length expansion. Then a new wavelength is used. the rotation angle parameter of the RoPE embedding is adjusted to preserve local relative distance information under the new context length L’. The adjusted RoPE embedding can be calculated using the following formula:

The adjusted rotation angle can be expressed as . here these are the adjusted rotation angle parameters. The position function remains unchanged. . The empirical values for hyperparameters are: .

If we consider the general formula described above, then NTK-by-parts Interpolation can be expressed as follows.

2325

6.4 Dynamic NTK Interpolation

Dynamic-NTK Interpolation is a dynamic interpolation method that uses precise position values for tokens within a pre-trained context window to prevent performance degradation. Furthermore, as the inference context grows, the base can be dynamically scaled, allowing RoPE to continuously adapt to the new context length.

The idea behind Dynamic NTK Interpolation is simple: when the inference length is less than or equal to the training length, no interpolation is performed; only when the inference length exceeds the training length will this method dynamically update the base at each step through NTK-Aware Interpolation. That is, the base grows with the inference context. When the sequence length just exceeds the pre-training context length, the ROPE base remains essentially unchanged, even with a large scale factor; the scale factor only takes effect when the sequence length significantly exceeds the pre-training context length, and the longer the sequence, the higher the amplification factor of the base.

The following are the specific steps of the Dynamic NTK method:

  • Dynamic scaling factor adjustment: The Dynamic NTK method introduces a dynamic scaling factor s that adjusts dynamically based on the length of the currently processed sequence. The scaling factor s is calculated as follows, where L represents the model training length and L′ is the length of the current sequence.
  • When L’ ≤ 𝐿, the original rotation radians of the model are not changed, and no interpolation is performed;
  • When L’ > 𝐿, use NTK-Aware Interpolation to adjust the rotation radius. The rotation angle is adjusted to , in .

After each token is generated, L′ is incremented by 1. When L′ > 𝐿, the rotation radian is readjusted according to L′ for each generation before the next generation is performed.

  • Adjusting RoPE embedding: Use a dynamic scaling factor s to adjust the rotation angle parameter of the RoPE embedding. This is passed through a new base end to realize now of, should base end and compare example because son s mutually close alliance: where b is the base in the original RoPE method.
  • Calculate the new query and key vectors: based on the adjusted calculate the new query s and keys vectors. These vectors will be used in the model’s self-attention mechanism to handle the expanded context length.
  • Dynamic expansion during inference: During inference, the model dynamically adjusts the RoPE embedding based on the actual length of the input sequence. This means the model can maintain its original performance when processing short sequences and expand its context window when processing long sequences.

The following is the implementation of the transformer library:

class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""

    def forward(self, x, position_ids):
        # difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
        seq_len = torch.max(position_ids) + 1
        if seq_len > self.max_position_embeddings:
            # 当模型拓展长度后,才进行NTK-ROPE
            # Dynamic NTK方法的关键计算公式,通过修改base值来改变每个位置的频率
            base = self.base * (
                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
            ) ** (self.dim / (self.dim - 2))
            inv_freq = 1.0 / (
                base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim)
            )
            self.register_buffer("inv_freq", inv_freq, persistent=False)  # TODO joao: this may break with compilation

        cos, sin = super().forward(x, position_ids)
        return cos, sin

6.5 YaRN

YaRN is an efficient method for scaling large language model context windows using Rotated Position Embedding (RoPE), achieving dynamic scaling in intermediate dimensions while maintaining no interpolation in low dimensions and full interpolation in high dimensions.

Both Position Interpolation and NTK-like methods essentially achieve length expansion by reducing the rotation radian and decreasing the rotation speed. The formula for the inner product of vectors is as follows.

Vector rotation does not change the magnitude. When the rotation radians of q and k decrease, this will cause the difference in rotation radians between positions to decrease (the angle between them). the word vectors become smaller, and the distance between them becomes closer, so their inner product increases, ultimately altering the model’s attention distribution. This effectively weakens the long-range decay of RoPE attention scores. From the perspective of attention score distribution, this underestimates the actual differences in attention score distribution (weakening the long-range decay of attention scores weakens the diff of attention scores accordingly), disrupting the model’s original attention distribution. Therefore, after interpolation, the model’s perplexity increases across the original training length, resulting in performance degradation. Furthermore, it can be observed that the weakening of RoPE’s long-range attention decay property will also lead to a smoother attention distribution across the entire sequence.

YaRN is essentially a combination of NTK-by-parts Interpolation and attention distribution correction strategy. It only reduces the rotation arc of the low-frequency part and corrects the attention distribution through the temperature coefficient. The underestimation mentioned above is compensated back by the temperature coefficient of softmax, that is, the original attention score is divided by the temperature t.

Because of the extrapolation of length, the average minimum distance becomes closer as the number of tokens increases, resulting in a higher kurtosis value in the attention softmax distribution (i.e., reducing the average entropy of the attention softmax). In other words, the decay of long distances is weakened by interpolation, and the network pays attention to more tokens. Therefore, to reverse this entropy reduction, a temperature t can be multiplied before the softmax calculation of attention, where t > 1. However, since ROPE is a rotation matrix, its length can be scaled. Taking LLaMA as an example, to minimize the perplexity of LLaMA, t and s roughly follow an empirical formula. . When the length expands from 2048 to 16384, it increases to eight times its original length. Substituting this into the formula, we get . Recalling the effect of the temperature coefficient on attention distribution, when t increases, the attention distribution becomes smoother and the variance is smaller; when t decreases, the attention distribution becomes sharper, the discrimination increases, and the variance increases. means mitigating the problem of an overly smooth attention distribution by allowing for a larger variance in the attention distribution.

The following are the specific steps of the YaRN method:

  • Introducing NTK-aware Interpolation: To address the potential loss of high-frequency information when expanding the context window, YaRN employs NTK-aware Interpolation technology, which maintains high-frequency information by scaling different dimensions of different frequencies to varying degrees.
  • Applying NTK-by-parts Interpolation: The YaRN method determines whether to interpolate the RoPE embedding based on the wavelength of the dimension. For dimensions with a wavelength greater than or equal to the original context length, no interpolation is performed; for dimensions with a wavelength less than the original context length, linear interpolation is performed.
  • Applying Dynamic NTK to dynamically adjust the context length: YaRN uses dynamic context length adjustment, which means that during inference, the model can dynamically adjust its context window based on the actual length of the input sequence.
  • Temperature scaling: To address potential changes in attention distribution when expanding the context window, YaRN introduces a temperature scaling mechanism. By adjusting the temperature parameter of the attention score, the entropy of the attention distribution can be increased, thereby keeping the model’s attention focused on relevant labels. .

6.6 Giraffe

Giraffe achieves extrapolation by preserving high-frequency rotations and suppressing low-frequency rotations.

the range of frequencies corresponds to different features, from low to high. However, during training, the model sees the full range of high-frequency components but not all low-frequency components. This imbalance makes extrapolating low frequencies particularly difficult. Therefore, dividing by a constant is clearly too simplistic. Giraffe multiplies each dimension by a coefficient that adapts to the dimension. The coefficient and dimension follow a power function relationship, hence this operation is called power scaling. Furthermore, Giraffe also multiplies the smaller values after scaling by power. set it directly to 0.

2326

Where k is the parameter to be set, ρ is a relatively small fixed value, and a and b are selected cutoff values. By selecting appropriate cutoff values, the model will experience all basic values within the context length used during fine-tuning, thus enabling better inference during the inference process. By applying this transformation, the basic high-frequency (short-range) elements are less affected than the low-frequency (long-range) elements.

6.7 Training

Su Shen provides a brilliant example using a disk to explain code training: The Transformer Upgrade Path, Part 16, “Reviewing” Length Extrapolation Techniques. Furthermore, the author’s work avoids complex derivations; how else can we understand RoPE? There’s also a brilliant analysis. Let’s learn and interpret it.

in essence, this refers to a point on the unit circle that rotates counterclockwise by (t - θi)θi degrees. As m - n gradually increases, this point completes its rotation on the unit circle. The larger θi is, the faster it rotates, and vice versa. During the training of positional encoding, we can view it as training d/2 unit circles with different rotation speeds. (If all points on the circle have been trained, the training is considered sufficient). If larger text is encountered during testing, it exceeds the range of the trained arc, resulting in unpredictable performance. In this case, we need to find a way to compress it onto the already sufficiently trained arc (positional interpolation).

Let’s continue with the perspective of disk training and visualize the design principles and operation process of NTK-RoPE.

Assuming we have already performed pre-training, the trained circumference length will differ for disks rotating at different speeds. For disks with higher frequencies (i closer to 0), the trained circumference length will be longer. Conversely, for disks with lower frequencies (i closer to d/2-1), the trained circumference length will be shorter, as shown in the diagram below.

2327

Building on this, we now want to use longer texts for continue-pretraining or inference. Intuitively, we would certainly want the disk to meet the following requirements:

  • (1) Try not to deviate from the already trained circumference range. For example, for the first disk in the figure, we perform detail filling (blue dots) within the circumference range of the green dots that the pretrain has passed through. This operation is equivalent to making the most of the positional information that has been trained. That is, for the disks in front, we try to let it learn the absolute position information and try to break through the circumference part that the pretrain has seen (extrapolation).
  • (2) Try to learn more positional information than pretrain. While we hope to achieve (1) as much as possible, should we ignore the untrained circular positions on the disk? If we introduce longer text, we certainly hope to learn some new knowledge while training conservatively. Therefore, for the last disk in the figure below, we hope to introduce new blue dots. That is, for the disks further back, we try to let them learn the relative position information, keep them within the circular parts seen in pretrain, and only perform fine angle training (interpolation).

2328

2329

We can explain why NTK-Aware Interpolation works as follows:

  • The groups at the front have seen many complete rotation cycles during training, and their position information has been fully trained, so they have a strong extrapolation ability.
  • The later groups, where complete rotation cycles are not observed during training, or very few rotation cycles are observed, are not sufficiently trained, have weak extrapolation performance, and require position interpolation.

This also leads to a commonly used method for training long texts:

  • First, use “small base + short data” for training (make each disk rotate as much as possible).
  • Then, we use “large base number + long text” to make fine adjustments (to fill the gaps on the disk). The points we have already learned are prior knowledge, which will help us make fine adjustments more effective.

Furthermore, this training method aligns with Kimi’s “long2short” method. The “long2short” method transfers reasoning abilities learned in long-context models to more efficient short-context models. This effectively addresses a key pain point in practical applications, the high operating cost of long-context models makes distilling their knowledge into lighter, faster models commercially valuable. This is also the technical foundation for the R1 model’s successful implementation of distillation (knowledge distillation from long thought chains to short thought chains) in the Qwen and Llama series models.

0xFF Reference