Exploring the Transformer Series (17) --- RoPE

0x00 Overview

RoPE encoding comes from Su Shen’s work Roformer , and it is one of the most popular PE encoding methods currently used in LLM.

The Transformer paper used Sinusoidal positional encoding, which is additive encoding, meaning the word embedding is added to the encoded position. The embedding vector for each position is fixed, regardless of its relationship to other positions. Sinusoidal positional encoding aims to introduce relative positional relationships (the positional encoding of any position can be expressed as a linear combination of the positional encodings of a known position with respect to distance), but it hasn’t been very successful; the model can only perceive relative positions to a certain extent. A common improvement to positional encoding is based on the triangular positional encoding formula, adjusting the bias of the self-attention calculation. RoPE, however, abandons this common improvement approach. Based on the triangular positional encoding formula, it uses techniques such as rotation matrices, complex multiplication, and Euler’s formula to reflect the relative positional information of two tokens in the form of a self-attention matrix bias, while also decomposing it to the feature sequence and directly encoding the absolute position of the tokens, thus combining the advantages of both absolute and relative positional encoding.

RoPR doesn’t modify the Attention structure. Instead, it works on the input layer, similar to absolute positional encoding, by directly transforming the input vector. Specifically, it performs a rotation transformation on the Query and Key vectors formed by the two input tokens, incorporating positional information into the transformed Query and Key. This allows the Attention’s inner product operation to automatically perceive relative positional information without any changes. In other words, RoPR’s starting point and strategy are based on relative positional encoding, but its implementation uses absolute positional encoding.

1701

0x01 Overall Approach

Let’s first look at the proposed modifications or pain points for trigonometric function encoding, which have two aspects.

In the analysis of previous chapters, we already know the calculation of the attention layer $(q_t^T k_{t+\Delta t})$ . This would destroy the desirable properties of the input layer’s positional encoding. Therefore, we naturally think of directly incorporating positional information into the attention layer, that is, directly applying the positional encoding to $q_t^T k_{t+\Delta t}$ . This way, the excellent properties of positional coding can be maintained.
Trigonometric function encoding directly adds positional information to the token embedding. Some argue that this pollutes semantic information with positional data and that information should be encoded without modifying the specification.

Therefore, let’s first review the attention mechanism.

1.1 Review of Attention Mechanisms

The key to the attention mechanism lies in obtaining the elements of the self-attention matrix through the inner product of vectors. $A_{m,n}$ For example, calculating the embedding vector of the $m$ -th word. $x_m$ The corresponding self-attention output is $q_m$ And all of the others $k_n$ An attention score will be calculated, and then the attention score will be multiplied by the corresponding $v_n$ Then sum them to get the output vector. $o_m$ The specific formula is as follows:

$\begin{aligned} q_m = x_mW^Q \\ k_n =x_nW^K \\ v_n = x_nW^V \\ a_{m,n} =\frac{exp(\frac{q_m^Tk_n}{\sqrt d})}{\sum ^N _{j=1}exp(\frac{q_m^Tk_j}{\sqrt d})}\\ o_m = \sum ^N _{n=1} a_{m,n}v_n \end{aligned}$

1.2 Analysis of the Approach

As can be seen from the formula above, the influence of one token on another is determined by $QK^T$ The attention score is determined by the dot product, or in other words, it’s essentially the inner product of two feature vectors. This is where positional encoding should be our focus. Therefore, let’s look at the representation of the dot product:

$\vec{a} \cdot \vec{b} = |\vec{a}| |\vec{b}| \cos\theta$

Two insights can be gleaned from this:

The dot product of two vectors can be adjusted by increasing or decreasing the angle between them.
Rotation has no effect on the norm of a vector, which may encode the semantic information of a token.

Therefore, we only need to perform absolute positional encoding on the Query and Key vectors before they enter the attention mechanism; it has nothing to do with the Value. This allows the positional encoding information to be directly introduced. $q_m^T k_n$ This means that we want to impose a certain penalty on this inner product calculation based on the result of $|nm|$ :

When $|nm|$ is small, we want to bring it closer to the nearest $nm$ . $q_m$ , $k_n$ The distance.
When $|nm|$ is large, we want to move it further away. $q_m$ , $k_n$ The distance.

Let’s look at how the paper demonstrates how the solution was found. RoPE’s starting point is “to achieve relative position encoding through absolute position encoding,” meaning that absolute positions are used during encoding, but the dot product reflects the relative position. Mathematically, this involves finding a suitable position encoding function $f$ such that the following formula holds true.

$f(q, m)^T f(k, n) = g(q, k, m-n)$

In layman’s terms, it means processing $q$ at position $m$ and $k$ at position $n$ to achieve the desired result. $qk^T$ The calculation of attention scores implicitly includes the relative positional information of $m$ and $n$ . We will further explain this using the formula from the paper. RoPE aims to… $f_q$ and $f_k$ The inner product operation is encoded into a function $g$ , where the arguments of $g$ include two tokens. $x_m$ and $x_n$ And its relative position $mn$ . $\langle \rangle$ indicates $f_q$ and $f_k$ Perform inner product operations.

$\langle f_q(x_m, m), f_k(x_n, n) \rangle = g(x_m, x_n, m-n)$

Because of the properties of function $g$ , therefore $f_q$ and $f_k$ The inner product also implies the relative positions $m$ and $n$ . This allows the inner product to be larger when the two words are relatively close (smaller $m$ and $n$ ), and smaller when they are relatively far apart (larger $m$ and $n$ ). In this way, without modifying the attention structure, explicit relative position information is integrated into the self-attention calculation, enabling the attention inner product to automatically perceive relative position information, thus achieving the goal of relative position encoding in the form of absolute position encoding.

Note that only here $f_q(x_m,m)$ , $f_k(x_n,n)$ It is the function that needs to be solved. And for $g$ , we require that the expression contains… $x_m$ , $x_n$ , $m-n$ It can also be said that $q_m$ , $k_n$ The inner product is affected by the relative position 𝑚−𝑛.

1.3 Results Display

Let’s examine whether RoPE satisfies the requirement of “achieving relative position encoding through absolute position encoding”.

Inject absolute position information. For position $t$ … $q_t$ and the position of $s$ $k_s$ RoPE will first $q_t$ and $k_s$ In each feature dimension, pairs of dimensions are grouped together, with each pair forming a complex number, corresponding to a vector in the complex plane. These vectors are then multiplied by corresponding positions using a complex rotation matrix, thus injecting absolute position information into the vector by rotating it by a certain angle. That is, for a vector at position $m$ … $q_m$ Multiply by matrix $R_m$ Given a vector at position $n$ $k_n$ Multiply by matrix $R_n$ Each of these yields a new position vector.

$\begin{gathered} f(q,m) = R_mq = \begin{pmatrix} cos m\theta & -sin m\theta \\ sin m\theta & cos m\theta \end{pmatrix} \begin{pmatrix} q_0 \\ q_1 \end{pmatrix} \\ f(k,n) = R_nk = \begin{pmatrix}cos n\theta & -sin n\theta \\sin n\theta & cos n\theta\end{pmatrix} \begin{pmatrix} k_0 \\ k_1 \end{pmatrix} \end{gathered}$

Obtain the relative position information. Use the transformed Q and K sequences for attention calculation; by expanding the formula, the relative position information can be obtained during the attention calculation.

$(R_mq_m)^T(R_nk_n) = q_m^TR_m^TR_nk_n = q_m^TR_{n-m}k_n$

That is, the attention scores of vector $q$ at position $m$ and vector $k$ at position $n$ can be calculated by their dot product. In other words, the difference between the attention scores before and after rotation depends only on their relative positions.

A simplified proof is as follows. Assume… $R_n$ It is a rotation matrix.

$\begin{gathered} q_m = x_mW_qR(m\theta) \\ k_n = x_nW_kR(n\theta) \\ q_mk_n^T = x_mW_qR(m\theta)R(n\theta)^TW_k^Tx_n^T \\ = x_mW_qR(m\theta)R(-n\theta)W_k^Tx_n^T \\ =x_mW_qR((m-n)\theta)W_k^Tx_n^T \\ =g(x_m,x_n,m-n) \end{gathered}$

Right now

$(qR_m)(kR_n)^T = qR_mR_n^Tk^T = qR_{m-n}k^T$

$R_m$ It is an orthogonal matrix that does not change the magnitude of the vector, so it usually does not change the stability of the original model.

1.4 Problems

We currently have several issues worth considering.

The paper mentions a function $f()$ . How is $f()$ implemented?
Why can this conversion embed the token’s location information?
Why does this transformation have extrapolation properties? Why is it said to be similar to the concept of trigonometric functions (PE)?

0x02 Principle Derivation

The next step is to find a transformation function $f$ such that the identity transformation $g$ holds. We will continue our analysis based on the ideas presented in RoFormer’s paper.

2.1 The f() function

First, the process of “adding positional information to the input word embedding and then converting it to q, k, V” is defined as a function $f()$ , resulting in the following formula:

$\begin{aligned} q_m = f_q(x_m, m) \\ k_n = f_k(x_n,n) \\ v_n = f_v(x_n,n) \\ a_{m,n} =\frac{exp(\frac{q_m^Tk_n}{\sqrt d})}{\sum ^N _{j=1}exp(\frac{q_m^Tk_j}{\sqrt d})}\\ o_m = \sum ^N _{n=1} a_{m,n}v_n \end{aligned}$

Secondly, we will conduct an in-depth analysis of the notation in the formula.

$x_m$ , $x_n$ : The input consists of two-dimensional row vectors at positions $m$ and $n$ , which are the original word vectors without positional encoding. They are not word embeddings, but token embeddings.
$q_m$ : The word vector corresponding to the $m$ -th token $x_m$ The query vector is generated after integrating the location information $m$ .
$k_n$ : The word vector corresponding to the nth token $x_n$ The key vector is generated after integrating the location information $n$ .
$v_n$ : The word vector corresponding to the nth token $x_n$ The value vector is generated after integrating the location information $n$ .
$f()$ : Adding positional information to the x vector transforms it into a function of q, k, and v. Transformer-based positional encoding methods primarily focus on constructing a suitable… $f_q$ , $f_k$ , $f_v$ .

As can be seen, the key to the RoPE algorithm lies in how to construct this transformation function $f()$ , which, while introducing absolute positional information into the word vectors, allows $q_m k_n^T$ It also contains relative position information. Let’s take a look at the background of $f()$ .

2.2 Objectives

In this section, we will use a reverse reasoning approach for analysis.

First, let’s look at the goal that $f()$ expects to achieve. We hope that for $q_m k_n^T = f_q(x_m, m) (f_k(x_n, n))^T$ . In other words, although the input to this calculation is a vector $x_m$ and $x_n$ And the absolute positions $m$ and $n$ , but we want the result of this calculation to depend only on the vectors $x_m$ and $x_n$ itself, and vectors $x_m$ and $x_n$ The relative distance $(m n)$ between them, without depending on their absolute positions $m$ and $n$ .

Secondly, for easier derivation, we will introduce a function $g()$ to perform the derivation. We hope… $q_m k_n^T = f_q(x_m, m) (f_k(x_n, n))^T = g(x_m, x_n, m-n)$ . The final derived formula for the g function only includes relative distances, omitting the absolute positions $m$ and $n$ . That is, it assumes the query vector… $q_m$ and key vector $k_n$ The inner product operation between them can be grepresented by a function whose ginput is the word embedding vector. $x_m$ , $x_n$ And their relative positions $m - n$ :

$\langle f_q(x_m, m), f_k(x_n, n) \rangle = g(x_m, x_n, m-n)$

g can be understood as a kernel function that transforms the direct operation of f() (the dot product of semantic information and absolute position information) into an interpretation using g (semantic information plus relative position information). As we will see later, g interprets the dot product using polar coordinates (converting relative distances into angles). See the diagram below for details.

1702

The g function is introduced merely for ease of derivation; the essential goal is still to find an f function, specifically an f() function with desirable properties that incorporates explicit relative positional dependencies into the self-attention formula. In other words, we aim to find an encoding method f() for the q and k vectors such that the encoded vectors… $q_m$ and $k_n$ The dot product can be obtained by $x_m$ , $x_n$ The dot product can be expressed as $m - n$ (the dot product can be represented by word vectors plus relative position information).

2.3 Derivation

Now that we know the target, let’s derive $f()$ step by step.

Ignoring the absolute positional parameters in the input arguments of $f()$ , let’s assume that the $f()$ function simply returns the original token embedding. Here, $f()$ will… $W^K$ , $W^Q$ , $W^V$ The weight matrix operation process is also included.

$\begin{aligned} q_m = f_q(x_m, m) = x_m \\ k_n = f_k(x_n, n) = x_n \\ v_n = f_v(x_n, n) = x_n \end{aligned}$

Let’s see how to gradually add functionality to the initial version of the $f()$ function above.

Adjust the perspective

We need to adjust our perspective.

From two-dimensional vectors to complex numbers

For simplicity, let’s assume… $x_m$ , $x_n$ It’s a two-dimensional row vector, meaning we first assume the input vector is two-dimensional. For example… $x_m$ It is [a, b]. Since it is two-dimensional, and a complex number is equivalent to a two-dimensional vector on the complex plane, we can consider it as a complex number. Therefore, we… $x_m$ Convert to $a + bi$ Why introduce complex numbers? This is because while planar rotations seem intuitive when represented by matrices, they are more elegantly represented by complex numbers.

From complex numbers to polar coordinates

Euler’s formula bridges the gap between exponential functions, trigonometric functions, and complex numbers. Some trigonometric functions are easily solved and understood using exponential form. For example, if $x$ represents any real number, $e$ is the base of the natural logarithm, and $i$ is the imaginary unit in complex numbers, then according to Euler’s formula:

$e^{ix} = \cos x + i \sin x$

The meaning of this expression is: a complex number with a real part of $\cos x$ and an imaginary part of $\sin x$ can be represented in an exponential form.

According to Euler’s formula, we can represent the complex number of a two-dimensional vector using polar coordinates.

$a + bi = r\cos(\theta) + r\sin(\theta) \cdot i = r(\cos(\theta) + i \sin(\theta)) = r \cdot e^{i\theta}$

here:

$\cos(\theta) + i.\sin(\theta)$ Points on the unit circle are described using coordinates in the complex plane. As $\theta$ varies from 0 to $2\pi$ , the complex number… $e^{i\theta}$ It describes a complete circle of the unit circle.
$e^{i\theta}$ Points on the unit circle are described by the circular motion of the unit circle. Using the exponential form of complex numbers, we can view complex numbers as unit vectors rotating around the origin in the complex plane.
for $r \cdot e^{i\theta}$ ‘r’ is semantic. $\theta$ It’s a location.

All three representations convey the same message: rotating a two-dimensional vector counterclockwise by an angle $\theta$ That is, a two-dimensional vector $(x_{even}, x_{odd})$ It can be treated as a plural number $x_{even} + i \cdot x_{odd}$ Then multiply $e^{i\theta}$ This allows rotation to be achieved. From a complex perspective, such a rotation is simply adding an angle to the phase.

therefore, $x_m$ and $x_n$ It can be represented using polar coordinates, that is, using angle + length, which allows us to separate positional information from semantic information.

1703

Next steps

Therefore, we will now take two approaches to see how to think about it from an exponential perspective.

How can I add absolute position information to the $f()$ function? This is the result of polar coordinate transformation.
How can the $f()$ function interact to transform relative position information into relative position information? This is the function achieved by the De Moivre formula.

Then merge the two routes together.

Introducing absolute position information

According to Euler’s formula, a complex number multiplied by $e^{i\theta}$ It is equivalent to rotating its corresponding two-dimensional vector counterclockwise. $\theta$ The angle; that is, multiplying by the rotation matrix. Why does RoPE require a rotation operation? Can’t other mappings be used? For example, a linear transformation could map embedding vectors at different positions to new vector spaces. The main reason is that rotation is a linear transformation that doesn’t destroy the original vector’s geometric properties. The length and angle remain unchanged, which is particularly useful for using dot products to measure similarity in attention calculations.

Rotation matrix

Let’s briefly review the concept of a rotation matrix. In two-dimensional space, there exists a rotation matrix. $R(\theta)$ When a two-dimensional vector is multiplied by a rotation matrix on the left, the vector can achieve a rotation of radians. $\theta$ The operation is a counterclockwise rotation. A rotation matrix is a matrix that, when multiplied by another vector, changes the direction of the vector but does not change its magnitude or chirality.

$R(\theta) = \begin{pmatrix} cos\theta & sin\theta \\ -sin\theta & cos\theta \end{pmatrix}$

The physical meaning is: $XR(\theta)$ Rotate X counterclockwise $\theta$ The specific proof is as follows.

$\begin{gathered} X = \rho(cos\phi,sin\phi)\\ XR(\theta) = \rho(cos\phi,sin\phi) \begin{pmatrix} cos\theta & sin\theta \\ -sin\theta & cos\theta \end{pmatrix} \\ = \rho(cos\phi cos\theta - sin\phi sin\theta,cos\phi sin\theta + sin\phi cos\theta)\\ =\rho(cos(\phi + \theta),sin(\phi + \theta)) \end{gathered}$

Absolute position encoding

The properties of the rotation matrix perfectly satisfy our requirement for encoding absolute position information. After rotating the token embedding around the origin by a certain angle, and this chosen angle being related to the absolute position value (e.g., …), … $m\theta$ We introduced angle information into the embedding vector, thus incorporating the absolute position into the $f()$ function. See the diagram below for details. $R_m$ It is a rotation matrix, and the function $f()$ represents rotating the vector counterclockwise while preserving its magnitude. $m\theta$ This means that by simply rotating a vector by a certain angle, the corresponding absolute position information can be added to that vector.

1705

Let’s explain further:

𝜃 is a non-zero constant.
$q_m^{(1)}$ q is the first dimension of the vector q, and m is the position.
Give $q_m$ Multiplying by this rotation matrix, geometrically speaking, is equivalent to giving the inverse… $q_m$ The clock hand rotates by a multiple of its index q. This operation only changes the direction and does not change the modulus of q.

For example:

dog: The word “dog” is in position 0 and does not rotate.
The dog: The word “dog” is in the first position, rotated by angle $\theta$ .
The pig chased the dog: the word “dog” is in the 4th position, rotated by an angle of $4\theta$ .
Once upon the time, the ping chased the dong: the word dong is in the 9th position, with a rotation angle of $9\theta$ .

1706

Find relative position information

At this point, $f()$ has the following functionality: From a complex and exponential perspective, by multiplying the vector by a rotation matrix related to its absolute position, it injects absolute position information into the vector, thus obtaining a new… $q_m$ and $k_n$ Let’s examine whether the $f()$ function is useful, specifically whether it can derive relative position information from absolute position information.

Find the interaction

Let’s first look at how to interact from our current perspective (complex numbers and exponential numbers). The foundation is De Moivre’s theorem: multiplying two complex numbers can be transformed into multiplying the radii of rotation expressed in polar coordinates, and then into adding the rotation angles. Assume… $\alpha$ , $\beta$ yes $x_m$ , $x_n$ The radian representation is used. The interaction is as follows.

$(a+bi)(c+di) = r1(cos(\alpha ) + i\ sin(\alpha ))\times r2(cos(\beta) + i\ sin(\beta)) = r1 \times r2 \times (cos(\alpha + \beta) + i\ sin(\alpha + \beta))$

Find the inner product

However, our goal is $q_m k_n^T$ This is an inner product, not a multiplication. Further investigation reveals that, according to the properties of complex multiplication, the conjugate of a complex number A (a+bi) multiplied by another complex number B (c+di) results in a product where the real part is equal to the inner product of A and B, and the imaginary part is equal to the outer product of A and B. In other words, the operation of multiplying the conjugate of the first complex number by the second perfectly satisfies the requirements of inner and outer product operations: the inner product takes the real part, and the outer product takes the imaginary part.

$(a-bi)(c+di) = (ac+bd)+(ad-bc)i$

ac + bd yes Inside product

ad - bc yes outside product

Note:

The coordinate representation of a complex number z is z = a + bi, where a is the real part of the complex number, b is the imaginary part of the restatement, and the conjugate of z is a - bi, that is, the real part remains unchanged, and the imaginary part takes its opposite.
Multiplying two complex numbers is as simple as expanding and multiplying them together. If z1 = a + bi and z2 = c + di, then z1 × z2 = (ac - bd) + (bc + ad)i.

Incorporate location information into the inner product

Next, let’s see how to incorporate absolute position information into the inner product, transforming it into relative position information. In the formula below, <> indicates inner product calculation, ∗ is the conjugate complex number, R[∗] represents the real part of ∗, and the multiplication on the right-hand side is ordinary complex multiplication. The formula means that if we treat two-dimensional vectors as complex numbers, the inner product of two two-dimensional vectors is equal to the real part of the product of the conjugate of one complex number and the other.

$\langle f_q(x_m, m), f_k(x_n, n) \rangle = Re\left[(W_q x_m)(W_k x_n)^* e^{i(m-n)\theta}\right]$

1707

summary

The derivation steps are summarized in the following diagram:

First $x_m$ , $x_n$ Convert to the corresponding complex form $x_m$ , $x_n$ It can also be expressed in polar coordinates;
By applying rotational transformation, a new complex number form is obtained. $x_m$ , $x_n$ Specifically, it means… $x_m$ , $x_n$ Multiply by respectively $e^{im\theta}$ , $e^{in\theta}$ , become $x_me^{im\theta}$ , $x_ne^{in\theta}$ That would be equivalent to giving $x_m$ , $x_n$ With absolute position encoding (explicitly depending on absolute positions $m$ , $n$ ), we obtain $q_m$ , $k_n$ That is, for $x_m$ The result vector after applying complex multiplication $q_m$ , $k_n$ , that is $x_m$ The vector after matrix rotation.
The inner product between the query and key is calculated using complex number operations to obtain the result of self-attention. Specifically, because… $q_m$ , $k_n$ It is already plural, we will $q_m$ conjugate multiplied by $k_n$ Taking the real part of the result yields the elements of the RoPE-encoded self-attention matrix. $A_{m,n}$ .

$\langle (W_qx_m)e^{im\theta}, (W_kx_n)e^{in\theta} \rangle = Re[(x_me^{im\theta})(x_ne^{in\theta})^*] = Re[x_mx_n^*e^{i(m-n)\theta}]$

We will find that the inner product depends only on the relative position $m-n$ , which cleverly utilizes the property of addition of the arguments of complex numbers to combine the absolute position and the relative position.

1708

$x_m$ and $x_n$ The two vectors initially only have absolute position information, so… $x_m$ and $x_n$ After rotating the two vectors by angles m and n respectively, and then calculating the dot product (allowing the absolute position information to interact), the vector dot product will automatically incorporate the relative position information.

2.4 Formal Definition

Now that the derivation is complete, let’s take a formal look at the interpretation of the $f()$ and $g()$ functions, that is, to sort out the above derivation in more detail.

f() introduces absolute information

f() is defined as follows, and can be understood as f() taking two input parameters (absolute position information m and word information) and processing them. $x_m$ The two parts are placed separately on polar coordinates, represented by length and angle respectively.

$f_q(x_m,m) = (W_qx_m)e^{im\theta}$

Let’s deduce it in detail.

first, $W_q$ It is a two-dimensional matrix. $x_m$ It is a two-dimensional vector. $W_qx_m$ The result of multiplication is also a two-dimensional vector. $q_m$ .

$W_qx_m = \begin{pmatrix} W_q^{(11)} & W_q^{(12)} \\ W_q^{(21)} & W_q^{(22)} \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} = \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix} = q_m$

Then, $q_m$ Interpreting it as a complex number allows for better subsequent processing, namely, performing the rotation operation through complex number multiplication.

$q_m = [q_m^{(1)},q_m^{(2)}] = q_m^{(1)} + i \ q_m^{(2)}$

Will $e^{im\theta}$ It is also represented by a plural number. $e^{i\theta}$ Indicated on the unit circle, with an amplitude of $m\theta$ The vector whose endpoint is… $e^{im\theta} = \cos(m\theta) + i\ \sin(m\theta)$ .

therefore,

$f_q(x_m,m) = (W_qx_m)e^{im\theta} = q_me^{im\theta}$

It means multiplying two complex numbers.

$f_q(x_m,m) = (W_qx_m)e^{im\theta} = q_me^{im\theta} = (q_m^{(1)} + i \ q_m^{(2)} ) \times (cos(m\theta) + i\ sin(m\theta)) = (q_m^{(1)} cos(m\theta) - q_m^{(2)} sin(m\theta)) + i \ (q_m^{(2)}cos(m\theta) + q_m^{(1)}sin(m\theta))$

Next, we will re-express $f(x)$ as a real vector.

This is essentially the query vector multiplied by a rotation matrix. $R_m$ That is, location information is included, but absolute location information and word information are extracted and placed in two parts of polar coordinates.

$f_q(x_m,m) = (W_qx_m)e^{im\theta} \\ = q_me^{im\theta} = (q_m^{(1)} + i \ q_m^{(2)} ) \ * \ (cos(m\theta) + i\ sin(m\theta)) \\ = [q_m^{(1)} cos(m\theta) - q_m^{(2)} sin(m\theta), q_m^{(2)}cos(m\theta) + q_m^{(1)}sin(m\theta)] \\ = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix} = R_mq_m$

See the image below for details.

1709

The above derivation shows that the function f() adds absolute position information to the word embedding. Let’s see how the dot product of f() introduces relative position information, that is, let’s use g() to prove that our constructed f() is correct.

The g() function verifies relative information

What we want to verify is that after obtaining the function f, we construct the query vector using the function f(). $q_m$ and key vector $k_n$ The inner product operation between two vectors can be grepresented by a function whose ginput is the word embedding vector. $x_m$ , $x_n$ And their relative positions $m - n$ . This proves the validity of f(): expressing relative position information through absolute position information. Position information is a high-dimensional vector, represented by polar coordinates. The relative position $m - n$ in polar coordinates is their angle (i.e., rotating from $m$ to $n$ by a certain angle), thus transforming position information into angular information.

The mathematical formula is as follows.

Known

$q_m = f_q(x_m,m) = (W_qx_m)e^{im\theta} \\ k_n = f_k(x_n,n) = (W_kx_n)e^{in\theta} \\ g(x_m,x_n,m-n) = Re[(W_qx_m)(W_kx_n)^* e^{i(m-n)\theta}]$

To demonstrate

$\langle f_q(x_m,m),f_k(x_n,n)\rangle = g(x_m,x_n,m-n) = Re[(W_qx_m)(W_kx_n)^* e^{i(m-n)\theta}]$

Re[x] represents the real part of a complex number x. $(W_kx_n)^*$ Representing complex numbers $W_kx_n$ . conjugate.

Next, we will prove… $\langle f_q(x_m,m),f_k(x_n,n)\rangle = Re[(W_qx_m)(W_kx_n)^* e^{i(m-n)\theta}]$ The left and right sides should be equal.

The equation on the right

First, deduce the information inside Re[].

$W_qx_m = q_m = q_m^{(1)} + i \ q_m^{(2)} \\ W_kx_n = k_n = k_n^{(1)} + i \ k_n^{(2)} \\ (W_kx_n)^* = k_n^* = k_n^{(1)} - i \ k_n^{(2)} \\ e^{i(m-n)\theta} = cos((m-n)\theta) + i \ sin((m-n)\theta)$

Continue the derivation

$g(x_m,x_n,m-n) = Re[(W_qx_m)(W_kx_n)^* e^{i(m-n)\theta}] \\ = Re[(q_m^{(1)} + i \ q_m^{(2)})(k_n^{(1)} + i \ k_n^{(2)})(cos((m-n)\theta) + i \ sin((m-n)\theta))] \\ = (q_m^{(1)}k_n^{(1)} + q_m^{(2)}k_n^{(2)})cos((m-n)\theta) - (q_m^{(2)}k_n^{(1)} + q_m^{(1)}k_n^{(2)})sin((m-n)\theta)$

The following diagram illustrates this.

1710

left side equation

The equation on the left expands as follows.

$f_q(x_m,m) = (W_qx_m)e^{im\theta} =[q_m^{(1)} cos(m\theta) - q_m^{(2)} sin(m\theta), \ q_m^{(2)}cos(m\theta) + q_m^{(1)}sin(m\theta)] \\ f_k(x_n,n) = (W_kx_n)e^{in\theta} =[k_n^{(1)} cos(n\theta) - k_n^{(2)} sin(n\theta), \ k_n^{(2)}cos(n\theta) + k_n^{(1)}sin(n\theta)] \\ <f_q(x_m,m),f_k(x_n,n) > \\ = (q_m^{(1)} cos(m\theta) - q_m^{(2)} sin(m\theta))(k_n^{(1)} cos(n\theta) - k_n^{(2)} sin(n\theta)) + (q_m^{(2)}cos(m\theta) + q_m^{(1)}sin(m\theta))(k_n^{(2)}cos(n\theta) + k_n^{(1)}sin(n\theta)) \\ = (q_m^{(1)}k_n^{(1)} + q_m^{(2)}k_n^{(2)})cos((m-n)\theta) - (q_m^{(2)}k_n^{(1)} + q_m^{(1)}k_n^{(2)})sin((m-n)\theta) \\ = \begin{pmatrix} q_m^{(1)} & q_m^{(2)} \end{pmatrix} \begin{pmatrix} cos((m-n)\theta) & -sin((m-n)\theta) \\ sin((m-n)\theta) & cos((m-n)\theta) \end{pmatrix} \begin{pmatrix} k_n^{(1)} \\ k_n^{(2)} \end{pmatrix}$

As you can see, the left and right sides of the equation are equal. See the diagram below for a more detailed explanation.

1711

Therefore, RoPE has achieved its intended purpose.

Add absolute position information. Adding absolute position encoding is done by using a rotation matrix, that is, rotating the embedding of each position to a new position using a position-based rotation matrix.
Obtaining relative position information allows the encoding of two tokens, after undergoing an inner product transformation (self-attn), to be influenced by the difference in their positions, i.e., their relative positions. This means incorporating explicit relative position dependencies into the self-attention formula. $q_m$ and $k_n$ The inner product between them is only determined by $q_m$ and $k_n$ , distance $|i-j|$ The value is determined by.

High dimension

So far, we’ve discussed two-dimensional vectors, but position encoding is typically high-dimensional vectors. How do we handle this? Instead of attempting to encode all positional information in a single rotation operation, RoPE pairs components within the same dimension and rotates them (or otherwise mixes x and y offset information). By processing each dimension independently, RoPE preserves the natural structure of space and can be generalized to any number of dimensions as needed.

Let’s analyze this carefully.

First, let’s see how to apply different row transformations to orthogonal subspaces using diagonal matrices. Suppose we have two square matrices A and B, let… $X = (X^1, X^2)$ . The changes are as follows.

$(X^1, X^2) \begin{pmatrix} A & 0 \\ 0 & B \end{pmatrix} = (X^1A, X^2B)$

Secondly, the inner product satisfies the linear superposition property, and any even-dimensional RoPE can be represented as a splicing of two-dimensional cases.

Therefore, we can split each vector (Key or Query) into element pairs, grouping them into pairs based on their two dimensions. $(q^1, q^2)$ , $(q^3, q^4)$ , … Each pair is interpreted as a two-dimensional vector. This divides the original space into independent orthogonal two-dimensional subspaces. Then RoPE uses angles… $\theta_i$ For each two-dimensional vector (dimension pair) $(q_i, q_{i+1})$ Each subspace is rotated independently, while the other subspaces remain unchanged. The rotation angle is the same as that used in triangular position coding, that is, the sampling frequency. $\theta$ Multiply by the token index ( $m\theta_i = m \times base^{-2i/d}$ ). After rotating, perform the inner product, and then concatenate all the segments to obtain the feature vector containing positional information.

Because each group satisfies a function g (with a relative relationship m-n), when they are added together, they will also satisfy the function g.

$\begin{pmatrix} cos(m\theta_1) & -sin(m\theta_1) & 0 & 0 & ... & 0 & 0\\ sin(m\theta_1) & cos(m\theta_1) & 0 & 0 & ... & 0 & 0\\ 0 & 0 & cos(m\theta_2) & -sin(m\theta_2) & ... & 0 & 0\\ 0 & 0 & sin(m\theta_2) & cos(m\theta_2) & ... & 0 & 0\\ 0 & 0 & 0 & 0 & ... & 0 & 0\\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ 0 & 0 & 0 & 0 & ... & cos(m\theta_{d/2}) & -sin(m\theta_{d/2})\\ 0 & 0 & 0 & 0 & ... & sin(m\theta_{d/2}) & cos(m\theta_{d/2}) \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ .\\ .\\ .\\ q_m^{(d-1)} \\ q_m^{(d)} \end{pmatrix}$

here $\theta_i = 10000^{-2i/d}$ , … $\theta_i$ The frequency decreases monotonically from 0 to d/2-1, which is a decreasing process, so it can bring about a certain degree of long-range attenuation.

If join $x_m$ and $W_{q,k}$ The specifics are as follows:

1712

$R^d_{\theta,m}$ It is an orthogonal matrix, which does not change the magnitude of the vectors, and therefore generally does not change the stability of the original model. Furthermore, because… $R_m$ Due to the sparsity of the matrix, directly implementing it using matrix multiplication would be computationally wasteful. Therefore, in practice, the calculation method shown in the figure below is recommended. $\bigotimes$ It involves multiplying each digit one by one.

1713

It can be seen that RoPE is somewhat similar to Sinusoidal positional encoding in form, except that Sinusoidal positional encoding is additive, while RoPE can be regarded as multiplicative encoding, that is, it is used to encode the high-dimensional vector of the query at position m. $q_m$ Multiply by matrix $R_m$ This corresponds to the rotation of a vector in each sub-dimension, hence the name Rotation Position Encoding (RoPE). This is because rotations are performed at different angles in independent two-dimensional subspaces. You can imagine it as a counter-clockwise clock system with hour, minute, and second hands, and even finer-grained hands. The earlier the pair, the larger the granularity. In other words, RoPE effectively distinguishes between long-range and short-range operations using trigonometric functions of different frequencies.

We will then use the figures from the paper to illustrate this further.

For a d-dimensional q vector at position m, we set the word vector size to a multiple of 2, that is, we split it into pairs according to the dimension, and each pair is interpreted as a two-dimensional vector.
The rotation angle of the i-th group (i.e., the elements 2i and 2i+1 in the vector) is $m\theta_i$ , $\theta_i$ Related to i and the hidden size of the word vector, it is a function that gradually changes from 1 to close to 0. Therefore, the rotation of the first dimension is faster and the rotation of the later dimension is slower.
Then rotate each of the split two-dimensional vectors.
After rotation, all segments are spliced together to obtain a feature vector containing positional information.

1714

2.5 Summary

Sine position encoding is essentially a position encoding method that aims to express relative position using absolute position encoding. However, this capability is compromised due to the presence of the projection matrix. Thus, the sinusoidal position encoding in the original transformer does not actually achieve its intended effect.

The concept of RoPE is somewhat similar to that of sinusoidal position coding. Both attempt to incorporate relative position information during the coding process, utilize trigonometric function transformation formulas for position transformation, and employ the same form of position transformation and rotation on a two-dimensional plane.

The difference between the two is:

The trigonometric function PE directly calculates each absolute position vector and then adds the absolute position vector to the token vector during input . In other words, it incorporates positional encoding into the word vector through addition.
RoPE performs a rotation after projection and before attention calculation. In other words, RoPE can be seen as transforming the position vector calculated by the trigonometric function PE with the input matrix after passing through the three weight matrices for query and key . It transforms the original query and key vectors into a new vector with positional information, represented by parameters m and θ, where m is the token’s position in the sentence, and θ’s index is directly related to the position of each element in the vector.

Note that rotation cannot be performed before projection, as this would prevent the merging of m and n. It’s likely that RoPE avoids the sinusoidal position encoding problem precisely because rotation is performed after projection. Furthermore, RoPE uses a product-like form similar to the Hadamard product.
Due to the properties of trigonometric functions, trigonometric functions (PE) inherently possess the ability to express relative distances. However, RoPE position encoding itself cannot express relative distances and requires the inner product of Attention to activate its ability to express relative distances.

1715

In other words, compared to trigonometric PE, RoPE embeds positional information more deeply into the model structure. Formally, it’s somewhat like multiplicative absolute positional encoding; by applying this positional encoding to q and k, the effect is equivalent to relative positional encoding. Furthermore, if explicit absolute positional information is still needed, this positional encoding can also be applied to v simultaneously.

In Su Shen’s article The Transformer Upgrade Path: 12, ReRoPE with Infinite Extrapolation?, it is pointed out that RoPE is formally an absolute position encoding, but in reality, it brings relative position information to Attention, that is, the Toeplitz matrix as shown below. This form of bias reminds us of ALiBi, which does not act on the embedding but directly on the Attention. Through this construction method, both long-range decay and relative positional relationships are achieved.

$\left( \begin{array}{ccccccccc} 0 & & & & & & & & \\ 1 & 0 & & & & & & & \\ 2 & 1 & 0 & & & & & & \\ 3 & 2 & 1 & 0 & & & & & \\ \ddots & 3 & 2 & 1 & 0 & & & & \\ \ddots & \ddots & 3 & 2 & 1 & 0 & & & \\ \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & & \\ L-2 & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \ddots & \\ L-1 & L-2 & \ddots & \ddots & \ddots & 3 & 2 & 1 & 0 \end{array} \right)$

In summary, RoPE achieves both absolute and relative positioning effects through operations on absolute positions. This results in a position encoding scheme that integrates both absolute and relative positioning.

In conclusion, the process of combining RoPE’s self-attention operation is as follows:

First, for token each word embedding vector in the sequence, calculate its corresponding query and key vectors;
Then, based on the obtained query and key vectors, token the corresponding rotation position code is calculated for each position;
Next, token apply rotation transformations to the elements of the query and key vectors at each position in pairs.
Finally, the inner product between query and key is calculated to obtain the result of self-attention. After calculating the inner product, the absolute position information is no longer present, and only the relative position information remains.

Furthermore, RoPE is only applicable to the embedding of queries and keys, not the embedding of values.

0x03 Properties

This section will explore some of the key features of RoPE and industry insights.

3.1 Correlation

Rotation Encoding (RoPE) has the following characteristics:

When calculating $qk^T$ by dot product, the relative positional information of the words is preserved. The absolute position of the words will not change, which effectively maintains the relative relationship of positional information.
Encodings at adjacent positions exhibit a certain degree of similarity. Even after rotation, adjacent positions will still have similar embeddings. Encodings at greater distances, however, show some differences. This enhances the model’s perception and utilization of positional information.
Tokens with similar semantics generally receive more attention. That is, when $k$ and $q$ are similar, regardless of their relative distance $n-m$ , their attention is higher. On average, $q^T R_{n-m} k$ should be larger, at least larger than for two random tokens.

3.2 Periodicity

Because the arc of one rotation is $2\pi$ , vector rotation in RoPE is like a clock, with each component rotating periodically. Because the rotation radius of each component increases linearly with the position index, later components have larger periods and lower frequencies in their sine functions, resulting in slower rotation speeds. The overall frequency can be mapped to both low and high frequencies.

So we have a question: as the position increases, will the rotation angle repeat? The answer is as follows.

In any $k$ -th subspace, as long as the formula for $\theta_k$ does not contain $\pi$ , there will be no periodic repetition.
If each subspace does not exhibit periodic repetition, then the whole will not repeat either.

3.3 $\beta$ number system

Su Shen believes that RoPE is a beta encoding, as stated in the original text below.

The Rotation Position Encoding (RoPE) of position n is essentially the beta encoding of the number n!

For a decimal number n, if you want to obtain its m-th digit (counting from right to left) in base beta, the method is…

$\left\lfloor\frac{n}{\beta^{m-1}}\right\rfloor \bmod \beta$

That is, first divide by $\beta^{m-1}$ , and then find the modulus (remainder). RoPE can be rewritten as…

$[cos(\frac{n}{\beta^0}),sin(\frac{n}{\beta^0}),cos(\frac{n}{\beta^1}),sin(\frac{n}{\beta^1}),...,cos(\frac{n}{\beta^{d/2-1}}),sin(\frac{n}{\beta^{d/2-1}})]$

in, $\beta = 10000^{2/d}$ . The most important characteristic of modular arithmetic is periodicity; cosine and sinine are also periodic functions. Therefore, aside from the insignificant difference of the floor function, RoPE (or Sinusoidal positional encoding) is actually the beta encoding of the number n!

3.4 Symmetry

Referring to the properties of trigonometric function encoding, for RoPE encoding, the attentional influence of token A at position $m$ on token B at position $n$ is the same as the attentional influence of token C at position $2n-m$ on token B. Especially when the token at position $m$ is the same as the token at position $2n-m$ , the following expression holds:

$g(x_m, x_n, m-n) = g(x_n, x_{2n-m}, n-(2n-m))$

This proves that RoPE encoding also conforms to symmetry and does not learn the difference in direction.

3.5 Frequency Domain

The magnitude of $\theta$ determines the monotonicity of the corresponding dimension and also gives the parameters in these dimensions different learning tendencies. $\theta$ corresponds to the concept of frequency in the Fourier transform.

When the relative distance $t-s$ is large, the attention calculation results only maintain a consistent monotonicity when the relative distance $t-s$ is small, and then fall into fluctuations, which are essentially high-frequency signals.
When the relative distance $t-s$ is small, the attention calculation results can still maintain a consistent monotonicity even when the relative distance $t-s$ is large, and the fluctuations are relatively smooth, which is essentially a low-frequency signal.

The paper SCALING LAWS OF ROPE-BASE points out that if using $q_tk_s$ to represent the semantic similarity between the token at position $s$ and the token at position $t$ , then $q_tk_s$ is a two-dimensional time-domain signal with two time-domain dimensions, $t$ and $s$ . Semantic similarity $q_tk_s$ consists of semantic similarity components in different frequency domain dimensions. $q_t^{(n)}k_s^{(n)}$ It is composed of combinations, with each dimension corresponding to a frequency band. $\theta_n$ High-frequency components correspond to local semantic influences, while low-frequency components correspond to long-term contextual semantic influences. The most basic conversion method from the frequency domain to the time domain is the inverse Fourier transform, through $e^{i(s-t)\theta_n}$ Combining components from different frequency bands. Since the goal is to obtain the positional information of position $s$ relative to position $t$ , the object of the transformation is $q_t^{(1)}k_s^{(1)}...q_t^{(d)}k_s^{(d)}$ . The target dimension of the transformation is the diagonal direction of the original two-dimensional time domain, that is, the $s-t$ direction.

1716

The paper Round and Round We Go! What makes Rotary Positional Encodings useful? also reveals the role of different frequency components of RoPE in model learning: high frequency is used for positional attention, and low frequency is used for semantic attention.

We can calculate the wavelength corresponding to each dimension of ROPE.

$\lambda_d = \frac{2\pi}{\theta_d} = 2\pi b^{\frac{2d}{|D|}}$

where $|D|$ is the total number of dimensions, and $b$ is the base. The wavelength describes the number of tags required to complete one full rotation $(2\pi)$ in that dimension. The wavelength is related to the frequency of RoPE embedding and may vary in different dimensions.

Given a length $L$ , some dimensions may have periods longer than $L$ . We can assume that when this happens, all positions receive a unique code, meaning their absolute positions are preserved. Conversely, dimensions with shorter periods only retain relative position information.

3.6 High frequency and low frequency

In RoPE, vector rotation works like a clock because the radius of one full rotation is $2\pi$ . Therefore, the rotation of each component is periodic. RoPE rotates at an angle $\theta_i$ . For each two-dimensional vector (dimension pair) $(q_i, q_{i+1})$ the rotations are performed separately, with the rotation angles taking the same values as the triangular position coding, that is, the sampling frequency $\theta$ multiplied by the token index $(m\theta_i = m \times base^{-2i/d})$ . After rotating and concatenating all the segments, we obtain a feature vector containing positional information. Here $\theta_i = 10000^{-2i/d}$ adopts the original sinusoidal position encoding scheme of the Transformer. This can introduce some long-range attenuation. The rotation angle is different at each position.

$\begin{pmatrix} cos(m\theta_1) & -sin(m\theta_1) & 0 & 0 & ... & 0 & 0\\ sin(m\theta_1) & cos(m\theta_1) & 0 & 0 & ... & 0 & 0 \\ 0 & 0 & cos(m\theta_2) & -sin(m\theta_2) & ... & 0 & 0 \\ 0 & 0 & sin(m\theta_2) & cos(m\theta_2) & ... & 0 & 0 \\ 0 & 0 & 0 & 0 & ... & 0 & 0 \\ . & . & . & . & . & . & . \\ . & . & . & . & . & . & . \\ . & . & . & . & . & . & . \\ 0 & 0 & 0 & 0 & ... & cos(m\theta_{d/2}) & -sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & ... & sin(m\theta_{d/2}) & cos(m\theta_{d/2}) \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ . \\ . \\ . \\ q_m^{(d-1)} \\ q_m^{(d)} \end{pmatrix}$

In periodic functions such as $\sin(\omega x)$ , the larger the value of $\omega$ , the higher the frequency. In RoPE, as the dimension variable $k$ increases, $b^{-2k/d}$ decreases, thus reducing the frequency.

$\sin(\frac{p}{b^\frac{2k}{d}}), \cos(\frac{p}{b^\frac{2k}{d}})$

We can conclude that lower dimensions of positional encoding correspond to higher frequencies, and higher dimensions correspond to lower frequencies. For each group, its rotation radian increases linearly with the position index. The later the group, the slower its rotation speed, and the larger the period and lower the frequency of the sine function.

High frequency: This is the position vector of RoPE. When $i$ is relatively small (the preceding dimension), and $\theta_i$ is large, the period is short and the frequency is high.
Low frequency: This is the position vector of RoPE. $i$ is relatively small (in the later dimensions). When $\theta_i$ is small, the period is long and the frequency is low.

Bowen Peng, the author of NTK-RoPE and YaRN, believes that high-frequency learning captures local relative distances, while low-frequency learning captures long-range absolute distances. Both high and low frequencies are important, and their relationship is more like a hierarchy. In terms of number systems, low frequencies correspond to high-order bits. If only low-order bits are retained and high-order bits are removed, the result is equivalent to taking the modulus (remainder), which cannot accurately represent the positional information.

1717

3.7 Long-distance attenuation

Long-range decay is based on a simple assumption: the greater the relative distance, the lower the correlation and dependence between tokens. If location encoding has long-range decay properties, tokens that are geographically close can receive more attention on average.

Performance

RoPE also exhibits a long-range decay property, specifically: for two word vectors, the closer they are, the higher their inner integral, and vice versa. That is, for the query vector at position $m$ , $q_m$ , and the key vector at position $n$ , $k_n$ , the greater the relative distance (the larger $|n-m|$ ), the smaller the value of $((R_mq_m)^T(R_nk_n))$ , and the lower the inner product. As can be seen from the figure below, the inner product result tends to decrease as the relative distance increases.

1718

As can be seen from the graph, significant fluctuations occur in the later stages of the decay curve, resulting in a U-shaped attention pattern. A comparison graph is shown below.

1719

The paper HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation provides a detailed analysis of this, finding that in RoPE, the U-shaped attention pattern is caused by specific learned components, which are also key factors limiting RoPE’s expressive and extrapolation capabilities. See the figure below for details.

(a) indicates that RoPE is decomposed into components (Comps) for analysis. The upper subplot shows the contribution of each component to the overall attention logic. We highlight some components with prominent patterns (“activating” components) in red and low-frequency components in blue. The lower subplot shows the overall attention logic and the combined effect of the “activating” components.
(b) The variance (VAF) of different components of RoPE during training is given.
(c) It was revealed that the OOD phenomenon in the extrapolation was caused by the “activation” component. The two upper subgraphs show the attention patterns of the first layer, while the lower subgraph shows the anomalous patterns of subsequent layers.

1720

Based on these findings, this paper proposes a novel position encoding method, High-frequency rotary Position Encoding (HoPE). HoPE removes the position-dependent components in RoPE, preserving the high-frequency signal, thus theoretically breaking the principle of long-term attenuation.

Is it possible to design a non-oscillating position code? It’s difficult. If the position code function doesn’t oscillate, it often lacks sufficient capacity to encode enough position information. In other words, in a sense, the complexity of the position code function itself is a requirement for encoding position.

Argument

We will now examine long-range attenuation.

First, let’s look at the derivation from the paper, as shown in the figure below.

1721

Secondly, some researchers believe that the following formula represents the main functional item of RoPE. Transformer position encoding (meaning) Riverside grass lxr

$C_{RoPE}(t-s) = \frac{1}{d/2}\sum_{n=1}^{d/2}\cos((s-t)\theta_n)$ $\theta_n = 10000^{-2n/d}$

$C_{RoPE}(t-s)$ The bias generally decreases monotonically with relative distance $t-s$ . However, the monotonically decreasing overall bias does not necessarily mean that the bias in each dimension decreases monotonically. $\theta_n$ The magnitude of determines the monotonicity of dimensions $2n-1, 2n$ , and also gives the parameters in these dimensions different learning tendencies:

When $n$ is small, $\theta_n$ It is relatively large, tends to 1, and maintains a consistent monotonicity only when the relative distance $t-s$ is small. After that, it falls into fluctuation, inducing the corresponding dimension to characterize the location information of the closer position.
When $n$ is large, $\theta_n$ It is relatively small, tends to 0, and can maintain a consistent monotonicity even when the relative distance $t-s$ is large, thus inducing the corresponding dimension to characterize the location information of a more distant location.

Conversely, semantic information at different relative positions will also be reflected in different feature dimensions.

When the relative distance $t-s$ is small, the biases of all dimensions are close to 1, which means that the self-attention distribution pays more attention to information from neighboring positions.
When the relative distance $t-s$ is large, most dimensions have positive and negative biases that cancel each other out, with only a few dimensions having larger biases. If the semantic features of the corresponding dimensions of two tokens highly overlap, they will be partially emphasized; otherwise, the corresponding self-attention distribution approaches 0. This is a major advantage of relative bias: it does not impose an absolute penalty on semantic associations with large relative distances, but rather provides relative filtering. Although information from larger distances is suppressed through overall bias, semantics on certain feature dimensions are still allowed to converge in the self-attention calculation.

Cardinality

For $\theta_n = 10000^{-2n/d}$ , the number 10,000 determines the size of K, which we call the base. Different values of base affect the degree of attention decay over distance. Because “fade decay over distance” is key to extrapolation, the properties of base are closely related to the length extrapolation of large models. For example, length extrapolation methods such as NTK-Aware Scaled RoPE, NTK-by-parts, and Dynamic NTK essentially change the base to affect the rotation angle corresponding to each position, thereby affecting the positional encoding information of the model and ultimately achieving the purpose of length extrapolation.

Since the attention values in RoPE, besides $q$ and $k$ themselves, are only related to $R_{n-m}$ , below we observe the characteristics of the $R_{n-m}$ factor.

$\begin{aligned} \langle p_m, p_n \rangle &= Re\left[e^{i(m-n)\theta_0}+e^{i(m-n)\theta_1}+\cdots+e^{i(m-n)\theta_{d/2-1}}\right] \\ &= \frac{d}{2}\cdot Re\left[\sum_{i=0}^{d/2-1}e^{i(m-n)10000^{-i/(d/2)}}\frac{1}{d/2}\right] \\ &\approx \frac{d}{2}\cdot Re\left[\int_0^1 e^{i(m-n)\cdot 10000^{-t}}dt\right] \end{aligned}$

The problem then becomes the asymptotic estimation of the integral $\int_0^1 e^{i(m-n)\cdot 10000^{-t}}dt$ . The influence of different base values can be analyzed by calculating the relationship between the integral value and the location distance using the following function.

With base=1, the long-range attenuation characteristic is completely lost.
The smaller the base, the faster and larger the decay. Too small a base will disrupt the long-range decay property of attention; for example, when base=10 or 100, the attention score no longer shows an oscillating downward trend as the relative position increases.
The larger the base, the slower and smaller the decay. This is why the base needs to be increased when training longer windows. Therefore, the mainstream practice in the industry is to increase the base as the window lengthens to adapt. Apple uses a large base in its models. The longer the input sequence, the larger the base needs to be, forcibly slowing down the decay of an insufficiently trained window, which is also a way to reduce the probability of crashing.

Smoothness

Furthermore, the embedding dimension is positively correlated with the smoothness of the decay curve; the higher the dimension, the smoother the decay curve. The fundamental premise of extrapolation is the “smoothness” of the function. Extrapolation is inferring the whole from the local, relying on the higher-order smoothness of a given function (the existence and boundedness of higher-order derivatives). However, trigonometric function encoding or RoPE does not possess this property. They are combinations of sine and cosine functions, which are high-frequency oscillating functions about the position code $k$ , rather than linear or asymptotically linear functions. Therefore, models based on them often exhibit unpredictable extrapolation behavior.

3.8 Extrapolation

Although RoPE can theoretically encode absolute positional information of arbitrary length and generate positional codes exceeding the pre-training length through rotation matrices, and RoPE also exhibits long-distance decay, it still suffers from the length extrapolation problem. For large language models based on RoPE, the model’s performance deteriorates significantly after the test length exceeds the training length, manifested as a sharp increase in language modeling perplexity.

We will write a separate article later to analyze this in detail.

0x04 Implementation

4.1 Basic Torch Knowledge

`torch.outer`

torch.outer(a, b) calculates the outer product of two 1D vectors a and b, generating a 2D matrix. Each element is calculated as: result[i,j] = a[i] * b[j]. That is, the element in the i-th row and j-th column of the result matrix is equal to the product of the i-th element of vector a and the j-th element of vector b.

The outer product is the matrix generated by the outer product operation of two vectors a and b: A = a ⊗ b. Here, a ⊗ b generates a matrix with the number of rows equal to the number of elements in vector a and the number of columns equal to the number of elements in vector b.

`torch.matmul`

When the dimension of the input tensor is greater than 2, torch.matmul performs batch matrix multiplication.

`torch.polar`

The torch.polar() function constructs a complex tensor, used as torch.polar(abs, angle, *, out=None) -> Tensor. Its elements are Cartesian coordinates corresponding to polar coordinates, with the absolute value abs and the angle angle. out = abs * cos(angle) + abs * sin(angle) * j.

`torch.repeat_interleave`

The torch.repeat_interleave() function returns a repeating tensor with the same dimensions as the input.

`torch.view_as_complex`

Convert a tensor to its complex form such that the last dimension of the tensor has a shape of 2.

`torch.view_as_real`

Transforming a complex tensor back into a real tensor can be seen as the inverse transformation of the previous operation.

4.2 Position in Transformer

Unlike the absolute position encoding of the original Transformer, RoPE is located inside the multi-head attention mechanism, directly acting on the query and key of each head to complete the transformation, and each head uses the same RoPE. This also means that RoPE must be added to each layer in the Transformer.

4.3 llama3

In LLaMA, RoPE is implemented using complex number formulas for calculation, $f_q(x_m, m) = (W_qx_m)e^{im\theta}$ . This method is faster, but it is less convenient to modify later.

Specifically, each vector (Key or Query) is split into element pairs based on its two dimensions. $(q^1, q^2)$ , $(q^3, q^4)$ , … Each pair is interpreted as a two-dimensional vector. Then RoPE is used in terms of angles. $\theta_i$ For each two-dimensional vector (dimension pair) $(q_i, q_{i+1})$ the rotations are performed separately, with the rotation angles taking the same values as the triangular position coding, that is, the sampling frequency $\theta$ multiplied by the token index $(m\theta_i = m \times base^{-2i/d})$ . After rotating and splicing all the segments, we obtain a feature vector containing positional information.

$\begin{pmatrix} cos(m\theta_1) & -sin(m\theta_1) & 0 & 0 & ... & 0 & 0\\ sin(m\theta_1) & cos(m\theta_1) & 0 & 0 & ... & 0 & 0 \\ 0 & 0 & cos(m\theta_2) & -sin(m\theta_2) & ... & 0 & 0 \\ 0 & 0 & sin(m\theta_2) & cos(m\theta_2) & ... & 0 & 0 \\ 0 & 0 & 0 & 0 & ... & 0 & 0 \\ . & . & . & . & . & . & . \\ . & . & . & . & . & . & . \\ . & . & . & . & . & . & . \\ 0 & 0 & 0 & 0 & ... & cos(m\theta_{d/2}) & -sin(m\theta_{d/2}) \\ 0 & 0 & 0 & 0 & ... & sin(m\theta_{d/2}) & cos(m\theta_{d/2}) \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ . \\ . \\ . \\ q_m^{(d-1)} \\ q_m^{(d)} \end{pmatrix}$

here $\theta_i = 10000^{-2i/d}$ . It adopts the original sinusoidal position encoding scheme of the Transformer. This can introduce some long-range attenuation. The rotation angle is different at each position.

overall

The overall code and formula are shown in the figure below.

1722

Before implementing the RoPE algorithm, please note the following: for ease of code implementation, the rotation matrix needs to be converted to polar coordinates before rotation, and the embedding vectors (q, k) need to be converted to complex numbers. After rotation, the rotated embeddings need to be converted back to real numbers for attention calculation.

Prepare rotation matrix

The precompute_freqs_cis() function generates a rotation matrix, which pre-computes the frequencies $\theta$ for a given dimension. $\theta$ is entirely determined by the vector lengths d of Q, K, and V. The position m corresponds to our query length, which in the actual code is determined by the max_position_embeddings parameter, which can be understood as the length of the longest query supported by the model. Therefore, given max, the range of m is also determined. Combining the above information, for an LLM with a fixed longest query length m and vector dimension d, we can pre-construct its corresponding rotation transformation matrix.

The matrix of freqs = torch.outer(t, freqs) is as follows.

$freqs = \begin{bmatrix} 1\theta_1 & 1\theta_2 & 1\theta_3 & ... & 1\theta_{d/2-1} & 1\theta_{d/2}\\ 2\theta_1 & 2\theta_2 & 2\theta_3 & ... & 2\theta_{d/2-1} & 2\theta_{d/2}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ m\theta_1 & m\theta_2 & m\theta_3 & ... & m\theta_{d/2-1} & m\theta_{d/2} \end{bmatrix}$

By combining this transformation matrix of $R_d$ with cos and sin respectively, we can obtain the transformation matrix of all positions and all dimensions required for our calculation.

The freqs following torch.polar are as follows.

$freqs = \begin{bmatrix} \cos(\theta_1)+i\cdot \sin(\theta_1) & \cos(\theta_2)+i\cdot \sin(\theta_2) & ... & \cos(\theta_{d/2})+i\cdot \sin(\theta_{d/2})\\ \cos(2\theta_1)+i\cdot \sin(2\theta_1) & \cos(2\theta_2)+i\cdot \sin(2\theta_2) & ... & \cos(2\theta_{d/2})+i\cdot \sin(2\theta_{d/2})\\ \vdots & \vdots & \ddots & \vdots\\ \cos(m\theta_1)+i\cdot \sin(m\theta_1) & \cos(m\theta_2)+i\cdot \sin(m\theta_2) & ... & \cos(m\theta_{d/2})+i\cdot \sin(m\theta_{d/2}) \end{bmatrix}$

The specific code is as follows.

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    # 根据维度 d 生成旋转角度θ向量。计算词向量元素两两分组之后，每组元素对应的旋转角度 θ_i，由于是将向量两两旋转应用 RoPE，所以共有 dim/2 个 θ。θ 完全由 Q、K、V 的向量长度 dim 决定
    # freqs 长度是 dim/2，一半的维度。2表示是偶数这里 θ 完全由 Q、K、V 的向量长度 d 决定，即 dim维度，取0，2，4...等维度
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    # 生成 token 序列索引 t = [0, 1,..., seq_len-1]，即拿到所有位置对应的ID，就是论文中常说的m或者n
    t = torch.arange(end, device=freqs.device, dtype=torch.float32)
    # 计算m * θ。将旋转角度和 `token` 位置索引相乘，即求向量的外积，结果是一个矩阵，该矩阵包含了每个位置和每个维度对应的旋转角度，即每个元素代表位置t在第i维上的旋转角度（频率）
    freqs = torch.outer(t, freqs)  # freqs的形状是 [seq_len, dim // 2]，具体参见上面公式。

    # 将上一步的结果写成复数的形式𝑒^{𝑖𝑚𝜃},模是1，幅角是freqs。freqs_cis的大小为(seqlen, dim//2)
    # 假设 freqs = [x, y]，则 freqs_cis = [cos(x) + sin(x)i, cos(y) + sin(y)i]
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis

The precompute_freqs_cis() function is called as follows.

class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = VocabParallelEmbedding(
            params.vocab_size, params.dim, init_method=lambda x: x
        )

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(
            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
        )

        # 预先计算出来选择矩阵，乘以2是为了动态扩展
        self.freqs_cis = precompute_freqs_cis(
            params.dim // params.n_heads,
            params.max_seq_len * 2,
            params.rope_theta,
        )

accomplish

The apply_rotary_emb() method applies cosine and sinine rotation matrices to the original query and key vectors, thus introducing positional information into the query and key during the Attention inner product.

# 为了匹配q和k，需要对角度进行扩展
# freqs_cis维度是[seq len, dim/2]
def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    # 需要确保形状和x的形状匹配，即是(x.shape[1]=seq len, x.shape[-1]=dim/2)
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    # x的第二维和最后一维保留，其他维度置为1
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape) # [1,S,1,head_dim//2]


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    作用: 将q,k向量分别与旋转向量相乘,得到旋转后的q,k向量q/k_rotated
    输入:
    x_q(torch.Tensor): 实际上是权重 W_q * 词嵌入向量值, 来自上一个线性层的输出, 形状为 [batch_size, seq_len, n_heads, head_dim]或者[batch_size, seq_len, dim]
    x_k(torch.Tensor): 实际上是权重 W_k * 词嵌入向量值, 来自上一个线性层的输出, 形状为 [batch_size, seq_len, n_heads, head_dim]或者[batch_size, seq_len, dim]
    freqs_cis (torch.Tensor): 频率复数张量, 形状为 [max_seq_len, head_dim]
    输出: 施加了旋转编码后的q和k
    """

    # 实数域张量转为复数域张量。将一个大小为n的向量xq_两两组合形成复数来计算，需要增加维度，把最后一维变成2，即把最后一维的两个实数作为一个复数的实部和虚部来构建一个复数。
    # 计算过程q:[batch_size,atten_heads,seq_len,atten_dim]->q_complex:[b,a_h,s,a_d//2,2]->[b,a_h,s,a_d//2]->[b,a_h,s,a_d//2,2]
    # [:-1]意思是从第一维到倒数第二维；*是为了展开列表；-1, 2表示把最后一维展开成两维：x/2和2，即最后一维是2；
    # xq_.shape = [batch_size,atten_heads,seq_len,atten_dim//2,2]，如果不考虑多头，则是[batch_size, seq_len, dim // 2, 2]
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) # 复数形式张量
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) # 复数形式张量
    
    # freqs_cis 的形状必须与 xq 和 xk 相匹配，因此我们需要将 freqs_cis 的形状从 [max_seq_len, head_dim] 调整为 [1, max_seq_len, 1, head_dim]。即，旋转矩阵（freqs_cis）的维度在序列长度（seq_len，维度 1）和头部维度（head_dim，维度 3）上需要与嵌入的维度一致。
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    
    # 通过复数乘法实现向量旋转操作，然后将结果转回实数域。这是幅度不变，角度变换的操作，即把结果恢复成原来的样子，将第三维之后压平，也就是(atten_dim//2,2)->(atten_dim)。位置编码只和向量的序列位置还有向量本身有关，和batch以及注意力头无关，所以只用关注第二维和第四维
    # xq_out.shape = [batch_size, seq_len, dim]
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk) # 又是实数了

Call

The Transformer will call the Transformer layer to perform the RoPE operation.

class Transformer(nn.Module):

    @torch.inference_mode()
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)

            mask = torch.triu(mask, diagonal=1)

            # When performing key-value caching, we compute the attention scores
            # only for the new sequence. Thus, the matrix of scores is of size
            # (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for
            # j > cache_len + i, since row i corresponds to token cache_len + i.
            mask = torch.hstack(
                [torch.zeros((seqlen, start_pos), device=tokens.device), mask]
            ).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h).float()
        return output

TransformerBlock will directly call the forward function of Attention.

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=4 * args.dim,
            multiple_of=args.multiple_of,
            ffn_dim_multiplier=args.ffn_dim_multiplier,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward(self.ffn_norm(h))
        return out

Attention will perform the following operations.

def forward(
    self,
    x: torch.Tensor,
    start_pos: int,
    freqs_cis: torch.Tensor,
    mask: Optional[torch.Tensor],
):
    bsz, seqlen, _ = x.shape
    xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

    xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
    xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
    xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

    # attention 操作之前，应用旋转位置编码
    xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

    self.cache_k = self.cache_k.to(xq)
    self.cache_v = self.cache_v.to(xq)

    self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
    self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

    keys = self.cache_k[:bsz, : start_pos + seqlen]
    values = self.cache_v[:bsz, : start_pos + seqlen]

    # repeat k/v heads if n_kv_heads < n_heads
    keys = repeat_kv(
        keys, self.n_rep
    )  # (bs, cache_len + seqlen, n_local_heads, head_dim)
    values = repeat_kv(
        values, self.n_rep
    )  # (bs, cache_len + seqlen, n_local_heads, head_dim)

    # Q/K/V 对应维度为 [bsz, seq_len, num_heads, head_dim]，transpose 将 seq_len 和 num_heads 的维度调换了，得到的 states 维度为 [bsz, num_heads, seq_len, head_dim]。这个变换是为了将 seq_len x head_dim = 4096 x 8 挪到一起，方便后面的 ⊗ 对位相乘。
    xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
    keys = keys.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)
    values = values.transpose(
        1, 2
    )  # (bs, n_local_heads, cache_len + seqlen, head_dim)
    scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
    if mask is not None:
        scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
    scores = F.softmax(scores.float(), dim=-1).type_as(xq)
    output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
    output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
    return self.wo(output)

4.4 `rotate_half`

rotate_half is a frequently used method in RoPE, and we will analyze it in detail. The function of rotate_half() is to xrotate half of the hidden dimension of the input tensor, that is, to perform semantic vector complexification, which is equivalent to multiplying the vector by the imaginary number i, and rotating the vector counterclockwise by 90 degrees.

$f_q(x_m,m) = (W_1x_m)(\cos(m\theta) + i\sin(m\theta))$

Continuing to derive the above formula, by combining cos and sin, we can find that the result after rotation is the result of adding cos to $q_t$ , plus the result of flipping the dimension of $q_t$ , inverting it by one dimension, and then multiplying by sin. This is called rotate_half in the program.

$\begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix} = \cos(m\theta) \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix} + \sin(m\theta) \begin{pmatrix} -q_m^{(2)} \\ q_m^{(1)} \end{pmatrix}$

There are actually two ways to implement rotate_half. Let’s look at one of them first. Specifically, it takes the negative of the second half (the imaginary part) of the input tensor and then concatenates it with the first half (the real part) to achieve the rotation operation. The process is as follows:

Tensor Segmentation: Assuming the input tensor x has the shape [batch_size, num_attention_heads, seq_len, head_size], the function first segments the tensor x into two parts: x1 and x2. x1 contains the first half, and x2 contains the second half.
Rotation operation: Negate x2, then concatenate x2 with x1. In this way, the latter half of the original tensor is rotated to the position of the former half, achieving the rotation effect.
Concatenation: Finally, the negative x2 and x1 are concatenated on the last dimension to form the final rotation position embedding tensor.

The specific code is as follows.

# 后半部分和前半部分进行了交换，并且将后半部分的符号取反。
# 这个函数很好理解，就是将原始向量从中间劈开分为 A、B 两份，然后拼接为 [-B, A] 的状态：比如 [q0,q1,q2,q3,q4,q5,q6,q7] -> [-q4,-q5,-q6,-q7,q0,q1,q2,q3]
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    # 前64个embedding位置 x=[batch_size, num_heads, seq_len, emb_size] => [batch_size, num_heads, seq_len, emb_size/2]
    x1 = x[..., : x.shape[-1] // 2]
    # 后64个embedding位置 x=[batch_size, num_heads, seq_len, emb_size] => [batch_size, num_heads, seq_len, emb_size/2]
    x2 = x[..., x.shape[-1] // 2 :]
    # 后64embedding位置取负号，和前64embedding位置拼接
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

Substitute rotate_half() into apply_rotary_pos_emb(), for example, with q=[x1, x2]:

q_embed = [x1, x2] * cos + [-x2, x1] * sin = [x1 * cos - x2 * sin, x2 * cos + x1 * sin]

See the diagram below for details. The negative sign here corresponds to the negative sign in the sum-of-the-angle formula. Calculate the twist angle. $m\theta$ The process is omitted here.

1723

However, the code above is an implementation of HuggingFace’s Transformer library, which differs slightly from the formula in the RoPE paper. Specifically, the difference lies in the arrangement of the elements. In the paper, q0 is the result of the pair of elements q0 and q1 transformed using trigonometric functions, but in the actual formula, q0 is derived from q0 and q_{d/2+1}. This pair was formed.

HuggingFace: $[-q_4,-q_5,-q_6,-q_7,q_0,q_1,q_2,q_3]$
paper: $[-q_1, q_0, -q_2, q_3,....q_{n-1}, q_{n-2}]$

The specific approximation is as follows.

1724

In fact, this involves two different implementations of splitting the feature dimensions.

Following the approach outlined in the RoPE paper, it follows the GPT-J style. The implementation performs a rotate_half operation on the odd and even dimensions of the feature vectors, grouping adjacent dimensions together (⊙ indicates multiplication of corresponding bits). The operation on $k_s$ is the same.

Because rotating odd and even dimensions requires pairwise interleaving of dimensions, which is quite complex, later researchers proposed directly halving the feature dimension. This implementation is called the GPT-NeoX style, which involves performing a rotate_half operation on the first and second halves of the feature vector. The GPT-J style and the GPT-NeoX style are equivalent and can be converted to each other: the odd-numbered dimensions in the GPT-J style correspond to the first half of the dimensions in the GPT-NeoX style, and the even-numbered dimensions in the GPT-J style correspond to the second half of the dimensions in the GPT-NeoX style. Extracting the odd-numbered dimensions from the GPT-J style and concatenating them before the even-numbered dimensions yields the result of the GPT-NeoX style.

The two implementations differ only in their corresponding R matrices; both ultimately achieve the goal of encoding relative positions from absolute positions. This has no impact on the final result. RoPE’s modification of the original vector essentially involves performing rotation matrix operations on element-wise pairs and concatenating all pairs. Whether consecutive elements are chosen as a pair or any other selection method is acceptable, as long as the embedding dimension is even and the selection strategy involves non-repeating pairs. Ultimately, the inner product result of Attention will perceive relative positional information because Attention satisfies the linear superposition property of the inner product. Which elements are paired together is irrelevant.

GPT-J sytle

Just like the original paper and blog, it uses two adjacent pairs as a group.

$\begin{bmatrix} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ \vdots \\ q_m^{(d-1)} \\ q_m^{(d)} \end{bmatrix} = \begin{bmatrix} \cos(m\theta_1) \\ \cos(m\theta_1) \\ \cos(m\theta_2) \\ \cos(m\theta_2) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_{d/2}) \end{bmatrix} \otimes \begin{bmatrix} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ \vdots \\ q_m^{(d-1)} \\ q_m^{(d)} \end{bmatrix} + \begin{bmatrix} \sin(m\theta_1) \\ \sin(m\theta_1) \\ \sin(m\theta_2) \\ \sin(m\theta_2) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_{d/2}) \end{bmatrix} \otimes \begin{bmatrix} -q_m^{(2)} \\ q_m^{(1)} \\ -q_m^{(4)} \\ q_m^{(3)} \\ \vdots \\ -q_m^{(d)} \\ q_m^{(d-1)} \end{bmatrix}$

GPT-NeoX style

Instead of grouping adjacent elements together, it uses 0 and… $q_{d/2-1}$ As a group.

$\begin{bmatrix} q_m^{(1)} \\ q_m^{(2)} \\ \vdots \\ q_m^{(d/2)} \\ q_m^{(d/2+1)} \\ \vdots \\ q_m^{(d)} \end{bmatrix} = \begin{bmatrix} \cos(m\theta_1) \\ \cos(m\theta_2) \\ \vdots \\ \cos(m\theta_{d/2}) \\ \cos(m\theta_1) \\ \cos(m\theta_2) \\ \vdots \\ \cos(m\theta_{d/2}) \end{bmatrix} \otimes \begin{bmatrix} q_m^{(1)} \\ q_m^{(2)} \\ \vdots \\ q_m^{(d/2)} \\ q_m^{(d/2+1)} \\ \vdots \\ q_m^{(d)} \end{bmatrix} + \begin{bmatrix} \sin(m\theta_1) \\ \sin(m\theta_2) \\ \vdots \\ \sin(m\theta_{d/2}) \\ \sin(m\theta_1) \\ \sin(m\theta_2) \\ \vdots \\ \sin(m\theta_{d/2}) \end{bmatrix} \otimes \begin{bmatrix} -q_m^{(d/2+1)} \\ \vdots \\ -q_m^{(d)} \\ q_m^{(1)} \\ q_m^{(2)} \\ \vdots \\ q_m^{(d/2)} \end{bmatrix}$

The FlashAttention source code implements RoPE for GPT-J style and GPT-NeoX style.

https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/layers/rotary.py

def rotate_half(x, interleaved=False):
    if not interleaved:
        x1, x2 = x.chunk(2, dim=-1)
        return torch.cat((-x2, x1), dim=-1)
    else:
        x1, x2 = x[..., ::2], x[..., 1::2]
        return rearrange(torch.stack((-x2, x1), dim=-1), '... d two -> ... (d two)', two=2)


def apply_rotary_emb_torch(x, cos, sin, interleaved=False):
    """
    x: (batch_size, seqlen, nheads, headdim)
    cos, sin: (seqlen, rotary_dim / 2)
    """
    ro_dim = cos.shape[-1] * 2
    assert ro_dim <= x.shape[-1]
    cos = repeat(cos, 's d -> s 1 (2 d)')
    sin = repeat(sin, 's d -> s 1 (2 d)')
    return torch.cat([x[..., :ro_dim] * cos + rotate_half(x[..., :ro_dim], interleaved) * sin, x[..., ro_dim:]], dim=-1)

Exploring the Transformer Series (17) --- RoPE

0x00 Overview

0x01 Overall Approach

1.1 Review of Attention Mechanisms

1.2 Analysis of the Approach

1.3 Results Display

1.4 Problems

0x02 Principle Derivation

2.1 The f() function

2.2 Objectives

2.3 Derivation

Adjust the perspective

From two-dimensional vectors to complex numbers

From complex numbers to polar coordinates

Next steps

Introducing absolute position information

Rotation matrix

Absolute position encoding

Find relative position information

Find the interaction

Find the inner product

Incorporate location information into the inner product

summary

2.4 Formal Definition

f() introduces absolute information

The g() function verifies relative information

The equation on the right

left side equation

High dimension

2.5 Summary

0x03 Properties

3.1 Correlation

3.2 Periodicity

3.3 \beta number system

3.4 Symmetry

3.5 Frequency Domain

3.6 High frequency and low frequency

3.7 Long-distance attenuation

Performance

Argument

Cardinality

Smoothness

3.8 Extrapolation

0x04 Implementation

4.1 Basic Torch Knowledge

torch.outer

torch.matmul

torch.polar

torch.repeat_interleave

torch.view_as_complex

torch.view_as_real

4.2 Position in Transformer

4.3 llama3

overall

Prepare rotation matrix

accomplish

Call

4.4 rotate_half

GPT-J sytle

GPT-NeoX style

0xFF Reference

3.3 $\beta$ number system

`torch.outer`

`torch.matmul`

`torch.polar`

`torch.repeat_interleave`

`torch.view_as_complex`

`torch.view_as_real`

4.4 `rotate_half`