Exploring the Transformer (8) --- Position Encoding

0x00 Overview

Positional embedding is a technique for processing sequential data, used to represent the positions of words in an input sequence. It plays a crucial role in Transformer implementations. Transformers need to focus on two pieces of information for each input word: its meaning and its position in the sequence. Positional embedding provides a key supplement to addressing these two concerns.

To address the concern of “the meaning of input words”, Transformer encodes the meaning of words through an embedding layer. Positional encoding can inject position-related prior knowledge on top of this, such as: “similar tokens should have similar embeddings”, “relative position is more important than absolute position”, “relative positions at long distances do not need to be so accurate”, “relative positions that are farther apart are more ambiguous”, “the closer the token, the more important it is”, and “the farther the token, the less important it is on average”.
Addressing the key concern of “the position of the input word in the sequence,” the Transformer uses a positional encoding layer to represent the word’s position. In sequence data, the order and position of words are crucial for semantic understanding. Traditional word vector representations only consider semantic information, neglecting word position, while attention mechanisms suffer from permutation invariance. Therefore, positional encoding assigns a unique position vector to each word, integrating word positional information into the model’s representation. This overcomes the permutation invariance of attention mechanisms, enabling the model to better understand the context and relationships between words when processing sequence data.

Ultimately, the Transformer combines the outputs of these two layers to encode two different types of information.

Note: In deep learning, the learned encoding is generally called embedding, which means “embedding” positional information into a vector space. For example, BERT’s position vector is learned, so it is called “Position Embedding”. In the original Transformers model, the position vector is directly calculated through rules (trigonometric functions) without a learning process, which is called Position Encoding.

0x01 Problem

1.1 The Importance of Word Order

Regardless of the language, sentences are a form of temporal data, meaning each word is position-wise. The order and position of words in a sentence determine its actual meaning. The same words in different orders can alter the meaning of a sentence. For example, the following two sentences have completely different meanings.

From Beijing to Shanghai.
From Shanghai to Beijing.

There are classic historical examples of Zeng Guofan: if it were written as “Your subject has repeatedly suffered defeats,” the result might have been that Zeng Guofan was dragged out and beheaded. But if it were written as “Your subject has repeatedly suffered defeats but fought on despite them,” the result would have been that Zeng Guofan was recognized for his unparalleled loyalty and bravery and received a reward.

Therefore, the processing of natural language text information is a sequential task with a specific order. Positional information is quite important for understanding language. If the sequential information is not learned, the model performance will be greatly reduced. Therefore, it is necessary to introduce a mechanism to express position into the model.

1.2 Deficiencies of the Transformer Architecture

Position invariance

The Transformer model abandons RNNs and CNNs as the basic models for sequence learning, replacing them entirely with an attention mechanism. For an input sentence, the words are no longer input sequentially, but rather all the words in a sequence are input at once. It relies on a pure self-attention mechanism to capture the relationships between words and directly perform feature transformation on the entire sequence.

Attention operations are global operations that can capture long dependencies in a sentence. For example, the Transformer allows every two elements $x_t$ and $x_s$ in a sequence to exchange information regardless of their absolute positions $t$ and $s$ and relative position $t-s$ , thereby calculating the attention weight of each element in the input sequence with the entire sequence. In other words, the positional information in the sequence is transformed in a constant form, which prevents long-term information loss and forgetting.

$q_t = W_Q x_t\quad k_t = W_K x_t\quad v_t = W_V x_t$ $q_s = W_Q x_s\quad k_s = W_K x_s\quad v_s = W_V x_s$ $A_{t,s} = q_t^T k_s = x_t^T (W_Q)^T W_K x_s$ $\alpha_{t,s}=\mathrm{softmax}\left(\frac{A_{t,s}}{\sqrt{d}}\right)=\frac{e^{q_t^T k_s/\sqrt{d}}}{\sum_s e^{q_t^T k_s/\sqrt{d}}}$ $\text{encoder: } h_t = \sum_{s=1}^{T} \alpha_{t,s} v_s$ $\text{decoder: } h_t = \sum_{s=1}^{t} \alpha_{t,s} v_s$

However, if positional relationships are taken into account, the advantage of Transformer becomes a disadvantage, because Transformer itself has no concept of position or order; it has position invariance or permutation invariance.

Permutation invariance means that arbitrarily changing the positions of words does not cause a change in the attention weights of those words; the overall result of the attention layer remains unchanged. That is, regardless of the position change, the calculated result of each word vector is completely consistent with the result before the position change; only the arrangement of the word vectors in the output matrix is adjusted accordingly as the word positions are swapped. Assuming that positional encoding is not currently included, for the self-attention calculation formula $A=\mathrm{softmax}(QK^T)V$ , there is the following example. If $q$ is “I”, the attention output is completely consistent regardless of whether the sentence is “I love you” or “you love me”. This shows that without positional sequence information, sentences with different word orders actually have different semantics, but the attention output is the same, making accurate modeling impossible.

$A = \mathrm{softmax}(QK^T)V$

import torch
import torch.nn.functional as F
d = 8 # 词嵌入维度
l = 3 # 句子长度
q = torch.randn(1,d) # 我
k = torch.randn(l,d) # 我爱你
v = torch.randn(l,d) # 我爱你

orig_attn = F.softmax(q@k.transpose(1,0),dim=1)@v

# 调转位置
k_shift = k[[2,1,0],:] # 你爱我
v_shift = v[[2,1,0],:] # 你爱我
shift_attn = F.softmax(q@k_shift.transpose(1,0),dim=1)@v_shift

print('我爱你:',orig_attn)
print('你爱我:',shift_attn)

prove

Next, we will prove position invariance. In the Transformer-NoPE architecture, both the Embedding layer and the FFN layer are point-wise, independent of position or order; only the attention module is position-dependent. We only need to focus on whether the attention mechanism is permutationally equivalent or order-invariant.

Suppose we have a permutation matrix $P_\pi$ , then $P_\pi X$ is the result of rearranging the rows of $X$ (which can be understood as shuffling the order of K and V by row, equivalent to shuffling the word order in a sentence). Note that $P_\pi$ only affects the rows of $X$ , not the column order. Let’s see how this input permutation affects the attention mechanism’s operating logic and results.

Changes in the query matrix

The original query matrix is:

$Q = XW^{(q)},\quad K = XW^{(k)}$

The query matrix after the permutation will also change, that is:

$Q' = P_\pi XW^{(q)} = P_\pi Q,\quad K' = P_\pi XW^{(k)} = P_\pi K$

Changes in the attention matrix

The original formula for calculating attention was:

$A = \frac{1}{\sqrt d}QK^\top$

The attention calculation will also change after the permutation, that is:

$A' = \frac{1}{\sqrt d}Q'(K')^\top =\frac{1}{\sqrt d}(P_\pi Q)(P_\pi K)^\top =\frac{1}{\sqrt d}P_\pi QK^\top P_\pi^\top = P_\pi A P_\pi^\top$

Changes in attention scores

The original formula for calculating attention score was:

$\mathrm{SoftMax}(A)_i = \frac{e^{A_i}}{\sum_j e^{A_j}}$

The calculation of the attention score after the substitution will also change, that is…

$\mathrm{SoftMax}(A')_i = \frac{e^{(P_\pi A P_\pi^\top)_i}}{\sum_j e^{(P_\pi A P_\pi^\top)_j}}$

Since the permutation matrix is simply a row arrangement, and the Softmax operation is row-independent, rearranging the rows and columns does not actually change the result of Softmax. Therefore:

$\mathrm{SoftMax}(P_\pi A P_\pi^\top)=P_\pi\mathrm{SoftMax}(A)P_\pi^\top$

Final result

Combining the above changes, the final derivation is as follows: regardless of the sentence order, the attention calculation result is completely consistent. That is, assuming $x_s$ and $x_t$ represent the s-th and t-th input words respectively, then we have $T(...,x_s,...,x_t,...) = T(...,x_t,...,x_s,...)$ , and the $T$ function naturally satisfies the identity $T(x,y)=T(y,x)$ , making it impossible to distinguish whether the input is $x,y$ or $y,x$ .

$T(...,x_s,...,x_t,...) = T(...,x_t,...,x_s,...)$

801

as a result of

Therefore, without location information, Transformer has two problems.

The first problem is that the model cannot capture the order of the sequence.

A self-attention model without positional information is at best a very sophisticated “bag-of-words” model, where the model treats the sequence as a set. Since it’s a set, the model treats every word in the input sequence equally, naturally lacking positional information, and the hidden state is independent of temporal order. If a word appears multiple times in different positions, the weighted sum of attention calculated each time will be exactly the same.

For example, the final attention output for the token at position j is as follows. As you can see, because the positional information is transformed in a constant form, the calculation formula contains no description of positional information, only a summation operator; this is a bag-of-words model.

$z_j = \sum_{i=0}^{n} \mathrm{softmax}\left(\frac{q_j^T\cdot k_i}{\sqrt d}\right)v_i$

Furthermore, given a sentence, the final word embedding combination of that sentence only comes from the features of all the words in the sentence, and has nothing to do with the order of the words in the sentence, that is, it loses the positional information between words. Changes in the position of the input elements will not affect the attention result, so as long as the elements contained in the set are determined, the output result is determined.

However, this clearly contradicts the inherent characteristics of sequences such as language, code, and speech: rearranging the word order in a sentence can alter its meaning, the objects each word refers to or modifies, and even the semantics of the words themselves. For example, inputting both “[I, love, you]” and “[you, love, I]” into a Transformer, this bag-of-words model will produce identical sentence representations. What we expect is that the word vector for “love” should produce different outputs in the sentences “I love you” and “I you love,” because the position of the word “love” in the sentences has actually changed. We are inputting two different sentences; a word in two different sentences should have different vector representations, but the neural network cannot capture this change.

                                 [ [0.3, 0.5, 0.1, 0.4]
[我，爱，你]  => Transformer =>    [0.1, -0.6, -0.2, 0.3],
                                 [0.3, 0.5, 0.3, -0.1] ]

                                 [ [0.3, 0.5, 0.1, 0.4]
[你，爱，我]  => Transformer =>     [0.1, -0.6, -0.2, 0.3],
                                   [0.3, 0.5, 0.3, -0.1] ]

The second problem is that the weight between words is independent of their position.

Regardless of how the positions of $t$ and $s$ change, the attention weight $A_{t,s}$ between them remains unchanged, meaning they are position-independent. However, this contradicts the characteristics of language: most of the time, words that are closer together are likely to be more related, and we want them to have a larger attention weight; two words that are far apart may be unrelated, and we want their attention weight to be smaller.

1.3 Solution Approach

Since the self-attention mechanism in Transformer cannot capture the order of the input element sequence, we need to model the positional relationships and incorporate the word order into the Transformer architecture to break this permutation invariance. Thus, the Transformer authors proposed the Position Embedding method, also known as “position vector” or “position encoding”.

The purpose of positional encoding is to add a unique positional encoding vector to each position, essentially vectorizing word order information. For each word in the input, each word has a corresponding vector (position-independent). To add a unique positional encoding vector to each position, another vector with the same dimensions is needed, where each vector uniquely represents a position in the sentence. The input to the Transformer layer is then formed by summing the word embeddings with their corresponding positional embeddings; that is, the entire embedding input to the model is the result of directly adding the word embedding and the positional embedding. The model then provides this resulting matrix as input to subsequent layers.

802

This introduces information about each word’s specific position in the sentence, similar to $T(...,x_s,...,x_t,...) = T(...,x_s+p_s,...,x_t+p_t,...)$ . The attention mechanism can then distinguish words in different positions. Thus, the model not only knows which word to focus attention on, but also the distance between words, allowing it to consider the relative positions of two elements when calculating the attention score. This gives the model the ability to handle sequence problems. Regardless of what information each input vector learns later, its position can be traced back to its specific location in the model, providing a reference for subsequent outputs.

$T(...,x_s,...,x_t,...) = T(...,x_s+p_s,...,x_t+p_t,...)$

803

1.4 Required Properties

The paper “A Length-Extrapolatable Transformer” mentions three design principles for transformer position modeling: position sensitivity; robustness to position translation; and extrapolation capability. An excerpt from the original paper is as follows:

First, a Transformer should be sensitive to order. Otherwise, it will degenerate into a bag-of-word model which confuses the whole meaning.

Then, position translation can’t hurt the representation a lot especially combining with the proper attention-mask operations.

After that, a good sequence model needs to deal with any input length.

The paper “On Position Embeddings in BERT” points out that position embedding is used to model the temporal characteristics of positions. Based on this, the paper proposes that position embedding should possess three properties: translation invariance (the relationship between two positions depends only on their relative positions), monotonicity (decreasing with increasing distance), and symmetry (the relationship between two positions is symmetrical, i,j is the same as i,j). An excerpt from the original paper is as follows:

Informally, as positions are originally positive integers, one may expect position vectors in vector space to have the following properties: 1) neighboring positions are embedded closer than faraway ones; 2) distances of two arbitrary m-offset position vectors are identical; 3) the metric (distance) itself is symmetric.

Property 1. Monotonicity: The proximity of embedded positions decreases when positions are further apart.

Property 2. Translation invariance: The proximity of embedded positions are translation invariant.

Property 3. Symmetry: The proximity of embedded positions is symmetric.

Following these points, let’s examine in detail the properties that a good location code should possess. Ideally, a location code should meet the following criteria:

Uniqueness/Determinism. Each position needs a consistent code regardless of the sequence length; that is, the token at position 5 should have the same code whether the current sequence length is 10 or 10,000. Furthermore, this position code must be deterministic, meaning each position has a unique code (or as different as possible) to reflect the difference between the same token at different positions and ensure the model’s ability to distinguish positions. Ideally, the position codes should be generated from a deterministic process, allowing the model to effectively learn the mechanism behind the encoding scheme.
Boundedness: The encoding range is bounded; the value must be within a certain range to avoid overflow. Because positional information itself is a correction parameter, the positional encoding number should not increase infinitely as the sentence lengthens, as this could easily affect the semantic vector of the word itself.
Relativity: For the model, what is truly important is often not the absolute position, but the relative position between tokens. Therefore, we expect that positional encoding can express both absolute positional information (representing the difference between different positions of the same word in the sequence, i.e., the absolute position of the token in the sequence) and relative positional information (if there is a set of words whose meaning does not change regardless of their position, then we consider that the actual meaning of this set of words is unrelated to their absolute position).
Monotonicity, or distance decay: The greatest significance of positional encoding is to provide the model with positional semantic relevance. Positional relevance should decrease with increasing relative distance, and this decrease should be monotonically decreasing. Specifically, closer elements have higher relevance, and farther elements have lower relevance. Distance decay is equivalent to soft window attention. This aligns with the habits of human natural language, where similar words are more correlated, and tokens with similar positions should, on average, receive more attention, while tokens that are farther apart receive less attention. Monotonicity can also be equated to a convolutional neural network prioritizing local information during information aggregation. Elements that are closer are considered more.
Translation invariance: The relative distance between any two positions should be consistent across sentences of different lengths. Specifically, the relationship between two positions depends only on their relative positions and is independent of the sequence length. In sequences of different lengths, the relative positions/distances between tokens at the same position remain consistent (reflecting the invariance of differences between token positions). For example, in sentences of length 10 and 100, the distance between the first and fifth words should be the same. That is, if the relative distance between two tokens is k in sentence 1 and also k in sentence 2, then the correlation between the two tokens should be consistent across both sentences, i.e., attention_sample1(token1, token2) = attention_sample2(token1, token2).
Linear relationship. The relationship between positions should be mathematically simple, or in other words, a linear relationship exists. If the code of position p is known, then calculating the code of position p+k should be straightforward, allowing the model to learn positional patterns more easily.
Multidimensionality: As multimodal models become the norm, location encoding schemes should be able to naturally extend to multiple dimensions, from 1D to nD. This will enable models to use data such as images or brain scans, which are 2D and 4D respectively.
Extrapolation, or generalization: Positional encodings can generalize to sequences longer than those encountered during training. To improve the model’s usability in the real world, they should generalize beyond the training distribution. Therefore, encoding schemes need sufficient adaptability to handle unexpected input lengths without violating any other ideal properties. Specifically, the encoding system should be unaffected by sentence length (i.e., applicable to arbitrary text lengths) and perform well on samples not seen during training, as well as on lengths never seen before. It transforms the unseen into the known, and the outside distribution into the inside distribution.
Periodicity: This property is for implementation considerations. Because the requirement is relative and bounded, it is easy to associate it with a property—periodicity, so that values at greater distances can be the same as values at closer distances, thus providing a certain degree of extrapolation.
Incorporating semantic information. In tasks involving long-context understanding and search, attention mechanisms should prioritize semantic similarity rather than being overshadowed by information related to location encoding, as the relevance of location encoding may be low over longer distances. Therefore, PE should combine semantic and location information to ensure that semantic information is not excessively affected by location distance.

0x02 Evolution of Encoding Schemes

Now that we know the properties of ideal positional encoding, we can try to design and iterate positional encoding schemes step by step from scratch, and compare the gaps between various schemes and the expected properties. Positional encoding involves a combination of imagination, experimentation, and demonstration; or rather, introducing positional information into LLM is more like building feature engineering, where the information corresponding to the feature is position.

To better illustrate this, let’s first look at the Harvard code. This is the approach described in the Transformer paper. The overall goal of the function is to calculate the relevant positional information for each dimension (each column). Therefore, we need:

Initialize an absolute position matrix position of shape (max_len, 1). This corresponds to position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) in the code. In position, the absolute position of a word is represented by its index. After initializing the absolute position matrix, the next step is to consider how to add this positional information to the positional encoding matrix. Therefore, the absolute position matrix of shape (max_len, 1) needs to be transformed into a shape of (max_len, d_model), and then overwritten with the initial positional encoding matrix. This requires constructing a transformation matrix div_term.
Constructing the transformation matrix div_term corresponds to the code div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)), and the specific operation is as follows.
- Scaling the absolute position encoding of natural numbers to a sufficiently small number 10000^(2i / d_model) helps to achieve faster convergence during subsequent gradient descent.
- The torch.exp() function is used to construct a transformation matrix div_term of shape (1, d_model), which is used to scale the sine and cosine functions at different positions. div_term is a 256-dimensional tensor, representing the content within the parentheses of the sin and cosine functions in the formulas. Unlike the original paper, the Harvard code uses e and ln for transformation, which is faster.

The specific details of div_term are as follows:

$\mathrm{div\_term}_i = 10000^{-i/d_{model}} = \frac{1}{10000^{i/d_{model}}}$

The positional encoding pe is constructed using position and div_term, calculating the relevant positional information for each dimension (each column). Specifically, the positional encoding is generated using the sine function torch.sin(position * div_term) and the cosine function torch.cos(position * div_term). The positional encoding is a d_model-dimensional vector. For each dimension of this vector, if the dimension is even, it is encoded using the sine function; if the dimension is odd, it is encoded using the cosine function.

Use unsqueeze(0) to add a dimension batch_size to the first dimension for batch processing.

Here are a few points to note in the code:

The input X of the Transformer model is [batch_size, max_len, d_model], which represents the encoding of batch_size sentences. Positional encoding encodes the position of all words within a sentence. Since these positional encodings are added to X, the dimension of the encoding for a position is the same as the dimension of the encoding for a word, d_model. Thus, the positional encoding of a sentence is a tensor of [max_len, d_model] dimensions, and the positional encoding of batch_size sentences is a tensor of [batch_size, max_len, d_model] dimensions.
The code div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) uses equivalent exponentiation and logarithmic operations to ensure numerical stability and computational efficiency. On one hand, when d_model is large, directly using exponentiation would cause 10000 ^ (2i / d_model) to become so small that it would underflow in numerical computation. By converting it to exponentiation and logarithmic operations, a better numerical range is maintained during computation, thus avoiding this situation. On the other hand, in many computing devices and libraries, implementations of exponentiation and logarithmic operations are generally faster than exponentiation.
The Module.register_buffer function is called. register_buffer is typically used to store values other than model parameters, such as running_mean in BatchNorm. It’s not a model parameter, but the model modifies it and uses it during prediction. Here, pe is a pre-calculated constant used during forward propagation. Therefore, the register_buffer() function saves pe.

The specific code is as follows.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        """
        d_model: 词嵌入维度
        max_len: 序列的最大长度
        dropout: 置0比率
        
        PE(pos, 2i)     = sin(pos/pow(10000, 2*i/d_model)), (0<= i <= d_model)
        PE(pos, 2i + 1) = cos(pos/pow(10000, (2*i)/d_model)),
        PE(p+k, 2i) = PE(p, 2i)PE(k, 2i+1) + PE(p, 2i+1)PE(k, 2i)
        sin(a+b) = sin(a)cos(b) + cos(a)sin(b)

        注: 序列上不同位置上token的同一个维度i上，才具有线性关系
        """

        # 初始化一个全0位置矩阵，矩阵的大小是max_len x d_model，后续计算的位置编码会存储在pe中
        pe = torch.zeros(max_len, d_model)

        """
        初始化一个绝对位置矩阵position, 在这里，词汇的绝对位置用它的索引表示，具体操作是：
        1. 使用arange方法获得一个连续自然数向量（0到max_len-1的整数序列）。
        2. 使用unsqueeze方法拓展向量维度使其成为矩阵，目的是为了和div_term进行计算
        假设max_len是500，则 position是tensor([[0,1,2,...,499]])，形状是torch.Size([500, 1])
        """
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)

        """
        构建一个1xd_model形状的变换矩阵div_term，具体操作是：
        1. 将自然数的绝对位置编码缩放成足够小的数字 10000 ^ (2i / d_model)，这样有助于在之后的梯度下降过程中更快的收敛。
        2. 用 torch.exp()函数来构建一个形状为(1, d_model) 变换矩阵div_term，用于缩放不同位置的正弦和余弦函数。div_term是256维的张量，即公式里面sin和cos括号中的内容。和原始论文不同，此处通过e和ln进行了变换，这样速度会快一些。
        """
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        """
        使用position和div_term共同构建位置编码pe，计算每个维度（每一列）的相关位置信息
        具体是使用正弦和余弦函数生成位置编码，d_model的偶数索引使用正弦函数；奇数索引使用余弦函数
        此时pe形状是[max_len,d_model]
        """
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        # 在第一个维度添加一个维度batch_size，以便进行批处理
        pe = pe.unsqueeze(0).transpose(0, 1)
        # 此时pe形状是[1,max_len,d_model]
        self.register_buffer("pe", pe) # 将pe注册为缓冲区，以便在不同设备之间传输模型时保持其状态

    def forward(self, x):
        """
        forward函数的参数是x, 表示文本序列的词嵌入表示。
        目的是返回Embedding + PositionEncoding
        """
        # 在位置矩阵中取与输入序列长度相等的前x.size(1)行，然后和Token Embedding相加，因为默认max_len一般太大了，所以要进行与输入张量的适配。
        # 这里需要注意的一点便是，在输入x的维度中，batch_size是第1个维度，seq_len是第2个维度，size(1)实际获取的是第2个维度。
        # 因为不需要进行梯度求解的，因此把requires_grad设置成false。
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        # 返回最后得到的结果并进行dropout操作
        return self.dropout(x)

2.1 Integer Position Encoding

Our initial approach is to add the integer value of the token position to each component of the token embedding, essentially assigning a number linearly to each time step with a certain step size. For example, the position code of the token at position i is i. The value of i ranges from 0 to L, where L is the length of the current sequence. The mathematical mapping is f(x) = x. For instance, the position code for “I like apples” is:

I 0
happiness 1
joyous 2
apple 3
fruit 4

We modified the Harvard code as follows, where all indexes in the word dimension were changed to positional values.

max_len = 5
d_model = 4
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
pe[:, 0::2] = position
pe[:, 1::2] = position
print(pe)

The output is:

tensor([[0., 0., 0., 0.],
        [1., 1., 1., 1.],
        [2., 2., 2., 2.],
        [3., 3., 3., 3.],
        [4., 4., 4., 4.]])

The advantages of this approach are its simplicity and speed of computation. However, it has several disadvantages:

The positional encoding values are unbounded (monotonically increasing), which can easily lead to excessively large encoding values. This, in turn, brings several problems:
It doesn’t handle long input sequences well. If there’s no limit to sentence length, the positional code of the last character is much larger than that of the first character. That is, if a sentence has 1000 characters, the positional code of the last character is 999, much larger than the positional code of the first character (0). A larger number indicates a greater weight at that position, but this doesn’t necessarily reflect the true weight of the sequence.
Positional encoding can interfere with the model. The positional encoding and the token embedding are added together before being passed to the attention mechanism. If the word embedding range is small, the positional encoding can significantly interfere with the result. For example, if the last word’s embedding is 0.5, but its positional encoding is 999, the influence of 0.5 will be severely weakened after adding 999, resulting in a skewed value. This means a very low signal-to-noise ratio, making it difficult for the model to separate semantic information from positional information. This will negatively impact the model’s performance. Furthermore, it can lead to large weights, affecting training stability (e.g., causing gradient vanishing).
Lack of generalization ability. Positional encoding is calculated based on a fixed length, but the model may encounter sequences longer during inference than during training. This means that if the sequence is too long, the positional encoding may fail. The model does not handle positions that were not encountered during training well. For example, if the longest sequence during training is 1000, but the sequence exceeds 1000 during inference, it will be difficult to generalize.
It’s difficult to integrate with attention mechanisms. Currently, positional relationships are expressed using the following method: $f(m-n)=f(m)-f(n)=m-n$ . However, this $f(x)$ function is not well-suited for integration with self-attention mechanisms because subtraction is clearly unacceptable when performing inner product multiplication between vectors. Furthermore, since the parameters in the self-attention layer are trained using a sequence length-independent method, introducing any sequence length-dependent parameters will lead to a distortion of the parameter objective.

We will now examine how to solve the above problems in two parts. One is to try mechanisms other than subtraction (multiplication representation). The other is how to solve the problem of unbounded positional encoding values (normalization).

2.2 Multiplication Representation

Since subtraction is difficult to integrate with attention, and the inner product method limits our calculations to multiplication, let’s try to construct a distance function f(x) using the properties of multiplication to see if it can express distance relationships, such as the following formula:

$f(m-n)=f(m)\times f(n)$

The problem lies in the fact that multiplication has a commutative property, making it impossible to distinguish the order of operations. For example:

$f(m-n)=f(m)\times f(n)=f(n)\times f(m)=f(n-m)$

This would create a problem where it’s impossible to distinguish between “going from Beijing to Shanghai” and “going from Shanghai to Beijing”. Therefore, we need to refine the requirements for f(x) so that f(x) not only needs to satisfy:

$f(m-n)=f(m)\times f(n)$

It also needs to not satisfy the commutative law:

$f(m-n)=f(m)\times f(n)\neq f(n)\times f(m)=f(n-m)$

Let’s look at vector multiplication. For tokens at positions m and n, V(m) and V(n) correspond to their position vectors. We find that vector dot product multiplication also satisfies the commutative law, so it’s not suitable either.

$f(m-n)=V(m)\times V(n)=V(n)\times V(m)=f(n-m)$

Let’s look at matrix multiplication again. We find that the multiplication of matrices does not satisfy the commutative law, that is:

$R_m^T\times R_n \neq R_n^T\times R_m$

Therefore, the matrix multiplication property satisfies our increasingly stringent conditions and comes into focus. This lays the foundation for the subsequent analysis of RoPE.

2.3 Normalization

To address the issue of unbounded positional encoding values caused by integer values, we decided to try restricting the positional encoding to a certain value range. The simplest approach is to normalize the positional encoding to the range [0,1] using the sentence length, i.e., Position Encoding = position / (seq_len), which results in an arithmetic sequence.

The code has been modified as follows:

max_len = 5
d_model = 4
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
pe[:, 0::2] = position / max_len
pe[:, 1::2] = position / max_len
print(pe)

The output is:

tensor([[0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000],
        [0.4000, 0.4000, 0.4000, 0.4000],
        [0.6000, 0.6000, 0.6000, 0.6000],
        [0.8000, 0.8000, 0.8000, 0.8000]])

While this solves the aforementioned problem, it introduces a new one: this normalization method can cause issues with sequences of varying lengths.

The reason is that the normalization methods for sequences of different lengths are inconsistent. Because the sequence lengths of different sentences may vary, the stride of the positional encoding for different lengths is also different. This leads to different positional encodings for the same position in different sentences, which in turn confuses our model, which is not what we want. Since the most crucial aspect of positional information is relative positional information, using the method in this example weakens the relative order relationships in long texts. The difference in positional encoding between two adjacent characters in a shorter text can be orders of magnitude different from that in a longer text. Let’s take the following two sentences as examples:

It has been almost ten years since I last experienced autumn in the North.
The locust tree in the north is also an embellishment that evokes the feeling of autumn.

In both sentences, “Northern Country” is adjacent. According to the relevance requirement, in sequences of different lengths, the relative positions/distances between tokens at the same location should remain consistent (reflecting the invariance of differences in token positions). However, this scheme fails to meet this requirement. In the first sentence, PE(North) = 2/13, PE(Country) = 3/13. PE(North) - PE(Country) = 1/13. In the second sentence, PE(North) = 0/20, PE(Country) = 1/20. PE(North) - PE(Country) = 1/20. The reason for this is the inability to find a normalized standard.

804

Let’s go back to the code to verify this. We can see that in the example above, the distance between each token is 0.2. If we change max_len to 10, we get the following result: the distance between each token is 0.1.

tensor([[0.0000, 0.0000, 0.0000, 0.0000],
        [0.1000, 0.1000, 0.1000, 0.1000],
        [0.2000, 0.2000, 0.2000, 0.2000],
        [0.3000, 0.3000, 0.3000, 0.3000],
        [0.4000, 0.4000, 0.4000, 0.4000],
        [0.5000, 0.5000, 0.5000, 0.5000],
        [0.6000, 0.6000, 0.6000, 0.6000],
        [0.7000, 0.7000, 0.7000, 0.7000],
        [0.8000, 0.8000, 0.8000, 0.8000],
        [0.9000, 0.9000, 0.9000, 0.9000]])

2.4 Binary Position Encoding

Is there a better way to ensure our numbers are between 0 and 1? If we think about it carefully for a while, we might consider converting decimal numbers to binary numbers. Therefore, instead of adding our (potentially normalized) integer positions to each component of the embedding, we would convert the position encoding to binary representation and match the binary values (which may have been normalized) to the embedding dimension.

For example, if we convert the position of interest (2) into a binary representation (0010), and limit it to $d_{model}$ dimensions, it can be directly added to the embedding. Specifically, each bit is added to the corresponding dimension of the token embedding. The least significant bit (LSB) will cycle between 0 and 1 in each subsequent token, while the most significant bit (MSB) will cycle once every $2^{n-1}$ tokens, where $n$ is the number of bits. The reader can see in the figure below how to encode 0-15 into position encoding vectors with different indices using binary encoding.

805

Now all values are bounded (between 0 and 1), and the $d_{model}$ in the Transformer is large enough to encode almost every position we need.

While it seems good, binary encoding still has drawbacks: the encoded position vector exists in a discrete space, making it unable to represent floating-point numbers. Furthermore, the changes between different positions are discontinuous, causing the output to “jump.” Neural network optimization processes prefer smooth, continuous, and predictable changes. In fact, binary positional encoding already shows signs of multidimensional positional encoding. Neural networks excel at handling high-dimensional data. If the range of binary numbers is considered too small, and we want to further increase the range, we can further increase the base of the number system, such as using base-6, base-8, or even base-16. This can further reduce the dimensionality of the input without narrowing the gap between adjacent numbers (a drawback of normalization). If we convert binary encoding into a continuous high-base version, we can model positional encoding from 0 to 100,000 or even millions.

2.5 Demand Expansion

Let’s summarize first. So far, our positional encoding scheme has achieved the following: we have solved the problem of numerical range and obtained a unique code that remains consistent across different sequence lengths. However, problems still exist:

Using binary values in the world of floating-point numbers is a waste of space, so we need a bounded and continuous periodic function.
The methods described above can also be viewed as vocabulary-like methods, which involve creating a vocabulary of length L and assigning positional codes according to the length of the vocabulary. Alternatively, they can be considered as monotonic function methods, characterized by pursuing the absolute position of a token in the sequence, such that the positional codes of subsequent tokens are all greater than those of preceding tokens.

Therefore, we need to consider whether we can shift the focus of the construction from absolute position to relative position? That is, to construct a function f whose input is the absolute position information of the token and whose output is the relative position information of the token.

$\text{relative position info} = f(\text{absolute position info})$

For example, in the sentence “I like apples,” if the position code for “欢” is set to 0, the position codes for other tokens are the distance between that token and “欢,” with negative numbers indicating that the token precedes “欢.” The position codes are as follows:

I -2
happiness -1
joyous 0
apple 1
fruit 2

By using relative positions, the distance between tokens can be determined. Note: Relative position encoding requires linear correlation.

Our next goal is clear: we need to find a function whose values are bounded, smooth, continuous, and can represent relative positions. If we consider bits as dimensions, ideally, each dimension should have a different frequency, with lower dimensions showing faster frequency changes and higher dimensions showing slower frequency changes. The sine function (sin) satisfies this requirement. In fact, the sine function can also represent alternation similar to binary. As the frequency of the sine function decreases, it can achieve the frequency change from red to orange bits in the diagram above, with the following specific characteristics:

Different dimensions are represented by sine combinations of different frequencies. By adjusting the frequencies of trigonometric functions, we can achieve this transition from lower to higher digits.
Because each dimension is calculated by dividing t by the dimension index, the period of change for each dimension is related to the index, meaning that the periods for different dimensions are different.
If the wavelength becomes longer, the data transformation will be slower, which is consistent with the situation where the lower bits change quickly and the higher bits change slowly.

Therefore, we can consider representing each element (corresponding to a dimension) in the position vector using a sine function. Then, the position vector of the t-th token can be represented as:

$PE_t=[\sin(t/1),\sin(t/2),\sin(t/3),...,\sin(t/i),...,\sin(t/d_{model})]$

We can also control the wavelength of the sine function using the frequency index. As the frequency decreases, the wavelength increases, making the sine function less sensitive to changes in t. This ensures the uniqueness of the encoded vector across any dimension; different positions will not have the same value in the same dimension. This is guaranteed by the periodicity and phase difference of the sine and cosine functions, meaning that for any two different positions, their corresponding encoded vectors will be different in every dimension. Assume…

$PE_t=[\sin(t/2^0),\sin(t/2^1),\sin(t/2^3),...,\sin(t/2^{i-1}),...,\sin(t/2^{d_{model}-1})]$

The first dimension, the period of the sine function, is a complete 2π, the second dimension is 4π, and so on. As the dimension increases, the frequency of value change decreases. Therefore, we introduce triangular positional encoding.

2.6 Trigonometric Function Encoding

nature

We will first introduce the properties of several trigonometric functions related to positional encoding.

Trigonometric identities

Trigonometric identities are proven identities concerning trigonometric functions, such as the sum and difference of two angles, double-angle formulas, and triple-angle formulas. For positional coding, the most important are the sum and difference formulas: based on the sine and cosine of two angles themselves, the sine and cosine of their sum and difference can be calculated.

$\cos(\alpha+\beta)=\cos\alpha\cdot\cos\beta-\sin\alpha\cdot\sin\beta$ $\cos(\alpha-\beta)=\cos\alpha\cdot\cos\beta+\sin\alpha\cdot\sin\beta$ $\sin(\alpha\pm\beta)=\sin\alpha\cdot\cos\beta\pm\cos\alpha\cdot\sin\beta$

Periodic

In mathematics, a periodic function is a function whose values repeat after a defined period. For functions of real numbers or integers, periodicity means that repeating a specific part at regular intervals will result in a complete graph of the function. If all positions x in the function 𝑓 satisfy: 𝑓(𝑥+𝑇)=𝑓(𝑥), then 𝑓 is a periodic function with period 𝑇.

Both the sine and cosine trigonometric functions are common periodic functions with a period of 2π.

wavelength

In physics, the frequency of a waveform represents the number of cycles completed per second, while the wavelength represents the distance the waveform repeats. The wavelength of a sine function is the distance or time required for a sine wave to complete one full cycle. The wavelength of a sine function depends on its frequency and period; frequency refers to the number of cycles per second, and period refers to the time required for one wave to complete. Therefore, wavelength equals velocity multiplied by period.

In formulas, the sine function is usually expressed as: 𝑦=sin⁡(𝑘𝑥+𝜙). Where:

𝑦 is the output value.
𝑥 is an input variable, usually representing time or space.
𝑘 is the wave number, and its relationship with the wavelength 𝜆 is 𝑘=2𝜋/𝜆.
𝜙 is the phase shift.

Therefore, the formula for calculating wavelength λ is: λ = 2π/k. For RoPE,

$\lambda_i = 2\pi/\theta_i$

806

definition

Trigonometric positional encoding (also known as sinusoidal positional encoding) is a positional encoding scheme proposed in the original Transformer paper. This scheme is based on parameterless fixed trigonometric functions to calculate the encoding vector (providing positional information to the model through linear transformations of sin and cos functions). In this way, absolute positional encoding is used to capture the relative relationships between different positions, and its definition is as follows:

807

The specific code is as follows:

PE(pos,2i) = sin(pos / 10000^(2i/d_model))
PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))

Encoding method

The Transformer authors experimented with two coding methods (learning and formula).

Learning from data: The learning method is “learned and fixed,” or “Postional Embedding.” The position vector is learned from the data during training; it can only represent positions within a finite length and cannot model arbitrary positions.
Sine function: The formulaic approach is Position Encoding, which uses a sine function to construct a unique embedding for each position. The calculated embedding can handle longer sequence lengths without being affected by training.

Both methods yielded similar results. However, the Transformer authors chose the second method because it’s simpler to calculate using formulas and involves fewer parameters.

Formula Interpretation

Here are some explanations about the formula:

$d_{model}$ represents the word vector dimension. In the paper, the word embedding vector and the positional encoding are directly added together, so we need to make the dimension of the positional encoding equal to the dimension of the word embedding vector.
pos is the position index, representing the position of this token in the sequence, i.e., the position of the word in the sequence. If the sentence length is L, then the value of pos ranges from 0 to L-1. For example, the position of the first token is 0.
$i$ represents the dimension index of the position vector, $2i$ represents an even dimension, and $2i+1$ represents an odd dimension. For example, if $d_{model}$ is 512, then the value of $i$ ranges from 0 to 255.
10000: Defined hyperparameter scalar, this is the value used by the Transformer authors.
represents the positional code of a token. The positional code is not a single number, but a -dimensional vector. It can be calculated from the position index pos and the dimension index . In other words, parametric positional encodings like sine involve two concepts: distance and dimension. Because these two concepts are combined, the uniqueness of the positional coding within a certain range can be guaranteed.
- The sine or cosine values corresponding to the same dimension feature at different positions will also be different, which will make the position code of different positions unique.
- The values corresponding to different dimensions of the same position are also different.
Feature information of different dimensions within the same token can be calculated using sin and cosine functions, i.e., generated using sine and cosine functions of different frequencies. Each component alternates between sin and cosine, with a pair of adjacent even and odd components forming a pair, and the trigonometric function inputs for this pair being identical. This can be considered as adding sine and cosine waves of different frequency bands sequentially to the embedding dimension of the input .
- The period of the position embedding function varies from $2\pi$ to $10000 \cdot 2\pi$ , and each position receives a combination of sin and cos values with different periods along the embedding dimension. The period of dimension $i$ is $T=\mathrm{base}^{i/d}\cdot 2\pi$ , where $0 \le i < d$ , so the range of period is $T \in [2\pi,\mathrm{base}\cdot2\pi]$ .
- The sin and cos formulas mean that the sin function is used to calculate even numbers, and the cos function is used to calculate odd numbers. Each pair of odd and even numbers is grouped together, for example, 0 and 1 in one group, 2 and 3 in another, and then processed using the sin and cos functions mentioned above.
- Within each group, the sine and cosine functions share a common frequency, while different groups have different frequencies (higher frequencies for lower dimensions, lower frequencies for higher dimensions), thus producing different periodic variations. This is somewhat reminiscent of Fourier spectrum transform, aiming to use several groups of cosine and sine functions to represent different feature dimensions (analogous to frequencies).
- The frequency of the k-th group is $f_k=\frac{2\pi}{10000^{2k/d}}$ , where $k=1,\dots,d/2$ . Thus, along the embedding dimension, the periodicity changes more and more slowly as the dimension index increases.
- All trigonometric functions on the same dimension of pos are identical (sin for even numbers, cos for odd numbers), the only difference being the pos value. Therefore, the denominators of the trigonometric function inputs are the same, and the difference lies in the numerators.
Positional encoding represents the positional information of a word as a vector, which is determined by both the word position and component positions. The position of the input padding is encoded using a vector of all zeros.
The positional encoding matrix is shared across all sentences, so the positional encoding results for characters/words at the same position in different sentences are the same.

Specific examples

A specific example of trigonometric positional encoding is shown in the figure below. The output of the positional encoding layer is a matrix, where each row represents the positional encoding vector of a token. Each positional encoding is a vector alternating between sin and cosine (d_model is divisible by 2). Where ci=1/10000^(2i/d_model).

$PE_{pos}=\begin{bmatrix} \sin(c_0 t) \\ \cos(c_0 t) \\ \vdots \\ \sin(c_{\frac{d}{2}-1}t) \\ \cos(c_{\frac{d}{2}-1}t) \end{bmatrix}_{d\times 1}$

The specific details are shown in the image below.

808

Let’s set n=10,000 and d=512 and see the sine wave at different positions.

def plotSinusoid(k, d=512, n=10000):
    x = np.arange(0, 100, 1)
    denominator = np.power(n, 2*x/d)
    y = np.sin(k/denominator)
    plt.plot(x, y)
    plt.title('k = ' + str(k))
 
fig = plt.figure(figsize=(15, 4))
for i in range(4):
    plt.subplot(141 + i)
    plotSinusoid(i*4)

As can be seen, each position $k$ corresponds to a different sine wave, which encodes that position as a vector. For a fixed embedding dimension $i$ , the wavelength is determined by $\lambda_i = 2\pi n^{2i/d}$ , and the wavelength of the sine wave changes geometrically with the increase of the embedding dimension $i$ .

809

advantage

In simple terms, Transformer uses sine and cosine position encoding to control frequency using dimension and phase using position. It generates high-dimensional position vectors for different locations by applying sine/cosine functions of different frequencies (varying with the feature dimension; higher frequencies for lower dimensions, lower frequencies for higher dimensions) to different dimensions.

Sine position coding has several obvious advantages:

It is simple and explainable.
The formula-based generation avoids the awkward situation of fixed-length position vectors obtained during training.
- The positional encoding is calculated based on absolute position and is fixed, which means that the model uses the same positional encoding during training and testing to maintain consistency.
- The encoded value does not depend on the length of the text.
It has a limited range of values, which has the following advantages:
- Using sin-cos positional encoding ensures that the position vector values are stable and controllable, ranging from -1 to 1. Except for the first row, all values in the position matrix are the results of trigonometric functions sin and cos. Therefore, the results for all positions and components are between -1 and 1, fixing the positional encoding values within a fixed range so they are neither too large nor too small. This makes it feasible to add the positional encoding to the original word embedding. Furthermore, adding the derived positional encoding to the input_embedding does not cause the result to deviate too much and thus not compromise the word meaning.
- The range of cos and sin functions is between [-1, 1] and is periodic, making them very suitable for initialization assignment in neural networks.
It can generalize to unseen data and is easy to extend.
- The model can accept inputs of different lengths.
- Trigonometric functions have explicit generation rules and a certain degree of extrapolation. Since positional encoding is based on trigonometric function calculations, PE can handle inputs that are longer than the sequences seen during training. Suppose the longest sentence in the training set is 20 words, and suddenly a sentence of length 21 comes in, then the 21st embedding can be calculated using the formula.
- The values of the position vector are bounded and lie in a continuous space. The model generalizes more easily when dealing with position vectors, meaning it handles sequences whose lengths do not match the distribution of the training data (a property of the sine function itself).
It can reflect relative position information.
- The relative relationship between two positions can be expressed through trigonometric functions, which can reflect the difference between words in different positions (especially the difference between the same word in different positions).
- The position vectors between different positions are periodic functions of sine and cosine functions. This allows the position vectors between different positions to maintain a certain similarity, thereby helping the model better understand positional information and capture the sequential relationships in the sequence. It allows the model to easily calculate the relative positions; for a fixed interval $k$ , $PE(pos+k)$ can be calculated using $PE(pos)$ . This is because $\sin(A+B)=\sin(A)\cos(B)+\cos(A)\sin(B)$ and $\cos(A+B)=\cos(A)\cos(B)-\sin(A)\sin(B)$ .

0x03 Analysis of Trigonometric Function Encoding Approach

We have previously discussed the properties that positional encoding should ideally satisfy. Since trigonometric function encoding is the scheme used in the Transformer proof, we will use these properties to verify the design principles of trigonometric function encoding.

3.1 Author’s Note

Why does the Transformer use sinusoidal positional encoding? The paper’s authors state that it’s because sinusoidal positional encoding can represent relative positions and can be extrapolated to lengths beyond the training parameters. The original text is as follows.

810

Because the author didn’t explain the design rationale, we need to explore it further. The following sections are intertwined. For example, discussions of multi-dimensionality will default to using trigonometric functions, while discussions of trigonometric functions will use multi-dimensionality. These are like twin flowers, two sides of the same coin.

3.2 Why a Multi-Dimensional Approach is Necessary

Let pos denote the position of the token in the sequence, and i denote a certain dimension of the vector. We need to assign a real number to each element of the vector at pos, and this number must satisfy the following property:

The values of different dimensions i in the same position must be different.
The values of the same dimension of different vectors should be different depending on pos, and the encoding rules should satisfy the relative relationship, that is, f(pos+1)-f(pos)=f(pos)-f(pos-1).

Sinusoidal positional encoding precisely satisfies the above conditions. We generate the example using the following code, assuming a sentence length of 512 and a word vector dimension of 768. Positional encoding is a binary function defined on the position-dimensional plane.

import numpy as np
import math
from matplotlib import image

seq_len = 512
dimension = 768
data = np.zeros((seq_len, dimension))

for pos in range(seq_len):
   for i in range(dimension):
      if i % 2 == 0:
          data[pos,i] = math.sin(pos / math.pow(10000, i / dimension))
      else:
          data[pos,i] = math.cos(pos / math.pow(10000, (i-1) / dimension))

image.imsave('Sinusoidal.png', data)

The relationship between Sinusoidal positional encoding and dimensional components is shown in the figure below. It can be observed that:

Each component is periodic and is a sine or cosine function.
The later the component (the larger i is), the longer the wavelength and the lower the frequency.

In this way, each position receives a combination of sin and cosine function values with different periods along the embedding dimension. Each dimension contains certain positional information, and the encoded values of characters at each position are different. This generates unique texture positional information, allowing the model to learn the dependencies between positions and the temporal characteristics of natural language.

811

Having understood these basic characteristics, we now need to discuss more in-depth questions. Let’s look at why we use a vector to represent positional information. Specifically, vectors have the following advantages:

Utilization and processing.
Avoid repetition.
Increasing the degree of differentiation allows for the differentiated characterization of different locations and the distinction of feature dimensions.

Easy to handle

How can positional encoding be integrated into a self-attention algorithm? There are two main approaches: directly modify the input matrix X, or modify the self-attention algorithm itself. Clearly, modifying the input matrix is more intuitive and convenient.

After determining the input matrix to be modified, we then consider how positional information affects the input embedding. Instead of using a single value, a better approach is to represent the position using a vector with the same dimension as the input embedding. As long as the positional embedding generated by positional encoding is the same length as the token embedding, they can be directly added together to obtain a new input matrix of the same size, without needing to modify the self-attention algorithm.

812

Avoid repetition

Next, let’s see why we set a period for each dimension of the vector.

Let’s first assume we use the sine function to generate positional codes. Because a sinusoidal graph is periodic, if the same sine function is used across all dimensions, such as $PE(pos)=\sin(pos)$ , different positions might have the same value. This would create correlation between dimensions, making it impossible to effectively utilize the entire high-dimensional space as the encoding space. This defeats the purpose of positional coding, which is to separate different words.

Therefore, we cannot use pos directly. Instead, we can add an alpha parameter to adjust the wavelength of the position function.

$PE(pos)=\sin(pos/\alpha)$

As mentioned earlier, the token’s embedding dimension is $d_{model}$ , and the position encoding dimension is also $d_{model}$ . Let’s assume $\alpha$ is set as the dimension index.

$PE_t = \sin(t/i)$

In this way, the period of the sine function is different for each dimension, which reduces the probability of repetition.

$PE_t=[\sin(t/1),\sin(t/2),\sin(t/3),...,\sin(t/i),...,\sin(t/d_{model})]$

Let’s discuss the value of the control parameter (or base) in Transformer position encoding: $\alpha = 10000^{2i/d_{model}}$ , and we can adjust the function period using $\alpha$ .

If α is relatively small, pos/α can easily become very large, leading to a higher frequency of the function, a shorter wavelength, and a shorter period. This makes it easy to enter the next period, resulting in overlapping position vectors at different times t. In long texts, some characters at different positions may have the same encoding (a vector represented by a smaller t and a larger t may overlap). Therefore, it is necessary to lengthen the wavelength (the period of the trigonometric function) to ensure that vectors at different positions do not collide. This requires setting all frequencies to a very small value, i.e., taking a larger α to reduce the probability of repetition.
A larger base α helps. However, an excessively large base means a larger period, resulting in denser vector values within the range of -1 to +1. This leads to smaller differences in the positional encodings of adjacent tokens, causing the vectors to almost overlap. This is detrimental to the subsequent Self-Attention module, as it requires more training iterations to accurately find the information for each position, or in other words, to accurately distinguish between different positions. Furthermore, long sequences require long encodings. This increases computational cost, especially since long encodings affect model training time. Therefore, a larger base is not always better.

Therefore, α needs to be set within an appropriate range. Let’s refine this further by rewriting the denominator as α^(2i/d_model), which is clearer. For α, the Transformer authors used an experimentally relevant value of 10000. Using 10000 likely ensures that the loop period is large enough to encode sufficiently long text.

Distinguishing feature dimensions

Introducing multiple dimensions allows for further subdivision of positional encoding and token semantics in a high-dimensional representation space. This subdivision goes beyond individual words; it’s based on word and word embedding features, allowing different dimensions to characterize different features and represent varying information at a single position, reflecting positional differences. For example, high-frequency components correspond to local semantic influences, while low-frequency components correspond to broader contextual semantic influences. Using different functions across different dimensions further enhances this differentiation. This not only separates different words but also distinguishes different features of the same word.

In addition, from the perspective of model interpretability, positional encoding, especially encoding schemes that include relative bias, assigns different practical meanings to different dimensions of the feature vector $q_t,k_s$ . This increases the interpretability of the Transformer architecture.

813

3.3 Why Multiple Frequencies?

Position encoding stores a pair of sine and cosine functions containing various frequencies, but why use trigonometric functions containing multiple frequencies? Actually, this question is related to the previous one, so we’ll analyze them together. Multiple frequencies offer several advantages:

This allows for a certain regularity between the encoded vectors at different positions (guaranteed by the continuity and monotonicity of the sine and cosine functions). For example, the differences between adjacent positions are small, while the differences between positions that are far apart are large.
- For any two adjacent positions, their corresponding encoding vectors change only slightly in each dimension.
- For any two locations that are far apart, their corresponding encoding vectors will differ significantly in every dimension.
This ensures that the encoding vector remains unique across any dimension; that is, for any two different positions, their corresponding encoding vectors will not be exactly equal in these dimensions (this is guaranteed by the periodicity and phase difference of the sine and cosine functions). Note: A rigorous analysis of periodicity will follow.
Similar to a clock system, where the hour, minute, and second hands represent different granularities of time, different granularities are used here to capture finer-grained position information.

814

This mode can adaptively adjust the frequency to meet different functions. For example, in Figure (b) above, $\phi(m)$ is the sum of multiple cosine functions at a single frequency, which determines the proximity between any two m-distance position vectors, where $m$ represents an increasing trend. As shown in (a), each frequency can play a different role:

The extremely small frequency has almost no effect on the overall word representation in the formula $(WE_x + P_x)$ , because it makes this positional embedding almost the same as the position increases.
Some smaller frequencies ( $\omega_i < \pi/L$ ) may help to ensure monotonicity (which decays with increasing distance).
Some larger frequencies will promote local participation mechanisms because if $\omega_i$ is large enough, the cosine function in Equation 6 will drop sharply at the beginning.
Some large frequencies ( $\omega_i > \pi$ ) will be smoothing factors for the overall pattern because they will randomly impose biases on all positions.

3.4 Why use cosine and sinine simultaneously?

The main reason for using both cos and sin is:

To make words in close proximity easier to distinguish, if two words are close together, they carry similar information. Therefore, if these two words are converted into vectors, they will be close in the vector space and easily influence each other. For example, in “I love reading books,” “reading” and “books” have a high probability of appearing together (corresponding to words that are close together). Therefore, semantically similar words must be easily distinguishable in terms of position. To differentiate between two adjacent dimensions and characterize different positions, the Transformer uses sine for even-numbered dimensions and cosine for odd-numbered dimensions.
Alternating between sin and cos can make the period longer.
Alternating sin/cos directions allows the model to easily calculate relative positions. For a fixed distance $k$ , $PE(pos+k)$ can be calculated using $PE(pos)$ . Because $\sin(A+B)=\sin(A)\cos(B)+\cos(A)\sin(B)$ , and $\cos(A+B)=\cos(A)\cos(B)-\sin(A)\sin(B)$ , both $\sin(x+k)$ and $\cos(x+k)$ can be expressed using these two methods.

Additionally, a post on Zhihu also discussed whether it’s necessary to set odd and even indices $2i$ and $2i+1$ . The discussion mentioned that the initial version of tensor2tensor simply divided the data into two parts (using sin for the first 256 dimensions and cos for the last 256 dimensions). The author of tensor2tensor explained that because the subsequent fully connected layers can help rearrange the coordinates, the alternation of sin and cos has no special meaning, and we can rearrange it in any way. The relevant code is as follows:

return tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
# PE(pos, 0:channels/2) = sin(scaled_time)
# PE(pos, channels/2:)  = cos(scaled_time)

The author of tensor2tensor responded as follows:

I think this does not matter, since there is a fully connected layer after the encoding is added (and before) and it can permute the coordinates.

3.5 indicates absolute position.

Next, let’s look at why trigonometric functions can represent absolute positions. In fact, this is also an interpretation of the advantages of trigonometric functions.

in conclusion

To state the conclusion first, multidimensional encoding using multiple periodic functions with different periods is equivalent to incremental sequence encoding. This answers why periodic functions can introduce positional information. In fact, theoretically, any function with different periods but the same characteristics (for example, the waveform of a cosine function is always the same) can be used for positional encoding.

Interpretation

We use a widely circulated example from the internet as a case study. Since converting number bases can be understood as using multi-dimensional data to represent the same meaning, we can use binary positional encoding instead of trigonometric function encoding. As shown in the diagram below, assuming a sequence has 16 tokens and an embedding size of 4, we can use binary encoding to obtain the positional embedding of each token, and we can observe the following pattern.

The numbers in different positions alternate. The numbers in each column alternate between 0 and 1.
Each dimension (i.e., each column) has a period, but the pattern of periodic changes differs across columns. The rate of change for each bit is different; the lower the bit, the faster it changes. The last bit alternates between 0 and 1 each time; the second-to-last bit alternates between two bits, and so on. That is, in the red positions, 0 and 1 change once for each digit, while in the orange positions, it changes once every two digits. In other words, the period in the i-th column is approximately 2^i digits alternating once.

This also answers, to some extent, why periodic functions can introduce positional information. Furthermore, the alternation of sine and cosine in trigonometric function encoding is, in a sense, equivalent to alternating bits in the binary representation of position.

815

Periodic

Next, let’s see if the periodicity will repeat.

First, let’s use a classic clock example from the internet to look at higher-dimensional rotation from a clock’s perspective. A clock has three hands:

Second hand: moves the fastest.
The minute hand is moving at a moderate pace.
The hour hand: moves the slowest.

These three circular motions, each with a different frequency, constitute the time. When two watches are placed in front of you, even if the fastest second hand overlaps, they can still represent different times if the slower minute and hour hands are different.

The sine coding formula defines a “clock” with d/2 pointers, where d is the code length and i represents the pointer’s index. Each pointer’s position can be uniquely represented by a pair of sine and cosine values, i.e., (sinθ, cosθ), where θ is the angle between the pointer and the horizontal. Different pointers have different frequencies; the 0th pointer has a frequency of 1, and the last pointer has a frequency of 1/10000. The frequencies of adjacent pointers follow a fixed multiple relationship, meaning the middle pointers are arranged according to a geometric progression. Clearly, each position code uniquely corresponds to a “point in time” on the “clock.” We know that clocks have a period of 12 hours. Sine coding also has a period; if chosen appropriately, it can express positional information and possess a certain degree of extrapolation.

Secondly, let’s look at how to select a position code in sinusoidal encoding to ensure that the position code is unique in any subspace, thereby avoiding repetition.

Note: i represents the uniqueness of a position in the sequence. The base determines the difference between each subspace, and the subspace is distinguished by sin and cos.

The properties of trigonometric functions are as follows:

The period of the vector pair consisting of (sinx, cosx) is 4π.
Suppose that positions $i$ and $j$ are repeated in the k-th subspace, then there exists an integer $m$ such that: $j\theta_k - i\theta_k = 4m\pi$ , hence $\theta_k=\frac{4m\pi}{j-i}$ .
In any k-th subspace, as long as $\theta_k$ does not contain $\pi$ in closed rational form, the rotation angle sequence $\{i\theta_k\}_i$ does not exhibit periodic repetition.

$PE(pos,2i)=\sin\left(pos/10000^{2i/d_{model}}\right)$ $PE(pos,2i+1)=\cos\left(pos/10000^{2i/d_{model}}\right)$

Therefore, with the appropriate 𝜃 value set, each location can obtain a unique location code, which reflects the absoluteness of the location code.

Finally, let’s look at some special nodes within the period. Let $B=10000^{1/d}$ , then $\theta_k=\frac{1}{B^{k-1}}$ is a geometric sequence with the following period.

$\theta_k$	$0$	$1/B$	$1/B^2$	…	$1/B^{d-1}$
Period	$2\pi$	$2B\pi$	$2B^2\pi$	…	$2B^{d-1}\pi$

In this process, there are three very important nodes: $\pi/2$ , $\pi$ , $2\pi$ .

Only when the values of cos/sin traverse from 0 to $\pi/2$ during the training phase will the model realize that cos is negative and sin is not monotonic.
Only when the values inside cos/sin reach $\pi$ will the model realize that cos is not monotonic and sin is negative.
Only when the value of cos/sin reaches $2\pi$ will the model perceive all possible cos/sin values, and thus may become aware of the periodic representation encoded in each dimension (truly becoming aware of the periodic representation may require more than one cycle within the training length).

3.6 indicates relative position

The importance of relative semantics

Semantics should be more related to relative position than absolute position. That is, the semantic similarity calculation between i and j should depend on the vectors and their relative distance, rather than their absolute position.

$Q_iK_j^T = g(X_i, X_j, i-j)$

Our previous derivation of the positional encoding of trigonometric functions was as follows:

$PE_t=[\sin(t/1),\sin(t/2),\sin(t/3),...,\sin(t/i),...,\sin(t/d_{model})]$

The PE form using only sin() does not yet solve the relativity problem. If we could achieve the characteristic of relative encoding by transforming any position into a positional encoding with the same relative value from the current position through a single relative value transformation, then this encoding method would truly achieve relativity. Therefore, we impose another requirement on position vectors: different position vectors should be obtainable through linear transformation. This allows us to represent not only the absolute position of a token but also its relative position. This necessitates the inclusion of the cos function. This also partially answers the question: why use sin and cos to alternately represent position?

in conclusion

Let’s state the conclusion first. Therefore, trigonometric positional coding, as a type of absolute positional coding, also contains certain relative positional information. That is, trigonometric function positional coding can use absolute positional coding to capture the relative relationships between different positions. This is because trigonometric functions have the following properties:

$\sin(\alpha+\beta)=\sin\alpha\cos\beta+\cos\alpha\sin\beta$ $\cos(\alpha+\beta)=\cos\alpha\cos\beta-\sin\alpha\sin\beta$

This demonstrates that sin-cos positional encoding has the ability to express relative positions; that is, a position α+β can be expressed as a combination of position α and β. Therefore, two conclusions can be drawn:

The dot product of two positional codes depends only on the offset (relative position), meaning that the dot product of two positional codes can reflect the distance between the two positional codes.
Given a distance, the location code of any position can be expressed as a linear combination of the location codes of a known position with respect to the distance.

prove

Conclusion 1 is that the inner product of the positional codes at time step p and time step p+k, i.e., PE_p · PE_{p+k}, is a constant value that is independent of p and only depends on k. In other words, the inner product of any two positional code vectors that are k time steps apart is the same, which is equivalent to the inner product containing information about the relative positional relationship between the two time steps.

First, let’s look at the calculation of the dot product. As you can see, the dot product of two position vectors depends only on their relative positions 𝑘.

816

Secondly, the specific proof process is as follows, where c_i is 1/10000^(2i/d_model), where d represents the dimension of the position embedding.

$c_i = \frac{1}{10000^{2i/d_{model}}}$

The formula uses trigonometric transformation formulas:

$\cos(x-y)=\sin(x)\sin(y)+\cos(x)\cos(y)$

817

Conclusion 2 is that the positional encodings of two positions pos and pos+k separated by k words are a linear transformation defined by the positional encoding of k, that is, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ . The specific proof is as follows.

First, let’s simplify. Represent pos as t, and $d_{model}$ as d, using 1000. Thus, in triangular positional encoding, the values of the position vector corresponding to position t at even and odd positions are respectively:

$PE_{t,2i}=\sin\left(\frac{t}{1000^{2i/d}}\right),\quad PE_{t,2i+1}=\cos\left(\frac{t}{1000^{2i/d}}\right)$

The following trigonometric identities are given:

$\sin(\alpha+\beta)=\sin\alpha\cos\beta+\cos\alpha\sin\beta$ $\cos(\alpha+\beta)=\cos\alpha\cos\beta-\sin\alpha\sin\beta$

Assume the positional encoding has a dimension of 2, i.e., the case where $i=0$ . There is…

$PE_{t+k,0}=\sin\left(\frac{t+k}{1000^{2i/d}}\right),\quad PE_{t+k,1}=\cos\left(\frac{t+k}{1000^{2i/d}}\right)$

Assumption:

$\alpha_k=\frac{k}{1000^{2i/d}}$

Then:

$PE_{t+k,0} =\sin\left(\frac{t+k}{1000^{2i/d}}\right) =\sin\left(\frac{t}{1000^{2i/d}}+\frac{k}{1000^{2i/d}}\right) =\sin\left(\frac{t}{1000^{2i/d}}+\alpha_k\right)$ $=\sin\left(\frac{t}{1000^{2i/d}}\right)\cos(\alpha_k) +\cos\left(\frac{t}{1000^{2i/d}}\right)\sin(\alpha_k) =PE_{t,0}\cos(\alpha_k)+PE_{t,1}\sin(\alpha_k)$

and

$PE_{t+k,1} =\cos\left(\frac{t+k}{1000^{2i/d}}\right) =\cos\left(\frac{t}{1000^{2i/d}}+\frac{k}{1000^{2i/d}}\right) =\cos\left(\frac{t}{1000^{2i/d}}+\alpha_k\right)$ $=\cos\left(\frac{t}{1000^{2i/d}}\right)\cos(\alpha_k) -\sin\left(\frac{t}{1000^{2i/d}}\right)\sin(\alpha_k) =PE_{t,1}\cos(\alpha_k)-PE_{t,0}\sin(\alpha_k)$

That is:

$PE_{t+k,0}=PE_{t,0}\cos\alpha_k + PE_{t,1}\sin\alpha_k$ $PE_{t+k,1}=PE_{t,0}(-\sin\alpha_k)+PE_{t,1}\cos\alpha_k$

Because any operation T that transforms the parameters of trigonometric functions must be some kind of rotation, and rotation can be accomplished by applying a linear transformation to the (cosine, sine) pair, we can use a matrix to re-express the above formula, resulting in the following:

$\begin{bmatrix} PE_{t+k,0} \\ PE_{t+k,1} \end{bmatrix} = \begin{bmatrix} \cos\alpha_k & \sin\alpha_k \\ -\sin\alpha_k & \cos\alpha_k \end{bmatrix} \begin{bmatrix} PE_{t,0} \\ PE_{t,1} \end{bmatrix} = \begin{bmatrix} \cos\alpha_k & -\sin\alpha_k \\ \sin\alpha_k & \cos\alpha_k \end{bmatrix}^{\!\top} \begin{bmatrix} PE_{t,0} \\ PE_{t,1} \end{bmatrix}$

This shows that $PE_{t+k}$ can be decomposed into the product of a matrix related to the relative distance $k$ and $PE_t$ . Assuming this matrix related to the relative distance $k$ is $R_k$ , then we have $PE_{t+k}=R_k^\top PE_t$ , and further we can obtain:

$PE_{t+k_1+k_2}=R_{k_1}^TR_{k_2}^TPE_t$

That is:

$R_{k_1+k_2}=R_{k_1}R_{k_2}$

Furthermore, based on $-\sin\alpha_k=\sin(-\alpha_k)$ and $\cos\alpha_k=\cos(-\alpha_k)$ , we can obtain $R_{k_1}=R_{-k_1}^T$ , and further:

$R_{k_2-k_1}=R_{k_1}^TR_{k_2}$

The proof is complete. Let’s summarize with another image as follows:

818

Or let’s expand on this further.

Because $\alpha_k=\frac{k}{1000^{2i/d}}$ , this is $PE_k$ . Therefore, we obtain:

$PE_{t+k,2i}=PE_{t,2i}\times PE_{k,2i+1}+PE_{t,2i+1}\times PE_{k,2i}$ $PE_{t+k,2i+1}=PE_{t,2i+1}\times PE_{k,2i+1}-PE_{t,2i}\times PE_{k,2i}$

Since the relative distance $k$ is a constant, $PE_{k,2i}$ and $PE_{k,2i+1}$ are also constants. These can be simplified to $u$ and $v$ respectively, yielding the following equation:

$\begin{bmatrix} PE_{t+k,2i} \\ PE_{t+k,2i+1} \end{bmatrix} = \begin{bmatrix} v & u \\ -u & v \end{bmatrix} \begin{bmatrix} PE_{t,2i} \\ PE_{t,2i+1} \end{bmatrix}$

PE_{t+k} can be represented as a linear transform of PE_t. The authors hope that by using the above encoding formula for absolute position, the model can learn relative position information. We further expand the above figure to obtain the following.

819

3.7 Rotation

From one perspective, PE_{t+k} is clockwise from PE_t. The formula involves pairing vectors by their indices, dividing them into multiple groups, and then rotating each group of two-dimensional vectors clockwise based on their position information. The angle of rotation is linearly related to the relative position. The proof is as follows.

820

From another perspective, trigonometric positional encoding essentially injects positional information into word vectors through rotation. If we combine word vectors and position vectors using addition, it’s equivalent to applying a translation to the original word vectors.

821

3.8 Why is it addition?

The question is: why is positional encoding added to the word embedding instead of concatenating the two? There has been much discussion and analysis on this issue. Let’s examine it in detail.

Explanation 1: Whether concatenating or adding, both ultimately undergo linear transformations at the entry points of each head of the multi-head attention system to recombine and reduce the dimensions of the features. In essence, each dimension becomes a linear combination of all previous dimensional vectors. Concatenation increases dimensionality, requiring another linear transformation to reduce it, which expands the parameter space, increases memory usage, and makes fitting more difficult. Compared to concatenation, adding performs a linear transformation before fusion, saving model parameters.
Explanation 2: Adding maintains the independence of the token and position embedding spaces to some extent, reflecting a kind of feature intersection; concatenation merges the two embedding spaces into a unified large embedding space, and then performing a linear transformation is equivalent to dimensionality reduction or pooling. In fact, they are mathematically equivalent, and either approach is acceptable.

0x04 Characteristics of Trigonometric Function Encoding

We already know some of the excellent properties of trigonometric function encoding.

With the appropriate 𝜃 value set, each location can obtain a unique location code (absolute).
A positional code can be derived by rotating another positional code (relativity).

Next, let’s look at some other characteristics of trigonometric function encoding.

4.1 Non-directionality

A characteristic of sinusoidal positional encoding is that the distance between adjacent time steps is symmetrical and decays over time. This indicates that trigonometric function positional encoding cannot learn direction. This is because the distance between two tokens depends only on the relative distance k, and is independent of the position t or j. Therefore, the distance between positions t and t+k is equal to the distance between positions j and j+k. See the diagram below for details. Some basic information from the diagram is as follows:

The horizontal axis represents $\Delta$ .
The vertical axis represents $PE_t^TPE_{t+\Delta t}$ obtained by changing $\Delta$ , with a fixed $t$ .
d represents different hidden sizes.

We can see that the trend of $PE_t^TPE_{t+\Delta t}$ is:

With a fixed position, the inner product of the two positional codes is symmetric.
When a certain $PE_t$ is fixed, the inner product of two position codes has long-range decay, that is, the farther apart the two position codes are, the smaller the value of the dot product, but it does not have monotonicity.

This shows that although the dot product of position vectors can reflect relative distance, it does not learn the directionality of position because the result of the dot product is symmetrical.

822

4.2 Long-distance attenuation

Distant decay refers to the ability of positional encoding to capture positional information and long-distance dependencies (relationships between words that are far apart) within a sequence. Specifically, this means:

When two positions are far apart, their encodings differ significantly and have low similarity. This helps the model distinguish these positions and better understand and differentiate word order relationships in the sequence.
When two locations are close to each other, their similarity is high, which helps the model understand the relationship between them.

The prior for long-range decay is that text is discrete temporal data, and we generally assume that the closer the text is to a given location, the stronger its correlation. That is, tokens that are close in location receive more attention on average, while tokens that are farther apart receive less attention on average. Under this prior condition, a good oscillation curve should have the following characteristics: it should maintain continuous monotonic decay over an infinite length; the decay curve should be non-linear, with rapid decay at close distances and gradual decay at distant distances; and it should oscillate as little as possible during the decay process. In fact, the localization of self-attention through positional encoding can be imagined as a transition from full self-attention to convolution: without positional encoding, allowing the semantics of tokens to ignore relative distance interactions will cause the sequence to lose its sequential information and interfere with the characterization of token features with irrelevant semantics; if the localization of self-attention through positional encoding is too significant, then the Transformer module can only see tokens in neighboring locations when characterizing token features, which is no different from a CNN. Therefore, designing a good decay curve for positional encoding with relative distance is particularly important.

In positional encoding methods using sine and cosine functions, the positional codes of words at different positions are generated using sine and cosine functions. This method can capture the relationships between different positions and has good long-range decay properties. Given an appropriate φ value, the size of the inner product of two positional codes reflects the distance between positions; the smaller the inner product, the greater the distance (decay property). Specifically, as shown in the figure above, the inner product of the position vectors gradually decreases with increasing relative distance, indicating long-range decay.

However, we also found that over long distances, the decay does not decrease monotonically as expected, but rather includes two types of oscillations: one is oscillation within a local window, and the other is convergence to a certain fluctuation range with distance. This results in positional encoding being more inclined to capture local information, limiting the impact of long-distance interactions. This characteristic is effective for short sequences but may be limited for long sequences, resulting in generally poor extrapolation of Sinusoidal positional encoding. Furthermore, these oscillations amplify modeling errors, which is one reason why long texts are difficult to train.

4.3 Extrapolation

Length extrapolation: If a model, without fine-tuning, can maintain its training performance well when tested on texts exceeding the training length, we say that the model has length extrapolation capability. Conversely, if a large model’s generalization ability decreases due to inconsistencies in input length between training and prediction, we say that the model’s extrapolation capability is problematic. How to improve the extrapolation performance of positional encoding while maintaining its in-distribution performance is a crucial trade-off in positional encoding design. Because the sine and cosine functions in Sinusoidal positional encoding have the following characteristics, they theoretically also possess a certain length extrapolation capability.

There are explicit generation rules.
Periodicity.
It can express relative positions. Because $\sin(\alpha+\beta)=\sin\alpha\cos\beta+\cos\alpha\sin\beta$ , and $\cos(\alpha+\beta)=\cos\alpha\cos\beta-\sin\alpha\sin\beta$ , this shows that the vector of position $\alpha+\beta$ can be represented as a vector combination of position $\alpha$ and position $\beta$ , providing the possibility of position extension.
The inner product has the characteristic of long-range decay.

question

However, in practice, trigonometric function encoding does not perform ideally in terms of extrapolation. This is partly because sin and cos are high-frequency oscillating functions, not linear or asymptotically linear functions, and their extrapolation behavior is often difficult to predict.

The paper “TENER: Adapting Transformer Encoder for Named Entity Recognition” presents excellent experiments and analysis on this problem. To explore this problem in more detail, the paper hypothesizes:

$x_t, x_{t+\Delta t}$ are two original token vectors at different positions, with dimensions of $(\text{hidden\_size}, 1)$ .
$PE_t, PE_{t+\Delta t}$ are two original PE vectors at different positions, with dimensions of $(\text{hidden\_size}, 1)$ .
$W_Q, W_K$ are Q and K matrices of size $(\text{hidden\_size}, \text{hidden\_size})$ , respectively.

The q-k dot product with sinusoidal positional encoding is as follows:

$q_t^T k_{t+\Delta t}= [W_Q(x_t+PE_t)]^T[W_K(x_{t+\Delta t}+PE_{t+\Delta t})]$ $= [x_t^T W_Q^T + PE_t^T W_Q^T][W_Kx_{t+\Delta t} + W_KPE_{t+\Delta t}]$

From this, we can easily see that after the attention layer, the location encoding no longer relies on $PE_t$ and $PE_{t+\Delta t}$ being directly correlated. Instead, it introduces linearly transformed terms. After introducing this linear transformation, can the location encoding still maintain the aforementioned excellent properties of absoluteness, relativity, and long-distance attenuation? The paper examines this in detail through experiments. Since $W_Q^T W_K$ can essentially be synthesized into a linear transformation, we can randomly initialize a linear matrix $W$ to replace it. The figure below shows the dot product result of changing $\Delta$ after fixing a certain $t$ .

Light blue represents the dot product of two sinusoidal positional codes at different locations, $PE_t^TPE_{t+\Delta t}$ , which does have good long-range attenuation and can reflect the distance between the two positional codes.
Yellow and green represent the dot product of two randomly initialized linearly transformed matrices. After multiplying the PE matrix by the $W$ weight matrix, there is no obvious pattern. This indicates that after introducing the linear transformation, $PE_t^TPE_{t+\Delta t}$ becomes $PE_t^TWPE_{t+\Delta t}$ , and the excellent properties of the original location encoding (long-range attenuation, etc.) are greatly destroyed, and the distance factor effect reflected by its inner product is lost.

823

The reason is that the initial Transformers positional encoding cleverly introduced periodically changing trigonometric function formulas to ensure the boundedness of its values, while the formulas for adding and subtracting trigonometric functions also satisfied relativity. However, this relative positional information is only contained within the positional encoding. When the word embedding (the sum of the positional encoding and the input layer’s token) enters the self-attention layer, the situation changes. At this point, what truly matters is not the product of the two positional encodings, but the positional encodings are multiplied by two projection matrices. Introducing these projection matrices corrupts the positional information, causing the loss of the long-range decay property.

Su Shen also mentioned in his blog that insufficiently trained $\cos(q_i,k_j)$ is the main reason why Attention cannot extrapolate its length.

The relevance score between the i-th and j-th tokens is determined by the inner product:

$s(j|i)=q_i\cdot k_j=\|q_i\|\|k_j\|\cos(q_i,k_j)$

The second equality, from a geometric perspective, decomposes it into the product of the magnitude of each token and the cosine of the included angle. If the magnitude remains constant, $\cos(q_i,k_j)$ becoming smaller means that the inner product of two far apart positions is smaller; that is, the inner product can be used to reflect the relative distance between two position vectors in absolute position.

To increase the relative importance of a certain position j, the model has two options:

Increase the module length ‖k_j‖.
Increase $\cos(q_i,k_j)$ , equivalent to decreasing the angle between $q_i$ and $k_j$ .

However, due to the “curse of dimensionality,” significantly changing the angle in high-dimensional space is not so easy. Therefore, the model will preferentially choose to increase the modulus to achieve this. The direct consequence of this is that $\cos(q_i,k_j)$ values seen during training are only a finite set, while when performing length extrapolation, it has to deal with a much larger set, thus failing to make correct predictions.

Argument

Sinusoidal positional encoding involves sequentially adding sine and cosine waves of different frequency bands to the embedding dimension of the input x. Thus, the dot product of two sinusoidal positional codes contains the superposition of several sine and cosine waves, which can represent the relative positions between tokens.

Let’s assume two words, $t$ and $s$ . $W_Q^T$ and $W_K^T$ are weight matrices, $x_t$ and $x_s$ are word embeddings, and $p_t$ and $p_s$ are the position vectors at positions $t$ and $s$ , respectively. We want $p_t^Tp_s$ to be $g(i-j)$ , meaning they are relatively positionally related. See the diagram below.

(a) is the content information of the word, which is the correlation between two words, and has nothing to do with position vectors and position encoding.
(b) and (c) are the correlation between words and positions. Each of them has only one position vector, which is information about the absolute position encoding, so it does not contain relative position information.
(d) containing both $p_t$ and $p_s$ is most likely to contain relative position information, which can be used to obtain the correlation between the two positions.

According to the Vanilla Transformer’s positional encoding method, without W_Q and W_K, then (d) only contains relative positional information. However, the final formula for calculating the attention score yields a combination of sine and cosine waves after parameter matrix mapping, not a true combination of sine and cosine waves. The addition of an “unknown” linear transformation eliminates relative positional information; that is, the dot product of the two cannot reflect directionality, resulting in a lack of monotonicity.

824

Thus, although the sinusoidal positional encoding in the original transformer is essentially a positional encoding that aims to express relative positions through absolute positional encoding, hoping that sinusoidal positional embeddings can infer longer sequences than what is seen, researchers subsequently found that sinusoidal APEs are difficult to extrapolate.

Therefore, various APEs and RPEs have been proposed to enhance sinusoidal positional encoding, thereby improving the extrapolation of the Transformer. We will later analyze how these schemes, based on characterizing the relative distances between different positions within a sequence and controlling the bias magnitudes at different positions of the self-attention matrix, influence the model’s parameter learning process and final performance.

0x05 NoPE

Positional encoding (PE) is a crucial part of Transformer models, enabling them to process sequential data and capture the sequential relationships between elements. Despite some limitations, its introduction is an important step towards achieving efficient parallel processing. However, positional encoding seems like a temporary fix or expedient. Therefore, some argue that positional encoding is not so important, especially since Decoder-only models are based on Causal Attention, which itself lacks permutation invariance, so in principle it seems to not need positional encoding (NoPE). Of course, some researchers believe that even Decoder-only models need PE. Let’s examine these viewpoints.

5.1 Not required

Studies have shown that the No Position Encoding (NoPE) Transformer has demonstrated comparable performance to Transformer+RoPE on autoregressive language modeling tasks, including…

Autoregressive models based on causality: In autoregressive models (such as the GPT series), the causal attention mechanism implicitly introduces positional information by restricting each element to interacting only with previous elements. Although explicit positional encoding can improve model performance, some studies have shown that these models can learn some positional information even without explicit positional encoding.
For specific non-sequential tasks: For some tasks that do not depend on the order of elements, such as certain types of image classification, positional encoding may not be necessary because the model’s goal may be more focused on extracting global features rather than understanding the order relationship between elements. Adding positional encoding may even lead to a performance degradation.

For details, please refer to [link/reference].

Haviv, A., Ram, O., Press, O., Izsak, P., & Levy, O. (2022). Transformer Language Models without Positional Encodings Still Learn Positional Information. (EMNLP 2022).
Chi, TC, Fan, TH, Chen, LW, Rudnicky, A., & Ramadge, P. (2023). Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings. (ACL 2023).
Kazemnejad, A., Padhi, I., Natesan Ramamurthy, K., Das, P., & Reddy, S. (2023). The Impact of Positional Encoding on Length Generalization in Transformers. (NeurIPS 2023).

The paper “Length Generalization of Causal Transformers without Position Encoding” also proposes a length generalization method suitable for NoPE. This paper discusses the length generalization problem of NoPE models. The authors found that simple attention scaling (introducing a SoftMax temperature hyperparameter to force attention to focus) can significantly improve the length generalization ability of NoPE, but has no significant effect on RoPE. The paper argues that NoPE removes the interference of explicit position encoding, directly addressing the internal positional information representation of the model.

5.2 Required

The following is a conclusion from Su Jianlin: NoPE may have problems such as insufficient positional resolution, low efficiency, and attention diffusion for long texts. Therefore, even for Decoder-only models, we still need to supplement them with additional positional encoding (especially relative positional encoding) to improve the above-mentioned shortcomings.

NoPE primarily expresses positional information through the variance of the hidden state vector. “Causal + NoPE” essentially hides the positional information within the component variances of y, or equivalently, within l2 norm of y. It’s equivalent to saying that y_n is obtained by multiplying a vector z_n without positional information by a scalar function p(n) related to the position n, which in turn means:

NoPE implements a multiplicative-like absolute positional encoding, but it simply compresses positional information into a single scalar, making it a very weak positional encoding.
A single scalar can represent a limited amount of information. As the input length increases, the positional encoding becomes increasingly compact and difficult to distinguish. For example, a very simple example is p(n)~1/√n. When n is large enough, 1/√n and 1/√(n+1) are almost indistinguishable, meaning that it is impossible to distinguish between positions n and n+1.
The prevailing view is that relative position encoding is more suitable for natural language. Since NoPE implements absolute position encoding, it is naturally less efficient than adding relative position encoding to the model.
NoPE neither adds priors such as long-range decay to the model nor seems to endow the model with the ability to learn such priors. When the input length is large enough, it may lead to attention problems.

0xFF Reference

Position Information in Transformers: an Overview

Transformer Architecture: The Positional Encoding

Understanding Position Encoding in Transformer Models

Machine Translation: Fundamentals and Models

Transformer position encoding, which has researchers racking their brains

Transformer Location Encoding (Meaning) by Riverside Grass lxr

[Deep Learning] Natural Language Processing - Introduction to Transformer Positional Encoding by Shuk and Baker

A Gentle Introduction to Positional Encoding in Transformer Models, Part 1

baiziyu: English-German Translation Based on End-to-End Attention Networks - 1

baiziyu: English-German Translation Based on End-to-End Attention Networks - Part 2

Why does a Decoder-only LLM need positional encoding? (Su Jianlin)

https://arxiv.org/pdf/2006.15595.pdf

https://github.com/guolinke/TUPE

A HuggingFace engineer shares his insights: How to achieve optimal positional encoding in Transformer [Machine Heart]

LANGUAGE TRANSLATION WITH TORCHTEXT

Learning to Encode Position for Transformer with Continuous Dynamical Model

Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding Liang Zhao, Xiaocheng Feng, Xiachong Feng, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, Ting Liu

Length Generalization of Causal Transformers without Position Encoding

LLM - A Simple Explanation of Position Encoding and RoPE BIT_666

The role of use_cache and the usage mechanism of past_key_value in LLM

Positional Encoding in Transformers during the LLM Era

LLM (23): The Long Text Problem in LLM (Auspicious Purple Qi from the East)

On Position Embeddings in BERT

Sinusoida Location Coding Explained by Zhang

TENER: Adapting Transformer Encoder for Named Entity Recognition

Transformer Architecture: The Positional Encoding Amirhossein Kazemnejad’s Blog

Transformer Architecture: The Positional Encoding

In a transformer: Does the self-attention component need to be masked?

Position encoding in Transformer Aozora Zhiqian

Transformer Position Encoding (Basic) by Riverside Grass lxr

Transformer positional encoding (meaning) [Riverside Grass lxr]

Transformer Position Encoding (Improved) [Riverside Grass lxr]

Transformer Upgrade Path: 15. Key Normalization Aids Length Extrapolation (Su Jianlin)

The Transformer Upgrade Path: 16. “Review” of Length Extrapolation Techniques (Su Jianlin)

The Transformer Upgrade Path: 18. RoPE’s Underlying Design Principles (by Su Jianlin)

The Transformer Upgrade Path: 1. Tracing the Origins of Sinusoidal Position Encoding

Transformer Learning Notes 1: Positional Encoding (by a powerful ape)

“Chasing Stars” with Transformer (Part 2): GPT Iron Heart Walnut Based on Transformer Pre-trained Model

“Convolutional Sequence to Sequence Learning”

Research on Chinese Language Models: (1) Multiplicative Positional Encoding PENG Bo

[OpenLLM 009] Position Encoding: A Comprehensive Explanation of Position Encoding and Length Extrapolation in LLM (Part 1) - A 10,000-Word Article OpenLLM

[OpenLLM 010] Position Encoding: A Comprehensive Explanation of Position Encoding and Length Extrapolation in LLM (Part 2) - A 10,000-Word Article OpenLLM

Understanding the Tidbits of Transformer in One Article: Position Encoding

A Comprehensive Guide to Position Encoding: From Standard Position Encoding and Euler’s Formula to Rotational Position Encoding (RoPE v_JULY_v)

A line-by-line analysis and implementation of Transformer, with a German-to-English translation practice (Part 1) iioSnail

A line-by-line analysis and implementation of Transformer, followed by a German-to-English translation practice (Part 3).

A line-by-line analysis and implementation of Transformer, followed by a German-to-English translation practice (Part 2).

From Words to Numbers: A Comprehensive Explanation of Tokenizers and Embeddings by HeptaAI

A comprehensive study of location coding aims to reach for the moon and soar to the heavens.

Background knowledge of positional encoding algorithms Zhang

Code Implementation and Performance Experiment of Six Position Encoding Methods (by Zhang Yice)

Further Discussion on Location Encoding and Extrapolation in Large Models (10,000-word article)

Further discussion on length extrapolation uuuuu

Illustrated Explanation of RoPE Rotational Position Encoding and Its Characteristics (Red Rain Pours)

English-German Translation Based on End-to-End Attention Network - 1

Useful Information! On Position Embeddings AI TIME

Sequence-to-sequence (seq2seq) model for text translation

Does No Position-Based Encoding (NoPE) also have length generalization problems? A CV (Computer Vision) Technical Guide to the First Length Extrapolation Method for NoPE

A Brief Discussion on LLM Length Extrapolation

Learn large-scale models through visuals: From absolute position encoding to rotational position encoding (RoPE), even junior high school students can understand it, and some who can even read tables can understand it :) Learn through visuals

Random thoughts: The little details of Transformer

Transformer position encoding that has researchers racking their brains - Su Jianlin

Paper Notes: Multi30K: Multilingual English-German Image Descriptions by Shuai Shuai Liang

Super super super super easy! Deriving RoPE rotational position coding from results (NanNanNanNanx https://www.zhihu.com/people/nan-nan-nan-nan-x)

https://github.com/meta-llama/llama/blob/main/llama/model.py

https://zhuanlan.zhihu.com/p/372858304

| [Transformer Series] (8) Position Encoding

Exploring the Transformer Series (8) --- Position Encoding

Exploring the Transformer (8) --- Position Encoding

0x00 Overview

0x01 Problem

1.1 The Importance of Word Order

1.2 Deficiencies of the Transformer Architecture

Position invariance

prove

Changes in the query matrix

Changes in the attention matrix

Changes in attention scores

Final result

as a result of

1.3 Solution Approach

1.4 Required Properties

0x02 Evolution of Encoding Schemes

2.1 Integer Position Encoding

2.2 Multiplication Representation

2.3 Normalization

2.4 Binary Position Encoding

2.5 Demand Expansion

2.6 Trigonometric Function Encoding

nature

definition

Encoding method

Formula Interpretation

Specific examples

advantage

0x03 Analysis of Trigonometric Function Encoding Approach

3.1 Author’s Note

3.2 Why a Multi-Dimensional Approach is Necessary

Easy to handle

Avoid repetition

Distinguishing feature dimensions

3.3 Why Multiple Frequencies?

3.4 Why use cosine and sinine simultaneously?

3.5 indicates absolute position.

in conclusion

Interpretation

Periodic

3.6 indicates relative position

The importance of relative semantics

in conclusion

prove

3.7 Rotation

3.8 Why is it addition?

0x04 Characteristics of Trigonometric Function Encoding

4.1 Non-directionality

4.2 Long-distance attenuation

4.3 Extrapolation

question

Argument

0x05 NoPE

5.1 Not required

5.2 Required

0xFF Reference