Transformer Systems · Transformer Systems

Exploring the Transformer Series (9) --- Location Encoding Classification

APE vs RPE in Transformers: differences, representative methods, and relative-position design patterns.

Attention And Positional Informationadvanced1.5 hrReading
transformerposition-encodingaperpexlnett5debertaalibitupe

Exploring the Transformer Series (9) --- Location Encoding Classification

Table of contents

  • Exploring the Transformer Series (9) --- Location Encoding Classification
  • 0x00 Overview
  • 0x01 Difference
  • 1.1 From an intuitive perspective
  • 1.2 From the perspective of model processing
  • 1.3 Advantages and disadvantages
  • 0x02 Absolute position encoding
  • 2.1 Basic Scheme
  • 2.2 Training Method
  • 2.3 Trigonometric Functions
  • 2.4 Other
  • 0x03 Relative Position Encoding
  • 3.1 Significance
  • Reference system in the brain
  • Semantic impact
  • Length extrapolation
  • 3.2 Position of Absolute Position Encoding
  • 3.3 Formula for Absolute Position Encoding
  • 3.4 Classic Style
  • 3.5 XLNET
  • 3.6 TENER
  • 3.7 T5
  • 3.8 DeBERTa type
  • 3.9 TUPE
  • 3.10 ALiBi
  • 3.11 Bias Coding & Context Mode
  • 3.12 Summary
  • 0xFF Reference

0x00 Overview

Because the Transformer itself possesses permutation invariance, it cannot directly capture the positional information of each word in the sequence. Therefore, using positional encoding to incorporate the order information of elements in the sequence into the Transformer has become a common practice. Based on whether the positional encoding represents the absolute or relative positional information of elements in the sequence, the industry mainly divides positional encoding into Absolute Position Encoding (APE) and Relative Position Encoding (RPE). The core idea of Absolute Position Encoding is to add a position vector to each element of the input sequence to represent the specific position of that element in the sequence. Relative Position Encoding focuses on considering the distance information between elements. This distinction is mainly because there are some positional encodings that are difficult to classify. Of course, there are other ways to differentiate them, such as listing RoPE as a separate rotation encoding.

901

0x01 Difference

In the previous article, we learned that some researchers proposed relative coding to overcome the effects of self-attention matrices. This led to the classification of positional coding. In this section, we will examine the differences between absolute positional coding and relative positional coding from various perspectives.

1.1 From an intuitive perspective

Taking the sentence “From beneath the locust tree leaves, counting the rays of sunlight filtering down from the east,” as an example, how do we obtain the sequence order? We generally have two options:

  • Absolute location information. For example, “from” is the first token, and “bottom” is the fifth token.
  • Relative positional information. For example, “light” is one position away from “sun,” but four positions away from “leak.”

These two schemes correspond to absolute positional encoding and relative positional encoding, respectively. The diagram below illustrates the differences between the original no-positional encoding, absolute positional encoding, and relative positional encoding from an intuitive perspective.

  • Positional encoding was not introduced. In human language, the position and order of words define grammar and also affect semantics. Uncaptured word order makes it difficult for us to understand the meaning of a sentence.
  • Absolute position encoding. Absolute position encoding tells the Transformer architecture the position of each element in the input sequence, similar to giving each element in the input sequence a “position label” to indicate its absolute position.
  • Relative position encoding. Relative position encoding is used in the self-attention mechanism to inform the Transformer architecture of the distance between each pair of elements.

902

Since natural language generally relies more on relative positions, relative positional encoding usually performs well.

1.2 From the perspective of model processing

From the perspective of model processing, these two approaches differ as follows:

  • The absolute position information is manipulated at the input layer, incorporating position information into the token’s input representation during the input phase. Specific details are as follows:
    • Position within the Transformer. APE only appears before the first layer.
    • Modeling approach: Position encoding is defined based on the absolute position , and the position encoding of each token is obtained using a formula function or a learnable vector.
    • Relative distance. The positional encoding of each location is a fixed vector, and each location is independent of the others, without considering their relationship with other locations. Therefore, when combined with attention mechanisms, it is impossible to calculate the relative distance.
    • Model input. For the model, the input for each token is a fusion of the token’s own encoding and its positional encoding.
    • The object of operation. Positional encoding operates on the feature sequences and in the self-attention transform (the Transformer paper focuses on the input embedding sequence ), that is, adding the absolute position information of the token to the corresponding .
  • The relative position information is mainly addressed at the model’s network layer. By fine-tuning the attention structure, the model is able to distinguish tokens in different positions.
    • In the Transformer, RPE typically repeats in every layer, unlike APE which only appears before the first layer.
    • Modeling Approach. Relative positional encoding does not model the complete positional information of each input. Instead, it models the relative positions . That is, when calculating the self-attention distribution, it considers the relative positional information between two tokens, i.e., the difference in their indices. This allows the model to learn positional information from the data to distinguish tokens in different positions.
    • Relative distance. Absolute position encoding considers the positional information of each independent token; relative position encoding considers the relative positional information between the query and key when performing Attention calculation, or in other words, it represents the positional relationship based on the relative distance between two tokens.
    • Model input. Generally, relative positional encoding does not directly add positional codes to word embeddings; the model input remains the word embeddings. Some solutions use a distance encoding matrix to calculate the offset vector, then add it to the positional embedding vector before inputting it into the model.
    • The object of operation. Position encoding operates on the self-attention matrix in the self-attention transformation (early schemes also involved operations on the feature sequence ), that is, adding the relative position information of two tokens to the corresponding .

These differences are illustrated in the figure below. Here, is the term encoding the relative positional information of . RPEs tend to directly modify the attention mechanism to fuse relative positional information; this modification is independent of the value vectors, preventing them from becoming entangled with positional information.

903

1.3 Advantages and disadvantages

The advantage of absolute positional encoding is its simplicity of implementation. Its disadvantages are:

  • It is difficult to generalize.
  • The lack of relative localization may hinder the model’s ability to understand subtle differences in language structure.

The advantages of relative position encoding are:

  • Location information can be incorporated into feature vectors and . Each dimension, when depicting long-distance semantic relationships, can not only effectively filter irrelevant information, but also provide a channel for conveying valuable semantic connections.
  • It can better handle the local structure of sequences because it focuses on the relative positions between elements.
  • A bridge has been established between semantic information and location information, so that location information no longer constitutes an absolute constraint.

The disadvantages are:

  • Computational inefficiency. The additional computational steps in the self-attention layer (such as obtaining the relative position encoding for each time step, with the position matrix being added to the query key matrix) make computation more complex, potentially increasing training and inference time.
  • Complexities of using KV Cache: The efficient use of KV Cache in Transformers becomes complex because each additional token alters the embedding of every other token. One requirement for using KV Cache is that the positional encoding of already generated words remains unchanged when generating new words (absolute positional encoding). Therefore, relative positional encoding is unsuitable for inference, as the embedding of each token changes with each new time step.
  • The overall bias is prone to fluctuations with the relative position size, requiring more dimensions and additional corrections to alleviate it. Furthermore, it cannot completely punish the semantic relationships that it wants to suppress.

Furthermore, current positional encoding methods tend to make models overly focused on local information and overly reliant on neighboring generated content. How to achieve “think before you act” at the positional encoding level—that is, enabling the model to consider information from more distant locations while paying attention to neighboring information, thereby correcting the current output—is also a problem that cannot be ignored.

Although location coding is mainly divided into two categories: absolute location coding and relative location coding, each category can actually give rise to a variety of variations, and researchers have demonstrated great ingenuity in this area. This article will showcase the diverse coding schemes devised by researchers to better represent location information, each employing its own unique approach.

0x02 Absolute position encoding

In terms of form, absolute position encoding is a relatively simple scheme. There are three classic absolute position encoding methods:

  • It is learnable and unconstrained. A representative work is the paper “Convolutional Sequence to Sequence Learning”, which uses trainable embeddings as positional encodings.
  • The trainable positional encoding is based on the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”.
  • Trigonometric function position coding. A representative work is the paper “Attention Is All You Need,” which uses position codes generated by sine and cosine functions.

In recent years, most of the work on absolute position coding has focused on generating absolute position codes using different methods, such as learning additional parameters on top of trigonometric function coding.

2.1 Basic Scheme

The basic approach is proposed in the paper “Convolutional Sequence to Sequence Learning”. This approach maps the position of each word to a unique position vector . Then embedding each word , add position code , then input the model. In the following format:

Where represents the input of the model, represents the word embedding at the -th position, and represents the absolute position code of the -th position, and depends only on the position number .

2.2 Training Method

BERT/GPT uses learned absolute positional embeddings, which learn to represent positional information by introducing learnable parameters into the model’s embedding layer. Specifically, the positional encoding is directly treated as trainable parameters, initializing a matrix of shape as the position vector, which is then updated during training. In other words, an embedding vector is trained for each input index to characterize the absolute positional features. This matrix is then used like a vocabulary.

The advantage of learnable schemes is that they can be adjusted according to the needs of the task, more accurately distinguish words at different positions, and capture the impact of positional information on the task, thereby learning the positional encoding best suited for a specific task. The disadvantages are poor scalability and extrapolation. They can only represent positions within a finite length, cannot model arbitrary positions, cannot generalize well to longer sequences not seen during training, and lack long-range decay. Furthermore, since positional information is implicitly provided through positional encoding, the model needs to learn from the data how to best utilize this information, which may require more model parameters and training data.

2.3 Trigonometric Functions

The vanilla Transformer generates position vectors using fixed mathematical formulas (using sine and cosine functions) to capture the relative relationships between different positions. Details will not be elaborated here. Because one of its core ideas is to achieve relative encoding through absolute encoding, it is sometimes categorized as a hybrid positional encoding method.

2.4 Other

In addition, there are other methods, such as:

  • Encoding Word Order in Complex Embeddings proposes a complex-valued word vector function to generate absolute positional encodings, cleverly linking the amplitude and phase of the complex-valued function with word meaning and position. This complex-valued word vector function calculates the word vector for each word at different positions using position as the variable. Because the function is continuous with respect to position, this method models not only absolute position but also the relative positions between words.
  • SHAPE: Shifted Absolute Position Embedding for Transformers proposes a robust training method for absolute position encoding. The basic idea of SHAPE is to randomly shift the absolute position encoding by a certain distance during training to achieve generalization ability.
  • Rethinking Positional Encoding in Language Pretraining adds a logit dot product between two labeled positional embeddings to the attention layer.
  • Some researchers are also considering using (bit-wise multiplication) to fuse word embeddings and positional encodings. Since adding token embeddings and PEs is actually a form of feature interaction, from this perspective, multiplication is also a form of feature interaction.

0x03 Relative Position Encoding

3.1 Significance

Let’s look at the significance of relative positions from several aspects.

Reference system in the brain

Jeff Hawkins, a member of the National Academy of Engineering, put forward some viewpoints in his paper and book “The Brain Theory” that are worth considering:

  • Frame of reference and neocortex.
    • The key to the neocortex is the frame of reference.
    • The frame of reference is ubiquitous in the neocortex.
  • Reference frame and storage.
    • A frame of reference is a storage structure for information in the brain, which uses it to manage all knowledge.
    • Knowledge is stored in a location associated with a frame of reference. Every fact we know corresponds to a location within that frame of reference.
  • Reference frame and modeling.
    • The brain builds a model of the world by associating sensory input with locations in a frame of reference.
    • A frame of reference is not just for modeling physical objects, but for modeling everything we know. Beyond concrete objects, frames of reference can also be used to develop abstract concepts, such as philosophy and democracy, which are defined based on different frames of reference in the neocortex.
  • Frame of reference and thinking.
    • Sequence recognition is a problem. The neocortex must know what the next movement will be in order to make predictions about the next input to the sequence.
    • Thinking is a special form of movement. Assuming everything we know is stored in a frame of reference, then to recall stored knowledge, we need to activate appropriate locations within that frame. Thinking occurs when neurons activate one location after another.

As mentioned above, the frame of reference is an important part of the human brain, which is of paramount importance for positional coding. Or more precisely, it is one of the key theoretical supports for relative positional coding.

Semantic impact

In many tasks, the relative positional relationships between elements in a sequence are crucial for understanding the sequence’s semantics and structure. In other words, absolute positional encoding has less impact on sentence semantics; relative positional encoding is far more important. For example, in the sentence below, relative word order has a more significant semantic impact than absolute word order.

  • Reading is good, reading good books is good, and reading is enjoyable.
  • People from Sichuan are not afraid of spicy food, people from Guizhou are not afraid of spicy food, and people from Hunan are afraid of food that isn’t spicy enough.
  • There is an unidentified creature eating a chicken.

Length extrapolation

Intuitively, length extrapolation is strongly correlated with both length and position. The Transformer authors proposed sinusoidal positional embeddings and claimed that they could extrapolate to longer sequences beyond the training sequence. The idea behind this claim—that length extrapolation can be achieved simply by changing the position representation—has been widely supported and proven. Therefore, developing better positional encoding methods has become a major approach to enhancing Transformer length extrapolation.

Because APE performs poorly in length extrapolation, while RPE naturally possesses better extrapolation capabilities due to its shift invariance, and because the relative order of words in context is generally considered more important, RPE has become the primary method for encoding positional information in recent years.

Early Relational Position Encoding (RPE) techniques were derived from simple modifications of sinusoidal position encoding and often combined with pruning or binning strategies to avoid out-of-distribution position embeddings, which were considered beneficial for extrapolation. Furthermore, because RPE decouples the one-to-one correspondence between position and position representation, adding the bias term directly to the attention formula becomes a feasible, and even better, approach to integrating position information into the Transformer. This method is much simpler and naturally decouples the value vector from the position information.

3.2 Position of Absolute Position Encoding

How do I add relative position information to a Transformer? There are two approaches, but they lead to the same goal.

  • Since the positional encoding of each word is obtained relative to the positional differences of other words, it is obviously not as simple as adding it directly to the input as APE. Instead, it needs to be addressed from the module after the input, which is the attention module.
  • As analyzed earlier, the ability to represent relative positions in the original transformer is destroyed during the attention computation stage. Therefore, researchers naturally thought of adding the relative position information back in during the attention computation.

Therefore, researchers modified the self-attention calculation process to embed relative positional information into the attention mechanism of each layer of the Transformer architecture. Relative positional encoding directly considers the relative positional relationship between the two tokens corresponding to each element, based on the matrix element indices. For example, when calculating the self-attention matrix, whether in the dot product of query and key, or in the final multiplication of attention weights and value matrices, an additional bias, dependent only on and , representing the relative positional information of positions and , is added. This encodes the relative distance between elements by combining the positional embedding vector of each element with the offset vectors of other positions. The positional embedding vector of each element changes according to its positional relationship with other elements, thus better capturing local structural information in the sequence and improving the performance of sequence modeling.

Taking the sentence “You are great” as an example, how do we obtain the sequence order? We generally have two options.

  • Absolute location information. For example, “You” is the first token, and “are” is the second token.
  • Relative positional information. For example, “great” is one position away from “are”, but two positions away from “You”.

The image below shows examples of absolute and relative positional biases that can be added to the attention matrix. Left: Attention matrix in the example sentence. Middle: Learnable absolute positional bias. Right: Positional bias relative to the reference position. They exhibit an intuitive weighting pattern that absolute positional encoding lacks.

904

3.3 Formula for Absolute Position Encoding

Since relative positional encoding is mostly derived from sinusoidal positional encoding, we first consider a general attention mechanism with absolute positional encoding. The diagram above shows the computation flow of the self-attention mechanism in a certain layer of the Transformer model. The final output is the dot product . It represents the relationship between the current position and all positions in the sequence, and is the linearly weighted representation of the input sequence.

The formula (2) below the image is the expansion of the vector inner product between query and key, which is a combination of four attention terms, each of which is respectively:

  • “Input-input”. It does not consider the original score of position encoding, but only content-based addressing.
  • “Input-position”. The positional bias relative to the current content.
  • “Location-Input”. This measures the importance of the key at the content level, representing global content bias.
  • “Position-position”. This measures the importance of a key from a relative positional perspective, representing global positional bias.

The introduction of relative positional encoding generally starts from this point. Some schemes turn certain terms into trainable parameters, while others even remove the middle two terms. In short, how to characterize the relative distance between different positions in the sequence and how to control the bias magnitude of different positions in the self-attention matrix through relative distance has always been the most important aspect of positional encoding design. Different positional encoding schemes add biases in different ways.

905

Next, I will guide readers to analyze some classic relative position coding methods.

3.4 Classic Style

The concept of relative position encoding originated from the paper “Self-Attention with Relative Position Representations,” written by the same team behind Transformer. They should have been aware of the problem of trigonometric function encoding long ago.

The diagram below illustrates the transformation process of trigonometric function encoding. The main idea is to use the current position as an anchor point. In calculating the attention score and weighted summation , a trainable relative position vector is introduced each time: and . The specific technical details are as follows:

  • It is related to the sinusoidal position coding in form.
  • The relative position information is added to and and shared among multiple heads, where is subject to constraints.
  • The target of relative position encoding is the edge or distance between sequence elements. Relative position means changing a vector that originally depended on binary coordinates into a vector that depends only on the relative distance : .

Furthermore, truncation is typically performed to accommodate different arbitrary distances and avoid off-center locations. This allows for the representation of relative positions of arbitrary length using only a finite number of positional encodings (due to truncation). In other words, by cropping relative positions within a defined range, the number of positional embeddings to be learned is reduced, enhancing length extrapolation.

906

Regarding the pruning, let’s explain: the relative position here is actually a piecewise function, linear within the range , truncated at both ends. The pre-set maximum relative position strengthens the model’s attention calculation for the words to the left and right of the current word. Therefore, the final window size is . For words whose edge positions exceed , a pruning mechanism is used, meaning only valid neighboring words are modeled. Relative position weight matrix is shown in the figure below.

907

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

908

3.5 XLNET

XLNET-style positional encoding originates from the Transformer-XL paper, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.

Transformer-XL modified the absolute position encoding formula: it retained the values of sine and cosine frequencies and the interaction between positional and semantic information, added two globally learnable variables, and , and also distinguished the key transformation matrix into two ‘s: one based on content and the other based on relative position. Specific details are as follows.

  • Replace the absolute position encoding with the relative position encoding.
    • Using as anchor points, all are changed to . This indicates that the absolute positional encoding for the key has been changed to a relative positional encoding with respect to . The relative position information (content-dependent positional bias) is obtained using the sine and cosine frequency method in the Transformer. This term does not need to be learned and therefore is not truncated. From the perspective of the XLNet paper’s formula, this encodes the absolute position as relative position encoding .
    • Introduce relative position information into the key, and split the key transformation matrix into two ‘s: one content-based and one relative position-based, meaning the input sequence and positional encoding no longer share weights. From the perspective of the XLNet paper’s formula, the two ‘s correspond to and .
  • Adjust the second term of the absolute position encoding formula. This is done using a matrix to describe the relationship between relative distance and contextual semantics.
  • Adjust the third and fourth terms of the absolute position encoding formula.
    • Two new learnable parameters are introduced in the third and fourth terms: and . This replaces the query vector in the Transformer. This is because the attentional bias for different words remains consistent regardless of the query position. Since we have already incorporated relative positional encoding into the key side, positional encoding is no longer needed on the query side.
    • From the perspective of the XLNet paper’s formula, this replaces .

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

909

It should be from this work onwards that subsequent RPEs were only added to , and not to .

3.6 TENER

From the perspective of positional encoding, the authors of TENER discovered that traditional triangular positional encoding lacks monotonicity in practice and theoretically lacks awareness of the relative direction between consecutive tokens. Therefore, the authors of TENER proposed incorporating both relative direction and relative distance into positional encoding. TENER’s positional encoding is similar to that of Transformer-XL, except that it removes the relative positional information transformation matrix . Furthermore, TENER also found that removing the correction coefficient in the self-attention transformation can improve its performance on NER tasks.

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

910

TENER actually reveals some shortcomings of current positional encoding: existing positional encoding mainly describes the relative distance between two tokens, but lacks description of the relative direction between tokens. How to achieve a direction-aware positional encoding that can be decomposed and interpreted is a big challenge.

3.7 T5

Also based on the decomposition of attention score calculation, T5 employs a simple relative position encoding scheme, directly mapping relative positions to learnable scalars. From the perspective of the absolute encoding formula, T5 eliminates the interaction between positional and semantic information, directly characterizing relative positional information and using bias (floating-point numbers) to represent each possible positional offset. For example, bias represents the relative distance between any two tags that are one position apart, regardless of their absolute positions in the sentence. This ensures the monotonicity of the bias with respect to relative position.

In short, T5 directly replaces the last three terms of the absolute position formula with a learnable bias; in other words, it adds a trainable bias term to the (probability-unnormalized) Attention matrix. Specifically:

  • Delete items (b) and (c). This is because the authors of T5 believed that input information and location information should be independent (decoupled), and therefore should not have excessive interaction.
  • Simplify (d) as . The fourth relative position information is actually just a scalar that depends only on , which can be directly mapped to a learnable scalar and trained as a parameter. This relative position bias matrix is added to the product of the query matrix and the key matrix in the self-attention layer. This ensures that labels with the same relative distance are always represented by the same bias, regardless of their position in the sequence.
  • Remove position encoding in the middle.

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

911

A significant advantage of this method is its scalability. It can be extended to arbitrarily long sequences, which is a clear advantage over absolute position embedding.

3.8 DeBERTa type

DeBERTa originates from DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In contrast to T5, DeBERTa removes the fourth term after decomposition and replaces absolute position encoding with relative position encoding in the second and third terms. Its underlying logic is as follows:

  • Replace the positional codes in items two and three with relative positional codes. First, by , truncate directly within the interval , then use the parameter matrix to characterize and map relative positions to feature vectors; that is, it uses relative position encoding and decouples attention from content and position.
  • Discard item 4. Since relative position encoding has already been used, position2position will not provide any additional information.

In addition, DeBERTa offers a new perspective on encoding relative and absolute positions. It points out that most NLP tasks may only require relative position information, but absolute position information is indeed more helpful in some scenarios. Therefore, it divides the entire model into two parts for understanding. It has a total of 13 layers. The first 11 layers only use relative position encoding, which is called the Encoder. The last 2 layers add absolute position information, which it calls the Decoder.

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

912

3.9 TUPE

TUPE comes from the paper “RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING”.

Note: TUPE has two versions, APE and RPE. This article categorizes it as APE based on its origin.

TUPE can be seen as a combination of T5 and DeBERTa positional encoding. TUPE positional encoding removes the second and third terms of the absolute positional encoding formula, retaining only the fourth term. Compared to T5 compression which learns a scalar to characterize relative position, TUPE treats semantic and positional information equally and characterizes them separately: depict semantic relationships, and characterize positional relationships (directly model the relationships between a pair of words or positions by using different projection matrices).

Regarding the four terms of the absolute position encoding formula, the paper argues that there are two problems:

  • Position embeddings and word embeddings should not be coupled.
    • In absolute positional encoding, the addition operation applied to positional embeddings and word embeddings introduces a hybrid correlation between the two heterogeneous information resources. This can introduce unwanted randomness into attention and further limit the model’s expressive power.
    • The paper’s authors visualized these four items and found that the middle two appeared very evenly distributed, indicating that there wasn’t a strong correlation between position and token. Therefore, TUPE removed the second and third items, specifically the word-position and position-word correspondences.
    • The same matrix is used for QKV transformations between token and position, but the information contained in position and token is different, so sharing the matrix is unreasonable. Therefore, TUPE decouples the projection matrices of token and position, calculates the contextual relevance and positional relevance of words separately through different parameterizations, and then adds them together.
    • Referencing the bias term in the T5 model.
  • Secondly, considering the special role of the symbol [CLS] in the downstream task (representation of the entire sentence), the paper questions whether treating the position of the symbol [CLS] the same as other words is a reasonable design. Therefore, TUPE separates the [CLS] symbol from other positions (untie), making it easier to capture information from all positions.

The TUPE architecture is shown in the figure below.

913

The decoupling logic is shown in the diagram below.

914

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

915

3.10 ALiBi

ALiBi encoding originates from the paper “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. ALiBi (Attention with Linear Biases) is similar to T5, directly adding a linear bias to the (unprobability-normalized) attention score. This linear bias term makes the self-attention distribution focus more on semantic information at relatively small distances, i.e., neighboring locations. The difference is that T5’s bias is a trainable parameter, while ALiBi’s bias is pre-set and not trainable.

motivation

ALiBi’s motivation is that closer words are more important than farther words.

Implementation

ALiBi encoding doesn’t add positional embedding vectors to word vectors. Instead, it biases the query-key attention score with a “penalty term” proportional to the distance between the query and key. This bias penalizes the attention score based on the relative distance between the query and key; the greater the relative distance, the larger the penalty term. Essentially, the farther apart two tokens are, the smaller their mutual contribution. For example, when multiplying the query vector of the -th token and the key vector of the -th token in the original attention layer, ALiBi embeds the relative positional information of positions and into the attention calculation process by adding the absolute difference between positions and . This absolute difference is constant, can be pre-calculated, and its value is different for each head.

The specific formula is:

where is a hyperparameter, and different values are set for each head. The paper found through experiments that setting intervals by

yields the best results; that is, if there are heads, then the interval starts from and ends at .

The implementation process is shown in the figure below.

916

feature

ALIBI is a very simple (and certainly effective) smooth local attention technique, but it is not entirely appropriate to understand it as “positional encoding”.

ALIBI uses a linear bias term to make the self-attention distribution focus more on semantic information at relatively small distances, i.e., neighboring locations, which is equivalent to enhancing locality. Although it is a linear bias term, after the softmax transformation, the actual multiplication by the self-attention distribution is exponential decay. Secondly, the linear bias means that when the relative distance is large, the bias term approaches negative infinity. Therefore, ALIBi’s bias term is more like a long-range decay in the attention calculation process through a sloping sliding window or mask, i.e., it can only capture information within the training length. As long as the relative distance is large enough, ALIBi will impose a strict penalty. As the sequence length increases, ALIBi often transitions from global attention to almost local attention, which is why ALIBi performs worse than most baselines within the training length, but performs better beyond the training length.

At the same time, it should be noted that absolute bias coding, represented by ALiBi, cannot convert absolute bias coding into context bias coding. Operations on cannot be split into . From a parameter perspective, since the bias is directly added to the inner product of , the parameter matrices lack distinctiveness between each feature dimension, making optimization harder compared to the other two bias forms.

Based on the formula for absolute position encoding, the specific derivation process of this scheme is as follows:

917

3.11 Bias Coding & Context Mode

After analyzing the relative positional encoding described above, let’s look at a classification method. Whether it’s absolute or relative positional encoding, how to characterize the relative distance between different positions in the sequence, and how to control the bias magnitude at different positions of the self-attention matrix through relative distance, has always been the most crucial aspect of positional encoding design. Different positional encoding schemes add biases in different ways. However, they can generally be summarized into the following form.

In this form, is known as position offset. Relative position encoding can be divided into the following two schools of thought based on its different forms.

  • Bias schemes. Position encoding methods represented by T5, TUPE, and ALiBi, where is a scalar unrelated to and . Simply add the bias to the inner product of , directly operating on the self-attention matrix. This method directly adds the bias to , and the calculation is simple and easy to understand, but the punishment is too absolute.
  • Contextual solutions. These include, for example, Transformer-XL and DeBERTa, where . This approach incorporates bias into the feature vector dimensions of , can effectively distinguish feature dimensions across each dimension, and replaces penalty with filtering, resulting in stronger expressive power. However, its overall bias is prone to fluctuations with relative position size, requiring more dimensions and additional corrections to mitigate this.

918

3.12 Summary

Generally, absolute position encoding has advantages such as simplicity of implementation and fast computation speed, while relative position encoding directly reflects the relative position signal, which is more intuitive and better suited to the characteristics of text, and often yields better practical results. How can we combine the strengths of both? A natural idea is to achieve relative position encoding through absolute position encoding. That is, hybrid position encoding. Therefore, APE and RPE are ultimately unified into Rotary Position Embedding (RoPE), achieving the effect of relative position encoding in the form of absolute position encoding.

Note: The classification here is ambiguous. Some people regard RoPE as a hybrid positional encoding, some classify it as a relative positional encoding, and some list it separately as a rotational positional encoding.

0xFF Reference