Exploring the Transformer Series (7) --- Embedding

0x00 Summary

In the Transformer, the task of mapping each token (corresponding to discrete input data, such as words or symbols) to a high-dimensional dense vector space is implemented by the embedding layer. The input embedding layer is an indispensable part of the Transformer framework, and its functions are as follows:

The input data is transformed into a form that the model can process. For example, for the four characters “新年大吉” (Happy New Year), assuming the high-dimensional space is 512-dimensional, the embedding layer will generate a 4 x 512-dimensional embedding matrix. Each token corresponds to a row in the matrix, i.e., a vector.
This provides the necessary semantics for the model. Research has found that in Transformer-type models, the concept formation process begins at the input embedding layer. This is similar to early human cognitive development.
Positional encoding provides the model with positional information for each word in the sequence, enabling the model to process sequence data and retain positional information.

These works laid the foundation for subsequent modules (such as self-attention mechanisms and feedforward networks) in processing and task execution.

701

0x01 Evolutionary Approach

1.1 Concept

First, let’s look at two concepts closely related to the embedding layer: embedding and vectorization. Some articles don’t distinguish between them. Others make a fine distinction, arguing that although they both relate to representing data as vectors, they differ significantly in concept, application, and implementation.

Simply put, vectorization is the process of converting data in other formats into vector form. These other formats include all common data formats such as text, images, videos, and audio. Therefore, vectorization can be directly understood as a data format conversion technique.

In large model technology systems, vectorization mainly involves the following two situations:

Data vectorization.
embedding.

Although both methods involve vectorization, there are fundamental differences. Data vectorization is a numerical transformation process, which is mechanical; while embedding is a higher-level, intelligent vectorization. For example, for the four characters “我爱中国” (I love China), in the logic of data vectorization, they might simply be four functions generating four independent vectors. However, embedding contains much more information, including the semantic structure of subject, verb, and object, the position of the subject and object, emotion, action, etc., all of which are collectively referred to as linguistic features. The relationship between data vectorization and embedding is shown in the table below.

Dimension	Embedding	Vectorization
Purpose	Learn low-dimensional dense semantic representations. The data after embedding has semantic relationships, rather than being unrelated discrete vectors.	Converting data into numerical vectors can result in either sparse or dense data. A greater emphasis on the directness of data representation refers to the numerical representation itself.
Do I need to learn?	It requires (usually through neural networks or optimization algorithms) learning.	No need (can be generated based on rules or statistical methods)
Semantic representation ability	The concept of representing data in a meaningful and structured way requires maintaining the deep semantic relationships and similarities between data.	It may not preserve semantics and is merely a mechanical representation of features.
Typical methods	word2vec, GloVe, BERT, node2vec	Bag-of-Words (BoW), TF-IDF, One-hot Encoding
Result vector dimension	Typically low-dimensional and dense. This “low” refers to its density compared to vectorization; in reality, it is high-dimensional and dense.	Typically high-dimensional and sparse

Taking text translation as an example, the input layer of the Transformer first transforms each word (or character) of the input text into a high-dimensional, dense vector space, obtaining a dense representation of the text’s semantic meaning. This is called Token Embedding, or word embedding in deep learning. The Transformer essentially “encodes” human semantics into its own language (Embedding) through vectorization. Word Embeddings can provide richer and more expressive input features for subsequent computations. Furthermore, these embeddings can be trained. During training, the model adjusts the numbers of these embeddings based on knowledge learned from the data to improve its task performance.

Once the embedding is obtained, the Transformer passes it to the self-attention layer. The self-attention layer analyzes the relationships between tokens in the input sequence, extracts rich knowledge and structure from the embedding, and then performs associations and weighted accumulations to generate new embeddings. Finally, the new embeddings are “encoded” back into human language. Essentially, the Transformer constructs a high-dimensional language system that can map (or encode) natural language, programming languages, and visual and auditory languages into this high-dimensional language space.

Let’s look at the evolution from text to embedding, that is, how to encode a word into an embedding.

1.2 Requirements

Why convert input data (usually words or characters in text data) from discrete symbolic form to continuous numerical vector form? This is because computers cannot directly process non-numerical computations; all computations require converting the computational object into numerical values. However, there are many ways to perform numerical computations. Why choose vectors as the carrier, rather than other forms? This requires in-depth analysis to examine the development history of computational object conversion.

1.3 Text

Words are abstract summaries of human language and are the primary objects of processing in the field of NLP. But how is a word represented in a computer? ASCII is an effective method, satisfying uniqueness and distinguishability. However, ASCII can only tell us what the word is, but cannot convey the deeper meaning of the word. In the text-based era, search was based on keyword search, that is, the most primitive character matching.

However, what we expect is semantic search, which finds text containing the most similar meanings. Semantic search allows computers to better understand human language and needs; it can understand what you want to do based on your semantics, rather than requiring you to give explicit instructions for it to understand what you want to do.

While semantic search may seem easy to the average person, it is difficult for computers because language is incredibly complex, and semantic search requires semantic understanding. Semantic understanding is the foundation of natural language and an essential component of artificial intelligence. Furthermore, we aim to achieve more than just semantic search; we also want to enable similarity search, applying it to various scenarios such as image search, hybrid search, and intelligent recommendation.

So we continue our search.

1.4 Numbers

Since computation is based on numbers, computers treat numbers as a language but cannot directly understand written text. Therefore, an obvious approach is to convert written text into numbers, making it easier for computers to process and understand.

The solution is straightforward: we construct a dictionary using the vocabulary appearing throughout the entire text corpus. This dictionary is unique and ordered, with each word having an index. We then represent words using their indices in the dictionary. This is essentially an indexed representation, assigning each word a unique number based on the list of all words known to the model. For simplicity, we’ll call this transformation process indexing.

The advantage of integers is that they are continuous, ordered, and very easy to index. For example, if a dictionary contains four words: “新年快乐” (Happy New Year), we can assign these four words the following indices: 1, 2, 3, 4. Furthermore, taking the previously generated dictionary object as an example, the vocab object has 8185 words. Each word can be uniquely identified using an index so that the computer can recognize it.

vocab = {Vocab: 8185} <torchtext._torchtext.Vocab object at 0x0000021A26983DF0>
  0000 = {str} '<s>'
  0001 = {str} '</s>'
  0002 = {str} '<blank>'
  0003 = {str} '<unk>'
  0004 = {str} '.'
  0005 = {str} 'Ein'
  0006 = {str} 'einem'
  ......

This approach has a problem: the numbers carry too little information to describe semantics, resulting in a lack of connection between the digitized values and the word meanings. For example, in an English dictionary, the words abeyance (to postpone, to suspend), abide (to comply), and ability are assigned adjacent values, but their meanings are vastly different. Conversely, the homonyms a and an are very far apart. This hinders the computer’s ability to understand commonalities and distinguish between specific characteristics. Consequently, the model needs more parameters to describe the same amount of information, making the learning process significantly more difficult.

Because a single number lacks expressive power, when a word is represented by a scalar, the relationship between words can only be derived from the difference between two scalars, which is clearly insufficient. Therefore, multiple numbers are needed to express concepts, that is, to embed multiple numbers into a mathematical space. When a multi-dimensional numerical form is required, we naturally think of using vectors—for each word, we can represent it as a set of numbers, rather than a single number; we can use these numbers to do things, such as calculating distances and angles. In this way, we can define distance in different dimensions, and the complex relationships between words can be expressed in this high-dimensional space.

Therefore, we arrive at the third option: vectors.

1.5 Vectors

In mathematics, a vector (also called an Euclidean vector or geometric vector) is a quantity that has both magnitude and direction. It can be visualized as a line segment with an arrowhead. The arrow points to the direction of the vector, and the length of the line segment represents the magnitude of the vector. In simpler terms, a vector is an ordered list of scalar values (i.e., numbers). For example, the two-dimensional coordinates (4,3) are a two-dimensional vector that can represent a two-dimensional position. Next, we will discuss methods for converting text into vectors, known as vectorization.

Vectorization is the process of converting data into vector form, typically used to transform non-numerical data into numerical form for easier processing by machine learning models. Vectorization primarily represents raw data as numerical vectors that can be directly input into the model. Vectorization can be a simple rule-based transformation without requiring training. The result of vectorization is not necessarily a dense vector; it can also be a sparse vector.

Why vectorize? The reason lies in several excellent properties of vectors:

It facilitates computer processing.
It can represent semantic relationships between text, images, etc.
Multiple vectors can be represented using matrices, which makes computation more efficient.

Therefore, the only data format that neural networks can process is vectors, which are the underlying data structure of large models. Any data that needs to be input into the model needs to be vectorized; not only entities need to be vectorized, but also data that records semantic relationships.

One-hot encoding

One-hot encoding is probably the most common type of vector encoding, a common and well-known method for representing categorical data. Essentially, it uses a binary vector containing only one 1 and all others 0s to uniquely represent a word. The encoding process of one-hot encoding is as follows:

Construct a dictionary from the set of all words in the corpus, assuming the size of this dictionary is |V|.
Each word has a subscript index in the dictionary.
Construct a unique length for each word |V|: the vector is a set of words whose index is 1 only, and all other positions are 0. The dimension of the vector is equal to the size of the vocabulary. The specific vector is shown in the figure below.

702

For example, if the vocabulary contains the four characters “新、年、快、乐”, one-hot encoding encodes these four characters using 0-1 respectively: each word is represented by a vector, which is “1” at the position corresponding to the word and “0” elsewhere.

703

One-hot encoding has the following significant drawbacks:

A single-dimensional vector cannot carry more information. Each vector in one-hot encoding has only one informative dimension. Moreover, this dimension contains only 1s, which can express too little information, leading to inefficiency.
High-dimensional sparsity. We always hope that the model can handle more words, which means the dictionary length |V| the larger the better, but this results in an excessively large vocabulary storage. Moreover, the number of dimensions in the vector corresponds to the number of characters in the dictionary, which can easily lead to excessively large vector dimensions, making them very sparse. This results in excessively high dimensionality of word features, leading to high memory consumption.
Semantic information is lacking. In one-hot encoding, each word is equidistant from every other word, essentially treating all words as independent entities without any concept of similarity or context. In other words, in one-hot encoding, words are orthogonal and unconnected, thus failing to capture any semantic relationships between them.
Hard-coded, cannot be adjusted or trained.

Improvement requests

So far, we have found a digital form that can be used to express word meaning—vectors. We have also discovered numerous problems with one-hot encoding, a common vector encoding method. We hope to improve upon one-hot encoding. Therefore, to address the shortcomings of one-hot encoding, we need to:

Dimensionality enhancement, or increasing the dimension from one to multiple dimensions, allows for the representation of more information. Here, dimensionality enhancement refers to using more dimensions to represent semantic information.
Dimensionality reduction, or dimensionality reduction from sparse to dense, refers to reducing the overall dimension of the vectors. The sparsity of one-hot encoding is undesirable, as it imposes a double burden on computation and storage. We prefer denser vectors. Therefore, we need methods to reduce the dimensionality of each word’s vector from the size of the dictionary to a smaller dimension.
Semantic similarity. Enhancing the ability to express relationships, ideally representing semantic relationships (semantic similarity) between text, images, etc. Because with the development of internet technology, unstructured data is becoming increasingly abundant, including various data formats such as images, audio/video, and mixed text and images. We need a method to process unstructured data.
Avoid hardcoding; let it be learned. One-hot encoding is hardcoding, and we hope to adjust it through training.

Let’s analyze the necessity of these improvements.

Dimensional enhancement

One-hot encoding uses single-dimensional vectors, and we need to extend it to represent semantic information using multi-dimensional vectors. Why represent a word using a multi-dimensional vector? Because human concepts are complex and diverse, and cannot be expressed using a single dimension. Consider a table; we need multiple features to fully represent its characteristics, such as length, width, height, weight, material, and style—each aspect being a dimension. Words also possess multi-dimensional information. For example, the word “instant” has multiple parts of speech, and each part of speech may have multiple meanings.

As a noun, it means instant, moment, or instant.
As an adjective, it means immediate, instantaneous; quick-cooked; urgent, pressing; present, immediate, momentary; persistent, indomitable; produced (or occurring) in an instant; unprepared (or considered) beforehand, impromptu, completed on the spot.
As an adverb, it means immediately or instantly.

The significance of multidimensionality lies in the fact that conceptual information is distributed across the entire vector, rather than being confined to any single dimension.

Dimensional reduction

There are two layers of meaning or dimensions of thought here.

The shift from sparse to dense encoding involves reducing the dimensionality of each word’s vector from the size of the dictionary to a smaller dimension. This is because the embedding length essentially represents the number of feature dimensions, whereas with ONT-hot encoding, the number of feature dimensions needs to be the same as the number of tokens in the dictionary.
We’ve mentioned multidimensional vectors, but how many dimensions are appropriate? If the dimension is too high, although it can express rich semantics, it may still cause problems similar to one-hot encoding.

The first point is understandable; it reduces the dual burden of computation and storage. For example, dimensionality reduction can reduce the number of parameters in the model, thereby mitigating the risk of overfitting and improving the model’s training efficiency.

The second point is how to determine the dimension of the new multidimensional vector. Let’s analyze this carefully. According to the Johnson–Lindenstrauss lemma, any high-dimensional dataset can be randomly projected into a lower-dimensional Euclidean space while controlling the distortion of pairwise distances. For example, any 10,000 points in a million-dimensional space can be roughly packed into a subspace of tens of dimensions. Specifically, assuming there are N vectors, regardless of their original dimensions, we can reduce them to O(log N) dimensionality and control the error of the relative distance within a certain range. That is, only O(log N) dimensional space is needed to accommodate N vectors, reducing the retrieval problem in the original high-dimensional space to O(log N) dimensional space. This actually corresponds to another way of saying that a language model is a compressor. And dimensionality reduction is a compression method: extracting commonalities, preserving individual characteristics, and filtering noise.

Semantic similarity

We hope that vectors can express the semantic similarity between words, allowing us to connect things we’ve seen with things we’ve never seen before. Semantic similarity is based on the distribution hypothesis, a fundamental linguistic assumption that words appearing in similar contexts are semantically related. For example, from the sentence below, we can see that there is a strong semantic connection between mathematicians and physicists.

The mathematician ran towards the store.
The physicist ran to the store.
Mathematicians enjoy drinking coffee.
Physicists enjoy coffee.
Mathematicians solved this long-standing problem.
Physicists have solved this long-standing problem.

Because one-hot encoding lacks semantic meaning, it cannot express the close relationship between mathematicians and physicists. Dense vectors, however, can precisely compensate for this shortcoming. To further illustrate the relationships between words, we can also plot the vectors corresponding to each word on a plane. For example, in the image below, physicists and mathematicians have similar meanings, so they are grouped together. Basketball players, football players, and rugby players have greater semantic differences than mathematicians, so they are relatively farther away from physicists.

704

In addition, we also hope to use mathematical methods to compare the semantic similarity and differences of vectors, because vectors transform natural language into a string of numbers, from which natural language can be computed, so we hope to express this through vectors: 郭靖 - 黄蓉 = man - women.

705

Trainable

We expect the embedding vectors to be trainable. As training progresses, the model can adjust the embedding vectors through backpropagation, making the vectors of similar words closer together and the vectors of dissimilar words further apart.

summary

We now understand the importance of semantically meaningful dense multidimensional vectors; in other words, semantic analysis is the foundation of artificial intelligence, and the foundation of semantic analysis is dense multidimensional vectors. RGB (the three primary colors) is a very good example of a semantically meaningful dense vector, with the following characteristics:

Any color can be represented by an RGB vector.
All three dimensions have clear physical meanings (semantics) and are highly interpretable.
Each dimension is predetermined.
Vectors with similar physical meanings (semantics) are located close to each other in color space. The degree of similarity in meaning between two words can be represented by calculating the distance between high-dimensional vectors.

1.5 embedding

In fact, a semantically meaningful dense multidimensional vector is an embedding. Let’s first look at the authoritative definition of embedding.

The definition given on the PyTorch website is: Word embeddings are dense vectors of real numbers, one per word in your vocabulary.
The Tensorflow community defines it as: An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.
The official OpenAI documentation explains it this way:

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

Let’s give our own definition: word embeddings are dense vectors transformed from high-dimensional sparse vectors. Each dimension of the vector can represent a certain feature of the corresponding object. In this way, an object can be projected from a complex space to a relatively simple space, and semantic similarity can be encoded in the vector. This makes it easy for computers to understand the relationship between these concepts.

Next, let’s look at two key aspects of word embedding: representing concepts and expressing relationships (using similarity as an example).

Representation concept

We know that LLM currently predicts the next word based on previous words. Therefore, we need to examine how humans predict, or rather, what is most crucial for human prediction. In fact, advanced human predictive abilities rely on equally advanced representational abilities. Broadly speaking, these representational abilities are called abstract thinking or abstract reasoning. That is, humans abstract the concrete physical world into subjective cognition, forming concepts.

NLP deals with language. While humans can understand the meaning of each word (concept) in a language as an abstract symbol, computers cannot directly abstract the meaning of each language symbol from perception. This is because successful language communication relies on shared experiences of the world. It is this shared experience that makes language meaningful. For example, “man” is a very abstract word concept, encompassing vision, hearing, interaction, relationships, and even more abstract responsibilities. Human concepts are so complex and effective because they possess the following characteristics:

They can be projected (potentially non-linearly) into a variety of applications.
In the conceptual space, different things are broadly connected through correlation.
The concepts are not only interconnected, but also support specific forms of logical inference and complex computation.

If we want a certain approach or method to accurately represent human concepts, then this method should not only be able to express various attributes, but also handle key phenomena in psychology, including similarity calculation, relationship establishment, analogical reasoning, and the recombination of certain concepts into new structured thinking. Alternatively, we can think of it this way: language disrupts spatial relationships, but forms a conceptual space. Within this conceptual space, geometry and calculus are no longer applicable. It requires a new mathematics or a new language. This unified structure and language, connecting the geometric space and the conceptual space, is the embedding. Taking Sora as an example, both concrete physical space and language are expressed as embeddings for joint training; the physical and linguistic spaces are unified as embeddings, and the connections between embeddings construct relationships.

The above statement may be too academic. Let’s use simpler language to see what characteristics a good embedding method should have in order to accurately represent human concepts.

Uniqueness. Words and embeddings must correspond one-to-one and cannot be repeated. Repetition increases the difficulty for computers to understand language, similar to the problems caused to humans by polyphonic or polysemous characters.
Distinguishing ability. From a mathematical perspective, words with similar meanings need to have “similar” quantification values (adjacent in space); words with dissimilar meanings need to have quantification values that are as “far apart” as possible.
Computability. Regardless of how two “words” are constructed, they should be computationally achievable within the same space, and the computation results should also be within the same space. For example, the representations of words can be “connected” through simple mathematical calculations to represent a new “word”.
It contains deep semantic information. The authors of ELMo believe that, ideally, a good embedding should model the following two aspects:
- Complex features of word usage (such as syntax and semantics).
- These usages vary in linguistic context (i.e., modeling polysemous words).

That is, in the right architecture, this method can function like a word.

Recent theoretical and computational advances in cognitive psychology and related fields suggest that vectors may satisfy all the properties required to express human concepts. This is partly because vectors operate on vector spaces. A vector space is a set of vectors, a linear algebraic concept that describes the distribution and operations of a set of vectors under a certain dimension and basis. Vector spaces have standard mathematical operations; for example, high-dimensional data in machine learning can be viewed as points in a vector space, and different distance metrics (such as Euclidean distance and cosine similarity) can be used to calculate the relationships between points. New vectors can also be created by adding two vectors or scaling a vector with a number. Because of the existence of vector spaces, the meaning of any particular vector cannot be determined in isolation, but rather derives from the role these vectors play in a larger computational process. At the most basic level, this role includes geometric relationships between vectors, including distances and angles, but also dynamic calculations on vectors.

Sample

For example, we can project mathematicians and physicists as vectors, with each attribute (such as whether they can run, whether they like coffee, etc.) being a dimension of the vector. We see that both mathematicians and physicists can run and like coffee, so we can assign higher scores to these dimensions. This maps words or phrases from a vocabulary to the real-valued space of vectors. This allows certain attributes of data objects to be encoded into the geometric properties of their vector representations, thus expressing similarity.

706

Please note that word embeddings may be uninterpretable. That is, although we can see from the vectors we handcrafted above that both mathematicians and physicists like coffee, it’s unclear what the numbers in these vectors mean. We only know that they are similar in some underlying semantic dimensions.

Similarity

Next, let’s look at the similarities.

From a psychological perspective, a vector maps a concept to points in a space (usually two- or three-dimensional), ensuring that the geometry of the vector space aligns with empirical measurements of psychological quantities (such as similarity or distance between items). While the resulting vector coordinates cannot be interpreted in isolation, the relationships between vectors carry meaning regarding psychological measurements. Two similar concepts will be close to each other in the vector space, and the distance within this space will correspond to people’s willingness to generalize between these two concepts. We will analyze this in more detail below.

First, let’s look at it from a numerical perspective. Word embeddings are a way to associate words with a series of numbers (vectors). Similar words are associated with close numbers, while dissimilar words are associated with far-off numbers. Similarity is a way to measure the degree of similarity between two words (or sentences) by assigning large numbers to similar words and small numbers to dissimilar words. We believe that in a good word embedding, the similarity between words like “bank” and “the” is almost zero because they are unrelated. Therefore, the model will know to ignore “the” and instead focus on words that might have a higher similarity to “bank”.

Secondly, the concepts are fully represented in a high-dimensional vector space, the underlying logic being that concepts can be derived dynamically through the relationships between vectors and computation. This actually corresponds to another logic: dynamically simulating any other computational system through a Church-coded target system. Church coding is an idea in mathematical logic where the dynamics of one system can reflect the dynamics of another.

Similarity calculation

Word embeddings are essentially vectors. Therefore, to understand the semantic proximity of two words or sentences, we can calculate the relevance between different data formats based on the relationships between different vectors, such as distance, length, and direction. For example, we can use the distance between vectors as a metric. The smaller the distance, the closer they are semantically, thus enabling semantic search. Several different metrics can be used to measure the distance between two vectors:

Euclidean distance.
Manhattan distance.
Cosine distance.
Dot product.

707

Euclidean distance (L2)

Euclidean distance is the most standard way to define the distance between two points (or vectors) and is the most commonly used metric in everyday life; it is the length of the line segment connecting the two points. A drawback of Euclidean distance is that, although it is a commonly used distance metric, it is not scale-invariant, meaning the calculated distance may be skewed based on the units of the features. Therefore, data normalization is usually required before using Euclidean distance. Furthermore, the effectiveness of Euclidean distance decreases as the dimensionality of the data increases. This is related to the curse of dimensionality.

Manhattan distance (L1)

$\text{distance} = \sum_{i=1}^{n} |x_i - y_i|$

Manhattan distance (also known as the L1 norm), often referred to as taxi distance or city block distance, is used to calculate the distance between real-valued vectors. The name comes from Manhattan, New York City, because the island has a street grid layout, and the shortest route between two points in Manhattan is the L1 distance.

Manhattan distance does not involve diagonal movement when calculating distance. Imagine objects on a uniform grid chessboard; they can only move at right angles because you need to follow the street grid. Manhattan distance seems to work well when the dataset has discrete or binary properties because it considers the actual paths that can be taken within the values of those properties. In contrast, Euclidean distance would create a straight line between two vectors, but this is practically impossible.

Disadvantages: While Manhattan distance seems to work in high-dimensional data, it is less intuitive than Euclidean distance, especially when used with high-dimensional data. Furthermore, since it may not be the shortest path, it may give a higher distance value than Euclidean distance.

Cosine similarity

$\text{Similarity} = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \times ||\vec{y}||}$

Cosine similarity measures the similarity between two vectors by measuring the cosine of the angle between them. This is analogous to measuring the angle between two vectors; the smaller the angle, the higher the similarity. Cosine similarity measures the angular difference between two vectors, not their length, because it’s a directional measure. The value of cosine similarity ranges from -1 to 1. A value closer to 1 indicates that the two vectors are more similar in direction; a value of 0 indicates that they are orthogonal; and a value of -1 indicates that they are completely opposite in direction.

Cosine similarity is often used to offset high-dimensional Euclidean distance problems. By calculating the cosine similarity between vectors, we can more accurately assess their semantic similarity, rather than simply comparing their numerical magnitude or linear distance.

Disadvantages: It only considers the direction of vectors, not their magnitude. For example, in recommender systems, cosine similarity doesn’t account for differences in rating scales among different users.

dot product

The formula for the dot product or scalar product is as follows: we multiply a pair of numbers and then add them together.

$\vec{x} \cdot \vec{y} = \sum_{i=1}^{n} x_i \times y_i$

The geometric meaning of the dot product operation is the angle between two vectors, representing the projection of one vector onto the other. The more similar the two vectors are, the larger their dot product will be, and vice versa. If the angle between two vectors is 90 degrees, then the two vectors are linearly independent and have no correlation. Based on this dot product, we can establish concepts such as distance, length, and angle. In fact, the Transformer uses the dot product to represent the similarity between two vectors.

708

Furthermore, the result of the dot product depends heavily on the magnitude of the vectors; for example, let’s calculate the dot product between two pairs of vectors:

For vectors A=(1,1) and B=(1,1), their dot product is 1*1+1*1=2.
When we consider another pair of vectors C=(1,1) and D=(10,10), the dot product is 1*10+1*10=20.

Although these two pairs of vectors are aligned in the same direction, their difference in magnitude results in significantly different dot products. This explains why the dot product is not always the most appropriate choice when comparing vectors.

progress

In machine learning and data science, cosine similarity has long been the preferred metric for measuring semantic similarity between high-dimensional objects. Cosine similarity has been widely used in a variety of applications, from recommender systems to natural language processing. Its popularity stems from the belief that it captures the directional alignment between embedded vectors, providing a more meaningful similarity measure than a simple dot product.

However, a paper titled “Is Cosine-Similarity of Embeddings Really About Similarity?” by Netflix and Cornell University challenges our understanding of this popular approach: cosine similarity often provides answers that do not reflect reality.

Regarding ebmedding derived from regularized linear models, the paper concludes that for some linear models, similarity is not even unique, while for others it is implicitly controlled by regularization. Beyond linear models, similar issues exist in more complex scenarios: deep learning models often use multiple different regularization techniques simultaneously, which can have unexpected effects on the cosine similarity of the final embedding. When learning embeddings through dot product optimization, directly using cosine similarity may yield results that are difficult to interpret and lack practical significance.

Based on these insights, the research team concluded that cosine similarity should not be used blindly and outlined alternatives.

The model is trained directly on cosine similarity.
Instead of working in the embedding space, project the embedding back into the original space before applying cosine similarity.
Normalization or reduction of popularity bias is applied during or before the learning process, rather than normalization only after learning, as is done with cosine similarity.

Building on the paper, blogger Amarpreet Kaur summarized some alternatives to cosine similarity:

Euclidean distance: Although it is not very popular in text data due to its sensitivity to vector size, it can be useful when the embedding is properly normalized.
Dot product: In some applications, the nonnormalized dot product between embedding vectors has been found to outperform cosine similarity, particularly in dense paragraph retrieval and question answering tasks.
Soft cosine similarity: This method considers the similarity between individual words in addition to vector representations, potentially providing a more detailed comparison.
Semantic Text Similarity (STS) Prediction: Fine-tuned models trained specifically for semantic similarity tasks (such as STSScore) promise to provide more robust and interpretable similarity metrics.
Normalized embedding and cosine similarity: Before using cosine similarity, applying normalization techniques such as layer normalization can effectively improve the accuracy of similarity calculation.

When selecting an alternative, the specific requirements of the task, the nature of the data, and the model architecture used must be considered. Empirical evaluation on domain-specific datasets is typically required to determine the most suitable similarity for the particular application.

1.6 Summary

contrast

Let’s first look at the comparison between embedding and one-hot encoding, specifically from the following perspectives:

One-hot vectors can be viewed as a special case of word embeddings, where each word essentially has a similarity of 0. Word embedding vectors, on the other hand, are dense, meaning that the values in their dimensions are (usually) non-zero.
Intuitively, embedding is equivalent to smoothing a one-hot vector, while a one-hot vector is equivalent to max pooling the embedding.
To quote Su Shen: “The Embedding layer is a fully connected layer that takes one-hot encoding as input and the intermediate layer nodes as the dimension of the character vectors! And the parameters of this fully connected layer are a ‘character vector table’! The character vectors are the parameters of the one-hot fully connected layer!”
One-hot vector encoding produces a high-dimensional sparse vector, while embedding vectors are low-dimensional dense vectors, saving more storage space. In addition, one-hot vector encoding cannot capture the relationship between categories, while embedding vectors can learn the semantic similarity between categories through training.
A classic modeling approach involves using the integer indices of characters as a base and looking up a table to obtain a vector. This makes it easy to use neural networks to process a string of character vectors into vectors of arbitrary levels. Multiplying a one-hot matrix with other matrices is analogous to using one-hot encoding to look up a table, thus simplifying one-hot matrix operations to a table lookup operation. After obtaining the parameters of this fully connected layer through the table lookup operation, we can directly use these parameters as features to pass to subsequent modules of the Transformer, or in other words, use these parameters as representations of characters and words to obtain character and word vectors.

$\begin{pmatrix} 1&0&0&0\\ 0&1&0&0 \end{pmatrix} \begin{pmatrix} w_{11}&w_{12}&w_{13}&w_{14}\\ w_{21}&w_{22}&w_{23}&w_{24}\\ w_{31}&w_{32}&w_{33}&w_{34}\\ w_{41}&w_{42}&w_{43}&w_{44} \end{pmatrix} = \begin{pmatrix} w_{11}&w_{12}&w_{13}&w_{14}\\ w_{21}&w_{22}&w_{23}&w_{24} \end{pmatrix}$

In fact, the vocabulary is a mapping between these two matrices.

circulation process

Secondly, the diagram below illustrates the flow from text to embedding, showing the transformation between the two spaces.

709

Advantages

Next, let’s look at the advantages of embedding.

Semantic understanding: Embedding-based retrieval methods represent text using word vectors, assigning different vector representations to words based on context. This allows the model to capture the semantic connections between words, making it more advantageous in handling polysemy. In contrast, keyword-based retrieval often focuses on literal matching, potentially ignoring the semantic relationships between words and failing to effectively distinguish the meaning of the same word in different contexts.
Error tolerance: Because embedding-based methods can understand the relationships between words, they are more advantageous in handling spelling errors, synonyms, and near-synonyms. Keyword-based retrieval methods, on the other hand, are relatively weaker in handling these situations.
Multilingual support: Many embedding methods support multiple languages, facilitating cross-language text retrieval. For example, you can use Chinese input to search for English text content, something keyword-based retrieval methods struggle to do.
Scalability and flexibility: The design of the input embedding layer allows the model to easily adapt to different languages or domains simply by adjusting the vocabulary and embedding matrix. This flexibility enables the Transformer model to be widely applied to a variety of natural language processing tasks.
High efficiency: The Embedding model is optimized with matrix algorithms, making it more efficient and effective than traditional vectorization methods.

Therefore, there’s a saying in the industry that “everything can be embedded.” Pinecone’s creator, Edo Liberty, wrote in his blog: “Machine learning represents everything as vectors, including documents, videos, user behavior, and so on. These representations allow different things to be accurately retrieved, searched, ranked, and classified based on similarity or relevance. This has applications in many scenarios, such as product recommendation, semantic search, image search, anomaly detection, fraud detection, and facial recognition.”

0x02 Transformer Embedded Layer Implementation

The role of the embedding layer is to transform the numerical representation of words in text into a high-dimensional vector representation, aiming to capture the relationships between words in high-dimensional space. Below, we will analyze in detail the key functions and construction process of the input embedding layer.

2.1 Process & Architecture

The image below has been mentioned in a previous article, but it is shown again here to help readers understand it better.

Suppose the model’s input is the sentence “Data visualization empowers users to”. Since the model cannot understand this sentence, we need to encode the natural language, that is, vectorize the text. This involves the following steps:

Tokenize the input text.
Each token is mapped to an input embedding. The purpose here is to project the information of the word itself into the embedding space, so this operation is a mapping from an integer (0~V-1) to a vector space. V represents the vocabulary size.
Add a positional code to each position in the sequence. The purpose of the positional code is to tell the model the correct word order, so this operation is also a mapping from an integer (0~n-1) to a vector space. n represents the length of the sentence sequence.
The input embedding and position encoding are added together to obtain the final word embedding. Because the embedding and position encoding need to be added together, they have the same dimension.

710

The last three steps of the above process correspond to the blue boxes in the Transformer architecture diagram. As shown in the diagram below, both the input and output (which are actually converted into inputs) are first compressed into embeddings, then positional encoding is applied before being fed into the attention layer. This can be represented using the concepts shown in the diagram.

$\text{MultiHead Attention input} = \text{Input Embedding}(\text{Inputs}) + \text{Positional Encoding}$ $\text{Masked MultiHead Attention input} = \text{Output Embedding}(\text{Outputs}) + \text{Positional Encoding}$

711

think

In BERT, XLNet, and RoBERTa, the vocabulary embedding size (E) and the transformer layer hidden size (H) are equal. Must they be equal? There are differing opinions. If they are equal, there are two drawbacks:

From a modeling perspective, the word representation vectors of a wordpiece learn context-independent representations. However, the representations learned by a transformer should be content-dependent. Therefore, theoretically, H, which stores contextual information, should be much larger than E. Decoupling H and E helps reduce the number of parameters (when H >> E).
From a practical perspective, the vocabulary size V in the NLP field is often quite large. If we set E = H, as H increases, E will also increase, resulting in a very large word embedding matrix V * E = V * H. Furthermore, embeddings are updated rather sparsely during actual training.

The authors of ALBERT argued that this was unnecessary because the primary target of the pre-trained model is the “context-related information” represented by H, rather than the “context-independent information” represented by E. Therefore, ALBERT did not directly map one-hot encoding to a hidden space of size H. Instead, it used factorization, first mapping to a low-dimensional space E, and then to the hidden space H. This factorization reduced the number of parameters from O(V*H) to O(V*E + E*H). When H >> E, the number of parameters was significantly reduced. Transforming the large word embedding matrix into two smaller matrices separates the hidden layer size from the word embedding size, allowing the hidden layer size to be increased without changing the word embedding size.

Furthermore, are larger word vectors always better? The paper “massive exploration of neural machine translation architectures” mentions that 2048-dimensional word vectors do not offer a significant performance improvement compared to 512-dimensional word vectors. The reason is simple: word vectors have fulfilled their task as long as they can group similar words together. Even if word vectors were only 100-dimensional, and each dimension could only take two values {0, 1}, the entire word vector space could still encode 2^100 different words, which is more than enough to embed tens of thousands of words into this space.

2.2 Implementation

The goal of the embedding layer is to convert the token sequence into word embeddings. Therefore, two steps are required:

More meaningful word vector representations are learned through gradient descent, and these word vectors form an embedding matrix.
Find the embedding vector corresponding to the token from the embedding matrix nn.Embedding. That is, project the token from the original word vector dimension (d_vocab) to a custom dimension (d_model). The size of d_model will also determine the final number of parameters in the model.

Note: We only provide the implementation of Input Embedding here; the implementation of position encoding will be analyzed in subsequent chapters.

Embedding matrix

In large language models, the embedding layer is essentially an embedding matrix used to convert input and output token IDs into d_model vectors.

An embedding matrix is essentially a lookup table derived from the vocabulary; it’s a table that stores “all” words. It contains the vector representation of each token (word or sub-word) in the model’s vocabulary. The embedding matrix can be viewed as a two-dimensional matrix of size V x D, where:

Number of rows: V is the size of the model’s vocabulary, that is, the total number of words or subwords that the model can recognize. The number of rows in the embedding matrix corresponds to the number of words in the vocabulary. Each row of the embedding matrix corresponds to a word ID in the vocabulary, and the vector stored in each row is the high-dimensional vector representation of that word ID.
Number of columns: D is the dimension of each word vector, usually a large number (such as 300 dimensions, 512 dimensions, 1024 dimensions, etc.), depending on the specific model design. The number of columns is equal to the dimension of the model.

For example, assuming a vocabulary size of V = 10000 and a word vector dimension D = 300, then the embedding matrix size is 10000 x 300, with each word in the vocabulary having a 300-dimensional vector representation. Therefore, the size of the embedding matrix is determined by the vocabulary size and the dimension of the word vectors.

During training or inference, after the input is tokenized and transformed into a sequence of word IDs, the embedding layer uses these IDs to look up the corresponding word vectors in the embedding matrix. This is accomplished through a lookup mechanism. Encoding using embeddings involves using the token index (word ID) to look up the corresponding vector in the matrix. Each word is located in a specific row of this table, and this row represents the semantic meaning of that word learned in the embedding space. This completes the transformation from discrete word IDs to a continuous vector space.

The diagram below uses different colors to indicate the calculation process of the two embeddings. For the third word ‘Hello’, the token index 2 is used to look up the third row of the embedding matrix (index 2). Similarly, for the word ‘World’, the fourth row of the embedding matrix (index 3) is retrieved. The embedding layer outputs these corresponding vectors, which are then used as input by the model to process subsequent layers of the neural network.

712

From another perspective, the token sequence number can be considered a one-hot encoding; for example, 2 is [0,0,1,0,0,0]. Looking up a token by its sequence number is equivalent to multiplying the one-hot encoding by the embedding matrix. Only one row in the embedding matrix is activated, and rows do not interfere with each other. In other words, the embedding lookup operation is essentially converting a one-hot vector into a high-dimensional vector.

Below is code extracted from ALBERT, which shows how one-hot encoding is handled. Therefore, we can actually understand one-hot encoding as a process of querying knowledge in the weights of an MLP, and the output is an embedding representation vector, which is actually invertible (it’s just a matrix equation).

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)

  input_shape = get_shape_list(input_ids)
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

Harvard Code

The Embeddings class is defined as follows.

# 继承自PyTorch的nn.Module类
class Embeddings(nn.Module):
    # 初始化函数
    def __init__(self, d_model, vocab):
        """
        输入参数是：
        d_model：词嵌入维度
        vocab：词表大小
        """
        # 调用父类初始化方法
        super(Embeddings, self).__init__()
        # 创建一个nn.Embedding作为词嵌入层，参数为词表大小和词嵌入维度
        self.lut = nn.Embedding(vocab, d_model)
        # 将词嵌入维度保存为类属性
        self.d_model = d_model

    # 前向传播函数
    def forward(self, x):
        # 通过词嵌入层将输入x编码为向量，并乘以词嵌入维度的平方根进行缩放
        # x是token的索引
        return self.lut(x) * math.sqrt(self.d_model)

There’s a detail in the code above: after obtaining the input word vectors, the matrix needs to be multiplied by the square root of the embedding size. This is because calculating the attention score involves a dot product operation. If the word embedding dimension is large, the dot product result will also be large, leading to gradient vanishing or gradient exploding during training. Multiplying the word embedding matrix by the square root of the embedding dimension makes the range of the dot product more reasonable, thereby helping to increase training stability and allowing the model to learn the mapping from input to output more effectively.

Additionally, multiplying by the square root of the embedding size will make the sizes of Embedding and Position Encoding consistent. The specific reason is:

Because Position Encoding is calculated using trigonometric functions, its range is [-1, 1].
If Xavier initialization is used, the variance of the embedding is 1/d_model. When d_model is very large, every value in the matrix will decrease. The variance can be restored to 1 by multiplying by the square root of d_model.

Therefore, when adding Position Encoding, the embedding value needs to be increased; otherwise, information will be lost after adding the inconsistent sizes.

In short, the purpose of this operation is to improve numerical and training stability, helping the model learn the mapping from input to output more effectively.

PyTorch Embedding

The core logic of the Embeddings class is handled by nn.Embedding(vocab, d_model), so let’s examine the details of nn.Embedding(vocab, d_model). nn.Embedding is a crucial layer provided by PyTorch, primarily used to map discrete vocabulary indices (or other category indices) to a continuous, dense vector space. The parameters of nn.Embedding are themselves part of the model parameters and participate in gradient descent calculations, allowing it to learn semantic similarity between categories through training.

Compared to torch.nn.Linear, nn.Embedding is specifically designed for mapping discrete indices to vectors, while nn.Linear is a fully connected layer used for linear transformations from inputs of arbitrary shapes to outputs. Furthermore, each index in nn.Embedding has an independent embedding vector, while the weights in nn.Linear are a shared linear transformation matrix.

The Embedding class is defined as follows.

class Embedding(Module):
    __constants__ = ['num_embeddings', 'embedding_dim', 'padding_idx', 'max_norm',
                     'norm_type', 'scale_grad_by_freq', 'sparse']
    num_embeddings: int # 词典的大小，即不同类别的总数，例如在词嵌入中，通常是词汇表的大小
    embedding_dim: int # 每个嵌入向量的维度，即将单词编码成多少维的向量
    padding_idx: Optional[int] # 指定一个索引，用于填充 (padding)
    max_norm: Optional[float] # 如果设置了该值，每次嵌入向量更新后会将其范数裁剪到不超过 max_norm
    norm_type: float # 用于裁剪范数的类型，默认为 2，即 L2 范数
    scale_grad_by_freq: bool # 梯度将会被频率 (即每个索引出现的次数) 缩放
    weight: Tensor # 用于初始化嵌入矩阵的权重
    sparse: bool # 使用稀疏梯度，这在处理大型词汇表时可以节省内存

    def __init__(self, num_embeddings: int, embedding_dim: int, padding_idx: Optional[int] = None,
                 max_norm: Optional[float] = None, norm_type: float = 2., scale_grad_by_freq: bool = False,
                 sparse: bool = False, _weight: Optional[Tensor] = None,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Embedding, self).__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        if padding_idx is not None:
            if padding_idx > 0:
                assert padding_idx < self.num_embeddings, 'Padding_idx must be within num_embeddings'
            elif padding_idx < 0:
                padding_idx = self.num_embeddings + padding_idx
        self.padding_idx = padding_idx
        self.max_norm = max_norm
        self.norm_type = norm_type
        self.scale_grad_by_freq = scale_grad_by_freq
        if _weight is None:
            self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs))
            self.reset_parameters()
        else:
            self.weight = Parameter(_weight)
        self.sparse = sparse

    def forward(self, input: Tensor) -> Tensor:
        return F.embedding(
            input, self.weight, self.padding_idx, self.max_norm,
            self.norm_type, self.scale_grad_by_freq, self.sparse)

The input to nn.Embedding is a token ID, and the output is the corresponding embedding vector. Given a token, the embedding layer directly retrieves the corresponding embedding vector from the embedding matrix based on the token ID. An example code snippet for using nn.Embedding is shown below. For instance, the following is the result of encoding the indices of the four words [0, 2, 0, 5]. As you can see, nn.Embedding is similar to a lookup table, mapping each discrete input index (usually an integer) to a fixed-size vector (i.e., the embedding vector). This mapping allows the model to transform discrete category information into continuous vector representations, thus better capturing the semantic relationships between categories.

# 10个向量，每个嵌入向量的维度为 3
embedding = nn.Embedding(10, 3, padding_idx=0)
# 输入的索引
input = torch.LongTensor([[0,2,0,5]])
# 获取嵌入向量
embedding(input)
# 输出
tensor([[[ 0.0000,  0.0000,  0.0000],
         [ 0.1535, -2.0309,  0.9315],
         [ 0.0000,  0.0000,  0.0000],
         [-0.1655,  0.9897,  0.0635]]])

Call

In the Harvard code, the Embeddings class is used to generate instances of the EncoderDecoder class when building the model. Then, the forward() function of the EncoderDecoder class uses the Embeddings class to encode the input and output sequences.

model = make_model(len(vocab_src), len(vocab_tgt), N=6)

def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)

    # src_vocab是词典大小
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        # 对应nn.Sequential(Embeddings(d_model, src_vocab), c(position))
        self.src_embed = src_embed
        # 对应nn.Sequential(Embeddings(d_model, tgt_vocab), c(position))
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask) # 这里会调用到

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask) # 这里会调用到

Input/Output

Suppose we are doing machine translation and have three sentences, with a maximum length of 10 characters per sentence. The input and output of Embedding would be as follows.

The input to the Input Embedding is a sequence of token indices with the shape torch.Size([3, 10]), which is [batch_size, seq_len]. This corresponds to label 1 in the diagram.
The output of the input embedding is a 3 x 10 x 512 tensor, and the character vector for each sentence is a 10 x 512 two-dimensional matrix (each row of the 512-dimensional vector represents a character). This corresponds to label 2 in the diagram.
The input to Positional Encoding is the sentence length, corresponding to label 3 in the diagram. The output of Positional Encoding is a position vector, which is also a 10 x 512 two-dimensional matrix (each row represents the position information of the corresponding word), corresponding to label 4 in the diagram. It should be noted that in the Harvard code, Positional Encoding actually performs an addition operation, so the input of the Positional Encoding class is the output of Embedding, and its shape is torch.Size([3, 10, 512]), that is, [batch_size, seq_len, d_model].
The new two-dimensional matrix obtained by adding the two matrices corresponding to each sentence can better and more comprehensively express the semantics; this is Word Embedding. (See label 5 in the diagram.)

713

2.3 Training

Source of Embedding

There are usually two options for word embedding:

We can directly use pre-trained word vectors without modifying them. For example, we can use algorithms like Word2Vec or GloVe for pre-training. Because these embedding vectors are fed as input to subsequent layers of the model during inference and remain unchanged, the embedding matrix is essentially a lookup table in this case. If there is limited supervised data, we can fix the embeddings and only let the model learn the other parameters. This can also be seen as a form of transfer learning.
Dynamic word vectors are used. Dynamic word vectors mean that word vectors are trained simultaneously with the training of a specific task, based on the corpus. The embedding matrix is the output after model training. Specifically, it’s an embedding layer trained in the Transformer. Each token has a corresponding high-dimensional representation in the embedding layer, and this mapping is continuously updated and improved during model training.

Why do we need to train again?

Why perform retraining? Because the pre-trained embeddings are fixed. For an effective natural language model to understand the possible meanings of each word, phrase, sentence, or paragraph in different contexts, it’s necessary to continue training in real-world contexts. Furthermore, static embeddings are generally the parameter weights of the penultimate layer of the neural network, possessing only overall and relative meaning, lacking local and absolute meaning. This is related to the embedding generation process.

Let the model design itself

Machine learning models do not use unstructured data. In order for a model to understand text or images, we must convert them into numerical representations. Before machine learning, such representations were typically created “manually” through feature engineering. But as the number of features increases, the vector dimensions grow larger and larger, and these new vectors become a major pain point: you can imagine an object having thousands of different semantic attributes—how exactly do you set the values for those different attributes?

With the advent of deep learning, nonlinear feature interactions in complex data are learned by the model rather than designed manually. This is because the core idea of deep learning is to let the neural network learn feature representations itself, rather than requiring programmers to design them. Therefore, researchers have proposed making word embeddings parameters in the model and updating them during training.

Training process

The training process is basically as follows.

Initialization. First, we need to initialize the character vectors as [vocab size, embedding dimension], where vocab size is the total number of characters in the vocabulary, and embedding dimension is the dimension of the character vector, which is also the mathematical representation of each character. Assuming we choose 3000 commonly used Chinese characters, each corresponding to a 512-dimensional random vector, then the entire character vector is a 3000 x 512 two-dimensional matrix, with each row representing a character. There are two ways to initialize the embedding matrix:
- Random initialization: Most models initialize each vector in the embedding matrix with random numbers at the beginning. To maintain the distinguishability of vectors, they are usually set to be orthogonal. A simple way to achieve orthogonality is through randomness: in high-dimensional space, two randomly generated vectors are approximately orthogonal, so symbols are usually initialized as random high-dimensional vectors. This makes the generated vectors as independent as possible, i.e., unrelated. For example, the specific meanings of “you,” “I,” “he,” “take,” “and,” and “in” can be randomly interchanged at the initial definition and are unrelated to each other. The correlation/relationship between these words is formed in context based on semantic, grammatical, cultural, and other factors.
- Pre-trained embeddings: In some cases, the model may use pre-trained word vectors to initialize the embedding matrix. For example, word vectors generated by models such as Word2Vec, GloVe, or FastText can be used as the initial values for the embedding layer. These pre-trained word vectors have been trained on a large corpus, containing rich lexical semantic relationships, providing a good foundation for subsequent Transformer layers (such as self-attention layers and feedforward network layers). Then, during training, the pre-trained embedding vectors are further fine-tuned and updated in conjunction with the model’s loss function and training task (such as masked language modeling or next word prediction), thereby learning specific semantic representations.
Update. The vectors embedded in the matrix are learned from the training data during the model’s training process. The goal of the transformer is to progressively adjust these embeddings during training, allowing these vectors to communicate with each other and update their values by passing information to one another.
- The goal of the learning process is to ensure that words sharing a common context also share similar vectors. We can achieve this by adjusting the word vectors through backpropagation. On one hand, we push similar vectors closer together in the numerical space, making them more like the vectors of their neighbors. On the other hand, training pushes the vectors away from the vectors of random words that do not share their context. Therefore, these vectors not only encode a single word but also incorporate richer contextual meaning.
- As the optimization algorithm iterates and updates, this process isn’t a one-time event. It’s more like a loop. You browse through each word in the corpus, updating its vector based on its neighbors. Sometimes, you might slightly differentiate it from random, unrelated words. As you repeat this process, iterating again and again, the vectors begin to stabilize. They find an equilibrium point where adjustments become increasingly smaller, allowing words with similar patterns to accumulate these similarity updates to a considerable degree. This means that similar words (like synonyms) will have similar vector representations in the embedding space. For example, in attention mechanisms, surrounding embeddings pass their information to the embedding corresponding to “故” (gù), regardless of their proximity. See the diagram below.

714

Solidification. When the network finally converges and stops iterating, the parameters of each layer are relatively solidified, resulting in the hidden layer weight table (which is equivalent to obtaining the desired embedding). This is convergence; the vectors now represent not only random assignments but also meaningful relationships based on the context of word occurrences. After repeating this process a sufficient number of times, the hidden structure in your corpus will be revealed—a word relationship graph built from scratch.
Then, by looking up a table, you can view the embedding of each element individually. The embedding process is also a table lookup process because there is a one-to-one mapping relationship between them. Each token has a corresponding high-dimensional representation, and this mapping relationship is constantly updated during model training.

The entire process is shown in the diagram below.

715

0x03 Text Embedding

Text embedding transforms text into a set of fixed-dimensional vector representations. While word embedding uses tokens as the basic unit, text embedding uses text as the basic unit. Ideally, text embedding should preserve as much semantic information as possible; texts with the same meaning but different expressions should be mapped to the same location, while texts with different meanings should maintain corresponding distances in the vector space. In practice, text embedding and word embedding may very well use the same model.

3.1 History

Text embedding models aim to encode the semantic content of natural language text into vector representations, thereby facilitating various natural language processing (NLP) tasks such as semantic text similarity, information retrieval, and clustering. The paper “When Text Embedding Meets Large Language Model: A Comprehensive Survey” provides a history of text embedding.

The era of statistical machine learning began. In its early stages, researchers primarily relied on manually designed features to represent text. The development of this approach was accompanied by transformations in the form of the output vectors, from bag-of-words models (one-hot vectors) to TF-IDF (sparse vectors), and then to word embeddings (dense vectors). These methods required domain experts to manually select and design features, and the quality and quantity of these features limited their effectiveness. With advancements in machine learning techniques, researchers began exploring methods using statistical approaches to learn text embeddings, such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). While these methods can automatically learn text embeddings, they still have limitations in capturing complex semantic and syntactic structures.
The Deep Learning Era: With the emergence of deep learning technology, word embeddings have become a major breakthrough in text embedding learning. Models such as Word2Vec, GloVe, and FastText can map words to a low-dimensional continuous vector space to capture the semantic and syntactic relationships between words. In the past few years, pre-trained language models (PLMs) with millions of parameters, such as BERT and RoBERTa, were first proposed. The “pre-training-fine-tuning” paradigm has performed well in various downstream tasks and played an important role in practical applications. However, the embedding spaces of these PLMs have been shown to be anisotropic, resulting in a surprisingly high similarity for computation of any two texts.
In the LLM era, the representations of LLMs were directly used for embeddings, such as in retrieval, clustering, and STS tasks, but the results were not very good. Therefore, it was necessary to distinguish embedding models from large models. However, using LLMs to generate embeddings is a recent trend. This paper defines language models with more than 1 byte of parameters as LLMs, thus excluding pre-trained language models based on Transformer encoders (such as BERT and RoBERTa) and neural networks with typically small parameter counts (such as LSTM and GRU). The LLM for generating embeddings presented in the paper is shown in the figure below.

716

Major model service providers and open-source communities have released numerous embedding models for users. An embedding is a specially trained neural network model used to vectorize data. However, embedding models are optimized with matrix algorithms, making them more efficient and effective than traditional vectorization methods.

Next, let’s look at some typical cases of the evolution of embedding models and embedding.

3.2 Word2Vec

One readily conceivable approach is to correlate different dimensions of word meaning with different dimensions of vectors. For example, a comprehensive breakdown of word meaning dimensions could be implemented: noun form, verb form, adjective form, quantitative features, person, active/passive voice, emotional connotation, emotional intensity, spatial context (top/bottom, front/back, inside/outside), and color features. With a sufficiently large number of dimensions, all the information contained in the word meaning can be encompassed. Once we define each dimension, we can provide the numerical value for each word in the corresponding dimension, thus completing the word vectorization and perfectly conforming to the two properties mentioned above. However, this seemingly feasible design is not practically achievable.

First, an extremely high dimensionality is required to encompass all the different dimensions of word meaning, and such a fine segmentation of word meaning is very difficult. Second, even with an extremely high dimensionality, assigning effective numerical values to the different dimensions of meaning for each word would likely be a challenge even for the most experienced linguists.

Since the pure construction method was not feasible, people came up with a method that could work wonders: Word2Vec, proposed by Google in 2013, which trains the model to learn how to associate things.

Ideas

The general idea behind Word2Vec is that the meaning of a word can be defined by its context. Therefore, we can understand the meaning of a word through its context and use that context to represent the characteristics of the word. In other words, words with similar contexts must also have similar meanings. This idea is the “Distribution Hypothesis” proposed by linguist Zellig Harris in 1954.

For example, we may not be familiar with the word “Puma,” but below are some descriptive sentences about the word.

They are unpatterned and have relatively small skulls. Males are larger than females. Males have a head-body length of 1.02-1.54 meters, while females have a head-body length of 860-1310 millimeters.
The body is well-proportioned, with medium-length limbs and a digitigrade gait. Vision, hearing, and smell are all highly developed. Canines and carnassial teeth are extremely well-developed.
They typically inhabit mountain valleys and forests, are adept at swimming and climbing trees, especially enjoying activities in the trees, and are also excellent runners, capable of running at 53-64 kilometers per hour. They have an exceptional jumping ability, able to leap from trees or cliffs 12-13 meters high, covering a distance of 8-9 meters with a single leap, and can also jump over heights of 3 to 6 meters or distances of 5 to 13 meters. Therefore, for prey within 20 meters, they can catch it with just two powerful leaps.
It inhabits a variety of habitats, including forests, jungles, hills, grasslands, semi-deserts, and mountains, such as mountain coniferous forests, lowland tropical forests, scrublands, swamps, and any area with sufficient cover and prey.
In the La Plata, they mainly hunt elk, ostriches, squirrels, and other small four-legged animals; in Chile, they hunt ponies and cattle.

From these descriptions, we can reasonably assume that “Puma” is an animal similar to a leopard. If we replace “Puma” with “leopard” in these descriptions, we find that these are common descriptions of leopards. In other words, because “Puma” and “leopard” share many similar contexts, we consider them to be similar in nature.

Starting from this point, instead of counting the number of times a word appears in a large document, we count the frequency of which words appear in its context (generally, we take the current word as the center and call the N words before and after it its context). With the definition of context, we construct a word-context matrix and then obtain the vectors corresponding to the words.

The first paper in Word2Vec is “Efficient Estimation of Word Representations in Vector Space.” The title mentions representation theory, a branch of abstract algebra in mathematics. Representation theory “represents” the elements of abstract algebraic structures as linear transformations on a vector space and studies the modulus of these algebraic structures to investigate their properties. The beauty of representation theory lies in its ability to transform abstract algebraic problems into more easily solvable linear algebraic problems. For example, the “Additive Compositionality” section of the Word2Vec paper discusses how words and phrases learned in the model exhibit a linear structure, making accurate analogical reasoning using simple vector operations possible. Furthermore, simple element-level vector representations of addition can meaningfully combine words. For instance, the geometric vector relationship between “apple” and “apples” is roughly the same as the relationship between “car” and “cars,” meaning we can compute “apples” as “apple + car - cars.” This essentially constructs a mapping from binary operational groups to linear spaces, making their operations in linear spaces isomorphic to the group operations.

Architecture

Word2Vec has two main architectures: CBOW (Continuous Bag of Words) and Skip-Gram.

The CBOW architecture predicts the target word based on the context of the word. Essentially, it means that “the context of a head word is given as input, and the head word is obtained as a result”.
Skip-Gram does the opposite; it predicts the context of a target word. Skip-Gram is very similar to CBOW, except that it “takes a head word and predicts the words in the context.”

The model structures of both are shown in the figure below. Taking CBOW as an example, during the initialization of the input layer, an n-dimensional vector is randomly generated for each word, and this n-dimensional vector is used as the model parameter for learning, ultimately obtaining the word vector. The process of generating word vectors is a parameter update process.

717

Let’s analyze the context, or windowing mechanism, more closely. In unsupervised training of language models, the windowing mechanism is used. This involves predicting a word based on the n nearest neighbors; these n neighbors form the word’s direct neighborhood, its social circle within the sentence. For example, in the sentence “我要去后海玩” (I want to go to Houhai), if we use a window of size 2, the context of “去” is the four characters [我, 要, 后, 海]. CBOW uses [我, 要, 后, 海] to predict the word “去”. Essentially, it means that “the context of a central word is the result of that context.”

During training, words within the same window will have similar updates, and words with similar patterns will accumulate these similar updates to a considerable extent. “Similar patterns” refer to the fact that they are replaceable in a specific language task. For example, in the sentence “I especially like to eat apples,” if “apple” is replaced with “banana,” the sentence still holds true. Therefore, “apple” and “banana” must have similar word vectors in a specific language task. Or, to put it another way, the characters “麒麟” (qilin) are almost always used together; updating “麒” almost simultaneously updates “麟,” so their updates are almost identical, and thus the word vectors of “麒” and “麟” must be almost the same.

Essentially, after Word Embedding, the row containing a word in the word-context matrix becomes a vector representation of that word. The words are then combined into a set of vectors in a relatively low-dimensional space, and the proximity of these vectors is determined by their semantic relationships. These representations have the following characteristics:

Embedding a word into a high-dimensional Euclidean space makes it a vector in the Euclidean space.
Similar texts are adjacent in a European-style space.

question

Regardless of the architecture used, the core of Word2Vec is a “static” word embedding: each word is mapped to a fixed, pre-trained vector. This means that the vector representation of a word is the same regardless of the context in which it appears.

While the CBOW model does consider local word context through a sliding window during training, Word2Vec only utilizes and mines the shallow structure of the “Distribution Hypothesis” and does not attempt to understand the semantics within the sentence. Therefore, once the model is trained, the embedding of each word is fixed and does not change with different contextual scenarios. Consequently, this embedding is considered static and lacks contextual information.

For language models, a major challenge is understanding how a word can be used in different contexts. In context, isolated words are mostly meaningless; we need to draw upon shared knowledge and experience to understand them. Therefore, an effective natural language model needs to be able to understand the possible meanings of each word, phrase, sentence, or paragraph in different contexts.

In this respect, fixed or static word embedding has a major weakness: words with multiple definitions. For example, the word “故” has different meanings depending on the context in each sentence.

The duke asked him the reason (cause, cause).
Reviewing old knowledge helps us learn new things.

Static embeddings are essentially read from a lookup table; regardless of the context, the word “故” has a fixed vector corresponding to it. They only encode the meaning of that specific word, without contextual reference, which limits its ability to understand and distinguish different meanings. That is, although Word2Vec considers local context during training, once training is complete, the embedding of each word is fixed and no longer changes.

718

Because static word vector representations cannot solve the problem of polysemy, dynamic word vectors were proposed. In dynamic word vector representation, the model is no longer a simple lookup table of vectors; instead, it is calculated on the fly by a pre-trained model. When representing words as vectors, the model understands polysemous words based on the current context, infers the meaning of each word, and thus obtains the word vector for the text. If the context changes, the calculated vector will change. Compared to static word vectors, dynamic word vectors make fuller use of contextual information, and therefore can solve the problem of polysemy.

3.3 ELMO

To address the issue that word embedding cannot solve the problem of polysemous words, ELMO (Embedding from Language Models) proposes a simple and effective solution.

Ideas

Instead of using fixed embeddings for each word, ELMo examines the entire sentence before assigning embeddings to each word. The core idea of ELMo is to dynamically adjust the word embedding based on the current context. This essence can be seen in the word “contextualized” in the title of the ELMo paper, “Deep contextualized word representation.”

ELMo first learns a word embedding using a language model, at which point the word embedding already contains some semantic information. When the word embedding is actually used, ELMo then adjusts the word embedding representation based on the word’s context. This adjusted word embedding better expresses the specific meaning within that context, thus naturally resolving the problem of polysemous words.

The ELMo model’s bidirectional language model (bidirectional two-layer LSTM) consists of a forward language model and a backward language model. The forward language model encodes the text from left to right, while the backward language model encodes it from right to left. These two pieces of information are then integrated to obtain the contextual information of each word. This bidirectional design allows ELMo to see not only the words preceding but also the words following it, thus simultaneously capturing information about words within their surrounding context and generating richer word vector representations.

In practical models, such forward-backward LSTM networks may have multiple layers. For each layer, the forward and backward LSTM outputs of each token are concatenated to obtain the representation of that token at that layer.

719

train

In models prior to ELMo, the embedding model was often trained separately. However, after ELMo, there was a surge in training the embedding layer and the language model layer above it together (ELMo stands for Embeddings from Language Model).

ELMo uses a typical two-stage process for pre-training.

The first stage involves pre-training a language model. Character-level features are extracted, which are the initial word embeddings for the corresponding words. During pre-training, ELMO not only learns word embeddings but also a two-layer bidirectional LSTM network structure (representing the forward and backward representations of each word).
In the second stage, when performing downstream tasks, the ELMo model combines the hidden states and character embeddings of these multi-layer bidirectional LSTMs in a certain way (by concatenating and then weighting the sum), and obtains the final context-relevant word vectors through a linear transformation (such as weighting the sum with a task-related scaling factor), which serve as the input to the downstream task model. This operation is equivalent to integrating words into the semantics of the context, which can more accurately express the true meaning of the words.

The training phase is shown in the following diagram.

720

There are currently two methods for pre-training:

Feature-based: The trained representations are used as features for downstream tasks, including word vectors, sentence vectors, paragraph vectors, and text vectors. The new ELMo also belongs to this category, but the input representations need to be recalculated after the transfer.
Fine-tuning: This mainly draws inspiration from computer vision (CV), which involves adding task-specific layers to a pre-trained model and then fine-tuning the later layers. New models like ULMFit and OpenAI GPT fall into this category.

3.4 BERT

BERT (Bidirectional Encoder Representations from Transformers) uses the Transformer architecture, and in particular, it captures the relationship between each word and all other words in a sentence through its self-attention mechanism. This means that the embeddings generated by BERT for a word depend not only on the word itself, but also on its context.

motivation

ELMo, when concatenating the forward and backward LSTM outputs, simply combines the forward and backward information without considering the interaction between the left and right contexts. GPT, on the other hand, uses a left-to-right decoder, where each word only receives contextual information from its preceding words. These structures are suboptimal for sentence-level tasks requiring bidirectional contextual information (such as NER and sentiment analysis). Therefore, BERT aims to truly integrate bidirectional contextual information.

Ideas

During the inference phase of the model (i.e., when using the model to make predictions), BERT dynamically adjusts the embedding of each word based on the context of the current input, allowing it to generate different embedding representations for the same word in different sentences. Moreover, a key feature of BERT is its bidirectional nature; the model considers the entire context (i.e., all words to its left and right) when processing any word. This enables BERT to more accurately capture the complexity and nuances of language.

The figure below illustrates the differences in pre-trained model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses independently trained left-to-right and right-to-left LSTM connections to generate features for downstream tasks. Of the three, only BERT represents a representation that is jointly constrained by left and right contexts in all layers. Besides architectural differences, BERT and OpenAI GPT are fine-tuning methods, while ELMo is a feature-based approach.

Compared to OpenAI GPT (Generative pre-trained transformer), BERT uses bidirectional Transformer block connections; just like the difference between a unidirectional RNN and a bidirectional RNN, intuitively, it would perform better.

Compared to ELMo, although both are “bidirectional,” their objective functions are actually different. ELMo trains two representations independently using P(wi|w1,...wi−1) and P(wi|wi+1,...wn) as objective functions and then concatenates them. BERT, on the other hand, trains the LM using P(wi|w1,...,wi−1,wi+1,...,wn) as the objective function.

721

BERT proposes two learning tasks: 1) Masked Language Modeling (MLM), which first masks a portion of the text and then predicts the masked words using contextual information; 2) Next-sentence-prediction (NSP), which predicts the relationship between two sentences. Similar to GPT, BERT also uses special tokens: [CLS] is added at the beginning of each input, and [SEP] is used for sentence breaks and endings.

embedding

The input encoding vector of BERT is the element-wise sum of three embedding features, as shown in the figure below. These three embedding features are:

Token Embeddings are word vectors, with the first word being a CLS token, which can be used for subsequent classification tasks.
Segment embeddings are used to distinguish between two sentences, because pre-training involves not only logical modeling but also classification tasks with two sentences as input. For example, determining whether B is the concluding sentence of A (in dialogue or question-and-answer scenarios). For a sentence pair, the feature value of the first sentence is 0, and the feature value of the second sentence is 1. For example, “[CLS] My name is xx [SEP] How are you [SEP]”, the corresponding Segment Embeddings would be 0 0000 0 111 1, where the bold 0 followed by 1 indicates the corresponding word.
Position embeddings encode the positional information of words into feature vectors. Position embeddings effectively incorporate the positional relationships of words into the model, improving the model’s ability to understand sentences. Note that these are learned functions, not trigonometric functions.

722

Why can Bert’s three embeddings be added together? Here are some insightful explanations.

These three embeddings correspond to three different frequencies. From a signal processing perspective, superposition or addition is a fairly standard operation.
The mathematical essence of embedding is a single-layer fully connected layer with one-hot input. Therefore, the three types of embedding are first encoded with one-hot, then concatenated together, and finally passed through an MLP, which is equivalent to adding the embeddings directly.
The addition of the three embeddings involves linearly mapping the symbol space, symbol attribute space, and position space to a unified, homogeneous feature space, and then summing the coordinates.

code

The specific code is as follows. This part mainly generates three embedding layers and then assigns weights by truncating the normal distribution.

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        self.register_buffer(
            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
        )
        self.register_buffer(
            "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
        )

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        past_key_values_length: int = 0,
    ) -> torch.Tensor:
        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

        if token_type_ids is None:
            if hasattr(self, "token_type_ids"):
                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

3.5 BGE

In the Chinese-speaking world, the BGE (Browser-Generated Embedding) model proposed by the Beijing Academy of Artificial Intelligence (BAAI) in its paper “C-Pack: Packaged Resources To Advance General Chinese Embedding” is a well-known open-source embedding model. BGE aims to be a universal embedding model for the Chinese-speaking world, which means it needs to support all embedding use cases, including but not limited to tasks such as retrieval, re-ranking, clustering, classification, and pair-classification. Let’s take a look at the efforts made by the paper’s authors.

Dataset

The paper’s authors designed the largest dataset, C-MTP, for training general Chinese embeddings, taking into account scale, diversity, and quality of text embedding, ensuring the universality of text embeddings, which is a prerequisite for training a general embedding model.

C-MTP collects data from two sources: the majority of the data is based on the management of massive amounts of unlabeled data, namely C-MTP (Unlabeled), which contains 100 million text pairs. A smaller portion comes from a comprehensive integration of high-quality labeled data, namely C-MTP (Labeled), which includes approximately 1 million text pairs.

723

train

The paper employs a three-stage training strategy, from pre-training, to general-purpose fine-tuning, to task-specific fine-tuning; the first two stages form the foundation for ensuring generality, while the last stage further refines the performance of downstream tasks while maintaining generality.

pre-training stage: based on Wudao Corpus, where the model is not trained on any text pair data. Its goal is to train a pre-trained model that is better suited for the embedding task. The core technology here is the RetroMAE training strategy.
general-purpose fine-tuning stage: based on C-MTP (unlabeled), training on 100M text pairs, can be viewed as a large-scale weakly supervised learning process. Its goal is to initially learn a general embedding model. The core technique in this stage is contrastive learning, employing an in-batch negative sampling method and a large batch size for training. Furthermore, the retrieval task has been significantly optimized.
task-specific fine-tuning stage: based on C-MTP (labeled), fine-tuning the model on a small, high-quality set of labeled data from downstream tasks. This ensures generality while enhancing the model’s performance on specific tasks. The challenge in this stage lies in effectively managing multi-task learning when there are differences between tasks. The paper employs two key techniques to address these challenges: instruction-based fine-tuning and hard negative sampling.

724

RetroMAE

The goal of the pre-training phase is to learn a pre-trained model that is more suitable for embedding.

Currently, most mainstream language model pre-training tasks are token-level, such as MLM or Seq2Seq. However, this training method struggles to produce high-quality sentence-level vectors, limiting the potential of language models in retrieval tasks. To address this drawback, two pre-training strategies for retrieval models exist: self-contrastive learning, which is often limited by the quality of data augmentation and requires a very large number of negative samples; and auto-encoding, a self-reconstruction method unaffected by data augmentation or negative sample sampling strategies. Two key factors determine the performance of models based on this method: firstly, the reconstruction task must have sufficient encoding quality requirements; and secondly, the training data must be fully utilized.

The authors of the paper proposed RetroMAE, which consists of two modules:

First, an encoder similar to BERT is used to randomly mask the input, and then the input is encoded.
Then, a single-layer transformer decoder is used for reconstruction. This process forces the encoder to learn good embeddings.

Encoding

Given a sentence input X, randomly mask a small portion of the tokens to obtain X_enc. Here, a moderate mask ratio (15%–30%) is typically used to preserve most of the original sentence information. Then, an encoder Φ_enc(.) is used to X_enc to obtain the corresponding sentence vector h. Due to the use of a BERT-like encoder, the hidden state of the last layer at the [CLS] position is ultimately used as the sentence vector.

Decoding

Given a sentence input X, a portion of the tokens is randomly masked to obtain X_dec. Here, a more aggressive masking ratio (50% to 70%) than that used in the encoder is usually adopted. The text is then reconstructed using the masked text and the sentence vector generated by the encoder.

Enhanced Decoding

The aforementioned decoding strategy has a drawback: the training signal only comes from the masked tokens, and each masked token is reconstructed based on the same context. Therefore, researchers proposed a new decoding method, Enhanced Decoding, as follows.

First, generate two different input streams H1 (query) and H2 (context).
The new output A is obtained through the attention mechanism. The other tokens that the i-th token can see are determined by sampling (to ensure that it cannot see its own token, and to see the first token, which is the information of the CLS sentence vector produced by the encoder).
Finally, A and H1 are used to reconstruct the original text. The goal of reconstruction here is not only to reconstruct the masked tokens, but all tokens.

The final RetroMAE loss is obtained by adding two parts: the MLM loss of the encoder part and the cross-entropy loss of the self-reconstruction of the decoder part.

725

3.6 LLM-As-Embedding

Next, let’s look at generating embeddings using LLM. The following figure shows the overall state of common models.

726

Backbone Selection

For many years, the dominant paradigm for building text embedding models has relied on pre-trained bidirectional encoder models or encoder-only models, such as BERT and T5. Previously, decoder-only models were generally considered unsuitable for embedding extraction because encoder-based models can capture semantics through bidirectional attention. Decoder-only models, however, typically use causal attention to interact only with previous word tokens, and therefore cannot capture rich semantics, such as contextual information, like encoder-decoder models. Only recently has the community begun to adopt decoder-only LLMs for generating embedded text. As shown in the figure above, in terms of backbone selection, T5 is primarily used in the Encoder-Decoder architecture, while Mistral and LLaMA are mainly used in the Decoder-Only architecture.

Architecture improvements

The internal representation obtained by a deep learning model after processing the input data contains semantic and compressed data information. Therefore, we usually extract the hidden state of the last layer of the neural network as the embedding.

727

While large decoder-only models perform well on many NLP tasks, directly using them to generate text representations often yields poor results. This is related to the training task and architecture of the large model itself. Therefore, architectural improvements are necessary.

Pooling strategy

Pooling refers to the process where the hidden states of the last layer of an LLM are represented by an additional mapper (different from the original Linear mapper in an LLM). It mainly involves the following five strategies.

First Pooling: Use the first token as the identifier, for example, use the [CLS] token for BERT and the [START] token for T5.
(Weighted) Mean Pooling: Uses a (weighted) average of each token. BERT and T5 perform well with Mean Pooling. GPT-Style LLM, which gives greater weight to later tokens, performs even better.
(Special) Last Pooling: Uses the last token. Because the last token in an LLM is generally aligned with the token representation of the next term, the following two methods are typically used for conversion:
- Using prompts allows the model to summarize the semantics of the entire text at the last token. A representative work is PromptEOL, with the template: “The sentence [X] means in a word”.
- A special token is introduced, and corresponding fine-tuning is performed. This series of tasks mainly utilizes the sentence-final token as a special token. ChatRetriever adds an Embedding sequence [EMB1], … , [EMBt] to the end. Using this part of the representation as a thought chain improves the quality of the representation.
Partial Pooling: Echo proposes the following prompt: Rewrite the sentence: [x], rewritten sentence: [x]. The same text will be filled into both [x] positions, ensuring that the token in the second [x] position can see all the tokens in the sentence. Then, the tokens in the second [x] position are used for pooling.
Trainable Pooling: This adds an attention mechanism before regular pooling. NV-Embedder uses the last hidden state H of the LLM layer as Q and K and V as learnable parameters to design a latent attention layer, thus enabling deeper interaction with H before pooling. LMORT uses an alignment uniformity metric to select suitable H values from each layer and then inputs them into a multi-layer attention network.

Attention architecture

The Casual Attention used in Decoder-Only LLMs ensures that language modeling can only reference prefixes to predict the next token; however, Casual Attention can degrade performance on downstream tasks. Therefore, researchers have used several methods to modify the model architecture to adapt it to the Embedding task.

Retain casual attention and use other techniques (e.g., use various pooling strategies).
The model is converted to bidirectional attention and adapted to the new structure. BELLM converts the last few layers of an LLM to bidirectional attention. NV-Embeder further discovers that with sufficient data, an LLM can be completely converted to bidirectional attention without additional operations. LLM2Vec proposes an MNTP (masked next token prediction) task combined with an MLM (masked language model) to better utilize more data.

Additional Projector

Projection layers are a commonly used strategy in text embedding, and are specifically divided into the following:

Low-dimensional representation mapping. Mapping a representation to a lower-dimensional representation (e.g., 4096 => 1024).
Sparse representation mapping. Methods such as gating, topk-masking, and regularization are used to transform representations into sparse representations.
Other uses include mapping. InBedder uses shorter question-answer pairs as training data to predict the next word and obtain a representation.
Parameter-Efficient Fine-Tuning Module: Whether to use parameter-efficient fine-tuning techniques depends primarily on the amount of data used, rather than resource constraints. Some studies have introduced BitFit or LoRA as PEFT modules and fine-tuned them using a single dataset; this small-scale data can stimulate the semantic generalization capabilities inherent in LLMs. Other works use full-parameter tuning and (multi-stage) fine-tuning based on hundreds of datasets, which is sufficient to allow for significant changes to model parameters and avoid overfitting.

Next, we will introduce several solutions for generating embeddings using large models.

3.7 LLM2Vec

The characteristic of causal attention mechanisms is that each word can only focus on the preceding word, which limits the comprehensive understanding of the meaning of the entire sentence. LLM2Vec aims to solve this problem. LLM2Vec is an unsupervised method that can transform any decoder-only model into a text representation model. This method mainly involves three parts: bidirectional attention mechanism transformation, a masked next token prediction task, and unsupervised contrastive learning. This method can effectively transform large models into generalized text encoders without requiring additional adaptation tasks or data. Compared to other methods that require further training, LLM2Vec can maintain inference cost without increasing input length, without requiring costly labeled training data.

LLM2Vec includes a modified attention mechanism and two unsupervised training tasks, as detailed below.

First, we directly changed the one-way attention mechanism to a two-way attention mechanism, so that each token can access all other tokens in the sequence.
Then, the model is trained using MNTP (Masked Next Token Prediction) to adapt to the bidirectional attention mechanism. At this point, the model already has the ability to generate high-quality token-level representations.
Finally, unsupervised contrastive learning (SimCSE) is used for further training, enabling the model to generate high-quality sentence-level representations.

Although LLM2Vec involves two training tasks, both are unsupervised learning and do not require high-quality labeled data.

Bidirectional attention mechanism

The first step of LLM2Vec is to change the unidirectional attention mechanism of decoder-only LLMs to a bidirectional attention mechanism, allowing each token to see information from other tokens. The reason for this modification is that researchers believe the mechanism in which decoder-only LLMs cannot see future information may impair the quality of text representation. Specifically, the causal attention mask of the decoder LLM is replaced with an all-ones attention mask, enabling each token to access all other tokens in the sequence. The leftmost part of the figure below illustrates how LLM2Vec expands the attention scope of each word to the entire sequence by changing the attention mask. This approach allows each word to consider all other words in its context simultaneously, significantly improving the model’s ability to understand the context.

Masked next token prediction (MNTP)

After completing the first step of the transformation, the model needed to adapt to the new bidirectional attention mechanism. Therefore, the researchers designed the MNTP task, similar to the MLM task pre-trained with BERT. In this step, the model hides certain words in a sentence and then predicts those words by looking at all other words (including the context before and after the hidden words). See the middle part of the image below for details. This step helps the model get used to paying attention to both the preceding and following contexts simultaneously, thus better understanding and representing the text.

Unsupervised contrastive learning

The first two steps of the modification enabled LLM to generate high-quality token-level representations, but not yet high-quality sentence-level representations. Therefore, the researchers directly adopted SimCSE’s unsupervised contrastive learning to further train the LLM. Specifically, the hidden state at each position was used as a sentence vector through mean pooling. By maximizing the similarity between two different representations of the same sequence while minimizing the similarity with other sequence representations in the batch, the model’s performance was improved. In other words, contrastive learning was used to bring similar texts closer together and distance dissimilar texts apart. This allows the model to better distinguish between similar and different sentences, thereby improving the quality of sentence representations. See the right side of the figure below for details.

728

The above three steps enable LLM2Vec to transform any large language model into a text understanding and representation tool that is highly practical in various NLP tasks.

3.8 NV Embedding

NV-Embed-v2, as of January 2025, topped the MTEB leaderboard with an average score of 72.31, and its improvement approach is worth learning from. NV-Embed-v2 comes from the paper “NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models”. The paper modifies a pre-trained Mistral-7B model by adding a latent attention layer to improve representation capabilities.

729

The diagram above shows the architecture design consisting of a decoder-only LLM and a latent attention layer: a latent attention layer followed by an MLP layer.

Bidirectional attention

The paper did not use the techniques of LLM2VEC and GRIT, but instead directly removed the causal attention mask of decoder-only LLM during the contrastive learning process, and found that it performed very well.

Latent attention layer

There are two popular methods to obtain the embedding of a series of tokens: i) mean pooling, ii) the last token embedding.

Previous bidirectional embedding models typically used mean pooling, while the last token embedding is more popular in decoder-only LLM embedding models. However, both methods have certain limitations. Mean pooling simply takes the average of the token embeddings, which may dilute important information in the key phrase, while the last token embedding may be affected by recency bias and heavily depends on the output embedding of the last token.

NV-Embed proposes a latent attention layer that enables more expressive sequence pooling in general embedding tasks. Specifically, the latent attention layer, as a form of cross-attention, uses the hidden state of the last layer of a decoder-only LLM as the query and a trainable latent array as the key and value. The blue dashed lines represent the two matrix multiplications involved in the QKV attention mechanism.

MLP layer

MLP is a regular MLP, which consists of two linear transformations with a GELU activation in between.

Embedding Aggregation

The paper suggests two methods to aggregate the output vector after the MLP layer to obtain the embedding of the entire sequence.

The output corresponding to the EOS token at the end of the text is used as the embedding for the entire sentence.
Mean-Pool: Each token is averaged across Embeddings, preserving the characteristic information of each token.

NV-Embed uses Mean-Pool this approach.

3.9 Methods of Prompting Engineering

The paper “Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models” uses prompting engineering to enhance the ability of large models to directly generate text representations.

How can we better minimize the deviation between the generative model’s prediction of the next token and the generation of a vector? Prior to this paper, PromptEOL was a common approach (PromptEOL was proposed in a 2023 paper, “Scaling sentence embeddings with large language models,” and is a method that uses specific prompt templates to allow LLMs to generate sentence embeddings). This paper builds upon PromptEOL and proposes PromptSTH and PromptSUM. These three methods are detailed below:

PromptEOL: Represents the sentence in one word, i.e., This sentence: "[X]" means in one word:.
PromptSTH: Use “something” to represent the sentence, i.e., This sentence: "[X]" means something.
PromptSUM: This sentence can be summarized as: This sentence: "[X]" can be summarized as.

The results show that the PromptEOL method performs better when directly generating vectors, indicating that it is more consistent with the pre-training stage of the large model itself. However, it still lags behind the BERT model in directly generating vectors.

Does this mean that directly generating vectors from large models is inherently ineffective? This paper, referencing COT and ICL approaches, proposes the Pretended Chain of Thought and Knowledge Enhancement methods, as detailed below:

730

The results show that prompt word engineering can not only stimulate the reasoning ability of large models, but also their vector representation ability.

3.10 Using MoE for Embedding

The paper “YOUR MIXTURE-OF-EXPERTS LLM IS SECRETLY AN EMBEDDING MODEL FOR FREE” proposes to generate embeddings based on MoE, namely MoE embedding (MOEE), to solve the problem of LLMs as embedding models.

Research Background

While LLMs excel at generation tasks, their decoder architecture limits their potential as embedding models without further representation fine-tuning. The research challenge lies in the fact that the final or intermediate hidden states (HS) of LLMs may fail to capture key features and all information about the input tokens, especially when subtle semantic differences are involved. Furthermore, existing embedding methods often rely on static architectures, potentially ignoring the variability of the input.

motivation

The authors found that routing weights in MoE can provide supplementary information for decoder embeddings. The MoE model assigns inputs to different experts through a dynamic routing mechanism. Each expert focuses on specific features of the input, and the routing weights of each layer reflect the inference choices made regarding the input labels; in other words, the routing weights represent each expert’s contribution to the final output. Therefore, the routing weights contain semantic information about the input that might be lost in the hidden-state embedding. By concatenating the routing weights of all layers, a route-based embedding can be formed, denoted e_RW.

The paper further investigated the differences between RW and HS embeddings through clustering and correlation analysis. It found that RW and HS embeddings differ in clustering behavior and topicality, and that they have low correlation, indicating complementarity. Specifically, the clustering results of RW and HS embeddings showed moderate overlap (AMI and NMI around 0.29), but their Jaccard similarity and exact match rate were low (0.06 and 45.54%, respectively). This suggests that RW and HS embeddings capture different information in terms of structure and topicality. Further analysis showed that RW embeddings emphasize different topics in the input, while HS embeddings better capture the overall structure and meaning of the sentence.

731

Therefore, based on the complementarity of RW and HS embeddings, and in order to fully utilize routing weights and decoder embeddings, the authors propose a method called MoE embedding (MoEE) to form a more comprehensive embedding representation. There are two types of MoEE.

Concatenation and Combination: As shown in Figure 2 above. This method is simple; we simply concatenate the route weights and decoder embeddings directly. The authors call this method MoEE (concat). It preserves the distinct information captured by each route weight while allowing downstream tasks to utilize the combined representation. The advantages of this scheme are: simplicity and intuitiveness, preservation of independent information from the HS and RW embeddings, and flexible utilization of these two representations by downstream tasks. The disadvantages of this scheme are: potential introduction of redundant information, as the two embeddings have different types and structures, and simple concatenation may lead to duplication or cancellation of some information.
Weighted Summation Integration: As shown in Figure 3 above. Similarity scores for HS and RW embeddings are calculated separately, then weighted and summed, denoted as MoEE(sum). α is a hyperparameter used to control the contribution of routing weights. The advantage of this approach is that weighted summation balances the contributions of RW and HS embeddings, avoiding the complexity of direct fusion. This method allows adjusting the weights of RW and HS according to the needs of different tasks, optimizing performance. The disadvantages are that a hyperparameter α needs to be set to control the contribution of RW, which may require additional experimentation and tuning. Furthermore, weighted summation may mask the specific advantages of certain embeddings.

Furthermore, the authors utilized the PromptEOL technique to enhance MoEE. The use of PromptEOL significantly improved the stability and performance of the MOEE method, enabling it to maintain high embedding quality even under conditions of high uncertainty.

0xFF Reference

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations[C]// International Conference on Learning Representations
Distributed Representations of Words and Phrases and their Compositionality
Echo embedding: Repeating text twice allows autoregressive models to generate higher quality embeddings
Efficient Estimation of Word Representations in Vector Space
Embeddings, etc. Technical Analysis J
https://arxiv.org/pdf/2307.16645.pdf
Is Cosine-Similarity of Embeddings Really About Similarity?
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
LLM2Vec: Modifying Decoder-only LLMs to Generate High-Quality Text Embeddings
RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models
RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder
RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder Reading Notes
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models
When Text Embedding Meets Large Language Model: A Comprehensive Survey
Word Embeddings: Encoding Lexical Semantics
Your Mixture-of-Experts LLM is Secretly an Embedding Model for Free
YOUR MIXTURE-OF-EXPERTS LLM IS SECRETLY AN EMBEDDING MODEL FOR FREE
【Daily LLM Paper Updates】 | Your Expert Team LLM is a Secret Free Embedding Model OptimaAI
【Tearing Apart LLM_Nv Embed】NVIDIA’s LLM-as-Embedding Achieves High ICLR Score, RAG Retrieval is Hopeful! Xiaodonggua AIGC [Tearing Apart LLM]
Why can BERT’s three Embeddings be added together?
From Mathematics to Neural Networks (Part 1): Structure - From Geometry, Language to the TOKEN Elephant Alpha [Elephant Alpha]
Cosine similarity might be useless? For some linear models, similarity isn’t even unique [Machine Heart]
Usability analysis of the dimensional formula “n > 8.33 log N” Su Jianlin
What is Embedding? [UNDONE] Fanzi xiwa
What is the difference between embedding and vectorization in large models? DFires [AI Exploration Era]
How to quickly improve the vector representation performance of large models? Liu Cong
How to use MoE to obtain embeddings in NLP Alex [Algorithm Dog]
How to view the slimmed-down version of BERT - ALBERT?
Microsoft E5-mistral-7b-instruct: The principle of minimum entropy in text embedding standing on the shoulders of LLM
(Part 6): How should the dimension of word vectors be chosen? Su Jianlin
What does residual network solve, and why is it effective?
In-depth analysis of the Transformers framework (Part 5): Embedding mechanism and Word2Vec word embedding model in practice [Old Niu]
Deep learning - Bert deep analysis
overview sharing - Beihang University & Alibaba - When LLM meets Embedding BrownSearch
paper sharing | Arxiv2024 McGill University | LLM2Vec - converting LLM to text encoder BrownSearch
word vectors and embedding are what they are? Su Jianlin
Overview of word vector embedding development Andre Andrew’s detailed explanation
Albert: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS boom
Language model text embedding (thinking) Zelong
language model output end shared embedding re-exploration Su Jianlin
https://colala.berkeley.edu/papers/piantadosi2024why.pdf
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html