Exploring the Transformer Series (6) --- token

0x00 Overview

Language is a concept unique to humans. As an abstract symbol, humans can understand the meaning of each word in a language, but current NLP language models cannot directly abstract the meaning of each language symbol from perception. In order for the model to understand natural language text, the text needs to be converted into a representation that the model can understand, such as integers, floating-point numbers, or vectors.

As we learned in previous chapters, the Transformer accepts high-dimensional vectors (word embeddings), and the conversion from text to vectors involves two stages: tokenization and embedding, which produce tokens and word embeddings respectively. Token segmentation and word embeddings play crucial roles in building large models. They are not only fundamental to the model’s understanding of textual language but also profoundly influence its performance and accuracy. This article will introduce how to perform effective word-to-number mapping; the next article will discuss how to convert numbers to embeddings.

Taking machine translation (Chinese to English) as an example, let’s outline what happens to the next piece of input text before it is converted into an embedding (and what preparations the developers make). The specific process is as follows:

Create Chinese and English vocabulary lists. This step involves deduplicating and sorting the characters in the corpus to obtain the vocabulary lists. Each character has a unique serial number in the vocabulary list.
Organize the text. This step will cut paragraphs and other text into shorter pieces.
Tokenization. This involves splitting/converting continuous text into individual, minimal semantic units. Each generated unit is called a token. Specifically, it includes the following steps:
Normalization. This involves necessary cleanup of the text, such as removing special characters, format conversion, filtering stop words, and Unicode standardization.
Pre-tokenization. Breaks the input into words.
Model processing. The words obtained from the pre-segmenter are fed into the word segmentation model for word segmentation to obtain a token sequence.
Post-processing includes adding special tokens for the tokenizer and generating attention masks.
At this point, the token is just a string and cannot be directly used for model calculations.
Indexing. Tokens are converted into numerical information that is easier for computers to process using a vocabulary. Each token is looked up in the vocabulary to obtain its index, which is the one-hot vector corresponding to each token. For example, we can get the index encoding of the sentence “Hello how are U today” as [5, 17, 7, 12, 15].

In summary, after the above operations, each word is represented as an integer, and a string of text becomes a vector of integers. This list of integer vectors is then converted into a more expressive vector form to capture the semantic relationships and feature information between words. The process described above is illustrated in the diagram below.

601

0x01 Basic Concepts

1.1 Word segmentation

As the name suggests, word segmentation is the process of dividing a sentence into individual words. This can be done by punctuation marks or by grammatical rules. Transformers process vectors, while user input is text sentences. Therefore, the sentence needs to be segmented or tokenized before vectors can be generated. Word segmentation breaks down the input text into small units that the model or machine can process, ensuring that each unit has relatively complete and independent semantics, and assigning each unit a unique identifier called a token. For example, for the sentence “I ate an apple,” the result of word segmentation would be a list ["I", "ate", "an", "apple"], where each element is a token.

In addition, directly inputting complete characters can lead to information loss or increased complexity, while word segmentation can preserve semantics, reduce sparsity, and ultimately improve the training efficiency and prediction accuracy of the model.

1.2 token

Token: A token is the result of word segmentation, which is the smallest semantic unit and the unit that the model can effectively understand and manage. A token can be a word, a Chinese character, or a special character that represents a whitespace character, an unknown character, or a sentence-initial character.

1.3 tokenizer

The role of a tokenizer is to separate the corpus in the original dataset according to certain rules, find the semantically meaningful strings, convert them into tokens, and then map each token to an integer.

The essence of word segmentation is actually a mapping from characters to numbers. Therefore, the tokenizer needs to maintain a dictionary-like module or function. Common methods for this function or module include: dictionary-based segmentation, statistical segmentation, and deep learning-based segmentation. Currently, LLM mainly uses a vocabulary-based segmentation method.

1.4 Vocabulary

A vocabulary is a collection of unique words or tokens that an LLM (Limited Language Management) can understand and recognize. It is a dictionary (hashmap) consisting of tokens and numbers, used to define the mapping relationship between tokens and integers. The integer is the unique index of this token in the vocabulary, and the maximum range of the integer is the size of the vocabulary.

The basic idea of a vocabulary list is based on dictionary matching. This involves segmenting and adjusting the Chinese text to be segmented according to certain rules, then matching it with words in a dictionary. If a match is successful, segmentation is performed according to the dictionary words; otherwise, adjustments are made or new words are selected. A vocabulary list needs to be built before training the model.

The image below is a sample of the Harvard Glossary.

602

1.6 Word segmentation process

The word segmentation process mainly consists of the following stages: standardization, pre-segmentation, model processing, and post-processing. We will introduce them one by one.

Standardization

Normalization refers to the standardization process of text, which mainly includes the following aspects:

Text cleaning. This involves removing useless characters (such as special characters, non-printable characters, etc.) and extra whitespace (redundant spaces, tabs, newlines, etc.), retaining only the content that is meaningful for word segmentation and model training.
Standardize syntax. For example, unify capitalization and number standardization (replace all numbers with a placeholder or specific marker to reduce the number of variables the model needs to process).
Encoding consistency. For example, ensuring that text uses a consistent character encoding, and handling or converting special characters and symbols.
Language standardization, such as lemmatization and stemming.

Example as follows:

from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
normalizer.normalize_str("Héllò hôw are ü?")

presegmentation

Pre-segmentation refers to the preprocessing step before text is divided into tokens. Specifically, it involves initial text segmentation based on simple rules (such as spaces and punctuation) to obtain smaller units (like words). For English, this means segmenting by spaces, breaking the text down into words, and the final token will be a part of these words. Chinese text, because it doesn’t use spaces for segmentation, generally doesn’t require this step.

For example, to optimize compression efficiency for multiple languages, DeepSeek-V3 made specific adjustments to the pretokenizer and training data. Compared to DeepSeek-V2, the new pretokenizer introduces a mechanism to combine punctuation marks and newline characters into new tokens. This method can improve the compression ratio, but it may also introduce token boundary bias when processing multi-line input without newline characters. To mitigate this bias, DeepSeek-V3 randomly splits these combined tokens with a certain probability during training, allowing the model to adapt to more diverse input formats and improving its robustness.

Model processing

The words obtained from the pre-segmenter are fed into the word segmentation model or segmented based on a vocabulary. Specifically, based on pre-tokenization, more detailed processing is performed according to the selected model or algorithm (BPE, WordPiece, Unigram, or SentencePiece, etc.), including generating a vocabulary based on a large amount of text data according to algorithm rules, and then splitting the text into tokens based on the vocabulary.

Post-processing

The post-processing stage involves additional processing steps on the encoded text to ensure it meets the input requirements of a specific model. Post-processing mainly includes:

Sequence padding and truncation: To ensure that the length of the input sequence is consistent, sequences that are too long are truncated and sequences that are too short are padded.
Add special tokens: Add special tokens (such as [CLS], [SEP]) at appropriate positions in the sequence according to the model requirements.
Construct an attention mask: For the required model, construct an attention mask to distinguish the actual token from the padding token.

0x02 Vocabulary

A vocabulary is a collection of unique words or tokens that an LLM (Limited Language Management) can understand and recognize. We’ll first use Harvard code to see how to construct and use a vocabulary. This process is quite complex, so we’ve created the following diagram to aid in the analysis.

603

2.1 Constructing a vocabulary

The previous section briefly described how to load a vocabulary list; here we’ll look at how to build one. The build_vocabulary() function reads data from the dataset using an iterator and ultimately returns an English dictionary and a German dictionary, both of which are Vocabulary objects.

def build_vocabulary(spacy_de, spacy_en):
    # 德语分词方法
    def tokenize_de(text):
        return tokenize(text, spacy_de)
    # 英语分词方法
    def tokenize_en(text):
        return tokenize(text, spacy_en)

    # 调用datasets.Multi30k得到三个Iterator
    print("Building German Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    # 调用PyTorch函数build_vocab_from_iterator()来构建德语词典
    vocab_src = build_vocab_from_iterator(
        # 使用分词器从三个Iterator之中获取token
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    print("Building English Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_tgt = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_en, index=1),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    # 设置缺省index为"<unk>"，分词器无法识别的单词会被归为`<unk>`
    vocab_src.set_default_index(vocab_src["<unk>"])
    vocab_tgt.set_default_index(vocab_tgt["<unk>"])

    return vocab_src, vocab_tgt # 返回德语词典和英语词典

The tokenize_en() function is an English word segmentation function that calls the passed-in tokenizer parameter to segment a sentence. The tokenize_de() function is a German word segmentation function, similar to tokenize_en().

def tokenize(text, tokenizer):
    """
    功能：调用分词模型tokenizer对text进行分词
    示例：tokenize("How are you", spacy_en)
    返回：['How', 'are', 'you']
    """
    return [tok.text for tok in tokenizer.tokenizer(text)]

The build_vocabulary() function uses the build_vocab_from_iterator() function, whose purpose is:

The frequency of words in the statistical data set.
Sort the words according to their frequency.
If there is a limit on max_tokens, then special words need to be excluded.
Words are written into the vocabulary list according to their frequency.
Returns a Vocab object.

The specific code is as follows.

def build_vocab_from_iterator(
    iterator: Iterable, # 迭代器，里面是分好的词
    min_freq: int = 1, # 当某个单词出现的频率大于min_freq，才会被加入词典
    specials: Optional[List[str]] = None,
    special_first: bool = True,
    max_tokens: Optional[int] = None,
) -> Vocab:
    counter = Counter()
    for tokens in iterator:
        counter.update(tokens)

    specials = specials or []
    sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

    if max_tokens is None:
        ordered_dict = OrderedDict(sorted_by_freq_tuples)
    else:
        ordered_dict = OrderedDict(sorted_by_freq_tuples[: max_tokens - len(specials)])

    word_vocab = vocab(ordered_dict, min_freq=min_freq, specials=specials, special_first=special_first)
    return word_vocab

Building a vocabulary requires tokenizing the dataset. The yield_tokens() function performs this task.

def yield_tokens(data_iter, tokenizer, index):
    """
    从iterator之中获取句子，调用分词器进行分词，返回一个token列表。
    比如，从data_iter之中得到了from_to_tuple为：
    ('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.', 'Two young, White males are outside near many bushes.')
    如果index为0，则说明对德语进行分词，返回德语token列表，如果index为1，则说明对英语进行分词，返回英语token列表
    """
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])

2.2 Using a vocabulary list

How is the dictionary used? The collate_batch() function takes the vocabulary as input; let’s delve deeper into this function. The collate_batch() function is the collate_fn parameter of the DataLoader class. Its purpose is to combine a list of samples into a mini-batch tensor. Internally, DataLoader passes the list of sentence pairs to the collate_batch() function for processing, and then sends the input batch to the model.

def collate_batch(
    batch,
    src_pipeline,
    tgt_pipeline,
    src_vocab,
    tgt_vocab,
    device,
    max_padding=128,
    pad_id=2,
):
    bs_id = torch.tensor([0], device=device)  # <s> token id
    eos_id = torch.tensor([1], device=device)  # </s> token id

    for (_src, _tgt) in batch:
        processed_src = torch.cat(
            [
                bs_id,
                torch.tensor(src_vocab(src_pipeline(_src)), dtype=torch.int64, device=device),
                eos_id,
            ],
            0,
        )
        processed_tgt = torch.cat(
            [
                bs_id,
                torch.tensor(tgt_vocab(tgt_pipeline(_tgt)), dtype=torch.int64, device=device),
                eos_id,
            ],
            0,
        )

The collate_batch() function actually calls the forward() function of the Vocab object, which returns a list of indices corresponding to a list of tokens, thus converting sentences into a sequence of vocabulary indices.

class Vocab(nn.Module):
    @torch.jit.export
    def forward(self, tokens: List[str]) -> List[int]:
        return self.vocab.lookup_indices(tokens)

2.3 Vocabulary Size

In natural language processing (NLP) models, determining the appropriate vocabulary size is a crucial step, directly impacting the model’s performance, efficiency, and adaptability. In fact, newer SLM (Small Language Model) vocabularies typically exceed 50,000 words or tokens. The following figure shows the trend of SLM vocabulary size changes from 2022 to 2024. As can be seen, the overall trend is an increase in vocabulary size. Expanding the vocabulary allows the model to handle a wider range of languages and provide more accurate and comprehensive responses.

604

The figure below shows the vocabulary size of some LLMs.

605

An ideal vocabulary should meet the needs of a specific task and dataset while ensuring model performance and efficiency. Different natural language processing tasks may require vocabularies of different sizes. For example, sophisticated text generation tasks may require a larger vocabulary to cover more details, while some classification tasks may only require a smaller vocabulary to achieve high performance. These factors need to be considered when setting the vocabulary.

Domain-specific vocabulary: If a particular domain contains a large number of technical terms and jargon, and these terms may not be present or sufficiently abundant in a general vocabulary, then vocabulary expansion is usually necessary. This helps the model more accurately understand and generate domain-related text, thereby improving performance on specific tasks. Additionally, in specialized domains, certain words may have specific meanings; expanding the vocabulary helps reduce ambiguity in the model’s understanding of these words and improves its ability to understand and generate domain-specific text.

The complexity and diversity of text in a dataset also influence vocabulary design; rich and varied datasets may require a larger vocabulary to capture the diversity of text. Structural differences between languages mean that vocabulary requirements also differ. For example, colloquial languages (such as German) may require a larger vocabulary to cover their rich compound word forms.

Word segmentation strategies are crucial, as different segmentation methods can lead to variations in how models interpret numerical values. For example, consider the famous question: can an embedding model determine whether 9.11 or 9.9 is larger? Researchers have offered various guesses, including the composition of the pre-training data and the model architecture itself. The design of the vocabulary directly impacts the representation of numbers. For instance, if the vocabulary only includes digits 0-9, then 11 might be segmented into two separate 1s instead of representing the whole number 11.

Furthermore, the paper Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral points out that while expanding the vocabulary can significantly improve the encoding and decoding efficiency of the target language, it does not necessarily mean that it will improve the performance of downstream tasks, and may even have a negative impact on the performance of downstream tasks. Moreover, vocabulary expansion may increase the computational and storage costs of the model.

Therefore, in practice, the vocabulary can be expanded in a targeted manner by analyzing the characteristics of the dataset, thereby improving model performance. Sometimes, simple vocabulary truncation or rule-based methods for handling domain-specific vocabulary can also achieve good results. The optimal vocabulary expansion strategy will vary depending on the specific task and domain requirements, and it is recommended to evaluate and experiment based on specific circumstances. In general, domain model vocabulary expansion is a worthwhile strategy to consider, as it can improve the model’s performance in a specific domain while maintaining its general capabilities. However, whether to expand the vocabulary should be based on a detailed analysis of the domain data and an accurate understanding of the model’s performance requirements.

Advantages

A larger vocabulary can improve a model’s ability to cover different words and expressions, helping it to better understand and generate text. The paper “Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies” explores the impact of vocabulary size on the performance of large language models (LLMs). Its conclusion is that the vocabulary size of a large model also applies to scaling laws, and it emphasizes the need to comprehensively consider model parameters, training data, and vocabulary size when designing and training LLMs. The specific conclusions of the paper regarding the impact of vocabulary size V on language model performance are as follows:

Increasing the vocabulary size can improve the efficiency of tokenized word segmentation, which means using shorter tokens to represent text, thereby improving model performance.
Larger models should be equipped with larger vocabularies. As the computational cost of a model increases, a larger vocabulary enhances its ability to understand more diverse texts and allows it to express more complex language patterns.
Once the vocabulary size reaches a certain level, the efficiency gains from gradually increasing the vocabulary size will gradually decrease, and it may lead to underfitting of vocabulary-related parameters, especially for word representation of low-frequency words.

In conclusion, vocabulary size is crucial for model scaling.

The paper also proposes three methods for predicting the optimal vocabulary size (FLOPs-based, derivative-based, and loss function parameter fitting-based estimation methods), and lists the relationship between the vocabulary parameters of current mainstream large language models (LLMs) and the predicted optimal vocabulary parameters. The table below shows the theoretically optimal vocabulary size for different model sizes.

606

Currently, most LLMs have suboptimal vocabulary parameters due to vocabulary sizes smaller than the predicted optimal values. For example, as shown in the figure below, the predicted optimal vocabulary size for Llama2-70B should be at least 216K, much larger than its actual 32K. Recently, the community has begun to shift towards larger vocabularies; for instance, the vocabulary size of Llama3 increased from 32K in Llama2 to 128K. However, expanding the data remains the most critical part, and addressing the data scarcity problem should be a focus of future work.

607

The paper also found that, given a certain amount of computing power, there is an upper limit to the optimal vocabulary size. These predictions were validated by training a 3B-parameter model under different FLOPs budgets, revealing that simply replacing the original vocabulary size with the predicted optimal vocabulary size can improve the model’s performance on multiple downstream tasks.

Disadvantages

However, an excessively large vocabulary also has its problems:

An excessively large vocabulary can lead to undertraining of certain tokens. This is because a large vocabulary implies a sparser token distribution (due to the sparsity of the word embedding space) and finer-grained token segmentation. This inevitably results in more low-frequency tokens and meaningless token fragments, leading to undertraining of the model on certain less common tokens. The model struggles to effectively learn the representations of these tokens, impacting its generalization ability. The paper “Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models” detected undertrained tokens in LLMs (tokens that cause anomalous outputs), finding that undertrained tokens that cause large models to “go crazy” are prevalent in these models. Furthermore, the number of undertrained tokens increases significantly in models with larger vocabularies. Therefore, this paper proposes that optimizing the vocabulary structure and tokenizer algorithm is key to solving the token undertraining problem.
The size of the vocabulary also affects the model’s processing speed. In resource-constrained environments, a larger vocabulary means that the model needs more computational resources to process storing word segmentation embeddings, and it also incurs computational burden when generating output, thus slowing down the processing speed and making the training and inference processes inefficient.

As mentioned above, in practical applications, it may be necessary to experiment and adjust to find the vocabulary size that is best suited for a specific model and task.

0x03 Tokenizer

The tokenizer generally does two things:

Tokenization. The tokenizer splits the string into sub-word tokens, and then maps the tokens to numbers (ordinal numbers in the vocabulary).
The process of splitting into tokens is based on a vocabulary. If a corresponding token can be found in the vocabulary, the token is output; otherwise, a special symbol indicating that the token does not exist is output.
The process of mapping a string to a number is called the encoding process of the tokenizer, and the process of mapping a number back to a string is called the decoding process of the tokenizer.
Expanding the vocabulary. Some tokenizers add tokens or special characters that appear in the training corpus but are not originally in the vocabulary. Of course, users can also add these tokens manually.

An ideal tokenizer should possess the following basic characteristics:

High compression ratio. More data can be represented using fewer tokens.
Training-friendly: Training can be completed within a reasonable timeframe.
Language-independent: Both training and word segmentation are independent of any language feature.
Lossless compression: The word segmentation result should be able to be restored to the input without loss.

In the LLM era, designing a tokenizer that is both versatile and efficient for reasoning is of great importance.

3.1 Word segmentation granularity

Based on the granularity of text segmentation, tokenization is generally divided into three categories: segmentation at the word base, segmentation at the character base, and segmentation at the subword base. Taking the English text “Today is Sunday” as an example, the segmentation results of the three methods are as follows:

Particle size	Cutting method	Word segmentation results
word	Word-level segmentation: English words can naturally be segmented based on spaces or punctuation.	[today, is, Sunday, .]
character	Character-level segmentation uses a single character as the smallest granularity and exhaustively enumerates all appearing characters, making it the most complete method.	[t, o, d, a, y, i, s, s, u, n, d, a, y,.]
Sub-words	This method falls between word and character segmentation, splitting a word into substrings. There are many different methods for sub-segmentation, such as BPE, WordPiece, Unigram, etc.	[to, day, is, s, un, day, .]

By word granularity

Words are the most common basic units in NLP tasks. Traditional methods for constructing a vocabulary first segment each sentence, then count and select the top N most frequent words to form the vocabulary. Each word is assigned an ID.

advantage:

It is easy to preserve semantics. Just like human reading, using words as the granularity of segmentation can well preserve the boundary information and complete semantics of words.
Sentences are easy to segment. Words are the most natural units of language, and for a language like English that contains spaces, they are naturally easy to segment.

Disadvantages:

The vocabulary is too large. Because listing all combinations of words is significantly more difficult than enumerating all characters, the dictionary constructed by segmenting words at the word level is too large, severely impacting computational efficiency and consuming excessive memory.
Limited capacity. For computational efficiency, the choice of N typically cannot include all words in the training set. This leads to the Out-of-Vocabulary (OOV) problem. Furthermore, this long-tail effect (a large number of low-frequency words occupying the vocabulary space) is severe, preventing low-frequency words from being adequately trained and causing the model to fail to fully understand the semantics of these words.
It struggles to handle semantic relationships. For example, it cannot handle semantic relationships and generalizations such as word morphology, affixes, singular and plural forms.

By character granularity

This word segmentation method breaks words down into individual characters and special symbols.

advantage:

With a small number of characters, the more than 5,000 commonly used Chinese characters can basically be combined to form all text sequences, and there will be no OOV problem.

Disadvantages:

The loss of boundary information makes it impossible to carry rich semantics at the word level, increasing the difficulty of modeling. For example, the embedding vector corresponding to each character can carry too much semantic information, making it difficult for the model to learn the relationships between words and sentences.
Increased sequence length increases the cost of text representation. For example, inference and training both incur higher computational costs, and training is less likely to converge.
In addition, although character-based word segmentation is more reasonable for Chinese, in Chinese, certain words (or idioms) are the smallest units that express semantics, and it is difficult to obtain valuable semantic information from the perspective of a single character.

By sub-word granularity

Subword tokenization strikes a balance between the two methods mentioned above. This method breaks a word down into smaller subwords, resulting in an intermediate granularity representation between words and characters. In other words, these tokens may be complete words or parts of a word. This allows the model to still have good generalization ability when dealing with unknown words, providing good flexibility and scalability.

Its processing principle is that frequently used words should be kept as is, while less common words should be split into subwords to share the token compression space. This achieves a better balance between vocabulary size, semantic independence, and semantic expressiveness. Thus, it solves the word segmentation problem for all words with a finite vocabulary, while minimizing the number of tokens in the result.

For example, if we segment “friendly and lovely” at the word level, we get “friendly / and / lovely”. Adding the common words “friend” and “love”, the dictionary will contain 5 words. If we segment it at the sub-word level, we get “friend / ly / and / love / ly”, which only contains 4 words in the dictionary.

Furthermore, the word segments derived from subwords can be used to form larger words. This is somewhat similar to the root and affix method of word formation in English. For example, “looking” can be divided into two subwords, “look” and “ing,” and the resulting “look” and “ing” can be used to construct other words. For instance, the subwords “look” and “ed” can form the word “looked.” Therefore, the subword-based segmentation method can reduce the size of the dictionary and better handle similar words.

For example, the following is a possible dictionary form, where ## indicates that this word should immediately follow the preceding word to form a complete word.

// 动词
look
see
speck
...
// 形容词
big
small
smart
...
// 表示时态的后缀
##ed
##ied
##ing
##s
##ies
// 表示比较级和最高级后缀
##er
##est
...

The subword method is currently the mainstream word segmentation granularity in LLM, with three main algorithms: Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model.

How to choose

The choice of tokenization method typically depends on multiple factors, requiring comprehensive consideration and a decision made based on the specific application scenario. Common factors include:

Task requirements. Different tasks may require different word segmentation granularities.
Language characteristics. The structure of different languages determines the appropriate word segmentation strategies for them.
Model requirements. For each specific model, the most suitable word segmentation tool may be chosen based on its own design goals and optimization direction. For example, certain word segmentation methods can help the model better understand and process unfamiliar words.
Contextual relevance. Some tokenization methods are able to preserve contextual information, which is crucial for understanding semantics.
Computational efficiency. When processing large-scale datasets, the speed of word segmentation and memory consumption are also important considerations.
Interpretability. In some application scenarios, the interpretability of word segmentation results is also an important factor.

3.2 Common Tokenizers

The table below lists some of the tokenizers used by LLM.

LM	Tokenizer
BERT	Word-Piece
DistilBERT	Word-Piece
ALBERT	Sentence-Piece
RoBERTa	BPE
GPT-2	BPE
GPT-3	BPE
GPT-3.5 (ChatGPT)	BPE
GPT-4	BPE
T5	Sentence-Piece
Flan T5	Sentence-Piece
XLNet	Sentence-Piece
BART	Word-Piece
Llama 1	Sentence-Piece
Llama 2	Sentence-Piece
Llama 3	Tiktokenizer
MiniMax-01	BPE
DeepSeek-V3	BPE

3.3 Llama3 Example

Because the tokenizer part in the Harvard code is relatively simple, we will use the code in Llama3 as an example for analysis.

definition

The Tokenizer class encapsulates the Tiktoken tokenizer.

class Tokenizer:
    """
    Tokenizing and encoding/decoding text using the Tiktoken tokenizer.
    """
    special_tokens: Dict[str, int]
    num_reserved_special_tokens = 256
    pat_str = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"  # noqa: E501

    def __init__(self, model_path: str):
        """
        Initializes the Tokenizer with a Tiktoken model.
        Args:
            model_path (str): The path to the Tiktoken model file.
        """
        mergeable_ranks = load_tiktoken_bpe(model_path)
        num_base_tokens = len(mergeable_ranks)
        special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|reserved_special_token_0|>",
            "<|reserved_special_token_1|>",
            "<|reserved_special_token_2|>",
            "<|reserved_special_token_3|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|reserved_special_token_4|>",
            "<|eot_id|>",  # end of turn
        ] + [
            f"<|reserved_special_token_{i}|>"
            for i in range(5, self.num_reserved_special_tokens - 5)
        ]
        self.special_tokens = {
            token: num_base_tokens + i for i, token in enumerate(special_tokens)
        }
        self.model = tiktoken.Encoding(
            name=Path(model_path).name,
            pat_str=self.pat_str,
            mergeable_ranks=mergeable_ranks,
            special_tokens=self.special_tokens,
        )

        self.n_words: int = self.model.n_vocab
        # BOS / EOS token IDs
        self.bos_id: int = self.special_tokens["<|begin_of_text|>"]
        self.eos_id: int = self.special_tokens["<|end_of_text|>"]
        self.pad_id: int = -1
        self.stop_tokens = {
            self.special_tokens["<|end_of_text|>"],
            self.special_tokens["<|eot_id|>"],
        }

coding

The code for the encoding process is as follows.

def encode(
    self,
    s: str,
    *,
    bos: bool,
    eos: bool,
    allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
    disallowed_special: Union[Literal["all"], Collection[str]] = (),
) -> List[int]:
    """
    Encodes a string into a list of token IDs.

    Args:
        s (str): The input string to be encoded.
        bos (bool): Whether to prepend the beginning-of-sequence token.
        eos (bool): Whether to append the end-of-sequence token.
        allowed_tokens ("all"|set[str]): allowed special tokens in string
        disallowed_tokens ("all"|set[str]): special tokens that raise an error when in string

    Returns:
        list[int]: A list of token IDs.

    By default, setting disallowed_special=() encodes a string by ignoring
    special tokens. Specifically:
    - Setting `disallowed_special` to () will cause all text corresponding
      to special tokens to be encoded as natural text (insteading of raising
      an error).
    - Setting `allowed_special` to "all" will treat all text corresponding
      to special tokens to be encoded as special tokens.
    """

    # The tiktoken tokenizer can handle <=400k chars without
    # pyo3_runtime.PanicException.
    TIKTOKEN_MAX_ENCODE_CHARS = 400_000
    MAX_NO_WHITESPACES_CHARS = 25_000
    substrs = (
        substr
        for i in range(0, len(s), TIKTOKEN_MAX_ENCODE_CHARS)
        for substr in self._split_whitespaces_or_nonwhitespaces(
            s[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
        )
    )
    t: List[int] = []
    for substr in substrs:
        t.extend(
            self.model.encode(
                substr,
                allowed_special=allowed_special,
                disallowed_special=disallowed_special,
            )
        )
    if bos:
        t.insert(0, self.bos_id)
    if eos:
        t.append(self.eos_id)
    return t

decoding

The decoding process is shown below.

def decode(self, t: Sequence[int]) -> str:
    """
    Decodes a list of token IDs into a string.

    Args:
        t (List[int]): The list of token IDs to be decoded.

    Returns:
        str: The decoded string.
    """
    # Typecast is safe here. Tiktoken doesn't do anything list-related with the sequence.
    return self.model.decode(cast(List[int], t))

0x04 BPE

BPE is a word segmentation method proposed by Google in 2015 in the paper [1508.07909] Neural Machine Translation of Rare Words with Subword Units.

4.1 Approach

BPE (Byte Pair Encoding) is originally a data compression algorithm, derived from a 1994 paper titled “A new algorithm for data compression”. For example, consider the data “cddcdycdyc”. In this data, “cd” appears most frequently (3 times) in adjacent letter pairs. Therefore, we can replace “cd” with the letter ‘X’, compressing the data to “XdXyXyc”. Similarly, the next ‘Xy’ is replaced with ‘Y’, resulting in “XdYYc”. Since there are no more frequently occurring byte pairs, “XdYYc” cannot be further compressed. Reversing this process restores the compressed encoding.

Google introduced this approach to the field of NLP. The idea is to start with a basic small vocabulary and continuously merge the most frequently occurring characters or character sequences (high-frequency consecutive token pairs) in the text data by statistically analyzing the frequency of character pairs, thereby generating new tokens. BPE ensures that the most common words are represented as a single token in the token list, while rare words are broken down into two or more subword tokens. This allows for adaptive and dynamic vocabulary construction and handles unknown words well.

4.2 Algorithm

Training a tokenizer differs from training a model. Model training uses stochastic gradient descent, a naturally randomized process. Tokenizer training, on the other hand, is a statistical and deterministic process. This algorithm always finds the most suitable subwords for a given corpus, meaning that training with the same algorithm on the same corpus will always yield the same results. Furthermore, due to the large number of if-else branches, tokenizers are difficult to parallelize, resulting in low GPU utilization during training.

To construct the corpus most efficiently, BPE employs a greedy strategy to achieve the optimal solution possible. The algorithm first breaks down each text word into character-level letter sequences. Then, starting at the character level, at each step, it merges the pair of most frequent “adjacent tokens” into a “new token” that hasn’t appeared in the data before. This gradually builds longer vocabulary or phrase representations. This process iterates until a preset vocabulary size or number of merges is reached. Later, a merge table is used to reconstruct the original data when using this approach. While the greedy algorithm may not be globally optimal, and frequency may not be the best merging metric, BPE is undeniably a highly efficient tokenizer algorithm that conveniently controls the total number of tokens within a manually set limit.

The main steps of the BPE algorithm are as follows:

Prepare a sufficiently large training corpus; set the maximum number of words in the segmentation dictionary.
Prepare a basic vocabulary list, such as the 26 letters of the English alphabet plus various symbols;
Initialization: Initialize a dictionary. Based on the basic vocabulary, all text in the corpus will be cut into the smallest units (in single character form) and added to the dictionary, and special characters will also be added to the dictionary.
Co-occurrence frequency statistics: For the corpus that has been segmented into characters, globally calculate the co-occurrence frequency of all adjacent character pairs.
Merge operation: Select the most frequently occurring character pairs, merge them into a whole (treat it as a word), add this whole to the dictionary, and synchronously replace all character pairs in the corpus with this new whole.
Repeat: Repeat the above process until the preset number of merges or vocabulary size is reached, or the frequency of the next most frequent byte pair is 1.
Output: The output is the subword list obtained by running the algorithm. Subsequent tasks will use this word segmentation dictionary for word segmentation and encoding.

608

Let’s look at the vocabulary size. After each round of operations, the vocabulary size may increase or decrease. However, overall, as the number of merges increases, the total vocabulary size usually increases initially and then gradually decreases until it reaches a stable value. For example, consider the words “loved,” “loving,” and “loves.” While they all mean “love,” if we consider them as individual words, they are considered different words. In English, there are many words with different suffixes, which makes the vocabulary very large, slows down training, and reduces training effectiveness. The BPE algorithm, through training, can split these three words into “lov,” “ed,” “ing,” and “es,” separating the meaning and tense of the words and effectively reducing the vocabulary size.

4.3 Example Analysis

Next, we will analyze this using an example. Suppose we have a corpus, and after pre-tokenization of the pre-corpus, we obtain a set of original words containing the following words: old, older, highest, and lowest.

Statistical frequency

We first calculate the frequency of these words in the corpus. Let’s assume the frequencies of these words are as follows:

{"low": 5, "lowest": 2, "new": 6, "widest": 3}

Initial segmentation

In this stage, the subword granularity is character-based. Therefore, we first split these words into individual characters and add a special closing marker "" to the end of each word to indicate the terminator. We also mark the frequency of each word. For example, if the frequency of “low” is 5, we rewrite it as “low ”: 5. The subword frequency table at this point is as follows:

{'l o w </w>': 5, 'l o w e s t </w>': 2, 'n e w </w>': 6, 'w i d e s t </w>': 3}

The significance of the terminator "" is that it indicates the subword is a word suffix. This marks word boundaries, allowing the algorithm to know the end position of each word (because when counting adjacent character pairs, we cannot include character pairs located in two separate words). This helps the algorithm examine each character and find the most frequent character pair. We will see later that "" can also be counted as part of a character pair.

Additionally, terminators help the algorithm understand the difference between words like “star” and “highest”. Both words share the suffix “st”, but one has it at the end and the other at the beginning, leading to distinct meanings. Therefore, tokens like “st” and “st” need to be treated differently. If the algorithm sees the token “st”, it knows it’s more likely to be the token for “highest” than “star”.

For English, we can simply use spaces and some punctuation marks to segment words; for Chinese, we can use jieba word segmentation or segment words directly according to the characters.

Constructing an initial vocabulary

Next, we’ll build the base vocab and begin learning the merge rules. For English, we’ll use letters to form the base vocab. Below is a list of all the initial subvocabularies, a total of 10 subvocabularies, all letters.

['l', 'o', 'w', '</w>', 'e', 's', 't', 'n', 'i', 'd']

Note: This basic vocabulary is our initial state. Although it seems that all words can be represented using only 26 letters of the English alphabet, resulting in a high compression rate, such sub-words are essentially unable to represent their corresponding meanings. Therefore, we will use the BPE algorithm for iterative processing, gradually merging these initial sub-words into longer, more semantically meaningful ones. We will continuously construct new words and add them until we reach our ideal vocabulary size.

Iterative learning combined with rules

The next step in the BPE algorithm is to find the most frequent character pairs, merge them, and perform the same iterations over and over again until we reach our pre-set limit on the number of tokens or the number of iterations.

Merging characters allows you to represent a corpus with the fewest possible tokens, which is the main goal of the BPE algorithm: data compression.
To merge, BPE looks for the most frequently occurring byte pairs. It then merges the most common byte pairs into a single token, adds them to the token list, and recalculates the frequency of each token. This is because the frequency count changes after each merge step.
We will continue this merge process until the pre-set token limit or iteration limit is reached.
We do the same thing in every iteration, let’s take a closer look.

First iteration

Statistical character pair frequency

First, based on the basic vocabulary, we can perform fine-grained word segmentation on the original word set and see the word frequency of the basic words, as shown below.

('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 11, ('w', 'e'): 2, ('e', 's'): 5, ('s', 't'): 5, ('t', '</w>'): 5, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3

Merge the most frequent character pairs

Next, we will select the most frequent character pairs to merge. For example, the current highest frequency is (‘w’, ”), which we will merge into a new subword. This will be done in three steps.

First, we add ‘w’ to the subvocabulary. Our new subvocabulary is now as follows: Note that the spaces inside (‘w’, ”) have been removed:

['l', 'o',  'w</w>', 'e', 's', 't', 'n', 'i', 'd','w'，'</w>]

Secondly, the word frequency statistics table needs to be updated:

('l', 'o'): 7, ('o', 'w</w>'): 5, ('o', 'w'): 2, ('w', 'e'): 2, ('e', 's'): 5, ('s', 't'): 5, ('t', '</w>'): 5, ('n', 'e'): 6, ('e', 'w</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3

Finally, merge the ‘w’ entries in the vocabulary (eliminating spaces in between).

{'l o w</w>': 5, 'l o w e s t </w>': 2, 'n e w</w>': 6, 'w i d e s t </w>': 3}

As can be seen, compared to the last time, a new subword ‘w’ has been added to the vocabulary. The vocabulary may undergo three changes after each merge:

+1 indicates that the new word is added after the merger, while the original two sub-words are still retained (the two words do not appear simultaneously or consecutively).
+0 indicates that the new word is added after merging, while one of the original two sub-words is retained and the other is eliminated (one word appears immediately following the appearance of the other word).
-1 indicates that the new word is added after the merger, and the original two sub-words are eliminated (the two words appear consecutively at the same time).

As the number of merges increases, the vocabulary typically increases first and then decreases. This is because words in the original vocabulary are gradually merged, thus removing them from the vocabulary. In practice, the number of iterations needs to be carefully set. If the number of iterations is too small, most words will still be letters, which is meaningless; if the number of iterations is too large, the vocabulary will revert to the original few words. Therefore, the vocabulary size should be set to a suitable middle value.

However, the vocabulary is still too large, so we need to move on to the second iteration.

Second iteration

We recalculated the frequency of the updated character pairs:

{('l', 'o'): 7, ('o', 'w</w>'): 5, ('o', 'w'): 2, ('w', 'e'): 2, ('e', 's'): 5, ('s', 't'): 5, ('t', '</w>'): 5, ('n', 'e'): 6, ('e', 'w</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3})

At this point, the most frequent character pair is (‘l’, ‘o’). Therefore, we merge this pair of characters into a new subword lo.

The updated sub-word list is as follows:

['lo',  'w</w>', 'e', 's', 't', 'n', 'i', 'd', 'w','</w>']

The updated frequency table is as follows:

{('l', 'o'): 7, ('o', 'w</w>'): 5, ('o', 'w'): 2, ('w', 'e'): 2, ('e', 's'): 5, ('s', 't'): 5, ('t', '</w>'): 5, ('n', 'e'): 6, ('e', 'w</w>'): 6, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3}

The updated vocabulary list is as follows.

{'lo w</w>': 5, 'lo w e s t </w>': 2, 'n e w</w>': 6, 'w i d e s t </w>': 3}

Subsequent iterations

We continue repeating the above steps until the preset vocabulary size is reached, the iteration condition is met, or the frequency of the next most frequent character pair is 1. The final sub-vocabulary is as follows:

['est</w>', 'new</w>', 'low</w>', 'wid', 'lo', 'w']

We successfully derived 6 sub-words from the initial 10 letters, which we call tokens. We also observed that during the encoding process, the input sentence was broken down into individual tokens.

Note that after the above algorithm is executed, if there are still substrings in the sentence that have not been replaced but all subwords have been iterated, then the remaining subwords will be replaced with a special token, such as…In principleThe fewer times it appears, the better, so we often use…The quality of a tokenizer can be evaluated based on its quantity.The fewer occurrences, the better the tokenizer performs.

summary

Let’s use the diagram below to illustrate the iteration process. Our initial token list was [‘l’, ‘o’, ‘w’, ”, ‘e’, ‘s’, ‘t’, ‘n’, ‘i’, ‘d’], a total of 10 tokens. Now the token list is [‘lo’, ‘new’, ‘w’, ‘wid’, ‘est’, ‘low’], a total of 6 tokens. This shows that the token list has been effectively compressed. In this example, our corpus is very small; in reality, corpora are much larger, and we can reduce the token list even further through more iterations.

609

4.4 Use

coding

BPE’s encoding process involves splitting words into tokens within a vocabulary. After obtaining the Subword vocabulary, we can encode each word in the following way:

Sub-word sorting. Sort all sub-words in the dictionary in descending order of length, because we will first match the longest token and then iterate down to the shortest token.
Matching substrings. For the word ‘w’, we iterate through the sorted dictionary. We check if the current word is a substring of ‘w’; if so, we replace the substring in ‘w’ with the current word. Then we continue matching the remaining substring of ‘w’.
Repeat until no match can be found. Eventually, we will iterate through all tokens, and substrings of tokens will be replaced with substrings that already exist in our subword list.
Handle unmatched substrings. If, after traversing the dictionary, there are still unmatched substrings even though all tokens have been iterated over, replace the remaining substrings with special characters (such as "").”) output.
The word is represented by all the output subwords mentioned above.

The basic process is as follows.

for subword in subwords:
    for word in words:
        # 执行替换操作

For example, we use the sub-vocabulary [‘lo’, ‘new’, ‘w’, ‘wid’, ‘est’, ‘low’] generated above to encode a newly created word “wwidlow”, resulting in [‘w’, ‘wid’, ‘low’]. It’s quite intuitive to see that a completely new word that has never appeared before is encoded as a token in the vocabulary, something traditional word segmentation methods cannot do.

decoding

Decoding is the reverse of encoding. If there’s no stop character between adjacent words, they are concatenated directly; otherwise, a separator is added between them. This allows the original word to be recovered. As you can see, during concatenation, ”< /w >” separates different words. Examples are shown below.

# 编码序列
[“the</w>”, “high”, “est</w>”, “moun”, “tain</w>”]

# 解码序列
“the</w> highest</w> mountain</w>”

4.5 MINBPE

Next, let’s take a look at the specific implementation using MIBPE.

Basic functions

The get_stats() and merge() functions are basic functions used during coding.

# 统计字频
def get_stats(ids, counts=None):
    """
    Given a list of integers, return a dictionary of counts of consecutive pairs
    Example: [1, 2, 3, 1, 2] -> {(1, 2): 2, (2, 3): 1, (3, 1): 1}
    Optionally allows to update an existing dictionary of counts
    """
    counts = {} if counts is None else counts
    for pair in zip(ids, ids[1:]): # iterate consecutive elements
        counts[pair] = counts.get(pair, 0) + 1
    return counts

# 合并子词
def merge(ids, pair, idx):
    """
    In the list of integers (ids), replace all consecutive occurrences
    of pair with the new integer token idx
    Example: ids=[1, 2, 3, 1, 2], pair=(1, 2), idx=4 -> [4, 3, 4]
    """
    newids = []
    i = 0
    while i < len(ids):
        # if not at the very last position AND the pair matches, replace it
        if ids[i] == pair[0] and i < len(ids) - 1 and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

Tokenizer

Tokenizer is the base class that provides the basic capabilities required by tokenizers.

class Tokenizer:
    """Base class for Tokenizers"""

    def __init__(self):
        # default: vocab size of 256 (all bytes), no merges, no patterns
        self.merges = {} # (int, int) -> int 对哪两个进行合并，以及合并之后对应的索引
        self.pattern = "" # str
        self.special_tokens = {} # str -> int, e.g. {'<|endoftext|>': 100257}
        # 词典
        self.vocab = self._build_vocab() # int -> bytes

    def train(self, text, vocab_size, verbose=False):
        # Tokenizer can train a vocabulary of size vocab_size from text
        raise NotImplementedError

    def encode(self, text):
        # Tokenizer can encode a string into a list of integers
        raise NotImplementedError

    def decode(self, ids):
        # Tokenizer can decode a list of integers into a string
        raise NotImplementedError

    def _build_vocab(self):
        # vocab is simply and deterministically derived from merges
        # 单字节token
        vocab = {idx: bytes([idx]) for idx in range(256)}
        # 字节对token
        for (p0, p1), idx in self.merges.items():
            vocab[idx] = vocab[p0] + vocab[p1]
        # 特殊token
        for special, idx in self.special_tokens.items():
            vocab[idx] = special.encode("utf-8")
        return vocab

    def save(self, file_prefix):
        """
        Saves two files: file_prefix.vocab and file_prefix.model
        This is inspired (but not equivalent to!) sentencepiece's model saving:
        - model file is the critical one, intended for load()
        - vocab file is just a pretty printed version for human inspection only
        """
        # write the model: to be used in load() later
        model_file = file_prefix + ".model"
        with open(model_file, 'w') as f:
            # write the version, pattern and merges, that's all that's needed
            f.write("minbpe v1\n")
            f.write(f"{self.pattern}\n")
            # write the special tokens, first the number of them, then each one
            f.write(f"{len(self.special_tokens)}\n")
            for special, idx in self.special_tokens.items():
                f.write(f"{special} {idx}\n")
            # the merges dict
            for idx1, idx2 in self.merges:
                f.write(f"{idx1} {idx2}\n")
        # write the vocab: for the human to look at
        vocab_file = file_prefix + ".vocab"
        inverted_merges = {idx: pair for pair, idx in self.merges.items()}
        with open(vocab_file, "w", encoding="utf-8") as f:
            for idx, token in self.vocab.items():
                # note: many tokens may be partial utf-8 sequences
                # and cannot be decoded into valid strings. Here we're using
                # errors='replace' to replace them with the replacement char �.
                # this also means that we couldn't possibly use .vocab in load()
                # because decoding in this way is a lossy operation!
                s = render_token(token)
                # find the children of this token, if any
                if idx in inverted_merges:
                    # if this token has children, render it nicely as a merge
                    idx0, idx1 = inverted_merges[idx]
                    s0 = render_token(self.vocab[idx0])
                    s1 = render_token(self.vocab[idx1])
                    f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
                else:
                    # otherwise this is leaf token, just print it
                    # (this should just be the first 256 tokens, the bytes)
                    f.write(f"[{s}] {idx}\n")

    def load(self, model_file):
        """Inverse of save() but only for the model file"""
        assert model_file.endswith(".model")
        # read the model file
        merges = {}
        special_tokens = {}
        idx = 256
        with open(model_file, 'r', encoding="utf-8") as f:
            # read the version
            version = f.readline().strip()
            assert version == "minbpe v1"
            # read the pattern
            self.pattern = f.readline().strip()
            # read the special tokens
            num_special = int(f.readline().strip())
            for _ in range(num_special):
                special, special_idx = f.readline().strip().split()
                special_tokens[special] = int(special_idx)
            # read the merges
            for line in f:
                idx1, idx2 = map(int, line.split())
                merges[(idx1, idx2)] = idx
                idx += 1
        self.merges = merges
        self.special_tokens = special_tokens
        self.vocab = self._build_vocab()

BPE Tokenizer

BasicTokenizer is an implementation of the BPE algorithm.

class BasicTokenizer(Tokenizer):

    def __init__(self):
        super().__init__()

    def train(self, text, vocab_size, verbose=False):
        assert vocab_size >= 256
        num_merges = vocab_size - 256

        # input text preprocessing
        # 得到原始字节
        text_bytes = text.encode("utf-8") # raw bytes
        ids = list(text_bytes) # list of integers in range 0..255

        # iteratively merge the most common pairs to create new tokens
        # merges用来决定把哪些单字节合并成一个token，并且标记一个索引
        merges = {} # (int, int) -> int
        # 前256个单字节token
        vocab = {idx: bytes([idx]) for idx in range(256)} # int -> bytes
        # 扩充词典
        for i in range(num_merges):
            # count up the number of times every consecutive pair appears
            # 计算每个相邻对出现的次数
            stats = get_stats(ids)
            # find the pair with the highest count
            # 找到出现最多的字节对，将其构建成为一个新的token
            pair = max(stats, key=stats.get)
            # mint a new token: assign it the next available id
            # 给这个新token设置对应的索引
            idx = 256 + i
            # replace all occurrences of pair in ids with idx
            # 对ids进行更新，即把ids中出现的pair替换成idx
            ids = merge(ids, pair, idx)
            # save the merge
            merges[pair] = idx
            # 更新词汇表，新idx对应的token就是原来词汇表中两个对应字节的拼接
            vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
            # prints
            if verbose:
                print(f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) had {stats[pair]} occurrences")

        # save class variables
        self.merges = merges # used in encode()
        self.vocab = vocab   # used in decode()

    def decode(self, ids):
        # 从整数列表还原成字节字符串
        # given ids (list of integers), return Python string
        text_bytes = b"".join(self.vocab[idx] for idx in ids)
        text = text_bytes.decode("utf-8", errors="replace")
        return text

    # 编码函数
    def encode(self, text):
        # given a string text, return the token ids
        text_bytes = text.encode("utf-8") # raw bytes
        # 把字节变成整型数（0~255）的列表。中文可能对应多个字节
        ids = list(text_bytes) # list of integers in range 0..255
        while len(ids) >= 2: # 遍历ids
            # find the pair with the lowest merge index
            # 计算相邻字节的频数
            stats = get_stats(ids)
            # 对于stats的每个key，调用merge函数，得到最小值
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            # subtle: if there are no more merges available, the key will
            # result in an inf for every single pair, and the min will be
            # just the first pair in the list, arbitrarily
            # we can detect this terminating case by a membership check
            # 排名最短的pair都不在merge之中，说明没有需要合并的pair，就不需要再编码了
            if pair not in self.merges:
                break # nothing else can be merged anymore
            # otherwise let's merge the best pair (lowest merge index)
            idx = self.merges[pair]
            # 合并
            ids = merge(ids, pair, idx)
        # ids中所有可以合并的pair都被替代了，得到了新的ids
        return ids

4.5 Advantages and Disadvantages

advantage

The advantages of BPE are as follows:

Smaller vocabulary. BPE can generate a smaller vocabulary. This not only saves storage space but also improves computational efficiency.
Better generalization ability. BPE’s vocabulary contains units ranging from single characters to longer subwords, which the model can use to represent any input text. This approach is particularly suitable for handling large numbers of unknown words or spelling errors because it allows the model to infer new words by combining known parts. For example, when encountering a new word, we can represent it using existing subwords instead of creating a new entry for each new word.
It can effectively balance dictionary size and the number of encoding steps.

Disadvantages

The disadvantages of BPE are:

BPE cannot provide multiple segmentation results with probabilistics due to its greedy algorithm and deterministic symbol replacement.
Decoding can lead to ambiguity. For example, the same sentence, such as “Hello world,” might have different subword sequences (Hell/o/ word and H/ello/ world). Different subword sequences will produce completely different ID sequence representations, and this ambiguity may not be resolved during the decoding stage. This results in different ID sequences potentially translating into different sentences in translation tasks.
In training tasks, training on different subwords will increase the robustness of the model and make it more tolerant of noise, while BPE’s greedy algorithm cannot learn from random distributions.

Furthermore, the question “Which is bigger, 9.9 or 9.11?” can also be explained from the perspective of tokenizers. For example, different tokenizers handle numbers differently, which significantly impacts the arithmetic performance of language models.

The GPT-2 paper uses BPE (Block Entity Prefix) to merge frequently occurring subwords into single units until the vocabulary reaches a target size. However, the resulting vocabulary is heavily dependent on the training data input into the tokenizer, leading to inconsistencies in how numbers are encoded. For example, numbers that are common in the training data (such as representations like 1-100 or 1943) are likely to be represented as a single token, while less common numbers are split into multiple tokens.
Llama and Llama 2 make significant tweaks to the numbers using SentencePiece’s BPE implementation: they break all numbers down into individual digits. This means that only 10 unique tokens (0-9) are needed to represent any number, thus simplifying the number representation in LLMs.
DeepSeek-V2 has a similar single-digit tokenizer.
Llama 3 uses a different approach to handle numbers, tokenizing them as three-digit numbers. Therefore, each number from 1 to 999 has a unique token.
Later, the right-to-left (R2L) word segmentation method emerged, which processes text in groups of three characters, starting from the end of the text and working towards the beginning.

We can optimize the tokenization strategy based on the problem type, thereby improving the performance of LLM on mathematical tasks.

0x05 Other Algorithms

As mentioned earlier, the three commonly used subword segmentation algorithms are BPE, WordPiece, and Unigram. This section will examine WordPiece and Unigram, as well as other subword classification algorithms.

5.1 WordPiece

The WordPiece algorithm, from the paper “JAPANESE AND KOREAN VOICE SEARCH,” is used to solve pronunciation problems in Japanese and Korean. Literally, this algorithm breaks down words into individual pieces, and can be seen as a variant of BPE. Compared to BPE, WordPiece can handle word variations and unknown words more effectively.

Thought

WordPiece, similar to BPE, uses a merging approach. It starts with a basic vocabulary, selecting two subwords and merging them into a new subword each time, continuing this process to generate the final vocabulary. The main difference is that BPE selects merging token pairs based on frequency, while WordPiece merges adjacent subwords based on the maximum likelihood probability from the language model. It not only calculates the frequency of these combinations but also considers the probability gain after merging. Alternatively, WordPiece can be understood as merging based on mutual information between tokens. That is, if the probability of P(ed) is greater than the probability of P(e) + P(d) appearing alone, WordPiece will merge them into the vocabulary.

Note: Mutual information, sometimes referred to as cohesion or solidification in word segmentation, reflects the degree of cohesion between two parts of a word. The greater the mutual information, the stronger the correlation between the two sub-words in the language model.

algorithm

WordPiece introduces the assumption that the occurrence of all subwords is independent, and that the sequence of subwords is generated by the product of the probabilities of their occurrence. The algorithm for WordPiece is as follows:

Prepare a sufficiently large training corpus and determine the desired subvocabulary size.
Prepare a basic vocabulary list, such as the 26 letters of the English alphabet plus various symbols.
The corpus is divided into the smallest units based on the basic vocabulary.
The language model (such as a unigram language model) is trained based on the data from step 3.
Select the token pair that will maximize the probability of the training data when added to the language model from all possible token pairs as the new subword.
Repeat step 5 until the subword vocabulary size or probability increment set in step 2 is lower than a certain threshold.
The algorithm outputs a subvocabulary.

Advantages and disadvantages

Advantages: It can better balance vocabulary size and OOV issues;

Disadvantages: May produce some unreasonable sub-words or incorrect segmentation; very sensitive to spelling errors; poor support for prefixes;

5.2 UniLM

ULM comes from the paper “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates”. The ULM algorithm considers different word segmentation possibilities for a sentence, and thus can output multiple subword segments with probabilities.

Unigram and WordPiece share the commonality of using language models to select subwords. Unigram also employs probabilistic statistics to predict the probability of each word appearing as an independent unit and performs word segmentation based on this probability. During this process, some words may be broken down into smaller units so that the model can more flexibly handle variations and new words in the language.

The biggest difference between Unigram and WordPiece is that the vocabulary size in the WordPiece algorithm increases gradually, while the vocabulary size in UniLM decreases gradually. It can be seen as the WordPiece algorithm performing the reverse operation during execution. Unigram first initializes a large vocabulary, and then in each step, it continuously discards subwords from the vocabulary according to evaluation criteria (repeatedly deleting lower-ranked subwords based on evaluation) until the conditions are met. Because each time a batch of subwords is retained and deleted, the Unigram algorithm has lower complexity than WordPiece (which merges one subword at a time).

algorithm

Prepare a sufficiently large training corpus and determine the desired subvocabulary size.
Prepare a basic vocabulary: Initialize a large vocabulary, such as all characters plus high-frequency ngrams, or you can initialize it using the BPE algorithm.
Given the current vocabulary, a language model is used to estimate the probability of each word in the corpus. The ULM algorithm considers different word segmentation possibilities for a sentence, and thus can output multiple word segmentation results with probabilities.
Calculate the impact of deleting each subword on the total loss, and use it as the score for that subword.
Subwords are sorted by score, and the top X% are retained. It’s clear that ULM tends to retain subwords that appear frequently in many sentences, as deleting these subwords would result in significant loss.
Repeat steps 3 through 5 until the vocabulary size is reduced to the set value, or the result of step 5 no longer changes.
The algorithm outputs a subword list.

Advantages and disadvantages

Advantages:

The training algorithm used can utilize all possible word segmentation results, which is achieved through a data sampling algorithm;
A word segmentation algorithm based on a language model is proposed. This language model can assign probabilities to multiple word segmentation results, thereby learning the noise in them.
It can automatically adapt to the characteristics of different languages, making the model more efficient in processing multilingual text; it was used in GPT-1.

Disadvantages:

The effect is closely related to the initial vocabulary. The initial large vocabulary must be good enough, for example, it can be initialized using BPE.
It’s a bit complicated.

Comparison

The table and figures below provide a comparison of the three word segmentation methods.

name	BPE	WordPiece	Unigram
Sub-word selection method	Frequency of occurrence	Mutual information (using language models)	Mutual information (using language models)
operate	Merging sub-words	Merging sub-words	Deleting the word that minimizes the maximum likelihood probability
Lexical changes	Gradually increase	Gradually increase	Gradually decrease

610

5.3 BBPE

The paper “Neural Machine Translation with Byte-Level Subwords” proposes a new subword algorithm based on BPE, which extends the idea of BPE from the character level to the byte level, hence the name BBPE, or Byte-level BPE.

motivation

Almost all existing machine translation models are built on character-based vocabularies: characters, subwords, or words (only the granularity of the characters differs). For English and Latin American languages, BPE segmentation is sufficient to solve the OOV problem within an acceptable vocabulary size. However, for noisy text or character-rich languages (such as Japanese and Chinese), rare characters may unnecessarily occupy the vocabulary and limit its compactness. Representing text at the byte level and using a 256-byte set as the vocabulary is a potential approach to this problem. However, the high computational cost hinders its widespread deployment or use in practice. Therefore, the authors of this paper propose Byte-level Subword BBPE, which is more compact than a character-based vocabulary, has no extra-vocabulary markers, but is more efficient than using only pure bytes.

Thought

BBPE started with UTF-8 encoding. Compared to ASCII, which only covers English characters, UTF-8 was created to universally encode characters from different languages around the world using a single encoding. In contrast, UTF-32, which uses 4 bytes per character, is too verbose. The improved UTF-8 encoding is a variable-length encoding, with a length ranging from 1 to 4 bytes. Different byte lengths can be used for characters in different languages.

BBPE is similar to BPE in principle, also selecting the most frequent character pairs for merging. The main difference is that BPE performs the merging process and generates a vocabulary based on the char granularity, while BBPE first converts any character into a length of 1 to 4 bytes using UTF-8 encoding. One byte has 256 representations, and then performs aggregation at the byte level. The other processes are the same as BPE.

Advantages and disadvantages

The advantages are as follows:

The performance is comparable to BPE, but its size is only 1/8 of BPE. For rare characters, BBPE does not assign a dedicated token ID, but uses byte-level encoding to solve the OOV problem, which to some extent controls the vocabulary size and solves the problem of difficult training with sparse characters.
Vocabulary lists can be shared across languages. BBPE maximizes the sharing of vocabulary across multiple languages and achieves better translation quality.
Any language can be encoded into bytes for representation, and UTF-8 encoding can have a certain degree of interoperability between different languages. This sharing at the underlying byte level may lead to knowledge transfer.

The disadvantages are as follows:

When encoding sequences, the length may be slightly longer than BPE, resulting in higher computational costs. For example, a single Chinese character is split into multiple bytes for representation, leading to increased representation costs.
Because the byte level is a lower granularity than the character level, during decoding, it can lead to ambiguity as to whether a byte comes from a specific character or a single character. In such cases, contextual information and dynamic programming algorithms may be necessary for decoding.

0x06 Development

Next, let’s look at some unique or newer papers related to tokens.

6.1 Better Than Tokens

Traditional language models rely on tokenizers to preprocess data, but tokenization has inherent limitations, including a fixed vocabulary, inefficiency in handling multilingual or noisy data, and biases introduced by compression heuristics. The paper “Byte Latent Transformer: Patches Scale Better Than Tokens” proposes the Byte Latent Transformer (BLT), which challenges this conventional approach. BLT achieves efficient computation by directly modeling the raw byte stream, dynamically grouping them into patches based on entropy. This tokenizer-free approach represents a significant shift in language modeling.

Main contributions

The main contributions of this paper are:

Dynamic patch partitioning: BLT dynamically groups bytes into patches using an entropy-based patch partitioning method, thereby allocating more computing resources to areas with high data complexity.
Extended Research: This paper is the first to conduct an extended study on FLOPs control for byte-level models, demonstrating that BLT can match token-based models (such as Llama 3) in terms of 8B parameters and 4T training bytes, and has a significant advantage in inference efficiency.
Improved inference efficiency: BLT can increase both model size and patch size while maintaining the same inference FLOPs budget, thus achieving higher scalability.
Improved robustness: BLT performs well in handling noisy inputs and character-level tasks such as spell checking and phoneme transcription, demonstrating a better understanding of word structure and character-level information.
By dynamically selecting long patches when data is predictable, BLT improves both training and inference efficiency, while also achieving qualitative improvements in inference and generalization to long-tail data.
Byte-based pre-trained models: The paper also explores a method for rapidly training BLT models by initializing the global Transformer parameters of pre-trained token-based models (such as Llama 3), demonstrating its potential in reducing training FLOPs.

motivation

Existing LLMs are almost entirely end-to-end trained, except for tokenization—a heuristic preprocessing step that groups bytes into static sets of tokens. Tokenization is important because focusing directly on the byte dimension leads to long sequence lengths, making LLMs prohibitively expensive for large-scale training. Using tokens avoids this problem. This tokenization approach, which emphasizes string compression, leads to drawbacks such as domain/modality sensitivity, sensitivity to input noise, and a lack of grammatical knowledge. Previous research has mitigated these issues by employing more efficient self-attention mechanisms or attention-free architectures. However, this primarily benefits training small models. When training large models, the main computational cost of the Transformer is not the attention mechanism, but rather the large FFN running on each byte.

Tokenized LLMs allocate the same amount of computation to each token, trading efficiency for performance. However, tokens are generated using compressed heuristics, which are not always relevant to the complexity of the prediction. The authors of the BLT paper argue that models should dynamically allocate computational resources to meet real-world needs. For example, predicting the endings of most words doesn’t require a large Transformer, as these are relatively simple, low-entropy decisions, while choosing the first word of a new sentence is more difficult.

A patch refers to a dynamically grouped sequence without a fixed vocabulary. A key difference between a patch and a token is that when using a token, the model cannot directly access the underlying byte features. To efficiently allocate computational resources, the authors propose a dynamic, learnable method to group bytes into patches and introduce a novel model architecture that blends byte and patch information. Specifically, this includes the following points:

Dynamic learning. Unlike traditional token-based models, BLT does not have a fixed patch vocabulary. Instead, it learns directly from raw byte data. This avoids the limitations of a static vocabulary and can better handle diverse and noisy inputs.
Entropy-based patching (BLT) dynamically groups bytes into patches based on information complexity, thereby dynamically allocating computational resources. BLT segments data based on the entropy predicted for the next byte, creating contextualized byte groups with relatively uniform information density. This means allocating more computational resources to high-entropy regions (complex inputs) and conserving resources in low-entropy regions.
A novel model architecture is introduced that groups arbitrary byte groups into potential patch representations through lightweight encoder and decoder modules. This mixes byte and patch information.

In standard LLMs, increasing the vocabulary size means a larger average token, resulting in fewer model steps, but also a larger output dimension for the final projection layer. This trade-off limits tokenization-based methods from achieving significant improvements in token size and inference cost. BLT’s key improvement over tokenization-based models is that it redefines the trade-off between vocabulary size and computation. During generation, BLT needs to determine whether the current byte sequence’s steps are at patch boundaries, as this determines whether to invoke more computation through a potential Transformer. Formally, the patching scheme needs to satisfy the properties of incremental patching:

$f_p(x_{<i}) = f_p(x)_{<i}$

For example, BPE is not an incremental patching scheme because the same prefix can be tokenized in different ways based on the continuation sequence, thus not satisfying the above properties.

Since we’ll be discussing the concept of entropy later, we need to address it first. Information entropy measures the uncertainty or randomness of a system; in this case, it refers to the uncertainty of the brain’s internal model of the world. The brain’s goal is to minimize the prediction error between its internal model and sensory input, and reducing information entropy is one way to reduce this error. By reducing information entropy, the brain can make more accurate predictions about the world, which is equivalent to minimizing the system’s free energy. In the pre-training phase, the optimization objective is to minimize cross entropy; for the GPT autoregressive language model, this means correctly predicting the next word. Here, cross entropy is information entropy.

Patch

The patch function divides a byte sequence of length n into a patch sequence of length m < n. Specifically, it is mapped to the set {0,1}, where 1 indicates the start of a new patch. This allows BLT to dynamically allocate resources based on the context. The average size of the patch is a major factor in processing data during training and inference using a given patch function. The three patch functions used in the paper are as follows:

Fixed number of bytes per patch. The most straightforward byte grouping method is fixed-size patches. Fixed strides are easy to implement for both training and inference, provide a simple mechanism for changing the average patch size, and thus make it easy to control FLOP costs. However, this patch function has significant drawbacks. First, computational resources are not dynamically allocated to where they are most needed: predicting only whitespace characters in the code might waste a Transformer step, resulting in insufficient computational resources allocated to information-intensive bytes (such as mathematical symbols). Second, this leads to inconsistencies and non-contextual patching of similar byte sequences, such as the same word being segmented in different ways.
Space patching. Slagle proposed a simple yet effective improvement: creating a new patch after any whitespace byte, which are natural boundaries of linguistic units in many languages. In space patching, a latent Transformer step (i.e., more FLOPs) is assigned to model each word. This ensures that words are patched in the same way throughout the sequence and assigns FLOPs to difficult predictions that are typically followed by whitespace. For example, predicting the first byte of the question “Who composed The Magic Flute?” is much harder than predicting the bytes after “M” because the first character significantly reduces the possible choices, making “Mozart” relatively easy to predict. However, space patching cannot elegantly handle all languages and domains, and most importantly, it cannot change the patch size.
Dynamic entropy patching using a small-byte language model. This approach uses entropy estimation to deduce patch boundaries, employing a data-driven method to identify high-uncertainty next-byte predictions. The authors trained a small-byte-level autoregressive language model to compute the entropy of the next byte under the LM distribution on the byte vocabulary V. A large entropy for the next byte indicates the start of a new patch.

611

The diagram above illustrates different ways of grouping bytes, each resulting in a different number of patches. Since each patch is processed through a large Transformer step, the number of patches directly determines the major portion of the computational overhead (in FLOPs). These schemes group bytes into patches in the following ways:

Grouping every four bytes into a single step, such as MegaByte.
Tokenization is performed using Byte-to-Pair Encoding (BPE).
Entropy-based patch partitioning.
Patch division is based on whitespace bytes.
Entropy is predicted using a small CNN byte-level model with 2-byte context, and then patches are partitioned based on the entropy.

BLT architecture

BLT consists of a large global autoregressive language model that operates on patch representations and two smaller local models. These two smaller local models encode byte sequences into patches and decode patch representations back into bytes.

612

In the diagram above, BLT consists of three modules: a lightweight local encoder that encodes the input bytes into patch representations; a computationally intensive latent Transformer that processes the patch representations; and a lightweight local decoder that decodes the bytes of the next patch. BLT combines the advantages of byte n-gram embeddings and cross-attention mechanisms, thus maximizing the information flow between the latent Transformer and the byte-level modules. Unlike tokenization with a fixed vocabulary, BLT dynamically groups bytes into patches while preserving access to byte-level information.

Potential Global Transformer Model

The Latent Global Transformer is an autoregressive transformer model G. The paper uses the index $j$ to denote a patch and the index $i$ to denote a byte. The global model uses a block causal attention mask, maps $p_j$ to a series of output patch representations $o_j$ .

Partial Encoder

The local encoder model (denoted by $\epsilon$ ) is a lightweight Transformer-based model with $l_\epsilon \ll l_g$ layers. Its main function is to map the input byte sequence to an expressive patch representation $p_j$ . The main difference from the Transformer architecture is the addition of a cross-attention layer after each Transformer layer, which pools the byte representations into patch representations. Its specific operation is as follows:

First, input byte sequence embeddings are built for bytes $b_i$ (optionally using hash embeddings). Then, a series of alternating Transformer layers and cross-attention layers transform these representations into patch representations, which are processed by the global transformer $G$ . These Transformer layers use local block causal attention masks; each byte attends within a fixed preceding window, typically crossing dynamic patch boundaries but not document boundaries.

Local decoder

Similar to the local encoder, the local decoder $D$ is a lightweight transformer-based model with $l_d \ll l_g$ layers that decodes the global patch representation sequence into raw bytes $y_i$ . Because the local decoder predicts the raw byte sequence based on decoded bytes, it also takes hidden representations generated by the local encoder for the byte sequence. In the decoder cross-attention, query and key/value roles are reversed: the byte representation is the query, and the patch representation is the key/value.

Interaction

The diagram below illustrates module interactions. The local encoder uses cross-attention to encode byte representations into patch representations, where patch representation is query and byte representation is key/value. The local decoder uses a similar module with reversed roles: byte representation is query and patch representation is key/value. Here, cross-attention $k=2$ .

613

6.2 Tokenformer

The paper “TOKENFORMER: RETHINKING TRANSFORMER SCALING WITH TOKENIZED MODEL PARAMETERS” mainly explores an innovative, efficient, and scalable Transformer architecture design scheme based on parameter tokenization. This scheme achieves efficient model expansion and computational optimization through parameter tokenization.

The research team introduced TokenFormer to unify the computation of Token-Token and Token-Parameter Interactions. Its Token-Parameter attention is flexible and can handle a variable number of parameters, thereby essentially maximizing the flexibility of Transformer and enhancing the scalability of the model.

Main contributions

Tokenformer eliminates the need to retrain models from scratch when scaling up, significantly reducing costs. Key innovations presented in the paper include:

A fully attention-based architecture. This design is used not only for interactions between tokens, but also for interactions between tokens and model parameters, providing greater architectural flexibility.
Parameter tokenization method. This method treats model parameters as learnable tokens, uses a cross-attention mechanism to manage interactions, and supports dynamic parameter expansion.

motivation

The research team observed that while the Transformer architecture has achieved great success in multiple domains, its scalability is severely limited, primarily due to the use of a fixed linear projection method for token-parameter interaction computation. This linear projection design restricts the model’s flexibility and scalability. Because the parameter sizes of these projection layers are fixed, previous small-scale models cannot be reused when the model size needs to be increased. Instead, the dimensions of these linear projection layers must be changed, requiring the entire model to be retrained, resulting in significant computational overhead.

To overcome this challenge, the authors propose Tokenformer, a new, more flexible architecture based entirely on attention, including token-parameter interaction and support for incrementally scaling the number of model parameters, thereby significantly reducing the overall cost of training large Tokenformer architectures.

Architecture

contrast

The diagram below illustrates the difference between the traditional Transformer and the Token Transformer. For the vanilla Transformer, the input is first computed into the attention block input via linear projection blocks, resulting in the Q, K, and V matrices. This stage involves the interaction between the model parameters and the input tokens, computed using linear projection. Then, self-attention components allow the input tokens to interact with each other, computed through the attention blocks. Finally, a feedforward network (FFN) produces the output for the next layer, again representing the interaction between the tokens and parameters computed using linear projection.

Tokenformer differs. To compute the inputs (Q, K, and V matrices) to the self-attention block, the input token is fed into a new component called token-parameter attention, where parameters are passed in in addition to the input token. The input token represents the query portion, and the parameters represent the key and value portions of the token-parameter attention block. Then, the same self-attention component as in the vanilla Transformer is used. Finally, to prepare the output for the next layer, the paper replaces FFN with another token-parameter attention block. This token-parameter attention block’s query comes from the output of the self-attention block, and the key and value are obtained from the new parameter component.

614

The detailed architecture diagram in the paper illustrates the complete design of Tokenformer. Tokenformer is a completely attention-driven architecture with a novel token parameter attention layer. Pattention uses a set of learnable tokens to represent model parameters, and these learnable tokens can perform attention computation with the input tokens.

In the lower right corner of the architecture diagram, we can see that when we want to incrementally increase the model size by adding new parameters, we essentially expand the existing key-value parameter set by adding more parameter token rows to the key-value matrix of each attention block, while retaining the already trained parameter tokens. Experimental results show that the scaled-up model trains much faster than training from scratch.

615

TokenFormer offers a new perspective on models, suggesting that network computation involves the arbitrary interaction of tokens. Based on these tokens (such as data tokens, parameter tokens, and memory tokens) and attention mechanisms, arbitrary network structures can be flexibly constructed. Therefore, the team hopes that TokenFormer can serve as a universal network structure.

616

Pattention mechanism

The core innovation of Tokenformer is the Token-Parameter Attention (Pattention) Layer. The research team replaced all the linear projection layers in the standard Transformer with a Pattention Layer. Pattention uses a set of trainable tokens as model parameters and manages the interaction between the input token and these parameter tokens through cross attention.

In this way, the attention layer introduces an additional dimension—the number of parameter tokens—that operates independently of the input and output channel dimensions. This decoupling allows the input data to interact dynamically with a variable number of parameters, providing the flexibility needed for incremental model scaling by reusing pre-trained models.

617

The image above compares standard attention and Pattern Retention. Specifically, Pattern Retention uses the input data as the query and introduces two sets of n learnable tokens: one representing the key, and the other the value. In the image, $A$ is the score obtained from $X \cdot K_P^\top / \sqrt{n}$ . $\Theta$ is an improved softmax function to prevent gradient problems caused by exponential gradients; $\tau$ is a scalar scale factor. $f()$ is an arbitrary non-linear function, using GeLU by default.

In this way, the attention layer introduces an additional dimension—the number of parameter tokens—independent of the input and output dimensions. This decoupling allows the input data to interact with a variable number of parameters, providing the flexibility needed for incremental model scaling. Therefore, training larger models is significantly faster, while achieving performance comparable to a Transformer trained from scratch.

The paper compares the standard attention mechanism with the newly proposed Pattention mechanism, which has the following advantages: better gradient stability; support for dynamic parameter expansion; and maintenance of the continuity of the output distribution.

FFN’s Innovation

In Tokenformer, the feedforward network in the traditional Transformer is replaced by two consecutive attention blocks, which are then merged with the input token through residual connections, thus supporting dynamic expansion of model parameters.

Reuse

The flexibility of TokenFormer allows for numerous applications. Let’s take incremental model scaling as an example. Due to the versatile design of the attention layer, it’s well-suited for large-scale model training along the parameter axis, allowing for incremental development of larger models by reusing parameters from smaller pre-trained models. Suppose a TokenFormer has already been trained, with key parameters denoted as $K_P^{old}$ and $V_P^{old}$ . As shown in the diagram, we add newly initialized key-value parameter pairs, denoted as $K_P^{new}$ and $V_P^{new}$ , combining them with the original parameters to form a new key-value set. Then, the attention layer allows the input data to interact with the parameter tokens. Intuitively, each key-value pair represents a learned pattern, forming a large knowledge base. Incremental scaling further expands training on top of this existing knowledge base.

This scaling scheme allows for the integration of any number of parameters without altering the input or output dimensions. As shown in Figure 3, this approach significantly improves the training efficiency of larger-scale models without sacrificing performance. Importantly, by initializing $K_P^{new}$ to zero, similar to LoRA, the model can recover pre-training behavior without losing learned knowledge, promoting faster convergence and accelerating scaling.

618

Summarize

Vanilla Transformer models typically divide the computation required to process a single token into two parts: token-to-token interactions with other tokens and token-parameter interactions involving model parameters. Token-parameter computation mainly relies on a fixed linear projection, which greatly limits model size scaling. Scaling the model usually involves changing the model structure, often requiring training the entire model from scratch, leading to excessive resource consumption and making it increasingly impractical.

TokenFormer breaks away from the traditional distinction between data and model, using the concept of a token to model all computations. That is, it not only tokenizes input data like the original Transformer, but also treats model parameters as tokens. Furthermore, it extends the attention mechanism to the interaction between tokens and model parameters, unifying computation into interactions between various tokens (such as data tokens and parameter tokens) through an attention mechanism. This greatly enhances the flexibility of token-parameter interactions, enabling incremental scaling of new and larger models based on the trained model, thus significantly reducing the training burden.

6.3 LCM

The paper “The Future of AI: Exploring the Potential of Large Concept Models” proposes the concept of Large Concept Models (LCMs). This paper is not only a profound reflection on the limitations of existing Large Language Models (LLMs), but also a forward-looking exploration of the future development path of AI.

question

LLM’s token granularity is not actually a good way to express semantics, and token-based learning is relatively inefficient for learning semantics. Although this “word-by-word prediction” model has been successful in many tasks, it is prone to the problem of “not seeing the forest for the trees” when dealing with long texts and complex concepts.

Because the human brain doesn’t operate at the word level. Human thought is clearly hierarchical. A very apt example online is this: you don’t learn how to drive from Beijing to Guangzhou by learning how to steer at every intersection. How to steer at each intersection is an overly granular unit, which corresponds to the token in this context.

Previously, with the improvement of inference hardware performance and the emergence of various optimization methods, it seemed that the impact of excessively small token granularity on performance was no longer so significant. However, with the recent realization of new technologies such as inference-time computation, reinforcement learning has become increasingly important in various fields of AI. This requires models to possess capabilities in the semantic space, not just those acquired from the perspective of the token sequence space. Therefore, it is necessary to find a better modeling method that is closer to semantic granularity than tokens.

motivation

Inspired by the high-level thinking behind human communication, Meta AI researchers have proposed a new paradigm for language modeling: LCM (Big Concept Model), which decouples language representation from reasoning. In short, LCM discards tokens and instead uses higher-level “concepts” to model reasoning in a “sentence embedding space,” directly manipulating high-level explicit semantic representations and completely freeing reasoning from the constraints of language and modality. The new system will no longer simply predict based on the next token, but will understand the world through observation and interaction, much like infants and small animals.

Why are “concepts” needed? This is because existing LLM models lack a crucial characteristic of human intelligence: explicit reasoning and planning at multiple levels of abstraction. For example, when solving a complex task or writing a long document, humans typically employ a top-down process: first planning the overall structure at a higher level, then gradually adding details at lower levels of abstraction. Models with explicit hierarchical structures are better suited for creating long outputs. Current language models, such as the well-known GPT, while capable of writing poetry, coding, and chatting, are essentially still “guessing” word by word. Imagine a parrot that can only recite but doesn’t understand the meaning; it can speak fluently but lacks true comprehension.

The emergence of LCM aims to break this pattern. LCMs no longer focus on “What is the next word?”, but instead consider “What is the core concept of this sentence, this paragraph, or even the entire article?” This indicates that AI’s “thinking” mode is undergoing a qualitative leap from “words” to “concepts”.

Ideas

The paper restricts the level of abstraction to two types: subword tokens and concepts. A “concept” is defined as an indivisible, holistic “abstract atomic insight.” In reality, a concept often corresponds to a sentence in a text document, or an equivalent audio segment. The authors argue that sentences, compared to words, are the appropriate unit for achieving language independence. This contrasts sharply with current token-based LLM technologies.

The core of LCM lies in its shift from focusing solely on predicting the next word to thinking at a higher semantic level—the “concept.” It treats sentences as conceptual units and represents these concepts using a sentence embedding technique called SONAR. This means LCM no longer deals with individual words or predicts word-by-word like traditional language models, but considers the meaning of the entire sentence. The goal of LCM is to predict the embedding vector of the next sentence, which is the next “concept.” This approach better captures the overall semantic structure of the text, enabling the model to reason at a higher level of abstraction.

For example in the sentence:

Tim 并不擅长运动，他认为如果参加一项运动就会有所改变，他尝试加入几个团队，但没有一个团队录取他。

The different concepts will be:

Tim 并不擅长运动。他认为如果参加一项运动就会有所改变。他尝试加入几个团队。但没有一个团队录取他。

We can see that each concept represents an idea in the sentence.

The new method differs from token-level processing, moving closer to (hierarchical) reasoning in an abstract space. Context is expressed within the abstract space designed by LCM, but this abstract space is language- or modal-agnostic. That is, it models the basic reasoning process at a purely semantic level, rather than modeling instances of reasoning within a specific language. Specifically, LCM can be constructed using only encoders and decoders with fixed-length sentence embedding spaces, and the processing flow is very simple:

First, the input content is segmented into sentences, and then an encoder is used to encode each sentence to obtain a concept sequence, i.e., sentence embedding.
Then, the Large Concept Model (LCM) processes the concept sequence and generates a new concept sequence at the output.
Finally, the decoder decodes the generated concepts into a sequence of subwords.

Overall Architecture

Training a large concept model requires training a new embedding space based on a decoder and encoder in the sentence embedding space, optimized for the inference architecture. This paper uses its open-source SONAR as the decoder and encoder for sentence embeddings. In other words, the core component of LCM is the sentence embedding model SONAR. SONAR is a powerful multilingual, multimodal sentence representation model that supports over 200 languages and speech inputs. LCM operates in the SONAR embedding space, meaning that both the input and output of LCM are SONAR embedding vectors, rather than discrete words. This modeling approach based on a continuous vector space brings many advantages to LCM.

Through SONAR, LCMs are able to reason at the conceptual level, rather than simply permuting words. For example, when an LCM processes the sentence “global warming causes sea level rise,” it not only understands the meaning of each word, but more importantly, it understands the concepts of “global warming” and “sea level rise,” as well as the causal relationship between them.

619

Left: Visualization of reasoning in the concept embedding space (summarizing task). Right: Basic architecture of a large concept model (LCM).

The SONAR decoder and encoder (blue parts in the diagram) are fixed and do not require training. The concepts output by the LCM (green parts in the diagram) can be decoded into other languages or modalities without having to perform the entire reasoning process from scratch. Similarly, a specific reasoning operation, such as induction, can be performed in zero-shot mode on inputs of any language or modality because reasoning only needs to operate on concepts.

In short, LCM neither possesses information about the input language or modality, nor does it generate output in a specific language or modality.

detail

SONAR Embedded Space

The SONAR text embedding space is trained using an encoder/decoder architecture, replacing cross-attention with a fixed-size bottleneck, as shown in the figure below.

620

To explore best practices for language modeling in the SONAR space, researchers at Meta AI designed several LCM architecture variants.

Base-LCM

The baseline architecture for next concept prediction is Base-LCM, a basic model based on the Transformer decoder. It takes the SONAR embedding of the previous sentence as input (antecedent concept) and predicts the embedding of the next sentence (concept). This architecture is simple, straightforward, and easy to understand and implement.

As shown in the figure below, Base-LCM is equipped with “PostNet” and “PreNet”. PreNet normalizes the input SONAR embeddings and maps them to the hidden dimensions of the model.

Base-LCM learns on semi-supervised tasks. The model predicts the next concept and optimizes the parameters by optimizing the distance between the predicted next concept and the true next concept, which is to optimize the parameters through MSE regression.

621

Diffusion-based LCM

Diffusion-based LCM is a generative latent variable model that learns a model distribution to approximate the data distribution $q$ . Similar to basic LCM, diffusion LCM modeling can be viewed as an autoregressive model that generates a concept in a document each time.

Specifically, at position $n$ in the sequence, the model predicts the probability of a certain concept at this position, conditioned on all previous concepts.

One-Tower Diffusion LCM

This model introduces the idea of a diffusion model, generating the embedding of the next sentence by progressively adding noise and then denoising it. This approach can generate more diverse and creative text.

As shown in the left figure above, the single-tower diffusion LCM consists of a Transformer backbone, whose task is to predict the next sentence embedding given the sentence embedding and the noisy input.

Two-Tower Diffusion-LCM

As shown on the right side of the diagram above, this model separates the encoder and decoder. The encoder is responsible for processing contextual information, while the decoder is responsible for generating the embedding for the next sentence. This architecture is more similar to traditional sequence-to-sequence models and can better capture long-distance dependencies.

The first model, the context labeling model, takes the context vector as input and encodes it causally. The output of the context analyzer is then fed into the second model, the denoiser. The denoiser predicts the next sentence embedding by iteratively denoising the latent Gaussian variables.

622

Quant-LCM

To improve computational efficiency, the model quantizes the SONAR space, converting continuous embedding vectors into discrete codebooks. This approach can significantly reduce computational costs without sacrificing too much performance.

In the field of image or speech generation, there are currently two main approaches to handling continuous data generation: one is diffusion modeling, and the other is to first learn and quantize the data and then model it on these discrete units. Furthermore, the text modality remains discrete; although it deals with continuous representations in SONAR space, all possible text sentences (fewer than a given number of characters) are point clouds in SONAR space, not a truly continuous distribution. These considerations prompted the authors to explore quantizing SONAR representations and then modeling them on these discrete units to solve the next sentence prediction task. Finally, this approach allows for the natural use of temperature, top-p, or top-k sampling to control the level of randomness and diversity in the sampling of the next sentence representation.

6.4 Action Tokenizer

The paper “FAST: Efficient Action Tokenization for Vision-Language-Action Models” proposes an efficient robot action tokenization method that represents actions using discrete tokens, much like language. This allows robotics to seamlessly integrate with autoregressive Transformer training processes, improves the transferability from pre-trained large-scale internet data, and enhances the robot’s ability to execute language commands.

FAST uses a compression algorithm based on Discrete Cosine Transform (DCT) to improve the training speed of VLA models. DCT is a frequency domain transform that is commonly used in compression algorithms due to its simplicity and computational efficiency, such as JPEG image compression and MP3 audio encoding and decoding.

FAST first normalizes the input actions, then applies Discrete Cosine Transform (DCT) to each action dimension, and finally uses Byte-Pair Encoding (BPE) to compress the DCT matrix. By combining DCT and Byte-Pair Encoding (BPE), the original action blocks can be compressed into fewer but denser action tokens.

623

Each action block typically contains 30-60 tokens, resulting in a 10x improvement in compression compared to previous action tokenization methods.

624

The diagram above provides an overview of the FAST action tokenization pipeline. Given a normalized block of actions, we apply a Discrete Cosine Transform (DCT) to convert the signal to the frequency domain. We then quantize the DCT coefficients and use Byte-Pair Encoding (BPE) to compress the flattened sequence of DCT coefficients for each dimension into the final action token sequence. See Section VB for a detailed explanation.

0xFF Reference

Byte Latent Transformer: Patches Scale BetterThan Tokens——字节潜在Transformer：patch比token更高效 Together_CZ
Byte Pair Encoding and WordPiece Model详解 Yuki
BytePiece：更纯粹、更高压缩率的Tokenizer 苏剑林
FAST: Efficient Action Tokenization for Vision-Language-Action Models
https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt#implementing-wordpiece
Huggingface详细教程之Tokenizer库基本粒子
JAPANESE AND KOREAN VOICE SEARCH
Large Concept Models: Language Modeling in a Sentence Representation Space
LLM时代Transformer中的Positional Encoding
LLM还没研究透，LCM又来了 Alex [算法狗]
Luke：深入理解NLP Subword算法：BPE、WordPiece、ULM Luke
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation of Rare Words with Subword Units
Neural Machine Translation with Byte-Level Subwords
NLP 中的Tokenizer：BPE、BBPE、WordPiece、UniLM 理论
NLP-Tokenizer-BPE算法原理及代码实现爱喝热水的lucky
NLP三大Subword模型详解：BPE、WordPiece、ULM
NLP中的Tokenization
NLP分词模型：BPE、WordPiece、ULM、SentencePiece
Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral
Robin3D: Improving 3D Large Language Model via Robust Instruction Tunin
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Tokenization不存在了？Meta最新研究，无需Tokenizer的架构来了 [PaperWeekly]
Tokenization，再见！Meta提出大概念模型LCM，1B模型干翻70B？新智元
UC伯克利等提出具身智能「动作Tokenizer」，效率飙升5倍！新智元
【OpenLLM 008】大模型基础组件之分词器-万字长文全面解读LLM中的分词算法与分词器（tokenization & tokenizers）：BPE/WordPiece/ULM & beyond
【OpenLLM 008】大模型基础组件之分词器-万字长文全面解读LLM中的分词算法与分词器（tokenization & tokenizers）：BPE/WordPiece/ULM & beyond OpenLLMAI
从2019年到现在，是时候重新审视Tokenization了机器之心
大模型中的分词器tokenizer
大模型中的分词器tokenizer：BPE、WordPiece、Unigram LM、SentencePiece
智能连接：碳原子与Token 伍鹏 [AI的无限游戏]
机器如何认识文本？NLP中的Tokenization方法总结
深入理解NLP Subword算法：BPE、WordPiece、ULM
理解NLP最重要的编码方式 — Byte Pair Encoding (BPE)，这一篇就够了
https://arxiv.org/pdf/2412.08821
https://github.com/karpathy/minbpe
https://huggingface.co/learn/nlp-course/chapter6/5%3Ffw%3Dpt
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Kudo, Taku. “Subword regularization: Improving neural network translation models with multiple subword candidates.” arXiv preprint arXiv:1804.10959 (2018)
Neural Machine Translation of Rare Words with Subword Units Rico Sennrich, Barry Haddow, Alexandra Birch

0x00 Overview

0x01 Basic Concepts

1.1 Word segmentation

1.2 token

1.3 tokenizer

1.4 Vocabulary

1.6 Word segmentation process

Standardization

presegmentation

Model processing

Post-processing

0x02 Vocabulary

2.1 Constructing a vocabulary

2.2 Using a vocabulary list

2.3 Vocabulary Size

Task related

Advantages

Disadvantages

0x03 Tokenizer

3.1 Word segmentation granularity

By word granularity

By character granularity

By sub-word granularity

How to choose

3.2 Common Tokenizers

3.3 Llama3 Example

0x04 BPE

4.1 Approach

4.2 Algorithm

4.3 Example Analysis

Statistical frequency

Initial segmentation

Constructing an initial vocabulary

Iterative learning combined with rules

First iteration

Statistical character pair frequency

Merge the most frequent character pairs

Second iteration

Subsequent iterations

summary

4.4 Use

coding

4.5 MINBPE

Basic functions

4.5 Advantages and Disadvantages

0x05 Other Algorithms

5.1 WordPiece

5.2 UniLM

5.3 BBPE

0x06 Development

6.1 Better Than Tokens

6.2 Tokenformer

6.3 LCM

6.4 Action Tokenizer

0xFF Reference