Transformer Systems · Transformer Systems
Exploring the Transformer Series (3) --- Data Processing
Transformer data processing pipeline: dataset choices, vocabulary/tokenizers, batch construction, masks, and training data loading in Harvard code.
The next three articles focus more on engineering and are a bit shorter; you can treat them as dessert.
0x00 Summary
Some researchers believe that the cognitive framework of the large model looks very close to the Bayesian brain described by Karl Friston. Based on Bayesian probability theory and biophysical principles, the brain’s main goal is to predict and control external information in order to minimize uncertainty and internal entropy.
The brain builds internal models by constantly collecting and processing external information to predict and control the external world. Massive amounts of text or multimodal corpora constitute the basic information about the external world that large models need to understand. Pre-training involves extracting the probability distribution of information from the corpus data at different scales. Increasing the amount of training data, the number of model parameters, and the training time all enrich the information content of the large model in a particular problem domain and reduce the information entropy on the test set, thus broadening its knowledge. Fine-tuning is similar to using new corpora and methods to perturb the relevant parameters within the model, promoting its entry into a more ordered space and achieving controllable and predictable emergence. Therefore, data and data processing determine the upper limit of large models.
This chapter analyzes the data processing portion of the Harvard code to gain a deeper understanding of the Transformer as a whole. Additionally, this section will cover vocabulary and tokenizers, so we’ll provide some preliminary explanations of some concepts to aid reader comprehension.
- Tokenization: Dividing a sentence into individual words according to certain rules. For example, tokenization can be based on punctuation marks or grammatical rules.
- Token: A token is the result of word segmentation, which is the smallest semantic unit. A token can be a word, a Chinese character, or a special character that represents a whitespace character, an unknown character, or a sentence-initial character.
- Vocabulary (vocb): A vocabulary is a set of unique words or tokens that an LLM can understand and recognize, used to define the mapping between tokens and integers. A vocabulary needs to be built before training the model.
0x01 Overall Process
The diagram below illustrates a common data processing flow in LLM, which includes quality filtering, deduplication, privacy reduction, tokenization, and data blending. This is actually the most complex part of the LLM workflow.

Harvard’s code is much simpler. We’ll first present a simplified version of the training code. It mainly consists of two steps:
- Create separate data loaders for loading training and validation data.
- The
run_epoch()function is called to iteratively run the training steps. Each run uses a data loader to load data.
The specific code is as follows.
def train_worker(
gpu,
ngpus_per_node,
vocab_src,
vocab_tgt,
spacy_de,
spacy_en,
config,
is_distributed=False,
):
train_dataloader, valid_dataloader = create_dataloaders(
gpu,
vocab_src,
vocab_tgt,
spacy_de,
spacy_en,
batch_size=config["batch_size"] // ngpus_per_node,
max_padding=config["max_padding"],
is_distributed=is_distributed,
)
for epoch in range(config["num_epochs"]):
_, train_state = run_epoch(
(Batch(b[0], b[1], pad_idx) for b in train_dataloader),
model,
SimpleLossCompute(module.generator, criterion),
optimizer,
lr_scheduler,
mode="train+log",
accum_iter=config["accum_iter"],
train_state=train_state,
)
The image below shows a further simplification of the code. Subsequent images will only show the code related to the training dataset, excluding the code related to the validation set.

0x02 Dataset
Let’s take a look at the dataset next.
2.1 Industry Practices
Common datasets
In practice, researchers often need to use a mix of different data sources for LLM pre-training, rather than a single corpus, typically including academic literature, books, web page content, and programming code. Therefore, existing research usually mixes several off-the-shelf datasets (e.g., C4, OpenWebText, and Pile) and then processes them further to obtain the pre-training corpus. Furthermore, to train an LLM adapted to a specific application, it is also important to extract data from relevant sources (such as Wikipedia and BigQuery) to enrich the pre-training data with relevant information. Only by providing sufficient corpus can the information entropy of the probability space be reduced to a certain threshold, thereby achieving a phase transition for a particular task. Below are some common datasets.

The following figure shows the architecture, training corpus, and training objectives of different pre-trained models.

Data source ratio
Data mixing strategies are crucial for training. To balance different types of data, researchers often use large models to classify the data and then adjust the data distribution for different categories. For example, they might adjust sampling weights based on quality metrics such as knowledge depth and helpfulness. Alternatively, they might employ a balanced sampling strategy to ensure the priority of high-quality content while preserving diverse categories. This ensures the model can learn from various types of data, avoiding bias caused by an overabundance of data from certain domains. The following figure shows the ratio of data sources in existing LLM pre-training.

Data governance
Because biases and errors in the corpus can cause large models to learn distorted external information, comprehensive data governance of the corpus is essential. It must be rich and detailed, yet unbiased. To ensure high-quality data, many LLMs employ various processing strategies during training, such as:
- Data quality enhancement. Document quality is rigorously evaluated by combining rule-based cleaning and deduplication procedures. This often involves intelligently filtering pre-trained data using a previous-generation model to assess document coherence, conciseness, educational value, helpfulness, knowledge richness, and category relevance. This approach not only improves data quality but also enhances the model’s ability to handle multilingual data.
- Data format optimization. For example, for dialogue and question-and-answer data, a nested document format can be used, employing flexible templates to balance natural understanding and structural consistency. This design ensures the model’s generalization ability across multiple interaction modes.
- Data synthesis. This involves using other models to generate high-quality synthetic data. Furthermore, these synthetic data are further filtered using other models to ensure their quality and relevance. This method not only expands the scale of the training data but also guarantees high quality and diversity.
2.2 Harvard Dataset
The Harvard code trains a model using the Multi30K dataset to translate German sentences into English. Multi30K is an extension of the Flickr30K dataset (Young et al., 2014), containing 31,014 English descriptions translated from German and 155,070 independently collected German descriptions.
- For a description of the dataset, please see https://github.com/multi30k/dataset
- For PyTorch documentation, please refer to https://pytorch.org/text/stable/_modules/torchtext/datasets/multi30k.html
The dataset consists of three files: mmt16_task1_test.tar.gz, training.tar.gz, and validation.tar.gz. Opening training.tar.gz reveals two files: train.de and train.en. Each file contains 29,000 lines of German and English text, respectively, excerpted below:
train.de
Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Mehrere Männer mit Schutzhelmen bedienen ein Antriebsradsystem.
Ein kleines Mädchen klettert in ein Spielhaus aus Holz.
Ein Mann in einem blauen Hemd steht auf einer Leiter und putzt ein Fenster.
Zwei Männer stehen am Herd und bereiten Essen zu.
train.en
Two young, White males are outside near many bushes.
Several men in hard hats are operating a giant pulley system.
A little girl climbing into a wooden playhouse.
A man in a blue shirt is standing on a ladder cleaning a window.
Two men are at the stove preparing food.
The dataset was used twice: once when building the vocabulary and once when training to build the batch.
0x03 Loading Functional Modules
Several data-related global variables were built in the Harvard source code to store the loaded tokenizer, dictionary, and model, respectively.
model = load_trained_model() # 加载模型
spacy_de, spacy_en = load_tokenizers() # 加载分词器
vocab_src, vocab_tgt = load_vocab(spacy_de, spacy_en) # 构建字典
3.1 Loading the Model
The load_trained_model() function is responsible for loading the model, and parameters such as batch_size are set here. If this function cannot find a model to load during execution, it will call the train_model() function to train a model.
def load_trained_model():
config = {
"batch_size": 2,
"distributed": False, # 不进行分布式训练
"num_epochs": 8,
"accum_iter": 10, # 每训练10个批量后会更新一次模型参数
"base_lr": 1.0, # 基础学习率
"max_padding": 10, # 句子最大长度
"warmup": 3000, # 依据基础学习率会预热3000次,此后学习率会下降
"file_prefix": "multi30k_model_",
}
model_path = "multi30k_model_final.pt"
if not exists(model_path):
train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config)
# 初始化模型
model = make_model(len(vocab_src), len(vocab_tgt), N=6)
# 从模型文件中加载模型参数
model.load_state_dict(torch.load("multi30k_model_final.pt"))
return model
3.2 Loading the word segmenter
The load_tokenizers() function loads German and English tokenization models. Spacy is a Python library for text preprocessing that provides tokenization functionality; for more information, see https://spacy.io/ and https://github.com/explosion/spaCy.
import spacy
def load_tokenizers():
try:
spacy_de = spacy.load("de_core_news_sm")
except IOError:
os.system("python -m spacy download de_core_news_sm")
spacy_de = spacy.load("de_core_news_sm")
try:
spacy_en = spacy.load("en_core_web_sm")
except IOError:
os.system("python -m spacy download en_core_web_sm")
spacy_en = spacy.load("en_core_web_sm")
return spacy_de, spacy_en
3.3 Loading the vocabulary
The load_vocab() function loads a vocabulary list and then builds it. The specific code for the load_vocab() function is as follows.
def load_vocab(spacy_de, spacy_en):
# 如果文件不存在,则构建字典,否则直接加载词典
if not exists("vocab.pt"):
vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
torch.save((vocab_src, vocab_tgt), "vocab.pt")
else:
vocab_src, vocab_tgt = torch.load("vocab.pt")
return vocab_src, vocab_tgt
0x04 Loading Data
The create_dataloaders() function defines data loaders, the most important part of which is the collate_fn() function. This function uses the collate_batch() function to define the batch building functionality, that is, to aggregate several data into a batch.
def create_dataloaders(
device,
vocab_src, # 源词表(德语词表)
vocab_tgt, # 目标词表(英语词表)
spacy_de, # 德语分词器
spacy_en, # 英语分词器
batch_size=12000, # batch size(批次大小)
max_padding=128, # 句子最大填充长度
is_distributed=True,
):
# 德语分词函数,其会调用德语分词器对语句进行分词
def tokenize_de(text):
return tokenize(text, spacy_de)
# 英语分词函数,其会调用英语分词器对语句进行分词
def tokenize_en(text):
return tokenize(text, spacy_en)
# 定义构建batch功能,即把若干数据聚集成一个batch
def collate_fn(batch):
return collate_batch(
batch,
tokenize_de,
tokenize_en,
vocab_src, # 源词表(德语词表)
vocab_tgt, # 目标词表(英语词表)
device,
max_padding=max_padding,
pad_id=vocab_src.get_stoi()["<blank>"],
)
# 加载数据集
train_iter, valid_iter, test_iter = datasets.Multi30k(
language_pair=("de", "en")
)
# 将train_iter转换为map
train_iter_map = to_map_style_dataset(
train_iter
) # DistributedSampler needs a dataset len()
train_sampler = (
DistributedSampler(train_iter_map) if is_distributed else None
)
valid_iter_map = to_map_style_dataset(valid_iter)
valid_sampler = (
DistributedSampler(valid_iter_map) if is_distributed else None
)
# 构建训练数据加载器
train_dataloader = DataLoader(
train_iter_map,
batch_size=batch_size,
shuffle=(train_sampler is None),
sampler=train_sampler,
collate_fn=collate_fn,
)
# 构建验证数据加载器
valid_dataloader = DataLoader(
valid_iter_map,
batch_size=batch_size,
shuffle=(valid_sampler is None),
sampler=valid_sampler,
collate_fn=collate_fn,
)
return train_dataloader, valid_dataloader
We mentioned earlier when loading the vocabulary that the vocabulary is passed as an argument to the collate_batch() function. Now we’re also saying that the data loader uses the collate_batch() function to load data. It seems the collate_batch() function is the core of the process, so we’ll analyze how it loads batches.
4.1 Padding
Deep learning models require input data of a fixed size, but this is difficult to achieve in the field of NLP (Natural Language Processing) because the input text is usually of variable length, making it hard to find multiple sentences of the same length. Therefore, sentences of varying lengths are inevitably grouped into a single batch. To accommodate this input pattern and enable models to handle text of different lengths, we need to align the input sequences during dataset generation to ensure that all sequences within the same batch have the same length. Specifically:
- However, if the input sequence is too long, the left side is truncated and the excess is discarded.
- Sentences that are too short need to be replaced with meaningless special characters. Complete the sentence to the maximum length.
In this way, all text sequences will have the same length, allowing them to be input into the model as a uniform batch for processing. The Harvard code uses padding and truncation.
improve
Because these padding symbols don’t actually carry semantic information and are only used to pad the sequence length, they still negatively impact model processing. Therefore, there are related optimization efforts, such as the No Pad optimization. Researchers modified the implementation of the attention operator, concatenating all text sequences involved in the operation end-to-end to form a very long input sequence. To mark the start and end positions of each text sequence, we also need a sequence to record the length of the concatenated text. This technique can effectively reduce the computational cost required for model inference.
Left padding
Padding can be left-padding or right-padding. BERT uses right-padding, while most LLMs currently use left-padding. This is because most LLMs always choose the logits of the last token to predict the next token. If we pad on the right, the model will use <PAD> the logits to predict the next token in some cases, leading to an incorrect structure. For example, suppose we want to generate two sentences: “飞雪连天射白鹿,笑书神侠倚碧鸳”. We have already generated “飞雪连天射白鹿”, and we need to generate the next word.
- The result of left padding is: “Flying snow connects the sky, shooting white deer.” The word “deer” is used to predict the next word.
- The result of the right-filling is: “Flying snow connects the sky, shooting white deer
<PAD><PAD>.” This will be used<PAD>to predict the next word.
4.2 Batch Class
The source code uses the Batch class to implement the batch concept. The Batch class combines a batch of source language sentences and target language sentences together and generates corresponding masks based on the data. When reading sentences from the dataset, a special token (usually <bos> or <PAD>) is added to the beginning of each sentence, and another special token (<eos>) is added to the end. Here, we assume a batch size of 8, a maximum sentence length of 32, and that the special characters have the following indices in the vocabulary: 0 for <bos>, 1 for <eos>, and 2 for <PAD>.
member variables
The key member functions of the Batch class are as follows:
src: A list of sentences in the source language.srchas the shape[batch size, max_seq_len], where each sentence’s content is the dictionary index corresponding to the token in the original statement. An example of a single sentence is:[0, 3, 5, 6, ..., 7, 1, 2, 2], where 0 is<bos>, 1 is<eos>, and 2 is<PAD>. Therefore, 3, 5, 6, …, 7 are the actual sentence content.max_seq_lenrepresents the maximum length of a sentence.tgt: A list of sentences in the target language. The logic is similar to src, but can be empty because target language sentences are not needed during inference.tgt_y: A list of truth values for the target language sentence. During the training phase, the decoder needs to compare the last character of the predicted output sequence with the true result, therefore tgt needs to be copied as tgt_y as the truth value.src_mask: The mask for the source language sentence. Its function is to cover up the src part<PAD>so that<PAD>is not included in the calculation.tgt_mask: The mask for the target language sentence, with logic similar to src_mask.
The code for Batch is as follows.
class Batch:
"""Object for holding a batch of data with mask during training."""
def __init__(self, src, tgt=None, pad=2): # 2 = <blank>
self.src = src # 源语言句子列表
# 创建源语言的掩码,这样可以忽略填充部分,unsqueeze()的作用是增加一个维度,因为后续要和注意力分数进行掩码计算,而注意力分数是三个维度,所以这里要保持一致。
self.src_mask = (src != pad).unsqueeze(-2)
# 预测时候没有目标语言句子;训练时候有目标语言句子
if tgt is not None: # 如果目标语言数据存在
# 去掉tgt的最后一个单词<eos>。因为tgt存储的是解码器的输入,而解码器的输入不应该有<eos>。比如一个句子“<bos>新年好<eos>”,下面代码处理之后,self.tgt就应该是"<bos>新年好"。
self.tgt = tgt[:, :-1] # 形状是torch.Size([batch size, 字数-1])
# 去掉tgt的第一个词<bos>。因为tgt_y存储的是希望预测的结果,所以不需要<bos>。假设tgt是“<bos>新年好<eos>”,下面语句运行之后, self.tgt_y内容就是“新年好<eos>”,即我们希望模型预测出这几个token。
self.tgt_y = tgt[:, 1:] # 形状是torch.Size([batch size, 字数-1])
# 创建目标语言掩码,这样可以忽略填充部分和未来词汇
self.tgt_mask = self.make_std_mask(self.tgt, pad)
self.ntokens = (self.tgt_y != pad).data.sum() # 计算目标语言句子中非填充词的数量,<bos>,<eos>这些也算是句子的token,所以依然要计算
@staticmethod
def make_std_mask(tgt, pad):
"Create a mask to hide padding and future words."
# 生成填充词对应的掩码
tgt_mask = (tgt != pad).unsqueeze(-2)
# subsequent_mask()函数会生成未来词汇相关的掩码,然后填充词对应的掩码和未来词汇相关的掩码会做与操作,得到最终掩码
# tgt.size(-1) 表示的是序列的长度
tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
tgt_mask.data
)
return tgt_mask
target statement
The source language sentence has only one member variable, src, while the target sentence has two: tgt and tgt_y. Therefore, we need to analyze them separately. During the inference phase, tgt can be empty because there is no target language sentence during prediction, only the source language sentence. During the training phase, the model’s input is src and tgt. After processing, the model outputs out, which needs to be compared with tgt_y to determine the loss. Two details need attention:
- The decoder’s input needs to be stripped of the last token (
<pad>or<eos>). This is because our last input TGT was<bos>“I love you” (no<eos>), so our input TGT will never contain the target’s last token. Therefore, during TGT processing, the last token of the target sentence is usually removed. - The decoder’s prediction target is the first token
<bos>, since we don’t need to predict it<bos>(meaning our label doesn’t contain it<bos>). In the code, the label is namedtgt_y.
The above operations are accomplished using the following statements: target_input=target[:-1, :] and target_out=target[1:, :]. We provide an example below. Assume the original target language sentence is "<bos>新年好<eos>" (Happy New Year), which is converted to tgt as "<bos>新年好" (Happy New Year). Assume the calculated out value is "新年乐<eos>" (Happy New Year), and tgt_y is "新年好<eos>" (Happy New Year).
def run_epoch():
"""Train a single epoch"""
for i, batch in enumerate(data_iter):
out = model.forward(
batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
)
loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
Generate mask
The most important function of the Batch class is to generate a mask for the sentence. Generating the mask serves two purposes.
- Since padding tokens are used to complete the length and have no practical meaning, we also want to minimize the impact of padding. This reduces computational complexity and also reduces the impact of padding on the model’s text modeling.
- Because training is performed using the Teaching Forcing pattern (which will be explained in detail later), a mask also needs to be added to prevent Self-Attention from accessing future inputs.
The mask used for the first purpose is called a Padding Mask (a mask corresponding to the filler word). The mask used for the second purpose is called a Sequence Mask (a mask related to future words). The source sentence needs a Padding Mask, and the target sentence needs a combination of Padding Mask and Sequence Mask. We will introduce these in detail below, using source and target sentences as examples.
The variable corresponding to the mask of the source statement is called src_mask. Suppose a statement contains the data [0, 3, 1, 2, 2]. The statement that generates src_mask is quite simple, consisting of only self.src_mask = (src != pad).unsqueeze(-2) one line of code. It mainly serves two purposes:
- Set the non-pad parts of
srctoTrueand the pad parts toFalse. The mask for the example sentence above would then be[True, True, True, False, False]. Because<bos>,<eos>, and<unk>are considered sentence components, they are not masked. - The
unsqueeze()function is used to add a dimension because thesrc_maskwill be used for masking calculation with the attention score, and the attention score has three dimensions, so consistency must be maintained here. Therefore, the final shape ofsrc_maskis[batch size, 1, longest sentence length].
The variable corresponding to the target statement mask is called tgt_mask. Generating tgt_mask is more complex; the specific logic is in the member variable function make_std_mask() of the Batch class given earlier. tgt_mask differs slightly from src_mask; besides covering the pad portion, it also needs to cover the upper right diagonal. This involves combining the mask corresponding to the fill word and the mask related to future words. The logic of the make_std_mask() function is as follows:
- First, generate the mask corresponding to the filler words, i.e., the Padding Mask. The mask for the above example is
[[[True, True, True, False, False]]]. - Then the
subsequent_mask()function is called to generate a future word-related mask, i.e., a sequence mask. This is a matrix where the diagonal and the area below it are all True. The specific mask is as follows.
[[
[ True, False, False, False, False ],
[ True, True, False, False, False ],
[ True, True, True, False, False ],
[ True, True, True, True, False ],
[ True, True, True, True, True ],
]]
- Finally, the mask corresponding to the fill word and the mask related to the future word are ANDed together to obtain the final mask as follows:
[[
[ True, False, False, False, False ],
[ True, True, False, False, False ],
[ True, True, True, False, False ],
[ True, True, True, False, False ],
[ True, True, True, False, False ],
]]
Note that the shape of src_mask is (batch, 1, seq_len), while that of trg_mask is (batch, seq_len, seq_len). This is because each time step of src_mask can attend to all time steps (except for padding), requiring only one vector at a time, while trg_mask requires a matrix representing several time steps.
The code for the subsequent_mask() function is as follows.
def subsequent_mask(size):
"Mask out subsequent positions."
attn_shape = (1, size, size)
subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
torch.uint8
)
return subsequent_mask == 0
Build batch
The function data_gen() is used to build batches, as shown in the following code.
def data_gen(V, batch_size, nbatches):
"""
生成一组随机数据。(该方法仅用于Demo)
:param V: 词典的大小
:param batch_size
:param nbatches: 生成多少个batch
:return: yield一个Batch对象
"""
# 生成{nbatches}个batch
for i in range(nbatches):
# 生成一组输入数据
data = torch.randint(1, V, size=(batch_size, 10))
# 将每行的第一个词都改为1,即"<bos>"
data[:, 0] = 1
# 该数据不需要梯度下降
src = data.requires_grad_(False).clone().detach()
tgt = data.requires_grad_(False).clone().detach()
# 返回一个Batch对象
yield Batch(src, tgt, 0)
4.3 Loading batches
The collate_batch() function is the collate_fn (Callable, optional) parameter of the DataLoader class. Its purpose is to combine a list of samples into a mini-batch of a tensor. Internally, DataLoader passes the list of sentence pairs to the collate_batch() function for processing, and then sends the input batch to the model.
def collate_batch(
batch, # 句子对的列表。比如[(源句子1, 目标句子1),(源句子2, 目标句子2),.....],列表大小为batch size
src_pipeline, # 德语分词功能,即spacy_de的封装器
tgt_pipeline, # 英语分词功能,即spacy_en的封装器
src_vocab, # 德语词典,Vocab对象
tgt_vocab, # 英语词典,Vocab对象
device,
max_padding=128, # 句子最大长度
pad_id=2,
):
# <bos>和<eos>在词典中的index
bs_id = torch.tensor([0], device=device) # <s> token id
eos_id = torch.tensor([1], device=device) # </s> token id
src_list, tgt_list = [], []
for (_src, _tgt) in batch: # 遍历句子对列表
# 首先调用src_vocab(src_pipeline(_src))对源句子处理,具体是利用分词器src_pipeline和词表src_vocab把句子转换为词表index的序列;其次调用torch.cat在句子前面加上<bos>,句子后面加上<eos>。
processed_src = torch.cat(
[
bs_id,
torch.tensor(
src_vocab(src_pipeline(_src)),
dtype=torch.int64,
device=device,
),
eos_id,
],
0,
)
# 首先调用tgt_vocab(tgt_pipeline(_tgt))对源句子处理,具体是利用分词器tgt_pipeline和词表tgt_vocab把句子转换为词表index的序列;其次调用torch.cat在句子前面加上<bos>,句子后面加上<eos>。
processed_tgt = torch.cat(
[
bs_id,
torch.tensor(
tgt_vocab(tgt_pipeline(_tgt)),
dtype=torch.int64,
device=device,
),
eos_id,
],
0,
)
# 如果processed_src大于max_padding,则截断;如果小于max_padding,则填充
src_list.append(
# warning - overwrites values for negative values of padding - len
pad(
processed_src,
(
0,
max_padding - len(processed_src),
),
value=pad_id,
)
)
# 如果processed_tgt大于max_padding,则截断;如果小于max_padding,则填充
tgt_list.append(
pad(
processed_tgt,
(0, max_padding - len(processed_tgt)),
value=pad_id,
)
)
src = torch.stack(src_list) # 把列表堆叠在一起
tgt = torch.stack(tgt_list) # 把列表堆叠在一起
return (src, tgt)
4.3 Training and Use
During training, the train_worker() function calls the run_epoch() function in each epoch to build a batch from the data obtained from the dataset, and then calls the run_epoch() function to perform the actual training.
_, train_state = run_epoch(
# 拿到Batch类的实例
(Batch(b[0], b[1], pad_idx) for b in train_dataloader),
model,
SimpleLossCompute(module.generator, criterion),
optimizer,
lr_scheduler,
mode="train+log",
accum_iter=config["accum_iter"],
train_state=train_state,
)
The code for the run_epoch() function is as follows.
def run_epoch(
data_iter, # 可迭代对象,一次返回一个Batch对
model, # Transformer模型,EncoderDecoder类对象
loss_compute, # SimpleLossCompute对象,用于计算损失
optimizer, # Adam优化器。验证时,optimizer是DummyOptimizer
scheduler, # LambdaLR对象,用于调整Adam的学习率,实现WarmUp
mode="train",
accum_iter=1, # 多少个batch更新一次参数,默认为1,也就是每个batch都对参数进行更新
train_state=TrainState(), # TrainState对象,用于保存一些训练状态
):
"""Train a single epoch"""
start = time.time()
total_tokens = 0
total_loss = 0
tokens = 0
n_accum = 0
# 遍历数据集中的每个batch
for i, batch in enumerate(data_iter):
# 对每个batch进行前向传播,等价于model(batch.src, batch.tgt, batch.src_mask, batch.tgt_mask)。这里的out是Decoder的输出,并不是Generator的输出,因为在EncoderDecoder的forward中并没有使用generator。generator的调用放在了loss_compute中
out = model.forward(
batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
)
"""
调用loss_compute()函数来计算每个批次的损失,传入的三个参数分别为:
1. out: EncoderDecoder的输出
2. tgt_y: 要被预测的所有token,例如src为`<bos> I love you <eos>`,则`tgt_y`则为`我 爱 你 <eos>`
3. ntokens:这批batch中有效token的数量,用于对loss进行正则化。
"""
loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
# loss_node = loss_node / accum_iter
if mode == "train" or mode == "train+log":
loss_node.backward() # 计算梯度
train_state.step += 1 # 记录step次数
train_state.samples += batch.src.shape[0] # 记录样本数量。batch.src.shape[0]获取的是Batch size
train_state.tokens += batch.ntokens # 记录处理过的token数
# 如果达到了accum_iter次,就进行一次参数更新
if i % accum_iter == 0:
optimizer.step()
optimizer.zero_grad(set_to_none=True)
n_accum += 1
train_state.accum_step += 1
# 更新学习率
scheduler.step()
# 累计loss
total_loss += loss
# 累计处理过的tokens
total_tokens += batch.ntokens
# 累计从上次打印日志开始处理过得tokens
tokens += batch.ntokens
if i % 40 == 1 and (mode == "train" or mode == "train+log"):
lr = optimizer.param_groups[0]["lr"]
elapsed = time.time() - start
start = time.time()
tokens = 0
del loss
del loss_node
# 返回平均损失和训练状态
return total_loss / total_tokens, train_state # 返回平均损失
summary
We present the overall data flow of training using a complete diagram as follows. Based on this data processing flow, LLM is infused with sufficient information to construct a probability distribution space of massive amounts of natural language and code, forming various complex relational patterns that encompass various knowledge and structures within natural language and code. This knowledge and structure manifests as distances and relationships in probability distributions, thus supporting reasoning steps such as comparison, analogy, induction, and deduction—in other words, enabling these reasoning abilities to “emerge.”

0xFF Reference
LLM pre-training corpus, preprocessing and dataset indexing, loading summary AI chat
The reason why most large LLM models use left padding is DuTim
Why current LLM uses left padding? Junrong Lin
https://commoncrawl.org/overview
https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/index.html
https://arxiv.org/abs/2303.18223
https://www.high-flyer.cn/en/blog/cc_cleaner/
https://arxiv.org/abs/2309.10305
https://huggingface.co/datasets/Skywork/SkyPile-150B
http://arxiv.org/abs/2310.19341
https://github.com/NVIDIA/Megatron-LM