Speculative decoding, also known as predictive decoding/speculative sampling, uses a small model to predict the behavior of a large model, thereby improving the decoding efficiency of the model and accelerating the execution of the large model. Its core idea is shown in the diagram below. First, multiple candidate tokens (serial sequences, trees, multi head trees, etc.) are quickly generated in a low cost manner (primarily using a small model, but also including multi head, retrieval, and Early Exit methods). ... Read Full Text
MoE has two distinct characteristics: Dynamic routing: Uses a gating network to determine which experts should process each input. Sparse activation: For each input, only some experts are activated, which greatly reduces the amount of computation. These two characteristics are precisely what lead to several major problems: Load balancing. Overuse by some experts has led to uneven load distribution. Routing network degradation. Due to overfitting in the routing decisions of the gated network, its exploration ... Read Full Text
The general idea behind MTP (Multi token Prediction) is to have the model use n independent output heads to predict the next n tokens, with these n independent output heads sharing the same model backbone. By optimizing the decoding stage, the generation of 1 tokens is transformed into the generation of multi tokens, thereby improving training and inference performance. Before DeepSeek, there were several MTP solutions, each with its own focus. The focus is on accelerating decoding during inference. Examples ... Read Full Text
Applying existing quantization techniques directly to large models presents challenges, leading to significant quantization errors and decreased accuracy. This is primarily due to the scale and complexity of large models. Compared to smaller models, large models typically exhibit more outliers in their weights and activations, and possess a wider distribution range. The authors of LLM.int() observed this: Unlike smaller models, LLMs exhibit unique weight and activation distributions, characterized by a large ... Read Full Text
Following the previous article which introduced the basics of large scale model quantization, this article will explore some quantization schemes. Since everyone currently uses compression to a certain bit to measure quantization schemes, we will classify and learn according to the quantization bits. ... Read Full Text
The speculative sampling paradigm is predict + verify, while another approach is based on Jacobi decoding and its evolutionary branches, which are built on the Jacobi iteration. Jacobi iteration transforms the N iterations of autoregression into N equations, which are then solved jointly. Jacobi Decoding, on the other hand, uses the output of the previous iteration as the input for the next iteration; essentially, it treats the output of each token as a 2 gram and uses this as the draft model. Assume \(\mathbf{y} ... Read Full Text
Medusa is one of the earlier works in the field of self speculation and has greatly inspired subsequent work. Its main idea is multi decoding head + tree attention + typical acceptance (threshold). Medusa does not use a separate draft model, but adds multiple decoding heads (MEDUSA heads) to the original model to predict multiple subsequent tokens in parallel. A typical LLM has only one head for predicting the token at time t. Medusa retains the original LM Head after the last Transformer layer of the LLM and adds ... Read Full Text
While Transformer based LLMs have made significant progress, deploying them presents major challenges due to the increasing number of parameters. For example, even a medium sized LLM, such as LLaMA13B, requires approximately 26GB of memory to load the FP16 format of all its parameters. This overhead not only increases the cost of use but also limits their wider adoption. To address these challenges, numerous specialized compression methods for LLMs have been proposed, including pruning, knowledge transfer, ... Read Full Text
The basic idea of MLA (Multi head Latent Attention) is to compress the attention input into a low dimensional latent vector whose dimension is , and is much smaller than the original dimension. When attention needs to be calculated, this latent vector can be mapped back to a higher dimensional space. Therefore, only the latent vector needs to be stored. This can significantly reduce memory usage. This process can be described more formally using the following formula. represents the latent vector. is a compression ... Read Full Text
Even with a KV cache, in long sequence scenarios, it only reduces the computational cost of the K and V results themselves. However, the computational cost of subsequent parts remains high, and the storage pressure on K and V is also very heavy. This seriously hinders our ability to use longer sequences as input and generate even longer sequences. To solve this problem, we need to optimize the KV cache. The core of KV Cache optimization lies in reducing memory consumption. This goal can be achieved by further ... Read Full Text
In the inference process of large models, the task can typically be divided into two phases: the Prefill phase processes all input tokens, generates the first output token, and builds a KVCache. The Decode phase utilizes the KVCache for multiple iterations, generating one token per iteration. Because the Prefill phase processes many tokens in parallel, it is computationally intensive, and its latency is measured by the first token latency (TTFT). In contrast, the Decode phase becomes memory intensive due to the ... Read Full Text
Current large scale language model (LLM) service systems employ key value (KV) caches to avoid redundant key value projection calculations during the decoding phase. While this is an effective solution for generating short sequences for a single client request, when dealing with multiple clients, each request maintains its own KV cache, increasing the overall KV cache size during inference. Furthermore, even for a single client request, KV caches can significantly impact inference performance when generating long ... Read Full Text
Advances in LLMs are driving longer contexts and broader text generation, with these models trained on sequences of millions of labeled data. This trend puts pressure on system memory bandwidth, leading to increased execution costs. LLMs for multi turn dialogue scenarios present several challenges: 1. Attention mechanisms' computational complexity; 2. Cache of key value pairs during the decoding stage requires a large amount of memory; 3. Popular LLMs cannot be extended beyond the training length. In this article, ... Read Full Text
As mentioned in the previous section on "Optimizing KV Cache," the main related work on "reducing the number of attention heads" currently includes MQA and GQA. MQA and GQA optimize the number of KV values to cache: intuitively, if the number of cached KV values is smaller, the GPU memory usage will be smaller, and the reduction in the capacity of large models can be compensated for by further training or increasing the scale of FFN/GLU. Because MQA and GQA are improvements on MHA, we illustrate the differences ... Read Full Text
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, driving breakthroughs in language understanding, generation, and reasoning capabilities. Similar to self supervised learning methods in other fields, LLMs are typically pre trained on large amounts of unlabeled text data and then fine tuned for specific downstream tasks to adapt their knowledge to the target domain. However, the sheer size of LLMs, often reaching billions of parameters, presents significant ... Read Full Text
FlashAttention leverages the asymmetric hierarchy of GPU memory to reduce memory consumption to linear (rather than quadratic) levels, achieving a 2 to 4x speedup compared to the optimized baseline. However, this technique still falls short of the speed of optimized matrix multiplication (GEMM) operations. Forward propagation throughput reaches only 30 50% of the theoretical maximum floating point operation rate (FLOPs/s), while backward propagation only reaches 25 35%. This inefficiency is due to poor load ... Read Full Text
0.1 Problem The core of the Transformer architecture is the powerful self attention mechanism. However, self attention is slow and memory intensive, especially when dealing with long context lengths. For a Transformer model, assuming an input sequence length of , both its computational and space complexity are . In other words, the computational cost and storage space of the model increase quadratically with the sequence length . When the input sequence length is long, the Transformer's computation process is slow ... Read Full Text
As the list of tokens input to the LLM grows, the Transformer's self attention phase can become a performance bottleneck. A longer token list means larger matrices to be multiplied. Each matrix multiplication consists of many small numerical operations, called floating point operations, whose performance is limited by the GPU's floating point operations per second (FLOPS). Thus, inference latency and throughput become critical challenges in LLM deployment. These problems mainly stem from: To generate the ... Read Full Text
With sufficient training data, we can scale up language models by increasing parameters and computational budget to obtain more powerful models. However, this comes with extremely high computational costs. The MoE (Mixture of Experts) architecture, through conditional computation, can achieve parameter scaling while maintaining moderate computational costs, providing enhanced model capacity and computational efficiency. Simply put, MoE combines multiple expert models to form a new model. However, MoE doesn't have ... Read Full Text
The Transformer can be viewed as a general purpose differentiable computer. Its self attention mechanism provides a message passing like architecture, offering a versatile and powerful set of algorithms capable of covering many real world applications, requiring only a few computational steps to complete tasks. Residual connections and layer normalization enable efficient learning and optimization capabilities. In the Transformer architecture, the Add & Norm modules are used to add residual connections and ... Read Full Text
For standard Transformer models, whether it's the Encoder only BERT series or the Decoder only GPT series, the number of parameters and computational cost are similar under the same configuration. One key point is that the input, output, and intermediate hidden dims of a standard Transformer block (layer) remain unchanged; they are always the hidden dims of the token embedding, and all Transformer blocks are very well organized. As shown in the figure below, the main parameters of the Encoder are derived from the ... Read Full Text
RoPE encoding comes from Su Shen's work Roformer , and it is one of the most popular PE encoding methods currently used in LLM. The Transformer paper used Sinusoidal positional encoding, which is additive encoding, meaning the word embedding is added to the encoded position. The embedding vector for each position is fixed, regardless of its relationship to other positions. Sinusoidal positional encoding aims to introduce relative positional relationships (the positional encoding of any position can be expressed as ... Read Full Text
The decoder consists of many Transformer layers, each ending with an "Add & Norm" module. In other words, the last module of the encoder's final layer is an "Add & Norm" module. The output of this module is a floating point vector representing the semantics. Our current problem is: how to convert this floating point vector into a word? This is what the sampling and output sections do. In short, during the prediction phase, the sampling and output sections perform the following three steps: Calculating ... Read Full Text
The Transformer's method of extracting and processing "sequence information" involves two steps: taking the encoder of the original Transformer structure as an example, each layer contains a multi head self attention block (MHSA) and an FFN (Feed Forward Network), that is, after the self attention layer, the encoder has an FFN. FFN is a simple network consisting of two linear transformations and one activation function (linear + relu + linear). Considering that the attention mechanism may not fit complex processes ... Read Full Text
In machine learning, a mask is essentially a tensor of the same size as the target tensor (mostly binary, 0 1). The idea originated from the CBOW training mechanism of word2vec: predicting the center word based on context. A mask essentially conceals the center word. Different tasks and applications may require different types of mask operations. In self attention models, two common mask operations are padding mask and sequence mask. Padding mask: When processing variable length sequences, special padding symbols ... Read Full Text
MHSA (Multi Head Self Attention) is the core module of the Transformer model. The Transformer is essentially a general purpose differentiable computer that combines many excellent features. The Transformer's message passing like architecture is versatile (i.e., complete) and powerful (i.e., efficient), capable of covering many real world algorithms, thus giving the Transformer very strong expressive power (in forward propagation). Through backpropagation and gradient descent, the Transformer can be continuously ... Read Full Text
Because the Transformer itself possesses permutation invariance, it cannot directly capture the positional information of each word in the sequence. Therefore, using positional encoding to incorporate the order information of elements in the sequence into the Transformer has become a common practice. Based on whether the positional encoding represents the absolute or relative positional information of elements in the sequence, the industry mainly divides positional encoding into Absolute Position Encoding (APE) ... Read Full Text
The core of the Transformer, or rather, the key difference between it and other architectures, is its self attention mechanism. This mechanism allows the model to consider the dependencies between each word and all other words in a sentence, using this information to capture the sentence's internal structure and representation, and ultimately calculating the correlations (weights) between words. We can divide the self attention mechanism into three stages: Input: As we learned earlier, attention mechanisms accept ... Read Full Text
In the Transformer, the task of mapping each token (corresponding to discrete input data, such as words or symbols) to a high dimensional dense vector space is implemented by the embedding layer. The input embedding layer is an indispensable part of the Transformer framework, and its functions are as follows: The input data is transformed into a form that the model can process. For example, for the four characters "新年大吉" (Happy New Year), assuming the high dimensional space is 512 dimensional, the embedding layer ... Read Full Text
Positional embedding is a technique for processing sequential data, used to represent the positions of words in an input sequence. It plays a crucial role in Transformer implementations. Transformers need to focus on two pieces of information for each input word: its meaning and its position in the sequence. Positional embedding provides a key supplement to addressing these two concerns. To address the concern of "the meaning of input words", Transformer encodes the meaning of words through an embedding layer. ... Read Full Text
Language is a concept unique to humans. As an abstract symbol, humans can understand the meaning of each word in a language, but current NLP language models cannot directly abstract the meaning of each language symbol from perception. In order for the model to understand natural language text, the text needs to be converted into a representation that the model can understand, such as integers, floating point numbers, or vectors. As we learned in previous chapters, the Transformer accepts high dimensional vectors ... Read Full Text
The purpose of Transformer training is to fit the true target sequence by learning from the input source sequence and the model's output sequence. The purpose of inference is to generate the target sequence using only the input sequence; the decoder receives only the input sequence, not the target sequence. This article will continue to use text translation as an example for learning. ... Read Full Text
Some researchers believe that the cognitive framework of the large model looks very close to the Bayesian brain described by Karl Friston. Based on Bayesian probability theory and biophysical principles, the brain's main goal is to predict and control external information in order to minimize uncertainty and internal entropy. The brain builds internal models by constantly collecting and processing external information to predict and control the external world. Massive amounts of text or multimodal corpora ... Read Full Text
For machine translation, the complete forward computation process of the Transformer is shown in the following diagram (compared to the flowchart in the overall architecture chapter, this diagram emphasizes the internal structure of the encoder and decoder and their interrelationships). The left side of the diagram shows the encoder stack, and the right side shows the decoder stack; these two constitute the "body" of the Transformer. The specific process is as follows. The input sequence is converted into an ... Read Full Text
0.1 Process Using Transformer for text generation essentially involves using a model to predict the next word. The complete process includes multiple stages, such as word segmentation, vectorization, attention calculation, and sampling. The specific workflow is as follows: Tokenization involves breaking down the user's input text (let's assume it's "Data visualization empowers users to") into several independent lexical units, i.e., tokens. Encoding. Tokens are mapped to numbers using a vocabulary, with each token ... Read Full Text
Due to various commitments, I haven't blogged for a long time, and I haven't had time to organize some drafts I wrote before (I haven't had time to log into my blog or WeChat, which led to me only recently discovering many unread messages and private messages; I sincerely apologize to everyone). I'm resuming updates now because some new students from non AI fields have recently contacted me asking for good learning materials. They hope to quickly get started with Transformers. I searched online but couldn't find ... Read Full Text