Exploring the Transformer Series (29) --- DeepSeek MoE

0x00 Overview

MoE has two distinct characteristics:

Dynamic routing: Uses a gating network to determine which experts should process each input.
Sparse activation: For each input, only some experts are activated, which greatly reduces the amount of computation.

These two characteristics are precisely what lead to several major problems:

Load balancing. Overuse by some experts has led to uneven load distribution.
Routing network degradation. Due to overfitting in the routing decisions of the gated network, its exploration capability decreases.
Parameter explosion. The increase in the number of experts has led to excessive memory and storage requirements.
Communication bottleneck. In distributed systems, the high communication overhead between experts is particularly prominent.
Memory fragmentation. Inefficient memory usage leads to out-of-memory errors during training.
How to trade off between model accuracy (trim token) and hardware performance (zero padding).

Among these challenges, load balancing is the most typical. We will conduct an in-depth analysis to see how the industry addresses the load balancing problem.

Note:

The complete list of articles is here. It’s estimated to eventually have around 35 articles. This list will be updated after each subsequent article is published. (Cnblogs Exploring Transformer Series: Article List)
This series is a study and interpretation of papers, blogs, and code, drawing on many articles from online friends. I would like to express my gratitude to them, and their names will be listed in the references. Because there are so many references in this series, there may be omissions in the citations. If the original authors or other friends find any omissions, please point them out, and I will add them to the references.

0x01 Difficulties

Let’s start with an example to illustrate this.

We need to develop an application, but this application requires both front-end and back-end development. Therefore, we have two options for hiring programmers:

Hire a programmer proficient in both front-end and back-end development so they can handle all the development. This is similar to the standard Transformer model, where a single FFN sublayer processes all input tokens.
Hire multiple programmers with different strengths, such as front-end and back-end developers, and add a development manager to allocate tasks. This is similar to the MoE approach, where each programmer acts as an expert, and the development manager acts as a gating mechanism to select the experts.

To ensure the efficient operation of this system, the following conditions must be met:

Each programmer must be proficient in the skills required for their job, and all programmers must be able to work together to complete application development.
Development managers need to have a thorough understanding of all programmers’ expertise and be able to allocate tasks efficiently.

1.1 Load Balancing

While sparse gating G(x; \Theta) can significantly expand the model parameter space without increasing computational cost, its performance is highly dependent on the effectiveness of the gating mechanism. MoE’s formula is “for each token encountered, find the corresponding Expert to compute,” but in actual training, it’s the reverse: first, each Expert is allocated appropriate computing power, and then tokens are routed to their respective Experts for parallel computation. This is why the gating mechanism responsible for scoring is also called a Router. Because the gating mechanism cannot control the probability of issuing tokens to experts, in practice, there will be an uneven distribution of workload among experts. Some experts are frequently used (receiving many tokens), while other experts are rarely called (receiving very few tokens). We call this phenomenon expert load imbalance. This not only contradicts MoE’s design intent (specialization of expertise) but also affects computational efficiency (for example, causing uneven load during communication between GPUs in distributed training).

To understand expert load balancing, we can start from the perspective of batch size. Generally speaking, larger batch sizes result in better inference performance. However, because the activation of experts in the MoE layer needs to be parallelized, the actual batch size of each expert in the MoE layer will decrease. For example, suppose there are 10 tokens in the current batch, with 5 tokens routed to a single expert network and the other 5 tokens routed to 5 other different expert networks. This will lead to an uneven distribution of batch sizes among the expert networks, with one expert’s batch size being 5 times that of the other 5 experts, resulting in underutilization of the other 5 experts and wasted hardware resources. Furthermore, if we send all the tokens to a few top experts, training efficiency will become low.

If the first few tokens increase the gating weights of these selected experts, this in turn leads to them being selected more frequently with higher gating weights. This results in them being trained more, and their gating weights increasing again. This situation reinforces itself, as favored experts are trained faster and more thoroughly, their performance is constantly being optimized, and therefore they are selected more frequently. At this point, these few experts become overloaded, requiring the computation of a large number of tokens each time, while other expert models are left idle and untrained. This leads to under-worked experts struggling to learn effective knowledge due to insufficient training tokens, ultimately causing model imbalance and performance degradation.
Routing crashes can also lead to training instability due to gradient conflict. Overloaded experts receive more input tokens, and their accumulated gradients are larger. Therefore, the gradients of overloaded experts and underloaded experts may deviate in both magnitude and direction, making training difficult to converge.

Therefore, expert balancing needs to be controlled in the MoE model. To combat this imbalance, various load balancing techniques have been implemented to ensure that all experts bear roughly equal responsibility during training. As we introduced in the previous article, adding a noise mechanism to the gating can avoid repeatedly routing to the same expert, allowing the gating to skip several optimal expert models when making a selection and choose other expert models, thus enabling better collaborative work. The details are as follows.

The original processing flow is shown in Figure 1 below, where the gating function is softmax. The routing strategy is to multiply the input by the weight matrix and then apply softmax.
However, this method does not guarantee that the expert’s choices will be sparse. To address this issue, we first perform a linear transformation on the input, and then add a softmax function, resulting in a non-sparse gating function. (See Figure 2 below.)
However, this is still not enough. Therefore, before performing softmax, we use a topk function to retain only the k largest values and set the rest to -\infty. This way, for the non-Topk parts, since the values are negative infinity, they will become 0 after softmax and will not be selected, thus achieving sparsity. Building on this, we add Gaussian noise to the input to force the routing decision to remain probabilistic. This corresponds to number 3 in the diagram below.

2901

Simply adding noise is not enough; various solutions have been proposed to address this problem. Among them, the most common strategies are adding an auxiliary loss function to load balancing and some strategies based on experts, such as expert capability and expert choice.

The following sections will introduce related technologies for solving load balancing.

1.2 Auxiliary Loss Function

To alleviate load imbalance, an auxiliary load balancing loss function is introduced on top of the objective function of model training. This encourages giving equal importance to all experts, thereby promoting the even distribution of tokens among experts in each batch. These loss terms are added to the training objective.

1.2.1 Definition

The paper “A Survey on Mixture of Experts” defines the auxiliary loss function as shown in the figure below, where N is the number of experts, T is the number of tokens, and K is the number of experts activated by each input token. To ensure a uniform distribution of T tokens among N experts, the load balancing loss function should be minimized. The figure below also shows the optimal case of the loss function, where each expert receives an equal number of distributed tokens, D_i = 1/N, and an equal gating probability ratio P_i = 1/N. This maintains load balancing among all experts, ensuring a uniform distribution of workload.

2902

1.2.2 Loss Function

Since the auxiliary loss function is introduced based on the target loss function, let’s first look at the loss function.

The earliest loss function comes from the paper “Adaptive Mixture of Local Experts”. There are two loss functions, corresponding to labels 1 and 2 in the figure below. For data c, $p_i^c$ is the proportion of the i-th expert controlled by the gating network, $o_i^c$ is the output of the i-th expert, and $d^c$ is the final expected overall output.

The first approach has a serious problem: different experts can influence each other, leading to strong coupling between expert networks. Because the loss is calculated by summing the weights of all expert networks, a change in one expert’s weight will affect the loss of other expert networks. This coupling may result in multiple expert networks being used to process every sample, rather than focusing on their respective subtasks.

To address this issue, the paper redefines the loss function to encourage competition among expert networks, i.e., Scheme 2. Intuitively, the improvement involves each expert calculating their loss independently, with the overall loss obtained by weighting the losses of multiple experts. This means that each expert’s goal in handling a specific sample is independent of other experts. If both the gating network and the experts use this new loss for gradient descent training, the system will tend to assign a particular class of samples to a specific expert. This is because when an expert’s loss on a given sample is less than the average loss of all experts, its gating score for that sample will increase; conversely, when its performance is worse than the average loss, its gating score will decrease. This mechanism encourages competition, rather than cooperation, among experts, thereby improving learning efficiency and generalization ability.

2903

Next, let’s see how to add an auxiliary loss function.

1.2.3 Industry Pioneer

The paper “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer” pioneered the design of an additional loss function to ensure that all experts have equal importance. It also pioneered a differentiable heuristic method with an auxiliary load-balance loss, which weights expert outputs by selecting probabilities to make the gating process differentiable, thus enabling gradient optimization of the gating function. This method subsequently became the mainstream research paradigm in the field of MoE.

The paper first defines the importance(X) of each expert relative to a batch of training samples, which is the sum of the expert’s gating values in this batch. Experts frequently selected in the batch will have high importance scores. Then, an additional loss function is defined, $L_{importance}(X)$ . This additional loss function is added to the model’s overall loss function. This loss function equals the square of the coefficient of variation (CV) of the importance value set (a measure of the dispersion of a set of data), multiplied by a manually adjusted scaling factor $w_{importance}$ . This additional loss encourages all experts to have equal importance in all batches.

While this loss function ensures equal importance, experts may still receive vastly different numbers of examples. For instance, one expert might receive some high-weighted examples, while another might receive many low-weighted examples. This can lead to memory and performance issues on distributed hardware. To address this (encouraging each expert to receive the same number of samples for computation), the paper introduces a second loss function $L_{load}$ . This is to further ensure load balancing.

The derivation of the specific calculation formula is shown in the figure below.

2904

1.2.4 GShard

For designing the loss function, there is a simple idea: for example, if there are S tokens and E experts, assume $c_e$ . This represents the number of tokens received by the e-th expert. If MoE is load-balanced, then each expert receives the same number of tokens, and the token ratio $\frac{c_e}{S}$ should be the same. Therefore, GShard defines the sum of the squares of the tokens received by each expert as the load balancing loss (see number 1 in the diagram below). When all experts receive the same proportion of tokens, $l_{aux}$ the minimum value is obtained. However, $c_e$ is derived from the top-k function and is not differentiable. Furthermore, such an auxiliary loss function does not contain the parameters of the Gating function, making it unsuitable for gradient updates during training. Therefore, GShard introduces a method that incorporates one component of the squared term, $\frac{c_e}{S}$ , and replaces it with the mean of Gating softmax $m_e$ (see icon 2 below), because $c_e$ and $m_e$ have a certain degree of dependence, and adjustments on $m_e$ will also have an impact on $c_e$ . This allows for the adjustment of the load distribution among experts.

The detailed derivation process is shown in the figure below.

2905

To further prevent some experts from overloading themselves, GShard introduced the following concepts during training:

Expert capacity is approximately $C \approx \frac{2N}{E}$ (If there are N tokens and E experts). Once an expert is assigned more tokens than this capacity, some tokens may be discarded (or imported into the next layer for further processing via a residual join).
Local Groups: Instead of all tokens competing globally, they are first grouped and then distributed to specialists. This reduces the level of confusion.

1.2.5 Switch Transformers

To further prevent tokens from being discarded, the Switch Transformer introduces a simplified auxiliary loss (similar to the auxiliary loss in GShard) to balance importance scores and perform load balancing among experts, aiming to ensure that each expert receives approximately the same number of tokens from the batch. This is illustrated in the diagram below, where α is a hyperparameter used to fine-tune the importance of this loss during training. A value that is too high will negatively impact the main loss function, while a value that is too low will fail to effectively balance the load.

This auxiliary loss function no longer calculates the coefficient of variation; instead, it performs a weighted comparison of the number of tokens assigned to each expert and the routing probability of each expert. This loss is differentiable with respect to P and can be easily incorporated into training. It encourages that both the token score and the routing probability score assigned to each expert are 1/N, meaning that each expert is equally important and receives a balanced number of tokens.

2906

1.2.6 General Approach

Su Jianlin, a renowned expert, provided a general approach to constructing Aux Loss: First, construct a loss function that meets the requirements based on F. Then, in implementation, replace F with P + sg[F−P]. Here, F represents the current load distribution of Expert, while P is equivalent to a smooth approximation of F. sg[] is the stop gradient operator, which keeps the forward output unchanged but forces the gradient to zero. Assume Q is uniformly distributed.

Q = (1/n,1/n,...,1/n)

Load balancing is equivalent to F=Q. The following formula is a relatively intuitive Aux Loss:

L_{aux} = \frac{1}{2} ||F - Q||^2 = \frac{1}{2}\sum_{i=1}^{n}(F_i - 1/n)^2 \tag{1}

If F is argtop_k, the output is not a directly usable differentiable target, then the above expression is not a differentiable target that can be used directly. The STE (Straight-Through Estimator) technique can be used to replace F with P during backpropagation.

In addition, the common auxiliary load balancing loss formula is as follows, where n is the number of experts, p is the probability of the router output, and f is the probability that each expert is picked by topk on all tokens.

L_{aux} = \langle f, p \rangle = \sum_{i=1}^{n} f_i p_i \tag{2}

The gradients of formulas (1) and (2) are equivalent, which shows that formula (2) can balance the expert load.

1.2.7 Summary

The figure below provides an overview of various auxiliary loss functions and their typical coefficient configurations. We have highlighted the authors of each auxiliary loss function in bold and underlined the optimizers who modified the original scheme.

2907

1.3 Expert Strategy

1.3.1 Expert Capabilities

Since both GShard and Switch Transformers mention expert capacity, and GShard’s implementation is relatively simple, we will take Switch Transformers as an example to study it in detail.

The Switch Transformers model is TPU-oriented, requiring all tensor shapes to be statically fixed during compilation. However, because routing is dynamic, the tensor shape routed to each expert is also dynamic. To address this, Switch Transformers uses Expert Capacity. As shown in the diagram, Expert Capacity is calculated by dividing the total number of tokens in each batch by the number of experts, and then multiplying by the Capacity Factor. This variable determines the maximum number of tokens that can be routed to any expert. If the number of tokens received by the current expert exceeds its capacity, it will stop accepting tokens; these tokens routed to experts with full capacity are called overflow tokens.

2908

To address the issue of expert capacity, Switch Transformers has also introduced two processing strategies:

Zero padding. Because of capacity limitations, Switch Transformers assigns an Expert buffer to each expert to receive tokens within that capacity. If an expert’s Expert buffer is not full, it is padded with a zero vector; this is called zero padding. Specifically, the final input data dimension for each expert is (E, C, M), where C represents capacity. This ensures that the input data dimension processed by each expert is the same, which is beneficial for subsequent hardware-level processing.
Drop tokens. If too many tokens are sent to a single expert (i.e., exceeding the expert’s capacity), causing an overflow situation for that expert, Switch Transformers will skip the computation of these tokens. These tokens are directly passed to the next layer via a residual connection. Because these tokens have not been processed by the expert, this represents a loss of information.

Zero padding and dropped tokens are two sides of the same coin. Reducing capacity decreases zero padding, but it can lead to decreased model performance due to token overflow. Increasing capacity can alleviate dropped tokens, but excessively large capacity can cause more severe zero padding problems (affecting the sparsity of the matrix), resulting in more invalid computations and wasted computing resources. In addition, the more unbalanced the load, the more tokens require padding, leading to more invalid computations and potentially more dropped tokens.

As shown in the diagram above, a larger capacity factor requires more tokens to be padded, resulting in more invalid computations; a more unbalanced load also requires more tokens to be padded, leading to more invalid computations. To achieve better load balancing, the authors also added a Load Balancing Loss. The diagram shows 6 tokens and 3 experts; let’s examine the capacity settings for two different expert configurations.

A capacity factor of 1.0: As shown on the left side of the diagram above, the corresponding expert capacity is 2.
- If Expert 1 has 3 tokens, one token needs to be discarded, or the token can be directly passed to the next layer via a residual connection.
- Expert 2 has 2 tokens, which is exactly equal to the expert’s capacity.
- Expert 3 has only 1 token, so you need to fill in 1 empty token.
A capacity factor of 1.5 corresponds to an expert capacity of 3, as shown on the right side of the diagram above.
- Expert 1 has 3 tokens, which is exactly equal to the expert’s capacity.
- Expert 2 has only 2 tokens, so you need to fill in 1 empty token.
- Expert 3 has only 1 token and needs to be filled with 2 empty tokens.

For padding, we can use a mask. Assuming S = batch_size * seq_len, E represents the number of experts, and C represents the capacity (expert buffer), then the input to the previous gated network is (S, embedding_size), and the output is (S, E). The input to the gated network with capacity handling is (S, embedding_size), and the output is:

combine_weights = (S, E, C) represents the probability that each of the S tokens will go to each expert. This probability is stored according to the token’s position (C) in the buffer, with non-target positions filled with 0s. For example, [[[p1, 0, 0]], [0, p2, 0], [0,0,0]] means that the first token will go to two experts, with a probability of p1 for the first expert and p2 for the second.
dispatch_mask = (S, E, BF). combine_weights is set to False where it is 0 and True where it is 1. The mask will be used during the filling process.

Let’s look at a real-world example. Assumptions input…

Token number S = 3
Expert number E = 4
Buffer length C = 3 (each expert can receive a maximum of 3 tokens)
Embedding dimension M = 512

reshape_input (𝑆=3,C=2):

Token A → [a1,..., a2]  # token的特征维度是512
Token B → [b1,..., b2]
Token C → [c1,..., c2]

The probabilities after obtaining the top 2 are as follows:

combine_weights (S=3, E=4, C=3)：
[
    [[0, 0, 0], [0.57, 0, 0], [0, 0, 0], [0.43, 0, 0]],    # Token A 发给了专家1，3
    [[0, 0, 0], [0.75, 0, 0], [0.25, 0, 0], [0, 0, 0]],    # Token B 发给了专家1，2
    [[0.375, 0, 0], [0, 0, 0], [0.625, 0, 0], [0, 0, 0]]   # Token C 发给了专家0，2
]

The mask is as follows:

dispatch_mask (S=3, E=4, C=3)：
[
    [[0, 0, 0], [1, 0, 0], [0, 0, 0], [1, 0, 0]],  # Token A 发给了专家1，3
    [[0, 0, 0], [1, 0, 0], [1, 0, 0], [0, 0, 0]],  # Token B 发给了专家1，2
    [[1, 0, 0], [0, 0, 0], [1, 0, 0], [0, 0, 0]]   # Token C 发给了专家0，2
]

The mask works with the input token to arrange the tokens according to the expert’s buffer order, forming dispatched_expert_input.

dispatched_expert_input (E=4, C=3, m=512)：
[
    [[c1,..., c2], [0, 0], [0, 0]],    # Expert 0 接收到了 token C
    [[a1,..., a2], [b1,..., b2], [0, 0]],  # Expert 1 接收到了 token A 和 B
    [[b1,..., b2], [c1,..., c2], [0, 0]],  # Expert 2 接收到了 token B 和 C
    [[a1,..., a2], [0, 0], [0, 0]]     # Expert 3 接收到了 token A 
]

Next, expert calculations will be performed.

expert_outputs (E=4, K=3, C=2)：
[
    [[c1,..., c2], [0, 0], [0, 0]],      # Expert 0
    [[a1,..., a2], [b1,..., b2], [0, 0]],  # Expert 1
    [[b1,..., b2], [c1,..., c2], [0, 0]],  # Expert 2
    [[a1,..., a2], [0, 0], [0, 0]]       # Expert 3
]

Finally, a weighted sum is performed using expert_outputs and prob.

1.3.2 Expert Selection

In MoE routing, there are two options: Token-Choice and Expert-Choice.

Most mainstream MoE models currently use token choice routing (or simply top-k routing). This involves calculating an affinity matrix based on the matching degree between the input token and all experts, applying Softmax to the affinity matrix, selecting the k experts with the highest matching degree for processing, and finally outputting the weighted sum as the corresponding token. This approach has several drawbacks:

Load imbalance: Since we cannot control the number of tokens received by each expert, uneven token distribution occurs. Some experts receive too few tokens, leading to insufficient training and utilization; others receive too many tokens, but due to memory limitations, only a limited number can be used, resulting in wasted token resources. Load imbalance also negatively impacts step latency, affecting inference time, as step latency can be determined by the expert with the highest load. Many methods incorporate auxiliary losses to mitigate this problem. However, these auxiliary losses do not guarantee load balance, especially in the crucial early stages of training.
Redundant Specialization/Under-Specialization: Each MoE layer uses a gating network to learn the affinity between tokens and experts. Ideally, the learned gating network should generate affinity so that similar or related tokens are routed to the same expert. However, if the gating network is a suboptimal strategy, it may result in redundant experts and/or under-specialized experts.
Same Compute for Every Token. In the token selection strategy, each token receives exactly k experts, thus consuming the same amount of computation. However, this is not necessarily true. Tokens have different importance, and the MoE model should flexibly allocate its computational resources according to the complexity of the input.

Expert choice routing (EC), the opposite of token choice routing, allows each expert to select the k tokens with the highest matching degree from all input tokens (e.g., a batch) to process. The image below is excerpted from the Expert-Choice paper; the two blue tokens on the left are Conventional MoEs, and the one on the right is the expert choice MoE.

2909

The image below will make it clearer.

Token-Choice: As shown in the left figure, each token selects which experts to go through. This method can lead to load imbalance because multiple tokens may choose the same expert, exceeding the capacity limit and having to be discarded. Meanwhile, some experts have very few tokens and may need padding.
Expert-Choice: As shown in the diagram on the right, each expert selects the token to use. This ensures that each expert receives the same number of tokens. However, it is possible that some tokens are used by multiple experts, while some tokens are used by very few or even zero experts.

2910

The algorithm and derivation are shown in the figure below.

2911

The Expert Choice approach also has limitations—the potential for future token leaks. Since each expert needs to review the routing scores of all tokens before deciding which tokens to process, this violates causality.

0x02 DeepSeek V1

2.1 Background & Motivation

Despite the great potential of the MoE architecture, existing MoE architectures may suffer from knowledge hybridity and knowledge redundancy, limiting the specialization of experts, as detailed below:

Knowledge mixing is a problem. Tokens assigned to a specific expert may cover different areas of knowledge. Therefore, the designated expert will tend to aggregate different types of knowledge in their parameters, making it difficult to utilize all of this knowledge simultaneously. This knowledge mixing problem arises in scenarios with a limited number of experts. In such cases, each expert needs to handle an overly broad range of knowledge domains, leading to knowledge mixing. This hinders experts from delving deeply into specific areas, resulting in many experts not being true specialists but rather a mixture of various knowledge, unable to fulfill their professional expertise.
Knowledge redundancy. Knowledge redundancy arises when different experts in a MoE model learn similar knowledge, contradicting the model’s original design intent. This is because tokens assigned to different experts may require shared knowledge for processing. Consequently, multiple experts might converge some of their parameters to shared knowledge during training, leading to redundancy in expert parameters. Since the overall model capacity is limited, redundancy degrades the overall performance of the MoE model.
The potential of the MoE architecture actually depends on the expertise of its “experts,” each of whom should possess non-overlapping and focused knowledge across different domains. If alignment and interchangeability cannot be achieved in the internal world model, the output of the MoE will be a result of a split personality. Here, the internal world model refers to the “realistic shared statistical model” described by MIT scholars (“Platonic Representation Hypothesis”), also equivalent to “a rich category represented by probability”—the larger, richer, and more accurate the model, the better.

In essence, this boils down to the trade-off between Knowledge Specialization and Knowledge Sharing.

Knowledge specialization means that each expert can acquire non-overlapping and focused knowledge.
Knowledge Sharing means using gating networks and collaborative mechanisms to enable different experts to share underlying knowledge or common features while independently handling specific tasks, thereby improving the overall performance and efficiency of the model.

Pursuing generalism may sacrifice skill depth, while pursuing specialization cannot abandon general knowledge, because it is difficult to define the boundaries of each expert’s level of specialization and the scope of their general knowledge in MoE. This problem together hinders the specialization of experts in current MoE practices, preventing them from reaching the theoretical upper limit of MoE model performance.

2.2 Solution

To address the aforementioned issues, DeepSeek-MoE introduced the concepts of “fine-grained/vertical experts” and “shared experts”.

“Fine-grained/vertical experts” are experts obtained through fine-grained expert segmentation. While keeping the overall number of parameters constant, DeepSeek-MoE further segments an FFN into even finer granularities, constructing more experts. The idea behind this technique is very simple: if more experts are activated, the knowledge required to process a particular token is more likely to be decomposed and acquired by different experts. By increasing the number of experts and reducing the number of parameters for each expert, the number of topics each expert is responsible for can be reduced, allowing each expert to focus more on its domain, thereby improving the model’s specialization and accuracy across various domains and enabling it to handle broader and more complex topics. Furthermore, increasing the number of experts also increases the number of possible expert combinations, thus improving the model’s adaptability and generalization ability to different inputs.
“Shared experts” are specialists who possess more generalized or public knowledge. While increasing the number of experts can improve the model’s specialization, overly specialized experts may not cover a sufficiently broad range of topics, thus limiting the model’s applicability. Introducing shared experts solves the problem of over-specialization by compressing shared knowledge into selected shared experts, reducing knowledge redundancy in each fine-grained expert and ensuring that each routing expert remains specialized by focusing on a specific domain. The number of shared experts is fixed and they are always active; each token is assigned to these shared experts regardless of how the router module computes it. This allows the model to maintain broad knowledge while ensuring that appropriate areas of expertise are involved in each prediction.

Returning to the earlier example of hiring programmers, “fine-grained/vertical experts” further refine the front-end and back-end tasks. For example, the back-end might be divided into several server development programmers and several big data processing programmers. “Shared experts” involve specifically creating a DBA role to provide unified database support to server development colleagues.

The following diagram is from the deepseekMoE paper. Figure (a) shows a MoE layer with a traditional top-2 routing strategy. Subgraph (b) illustrates a fine-grained expert segmentation strategy, k=4. Subgraph (c) then shows a routing scheme with a shared expert isolation strategy in the DeepSeekMoE architecture. Here, the “k” experts are fixed, meaning they always run for each prediction. These “shared experts,” or experts, possess broad knowledge, while the super-specialist experts (up to 64 for this particular model) possess finer-grained knowledge.

2912

The following diagram shows the formula for the DeepSeekMoE architecture, and a comparison with vanilla Transformer and native MoE. Among them,

mN represents the total number of all sub-experts (fine-grained experts). This is equivalent to increasing the number of experts from N to mN by reducing the dimension of the intermediate hidden layers of FFN to 1/m of the original, and at the same time increasing the number of activated experts from k to mk.
$g_{i,t}$ refers to the gating weight of the i-th expert on the j-th token.
$FFN_i$ lets i represent the i-th expert out of N experts.
$u_t^l$ and $h_t^l$ represent the hidden input state and the hidden output state of the t-th token in the l-th Transformer layer.
$e_i^l$ is the centroid of the i-th expert. It is usually calculated by aggregating all input tokens that have been routed to that expert in history to obtain a mean. The inner product between $u_t^l$ and $e_i^l$ measures how similar the current input token is to the average input token that has been routed to the i-th expert in the past. Intuitively, if the i-th expert has processed a large number of inputs similar to the current token, then it should be better at processing the current token.
$s_{i,t}$ represents the affinity between tokens and experts, specifically the score given by the t-th token to the i-th expert. The similarity is transformed into a probability distribution using the softmax function. Since there are N experts, each token will correspond to N experts. $s_{i,t}$ indicates the affinity or preference of the token for each expert. The mechanism dynamically assigns tokens to experts using learnable routing parameters and trains them jointly with the overall model.

The routing mechanism selects the top mK highest-scoring experts for each token from all fine-grained experts.

2913

2.3 Load Balancing

Automatically learned routing strategies may encounter load imbalance issues, which leads to two significant drawbacks:

The risk of routing failure is that the model always selects only a few experts, while other experts lack sufficient training.
If experts are distributed across multiple devices, the unbalanced load will exacerbate the computing bottleneck.

Therefore, DeepSeek introduces two loss functions, as shown in the figure below. $f_i$ indicates the percentage of tokens routed to experts. $P_i$ represents the expert’s average route probability (gating probability). $\alpha_1$ is the loss coefficient. D represents the number of devices. $f'_i$ and $P'_i$ represent the average percentage of tokens and the average routing probability on the device, respectively. $\alpha_2$ is the loss coefficient for another layer. This loss is differentiable with respect to P and can be easily incorporated into training. It encourages that the label score assigned to each EA and the router probability score assigned to each EA be 1/N, meaning that EAs are equally important and receive a balanced number of labels. Through this dual-balanced loss design, DeepSeekMoE can ensure fine-grained specialization within experts while minimizing the overuse of certain experts or devices, thereby reducing the computational bottleneck caused by uneven routing.

2914

2.3.1 Expert-Level Balance Loss

The purpose of introducing expert-level load loss is to ensure a balanced distribution of computational load among experts, maintain a constant computational loss, and prevent the loss from changing with the number of experts. The following figure compares the loss functions of GShard and DeepSeek V1, and we can see the following:

In labels 1.1 and 2.1, only $\frac{1}{S}$ and $\frac{N'}{K'T}$ are different.
Numbers 1.2 and 2.2 are the same.

Therefore, let’s look at why, in label 2.1, the numerator is multiplied by the number of routing experts (N′) and the denominator is divided by the number of active routing experts (K′).

Ideally, if tokens are evenly distributed, each token activates K′ experts. A total of TK′ tokens will be evenly distributed among N′ experts. Each expert will be allocated TK′/N′ tokens. Therefore:

$f_i$ is the ratio of tokens sent to expert i, which in this case should be K′/N′.
$P_i$ is the ratio of router probability allocated to expert i, which in this case should be 1/N′.

Substitute these two into $L_{ExpBal}$ . In the final loss statement, there’s a K′/N′ term that dynamically changes with the number of routing experts (N′) and the number of active routing experts (K′), which is the problem. To remove this dynamically changing term and maintain a constant loss magnitude, version V1 multiplies the auxiliary loss by N′/K′ to ensure the loss calculation doesn’t change with the number of experts. This is likely for easier parameter tuning and comparison.

2915

2.3.2 Device-Level Balance Loss

DeepSeek divides experts into D groups, with each group placed on a single device. Device-level load loss is used to ensure that the computational load is balanced across different GPUs or devices. (This is related to the loss hyperparameter.) $\alpha_1,\alpha_2$ In terms of design, the loss parameter for expert load balancing is set to a smaller value, while the hyperparameter for device load balancing loss is set to a larger value. The granularity of device-level load loss is larger than that of expert-level loss, which is equivalent to load balancing among multiple groups of experts. This can better balance the computing load of different devices.

2916

2.4 Implementation

The model structure is as follows.

(mlp): DeepseekMoE(
  (experts): ModuleList(
    (0-159): 160 x DeepseekMLP(
      (gate_proj): Linear(in_features=5120, out_features=1536, bias=False)
      (up_proj): Linear(in_features=5120, out_features=1536, bias=False)
      (down_proj): Linear(in_features=1536, out_features=5120, bias=False)
      (act_fn): SiLU()
    )
  )
  (gate): MoEGate()
  (shared_experts): DeepseekMLP(
    (gate_proj): Linear(in_features=5120, out_features=3072, bias=False)
    (up_proj): Linear(in_features=5120, out_features=3072, bias=False)
    (down_proj): Linear(in_features=3072, out_features=5120, bias=False)
    (act_fn): SiLU()
  )
)

The key code is as follows. Looking at the source code, the implementation of the MoEGate function differs somewhat from the paper. It calculates the load balancing loss for all tokens within a batch, where T is the total number of tokens in a batch. The torch.topk() function returns the k largest elements in the input tensor along a specified dimension, along with their corresponding indices.

class MoEGate(nn.Module):
    def forward(self, hidden_states):
        bsz, seq_len, h = hidden_states.shape        
        # 公式里的T，是一个Batch数据的全部token做计算，每个Batch会重新计算
        hidden_states = hidden_states.view(-1, h)
        
        logits = F.linear(hidden_states, self.weight, None)
        scores_for_aux = logits.softmax(dim=-1)
        
        topk_weight, topk_idx = torch.topk(scores_for_aux, k=self.top_k, dim=-1, sorted=False)
        topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
        # 对选定的专家进行独热编码以创建专家掩码，将用于索引哪个专家将被调用
        mask_ce = F.one_hot(topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts)
        ce = mask_ce.float().mean(0)
        # 计算Pi，fi 和 aux_loss。这里的计算并没有跨Batch累积，每个Batch单独计算    
        Pi = scores_for_aux.mean(0)
        fi = ce * self.n_routed_experts
        aux_loss = (Pi * fi).sum() * self.alpha

0x03 DeepSeek V2

DeepSeek-V2 further expands the fine-grained expert selection, adopting a 160-to-6 routing expert selection approach, plus 2 shared experts, while adding a routing mechanism and two load balancing methods.

2917

3.1 Load Balancing

Compared to DeepSeek-V1-MoE, the MoE module has been upgraded in three main aspects, particularly in load balancing.

Device-Limited Routing

In expert parallelism, multiple experts in the MoE layer are assigned to multiple devices for parallel training. Because DeepSeek’s MoE employs a fine-grained expert design, there are typically many experts (the V2 model has 160 routing experts and 6 activation experts). If experts were randomly distributed across different devices, then for each token, the experts that could potentially be activated would involve most of the devices, resulting in very high communication costs for the MoE.

To address this issue, DeepSeekV2 introduces a device-limited expert routing mechanism. Specifically, it requires that each token can be routed to a maximum of M devices (M less than TopK) to control communication costs. This is achieved in two steps:

For each token, first select the gating score, $s_{i,t}$ . The highest-ranking experts are located in M devices.
Then select the top K experts from all the experts on these M devices.

DeepSeek’s practical verification shows that when M≥3, the restricted TopK selection operation performs roughly the same as the unrestricted global TopK selection operation. Therefore, in the V2 model, DeepSeek selects TopK=6 and M=3.

Communication Balance Loss

The device-constrained routing mechanism described above can alleviate the problem of distributing data from the input side to multiple devices, reducing multi-fanout communication. However, on the device receiving side, there may still be a problem of a few devices’ experts being activated simultaneously (i.e., a single device accepting more tokens than other devices), leading to communication congestion. Therefore, compared to V1, the V2 model adds a communication load balancing loss.

Where E_i represents a group of experts for the i-th device, D is the number of devices, M is the number of devices with restricted routing, and T is the number of tokens in a batch. $\alpha_3$ is the hyperparameter of the auxiliary loss. For $f''_i$ , the calculation, multiplying by D and dividing by M, is also to ensure that the loss is not affected by the number of devices or routing configuration.

Device-constrained expert routing and communication load balancing are both methods to address communication load balancing. The difference lies in their approach: device-constrained expert routing ensures an upper limit on the distribution at the communication distribution end; while communication load balancing ensures balanced reception at the communication receiving end, encouraging each device to receive an equal number of tokens. Therefore, these two methods can ensure balanced communication load between device inputs and outputs.

2918

Device-level token dropping strategy

While multiple load-balanced losses (including expert, device, and communication losses) can guide the model to balance communication and computation, they cannot strictly achieve load balancing. On the receiving device side, activation may still be concentrated on a few experts, leading to communication congestion. To further improve computational load balancing, a device-level token discarding strategy is introduced. Specific implementation:

First, for a batch of input tokens, calculate the average number of tokens received by each device, and use this number as the upper limit of the device’s capacity C.
The actual amount of tokens T_d allocated to each device are sorted in descending order of routing score.
If T_d > C, then the tokens exceeding the capacity C are discarded. For the portion exceeding the capacity, the Expert network computation of MoE is not performed; instead, it is directly fed into the upper-layer Transformer network to participate in the upper-layer computation. That is, only the residual result exists in this layer, but it may participate in expert computation in other layers.

To maintain consistency between inference and training, the authors ensured that 10% of the samples were not discarded during the training phase, thus guaranteeing that no tokens would be discarded during the inference phase.

3.2 Implementation

The code for version V2 is as follows.

class DeepseekV2MoE(nn.Module):
    """
    A mixed expert module containing shared experts.
    """

    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_tok = config.num_experts_per_tok

        if hasattr(config, "ep_size") and config.ep_size > 1:
            assert config.ep_size == dist.get_world_size()
            self.ep_size = config.ep_size
            self.experts_per_rank = config.n_routed_experts // config.ep_size
            self.ep_rank = dist.get_rank()
            self.experts = nn.ModuleList(
                [
                    (
                        DeepseekV2MLP(
                            config, intermediate_size=config.moe_intermediate_size
                        )
                        if i >= self.ep_rank * self.experts_per_rank
                        and i < (self.ep_rank + 1) * self.experts_per_rank
                        else None
                    )
                    for i in range(config.n_routed_experts)
                ]
            )
        else:
            self.ep_size = 1
            self.experts_per_rank = config.n_routed_experts
            self.ep_rank = 0
            self.experts = nn.ModuleList(
                [
                    DeepseekV2MLP(config, intermediate_size=config.moe_intermediate_size)
                    for i in range(config.n_routed_experts)
                ]
            )
        self.gate = MoEGate(config)
        if config.n_shared_experts is not None:
            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
            self.shared_experts = DeepseekV2MLP(
                config=config, intermediate_size=intermediate_size
            )

    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
        flat_topk_idx = topk_idx.view(-1)
        if self.training:
            hidden_states = hidden_states.repeat_interleave(
                self.num_experts_per_tok, dim=0
            )
            y = torch.empty_like(hidden_states)
            for i, expert in enumerate(self.experts):
                # 遍历模型中的所有可用专家并在每个专家上执行计算
                y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i])
            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim=1)
            y = y.view(*orig_shape)
            y = AddAuxiliaryLoss.apply(y, aux_loss)
        else:
            y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        if self.config.n_shared_experts is not None:
            y = y + self.shared_experts(identity)
        return y

    @torch.no_grad()
    def moe_infer(self, x, topk_ids, topk_weight):
        cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
        # scatter_()是原地操作函数，依据给定的索引，把源张量中的元素分散到目标张量的指定位置，第一个参数 1 代表在行上操作，操作是把第三个参数 1 赋值到第二个参数 topk_ids 指向的位置上
        cnts.scatter_(1, topk_ids, 1)
        tokens_per_expert = cnts.sum(dim=0)
        idxs = topk_ids.view(-1).argsort()
        sorted_tokens = x[idxs // topk_ids.shape[1]]
        sorted_tokens_shape = sorted_tokens.shape
        if self.ep_size > 1:
            tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1)
            tokens_per_expert_group = tokens_per_expert.new_empty(
                tokens_per_expert.shape[0]
            )
            dist.all_to_all_single(tokens_per_expert_group, tokens_per_expert)
            output_splits = (
                tokens_per_expert_group.view(self.ep_size, -1)
                .sum(1)
                .cpu()
                .numpy()
                .tolist()
            )
            gathered_tokens = sorted_tokens.new_empty(
                tokens_per_expert_group.sum(dim=0).cpu().item(), sorted_tokens.shape[1]
            )
            input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist()
            dist.all_to_all(
                list(gathered_tokens.split(output_splits)),
                list(sorted_tokens.split(input_split_sizes)),
            )
            tokens_per_expert_post_gather = tokens_per_expert_group.view(
                self.ep_size, self.experts_per_rank
            ).sum(dim=0)
            gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32)
            s = 0
            for i, k in enumerate(tokens_per_expert_group.cpu().numpy()):
                gatherd_idxs[s : s + k] = i % self.experts_per_rank
                s += k
            gatherd_idxs = gatherd_idxs.argsort()
            sorted_tokens = gathered_tokens[gatherd_idxs]
            tokens_per_expert = tokens_per_expert_post_gather
        tokens_per_expert = tokens_per_expert.cpu().numpy()

        outputs = []
        start_idx = 0
        for i, num_tokens in enumerate(tokens_per_expert):
            end_idx = start_idx + num_tokens
            if num_tokens == 0:
                continue
            expert = self.experts[i + self.ep_rank * self.experts_per_rank]
            tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
            expert_out = expert(tokens_for_this_expert)
            outputs.append(expert_out)
            start_idx = end_idx

        outs = torch.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0)
        if self.ep_size > 1:
            new_x = torch.empty_like(outs)
            new_x[gatherd_idxs] = outs
            gathered_tokens = new_x.new_empty(*sorted_tokens_shape)
            dist.all_to_all(
                list(gathered_tokens.split(input_split_sizes)),
                list(new_x.split(output_splits)),
            )
            outs = gathered_tokens

        new_x = torch.empty_like(outs)
        new_x[idxs] = outs
        final_out = (
            new_x.view(*topk_ids.shape, -1)
            .type(topk_weight.dtype)
            .mul_(topk_weight.unsqueeze(dim=-1))
            .sum(dim=1)
            .type(new_x.dtype)
        )
        return final_out

The code for the DeepseekV2MLP class is as follows.

class DeepseekV2MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

The calculation of the Gating function remains the same as before, still calculating the softmax for the entire batch, but using FP32 precision. Furthermore, before performing TopK and normalization selection, MoE Groups are calculated, resulting in a total of 8 groups. Three groups are selected for the topk selection. Then, the maximum softmax of each group is calculated as the group’s score, from which M groups are selected.

class MoEGate(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.top_k = config.num_experts_per_tok
        self.n_routed_experts = config.n_routed_experts
        self.routed_scaling_factor = config.routed_scaling_factor
        self.scoring_func = config.scoring_func
        self.alpha = config.aux_loss_alpha
        self.seq_aux = config.seq_aux
        self.topk_method = config.topk_method
        self.n_group = config.n_group
        self.topk_group = config.topk_group

        # topk selection algorithm
        self.norm_topk_prob = config.norm_topk_prob
        self.gating_dim = config.hidden_size
        self.weight = nn.Parameter(
            torch.empty((self.n_routed_experts, self.gating_dim))
        )
        self.reset_parameters()

    def reset_parameters(self) -> None:
        import torch.nn.init as init

        init.kaiming_uniform_(self.weight, a=math.sqrt(5))

    def forward(self, hidden_states):
        # 以整个batch计算
        bsz, seq_len, h = hidden_states.shape
        ### compute gating score
        hidden_states = hidden_states.view(-1, h)
        logits = F.linear(
            hidden_states.type(torch.float32), self.weight.type(torch.float32), None
        )
        # 计算精度采用了FP32
        if self.scoring_func == "softmax":
            scores = logits.softmax(dim=-1, dtype=torch.float32)
        else:
            raise NotImplementedError(
                f"insupportable scoring function for MoE gating: {self.scoring_func}"
            )

        ### select top-k experts
        if self.topk_method == "gready":
            topk_weight, topk_idx = torch.topk(
                scores, k=self.top_k, dim=-1, sorted=False
            )
        elif self.topk_method == "group_limited_greedy":
            # 基于每个Token分组组内最大的softmax作为Group scores
            group_scores = (
                scores.view(bsz * seq_len, self.n_group, -1).max(dim=-1).values
            )  # [n, n_group]
            # 选择M个Group
            group_idx = torch.topk(
                group_scores, k=self.topk_group, dim=-1, sorted=False
            )[
                1
            ]  # [n, top_k_group]
            # 然后构建Groupmask, mask后再选择TopK
            group_mask = torch.zeros_like(group_scores)  # [n, n_group]
            group_mask.scatter_(1, group_idx, 1)  # [n, n_group]
            score_mask = (
                group_mask.unsqueeze(-1)
                .expand(
                    bsz * seq_len, self.n_group, self.n_routed_experts // self.n_group
                )
                .reshape(bsz * seq_len, -1)
            )  # [n, e]
            tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
            # 在前面基础之上执行topK
            topk_weight, topk_idx = torch.topk(
                tmp_scores, k=self.top_k, dim=-1, sorted=False
            )

        ### norm gate to sum 1
        if self.top_k > 1 and self.norm_topk_prob:
            denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
            topk_weight = topk_weight / denominator
        else:
            topk_weight = topk_weight * self.routed_scaling_factor
        ### expert-level computation auxiliary loss
        if self.training and self.alpha > 0.0:
            scores_for_aux = scores
            aux_topk = self.top_k
            # always compute aux loss based on the naive greedy topk method
            topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
            if self.seq_aux:
                scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
                ce = torch.zeros(
                    bsz, self.n_routed_experts, device=hidden_states.device
                )
                ce.scatter_add_(
                    1,
                    topk_idx_for_aux_loss,
                    torch.ones(bsz, seq_len * aux_topk, device=hidden_states.device),
                ).div_(seq_len * aux_topk / self.n_routed_experts)
                aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(
                    dim=1
                ).mean() * self.alpha
            else:
                mask_ce = F.one_hot(
                    topk_idx_for_aux_loss.view(-1), num_classes=self.n_routed_experts
                )
                ce = mask_ce.float().mean(0)
                Pi = scores_for_aux.mean(0)
                fi = ce * self.n_routed_experts
                aux_loss = (Pi * fi).sum() * self.alpha
        else:
            aux_loss = None
        return topk_idx, topk_weight, aux_loss

AddAuxiliaryLoss is an auxiliary loss function, and the specific code is as follows.

class AddAuxiliaryLoss(torch.autograd.Function):
    """
    The trick function of adding auxiliary (aux) loss,
    which includes the gradient of the aux loss during backpropagation.
    """

    @staticmethod
    def forward(ctx, x, loss):
        assert loss.numel() == 1
        ctx.dtype = loss.dtype
        ctx.required_aux_loss = loss.requires_grad
        return x

    @staticmethod
    def backward(ctx, grad_output):
        grad_loss = None
        if ctx.required_aux_loss:
            grad_loss = torch.ones(1, dtype=ctx.dtype, device=grad_output.device)
        return grad_output, grad_loss

0x04 DeepSeek V3

V3 builds upon the basic MoE framework, continuing the design of fine-grained experts and shared expert isolation. Improvements have been made to gated networks and load balancing.

2919

4.1 Architecture

The diagram below shows the architectural evolution. As you can see, compared to the previous two versions of the computational framework, the main change in V3 is that the activation function of the gating network has been changed from softmax to sigmoid. Additionally, bias has been added to the gating mechanism.

2920

4.1.1 Sigmoid

concept

The mathematical form of the Sigmoid function is as follows, and it is similar to the cumulative distribution function (CDF).

\sigma(x) = \frac{1}{1 + e^{-x}}

Its key properties are:

Output range: The output value of the Sigmoid function is between 0 and 1, which can be regarded as a probability value.
Smoothness: The Sigmoid function is a smooth, continuous function with good derivative properties.
Nonlinearity: The Sigmoid function is a nonlinear function that can introduce nonlinear transformations to increase the expressive power of neural networks.

contrast

From the perspective of gating effectiveness, both Softmax and Sigmoid can achieve the function of filtering TopK and normalizing probability distributions. But why did MoE upgrade from Softmax to Sigmoid in version V3? Let’s look at the changes in expert settings in version V3 compared to version V2.

Version 2. Number of routing experts: Each MoE Layer contains 162 experts, including 2 shared experts and 160 routing experts. Each Token activates 2 + 6 = 8 experts. There are a total of 236 bytes of parameters, with each Token activating 21 bytes of parameters.
Version 3. Number of routing experts: 256, Number of activated experts: 8, Total model parameters: 671B, Number of activated parameters: 37B

The possible reasons are as follows: because the number of routing experts in V3 has increased by 100 compared to V2, the gating function faces a larger-dimensional operation.

When faced with a larger number of experts, the softmax function leads to decreased discriminative power and increased computational error in Gating Scores, resulting in significant errors in expert selection. Softmax internally normalizes values across all dimensions. Since the sum of all dimensions must equal 1, theoretically, the larger the dimension, the smaller the value assigned to each dimension. Furthermore, DeepSeek-V2 already uses FP32 precision. This makes it more sensitive to smaller decimal places when selecting the Top K maximum values, resulting in low data discriminative power; the larger the dimension, the more severe the problem.
The Sigmoid function has a wider range of values, making it more suitable for high-dimensional operations. The Sigmoid function calculates a score of [0,1] for each expert, and this score does not change with the expert’s dimensionality. Theoretically, this results in a wider range of scores and higher discriminative power.
Larger logits values in the softmax activation function can lead to larger gradients, which may cause training instability.

Alternatively, we can also analyze it from the perspective of the gradients of the two functions.

Common load balancing functions are:

L_{aux} = \langle f, p \rangle = \sum_{i=1}^{n} f_i p_i

Assumption $x_i$ is Logit. If the probability is the softmax probability, then the formula is extended to:

L_{aux} = \langle f, p \rangle = \sum_{i=1}^{n} f_i p_i = \sum_{i=1}^{n} f_i \cdot \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Its gradient is:

\frac{\partial L}{\partial x_k} = p_k \left(f_k - \sum_i f_i p_i \right)

Its characteristics can be observed from the gradient formula:

if $p_k$ If the value is 0, the gradient is 0, and the gradient will stop being updated.
$\sum_i f_i p_i$ can be understood as a weighted average based on probability. If $f_k$ exceeds the mean, then $x_k$ will obtain a positive gradient, and gradient descent will gradually converge to the mean, and vice versa. As optimization progresses, $f_k$ all items will gradually tend to be even, thus satisfying load balancing.
The weighted average of softmax is very close to the maximum value, which may cause suboptimal values to get the opposite gradient, which can cause some problems.

If we use sigmoid and normalize, then we have:

p_i = \frac{\sigma(x_i)}{\sum_i \sigma(x_i)}

The gradient obtained using this probability is in the form of:

\frac{\partial L}{\partial x_k} = p_k (1 - \sigma_k) \left(f_k - \sum_i f_i p_i \right)

Its characteristics are as follows:

Not only does the probability multiply by 0, but now when a certain sigmoid is multiplied by 1, the gradient also disappears.
The part within the parentheses, since it has been changed to the sigmoid function, no longer has the drawback of only lowering the highest term.

4.1.2 Expert Grouping

DeepSeek-V3 uses expert grouping, but unlike Device-Limited Routing, V3 is mainly used for NVLINK and IB bandwidth 3.2x configurations.

4.2 Load Balancing

The figure below shows the change in the loss function.

2921

4.2.1 Load balancing without auxiliary losses

Faced with load imbalance, Aux Loss addresses the issue by using additional losses to guide the router to assign a balanced score. Auxiliary-Loss-Free Load Balancing, on the other hand, takes a different approach: it doesn’t change the router’s existing score, but rather alters it. argtop_k p This allocation method.

DeepSeek, in versions V1 and V2 of the MoE model, added expert-level, device-level, and device communication-level load-balancing auxiliary losses. These auxiliary losses have several issues:

It’s only for load balancing of computation and communication, and doesn’t help with optimizing the model’s performance.
The weights are difficult to adjust; lowering them fails to promote balance, while raising them can easily damage the LM Loss.
Even if these auxiliary losses increase too much, the excessive loss will affect the main model and impair its performance.

To achieve a better trade-off between load balancing and model performance, V3 employs a clever auxiliary-loss-free load balancing strategy. This strategy removes all previous auxiliary losses and achieves load balancing by adding a dynamically adjustable bias term to each expert’s Gating score (this bias is only used for topk selection and is not included in subsequent weight calculations).

The Bias method is as follows. Bias is only used for route calculation; the final weight still uses the original sigmoid (Original_score in the code). In version V2, experts are selected through gating weights applied to the experts, $s_{i,t}$ . To select the TopK, version V3 maintains a dynamically adjustable bias for each expert. The TopK is selected by adding a bias term to the expert’s score. Adding a bias term to the gated score is conceptually similar to directly applying gradient updates to the gated score, without relying on backpropagation through the loss function.

Bias adjustment is achieved through a hyperparameter φ that controls the bias update speed. During training, the load balancing of each batch is continuously monitored. At the end of each step, the following operations are performed.

If an expert’s load is too high, its corresponding bias term will be reduced by γ to lower the gating weight, thereby reducing the amount of tokens routed to that expert.
If an expert’s load is insufficient, its corresponding bias term will be increased by γ to improve the expert’s gating weight, thereby increasing the amount of tokens routed to that expert.

The diagram below illustrates load balancing without auxiliary loss. Using auxiliary loss ultimately involves adjusting $s_{i,t}$ to minimize $P_i$ . If we can adjust directly $s_{i,t}$ , in theory, we should be able to achieve similar effects to applying auxiliary loss.

2922

The specific code is as follows. First, obtain the token quantity allocated to each expert times the mean and the theoretical global mean of all experts. Then, calculate the difference between the token quantity allocated to each expert and the theoretical global mean. The bias term is determined by multiplying the sign of this difference (or error) by the fixed update rate (which is an adjustable hyperparameter).

2923

Through this dynamic adjustment, DeepSeek-V3 maintains a balanced expert load during training and achieves better performance than models that rely solely on auxiliary loss to promote load balancing.

4.2.2 Negative Balance Loss at Sequence Granularity

DeepSeek V3 also adds a sequence-level load balancing loss (Complementary Sequence-Wise Auxiliary Loss) to balance the allocation of tokens for a single sequence to each expert in order to prevent extreme imbalances within a single sequence.

In the diagram above, T is the total length of the input sequence. $s'_{i,t}$ represents the normalized input sequence and the affinity of each expert. $P_i$ is the mean affinity between the i-th expert and every token in the sequence, representing the overall affinity between that expert and the sequence. $f_i$ represents the selection frequency of the i-th expert in the sequence prediction process. $\alpha$ are relatively small constant hyperparameters. It can be seen that $f_i P_i$ represents the load intensity of the i-th expert, when some experts are repeatedly selected in the topk list. $L_{Bal}$ The increase reflects the penalty for unbalanced load.

Compared to the Expert-Level Balance Loss in version V1, Sequence-Wise operates at a different granularity. Sequence-Wise calculates the loss at the level of a single sample token, while Expert-Level Balance calculates the loss using tokens from multiple sequences within a batch. The calculation formulas are identical.

DeepSeekV3 also emphasizes that the above strategy can achieve a very good load balancing effect, so there is no token-dropping strategy in the V3 version.

4.2.3 Routing

Finally, regarding communication, DeepSeek-V3 uses a restricted routing mechanism to limit communication costs during training. Each token is sent to at most a few computing nodes, which are selected based on the highest-level expertise distributed across each node. K_r / M The selection is based on the sum of affinity scores. Under this constraint, the MoE training framework of deepseek v3 can achieve almost complete computation-communication overlap.

4.3 vs Llama 4

Let’s compare it with the Llama 4.

Regarding routing experts and shared experts, the Llama 4 Marverick MoE layer uses 128 routing experts and one shared expert. Each token is sent to the shared expert and one of the 128 routing experts.

At the deployment location.

Llama 4 Scout alternates between MoE and MLP. In fact, Google Switch Transformer’s paper mentions “with experts at every other feed-forward layer.” Google GLaM also operates similarly (with a MoE every few layers). Google ST-MoE offers more options, such as: only the first few layers are MoE; only the middle few layers are MoE; MoE is interspersed within Dense Transformer Layers; and only the last few layers are MoE. This may be a compromise between training stability and computational efficiency.
DeepSeek-V3 uses a Dense MLP for the first three layers, followed by MoE (Modal of Experience). Why and how is this configured? The paper “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” mentions that the load balancing state of the first layer converges very slowly, therefore the first layer is retained as a Dense Layer. This might be because the earlier layers focus more on low-level and general features, which don’t require complex MoE mechanisms, or because maintaining load balancing is difficult within MoE.

The code for Llama 4 is as follows:

class MoEArgs(BaseModel):
    num_experts: int = -1
    capacity_factor: float = 1.0  # capacity factor determines how many tokens each expert can choose
    auto_scale_F: bool = (  # noqa: N815
        True  # if true, rescales hidden_dim such that number of activated params is same as equivalent dense layer
    )
    top_k: int = 1
    interleave_moe_layer_step: int = 1 # 表示每多少层有一个 MoE 层。

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads if args.head_dim is None else args.head_dim

        self.is_nope_layer = args.nope_layer_interval is not None and (layer_id + 1) % args.nope_layer_interval == 0

        use_rope = not self.is_nope_layer
        use_qk_norm = args.use_qk_norm and not self.is_nope_layer

        self.attention = Attention(args, use_rope=use_rope, use_qk_norm=use_qk_norm)

        if args.moe_args and (layer_id + 1) % args.moe_args.interleave_moe_layer_step == 0:
            self.feed_forward = MoE( # 设置MoE
                dim=args.dim,
                hidden_dim=int(args.ffn_exp * args.dim),
                ffn_dim_multiplier=args.ffn_dim_multiplier,
                multiple_of=args.multiple_of,
                moe_args=args.moe_args,
            )
        else:
            hidden_dim = int(4 * args.dim)
            hidden_dim = int(2 * hidden_dim / 3)
            if args.ffn_dim_multiplier is not None:
                hidden_dim = int(args.ffn_dim_multiplier * hidden_dim)
            hidden_dim = args.multiple_of * ((hidden_dim + args.multiple_of - 1) // args.multiple_of)

            self.feed_forward = FeedForward(
                dim=args.dim,
                hidden_dim=hidden_dim,
            )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

        self._register_load_state_dict_pre_hook(self.load_hook)

4.4 Implementation

The relevant configurations for the 671B model are as follows:

{
    "vocab_size": 129280,
    "dim": 7168,
    "inter_dim": 18432,
    "moe_inter_dim": 2048,
    "n_layers": 61,
    "n_dense_layers": 3,
    "n_heads": 128,
    "n_routed_experts": 256,
    "n_shared_experts": 1,
    "n_activated_experts": 8,
    "n_expert_groups": 8,
    "n_limited_groups": 4,
    "route_scale": 2.5,
    "score_func": "sigmoid",
    "q_lora_rank": 1536,
    "kv_lora_rank": 512,
    "qk_nope_head_dim": 128,
    "qk_rope_head_dim": 64,
    "v_head_dim": 128,
    "dtype": "fp8"
}

The model structure is as follows:

DeepseekV2MoE(
  (experts): ModuleList(
    (0-63): 64 x DeepseekV2MLP(
      (gate_proj): Linear(in_features=2048, out_features=1408, bias=False)
      (up_proj): Linear(in_features=2048, out_features=1408, bias=False)
      (down_proj): Linear(in_features=1408, out_features=2048, bias=False)
      (act_fn): SiLU()
    )
  )
  (gate): MoEGate()
  (shared_experts): DeepseekV2MLP(
    (gate_proj): Linear(in_features=2048, out_features=2816, bias=False)
    (up_proj): Linear(in_features=2048, out_features=2816, bias=False)
    (down_proj): Linear(in_features=2816, out_features=2048, bias=False)
    (act_fn): SiLU()
  )
)

The code for MoE is as follows:

class MoE(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        assert args.n_routed_experts % world_size == 0
        self.n_routed_experts = args.n_routed_experts # 路由专家个数
        self.n_local_experts = args.n_routed_experts // world_size
        self.n_activated_experts = args.n_activated_experts # 激活专家个数
        # 计算本地所要创建的Expert
        self.experts_start_idx = rank * self.n_local_experts
        self.experts_end_idx = self.experts_start_idx + self.n_local_experts
        self.gate = Gate(args) # 门控函数
        # 本地专家只创建根据world size和Rank计算出来的一个区间
        self.experts = nn.ModuleList([Expert(args.dim, args.moe_inter_dim) if self.experts_start_idx <= i < self.experts_end_idx else None
                                      for i in range(self.n_routed_experts)])
        # 共享专家列表其实就是一个MLP，MLP的维度依据共享专家的个数来进行扩大
        self.shared_experts = MLP(args.dim, args.n_shared_experts * args.moe_inter_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        shape = x.size()
        # 将一个Batch的输入全部展*
        x = x.view(-1, self.dim)
        weights, indices = self.gate(x) # 选取哪些专家以其权重
        y = torch.zeros_like(x) # 初始化结果矩阵
        # bincount()函数统计非负整数张量中每个值出现的次数，这里是统计在所有batch中每个专家被选中了多少次
        counts = torch.bincount(indices.flatten(), minlength=self.n_routed_experts).tolist()
        # 仅计算了Local Expert, 应该是推理阶段的代码, 训练dispatch没有实现。循环的维度是专家
        for i in range(self.experts_start_idx, self.experts_end_idx):
            if counts[i] == 0:
                continue # 在所有batch的数据中，这个专家都没有被选择，就跳过
            expert = self.experts[i]
            # where()函数会依据给定的条件对张量元素进行选择性操作，返回满足条件元素的索引（行和列）。这里i是专家的id，所以是得到在分发给这个专家的所有token的索引
            idx, top = torch.where(indices == i)
            y[idx] += expert(x[idx]) * weights[idx, top, None]
        # 共享专家的输出直接添加到y中
        z = self.shared_experts(x)
        # 结果是会combine做allreduce的
        if world_size > 1:
            dist.all_reduce(y)
        return (y + z).view(shape)

The gating function is implemented as follows.

class Gate(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.topk = args.n_activated_experts
        self.n_groups = args.n_expert_groups # 一共几组
        self.topk_groups = args.n_limited_groups # 选择几组
        self.score_func = args.score_func
        self.route_scale = args.route_scale # 增加了一个route-scale=2.5放大系数
        self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
        # 671B模型的维度是7168, 增加了Bias项
        # 把每个专家的 bias 设置成了一个可学习的参数。跟论文中描述的增加和减少固定的 𝛾 量不一样。
        self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else None

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        # 计算分数，使用输入张量与权重参数进行线性变换，以便选择topk
        scores = linear(x, self.weight)
        # 根据评分函数选择，应用softmax或sigmoid对分数进行归一化处理。
        if self.score_func == "softmax":
            scores = scores.softmax(dim=-1, dtype=torch.float32)
        else:
            scores = scores.sigmoid()
        # 保留了一份score用于Weight计算的不带Bias
        original_scores = scores
        # 如果存在偏置项，将其加到分数上
        if self.bias is not None:
            # 在671B模型中还添加了Bias, 这个和负载均衡相关
            scores = scores + self.bias
        if self.n_groups > 1:
            scores = scores.view(x.size(0), self.n_groups, -1) # 将分数进行分组
            if self.bias is None:
                group_scores = scores.amax(dim=-1)
            else:
                group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1) # 每组选择最大两个得分进行求和，作为每组的代表得分
            # 找到前k组得分的索引，得到分数最高的topk_groups组
            indices = group_scores.topk(self.topk_groups, dim=-1)[1]
            # 没有选择的组被屏蔽
            mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)
            scores = (scores * mask.unsqueeze(-1)).flatten(1)
        # 从剩下的专家中，在每个输入上选择topk个专家的索引    
        indices = torch.topk(scores, self.topk, dim=-1)[1]
        # 从原始分数中依据索引indices来获取相应选中专家的权重, bias仅用于路由选择
        weights = original_scores.gather(1, indices)
        # 对weight归一化处理并增加了一个route-scale=2.5放大系数
        if self.score_func == "sigmoid":
            weights /= weights.sum(dim=-1, keepdim=True)
        # 使用权重缩放因子    
        weights *= self.route_scale
        return weights.type_as(x), indices

The expert’s implementation is as follows.

class Expert(nn.Module):
    def __init__(self, dim: int, inter_dim: int):
        super().__init__()
        self.w1 = Linear(dim, inter_dim)
        self.w2 = Linear(inter_dim, dim)
        self.w3 = Linear(dim, inter_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

0x05 Other Explorations

Below, we will introduce some explorations related to MoE.

5.1 MOA

While MoE research primarily focuses on the FFN layer within the Transformer architecture, some researchers have proposed using the Attention Head as an expert, resulting in a system called Mixture of Attention Head (MoA). The core idea is the same as MoE: in the original Transformer, each token passes through all Attention Heads, while MoA adds a Top-K Router before the Attention Heads, so each token only needs to pass through a subset of tokens. By combining a multi-head attention layer with MoE, performance can be further improved while limiting computational costs.

2924

As shown in the figure below, MoA employs two groups of experts: one for query projection and the other for output projection. They select the same expert index through a common gating network. To reduce computational complexity, MoA shares attention experts, $W_k$ and $W_v$ projection weights. Experts rely solely on their own queries, $q_t W_q$ , and output $o_i W_o$ projection weights are differentiated, thus allowing for pre-calculation. $K W_k$ and $V W_v$ sequence.

2925

Furthermore, MoA is compatible with MQA (Multi Query Attention) and GQA (Grouped Query Attention), as shown in the figure below.

2926

5.2 MoD

5.2.1 Overview

Deep Hybridization (MoD) is a relatively new approach to conditional computation. Instead of partitioning the depth of a model, MoD focuses on enabling the model to determine dedicated computations for each input. It is orthogonal to MoE because they are complementary techniques rather than substitutes for each other, enabling more efficient and scalable inference.

In Transformer-based language models, the model allocates almost the same computational FLOPs to tokens in a sequence. However, in LLM processing, tokens in a sequence vary in complexity; some tokens require more computational resources, while others are simpler and do not actually require the same amount of computational resources, leading to computational waste.

The authors of MoD proposed a scheme to dynamically allocate FLOPs at different locations in the model. Specifically, within a Transformer Block, a Router module is used to control the Top-K tokens participating in the computation of that layer, while other tokens are directly input into the next Transformer Block via residual connections. Since the Top-K hyperparameters are determined a priori, the tensor size in this process can be guaranteed to be deterministic, i.e., a static computation graph, and the corresponding total computational budget is also completely predictable. At the same time, the token dimension is dynamic and context-dependent. The model trained based on this method can achieve performance comparable to the baseline model with equivalent FLOPS and training time, while achieving a speedup of over 50% during inference.

5.2.2 Approach

The right side of the image below illustrates the differences between the three strategies.

Vanilla Transformer: Each token must pass through all Transformer Blocks.
Early-Exit: Each token can selectively skip the last few consecutive serial Transformer Blocks.
Mixture-of-Depths: Each token can selectively skip certain (sparse) Transformer Blocks.

The left side of the diagram below illustrates the MoD approach. Each Transformer Block is preceded by a Router module to select from potential computational paths. However, unlike the MoE Transformer, each token in MoD has two computational paths: either the left path directly connecting to the next layer via residual connections, or, like the original Transformer, a path using Self-Attention + MLP.

The Router module outputs a scalar for each token. For example, it outputs w=0.41 for Xi, indicating that Xi will not traverse the current block, and w=0.65 for Xi+1, indicating that Xi+1 will traverse the current block. In MoE, if there are N experts, each token has N computational paths (ignoring discarded paths). If the Top-K size in MoD is the same as the maximum sequence length, it degenerates into the original Transformer, where each token must traverse all blocks.

2927

5.2.3 Budget

The authors also used the concept of capacity to determine the total computational budget for each forward pass, which defines the total number of tokens for a given computational input (e.g., the number of tokens participating in Self-Attention, the number of tokens corresponding to experts in MoE, etc.).

In MoE, the size of each expert is the same as the FFN size of the original model. If it’s a Top-1 Router, the total computation is comparable to the original Transformer model. For cases with multiple experts per token, the computation will be greater than the original Transformer model. For static graph computation, once the capacity is determined, the corresponding computation is also determined. Insufficient tokens will result in padding, while excessive tokens will be discarded. The total computation will not change due to changes in the Router.

For MoD, reducing capacity allows for lower computational requirements than the original Transformer. However, indiscriminately using a smaller computational budget can lead to performance degradation.

5.2.4 Routing Mechanism

2928

In MoE, the Token-Choice approach is typically used, while in this paper’s MoD, the authors chose the Expert-Choice approach, as shown on the right in the diagram above. This approach offers three main advantages:

There is no load balancing issue, so there is no need to add a load balancing penalty like with Token-Choice.
There are only two computation paths in MoD, and each token will only take one path. Expert-Choice’s Top-K mechanism can easily achieve this.
Each token in Token-Choice has its route calculated independently, which makes it impossible to precisely control the budget.

5.2.5 Sampling

Using Expert-Choice has many advantages, but it also has a significant drawback. Decoder-Only LLMs are causal models, meaning each token depends only on previous tokens, not subsequent tokens. However, Top-K operations are non-causal; selecting the Top-K tokens from a sequence requires knowledge of the entire token sequence. Therefore, the Top-K operation needs to be removed during inference (while the entire sequence is known during training, only the Top-K operation is needed to select the corresponding K tokens).

To solve this problem, the authors tested two solutions:

Add a binary cross-entropy loss: Each token outputs a scalar value between 0 and 1. A value greater than 0.5 indicates that the token is in the Top-K, while a value less than 0.5 indicates that the token is not in the Top-K. This approach affects the original modeling objective. In this approach, the Router outputs Logits, and the topk operation (such as torch.topk()) provides the Target.
Add a small MLP predictor: it has the same input as the Router and directly predicts whether a certain token is in the Top-K. Here, the gradient will stop, meaning the gradient of the MLP predictor will not continue to propagate, so it will not affect the modeling objective.

It should be noted that neither of the above methods can guarantee that the number of selected tokens will be Top-K, but they can be very close. The author estimates that they can achieve an accuracy of around 99%.

5.2.6 Routing Discussion

Let’s analyze the routing mechanism next.

Intuitively, a token might learn to bypass a certain computational unit (transformer block) because prediction at that step is relatively easy (through auxiliary loss or auxiliary predictor). However, the routing network learns more than just this strategy. If a token doesn’t participate in the self-attention layer computation in a certain computational unit, then all other tokens following it cannot observe that token through self-attention. This is because the self-attention mechanism allows each token to pay attention to all other tokens in the sequence to capture the dependencies between them. But if a token decides to skip the self-attention layer computation of a certain transformer block, then all tokens following it cannot pay attention to that token through the self-attention mechanism. This also leads to the complexity of routing decisions in MoD, because a token’s routing decision affects other tokens, so the routing mechanism needs to consider these potential impacts. If the routing network decides to let an important token skip computation, this may prevent subsequent tokens from obtaining crucial information. Therefore, the routing network needs to learn to balance the computational requirements of each token with the potential impact on other tokens, making routing decisions a complex optimization problem.

This insight opens the door to Modulo-Discrimination (MoD) variants. MoD can be used to decouple the query, key, and value in a self-attention mechanism, allowing us to design MoD variants that make routing decisions independently for each query, key, and value. This means a token might be considered an important query in one attention head but not an important key in another. This decoupling allows the routing network to make more granular decisions, better capturing the importance of each token in its different roles.

The authors further speculate that some tokens may possess significant informational value within the context, even if they are not crucial in the current query step. These tokens may carry information essential for understanding the entire sequence or completing the task; however, in a standard Transformer, if these tokens are far removed from the current query position, they are difficult to utilize effectively. To address this issue, the authors propose using a routing mechanism to identify these tokens carrying long-term information. The routing network can introduce these tokens into a special long-term memory buffer, which can then be accessed in future computations, even if those tokens are far removed from the original sequence. This allows the model to more effectively capture and utilize long-distance dependencies. The authors point out that one advantage of this long-term memory mechanism is that the decision on whether each token should be added to long-term memory can be made during its first processing (i.e., the “memory encoding” phase). This means that the model does not need to re-evaluate the long-term importance of each token in every subsequent query step.

These highlight the potential of adaptive computation, particularly learning routes, to improve the memory and context processing capabilities of language models. By identifying and storing tokens that carry long-term information, routing mechanisms can be a powerful tool for extending the effective context length of models and better capturing long-distance dependencies.

5.3 Integration of MoE with Large-Scale Multimodal Models

The paper “MoE-LLaVA: Mixture of Experts for Large Vision-Language Models” combines MoE and LLaVA to achieve a balance between efficiency and performance.

The authors propose a novel LMM training strategy, MoE-tuning, which introduces the MoE model into LMMs, effectively addressing the performance degradation issues associated with multimodal learning and model sparsity while maintaining constant computational cost. Simultaneously, they propose the MoE-LLaVA model, replacing the LLM in LLaVA with the MoE model, which allows for the activation of only the top k experts during the inference phase via the router. Experiments demonstrate that MoE-LLaVA exhibits superior performance in visual understanding and effectively reduces hallucination phenomena.

The entire training was completed in three phases:

Phase 1: Train only the MLP, keep other parameters frozen.
Phase 2: Unlock all parameters except for Vision Encoder.
The third stage involves expanding the FFNs in the LLM to a MoE, which means expanding the FFNs into E parallel FFNs, and then adding a Router layer before them. Only the MoE and MLP layers are trained.

2929

In MoE, the Router functions the same as in traditional MoE, selecting the top k Experts (FFNs). In addition to the autoregressive loss, the authors added a Lax loss, primarily to achieve load balancing among the Experts.

2930

5.4 LoRA and Expert Hybrid

A significant branch of LoRA architecture development involves combining LoRA with MoE. This approach jointly learns hybrid weights and LoRA plugins, where each LoRA plugin acts as an expert, and the routing network determines the expert’s weights or selections based on the input. During fine-tuning, the pre-trained LLM weights remain fixed, while the LoRA experts and routers are trained. The hybrid LoRA expert model provides a mechanism that allows the model to leverage knowledge learned on different tasks, enabling effective ensemble and generalization in an expert system manner.

A typical framework is shown in the figure below.

2931

5.4.1 Classification

Research based on LoRA-MoE methods can be broadly categorized into three main types, with the primary objectives being: (1) improving performance and parameter efficiency, (2) preserving knowledge during fine-tuning, and (3) adapting to multi-task learning. While these categories highlight different focuses, many methods address multiple objectives simultaneously.

The following diagram provides a comparison of some common solutions.

2932

Efficiency-oriented design

Such methods aim to achieve performance comparable to full fine-tuning with minimal parameter overhead.

MoV and MoLoRA aim to achieve full fine-tuning performance with less than 1% of parameters updated and improve generalization to unseen tasks. MoV and MoLoRA use vector and LoRA adapters as experts, respectively, and employ a soft merging strategy, where all experts contribute to the output based on router probabilities.
MoELoRA treats LoRA modules as experts within the MoE framework. MoELoRA comprises multiple LoRA experts and a gated network for routing and load balancing to prevent convergence to a finite set of experts. Applying contrastive learning among the experts mitigates the stochastic routing problem common in MoE models.
However, a fixed number of LoRA experts (such as MoELoRA) lacks flexibility and can become redundant due to representation collapse or overfitting of learned routing policies. To address this issue, MoLA proposes a hierarchical expert distribution method that flexibly assigns LoRA experts across different Transformer layers. MoLA employs a top-k routing mechanism to select relevant experts for each input. In addition to improving performance and parameter efficiency, MoLA demonstrates promising continuous learning capabilities due to its sparse expert activations, enabling the model to retain knowledge from previous domains while adapting to new domains.

Memory-based adaptation

These methods focus on preventing catastrophic forgetting during the adaptation process, addressing the challenge of knowledge retention when adapting LLMs to new tasks or domains.

LoRAMoE introduces multiple LoRA experts integrated through a router network, using local equilibrium constraints to encourage some experts to focus on leveraging world knowledge for downstream tasks. It employs a top-k routing strategy, enabling the model to maintain world knowledge while improving performance across multiple tasks.
MoRAL uses question-answer pairs from unstructured text and combines the multitasking capabilities of MoE with the parametric efficiency of LoRA. It employs a soft routing mechanism where all experts contribute to the output based on router probabilities. MoRAL addresses the catastrophic forgetting problem by maintaining performance on previously seen tasks while adapting to new domains.

Task-based integration

These methods address the challenges of domain specificity and task interference. Domain specificity arises when models trained on general-purpose data lack the expertise required for a specific domain, such as medicine or finance. Task interference occurs when multiple tasks and their datasets compete during training, leading to performance degradation across tasks.

To address the domain-specificity issue, researchers have proposed many solutions.
- MOELoRA can be used for multi-task medical applications. MOELoRA introduces multiple LoRA experts, each consisting of a low-rank matrix, with a task-driven gating function that controls each expert’s contributions based on task identity. This approach allows for task-specific learning while maintaining a shared knowledge base across tasks.
- MOA is an efficient end-to-end parameter tuning method for multi-task learning. MOA first trains separate LoRA modules for different tasks, and then combines them using a sequence-level routing mechanism based on domain metadata, allowing for flexible combination of domain-specific LoRA modules.
- XLoRA employs a deep, token-level, dynamic MoE policy. Starting with a pre-trained LoRA adapter, XLoRA dynamically mixes and adapts layers using a gating policy that leverages hidden states. This enables the model to create novel combinations to solve tasks, demonstrating strong performance in scientific applications.
To address the problem of task interference, researchers have proposed many solutions.
- MoCLE resolves task conflicts in visual language instruction adaptation. This approach introduces a MoE architecture that activates task-customized parameters based on instruction clusters, employs a cluster-conditional routing strategy, and incorporates a general expert to improve generalization to new instructions.
- LLaVAMoLE mitigates data conflicts in multimodal LLM instruction fine-tuning. It introduces a sparse MoE design containing multiple LoRA experts and employs a token-level routing strategy, where each token is routed to a top-1 expert. This allows for adaptive selection of tokens from different domains, effectively resolving data conflicts.
- HydraLoRA is an asymmetric LoRA architecture that challenges the traditional symmetric expert structures in MoE-based methods. Through empirical analysis, the authors found that in multi-task settings, matrix parameters from different LoRA heads tend to converge, while the matrix parameters themselves remain distinct. Building on this observation, HydraLoRA introduces an architecture where all tasks share a matrix and have multiple task-specific matrices, employing a trainable MoE router to automatically identify intrinsic components in the training data.

By employing various routing strategies and expert design, these methods can effectively adapt to multiple tasks or domains while mitigating interference and maintaining performance for specific tasks. The integration of LoRA with MoE has demonstrated promising results in improving performance, preserving knowledge, and facilitating multi-task adaptation across diverse domains.

Improve performance

Existing methods improve the performance of LoRA MoE in terms of initialization, task relationship management, and efficiency.

In terms of initialization, Mixture-of-LoRAs first trains multiple LoRAs separately as initializations, and then jointly optimizes the router and the LoRAs. MultiLoRA proposes to refine the initialization to reduce parameter dependencies, thereby producing a more balanced single subspace.
In terms of task balance, MLoRE adds a low-rank convolutional path to the MoE structure to capture global task relationships. MITLoRA uses task-independent and task-specific LoRA modules to resolve task conflicts.
In terms of efficiency, MoLA adaptively assigns different numbers of LoRA experts to different layers of the Transformer model to save on the number of LoRA modules. LLaVA-MoLE and SiRA utilize sparse computation to reduce computational costs. Furthermore, Octavius sparsely activates independent LoRA experts through instance-level instructions to mitigate task interference and improve efficiency. Fast LoRA allows each sample in a mini-batch to have its own unique low-rank adapter, enabling efficient batch processing.

In addition, some methods, while not explicitly based on MoE, follow its principles. For example, I-LoRA uses two LoRAs separately to manage long-term and short-term memory for continuous learning.

5.4.2 LoRAMoE

Let’s take LoRMoE as an example to see how LoRAs can be considered “experts” in MoE. LoRAMoE integrates multiple LoRAs as experts through MoE-style plugins, with the aim of mitigating the problem of forgetting world knowledge during large-scale fine-tuning.

2933

motivation

The authors found that as the amount of data used increases, SFT training causes the model parameters to deviate significantly from the pre-training parameters, and the world knowledge learned in the pre-training stage is gradually forgotten. Although the model’s ability to follow instructions is enhanced and its performance on common test sets increases, the performance of QA tasks that require this world knowledge drops significantly.

plan

The author’s proposed solution is:

Data component: CBQA, a representative dataset of world knowledge, was added to slow down the model’s forgetting of world knowledge;
Model section: Guided by the ideas of (1) reducing model parameter changes and (2) isolating the parameters of world knowledge and new task knowledge, the LoRAMoE method was designed. LoRA experts were divided into two groups: one group was used for tasks that can be handled well (related to world knowledge) by retaining the pre-trained parameters, and the other group was used for downstream new tasks encountered during the learning of SFT, as shown in the figure below.

2934

train

To train these grouped experts effectively, ensuring that the two groups each perform their respective duties (handling two types of tasks) and that the load is balanced within each group, the authors designed a load balancing constraint mechanism called the localized balancing constraint. Specifically, assuming Q is the importance matrix output by the routing module, $Q_{n,m}$ . Let I represent the weights of the nth expert on the mth training sample, and let Q be a matrix defined by the authors with the same shape as Q. Load balancing loss. $L_{lbc}$ . The importance matrix Z = I \circ Q is defined as the variance of the importance matrix weighted by I divided by the mean.

2935

The purpose of this loss function design is to ensure that, for any training sample, the λ value is equal in both LoRA expert groups, thus optimizing $L_{lbc}$ . This involves reducing the variance of routing weights within a group to achieve load balancing. Between two expert groups, if expert group A is more proficient with the current data type, its I value is greater than that of expert group B. At the start of training, the activation weight of A is significantly greater than that of B, giving A more training opportunities for this type of data. During training, the routing module gradually favors selecting experts from group A for this type of data. In this way, even without information about the data type I during the inference phase, A’s routing value Q for this type of data will be significantly greater than B’s corresponding value, thus achieving the goal of each expert group fulfilling its specific role.

5.4.3 HydraLoRA

We’ll use HydraLoRA as an example to learn how to compress or better optimize MoE expert parameters through low-rank decomposition. HydraLoRA is a novel, parameter-efficient fine-tuning architecture that can automatically identify “intrinsic components” in the data—i.e., subdomains or different tasks—that may be difficult for domain experts to define explicitly.

The core idea of HydraLoRA is to minimize inter-task interference by using a shared A matrix and independent B matrices, optimizing each intrinsic component. HydraLoRA autonomously assigns different B matrices to capture the characteristics of specific tasks, while the shared A matrix integrates global information, thus achieving efficient parameter utilization and performance improvement. In complex multi-task environments, HydraLoRA demonstrates excellent adaptability, flexibly handling various intrinsic components, reducing inter-task interference and improving performance, significantly enhancing model accuracy and efficiency while optimizing resource consumption.

motivation

While large language models (LLMs) have made significant progress in adapting to new tasks, they still face enormous computational resource consumption, often performing poorly, especially in complex domains. To alleviate this problem, various parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have been proposed. However, LoRA consistently falls short of the performance of full-parameter fine-tuning when dealing with complex datasets, especially when handling more diverse or heterogeneous training corpora, where the gap widens further. Corpus heterogeneity implies dataset diversity, which often introduces interference due to variations in content and style. PEFT methods are particularly sensitive to this, experiencing more severe performance losses in heterogeneous scenarios.

To overcome this bottleneck, researchers from the University of Macau, the University of Texas at Austin, and the University of Cambridge jointly proposed a novel asymmetric LoRA architecture—HydraLoRA. Unlike traditional LoRA, which requires the same parameter structure for all tasks, HydraLoRA introduces a shared A matrix and multiple independent B matrices, each handling different tasks to avoid interference between them. Each head of the Hydra is like a B matrix in LoRA, focusing on its specific task, while the shared A matrix acts like the Hydra’s body, managing and coordinating everything to ensure efficiency and consistency. Without additional tools or human intervention, HydraLoRA can autonomously identify implicit features in the data, greatly improving task adaptability and performance. Through this flexible multi-head mechanism, HydraLoRA achieves a dual breakthrough in parameter efficiency and model performance.

observe

LoRA Analysis Observation 1: With the same number of parameters, instead of using a single LoRA for the entire domain dataset, it is better to deploy multiple smaller LoRA modules, each focusing on a specific downstream task. Furthermore, the research team believes that this interference is not limited to explicit multi-task training scenarios. This interference can occur in any training setup because all datasets inherently contain multiple implicit intrinsic components, such as subdomains or domain-specific tasks, which even domain experts may not be able to clearly distinguish.
LoRA Analysis Observation 2: When multiple LoRA modules are trained independently on different data, the parameters of matrix A for different heads are highly similar and tend to be consistent, while the parameters of matrix B are significantly different and easily distinguishable. The research team believes that this asymmetry mainly stems from the different initialization methods of matrices A and B. Matrix A tends to capture commonalities across domains, while matrix B adapts to domain-specific differences. The difference between matrices A and B provides important insights into improving parameter efficiency and effectiveness. From an efficiency perspective, this study hypothesizes that the parameters of matrix A can be shared among multiple heads, thereby reducing redundancy. In terms of effectiveness, since the parameters of matrix B are dispersed across different heads, it suggests that using a single head to adapt to multiple domains may not be as effective as using an independent head for each domain, as this minimizes interference between domains.

plan

The following diagram illustrates the LoRA architecture changes in HydraLoRA. Only adjustable parameters are shown in this diagram.

(a) is the LoRA architecture, where matrix a is used to implement low rank and matrix B is used for recovery.
(b) With the same parameter count, a single LoRA is split into multiple smaller a and b matrices to avoid training interference.
(c) Evolving from (b), HydraLoRA features an asymmetric structure with a shared a matrix (a) and multiple matrices (b). By introducing multiple B matrices, HydraLoRA effectively distinguishes the intrinsic components in the data, avoiding interference between different tasks. The shared a matrix (a) captures the commonalities between tasks, while the different B matrices handle the diversity of tasks, thus achieving better performance in diverse tasks. It significantly improves the efficiency of parameter utilization. This architecture improves computational and storage efficiency by reducing redundancy, especially in scenarios involving fine-tuning large models.

2936

Training & Reasoning

The diagram below illustrates the training and inference processes.

Fine-tuning phase: HydraLoRA adaptively identifies and initializes N intrinsic components without requiring specific domain knowledge. Then, it utilizes a trainable MoE (Mixture of Experts) router, treating each intrinsic component as an expert, and automatically assigns training samples to the corresponding components for fine-tuning.
Inference phase: HydraLoRA flexibly and dynamically merges multiple B matrices through trained routers to meet the needs of different tasks and data. This design enables the model to efficiently adapt to diverse application scenarios, improving overall performance and resource utilization efficiency.

2937

5.5 High-efficiency fine-tuning

Let’s introduce ESFT, the most recently released high-efficiency fine-tuning solution for MoE models by MagicCube AI. The paper is titled “Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models.” The paper explores the PEFT method based on MoE LLM, and the research mainly includes three aspects:

The study investigated the dispersion of activation experts in specific tasks and found that the routing distribution of specific tasks is highly concentrated, while the activation experts show significant differences in different tasks.
We propose Expert-Specialized Fine-Tuning (ESFT), which improves fine-tuning efficiency and achieves performance comparable to or even exceeding that of full-parameter fine-tuning.
The impact of the MoE architecture on ESFT was further analyzed. The results show that the MoE model with fine-grained experts is more advantageous in selecting the expert combination most relevant to the downstream task, thereby improving training efficiency and training effect.

The core idea of ESFT is shown in the figure below. When fine-tuning for each task, most parameters are frozen, such as attention parameters and irrelevant experts in MoE (blue), while only the most relevant experts (green) are fine-tuned, and these experts all come from non-shared experts. Since the number of most relevant experts in each task is relatively small (not a fixed number, the number varies for different tasks), for example, only 6, the overall number of trainable parameters is also small.

PS: It should be noted that fine-tuning only the most relevant experts here does not mean that the Router will only select these experts. Other experts will still be routed. It just means that the frozen experts will no longer undergo gradient calculation and parameter updates, thus reducing the amount of computation.

2938

The paper did not use LoRA because the authors wanted to develop a method directly related to MoE. ESFT consists of two steps:

Task classification: Task for Model Enhancement (Math/Code), whether it can be specially optimized through expert fine-tuning; Task for Model Adaptation (such as converting some customer service dialogue data into summary).
Expert selection: After inputting the task, we identify the experts most frequently activated; these are the experts most needed for this task. In expert selection, we designed a Top-P algorithm: we sum the scores of each model and normalize the probabilities to obtain the distribution of each expert’s importance. Selecting the top K experts ensures that the normalized expert distribution reaches a P-value. After expert selection, the remaining experts should continue to be used (participating in gradient propagation but not gradient descent, meaning they won’t be updated); otherwise, the model’s performance will be significantly reduced.

In ESFT, we observe a differentiation among experts, with a high concentration of experts activated across different tasks. ESFT’s advantage lies in its ability to improve model performance and generalization even with limited computational resources. Why is ESFT better than LoRA? Because in LoRA, every expert is trained. Sometimes, experts are not suited to a particular downstream task, and to meet loss requirements, even those unskilled are used for training. This leads to poor performance when applying general tasks after training. However, ESFT pre-selects experts, thus exhibiting stronger generalization capabilities compared to LoRA—allowing specialists to perform their specialized tasks.

After the large model is transformed from a dense model to a sparse model, how do we fine-tune so many parameters? We will find some experts who are suited to these tasks and fine-tune these experts rather than fine-tuning the entire model. This requires some thinking from the perspective of the experts.

The experts selected for fine-tuning are chosen before training, not during training. Specifically, 32 samples are first selected from the task dataset, each with a length of 4096 tokens, and these 32 samples are used to select experts.

There are two ways to select the most relevant experts:

ESFT-Gate (Average Gate Score): Each token has a corresponding Gate Score from 64 experts after passing through the Router. The Gate Scores of 2^17 tokens are averaged, and then the experts with the highest scores are selected.
ESFT-Token (Token Selection Ratio): This selects the most relevant experts based on the proportion of tokens selected for each expert. In other words, 2^17 tokens are distributed across 64 experts using a router, and then the experts with the most tokens are selected.

0xFF Reference

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).

[1701.06538] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

[2006.16668] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

[2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

[2103.13262] FastMoE: A Fast Mixture-of-Expert Training System

[2206.03382] Tutel: Adaptive Mixture-of-Experts at Scale

[2211.15841] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

[2305.13245] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

[2404.02258] Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

A Survey on Inference Optimization Techniques for Mixture of Experts Models

A Survey on Mixture of Experts WEILIN CAI

Adaptive Mixtures of Local Experts

DeepEP Dispatch/Combine Illustration Marlene

“DeepSeek-V3 Technical Analysis”: Load Balancing without Auxiliary Loss Function ( Baihai IDP)

“DeepSeek-V3 Technical Analysis”: DeepSeekMoE Baihai IDP

Deepseek-MOE Architecture Diagram (V1->V2->V3) What if I had an AI?

DeepSeek-R1 Model Architecture In-Depth Analysis (Part 3): Understanding DeepSeekMoE [AI Algorithm Path](javascript:void(0)😉

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V3 Technical Report

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

DeepSeekV2’s MLA (Multi-head Latent Attention) Explains the Mission of a Drop of Water

DeepSeek Technology Explained (3) - The Evolution of MoE ( by Jiang Fuchun)

Deepseek’s MoE Architecture Evolution ( by Chen Xiaoxiao)

DeepSpeed Inference full-stack optimization reduces latency by 7.3x and increases throughput by 1.5x (MLSys2024)

DeepSpeed: Advancing MoE inference and training to power next-generation AI scale By DeepSpeed Team Andrey Proskurin , Corporate Vice President of Engineering

Google’s latest Mod: Dynamic computing resource allocation accelerates inference by 50%. (AI Talk)

The key to GPT-6: Hybrids, hybrid experts, and higher data quality – Tim is on the way.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

HunYuan MoE: A casual discussion on AI sparse layers, including LLM parameter count, computational cost, and MFU.

The evolution of LLM MOE, from a simple MOE to a sparse MOE, and then to the share_expert sparse MOE used by Deepseek . (chaofa adds a brief comment on the code.)

LLM Study Notes - Deepspeed - MoE Papers (marsggbo)

Mistral & LLama MoE: An Initial Exploration of Hybrid Expert Models (JMXGODLZ)

Mixtral 8x7B: A Detailed Explanation of Extraordinary Techniques and the Good Brother Pang Tong

Mixtral of Experts

Mixtral of Experts Notes Shane

Mixture of Parrots: Experts improve memorization more than reasoning

A list of classic papers in Mixture-of-Experts (MoE)

The impact of MoE LLM on AI chip communication TANG

Interpreting MoE Papers: A Casual Discussion of AI, Including Gshard, FastMoE, Tutel, and MegaBlocks

Should MoE training use TP or EP? xffxff

MoE-LLaVA: Combining MoE with Large-Scale Multimodal Models AI Talk [AI Talk](javascript:void(0)😉

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts Zhengyan Zhang , Yankai Lin , Zhiyuan Liu , Peng Li , Maosong Sun , Jie Zhou

MOE Introduction and LLM Solution Summary If I were given an AI

Moe has become the new standard for LLM at this stage, and Tim is on its way.

The Breakthrough Path of MoE Architecture: How to Solve Five Major Pain Points in Deep Learning (Alex [Algorithm Dog])

The Past and Present of the MoE Model (Linsight)

Comparison of Moe models: Mixtral, Qwen2-MoE, DeepSeek-v3 Alex [Algorithm Dog]

OpenMoE Notes Shane

OUTRAGEOUSLY LARGE NEURAL NETWORKS:THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

SwitchHead: Accelerating Transformer Attention Using Expert Hybrid Models

Z Tech｜Former DeepSeek Scientist’s 10,000-Word Explanation: How RL and MoE Ignited the Large Model Revolution Z Potentials [Z Potentials](javascript:void(0)😉

[Tearing apart LLM-sMoE] One step closer to GPT4 (by Xiaodonggua AIGC)

[Paper] A Review of Hybrid Expert Models (MoE) Chang Hua Andy [Andy730](javascript:void(0)😉

A Comprehensive Guide to DeepSeek-V2 (Chinese Model for Transformer Modification): Detailed Explanation of MoE, GRPO, MLA v_JULY_v

A Panoramic View of Expert Hybrid Model (MOE) Inference Optimization Technology: In-Depth Analysis from Model to Hardware ( by Lang from the North)

From GShard to DeepSeek-V3: A Retrospective of the Evolution of Load Balancing Strategies in MoE Large Models Sirius Black

From Conditional Calculation to MoE: The Origin and Development of MoE (Part 1) by Hai Chaoyin

From Conditional Calculation to MoE: The Origin and Development of MoE (Part 2) by Hai Chaoyin

Implementing an MOE (Expert Hybrid Model) from Scratch: KaiH

Illustrated Guide to Large Model Training Series: DeepSpeed-Megatron MoE Parallel Training (Principles) - by Mengyuan

Illustrated Guide to Large Model Training Series: Parallel Training of DeepSpeed-Megatron MoE (Source Code Analysis)

MLA Learning Notes (including matrix absorption analysis during inference) - A powerful tool for saving large model key-value caches (BBuf )

Large Model LLM Hybrid Expert Model MoE (Part 2 - Implementation) - Lulu who loves avocados

MoE in Large Models: From Beginner to Expert [Sharing] Alex [Algorithm Dog](javascript:void(0)😉

The mathematical foundations of the era of large models (4) Zapote’s Eraser [zartbot](javascript:void(0)😉

The Mathematical Foundations of the Big Model Era (5) - Talking about MoE and Mixtral 8x7B Zapot’s Eraser [zartbot](javascript:void(0)😉 @WeChat

Large Models: An Overview of Hybrid Expert Models (MoE) [Frontiers of AI Large Models](javascript:void(0)😉

Magic Square AI ESFT: A Highly Efficient Fine-Tuning Solution for MoE, Comparable to Full-Parameter Fine-Tuning AI Chat [AI Chat](javascript:void(0)😉

A Brief Read of DeepSeek-V2 Technical Report AGI Dream Factory

Hybrid Expert (MoE) Routing (Deepspeed) Yang Xin

Wang Wenguang unveils the MoE architecture, and his lengthy article analyzes the mystery of GPT-4, which has been imitated but never surpassed—“Understanding Large Models in Seconds” —Towards the Future [Towards the Future](javascript:void(0)😉

Developing DeepSeek-V2 Deephub from scratch using PyTorch

A Simple Understanding of DeepSpeed-MoE Expert Models and All-to-All Communication (voodoo)

Continuing our discussion on MLA, DeepSeek-MoE, and SnowFlake Dense-MoE, Zabot’s Eraser zartbot

Let’s talk in detail about the technological development of DeepSeek MoE . zartbot

A review of hybrid expert systems (MoE) published by the Hong Kong University of Science and Technology ( HKUST )

Adaptive Mixtures of Experts: https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf ,

Adaptive Mixtures of Local Experts https://www.cs.toronto.edu/~hinton/absps/h91.pdf

Bertsekas, DP Auction algorithms for network flow problems: A tutorial introduction. Comput Optim Applic 1, 7–66 (. https://doi.org/10.1007/BF00247653

Clark, Aidan, et al. “Unified scaling laws for routed language models.” International Conference on Machine Learning. PMLR, 2022.

Deepseek-V1 MoE: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py

DeepSeek-V2 MoE: https://huggingface.co/deepseek-ai/DeepSeek-V2/blob/main/modeling_deepseek.py

DeepSeek-V3 MoE: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models https://arxiv.org/abs/1.06066

Eigen, David, Marc’Aurelio Ranzato, and Ilya Sutskever. “Learning factored representations in a deep mixture of experts.” v preprint arXiv:1312.4314 (2013).

Fedus, William, Barret Zoph, and Noam Shazeer. “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.” The Journal of Machine Learning Research 23.1 (2022): 5232-5270.

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024. Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024).

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts https://arxiv.org/abs/2.06905

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding: https://arxiv.org/pdf/2006.16668.pdf , Hoffmann, Jordan, et al. “Training compute-optimal large language models.” arXiv preprint arXiv:2203.15556 (2022).

https://arxiv.org/pdf/2209.01667

https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts

https://huggingface.co/docs/transformers/v4.32.0/en/model_doc/switch_transformers

https://mp.weixin.qq.com/s/hI7q4_-ZMtFIQ-ckhM9_YQ

https://zhuanlan.zhihu.com/p/658007181

Introducing DBRX: A New State-of-the-Art Open LLM https://www.databricks.com/blog/roducing-dbrx-new-state-art-open-llm

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models: https://dl.acm.org/doi/pdf/10.1145/3269.3604869

Korthikanti, Vijay Anand, et al. “Reducing activation recomputation in large transformer models.” Proceedings of Machine ning and Systems 5 (2023).

Lepikhin, Dmitry, et al. “Gshard: Scaling giant models with conditional computation and automatic sharding.” arXiv int arXiv:2006.16668 (2020).

Lewis, Mike, et al. “Base layers: Simplifying training of large, sparse models.” International Conference on Machine ing. PMLR, 2021.

MegaBlocks: Efficient Sparse Training with Mixture-of-Experts: https://arxiv.org/abs/2211.15841 ,

Narayanan, Deepak, et al. “Efficient large-scale language model training on gpu clusters using megatron-lm.” Proceedings he International Conference for High Performance Computing, Networking, Storage and Analysis. 2021.

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer: https://arxiv.org/pdf/1701.06538.pdf , Pathways: Asynchronous Distributed Dataflow for ML: https://arxiv.org/abs/2203.12533 ,

Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters https://qwenlm.github.io/zh/blog/qwen-moe/

RA Jacobs, MI Jordan, SJ Nowlan and GE Hinton, “Adaptive Mixtures of Local Experts,” in Neural Computation, 3, no. 1, pp. 79-87, March 1991, doi: 10.1162/neco.1991.3.1.79.

Roller, Stephen, Sainbayar Sukhbaatar, and Jason Weston. “Hash layers for large sparse models.” Advances in Neural mation Processing Systems 34 (2021): 17555-17566.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780. s://doi.org/10.1162/neco.1997.9.8.1735

Shazeer, Noam, et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint v:1701.06538 (2017).

ST-MoE: Designing Stable and Transferable Sparse Expert Models https://arxiv.org/abs/2.08906

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity https://arxiv.org/abs/1.03961

Tutel: Adaptive Mixture-of-Experts at Scale: https://arxiv.org/abs/2206.03382 ,

Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

Mixture-of-Experts with Expert Choice Routing（https://arxiv.org/abs/2202.09368）

MoE Routing — Expert Choice Routing Linsight

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts ( https://arxiv.org/abs/2408.15664)

NeurIPS 2024 Oral | Small Parameters, Big Impact! Unveiling the High Performance of the Asymmetric LoRA Architecture (Machine Heart)

HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning

https://arxiv.org/pdf/2501.00365

Let’s talk about Llama4, DeepSeek GRM, and zartbot.

Interpreting and Understanding the LLaMA 4 Model: A Casual Discussion on AI

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

https://github.com/meta-llama/llama-models/tree/main/models/llama4

Balance loss in MoE - A gradient perspective ( by Wang Feng)

MoE Travelogue: 2. It’s not scarcity that’s worrisome, but inequality. ( Su Jianlin )

MoE Travelogue: 1. Starting from a Geometric Perspective - Su Jianlin

MoE Travelogue: 3. A Different Approach to Allocation (Su Jianlin)

MoE Travelogue: 4. More effort should be invested in overcoming difficulties . (Su Jianlin )

Special Choices in DeepSeek Model Architecture: AI Talk

Overview: DeepSeek Infra/V1/MoE/V2/V3/R1 & Casual Discussion on Key Open Source Technologies and AI

DeepSeek V3 Reasoning: MLA and MOE Analysis Arthur

Notes: A Brief Discussion of Expert Compression (Low-Rank Decomposition) in MoE ( by Daodao Ning)

DeepSeek’s MoE load balancing solution ( by Ji Niu Niu)

The Platonic Representation Hypothesis

Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin, arXiv:2312.09979

A New Paradigm for Fine-tuning Large Models: When LoRA Meets MoE - Sam Talks Algorithms

LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment