Exploring the Transformer Series (29) --- DeepSeek MoE
DeepSeek MoE: load balancing, fine-grained and shared experts, DeepSeek V1/V2/V3 routing, MoD, LoRA hybrids, and efficient fine-tuning.
DeepSeek MoE: load balancing, fine-grained and shared experts, DeepSeek V1/V2/V3 routing, MoD, LoRA hybrids, and efficient fine-tuning.
DeepSeek MTP: EAGLE, HASS, classical multi-token prediction, DeepSeekβs causal-chain design, formulas, and the vLLM implementation.
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.