Exploring the Transformer Series (28) --- DeepSeek MLA
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.
KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.
KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.
MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.
Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.
Transformer parameter counts, memory usage, activations, FLOPs, KV cache, and optimization directions.