Exploring the Transformer Series (28) --- DeepSeek MLA
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
Length extrapolation in Transformers and LLMs: position encoding methods, RoPE extrapolation, PI, NTK-aware interpolation, YaRN, and Giraffe.
RoPE positional encoding, derivation, properties, extrapolation, and implementation.
Transformer positional encoding: why it is needed, design evolution, sinusoidal encoding analysis, and NoPE debates.