Exploring the Transformer Series (28) --- DeepSeek MLA
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.
FlashAttention V2, Flash-Decoding, Flash-Mask, and FlashAttention-3.
FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.
RoPE positional encoding, derivation, properties, extrapolation, and implementation.
Multi-head self-attention in Transformers: motivation, principles, implementation details, and modern head-composition improvements.
Transformer positional encoding: why it is needed, design evolution, sinusoidal encoding analysis, and NoPE debates.
Transformer overall architecture: workflow, attention modules, construction from Harvard code, and theoretical perspectives.
A deep dive into attention mechanism foundations: seq2seq background, CNN/RNN limitations, attention principles, and historical evolution to Transformer.