Exploring the Transformer Series (27) --- MQA & GQA
MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.
MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.
RoPE positional encoding, derivation, properties, extrapolation, and implementation.
Self-attention in Transformers: principles, implementation details, scaling/softmax analysis, and modern optimization directions.