#attention - Tags - ML Learning Lab

9 posts · Transformer Series

Tag: #attention

Exploring the Transformer Series (28) --- DeepSeek MLA

🗓 2026-04-09 • Transformer Series • ⏱ 55 min read

DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.

#transformer #mla #deepseek #attention #kv-cache #rope

Read →

Exploring the Transformer Series (27) --- MQA & GQA

🗓 2026-04-09 • Transformer Series • ⏱ 13 min read

MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.

#transformer #mqa #gqa #attention #kv-cache #mha

Read →

Exploring the Transformer Series (19) --- FlashAttention V2 and its Upgrade

🗓 2026-04-07 • Transformer Series • ⏱ 47 min read

FlashAttention V2, Flash-Decoding, Flash-Mask, and FlashAttention-3.

#transformer #flashattention #flashattention-v2 #flash-decoding #flash-mask #flashattention-3

Read →

Exploring the Transformer Series (18) --- FlashAttention

🗓 2026-04-07 • Transformer Series • ⏱ 87 min read

FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.

#transformer #flashattention #attention #softmax #tiling #memory

Read →

Exploring the Transformer Series (17) --- RoPE

🗓 2026-04-05 • Transformer Series • ⏱ 47 min read

RoPE positional encoding, derivation, properties, extrapolation, and implementation.

#transformer #rope #position-encoding #rotary-embedding #llm #attention

Read →

Exploring the Transformer Series (12) --- Multi-head Self-Attention

🗓 2026-04-03 • Transformer Series • ⏱ 41 min read

Multi-head self-attention in Transformers: motivation, principles, implementation details, and modern head-composition improvements.

#transformer #multi-head-self-attention #attention #qkv #mha #optimization

Read →

Exploring the Transformer Series (8) --- Position Encoding

🗓 2026-04-01 • Transformer Series • ⏱ 59 min read

Transformer positional encoding: why it is needed, design evolution, sinusoidal encoding analysis, and NoPE debates.

#transformer #position-encoding #rope #nope #attention #length-extrapolation

Read →

Exploring the Transformer Series (2) --- Overall Architecture

🗓 2026-03-31 • Transformer Series • ⏱ 76 min read

Transformer overall architecture: workflow, attention modules, construction from Harvard code, and theoretical perspectives.

#transformer #architecture #attention #llm #encoder-decoder

Read →

Exploring the Transformer Series (1): Attention Mechanism

🗓 2026-03-27 • Transformer Series • ⏱ 64 min read

A deep dive into attention mechanism foundations: seq2seq background, CNN/RNN limitations, attention principles, and historical evolution to Transformer.

#transformer #attention #seq2seq #rnn #cnn

Read →

| #attention

Tag: #attention