Documentation / Transformer Systems Roadmap

Transformer Systems Roadmap

Edit: src/content/docs/roadmaps/transformer-systems.md

What the Transformer Systems roadmap covers and how it should be organized.

The Transformer Systems roadmap teaches transformers as systems, not just as diagrams.

It should move from representation to architecture, then from architecture to inference and serving.

Learning Arc

  1. Tokens and representations. Tokenization, vocabulary, embeddings, and how discrete text becomes vectors.
  2. Core architecture. Self-attention, multi-head attention, FFNs, residual connections, normalization, encoder-decoder structure, and masks.
  3. Position and context. Positional encoding, RoPE, length extrapolation, context windows, and attention behavior over long sequences.
  4. Inference mechanics. Prefill, decode, KV cache, batching, memory pressure, and serving constraints.
  5. Efficiency techniques. FlashAttention, MQA, GQA, MLA, quantization, LoRA, MoE, speculative decoding, Medusa, and lookahead decoding.

What the Learner Should Retain

By the end of the roadmap, a learner should be able to explain:

  • how tokens become embedding vectors
  • why attention uses Q, K, and V
  • what changes between training, prefill, and decode
  • why masks matter
  • what the KV cache stores
  • why serving cost depends on memory movement, sequence length, batch shape, and cache layout
  • how common efficiency techniques change the tradeoffs

Attached Flashcards

Transformer flashcards should focus on mechanisms and tradeoffs.

Examples:

  • What is the difference between prefill and decode?
  • Why does the KV cache reduce compute but increase memory pressure?
  • What problem does FlashAttention solve?
  • How does GQA reduce KV cache size compared with MHA?
  • Why does causal masking matter in decoder-only training?

Roadmap Output

The output of this roadmap is not only article completion. The useful output is a working mental model of the transformer stack from text input to generated tokens.