Exploring the Transformer Series (30) --- Decoding Speculation
Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.
Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.
DeepSeek MoE: load balancing, fine-grained and shared experts, DeepSeek V1/V2/V3 routing, MoD, LoRA hybrids, and efficient fine-tuning.
DeepSeek MTP: EAGLE, HASS, classical multi-token prediction, DeepSeekβs causal-chain design, formulas, and the vLLM implementation.
Large model quantization fundamentals: outliers, superweights, massive activations, PTQ, QAT, and common quantization strategies.
Large model quantization schemes across 8-bit, 4-bit, and low-bit settings, including LLM.int8(), ZeroQuant, SmoothQuant, GPTQ, AWQ, LLM-QAT, QLoRA, FlatQuant, SqueezeLLM, SpQR, BitNet, and OneBit.
Lookahead decoding: Jacobi decoding, n-gram pool, 2D window, parallel verification, and llama.cpp implementation details.
Medusa: multi-decoding heads, tree attention, typical acceptance, sparse tree construction, training strategies, and decoding flow.
Quantization fundamentals for Transformer LLMs: compression background, numerical representations, PTQ/QAT workflows, calibration, granularity, and acceleration.
DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.
KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.
KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.
KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.
Length extrapolation in Transformers and LLMs: position encoding methods, RoPE extrapolation, PI, NTK-aware interpolation, YaRN, and Giraffe.
MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.
LoRA: PEFT, low-rank adaptation, rank, initialization, implementation, optimization, and continual learning.
FlashAttention V2, Flash-Decoding, Flash-Mask, and FlashAttention-3.
FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.
Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.
Mixture-of-Experts (MoE): conditional computation, routing, experts, load balancing, implementation, and parallel inference.
Residual connections and normalization in Transformers: ResNet intuition, BatchNorm vs LayerNorm, Pre-Norm vs Post-Norm, implementations, and recent variants.
Transformer parameter counts, memory usage, activations, FLOPs, KV cache, and optimization directions.
RoPE positional encoding, derivation, properties, extrapolation, and implementation.
Transformer generator heads, softmax, decoding strategies, sampling parameters, logits analysis, and weight sharing.
Feed-Forward Networks (FFN) in Transformers: structure, implementation, function, knowledge utilization, and optimization evolution.
Transformer masks: padding mask, sequence/causal mask, implementation details, data flow, and advanced sample-packing strategies.
Multi-head self-attention in Transformers: motivation, principles, implementation details, and modern head-composition improvements.
APE vs RPE in Transformers: differences, representative methods, and relative-position design patterns.
Self-attention in Transformers: principles, implementation details, scaling/softmax analysis, and modern optimization directions.
Transformer embedding fundamentals: from vectorization to trainable embeddings, implementation details, and modern text-embedding model evolution.
Transformer positional encoding: why it is needed, design evolution, sinusoidal encoding analysis, and NoPE debates.
Tokenization fundamentals in Transformers: vocabulary construction, tokenizers, BPE/WordPiece/Unigram, and newer token-free directions.
Transformer training and inference in practice: teacher forcing, masks, dropout, label smoothing, learning rate scheduling, and parallelism.
Transformer data processing pipeline: dataset choices, vocabulary/tokenizers, batch construction, masks, and training data loading in Harvard code.
Transformer encoder and decoder internals: architecture, data flow, cross-attention, and decoder-only design tradeoffs.
Transformer overall architecture: workflow, attention modules, construction from Harvard code, and theoretical perspectives.
A deep dive into attention mechanism foundations: seq2seq background, CNN/RNN limitations, attention principles, and historical evolution to Transformer.