Exploring the Transformer Series (30) --- Decoding Speculation
Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.
Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.
Large model quantization fundamentals: outliers, superweights, massive activations, PTQ, QAT, and common quantization strategies.
Lookahead decoding: Jacobi decoding, n-gram pool, 2D window, parallel verification, and llama.cpp implementation details.
Medusa: multi-decoding heads, tree attention, typical acceptance, sparse tree construction, training strategies, and decoding flow.
KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.
KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.
KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.
Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.
Transformer parameter counts, memory usage, activations, FLOPs, KV cache, and optimization directions.