Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences
KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.
KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.
KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.
FlashAttention V2, Flash-Decoding, Flash-Mask, and FlashAttention-3.
FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.
Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.
Multi-head self-attention in Transformers: motivation, principles, implementation details, and modern head-composition improvements.
Self-attention in Transformers: principles, implementation details, scaling/softmax analysis, and modern optimization directions.