Exploring the Transformer Series (26) --- KV Cache Optimization: PD Separation or Merging
KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.
KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.
FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.
Mixture-of-Experts (MoE): conditional computation, routing, experts, load balancing, implementation, and parallel inference.
Transformer training and inference in practice: teacher forcing, masks, dropout, label smoothing, learning rate scheduling, and parallelism.