#inference - Tags - ML Learning Lab

9 posts · Transformer Series

Tag: #inference

Exploring the Transformer Series (30) --- Decoding Speculation

🗓 2026-04-11 • Transformer Series • ⏱ 63 min read

Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.

#transformer #speculative-decoding #speculative-sampling #parallel-decoding #inference #sampling

Read →

Exploring the Transformer Series (35) --- Fundamentals of Large Model Quantization

🗓 2026-04-11 • Transformer Series • ⏱ 62 min read

Large model quantization fundamentals: outliers, superweights, massive activations, PTQ, QAT, and common quantization strategies.

#transformer #quantization #llm #outlier #ptq #qat

Read →

Exploring the Transformer Series (32) --- Lookahead Decoding

🗓 2026-04-11 • Transformer Series • ⏱ 29 min read

Lookahead decoding: Jacobi decoding, n-gram pool, 2D window, parallel verification, and llama.cpp implementation details.

#transformer #lookahead-decoding #jacobi-decoding #speculative-decoding #parallel-decoding #inference

Read →

Exploring the Transformer Series (31) --- Medusa

🗓 2026-04-11 • Transformer Series • ⏱ 32 min read

Medusa: multi-decoding heads, tree attention, typical acceptance, sparse tree construction, training strategies, and decoding flow.

#transformer #medusa #speculative-decoding #inference #tree-attention #decode

Read →

Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences

🗓 2026-04-09 • Transformer Series • ⏱ 105 min read

KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.

#transformer #kv-cache #optimization #long-context #inference #sparsification

Read →

Exploring the Transformer Series (26) --- KV Cache Optimization: PD Separation or Merging

🗓 2026-04-09 • Transformer Series • ⏱ 104 min read

KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.

#transformer #kv-cache #prefill #decode #parallelism #inference

Read →

Exploring the Transformer Series (24) --- KV Cache Optimization

🗓 2026-04-09 • Transformer Series • ⏱ 79 min read

KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.

#transformer #kv-cache #optimization #inference #memory #prefill

Read →

Exploring the Transformer Series (20) --- KV Cache

🗓 2026-04-07 • Transformer Series • ⏱ 50 min read

Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.

#transformer #kv-cache #inference #prefill #decode #memory

Read →

Exploring the Transformer Series (16) --- Resource Consumption

🗓 2026-04-05 • Transformer Series • ⏱ 35 min read

Transformer parameter counts, memory usage, activations, FLOPs, KV cache, and optimization directions.

#transformer #parameters #memory #activations #flops #kv-cache

Read →

| #inference

Tag: #inference