#transformer - Tags - ML Learning Lab

36 posts · Transformer Series

Tag: #transformer

Exploring the Transformer Series (30) --- Decoding Speculation

🗓 2026-04-11 • Transformer Series • ⏱ 63 min read

Speculative decoding, speculative sampling, blockwise parallel decoding, token tree verification, and Hugging Face implementation details.

#transformer #speculative-decoding #speculative-sampling #parallel-decoding #inference #sampling

Read →

Exploring the Transformer Series (29) --- DeepSeek MoE

🗓 2026-04-11 • Transformer Series • ⏱ 76 min read

DeepSeek MoE: load balancing, fine-grained and shared experts, DeepSeek V1/V2/V3 routing, MoD, LoRA hybrids, and efficient fine-tuning.

#transformer #moe #deepseek #routing #load-balancing #experts

Read →

Exploring the Transformer Series (33) --- DeepSeek MTP

🗓 2026-04-11 • Transformer Series • ⏱ 40 min read

DeepSeek MTP: EAGLE, HASS, classical multi-token prediction, DeepSeek’s causal-chain design, formulas, and the vLLM implementation.

#transformer #deepseek #mtp #multi-token-prediction #eagle #hass

Read →

Exploring the Transformer Series (35) --- Fundamentals of Large Model Quantization

🗓 2026-04-11 • Transformer Series • ⏱ 62 min read

Large model quantization fundamentals: outliers, superweights, massive activations, PTQ, QAT, and common quantization strategies.

#transformer #quantization #llm #outlier #ptq #qat

Read →

Exploring the Transformer Series (36) --- Large Model Quantization Scheme

🗓 2026-04-11 • Transformer Series • ⏱ 106 min read

Large model quantization schemes across 8-bit, 4-bit, and low-bit settings, including LLM.int8(), ZeroQuant, SmoothQuant, GPTQ, AWQ, LLM-QAT, QLoRA, FlatQuant, SqueezeLLM, SpQR, BitNet, and OneBit.

#transformer #quantization #llm-compression #ptq #qat #low-bit-quantization

Read →

Exploring the Transformer Series (32) --- Lookahead Decoding

🗓 2026-04-11 • Transformer Series • ⏱ 29 min read

Lookahead decoding: Jacobi decoding, n-gram pool, 2D window, parallel verification, and llama.cpp implementation details.

#transformer #lookahead-decoding #jacobi-decoding #speculative-decoding #parallel-decoding #inference

Read →

Exploring the Transformer Series (31) --- Medusa

🗓 2026-04-11 • Transformer Series • ⏱ 32 min read

Medusa: multi-decoding heads, tree attention, typical acceptance, sparse tree construction, training strategies, and decoding flow.

#transformer #medusa #speculative-decoding #inference #tree-attention #decode

Read →

Exploring the Transformer Series (34) --- Quantitative Fundamentals

🗓 2026-04-11 • Transformer Series • ⏱ 48 min read

Quantization fundamentals for Transformer LLMs: compression background, numerical representations, PTQ/QAT workflows, calibration, granularity, and acceleration.

#transformer #quantization #llm-compression #ptq #qat #model-quantization

Read →

Exploring the Transformer Series (28) --- DeepSeek MLA

🗓 2026-04-09 • Transformer Series • ⏱ 55 min read

DeepSeek MLA: low-rank KV compression, weight absorption, decoupled RoPE, resource tradeoffs, implementation details, and conversions from GQA and MHA.

#transformer #mla #deepseek #attention #kv-cache #rope

Read →

Exploring the Transformer Series (25) --- KV Cache Optimization for Handling Long Text Sequences

🗓 2026-04-09 • Transformer Series • ⏱ 105 min read

KV cache optimization for long text sequences: sparsification, token reuse, prefix reuse, retrieval-based schemes, and long-context KV management.

#transformer #kv-cache #optimization #long-context #inference #sparsification

Read →

Exploring the Transformer Series (26) --- KV Cache Optimization: PD Separation or Merging

🗓 2026-04-09 • Transformer Series • ⏱ 104 min read

KV cache optimization through PD separation or merging: static batching, ORCA, Sarathi, DistServe, SplitWise, MemServe, TetriInfer, and Mooncake.

#transformer #kv-cache #prefill #decode #parallelism #inference

Read →

Exploring the Transformer Series (24) --- KV Cache Optimization

🗓 2026-04-09 • Transformer Series • ⏱ 79 min read

KV Cache optimization: metrics, memory crisis, formula-based compression, stage-aware optimization, memory management, and scheduling.

#transformer #kv-cache #optimization #inference #memory #prefill

Read →

Exploring the Transformer Series (23) --- Length Extrapolation

🗓 2026-04-09 • Transformer Series • ⏱ 49 min read

Length extrapolation in Transformers and LLMs: position encoding methods, RoPE extrapolation, PI, NTK-aware interpolation, YaRN, and Giraffe.

#transformer #length-extrapolation #position-encoding #rope #llm #context-window

Read →

Exploring the Transformer Series (27) --- MQA & GQA

🗓 2026-04-09 • Transformer Series • ⏱ 13 min read

MQA and GQA: MHA review, shared KV heads, grouped-query attention, implementation details, memory and speed tradeoffs, conversion, and optimization variants.

#transformer #mqa #gqa #attention #kv-cache #mha

Read →

Exploring the Transformer Series (22) --- LoRA

🗓 2026-04-08 • Transformer Series • ⏱ 75 min read

LoRA: PEFT, low-rank adaptation, rank, initialization, implementation, optimization, and continual learning.

#transformer #lora #peft #low-rank #fine-tuning #dora

Read →

Exploring the Transformer Series (19) --- FlashAttention V2 and its Upgrade

🗓 2026-04-07 • Transformer Series • ⏱ 47 min read

FlashAttention V2, Flash-Decoding, Flash-Mask, and FlashAttention-3.

#transformer #flashattention #flashattention-v2 #flash-decoding #flash-mask #flashattention-3

Read →

Exploring the Transformer Series (18) --- FlashAttention

🗓 2026-04-07 • Transformer Series • ⏱ 87 min read

FlashAttention, online softmax, tiling, IO-awareness, and memory-efficient exact attention.

#transformer #flashattention #attention #softmax #tiling #memory

Read →

Exploring the Transformer Series (20) --- KV Cache

🗓 2026-04-07 • Transformer Series • ⏱ 50 min read

Autoregressive inference redundancy, KV cache, prefill vs decode, implementation, and resource usage.

#transformer #kv-cache #inference #prefill #decode #memory

Read →

Exploring the Transformer Series (21) --- MoE

🗓 2026-04-07 • Transformer Series • ⏱ 79 min read

Mixture-of-Experts (MoE): conditional computation, routing, experts, load balancing, implementation, and parallel inference.

#transformer #moe #routing #sparsity #parallelism #mixtral

Read →

Exploring the Transformer Series (14) --- Residual Networks and Normalization

🗓 2026-04-05 • Transformer Series • ⏱ 68 min read

Residual connections and normalization in Transformers: ResNet intuition, BatchNorm vs LayerNorm, Pre-Norm vs Post-Norm, implementations, and recent variants.

#transformer #residual-connection #normalization #layernorm #batchnorm #resnet

Read →

Exploring the Transformer Series (16) --- Resource Consumption

🗓 2026-04-05 • Transformer Series • ⏱ 35 min read

Transformer parameter counts, memory usage, activations, FLOPs, KV cache, and optimization directions.

#transformer #parameters #memory #activations #flops #kv-cache

Read →

Exploring the Transformer Series (17) --- RoPE

🗓 2026-04-05 • Transformer Series • ⏱ 47 min read

RoPE positional encoding, derivation, properties, extrapolation, and implementation.

#transformer #rope #position-encoding #rotary-embedding #llm #attention

Read →

Exploring the Transformer Series (15) --- Sampling and Output

🗓 2026-04-05 • Transformer Series • ⏱ 66 min read

Transformer generator heads, softmax, decoding strategies, sampling parameters, logits analysis, and weight sharing.

#transformer #sampling #generator #softmax #beam-search #top-k

Read →

Exploring the Transformer Series (13) --- FFN

🗓 2026-04-04 • Transformer Series • ⏱ 76 min read

Feed-Forward Networks (FFN) in Transformers: structure, implementation, function, knowledge utilization, and optimization evolution.

#transformer #ffn #feed-forward-network #mlp #activation #knowledge

Read →

Exploring the Transformer Series (11) --- Mask

🗓 2026-04-03 • Transformer Series • ⏱ 37 min read

Transformer masks: padding mask, sequence/causal mask, implementation details, data flow, and advanced sample-packing strategies.

#transformer #mask #padding-mask #causal-mask #self-attention #sample-packing

Read →

Exploring the Transformer Series (12) --- Multi-head Self-Attention

🗓 2026-04-03 • Transformer Series • ⏱ 41 min read

Multi-head self-attention in Transformers: motivation, principles, implementation details, and modern head-composition improvements.

#transformer #multi-head-self-attention #attention #qkv #mha #optimization

Read →

Exploring the Transformer Series (9) --- Location Encoding Classification

🗓 2026-04-02 • Transformer Series • ⏱ 32 min read

APE vs RPE in Transformers: differences, representative methods, and relative-position design patterns.

#transformer #position-encoding #ape #rpe #xlnet #t5

Read →

Exploring the Transformer Series (10) --- Self-Attention

🗓 2026-04-02 • Transformer Series • ⏱ 86 min read

Self-attention in Transformers: principles, implementation details, scaling/softmax analysis, and modern optimization directions.

#transformer #self-attention #qkv #softmax #llama3 #linear-attention

Read →

Exploring the Transformer Series (7) --- Embedding

🗓 2026-04-01 • Transformer Series • ⏱ 85 min read

Transformer embedding fundamentals: from vectorization to trainable embeddings, implementation details, and modern text-embedding model evolution.

#transformer #embedding #vectorization #word2vec #bert #text-embedding

Read →

Exploring the Transformer Series (8) --- Position Encoding

🗓 2026-04-01 • Transformer Series • ⏱ 59 min read

Transformer positional encoding: why it is needed, design evolution, sinusoidal encoding analysis, and NoPE debates.

#transformer #position-encoding #rope #nope #attention #length-extrapolation

Read →

Exploring the Transformer Series (6) --- token

🗓 2026-04-01 • Transformer Series • ⏱ 70 min read

Tokenization fundamentals in Transformers: vocabulary construction, tokenizers, BPE/WordPiece/Unigram, and newer token-free directions.

#transformer #token #tokenizer #vocabulary #bpe #wordpiece

Read →

Exploring the Transformer Series (5) --- Training & Reasoning

🗓 2026-04-01 • Transformer Series • ⏱ 38 min read

Transformer training and inference in practice: teacher forcing, masks, dropout, label smoothing, learning rate scheduling, and parallelism.

#transformer #training #reasoning #teacher-forcing #dropout #label-smoothing

Read →

Exploring the Transformer Series (3) --- Data Processing

🗓 2026-03-31 • Transformer Series • ⏱ 15 min read

Transformer data processing pipeline: dataset choices, vocabulary/tokenizers, batch construction, masks, and training data loading in Harvard code.

#transformer #data-processing #dataset #tokenization #padding

Read →

Exploring the Transformer Series (4) --- Encoder & Decoder

🗓 2026-03-31 • Transformer Series • ⏱ 46 min read

Transformer encoder and decoder internals: architecture, data flow, cross-attention, and decoder-only design tradeoffs.

#transformer #encoder #decoder #cross-attention #decoder-only

Read →

Exploring the Transformer Series (2) --- Overall Architecture

🗓 2026-03-31 • Transformer Series • ⏱ 76 min read

Transformer overall architecture: workflow, attention modules, construction from Harvard code, and theoretical perspectives.

#transformer #architecture #attention #llm #encoder-decoder

Read →

Exploring the Transformer Series (1): Attention Mechanism

🗓 2026-03-27 • Transformer Series • ⏱ 64 min read

A deep dive into attention mechanism foundations: seq2seq background, CNN/RNN limitations, attention principles, and historical evolution to Transformer.

#transformer #attention #seq2seq #rnn #cnn

Read →

| #transformer

Tag: #transformer