Review deck

Advanced Masking And Sample Packing Review

Recall sample packing masks, block diagonal attention, packing strategies, and rank-collapse implications from the advanced Mask section.

question

answer

Q1: Why does long-context training make naive batch padding especially wasteful?

long-contextpadding

Q2: What is sample packing in the Mask lesson?

sample-packingattention-mask

Q3: Why does packed training need a block diagonal attention mask?

block-diagonal-maskdocument-boundaries

Q4: What tradeoff does the lesson associate with packing strategies such as FixedLengthPacking, MultiPack, and SortedPacking?

packing-strategyload-balance

Q5: What role can attention masks play in rank-collapse behavior according to the advanced section?

rank-collapselocal-attention