Q1: Why does long-context training make naive batch padding especially wasteful?
Review deck
Advanced Masking And Sample Packing Review
Recall sample packing masks, block diagonal attention, packing strategies, and rank-collapse implications from the advanced Mask section.
question
answer
Q2: What is sample packing in the Mask lesson?
Q3: Why does packed training need a block diagonal attention mask?
Q4: What tradeoff does the lesson associate with packing strategies such as FixedLengthPacking, MultiPack, and SortedPacking?
Q5: What role can attention masks play in rank-collapse behavior according to the advanced section?