Q1: What shape does a sequence mask have for decoder self-attention over one sequence?
Review deck
Sequence Mask And Decoder Review
Recall how causal masks make decoder self-attention match autoregressive generation.
question
answer
Q2: Which part of a Boolean causal mask is allowed?
Q3: In decoder masked self-attention, where do Q, K, and V come from?
Q4: After a causal mask is applied, what does row i of the decoder context matrix use?
Q5: Why does teacher forcing make the sequence mask necessary during training?