Masking looks interesting for sequences that can't be lossy. If an image squishes a pixel here or there, it won't be noticed, but if a sentence lacks room for "if", that sounds bad.<p>Does this force the model to encode a high-level answering strategy? (AFAIU, there's no reordering during sampling.) Or does it mean a masking model of a certain size is more prone to making things up that fit the blank space?