Closed
Description
The tutorial Language Modeling With nn.Transformer and Torchtext describes an attention mask that will prevent nn.TransformerEncoder
from attending to not-yet-seen tokens:
Along with the input sequence, a square attention mask is required because the self-attention layers in nn.TransformerEncoder are only allowed to attend the earlier positions in the sequence.
I think the mask is actually upper triangular: a square mask would not prevent a token attending to future token.
I may have misunderstood the mask description; if not, happy to write the PR that fixes this.
cc @pytorch/team-text-core @Nayef211