Skip to content

Transformer Language Model Tutorial - Incorrect Attention Mask Description #1877

Closed
@zoeqevans

Description

@zoeqevans

The tutorial Language Modeling With nn.Transformer and Torchtext describes an attention mask that will prevent nn.TransformerEncoder from attending to not-yet-seen tokens:

Along with the input sequence, a square attention mask is required because the self-attention layers in nn.TransformerEncoder are only allowed to attend the earlier positions in the sequence.

I think the mask is actually upper triangular: a square mask would not prevent a token attending to future token.

I may have misunderstood the mask description; if not, happy to write the PR that fixes this.

cc @pytorch/team-text-core @Nayef211

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions