Causal (GPT)
Full (BERT)
Hover over a token (row) to see which positions it can attend to. In
causal
masking, each token only sees itself and previous tokens. In
full
masking (BERT), every token attends to every other token.