>>...indicating that BERT does not treat grammatical agreement as a rule that must be followed.<p>This makes sense, since the model inference they used is stochastic - if they'd used a deterministic inference pass, they'd be able to inspect whether the rule, as encoded in the model, was correctly learned to be applied to instances of grammatical agreement.<p>They're treating BERT as some sort of black box, and then train a set of models on different data, drawing conclusions from their interrogation of the models. Their methodology needs to account for what transformers do with the data during training, and to impose a spectrum of training parameters and randomized corpuses to eke any useful observations about BERT. Other language models like Megatron, gpt-2, and gpt-3 have very different capabilities.<p>None of their conclusions are applicable to anything other than the particular models they trained.<p>>>it knows that subjects and verbs should agree and that high frequency words are more likely, but doesn’t understand that agreement is a rule that must be followed and that the frequency is only a preference.<p>This is only true because of the particular way they used the model, and shows a glaring misunderstanding about what the software is doing, without any apparent attempts to use the architecture as context for their assumptions.<p>You cannot generalize assertions about language models at large by running BERT a few times. You need to understand the architecture of each model to know how the changes in training and inference will constrain the capabilities of each model.<p>There are probably very interesting insights into transformer based models that you could derive from a better methodology and range of architectures, but this article fails to deliver even a single valid insight.