TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Evaluating syntactic abilities of language models

4 点作者 blopeur超过 3 年前

1 comment

robbedpeter超过 3 年前
&gt;&gt;...indicating that BERT does not treat grammatical agreement as a rule that must be followed.<p>This makes sense, since the model inference they used is stochastic - if they&#x27;d used a deterministic inference pass, they&#x27;d be able to inspect whether the rule, as encoded in the model, was correctly learned to be applied to instances of grammatical agreement.<p>They&#x27;re treating BERT as some sort of black box, and then train a set of models on different data, drawing conclusions from their interrogation of the models. Their methodology needs to account for what transformers do with the data during training, and to impose a spectrum of training parameters and randomized corpuses to eke any useful observations about BERT. Other language models like Megatron, gpt-2, and gpt-3 have very different capabilities.<p>None of their conclusions are applicable to anything other than the particular models they trained.<p>&gt;&gt;it knows that subjects and verbs should agree and that high frequency words are more likely, but doesn’t understand that agreement is a rule that must be followed and that the frequency is only a preference.<p>This is only true because of the particular way they used the model, and shows a glaring misunderstanding about what the software is doing, without any apparent attempts to use the architecture as context for their assumptions.<p>You cannot generalize assertions about language models at large by running BERT a few times. You need to understand the architecture of each model to know how the changes in training and inference will constrain the capabilities of each model.<p>There are probably very interesting insights into transformer based models that you could derive from a better methodology and range of architectures, but this article fails to deliver even a single valid insight.