TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

XLNet: Generalized Autoregressive Pretraining for Language Understanding

79 点作者 asparagui将近 6 年前

4 条评论

cs702将近 6 年前
This is NOT &quot;just throwing more compute&quot; at the problem.<p>The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from <i>all possible permutations</i> of the factorization order of all other tokens in the same input sequence. In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.<p>For example, if the input sequence has four tokens, [&quot;The&quot;, &quot;cat&quot;, &quot;is&quot;, &quot;furry&quot;], in one training step the model will try to predict &quot;is&quot; after seeing &quot;The&quot;, then &quot;cat&quot;, then &quot;furry&quot;. In another training step, the model might see &quot;furry&quot; first, then &quot;The&quot;, then &quot;cat&quot;. Note that the original sequence order is always retained, e.g., the model always knows that &quot;furry&quot; is the fourth token.<p>The masking-and-caching algorithm that accomplishes this does not seem trivial to me.<p>The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.
评论 #20235557 未加载
s_Hogg将近 6 年前
It&#x27;s nice to see people managing to push BERT further and get SOTA on stuff, but I feel like a fair amount of this sort of thing is really just throwing more and more compute at a problem.<p>Given that we still don&#x27;t fundamentally understand the properties of deep nets anywhere near as well as we might like to thing, it gives me the gnawing feeling that we&#x27;re missing something.
评论 #20231997 未加载
jamesblonde将近 6 年前
What i am most impressed about this paper is that it is primarily from CMU with 5 authors and 1 from Google (with the legend Quoc from Google and Ruslan Salakhutdinov is also excellent). Even though the CMU team may have had the original ideas, I guess the golden age of ML adage still holds despite their contributions related to dual-masking-plus-caching - &quot;it&#x27;s a golden age so long as you have access to massive amounts of compute and storage&quot;.
jeremysalwen将近 6 年前
I don&#x27;t the point they are trying to make about BERT not learning dependencies between masked words. Isn&#x27;t the mask randomly chosen each time, so it has a chance to learn with all possible words unmasked?
评论 #20233791 未加载
评论 #20230709 未加载
评论 #20230696 未加载