TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

79 pointsby asparaguialmost 6 years ago

4 comments

cs702almost 6 years ago
This is NOT &quot;just throwing more compute&quot; at the problem.<p>The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from <i>all possible permutations</i> of the factorization order of all other tokens in the same input sequence. In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.<p>For example, if the input sequence has four tokens, [&quot;The&quot;, &quot;cat&quot;, &quot;is&quot;, &quot;furry&quot;], in one training step the model will try to predict &quot;is&quot; after seeing &quot;The&quot;, then &quot;cat&quot;, then &quot;furry&quot;. In another training step, the model might see &quot;furry&quot; first, then &quot;The&quot;, then &quot;cat&quot;. Note that the original sequence order is always retained, e.g., the model always knows that &quot;furry&quot; is the fourth token.<p>The masking-and-caching algorithm that accomplishes this does not seem trivial to me.<p>The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.
评论 #20235557 未加载
s_Hoggalmost 6 years ago
It&#x27;s nice to see people managing to push BERT further and get SOTA on stuff, but I feel like a fair amount of this sort of thing is really just throwing more and more compute at a problem.<p>Given that we still don&#x27;t fundamentally understand the properties of deep nets anywhere near as well as we might like to thing, it gives me the gnawing feeling that we&#x27;re missing something.
评论 #20231997 未加载
jamesblondealmost 6 years ago
What i am most impressed about this paper is that it is primarily from CMU with 5 authors and 1 from Google (with the legend Quoc from Google and Ruslan Salakhutdinov is also excellent). Even though the CMU team may have had the original ideas, I guess the golden age of ML adage still holds despite their contributions related to dual-masking-plus-caching - &quot;it&#x27;s a golden age so long as you have access to massive amounts of compute and storage&quot;.
jeremysalwenalmost 6 years ago
I don&#x27;t the point they are trying to make about BERT not learning dependencies between masked words. Isn&#x27;t the mask randomly chosen each time, so it has a chance to learn with all possible words unmasked?
评论 #20233791 未加载
评论 #20230709 未加载
评论 #20230696 未加载