TechEcho

4 comments

cs702almost 6 years ago

This is NOT "just throwing more compute" at the problem.The authors have devised a clever dual-masking-plus-caching mechanism to induce an attention-based model to learn to predict tokens from all possible permutations of the factorization order of all other tokens in the same input sequence. In expectation, the model learns to gather information from all positions on both sides of each token in order to predict the token.For example, if the input sequence has four tokens, ["The", "cat", "is", "furry"], in one training step the model will try to predict "is" after seeing "The", then "cat", then "furry". In another training step, the model might see "furry" first, then "The", then "cat". Note that the original sequence order is always retained, e.g., the model always knows that "furry" is the fourth token.The masking-and-caching algorithm that accomplishes this does not seem trivial to me.The improvements to SOTA performance in a range of tasks are significant -- see tables 2, 3, 4, 5, and 6 in the paper.

评论 #20235557 未加载

s_Hoggalmost 6 years ago

It's nice to see people managing to push BERT further and get SOTA on stuff, but I feel like a fair amount of this sort of thing is really just throwing more and more compute at a problem.Given that we still don't fundamentally understand the properties of deep nets anywhere near as well as we might like to thing, it gives me the gnawing feeling that we're missing something.

评论 #20231997 未加载

jamesblondealmost 6 years ago

What i am most impressed about this paper is that it is primarily from CMU with 5 authors and 1 from Google (with the legend Quoc from Google and Ruslan Salakhutdinov is also excellent). Even though the CMU team may have had the original ideas, I guess the golden age of ML adage still holds despite their contributions related to dual-masking-plus-caching - "it's a golden age so long as you have access to massive amounts of compute and storage".

jeremysalwenalmost 6 years ago

I don't the point they are trying to make about BERT not learning dependencies between masked words. Isn't the mask randomly chosen each time, so it has a chance to learn with all possible words unmasked?

XLNet: Generalized Autoregressive Pretraining for Language Understanding

4 comments

XLNet: Generalized Autoregressive Pretraining for Language Understanding

4 comments