GPT-3: A Disappointing Paper?

142 点作者 reedwolf将近 5 年前

12 条评论

cs702将近 5 年前

All valid points, but I disagree with the conclusion, for several reasons:* First of all, the GPT-3 authors successfully trained a model with 175 billion parameters. I mean, 175 billion. The previous largest model in the literature, Google’s T5, had "only" 11 billion. Models with trillions of weights are suddenly looking... achievable. That's a significant experimental accomplishment.* Second, the model achieves competitive results on many NLP tasks and benchmarks without finetuning, using only a context window of text for instructions and input. There is only unsupervised (i.e., autoregressive) pretraining. AFAIK, this is the first paper that has reported a model doing this. It's a significant experimental accomplishment that points to a future in which general-purpose NLP models could be used for novel tasks without requiring additional training from the get-go.* Finally, the model’s text generation fools human beings without having to cherry-pick examples. AFAIK, this is the first paper that has reported a model doing this. It's another significant experimental accomplishment.More generally, I find that some AI researchers and practitioners with strong theoretical backgrounds tend to dismiss this kind of paper as "merely" engineering. I think this tendency is misguided. We must build giant machines and gather experimental evidence from them -- akin to physicists who build giant high-energy particle colliders to gather experimental evidence from them.I'm reminded of Rich Sutton's essay, "The Bitter Lesson:"<a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html" rel="nofollow">http://www.incompleteideas.net/IncIdeas/BitterLesson.html</a>

评论 #23361924 未加载

评论 #23362129 未加载

评论 #23363417 未加载

评论 #23363568 未加载

评论 #23361770 未加载

评论 #23361590 未加载

评论 #23363559 未加载

评论 #23362032 未加载

Voloskaya将近 5 年前

> Transformers are extremely interesting. And this is about the least interesting transformer paper one can imagine in 2020.Because it's not a transformer paper.This paper goal was to see how far can an increase in compute continue to deliver an increase in model performance. There is no better way to study this than to take a very well known architecture and keep it the same as possible, otherwise it becomes very hard to know what is due to the increase size of the model and what is due to the tweaks you make.So yes, it's a disappointing paper if you expect it to be on a different topic than what it is.

评论 #23361919 未加载

strin将近 5 年前

> “GPT-3″ is just a bigger GPT-2. In other words, it’s a straightforward generalization of the “just make the transformers bigger” approachYes it’s true. But there is a difference between what’s interesting and what works. deep learning (RNNs, transformers, etc.) is usually old ideas applied at large scale with slight modifications. Proving a model works well at large scale (175B parameters) is a great contribution and measures our progress towards AI.

gambler将近 5 年前

Article> One of their experiments, “Learning and Using Novel Words,“ strikes me as more remarkable than most of the others and the paper’s lack of focus on it confuses me.This sort of "learning" is not necessarily real learning and it's not new for GPT-3. Even reduced GPT-2 willingly used made-up terms from the prompt in its results:<a href="https://medium.com/@VictorBanev/interrogating-gpt-2-345m-aaff8dcc516d" rel="nofollow">https://medium.com/@VictorBanev/interrogating-gpt-2-345m-aaf...</a>Search the article for 'Now I will feed it the same thing, but with a bunch of made-up terms.' It has some examples of how that stuff worked.I've already posted this in the original discussion of GPT-3 paper and I will post it again: statements about whether some system "learns new words" or "does math" require hypothesis formulation and testing. It astounds me that many people in ML community not only don't do these sort of things, but even actively oppose to the very idea of them being necessary.Recently there was a great live-stream from DarkHorse talking about this problem in science in general:<a href="https://www.youtube.com/watch?v=QvljruLDhxY" rel="nofollow">https://www.youtube.com/watch?v=QvljruLDhxY</a>They talk about "data-driven" science and the fundamental problems with that notion.

victor9000将近 5 年前

My biggest problem with GPT3 is that it's not going to be accessible (practically speaking) to the general public. There's been a recent push to democratize this type of work with libraries like Huggingface transformers, but models this large will force the benefits of this work back into the ivory tower.

The_rationalist将近 5 年前

Meanwhile, a revolutionary paper that brought for the first successful time a new paradigm to NLP (latent variational autoencoders) and that destroy GPT 3 on text perplexity on the Pen treebank (4.6 vs 20) and with order of magnitudes less parameters is talked about nowhere on the web...<a href="https://arxiv.org/abs/2003.02645v2" rel="nofollow">https://arxiv.org/abs/2003.02645v2</a>

评论 #23362042 未加载

评论 #23361830 未加载

评论 #23365372 未加载

评论 #23361435 未加载

评论 #23364907 未加载

评论 #23362154 未加载

评论 #23361509 未加载

krzyk将近 5 年前

What does GPT mean? I assume it is not about partition tables (GUID Parition Table), it has something to do with NLP, but besides that it is hard to find what does this acronym mean.

评论 #23361642 未加载

评论 #23361627 未加载

reddickulous将近 5 年前

It would be cool if there was a platform to crowd source compute resources to train stuff like this so that regular people (without 7 figure budgets) can have access to these models which are becoming increasingly out of reach to the general public.

评论 #23363955 未加载

评论 #23363850 未加载

drcode将近 5 年前

Newbie question: If/when models the size of GPT3 are released to the general public, will average people going to be able to run them on their PCs, as they can with GPT2? Or will that basically be impossible now without expensive specialty hardware?

评论 #23362910 未加载

评论 #23362875 未加载

jkhdigital将近 5 年前

> it would represent a kind of non-linguistic general intelligence ability which would be remarkable to find in a language modelAs a relative outsider to this field, I don’t really see the stark line between natural language and general intelligence implied by this statement. Language is just abstractions encoded in symbols, and general intelligence is just the ability to construct and manipulate abstractions. Seems reasonable to think that these are two sides of the same coin.Put another way, natural language is the product of general intelligence.

评论 #23367160 未加载

bitL将近 5 年前

I think the main disappointment is that we humans aren't that special when a brute-forced scalable transformer is getting into our ballpark. We have also recently seen how Open AI + MS were able to use a GPT-variation for automated text-description-to-python-code generation, and utilizing something like GPT-3 in that task might render many swengs obsolete fairly soon.

ericjang将近 5 年前

I could not disagree more with this post. To summarize what the author is unhappy with:1) "It’s another big jump in the number, but the underlying architecture hasn’t changed much... it’s pretty annoying and misleading to call it “GPT-3.” GPT-2 was (arguably) a fundamental advance, because it demonstrated the power of way bigger transformers when people didn’t know about that power. Now everyone knows, so it’s the furthest thing from a fundamental advance."2) "The “zero-shot” learning they demonstrated in the paper – stuff like “adding tl;dr after a text and treating GPT-2′s continuation thereafter as a ‘summary’” – were weird and goofy and not the way anyone would want to do these things in practice... They do better with one task example than zero (the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently there is not too much progressive “learning as you go” here."3) "Coercing it to do well on standard benchmarks was valuable (to me) only as a flamboyant, semi-comedic way of pointing this out, kind of like showing off one’s artistic talent by painting (but not painting especially well) with just one’s non-dominant hand."4) "On Abstract reasoning..So, if we’re mostly seeing #1 here, this is not a good demo of few-shot learning the way the authors think it is."--------- My response:1) The fact that we can get so much improvement out of something so "mundane" should be cause for celebration, rather than disappointment. It means that we have found general methods that scale well and a straightforward recipe for brute-forcing our way through solutions we haven't solved before.At this point it becomes not a question of possibility, but of engineering investment. Isn't that the dream of an AI researcher? To find something that works so well you can stop ``innovating'' on the math stuff?2) Are we reading the same plot? I see an improvement after >16 shot.I believe the point of that setup is to illustrate the fact that any model trained to make sequential decisions can be regarded as "learning to learn", because the arbitrary computation in between sequential decisions can incorporate "adaptive feedback". It blurs the semantics between "task learning" and "instance learning"3) This is a fair point actually, and perhaps now that models are doing better (no thanks to people who spurn big compute), we should propose better metrics to capture general language understanding.4) It's certainly possible, but you come off as pretty confident for someone who hasn't tried running the model and trying to test its abilities.Who is the author, anyway? Are they capable of building systems like GPT-3?

评论 #23364148 未加载