I could not disagree more with this post. To summarize what the author is unhappy with:<p>1) "It’s another big jump in the number, but the underlying architecture hasn’t changed much... it’s pretty annoying and misleading to call it “GPT-3.” GPT-2 was (arguably) a fundamental advance, because it demonstrated the power of way bigger transformers when people didn’t know about that power. Now everyone knows, so it’s the furthest thing from a fundamental advance."<p>2) "The “zero-shot” learning they demonstrated in the paper – stuff like “adding tl;dr after a text and treating GPT-2′s continuation thereafter as a ‘summary’” – were weird and goofy and not the way anyone would want to do these things in practice... They do better with one task example than zero (the GPT-2 paper used zero), but otherwise it’s a pretty flat line; evidently there is not too much progressive “learning as you go” here."<p>3) "Coercing it to do well on standard benchmarks was valuable (to me) only as a flamboyant, semi-comedic way of pointing this out, kind of like showing off one’s artistic talent by painting (but not painting especially well) with just one’s non-dominant hand."<p>4) "On Abstract reasoning..So, if we’re mostly seeing #1 here, this is not a good demo of few-shot learning the way the authors think it is."<p>---------
My response:<p>1) The fact that we can get so much improvement out of something so "mundane" should be cause for celebration, rather than disappointment. It means that we have found general methods that scale well and a straightforward recipe for brute-forcing our way through solutions we haven't solved before.<p>At this point it becomes not a question of possibility, but of engineering investment. Isn't that the dream of an AI researcher? To find something that works so well you can stop ``innovating'' on the math stuff?<p>2) Are we reading the same plot? I see an improvement after >16 shot.<p>I believe the point of that setup is to illustrate the fact that any model trained to make sequential decisions can be regarded as "learning to learn", because the arbitrary computation in between sequential decisions can incorporate "adaptive feedback". It blurs the semantics between "task learning" and "instance learning"<p>3) This is a fair point actually, and perhaps now that models are doing better (no thanks to people who spurn big compute), we should propose better metrics to capture general language understanding.<p>4) It's certainly possible, but you come off as pretty confident for someone who hasn't tried running the model and trying to test its abilities.<p>Who is the author, anyway? Are <i>they</i> capable of building systems like GPT-3?