The False Promise of Imitating Proprietary LLMs

126 pointsby lebekalmost 2 years ago

17 comments

ofoualmost 2 years ago

From the Conclusion:"Finally, our work raises ethical and legal questions, including whether the open-source community should continue to advance progress by “stealing” what OpenAI and other companies have done, as well as what legal countermeasures companies can take to protect and license intellectual property."Really???

评论 #36079574 未加载

评论 #36079675 未加载

评论 #36081060 未加载

评论 #36082862 未加载

评论 #36089499 未加载

blazespinalmost 2 years ago

The breathtaking audacity of calling distilling GPT4 'stealing' when GPT4 trained on data it has no proprietary right to.

评论 #36079788 未加载

评论 #36083467 未加载

评论 #36080975 未加载

评论 #36086584 未加载

评论 #36079422 未加载

评论 #36079513 未加载

评论 #36081934 未加载

cs702almost 2 years ago

The authors conduct automated, more methodical evaluations of LLMs finetuned to imitate ChatGPT outputs, and find that, despite superficial/informal appearances to the contrary, the base LLMs close little to none of the gap to ChatGPT on tasks that are not heavily supported in the imitation data.It's not good news for the open LLM ecosystem.

评论 #36079456 未加载

评论 #36078980 未加载

评论 #36080556 未加载

评论 #36079444 未加载

评论 #36079323 未加载

winddudealmost 2 years ago

"Second, given the large gap between LLaMA and ChatGPT (the latter model is faster, cheaper, and more accurate), "No it's not, llama would be cheaper and likely faster if you ran it on the same scale, actually there've been a few calcs done, that running llama 65b if you're at 100% usage is cheaper than 3.5turbo per token. Also comparing them for accuracy isn't fair comparison, one is a foundational model, one is an instruct tuned model. Perhaps compare llama 65b with gpt3.

评论 #36080951 未加载

throwaway6977almost 2 years ago

If they really didn't test anything bigger than 13b, as their abstract states, then this doesn't even seem worth reading through.

评论 #36079497 未加载

评论 #36079299 未加载

brucethemoose2almost 2 years ago

The jump between llama 13B and 30B is quite significant. And their instruction finetuning is not SOTA I don't think, though the point about general knowledge is a good one: instruction llama lies very confidently.But one great thing about open source LLMs is that you can specialize them in various tasks with affordable LORA training, enough to easily beat GPT4 in a specific niche.

评论 #36090992 未加载

ImprobableTruthalmost 2 years ago

This isn't a new result really. We already know through the gpt-4 paper that rlhf style fine-tuning just makes the model more compliant, not more capable.

评论 #36079014 未加载

mxwsnalmost 2 years ago

This is an important study and I've been waiting for something like this ever since Alpaca and the following wave of imitating models that have had lackluster, non rigorous evaluation.

dspokaalmost 2 years ago

Sensational title that misrepresents the message in paper.However, when conducting more targeted automatic evaluations, we found that the imitation models close little to none of the large gap between LLaMA and ChatGPT. In particular, we demonstrate that imitation models improve on evaluation tasks that are heavily supported in the imitation training data. On the other hand, the models do not improve (or even decline in accuracy) on evaluation datasets for which there is little support. For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy.Just because this might not be the way to replicate the performance of ChatGPT across all tasks, it seems to work quite well on whichever tasks are in the imitation learning. That is still a big win.Later on this also works for factual correctness. (leaving aside the argument whether this is the right approach for factuality)For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy.

评论 #36079921 未加载

kamranjonalmost 2 years ago

I'd be really curious what the authors of the recent (3 days ago) paper on QLora would think of this article? <a href="https://arxiv.org/abs/2305.14314" rel="nofollow">https://arxiv.org/abs/2305.14314</a> - they claim "Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU"Particularly this statement seems relevant: "We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT."

courseofactionalmost 2 years ago

Does this mean that, if one wants GPT-4 quality outputs on a topic, one should specifically generate a dataset on that topic to fine-tune their own model?There's still room for closing the gap, but ultimately it's only going to be a pale imitation when the underlying model's representations aren't as useful.

nologic01almost 2 years ago

> imitation models are adept at mimicking ChatGPT's style but not its factualitythis is largely a pot calling the kettle black. The LLM game is not about not mimicking somebody else. It is about not being caught doing so :-)

winddudealmost 2 years ago

"However, imitation falls short in improving LMs across more challenging axes such as factuality, coding, and problem solving."Brilliant observation captain obvious.

nmcaalmost 2 years ago

nobody with lots of experience with proprietary LMs is surprised

评论 #36079276 未加载

airgapstopgapalmost 2 years ago

This is exactly the reason OpenAI isn't afraid of the open-source community, like many kneejerk opponents of regulatory capture assume (they are probably still afraid of Google). Also why they still do the expensive and cumbersome RLHF training, instead of those deceptively cheap and fast finetunes. They understand their own tech and why there isn't free lunch.Recently, John Schulman explained the issue with behavior cloning and it's a very typical ML problem.[1] Basically: what are we training the model to do? The model updates after finetuning in a holistic manner, based on the sum total of its content and capability. Suppose GPT-4 can correctly answer to many requests because it knows correct answers, in the sense that it has something isomorphic to an internal knowledge graph and tools for querying it, and that graph contains sufficient data for its tools to derive an answer at inference. RLHF reinforces this behavior by constraining the distribution of outputs (essentially, steering the model away from applying inappropriate tools for respective inputs, e.g. employing fantasy-narrative or bad-yahoo-answers cognitive routines when asked something that looks like a straightforward factual question).Now suppose you teach LLaMA-13B to imitate those responses by SFTing it on a dump of successful GPT-4 conversations. But LLaMA doesn't have internals that would have enabled it to find the same answers; so on the object level it shallowly memorizes specific items of the post-training dataset, and on the meta-level it learns the stylistic flourish of a high-powered model. But it starts to hallucinate confident nonsense whenever you step out of the training distribution, because it doesn't actually learn to query its own knowledge graph. A little anthropomorphism won't hurt: you create an incapable impostor this way, a wannabe nerd, a character who is used to guessing the teacher's password and being praised, instead of understanding the subject, and keeps raising its hand whenever a question is asked, but is painfully clueless.Indeed, the early and cheap success of behavior cloning was a massive red flag unto itself. There's no way all the compute and data that went into training GPT-3/3.5/4 tier models can be substituted with gently demonstrating the attitude vector. If we had models that were markedly less capable but comparably honest, we would have reasons for hope that this line terminates in a genuine open-source peer competitor; instead, we have total fraud.It is a nontrivial task to have a model generalize epistemic honesty and not a lower-order behavior like clamping up and kowtowing or bullshitting from external examples; train it to say "I don't know" whenever it actually does not, but only then.There are clever approaches here, but they're not such a low-hanging fruit as what passes for open-source right now.1. <a href="https://youtu.be/hhiLw5Q_UFg?t=685" rel="nofollow">https://youtu.be/hhiLw5Q_UFg?t=685</a>

评论 #36080110 未加载

a0zUalmost 2 years ago

>Grammatical error in the abstract.

luckystarralmost 2 years ago

Conspiracy theory: Is that the reason why GPT-4 is not available as an API? So people wouldn't siphon off it's capabilities?

评论 #36080972 未加载

评论 #36080916 未加载

评论 #36080902 未加载

评论 #36080915 未加载

17 comments

ofoualmost 2 years ago

评论 #36079574 未加载

评论 #36079675 未加载

评论 #36081060 未加载

评论 #36082862 未加载

评论 #36089499 未加载

blazespinalmost 2 years ago

The breathtaking audacity of calling distilling GPT4 'stealing' when GPT4 trained on data it has no proprietary right to.

评论 #36079788 未加载

评论 #36083467 未加载

评论 #36080975 未加载

评论 #36086584 未加载

评论 #36079422 未加载

评论 #36079513 未加载

评论 #36081934 未加载

cs702almost 2 years ago

评论 #36079456 未加载

评论 #36078980 未加载

评论 #36080556 未加载

评论 #36079444 未加载

评论 #36079323 未加载

winddudealmost 2 years ago

评论 #36080951 未加载

throwaway6977almost 2 years ago

If they really didn't test anything bigger than 13b, as their abstract states, then this doesn't even seem worth reading through.

评论 #36079497 未加载

评论 #36079299 未加载

brucethemoose2almost 2 years ago

评论 #36090992 未加载

ImprobableTruthalmost 2 years ago

This isn't a new result really. We already know through the gpt-4 paper that rlhf style fine-tuning just makes the model more compliant, not more capable.

评论 #36079014 未加载

mxwsnalmost 2 years ago

This is an important study and I've been waiting for something like this ever since Alpaca and the following wave of imitating models that have had lackluster, non rigorous evaluation.

dspokaalmost 2 years ago

评论 #36079921 未加载

kamranjonalmost 2 years ago

courseofactionalmost 2 years ago

nologic01almost 2 years ago

winddudealmost 2 years ago

"However, imitation falls short in improving LMs across more challenging axes such as factuality, coding, and problem solving."Brilliant observation captain obvious.

nmcaalmost 2 years ago

nobody with lots of experience with proprietary LMs is surprised

评论 #36079276 未加载

airgapstopgapalmost 2 years ago

评论 #36080110 未加载

a0zUalmost 2 years ago

>Grammatical error in the abstract.

luckystarralmost 2 years ago

Conspiracy theory: Is that the reason why GPT-4 is not available as an API? So people wouldn't siphon off it's capabilities?

评论 #36080972 未加载

评论 #36080916 未加载

评论 #36080902 未加载

评论 #36080915 未加载