TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Training open-source LLMs on ChatGPT output is a really bad idea.

114 pointsby lapriseabout 2 years ago

13 comments

lysozymeabout 2 years ago
Do yourself a favor and skip right through to the Twitter link to another link to this excellent post by Yoav Goldberg [1] on the actual reason that training new models on ChatGPT output in the manner of supervised learning (in contrast to reinforcement learning) will not produce a model as good as ChatGPT<p>&gt;For this type of interaction, we must use RL training, as supervised training teaches the model to lie. The core issue is that we want to encourage the model to answer based on its internal knowledge, but we don&#x27;t know what this internal knowledge contains. In supervised training, we present the model with a question and its correct answer, and train the model to replicate the provided answer.<p>The author says he’s summarizing a talk by John Schulman of OpenAI [2] but I haven’t personally watched the video. In any case, this is an interesting insight.<p>Say we set up a supervised learning scenario where we ask the model to use its internal knowledge to answer a question and compare its answer to one written by a human. If the two answers essentially say the same thing, but in different words, in the supervised learning case the model is penalized. In the RL case, it’s rewarded. That’s the difference.<p>1. <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;yoavg&#x2F;6bff0fecd65950898eba1bb321cfbd81" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;yoavg&#x2F;6bff0fecd65950898eba1bb321cfbd...</a><p>2. <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=hhiLw5Q_UFg">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=hhiLw5Q_UFg</a>
评论 #35687184 未加载
评论 #35683947 未加载
评论 #35684273 未加载
评论 #35684768 未加载
评论 #35684265 未加载
dvtabout 2 years ago
&gt; So I can easily imagine a near future where the web will be flooded by LLM output or at least by content heavily inspired or edited by LLMs.<p>To be fair, we&#x27;re already there, and we&#x27;ve been there for at least 10 years now. I&#x27;d wager &gt;75% of the internet is garbage: auto-generated blog posts, programmatically-permuted ads, YouTube videos that mainly regurgitate other sources. Email is mostly garbage and the only reason it&#x27;s usable is because spam filters have gotten pretty good. Even non-trivial amounts of heavily-curated social media (Twitter&#x2F;FB&#x2F;IG) is purely spam.
评论 #35684092 未加载
评论 #35684079 未加载
评论 #35683794 未加载
评论 #35683817 未加载
anotherhueabout 2 years ago
As others have mentioned, it&#x27;s frustrating to use a non-OpenAI model and to be told &quot;I&#x27;m sorry, as an AI...&quot;, as it represents a reimplementation of someone else&#x27;s censorship.<p>There are approaches such as Dolly to develop a non-openAI RHLF feedback set but it&#x27;s hard to compete against ShareGPT and co.
seydorabout 2 years ago
This is not new. We ve been dealing with US standards of morality down to nipples since the beginning of the internet. People will get bored of ChatGPT outputs everywhere, however. We are very good at detecting repeated patterns and tend to find them banal.<p>There are now uncensored open source models. Vicuna like models are great, and even work for translation. It&#x27;s eerie what a 10GB file can do
kenshoenabout 2 years ago
I wonder how OpenAI are going to avoid the problem after the web is littered with its content?
评论 #35683656 未加载
评论 #35684196 未加载
评论 #35683595 未加载
评论 #35683652 未加载
politicianabout 2 years ago
The article points out that training data generated using ChatGPT is necessarily biased or tainted with the consequences of the policy optimizations and RLHF alignment processes conducted by OpenAI. This results in models that reflect the alignment preferences of OpenAI instead of the preferences of the model developers.
评论 #35684028 未加载
estabout 2 years ago
It seems more and more plausible that OpenAI chose 2021-09 as a cut-off date was intentional. Because GPT-3 generated output was released into the wild after that.
评论 #35684024 未加载
评论 #35688222 未加载
m3kw9about 2 years ago
It won’t work because will need as much training data as ChatGPT to get to its general knowledge level.<p>A subset will give you a subset of the knowledge, it’s no free lunch
评论 #35683367 未加载
brucethemoose2about 2 years ago
I am also worried about LLM &quot;indbreeding.&quot;<p>When I finetuned successive generations of ESRGAN on its own output (as I essentially wanted to use it for img2img), it would amplify <i>tiny</i> oddities and artifacts that, I would later find out, were in the training data. <i>Tiny</i> noise splotches, &quot;swirls&quot; and distorted line edges blew up. And I was careful... I pixel peeped the dataset as best I could before starting training.<p>Human language is obviously different, but I still fear oddities or biases will start popping up when the base models train on large fractions of their own data. And by the time we find out, it will be near impossible to filter out.<p>But continuing the analogy, maybe a diverse base model population is a good way to avoid that issue?
评论 #35689105 未加载
satisficeabout 2 years ago
The author claims to be “flabbergasted” that people would want to stop work on world-changing AI projects.<p>The gulf between otherwise smart people on this very important issue should depress us all. Personally I feel as if a mutant species has been released into the wild, yet as in Rick and Morty, some people think the wisest course is to release a lot more mutants.<p>People are fools. Hackers more than most— though we are productive and useful fools much of the time— but it hasn’t been a threat to humanity until recently.
petrzjuniorabout 2 years ago
I think that the article misses the point. Many people are using ChatGPT for creation of relatively small but high quality datasets, because it is very easy. Stanford created an amazing dataset for their Alpaca for just $500. If you are building a competitive model (such as Meta Llama), then you of course don&#x27;t use ChatGPT-generated data, because you have the money to download the whole internet.
评论 #35687574 未加载
MPSimmonsabout 2 years ago
Seems to be a Multiplicity[1]-type problem.<p>[1] - <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multiplicity_(film)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multiplicity_(film)</a>
sandGorgonabout 2 years ago
how does one do this ? train an opensource LLM on chatgpt ? people have been talking about it so im intrigued.<p>is there a how to anywhere - not even sure which opensource model to use, etc
评论 #35683983 未加载