Self-Rewarding Language Models

95 点作者 jph00超过 1 年前

12 条评论

djoldman超过 1 年前

<a href="https://arxiv.org/abs/2401.10020" rel="nofollow">https://arxiv.org/abs/2401.10020</a>

jph00超过 1 年前

I'm not an author on this paper, but I posted it here because it's in my area of research and I think it's a great example of recent excellent results coming from "self improvement" of LLMs. Most of the authors are very well known in the deep learning academic community and have written many high impact papers between them.Many researchers believe that using synthetic data and automatic classification/scoring, i.e. using LLMs to improve LLMs, is likely to be one of the most successful lines of research. We've been seeing a lot of success from this in recent months, including OpenAI's DALL-e 3, which used (IIRC) 99% synthetic LLM data for captions.

评论 #39131296 未加载

评论 #39074939 未加载

评论 #39232534 未加载

TOMDM超过 1 年前

If this is actually improving itself I assume it has to be bounded by it's ability to self evaluate.Starting with llama 70b makes sense for Meta, but I can't help but wonder what the results would look like if applied to Mixtral. If it replicates and isn't overfitting could we see a performant and open source GPT4 competitor?

评论 #39074017 未加载

评论 #39095890 未加载

评论 #39073959 未加载

评论 #39073954 未加载

williamtrask超过 1 年前

So this is a good idea because it can create vastly more training data for a model to learn from. However, it seems likely that these models are going to hallucinate like crazy. As featured in the documentary, the AlphaGo program struggled with hallucination, and it performed self-play in a tiny world based on a perfectly rigid and exactly correct set of rules. LLMs already have tons of false corners and edges in their logic about the world, and this seems like it has the potential to spread those all around.Hard to say — these things can be difficult to predict. I can see this working but there'll probably be some ratio of training data - self-play that we have a hard time getting past because it's a difficult-to-control form of extrapolation.

评论 #39074221 未加载

chunhualiao超过 1 年前

Quickly scanned <a href="https://arxiv.org/abs/2401.10020" rel="nofollow">https://arxiv.org/abs/2401.10020</a> . Quite interesting work. The paper's idea is to have a single language model doing both question answering (responding to prompts) and self-evaluating its own answers. Iterative DPO training is used to improve the model's dual capabilities.The authors tried different LLM-as-a-judge promptings to generate a reward score for each answer. A very particular additive 5-point rewarding prompting is found to be the most effective one. The two-step inferencing pipeline (answering questions+evaluating answers) also generates an extra dataset in the form of <question, winning-answering, losing-answering>.This AI-generated dataset (called preference pairs) is fed back to the model in a training pipeline (using Direct Preference Optimization).The inferencing and training pipelines are connected to have a closed-loop, iterative process. Each iteration generates better AI feedback training data and subsequently better model. The evaluation shows very promising results, outperforming Claude 2, Genimini Pro, and GPT-4 in selected benchmarks.The paper has some zoom for improvements. 1) Figure 1 is not very accurate to reflect the entire workflow. A fixed model is used to generate prompts for example. But it is not shown in the figure. The preference pairs should be a matrix instead of a vector in the diagram. Also, the bootstrapping workflow (using seed instruction following and evaluation datasets) should be reflected.2) The authors did not explain why a fixed model is used to generate prompts, instead of using the self-rewarding model directly.3) The authors tried to use another form of AI feedback data (question, best-answer), coupled with supervised fine-tuning. However, it did not result in any performance improvement for the model. It is better to explore why or at least propose it as future work.4) Fundamentally, the paper does not directly compare (or comment on) self-rewarding vs. independent rewarding. The iterative process can still apply to an independent rewarding model.

bluelightning2k超过 1 年前

I'm no AI expert but this seems to me like it's overfitting after-the-fact by basically learning from the specific examples and remembering when it's re-run?

评论 #39073914 未加载

评论 #39073934 未加载

评论 #39129186 未加载

stonebraker超过 1 年前

Might also be worth scanning, Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies<a href="https://arxiv.org/abs/2308.03188" rel="nofollow">https://arxiv.org/abs/2308.03188</a>

batch12超过 1 年前

Probably a silly question, but as a non-academic, what's the barrier of entry for publishing these papers? I have many adjacent ideas and projects that never see the light of day outside of my personal lab/play space. This one is very similar to another I have been toying with.

评论 #39075172 未加载

macawfish超过 1 年前

How might this compare to all these legends of "Q*"? Isn't this in some sense a combination of reinforcement learning and LLMs?

评论 #39073940 未加载

ShamelessC超过 1 年前

Has this been reviewed and or published anywhere? Anyone able to vouch for the authors?

评论 #39074283 未加载

xianshou超过 1 年前

Last May, Karpathy mentioned that the next step in LLM performance would likely take inspiration from AlphaGo's Monte Carlo-style tree traversal to enable deeper and more self-correcting reasoning:<a href="https://youtu.be/bZQun8Y4L2A?si=RgD7NWfwDdh0bklK&t=1630" rel="nofollow">https://youtu.be/bZQun8Y4L2A?si=RgD7NWfwDdh0bklK&t=1630</a>But if this approach holds up, it suggests that the most valuable part of the AlphaGo project that applies to LLM development is in fact reinforcement learning through self-play. Why not create a leaderboard of reward-generating language models that constantly "play" against each other, where model selection frequency is based on ELO, and update on the results? What if the Open LLM leaderboard (<a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" rel="nofollow">https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...</a>) evolved into a constantly improving pool of such models? This also alleviates data scaling issues by providing a diverse and continuously changing distribution of new training inputs.

评论 #39074202 未加载

评论 #39074118 未加载

评论 #39074271 未加载

ldjkfkdsjnv超过 1 年前

Cannot wait for Meta to open source the LLM that makes Google obsolete. Some startup will put a very simple react UI, with a search box, on top of some LLM running on AWS, and google as we know it will be replaced with 500k-3M in funding. The MVP backend will be a dead simple chat API, written with a few hundred lines of code.What an interesting turn of history.

评论 #39073969 未加载

评论 #39074060 未加载

评论 #39074015 未加载

评论 #39073996 未加载

评论 #39073828 未加载