Quickly scanned <a href="https://arxiv.org/abs/2401.10020" rel="nofollow">https://arxiv.org/abs/2401.10020</a> . Quite interesting work. The paper's idea is to have a single language model doing both question answering (responding to prompts) and self-evaluating its own answers. Iterative DPO training is used to improve the model's dual capabilities.<p>The authors tried different LLM-as-a-judge promptings to generate a reward score for each answer. A very particular additive 5-point rewarding prompting is found to be the most effective one. The two-step inferencing pipeline (answering questions+evaluating answers) also generates an extra dataset in the form of <question, winning-answering, losing-answering>.<p>This AI-generated dataset (called preference pairs) is fed back to the model in a training pipeline (using Direct Preference Optimization).<p>The inferencing and training pipelines are connected to have a closed-loop, iterative process. Each iteration generates better AI feedback training data and subsequently better model. The evaluation shows very promising results, outperforming Claude 2, Genimini Pro, and GPT-4 in selected benchmarks.<p>The paper has some zoom for improvements.
1) Figure 1 is not very accurate to reflect the entire workflow. A fixed model is used to generate prompts for example. But it is not shown in the figure. The preference pairs should be a matrix instead of a vector in the diagram. Also, the bootstrapping workflow (using seed instruction following and evaluation datasets) should be reflected.<p>2) The authors did not explain why a fixed model is used to generate prompts, instead of using the self-rewarding model directly.<p>3) The authors tried to use another form of AI feedback data (question, best-answer), coupled with supervised fine-tuning. However, it did not result in any performance improvement for the model. It is better to explore why or at least propose it as future work.<p>4) Fundamentally, the paper does not directly compare (or comment on) self-rewarding vs. independent rewarding. The iterative process can still apply to an independent rewarding model.