Self-Rewarding Language Models!
“Self-Rewarding Language Models” proposes a new method to train large language models (LLMs) that can generate and evaluate their own rewards during training instead of relying on human preferences or fixed reward models. The authors use a technique called LLM-as-a-Judge, which uses the language model itself to provide feedback via natural language prompts. They also use an iterative Direct Preference Optimization (DPO) framework, alternating between generating new instructions following examples and fine-tuning the LLM. The authors claim that this method allows the LLM to improve both its instruction-following and reward modelling ability and achieve superhuman performance on the AlpacaEval 2.0 benchmark.
For more details, please refer to the original paper.
Note: Summarized by AI.