A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models
Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, +4 more
Abstract
Reinforcement Learning with Verifiable Rewards~(RLVR) has emerged as a powerful learn-to-reason paradigm for large reasoning models to tackle complex tasks. However, the current RLVR paradigm is still not efficient enough, as it works in a trial-and-error manner. To perform better, the model needs to explore the reward space by numerously generating responses and learn from fragmented reward signals, blind to the overall reward patterns. Fortunately, verifiable rewards make the natural language description of the reward function possible, and meanwhile, LLMs have demonstrated strong in-context learning ability. This motivates us to explore if large reasoning models can benefit from a \textbf{motivation} of the task, \textit{i.e.}, awareness of the reward function, during the reinforcement finetuning process, as we humans sometimes do when learning. In this paper, we introduce \textit{\textbf{M}otivation-\textbf{e}nhanced \textbf{R}einforcement \textbf{F}inetuning}~(\textbf{MeRF}), an intuitive yet effective method enhancing reinforcement finetuning of LLMs by involving \emph{``telling LLMs rules of the game''}. Specifically, \textbf{MeRF} directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective. This simple modification leverages the in-context learning ability of LLMs, aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations demonstrate that \textbf{MeRF} achieves substantial performance gains over the RLVR baseline. Moreover, ablation studies show that MeRF performs better with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement finetuning.
Full analysis loading… Code implementations, benchmark data, and reproduction guides are being assembled. Please check back shortly.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.