Improving Language Models with Advantage-based Offline Policy Gradients
Abstract
A-LoL, an offline policy gradient algorithm, enables efficient and stable LM training using pre-existing data and achieves superior performance across multiple language generation tasks.
Abstract Language Models (LMs) achieve substantial language capabilities when finetuned using Reinforcement Learning with Human Feedback (RLHF). However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's internal sequence-level value estimate, A-LoL filters negative advantage (low-quality) data points during training, making it resilient to noise. Overall, A-LoL is an easy-to-implement LM training recipe that is sample-efficient and stable. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL
Community
Hi authors,
Interesting paper
do we have any models on LMSYS chatbot arena trained on this technique ?
also are there any recent studied that compare this work further with PPO with larger models ?
Hi
@karthik-ganesan-nexusflow
,
Thank you for taking an interest in our work. We haven't had an opportunity to test our method with large-scale datasets and models yet. I would like to extend this work into an offline + online method in my subsequent project and then systematically compare it with an online PPO baseline.
Would love to know if you tried experimenting with A-LoL. We have the code available on Git Hub.
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper