Table of contents

0 - Introduction

Reinforcement Learning with Human Feedback (RLHF) methods, particularly Proximal Policy Optimization (PPO) [1], have proven to be effective in aligning large language models (LLMs). However, it remains unclear whether PPO can generalize to unseen prompts during training, and how exactly it generalizes to such prompts. To delve deeper into this issue, we have gathered recent research insights in this blog and carried out a series of experiments on the code generation task. Our discoveries highlight that the generalization process in RLHF comprises two essential aspects: generalization originating from the training of the reward model and generalization arising from the PPO training itself. Notably, the generalization from the training of the reward model is conveyed to the RLHF process through the diverse PPO prompts.

In summary, our findings are as follows:

The generalization stemming from the reward model primarily originates from the preference datasets. During training, the reward model learns coarse patterns from these datasets and generalizes them to unseen samples. Additionally, the reward model inherits certain generalization abilities from the pre-trained models, enabling it to identify some easy errors in unseen negative responses.
The generalization achieved through PPO training encompasses two primary components: generalization derived from on-policy samples and generalization stemming from token-wise rewards.

Finally, we give some recommendations for data construction in RLHF based on the above findings.

1 - The RLHF Process

The RLHF process consists of two main components:

Reward Model Training: The reward model provides LLMs with a signal that guides the reinforcement learning process. In general, we first gather datasets consisting of prompts, LLM responses, and corresponding human feedback (ratings, rankings, or other forms of evaluation). This feedback serves as the ground truth for training the reward model. Then, we train the reward model using supervised learning techniques on the collected data. The model learns to predict the reward associated with each LLM response given a prompt and the corresponding human feedback. [2].

Figure 1: Reward model training process. [2]

Fine-tuning with RL: Given a reward model, we employ RL to fine-tune the policy of a LLM. First, the policy is a language model that takes in a prompt and returns a sequence of text (or just probability distributions over text). The action space of this policy is all the tokens corresponding to the vocabulary of the language model and the observation space is the distribution of possible input token sequences, which is also quite large given previous uses of RL (the dimension is approximately the size of vocabulary ^ length of the input token sequence). The reward function is a combination of the preference model and a constraint on policy shift. Finally, the update rule is the parameter update of the policy from PPO that maximizes the reward metrics in the current batch of data [3].

Figure 2: The reinforcement learning from human feedback. [3]

2 - Generalization Progress in Reward Model

2.1 - Generalization from Preference Dataset

Generally, we gather human rankings of pairs of LLM responses to create a human feedback dataset, commonly referred to as a preference dataset. The reward model is capable of uncovering patterns within pairs of ranked responses and applying these patterns to unseen pairs. It is generalization stemming from the preference dataset.

OpenAI has demonstrated the generalization ability from reward model in RLHF progress:

OpenAI first discovered that reward models (verifiers) scale significantly better with increased data compared to supervised fine-tuning [4].
Subsequently, OpenAI discovered the scaling law of the reward model and RL benefits from the scaling of the reward model.[5]