- Aug 8, 2023

Chai-GPT. RLHF Part I: Reward Modelling

Updated: Nov 8, 2023

Full results published in: https://arxiv.org/pdf/2303.06135.pdf

Our own large language models (LLMs) are the foundation of chai. A platform for chat AI. Improving our LLMs directly translates to better user experiences, engagement and retention of our entire platform.

In this series of blog posts, we will offer a technical deep-dive into exactly how Chai implements Reinforcement Learning with Human Feedback (RLHF) pipeline, directly improving user retention by over 30%.

RLHF Overview

Reinforcement learning from human feedback (RLHF) was initially introduced in this paper in 2022 by OpenAI and has gained popularity with the recent release of ChatGPT. RLHG involves training multiple models at different stages, which typically include pre-training a language model, training a reward model, and fine-tuning the language model with reinforcement learning. This process is depicted in the image below.

source: https://arxiv.org/pdf/2203.02155.pdf

Chai’s Internal Reward Modelling via Pseudo Labels

People come to Chai for engaging conversations, storytelling, and role-play, which is quite different from OpenAI’s ChatGPT. To train a reward model using supervised learning, one requires labelled data. Manual annotations are not only costly, but in this case, they are also inaccurate and biased. It’s more challenging to determine whether a response is engaging compared to determining its accuracy or helpfulness.

We therefore propose pseudo labels that can be conveniently extracted from user interactions, where the labels directly mark whether a given response is engaging or not.

As there most likely is no single metric which defines how engaging a response is, we invert the problem and look into what makes a response bad. Specifically:

End of conversation: Did the model send a message which resulted in the end of a conversation?
Retry: Did the user regenerate the message from the model after they’ve seen it?
Edit: Did the user edit the response? (This typically means the user is very frustrated as they have re-tried the message multiple times)
Star rating (this is not a negative signal): How much did the user like the message they’ve seen

With over 70 million model responses served per day, we found around 20% of the responses were re-tried, 3% of the messages were edited, 1% of the messages were rated by the user, and each conversation roughly has 11 model responses. This means, each day, we produce:

14 million retry signals
2.1 million edit signals
700K star ratings
6 million end-of-conversation signals

During data collection, we simply sample these signals across different dates and different model types served. This resulted in a dataset with 170 million rows. Each row contains model input/output along with the signal (binary for retry/edit/end-of-conversation).

Model training

There are several ways to train a reward model, given we have multiple signals, one can consider two different approaches:

Multiple reward ensemble: where we train one reward model for each signal, these outputs are then aggregated (i.e. simple sum) in order to obtain the final reward value.
Reward with signal ensemble: where we have a single model, but trained on some joint targets of these signals.

We have experimented with both methods and discovered multiple reward ensemble yields no significantly better result at the cost of complexity during production. Therefore we chose the second approach.

Chai Reward Model Architectures / Targets

The reward model architecture we have chosen is GPT2 and RoBERTa, with the number of parameters between 82M and 355M. As text classification is inherently less complex than text generation, we saw no benefit in using extra-large models such as GPT-J 6B for such a task.

Best-of-N sampling

After the reward model is trained, we can simply utilize rejection (best of N sampling) in production, i.e. for each model input, we generate N responses. The model input is concatenated with the responses and they are run through the reward model. Finally, we select the response with the highest reward score. In part II of the blog post, we will show case how we utilize reinforcement learning to achieve the best of N performance via a single model pass-through.

Results

Our north star metric is user retention, defined by the percentage of new users who joined on day 0, and send at least 1 message on day 30. As this metric is expensive to obtain (as it may hurt user experience if a bad model is deployed and tested), we have an approximate metric called mean conversation length (MCL), which averages over the total number of model responses within each conversation per test group.

Dataset size VS. MCL

In this experiment, we vary the amount of data used during model training. From ~10K examples all the way to ~60M examples. We directly observe the scaling law (i.e. the log-linear relationship between model performance and dataset size). We also observe by tuning the hyper-parameters of the base model, we obtain a delta performance improvement in model performance.

Mean conversation length (MCL) improvement over dataset size

Exogenous parameters VS. MCL

There are a couple of key exogenous parameters that can be tuned. These are a bit different from hyper parameters as they are not model-specific. We observe by tuning the context window length, there is very marginal improvement in performance. So is tuning K (defining end of conversation, i.e. remaining responses before the termination of a conversation).

There is a log-linear relationship between performance and the number of samples drawn during rejection sampling. Unfortunately, we could not feasibly test best-of-32 and above as they induce too much latency on the platform.

User retention testing

Finally, by selecting the best hyper parameters, we observe that a joint target of retry and end of conversation achieves the best performance. Yielding over 30% improvement against the baseline!

Conclusion

In this work, we focus on developing chat AIs that are highly engaging and entertaining for users. We demonstrate through extensive A/B tests across large user groups that training reward models with human feedback followed by response selection, leads to chat AIs with longer average user interactions and higher user retention. We propose intuitive evaluation metrics, namely the mean conversation length and user retention, and investigate a range of pseudo labels that can be used to identify captivating responses that can be used to train a reward model that can score generated responses. Using our best reward model for response selection, we show that the user retention of a GPT-J 6B language model increases by over 30%, highlighting the effectiveness of using natural human-chatbot interactions for developing highly engaging chatAIs.