Reinfocement Learning in LLMs

In this post, we overview Reinforcement Learning techniques used in LLMs and alternative techniques that are often compared with RL techniques.

PPO

The PPO-based approach is the most famous RL approach. Detailed derivation of PPO and implementation tricks are introduced thoroughly in [2]. Especially, we want to call out their recommended implementation tricks:

SLiC-HF

SLiC-HF [1] is a technique often compared with RLHF. Its idea is straightforward: for a human preference dataset, (x, y^+, y^-,), we penalize the unfavored output y^- with a hinge loss:

L(\theta)=max(0, \beta - log P_\theta(y^+|x) + log P_\theta(y^-|x))

SLiC-HF eliminates the need to train a reward model so it greatly simplifies the alignment process compared to PPO-based approaches.

 

DPO

In the same vein to eliminate the need to train a separate reward model, Direct Preference Optimization (DPO) proposes that we can directly fine-tune a LLM policy \pi_\theta(y|x) (the initial policy is denoted as \pi_{ref}(y|x)) with the loss:

There are many ways to interpret this loss. One intuitive one is that we will bump the likelihood of generating winning responses y_w and lower the likelihood of losing responses y_l under a Bradley-Terry model.

 

References

  1. SLiC-HF: Sequence Likelihood Calibration with Human Feedback: https://arxiv.org/abs/2305.10425
  2. Secrets of RLHF in Large Language Models Part I: PPO: https://arxiv.org/abs/2307.04964
  3. Direct Preference Optimization: Your Language Model is Secretly a Reward Model: https://arxiv.org/abs/2305.18290

 

 

Leave a comment

Your email address will not be published. Required fields are marked *