Reinfocement Learning in LLMs

In this post, we overview Reinforcement Learning techniques used in LLMs and alternative techniques that are often compared with RL techniques.

PPO

The PPO-based approach is the most famous RL approach. Detailed derivation of PPO and implementation tricks are introduced thoroughly in [2]. Especially, we want to call out their recommended implementation tricks:

SLiC-HF

SLiC-HF [1] is a technique often compared with RLHF. Its idea is straightforward: for a human preference dataset, $(x, y^+, y^-,)$ , we penalize the unfavored output $y^-$ with a hinge loss:

$L(\theta)=max(0, \beta - log P_\theta(y^+|x) + log P_\theta(y^-|x))$

SLiC-HF eliminates the need to train a reward model so it greatly simplifies the alignment process compared to PPO-based approaches.

DPO

In the same vein to eliminate the need to train a separate reward model, Direct Preference Optimization (DPO) proposes that we can directly fine-tune a LLM policy $\pi_\theta(y|x)$ (the initial policy is denoted as $\pi_{ref}(y|x)$ ) with the loss:

There are many ways to interpret this loss. One intuitive one is that we will bump the likelihood of generating winning responses $y_w$ and lower the likelihood of losing responses $y_l$ under a Bradley-Terry model.

References

SLiC-HF: Sequence Likelihood Calibration with Human Feedback: https://arxiv.org/abs/2305.10425
Secrets of RLHF in Large Language Models Part I: PPO: https://arxiv.org/abs/2307.04964
Direct Preference Optimization: Your Language Model is Secretly a Reward Model: https://arxiv.org/abs/2305.18290

Reinfocement Learning in LLMs

PPO

SLiC-HF

DPO

References

Leave a comment

Cancel reply