In this post, we overview Reinforcement Learning techniques used in LLMs and alternative techniques that are often compared with RL techniques.
PPO
The PPO-based approach is the most famous RL approach. Detailed derivation of PPO and implementation tricks are introduced thoroughly in [2]. Especially, we want to call out their recommended implementation tricks:
SLiC-HF
SLiC-HF [1] is a technique often compared with RLHF. Its idea is straightforward: for a human preference dataset, , we penalize the unfavored output with a hinge loss:
SLiC-HF eliminates the need to train a reward model so it greatly simplifies the alignment process compared to PPO-based approaches.
DPO
In the same vein to eliminate the need to train a separate reward model, Direct Preference Optimization (DPO) proposes that we can directly fine-tune a LLM policy (the initial policy is denoted as ) with the loss:
There are many ways to interpret this loss. One intuitive one is that we will bump the likelihood of generating winning responses and lower the likelihood of losing responses under a Bradley-Terry model.
References
- SLiC-HF: Sequence Likelihood Calibration with Human Feedback: https://arxiv.org/abs/2305.10425
- Secrets of RLHF in Large Language Models Part I: PPO: https://arxiv.org/abs/2307.04964
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model: https://arxiv.org/abs/2305.18290