Improve reasoning for LLMs

LLMs have become the hottest topic in 2023, when I did not have much time to cover related topics. Let’s deep dive into this topic in the beginning of 2024.

Prompts

Using few-shots prompts to hint LLMs how to solve problems is the simplest form to improve reasoning for LLMs. When you first come across LLMs, you will be surprised that prompting can be a methodology to improve reasoning, even though it seems only like a game of language manipulation. 

Here are prompting techniques I have encountered:

  1. Chain of Thought [2]: for every question to an LLM, we augment the question with a few exemplars (termed “few-shot learning by prompting”). Instead of directly showing answers in the exemplars, the exemplars contain detailed reasoning steps one by one, hence encourages the LLM to output step-by-step answers to the actual question.
  2. Resprompt [1]: more like a graph version of the linear version of chain of thought. The motivation is that “later stages often depend on the results of several steps earlier, not just the results of the immediately preceding step”. 
  3. Step-back prompting [3]. It also uses “few-shot learning by prompting” to improve LLMs’ reasoning. Each exemplar consists of two steps: ask a generic question about a high-level principle or concept behind the original question; then ground the reasoning on step 1’s answer. The two steps are also called “abstraction-and-reasoning”, which demonstrates that it is hard for LLMs to directly address very specific questions but relatively easier if deriving a high level concept first.
  4. React prompting [5]. This type of prompting is suited when the LLM can perform actions in multiple rounds (e.g., calling a search engine) to acquire external knowledge. React prompting is also a few shot learning technique, which contains a few exemplars of interleaving actions and reasoning traces. The picture below is an exemplar in a few-shot prompt. There are more examples in the paper’s Appendix C [5].
  5. Automatic prompt optimization [8]. The works introduces a search-based optimization process similar to Monte-Carlo Tree Search to iteratively find the best prompt which can get the highest score (based on a given metric function on a given dataset).   

Improve Reward Model

OpenAI shows a way [4] to improve reward models which can consequently improve the LLM model’s reasoning. Typically an outcome-supervised reward model is trained to output a scalar based on the whole input and output sequence. Rather, [4] collects a step-by-step solution dataset which is generated by a LLM on math problems and annotated per-step correctedness by human labelers. Then, [4] trains a “process-supervised reward model” to predict the correctness of each step. If we take the product of each step’s correctness probability, we get the final answer’s correctness probability. [4] evaluates the process-supervised reward model and the typical outcome-supervised reward model: by sampling N step-by-step solutions from a LLM and picking the best one with the highest reward, the process-supervised reward model solves more problems on average than the outcome-supervised reward model.

The West-of-N paper [12] introduces an interesting semi-supervised learning idea to augment human preference data with synthetic preference data to boost reward model performance. When we perform rejection sampling, we get sampled responses and their reward scores (with an incumbent reward model). We can pick the two responses with the highest and lowest score to form synthetic preference pairs as an additional dataset to retrain the reward model. The results show that the reward model improves due to this semi-supervised learning paradigm, “with an effect comparable to the addition of a similar quantity of human preference data”.

 

Introduce additional loop

Self-Training [10] adopts two stages to SFT LLMs. The first stage is called Grow, where a batch of data is sampled using the current LLM and scored by a RM. The second stage is called Improve, where the LLM is trained with the SFT objective. We can perform multiple Grow and Improve iterations to improve the LLM’s capability. RAFT [11] is a special case of Self-Training in which there is only one Improve iteration after Grow stage.

Self-consistency [7] is an example to introduce an additional loop to sample multiple solutions from Chain-of-Thought prompts. The sampled solutions are then marginalized (e.g., majority vote) to get the most convincing solution. A common and simple baseline to compare self-consistency with is to sample the same number of solutions from an LLM and get the best one with the highest decoding probability. Self-consistency beats this baseline by a significant margin, indicating that decoding probabilities are not a good indicator of solution quality [9].  

Reflexion [6] is an more complex example to introduce a loop to improve LLMs over time. An LLM-based actor outputs answers given prompts, an evaluator evaluates the actor’s answer (evaluator can be a verification function for logic tasks or an LLM for NLP tasks), and an LLM-based self-reflection component evaluates how the actor’s answer leads to the evaluation result and then stores useful lessons in a long-term memory for the actor’s future usage.

 

References

[1] Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models: https://arxiv.org/abs/2310.04743

[2] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: https://arxiv.org/abs/2201.11903

[3] Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models: https://arxiv.org/abs/2310.06117

[4] Let’s Verify Step by Step: https://arxiv.org/abs/2305.20050

[5] ReAct: Synergizing Reasoning and Acting in Language Models: https://arxiv.org/abs/2210.03629

[6] Reflexion: Language Agents with Verbal Reinforcement Learning: https://arxiv.org/abs/2303.11366

[7] Self-Consistency Improves Chain of Thought Reasoning in Language Models: https://arxiv.org/abs/2203.11171

[8] Automatic Prompt Optimization with “Gradient Descent” and Beam Search: https://arxiv.org/abs/2305.03495

[9] Calibrating Sequence likelihood Improves Conditional Language Generation: https://arxiv.org/abs/2210.00045

[10] Reinforced Self-Training (ReST) for Language Modeling: https://arxiv.org/abs/2308.08998

[11] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment: https://arxiv.org/abs/2304.06767

[12] West-of-N: Synthetic Preference Generation for Improved Reward Modeling: https://arxiv.org/abs/2401.12086

 

Leave a comment

Your email address will not be published. Required fields are marked *