没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文提出了搜索动作空间中高级抽象计划以增强大型语言模型(LLMs)泛化的关键步骤学习方法(CPL)。研究主要针对现有的强化学习(RL)在大规模语言模型上的局限性进行改进。CPL 方法通过蒙特卡洛树搜索(MCTS)探索多样化的计划步骤并利用逐步优势偏好优化(Step-APO)来学习重要的计划步骤,从而提高了多步推理任务中的表现。 适合人群:对深度学习、强化学习和自然语言处理感兴趣的科研工作者和工程技术人员。 使用场景及目标:CPL 在数学、编码和其他领域的多步推理任务中展现出显著的优势,特别适用于需要复杂推理路径的问题。实验结果表明,CPL 不仅改善了特定领域的任务性能,还在多种跨领域的基准测试中取得了显著的进步,验证了其泛化能力和实际应用前景。 其他说明:本文还对比了其他基于解法的学习方法,如传统 DPO 和解决方案级别的 Step-DPO,进一步证明了高阶计划在提高模型性能方面的独特优势。作者还讨论了未来的工作方向,例如扩展计划策略和结合测试时间搜索等。
资源推荐
资源详情
资源评论
CPL: Critical Plan Step Learning Boosts
LLM Generalization in Reasoning Tasks
Tianlong Wang
12∗
Junzhe Chen
13∗
Xueting Han
1†
Jing Bai
1
1
Microsoft Research Asia
2
Peking University
3
Tsinghua University
Abstract
Post-training, particularly reinforcement learning (RL) using self-play-
generated data, has become a new learning paradigm for large language
models (LLMs). However, scaling RL to develop a general reasoner remains
a research challenge, as existing methods focus on task-specific reasoning
without adequately addressing generalization across a broader range of tasks.
Moreover, unlike traditional RL with limited action space, LLMs operate
in an infinite space, making it crucial to search for valuable and diverse
strategies to solve problems effectively. To address this, we propose search-
ing within the action space on high-level abstract plans to enhance model
generalization and introduce Critical Plan Step Learning (CPL), comprising:
1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore
diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan
steps through Step-level Advantage Preference Optimization (Step-APO),
which integrates advantage estimates for step preference obtained via MCTS
into Direct Preference Optimization (DPO). This combination helps the
model effectively learn critical plan steps, enhancing both reasoning capabil-
ities and generalization. Experimental results demonstrate that our method,
trained exclusively on GSM8K and MATH, not only significantly improves
performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances
out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA
(+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).
1 Introduction
Large language models (LLMs) have achieved significant success through scaling, particularly
in pre-training on vast datasets (OpenAI et al., 2024; Dubey et al., 2024). Recently, there has
been an increasing focus on scaling post-training, especially through reinforcement learning
(RL) on self-play-generated data, which has emerged as a new learning paradigm for LLMs.
Notably, OpenAI’s o1 (OpenAI, 2024) has consistently improved its reasoning abilities
through large-scale RL, which teaches the model to think more productively. Additionally,
recent research works (Xie et al., 2024; Feng et al., 2023; Chen et al., 2024) leverage Monte
Carlo Tree Search (MCTS) (Kocsis & Szepesv´ari, 2006) to iteratively collect preference data.
RL on this self-generated data facilitates iterative self-improvement, leading to significantly
enhanced reasoning capabilities.
However, scaling RL to develop a general reasoner remains a research challenge. Traditional
RL methods, such as AlphaGo (Silver et al., 2016), struggle with generalization due to
their specific action spaces. Existing approaches for LLMs primarily focus on enhancing
task-specific or domain-specific reasoning capabilities, such as in mathematics or coding.
While this has led to significant improvements in these specific tasks, it has not adequately
addressed the model’s generalization abilities across various reasoning tasks. Furthermore,
unlike traditional RL, which operates in a limited action space, LLMs function within a
vast search space. This expansive scope, combined with the high inference latency of LLMs,
limits both the diversity and quality of explored reasoning paths.
∗
Work is done during internship at Microsoft Research Asia.
†
Corresponding author: chrihan@microsoft.com
1
arXiv:2409.08642v2 [cs.AI] 1 Oct 2024
Figure 1: Overview of CPL results. (a) In-Domain Performance: Our CPL-trained model significantly
outperforms the DeepSeekMath-7B-Base on in-domain tasks. (b) Out-of-Domain Performance: Our
CPL method also shows strong generalization, outperforming the baseline model on out-of-domain
reasoning tasks.
To enhance generalization in reasoning tasks, we propose searching within the action space
on high-level abstract plans, rather than focusing on task-specific action spaces that often
limit generalization. Building on previous work (Wang et al., 2023; Yao et al., 2023; Hao
et al., 2023) that uses LLMs to generate both reasoning plans and task-specific solutions
to boost reasoning capabilities, we argue that task-specific solutions—like mathematical
formulas, code or symbolic solutions—are closely tied to task-specific skills. In contrast,
plans represent abstract thinking for problem-solving, such as determining which knowledge
to apply or how to break down a problem, helping models develop broader, task-agnostic
abilities that improve generalization (illustrated in Figure 2 Left).
Furthermore, under the challenge of a vast search space of reasoning paths, we propose that
maintaining diversity and identifying critical paths are essential for solving complex problems.
Plan-based search enables better exploration of high-level strategies and can achieve better
diversity; whereas, solutions-based search may limit diversity, as different solutions may
share the same underlying thought. Besides, existing preference-learning methods, such as
Direct Preference Optimization (DPO) (Rafailov et al., 2023) faces challenges in learning
critical steps due to their inability to capture fine-grained supervision. Recent works propose
Step-level DPO (Setlur et al., 2024; Lai et al., 2024) to learn step-level preferences, but
its reliance on heuristics, such as marking the first error step as dispreferred, limits full
exploration of the search space and model optimization. To address this, we propose a
method to identify and learn critical plan steps that provide higher advantages for improving
the model’s reasoning ability (illustrated in Figure 2 Right).
Thus, we introduce Critical Plan Step Learning (CPL), which consists of two key components:
1. Searching on plan, specifically, we devise a step-by-step plan to solve the problem, with
the final step providing the full solution based on the plan. Using MCTS to explores diverse
plan steps in multi-step reasoning tasks, it creates a plan tree, where high-quality plan step
preferences are derived from the final outcome. This process enables the exploration of
high-level strategies, helping the model acquire task-agnostic skills and improve generalization
across different tasks.
2. Learning critical plan steps through Step-level Advantage Preference Optimization (Step-
APO), which builds upon DPO. Step-APO integrates advantage estimates for step-level
preference pairs obtained via MCTS, enabling the model to learn fine-grained preferences
between steps, identify critical plan steps, and de-emphasize erroneous ones.
To conclude, our contributions are: 1) We explore the scaling problem in RL and propose
searching within the action space on high-level abstract plans to enhance model generalization,
rather than focusing on task-specific action spaces that often limit generalization. 2) We
introduce a novel approach CPL, which leverages MCTS to explore diverse plan steps,
distinguishing it from existing methods that focus on exploring solutions, and uses our Step-
APO to learn step-level plan preferences, thereby helping the model effectively identify and
2
Task 1
Problem: Given the reaction,
predict the final products:
?
<plan>
1. Recognize the reaction as the
combustion of ethane, which
typically produces carbon
dioxide (
) and water (
).
2. Balance the equation
according to the law of
conservation of mass.
</plan>
<solution>
1. The unbalanced equation:
2. Balance the carbon atoms:
Balance the hydrogen atoms:
2
alance the oxygen atoms:
</solution>
Task-specific
solutions
High-level
abstract plans
Generalization
Task 2
Problem: Solve the equation
in the real number
domain.
<plan>
1. Rewrite the equation in standard
form and determine the coefficients
.
2. Calculate the discriminant
. If use the
quadratic formula
to
find real solutions. If
conclude that there are no real
solutions.
</plan>
<solution>
1. The equation can be written as
where
.
2. Calculate
. Since there are no
real solutions.
</solution>
Problem : Solve the equation
in the real number
domain.
Plan1: Determine
the coefficients
.
Plan1: […]
Plan2: Apply the
quadratic formula
to calculate
solutions.
Plan2: Calculate
the discriminant
to determine the
nature of the
roots.
Plan2: […]
If
.
If , there are
no real solutions.
V= 0.8
> by 1.6
> by 1.1
3
2
1. […]
2. […]
1. The equation
[…]
2. […] there are no
real solutions.
1. […]
2. […] there are
no real solutions.
Plan-based Search
Plan1: […]
Generalization
Critical
Plan Step
1
2
3
1
1
V= -0.8 V= 0.3
V= 1V= -1 V= 1
V= 0.1V= -0.5 V= 0.4
Figure 2: Illustration of CPL. Left: Plans represent abstract thinking for problem-solving, which
allows for better generalization, whereas task-specific solutions often limit it. Right: CPL searches
within the action space on high-level abstract plans using MCTS and obtains advantage estimates
for step-level preferences. CPL can then identify and learn critical steps that provide a distinct
advantage over others.
learn critical steps. 3) Extensive experiments show that CPL enhances reasoning capabilities
and generalization across tasks, achieving significant improvements in both in-domain and
out-of-domain tasks, as shown in Figure 1.
2 Methods
In this section, we introduce our Critical Plan Step Learning (CPL), it boosts model
performance via iterative process over plan-based search and step-level preference learning.
We first introduce our plan-based MCTS, which enables the LLM to explore diverse plan
strategies in the vast search space. Next, we present our Step-APO in detail to further
explore the potential of step-level preference learning in multi-step reasoning task. Finally,
we describe how we iteratively optimize the policy model and value model.
2.1 Plan-based MCTS
MCTS builds a reasoning tree iteratively and autonomously explores step-level reasoning
traces, which can be used to optimize LLMs. Existing methods (Chen et al., 2024; Xie et al.,
2024) that leverage MCTS to collect data for training usually focus on exploring solution
steps within the entire search space or on simultaneously exploring both plans and solutions.
To improve transfer performance across a broader range of reasoning tasks, we propose
learning high-level abstract plans, which enables the model to acquire more task-agnostic
capabilities and thereby achieve better generalization. We first create a step-by-step plan to
solve the problem, with the final step presenting the full solution and final answer based on
the plan. The prompt is provided in the Appendix A. Ultimately, we obtain a plan tree and
high-quality plan step supervision through iterative search with MCTS.
Specifically, given the plan tree
T
, each node represents a state s
t
, and each edge represents
an action a
t
, which corresponds to a reasoning step that leads to the next state s
t+1
. Under
the same parent node, different sibling nodes form a set of step-level preference pairs, with
each node having its own value
V
(s
t
) representing the expected future reward under state s
t
.
These values can be obtained through the MCTS process, which involves four key operations:
selection, expansion, evaluation, and backup. To enhance efficiency, we use a value model to
assess the expected returns from the partial reasoning paths, with the final integration of
3
both policy and value models guiding the search process. Next, we describe the four steps of
MCTS.
Selection: We use the PUCT algorithm (Rosin, 2011) to guide the selection process with
the following formula, where N represents the visit count:
arg max
a
t
Q(s
t
, a
t
) + c
puct
π
θ
(a
t
|s
t
)
p
N(s
t
)
1 + N(s
t
, a
t
)
. (1)
Expansion and Evaluation: During expansion, we sample multiple possible candidate
actions for the next step. During evaluation, the terminal node is assessed by comparing
the final answer with the ground truth, while the values of other nodes are predicted by the
value model.
Backup: Once a terminal node is reached, we perform a bottom-up update from the terminal
node back to the root. We update the visit count
N
, the state value
V
, and the transition
value Q as follows:
Q(s
t
, a
t
) ← r(s
t
, a
t
) + V (s
t+1
), (2)
V (s
t
) ←
X
a
N(s
t+1
)Q(s
t
, a
t
)/
X
a
N(s
t+1
), (3)
N(s
t
) ← N (s
t
) + 1. (4)
2.2 Step-APO to Learn Critical Plan Steps
Unlike mainstream approaches (Hwang et al., 2024; Lai et al., 2024) that learn step-level
preferences by identifying the first error step and sampling a corresponding preferred step,
while potentially yielding more accurate preferences, this method lacks sufficient exploration
of the vast reasoning trace space. Given the large variations in advantage differences
across different data pairs, we propose Step-APO, which introduces advantage estimates
for preference pairs into DPO. This enables the model to more effectively learn critical
intermediate plan steps, thereby further improving its reasoning capabilities. Next, We will
provide its derivation and analysis from the perspective of its gradient.
2.2.1 Preliminaries
The Classical RL Objective RLHF approaches (Ziegler et al., 2020; Bai et al., 2022;
Ouyang et al., 2022) usually first learn a reward function from human feedback, then
optimize it with a policy gradient-based method like PPO (Schulman et al., 2017) with an
entropy-bonus using the following multi-step RL objective:
max
π
θ
E
a
t
∼π
θ
(·|s
t
)
T
X
t=0
(r(s
t
, a
t
) + β log π
ref
(a
t
|s
t
)
| {z }
KL penalty
) + βH(π
θ
)|s
0
∼ ρ(s
0
)
, (5)
where
r
(s
t
,
a
t
) denotes the step-level reward function, followed by a KL penalty that aims
to ensure the learned policy
π
θ
does not deviate significantly from the reference policy
π
ref
.
π
ref
is typically produced via supervised fine-tuning.
Direct Preference Optimization DPO (Rafailov et al., 2023) uses the well-known closed-
form optimal solution, which establishes a mapping between the reward model and the
optimal policy under the KL divergence, obtaining the reward as:
r(x, y) = β log π
∗
(y|x) − β log π
ref
(y|x) − Z(x), (6)
where x denotes the prompt and y denotes the response,
π
∗
is the optimal policy and
Z
(x)
is the partition function that normalizes it. Substituting eq. (6) into the Bradley Terry
preference model, and leverage the maximum likelihood objective, DPO derives the loss:
L
DPO
(π
θ
; π
ref
) = −E
(x,y
w
,y
l
)∼D
log σ
β log
π
θ
(y
w
| x)
π
ref
(y
w
| x)
− β log
π
θ
(y
l
| x)
π
ref
(y
l
| x)
, (7)
where
σ
denotes the logistic function, y
w
and y
l
denote the preferred and dis-preferred
responses to the prompt x.
4
剩余16页未读,继续阅读
资源评论
豪AI冰
- 粉丝: 73
- 资源: 68
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- (源码)基于JavaWeb的学生管理系统.zip
- (源码)基于Android的VR应用转换系统.zip
- (源码)基于NetCore3.1和Vue的系统管理平台.zip
- (源码)基于Arduino的蓝牙控制LED系统.zip
- SwitchResX 4.6.4 自定义分辨率 黑苹果神器
- (源码)基于Spring Boot和MyBatis的大文件分片上传系统.zip
- (源码)基于Spring Boot和MyBatis的后台管理系统.zip
- (源码)基于JDBC的Java学生管理系统.zip
- (源码)基于Arduino的教室电力节能管理系统.zip
- (源码)基于Python语言的注释格式处理系统.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功