强化学习在大型语言模型推理任务中的关键步骤学习提升泛化能力

强化学习

蒙特卡洛树搜索

149 浏览量 2024-10-25 11:21:01 上传评论收藏 760KB PDF 举报

资源推荐

资源详情

资源评论

CPL: Critical Plan Step Learning Boosts

LLM Generalization in Reasoning Tasks

Tianlong Wang

12∗

Junzhe Chen

13∗

Xueting Han

1†

Jing Bai

Microsoft Research Asia

Peking University

Tsinghua University

Abstract

Post-training, particularly reinforcement learning (RL) using self-play-

generated data, has become a new learning paradigm for large language

models (LLMs). However, scaling RL to develop a general reasoner remains

a research challenge, as existing methods focus on task-speciﬁc reasoning

without adequately addressing generalization across a broader range of tasks.

Moreover, unlike traditional RL with limited action space, LLMs operate

in an inﬁnite space, making it crucial to search for valuable and diverse

strategies to solve problems eﬀectively. To address this, we propose search-

ing within the action space on high-level abstract plans to enhance model

generalization and introduce Critical Plan Step Learning (CPL), comprising:

1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore

diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan

steps through Step-level Advantage Preference Optimization (Step-APO),

which integrates advantage estimates for step preference obtained via MCTS

into Direct Preference Optimization (DPO). This combination helps the

model eﬀectively learn critical plan steps, enhancing both reasoning capabil-

ities and generalization. Experimental results demonstrate that our method,

trained exclusively on GSM8K and MATH, not only signiﬁcantly improves

performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances

out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA

(+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).

1 Introduction

Large language models (LLMs) have achieved signiﬁcant success through scaling, particularly

in pre-training on vast datasets (OpenAI et al., 2024; Dubey et al., 2024). Recently, there has

been an increasing focus on scaling post-training, especially through reinforcement learning

(RL) on self-play-generated data, which has emerged as a new learning paradigm for LLMs.

Notably, OpenAI’s o1 (OpenAI, 2024) has consistently improved its reasoning abilities

through large-scale RL, which teaches the model to think more productively. Additionally,

recent research works (Xie et al., 2024; Feng et al., 2023; Chen et al., 2024) leverage Monte

Carlo Tree Search (MCTS) (Kocsis & Szepesv´ari, 2006) to iteratively collect preference data.

RL on this self-generated data facilitates iterative self-improvement, leading to signiﬁcantly

enhanced reasoning capabilities.

However, scaling RL to develop a general reasoner remains a research challenge. Traditional

RL methods, such as AlphaGo (Silver et al., 2016), struggle with generalization due to

their speciﬁc action spaces. Existing approaches for LLMs primarily focus on enhancing

task-speciﬁc or domain-speciﬁc reasoning capabilities, such as in mathematics or coding.

While this has led to signiﬁcant improvements in these speciﬁc tasks, it has not adequately

addressed the model’s generalization abilities across various reasoning tasks. Furthermore,

unlike traditional RL, which operates in a limited action space, LLMs function within a

vast search space. This expansive scope, combined with the high inference latency of LLMs,

limits both the diversity and quality of explored reasoning paths.

∗

Work is done during internship at Microsoft Research Asia.

†

Corresponding author: chrihan@microsoft.com

arXiv:2409.08642v2 [cs.AI] 1 Oct 2024

Figure 1: Overview of CPL results. (a) In-Domain Performance: Our CPL-trained model signiﬁcantly

outperforms the DeepSeekMath-7B-Base on in-domain tasks. (b) Out-of-Domain Performance: Our

CPL method also shows strong generalization, outperforming the baseline model on out-of-domain

reasoning tasks.

To enhance generalization in reasoning tasks, we propose searching within the action space

on high-level abstract plans, rather than focusing on task-speciﬁc action spaces that often

limit generalization. Building on previous work (Wang et al., 2023; Yao et al., 2023; Hao

et al., 2023) that uses LLMs to generate both reasoning plans and task-speciﬁc solutions

to boost reasoning capabilities, we argue that task-speciﬁc solutions—like mathematical

formulas, code or symbolic solutions—are closely tied to task-speciﬁc skills. In contrast,

plans represent abstract thinking for problem-solving, such as determining which knowledge

to apply or how to break down a problem, helping models develop broader, task-agnostic

abilities that improve generalization (illustrated in Figure 2 Left).

Furthermore, under the challenge of a vast search space of reasoning paths, we propose that

maintaining diversity and identifying critical paths are essential for solving complex problems.

Plan-based search enables better exploration of high-level strategies and can achieve better

diversity; whereas, solutions-based search may limit diversity, as diﬀerent solutions may

share the same underlying thought. Besides, existing preference-learning methods, such as

Direct Preference Optimization (DPO) (Rafailov et al., 2023) faces challenges in learning

critical steps due to their inability to capture ﬁne-grained supervision. Recent works propose

Step-level DPO (Setlur et al., 2024; Lai et al., 2024) to learn step-level preferences, but

its reliance on heuristics, such as marking the ﬁrst error step as dispreferred, limits full

exploration of the search space and model optimization. To address this, we propose a

method to identify and learn critical plan steps that provide higher advantages for improving

the model’s reasoning ability (illustrated in Figure 2 Right).

Thus, we introduce Critical Plan Step Learning (CPL), which consists of two key components:

1. Searching on plan, speciﬁcally, we devise a step-by-step plan to solve the problem, with

the ﬁnal step providing the full solution based on the plan. Using MCTS to explores diverse

plan steps in multi-step reasoning tasks, it creates a plan tree, where high-quality plan step

preferences are derived from the ﬁnal outcome. This process enables the exploration of

high-level strategies, helping the model acquire task-agnostic skills and improve generalization

across diﬀerent tasks.

2. Learning critical plan steps through Step-level Advantage Preference Optimization (Step-

APO), which builds upon DPO. Step-APO integrates advantage estimates for step-level

preference pairs obtained via MCTS, enabling the model to learn ﬁne-grained preferences

between steps, identify critical plan steps, and de-emphasize erroneous ones.

To conclude, our contributions are: 1) We explore the scaling problem in RL and propose

searching within the action space on high-level abstract plans to enhance model generalization,

rather than focusing on task-speciﬁc action spaces that often limit generalization. 2) We

introduce a novel approach CPL, which leverages MCTS to explore diverse plan steps,

distinguishing it from existing methods that focus on exploring solutions, and uses our Step-

APO to learn step-level plan preferences, thereby helping the model eﬀectively identify and

Task 1

Problem: Given the reaction,

predict the final products:









 



 ?

<plan>

1. Recognize the reaction as the

combustion of ethane, which

typically produces carbon

dioxide (



) and water (



).

2. Balance the equation

according to the law of

conservation of mass.

</plan>

1. The unbalanced equation:









 



 



 





2. Balance the carbon atoms:









 



 



 





Balance the hydrogen atoms:

2







 



 



 



 

alance the oxygen atoms:









 



 



 





</solution>

Task-specific

solutions

High-level

abstract plans

Generalization

Task 2

Problem: Solve the equation





      in the real number

domain.

<plan>

1. Rewrite the equation in standard

form and determine the coefficients

  .

2. Calculate the discriminant  





 . If    use the

quadratic formula  

 







find real solutions. If   

conclude that there are no real

solutions.

</plan>

1. The equation can be written as





     where    

   .

2. Calculate   



    

  . Since    there are no

real solutions.

</solution>

Problem : Solve the equation 



      in the real number

domain.

Plan1: Determine

the coefficients

  .

Plan1: […]

Plan2: Apply the

quadratic formula

 

 







to calculate

solutions.

Plan2: Calculate

the discriminant

to determine the

nature of the

roots.

Plan2: […]

If   

 

 







If   , there are

no real solutions.

V= 0.8

> by 1.6

> by 1.1

1. […]

2. […]

 

  



1. The equation

[…]

2. […] there are no

real solutions.

1. […]

2. […] there are

no real solutions.

Plan-based Search

Plan1: […]

Generalization

Critical

Plan Step

V= -0.8 V= 0.3

V= 1V= -1 V= 1

V= 0.1V= -0.5 V= 0.4

Figure 2: Illustration of CPL. Left: Plans represent abstract thinking for problem-solving, which

allows for better generalization, whereas task-speciﬁc solutions often limit it. Right: CPL searches

within the action space on high-level abstract plans using MCTS and obtains advantage estimates

for step-level preferences. CPL can then identify and learn critical steps that provide a distinct

advantage over others.

learn critical steps. 3) Extensive experiments show that CPL enhances reasoning capabilities

and generalization across tasks, achieving signiﬁcant improvements in both in-domain and

out-of-domain tasks, as shown in Figure 1.

2 Methods

In this section, we introduce our Critical Plan Step Learning (CPL), it boosts model

performance via iterative process over plan-based search and step-level preference learning.

We ﬁrst introduce our plan-based MCTS, which enables the LLM to explore diverse plan

strategies in the vast search space. Next, we present our Step-APO in detail to further

explore the potential of step-level preference learning in multi-step reasoning task. Finally,

we describe how we iteratively optimize the policy model and value model.

2.1 Plan-based MCTS

MCTS builds a reasoning tree iteratively and autonomously explores step-level reasoning

traces, which can be used to optimize LLMs. Existing methods (Chen et al., 2024; Xie et al.,

2024) that leverage MCTS to collect data for training usually focus on exploring solution

steps within the entire search space or on simultaneously exploring both plans and solutions.

To improve transfer performance across a broader range of reasoning tasks, we propose

learning high-level abstract plans, which enables the model to acquire more task-agnostic

capabilities and thereby achieve better generalization. We ﬁrst create a step-by-step plan to

solve the problem, with the ﬁnal step presenting the full solution and ﬁnal answer based on

the plan. The prompt is provided in the Appendix A. Ultimately, we obtain a plan tree and

high-quality plan step supervision through iterative search with MCTS.

Speciﬁcally, given the plan tree

, each node represents a state s

, and each edge represents

an action a

, which corresponds to a reasoning step that leads to the next state s

t+1

. Under

the same parent node, diﬀerent sibling nodes form a set of step-level preference pairs, with

each node having its own value

) representing the expected future reward under state s

These values can be obtained through the MCTS process, which involves four key operations:

selection, expansion, evaluation, and backup. To enhance eﬃciency, we use a value model to

assess the expected returns from the partial reasoning paths, with the ﬁnal integration of

both policy and value models guiding the search process. Next, we describe the four steps of

MCTS.

Selection: We use the PUCT algorithm (Rosin, 2011) to guide the selection process with

the following formula, where N represents the visit count:

arg max



Q(s

, a

) + c

puct

)

N(s

)

1 + N(s

, a

)



. (1)

Expansion and Evaluation: During expansion, we sample multiple possible candidate

actions for the next step. During evaluation, the terminal node is assessed by comparing

the ﬁnal answer with the ground truth, while the values of other nodes are predicted by the

value model.

Backup: Once a terminal node is reached, we perform a bottom-up update from the terminal

node back to the root. We update the visit count

, the state value

, and the transition

value Q as follows:

Q(s

, a

) ← r(s

, a

) + V (s

t+1

), (2)

V (s

) ←

N(s

t+1

)Q(s

, a

N(s

t+1

), (3)

N(s

) ← N (s

) + 1. (4)

2.2 Step-APO to Learn Critical Plan Steps

Unlike mainstream approaches (Hwang et al., 2024; Lai et al., 2024) that learn step-level

preferences by identifying the ﬁrst error step and sampling a corresponding preferred step,

while potentially yielding more accurate preferences, this method lacks suﬃcient exploration

of the vast reasoning trace space. Given the large variations in advantage diﬀerences

across diﬀerent data pairs, we propose Step-APO, which introduces advantage estimates

for preference pairs into DPO. This enables the model to more eﬀectively learn critical

intermediate plan steps, thereby further improving its reasoning capabilities. Next, We will

provide its derivation and analysis from the perspective of its gradient.

2.2.1 Preliminaries

The Classical RL Objective RLHF approaches (Ziegler et al., 2020; Bai et al., 2022;

Ouyang et al., 2022) usually ﬁrst learn a reward function from human feedback, then

optimize it with a policy gradient-based method like PPO (Schulman et al., 2017) with an

entropy-bonus using the following multi-step RL objective:

max

∼π

(·|s

)







t=0

(r(s

, a

) + β log π

ref

)

| {z }

KL penalty

) + βH(π

)|s

∼ ρ(s

)







, (5)

where

) denotes the step-level reward function, followed by a KL penalty that aims

to ensure the learned policy

does not deviate signiﬁcantly from the reference policy

ref

is typically produced via supervised ﬁne-tuning.

Direct Preference Optimization DPO (Rafailov et al., 2023) uses the well-known closed-

form optimal solution, which establishes a mapping between the reward model and the

optimal policy under the KL divergence, obtaining the reward as:

r(x, y) = β log π

∗

(y|x) − β log π

ref

(y|x) − Z(x), (6)

where x denotes the prompt and y denotes the response,

∗

is the optimal policy and

(x)

is the partition function that normalizes it. Substituting eq. (6) into the Bradley Terry

preference model, and leverage the maximum likelihood objective, DPO derives the loss:

DPO

(π

; π

ref

) = −E

(x,y

)∼D



log σ



β log

| x)

ref

| x)

− β log

| x)

ref

| x)



, (7)

where

denotes the logistic function, y

and y

denote the preferred and dis-preferred

responses to the prompt x.

剩余16页未读，继续阅读

评论收藏

内容反馈

豪AI冰

粉丝: 73
资源: 68

强化学习在大型语言模型推理任务中的关键步骤学习提升泛化能力

CPL与Step-APO方法提升大型语言模型推理能力

大型语言模型复杂推理训练：反向课程强化学习方法的应用与效果分析

大语言模型面试题，校招面试必备，给自己面试增加成功的概率

模糊强化学习

机器学习算法竞赛实战.docx

AI大模型相关的核心概念

基于深度学习的角度测量方法.zip

故障诊断专家系统中机器学习方法的研究.pdf

机器学习的知识结构.pdf

史上最全的机器学习面试题机器学习爱好者必看.pdf

计算机行业点评报告：聊天机器人顶流ChatGPT，开启自然语言处理领域新篇章.pdf

ChatLM-mini-Chinese-main.zip

NLP：机器读心术之文本挖掘与自然语言处理.zip

机器学习期末复习.docx

人工智能自然语言处理发展调研报告（完整版）.docx

史上机器学习面试题机器学习爱好者必看.pdf

机器学习人工智能相关论文.7z

ChatGPT模型的鲁棒性与稳定性评价指标与方法.docx

基于深度学习智能问答技术的研究.pdf

Machine Learning A Bayesian and Optimization Perspective

Scene Graphs A Survey of Generations and Applications.zip

对话系统综述 Neural Approaches to Conversational AI

ccpd_green_20.zip

基于机器学习的数字图片识别服务器.zip

基于卷积神经网络的车辆特征识别研究与实现.pdf.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

最新资源