自然语言强化学习（NLRL）框架在MDPs任务中的应用与实现资源-CSDN文库

需积分: 5 139 浏览量 2024-12-13 14:33:57 上传评论收藏 633KB PDF 举报

展开

资源推荐

资源详情

资源评论

Preprint. Work in progress.

NATURAL LANGUAGE REINFORCEMENT LEARNING

Xidong Feng

1∗

, Ziyu Wan

2∗

, Mengyue Yang

, Ziyan Wang

, Girish A. Koushik

Yali Du

, Ying Wen

, Jun Wang

University College London,

Shanghai Jiao Tong University,

King’s Collge London,

University of Surrey

ABSTRACT

Reinforcement Learning (RL) has shown remarkable abilities in learning poli-

cies for decision-making tasks. However, RL is often hindered by issues such as

low sample efﬁciency, lack of interpretability, and sparse supervision signals. To

tackle these limitations, we take inspiration from the human learning process and

introduce Natural Language Reinforcement Learning (NLRL), which innova-

tively combines RL principles with natural language representation. Speciﬁcally,

NLRL redeﬁnes RL concepts like task objectives, policy, value function, Bellman

equation, and policy iteration in natural language space. We present how NLRL

can be practically implemented with the latest advancements in large language

models (LLMs) like GPT-4. Initial experiments over tabular MDPs demonstrate

the effectiveness, efﬁciency, and also interpretability of the NLRL framework.

1 INTRODUCTION

Reinforcement Learning (RL) constructs a mathematical framework that encapsulates key decision-

making elements. It quantiﬁes the objectives of tasks through the concept of cumulative rewards,

formulates policies with probability distributions, expresses value functions via mathematical ex-

pectations, and models environment dynamics through state transition and reward functions. This

framework effectively converts the policy learning problem into an optimization problem.

Despite the remarkable achievements of RL in recent years, signiﬁcant challenges still underscore

the framework’s limitations. For example, RL suffers from the sample efﬁciency problem–RL algo-

rithms are task-agnostic and do not leverage any prior knowledge, requiring large-scale and extensive

sampling to develop an understanding of the environment. RL also lacks interpretability. Despite

the superhuman performance of models like AlphaZero (Silver et al., 2017) in mastering complex

games such as Go, the underlying strategic logic of their decision-making processes remains elusive,

even to professional players. In addition, the supervision signal of RL is a one-dimensional scalar

value, which is much more sparse compared with traditional supervised learning over information-

rich datasets such as texts and images. This is also one of the reasons for the instability of RL

training (Zheng et al., 2023; Andrychowicz et al., 2020).

These limitations drive us to a new framework inspired by the human learning process. Instead of

mathematically modeling decision-making components like RL algorithms, humans tend to conduct

relatively vague operations by natural language. First, natural language enables humans with text-

based prior knowledge, which largely increases the sample efﬁciency when learning new tasks. Sec-

ond, humans possess the unique ability to articulate their explicit strategic reasoning and thoughts in

natural language before deciding on their actions, making their process fully interpretable by others,

even if it’s not always the most effective approach for task completion. Third, natural language data

contains information about thinking, analysis, evaluation, and future planning. It can provide signals

with high information density, far surpassing that found in the reward signals of traditional RL.

Inspired by the human learning process, we propose Natural Language Reinforcement Learn-

ing (NLRL), a new RL paradigm that innovatively combines the traditional RL concepts and nat-

ural language representation. By transforming key RL components—such as task objectives, poli-

cies, value functions, the Bellman equation, and generalized policy iteration (GPI) (Sutton & Barto,

∗

Equal Contribution. Correspondence to xidong.feng.20@ucl.ac.uk.

arXiv:2402.07157v2 [cs.CL] 14 Feb 2024

Preprint. Work in progress.

2018)—into their natural language equivalents, we harness the intuitive power of language to en-

capsulate complex decision-making processes. This transformation is made possible by leveraging

recent breakthroughs in large language models (LLMs), which possess human-like ability to un-

derstand, process, and generate language-based information. Our initial experiments over tabular

MDPs validate the effectiveness, efﬁciency, and interpretability of our NLRL framework.

2 PRELIMINARY OF REINFORCEMENT LEARNING

Reinforcement Learning models the decision-making problem as a Markov Decision Process

(MDP), deﬁned by the state space S, action space A, probabilistic transition function P : S ×

A × S → [0, 1], discount factor γ ∈ [0, 1) and reward function r : S × A → [−R

max

, R

max

The goal of RL aims to learn a policy π : S × A → [0, 1], which measures the action a’s

probability given the state s: π(a|s) = Pr (A

= a | S

= s). In decision-making task, the opti-

mal policy tends to maximize the expected discounted cumulative reward: E



∞

t=0

r (s

, a

)



The state-action and state value functions are two key concepts that evaluate states and actions

by RL objective’s proxy: Q

, a

) = E

(s,a)

t+1:∞

∼P



∞

i=t

i−t

r (s

, a

) | s

, a



, V

) =

(s,a)

t+1:∞

∼P



∞

i=t

i−t

r (s

, a

) | s



, where P

is the trajectory distribution given the policy

π and dynamic transition P .

Given the deﬁnition of V

), the relationship between temporally adjacent state’s value can be

derived as the Bellman expectation equation. Here is a one-step Bellman expectation equation:

) = E

∼π

r(s

, a

) + γE

t+1

∼p(s

t+1

)

t+1

)]

, ∀s

∈ S (1)

A similar equation can also be derived for Q

(s, a). Given these basic RL deﬁnitions and equations,

we can illustrate how policy evaluation and policy improvement are conducted in GPI.

Policy Evaluation. The target of the policy evaluation process is to estimate state value function

(s) or state-action value function Q

(s, a). For simplicity, we only utilize V

(s) in the following

illustration. Two common value function estimation methods are the Monte Carlo (MC) estimate and

the Temporal-Difference (TD) estimate (Sutton, 1988). MC estimate leverages Monte-Carlo sam-

pling over trajectories to construct unbiased estimation: V

) ≈

n=1

[

∞

i=t

i−t

r(s

, a

)].

TD estimate relies on the temporal relationship shown in Equ.1 to construct an estimation: V

) ≈

n=1

[r(s

, a

) + γV

t+1

)], which can be seen as a bootstrap over next-state value function.

Policy Improvement. The policy improvement process aims to update and improve policy accord-

ing to the result of policy evaluation. Speciﬁcally, it replaces the old policy π

old

with the new one

new

to make the expected return increases: V

new

) ≥ V

old

). In the environment with small,

discrete action spaces, such improvements can be achieved by greedily choosing the action that

maximizes Q

old

(s, a) at each state:

new

(· | s) = arg max

¯π(·|s)∈P(A)

a∼¯π

old

(s, a)] , ∀s ∈ S (2)

Another improvement method involves applying policy gradient ascent (Sutton et al., 1999). It

parameterizes the policy π

with θ. The analytical policy gradient can be derived as follows:

∇

V (π

θ=θ

old

= E

(s,a)∼P

old



∇

log π

(a|s)Q

old

(s, a)



θ=θ

old



. (3)

By choosing a relatively small step-size α > 0 to conduct gradient ascent: θ

new

= θ +

α∇

θ=θ

old

, we can guarantee the policy improvement: V

new

) ≥ V

old

3 NATURAL LANGUAGE REINFORCEMENT LEARNING

In contrast to the precise statistical models used in traditional RL, humans typically frame all el-

ements—including task objectives, value evaluations, and strategic policies—within the form of

natural language. This section aims to mimic how humans navigate decision-making tasks using

natural language, aligning it with the concepts, deﬁnitions, and equations in traditional RL. Due to

the inherent ambiguity of natural language, the equations presented here are not strictly derived from

mathematical deﬁnitions. They are instead analogical and based on empirical insights of original RL

concepts. We leave rigorous theory for future work.

Preprint. Work in progress.

Reach the crown and

avoid all dangers

Value Function and Bellman Equation

Q(b, Move Left) = V(a) = 1

(b, Move Left) = A good move that the following path starts from

state a can pass all fire, snakes and lightning and reach the final crown.

0=V(b)=1/4(1+0+0-1)

Q(b, Move Right) = V(c) = 0

Q(b, Move Down) = V(d) =-1

Q(b, Move Up) = V(b) = 0

(b, Move Up) = The action move up will make the agent stay exactly

at the same state b, a state with mixed outcomes.

(b, Move Right) = Reach state c, a dangerous state close to snake

and fire. But future action will send the robot back to b, a state with

mixed results.

(b, Move down) = A real dangerous move, It will reach the state d

with fire and the episode ends.

(b) = Mixed outcomes where the robot takes uniform policy

over actions. Move up/right stays at the

same state, move down

leads to failure while move left is the only move leads to success.

Generalized Policy Iteration – Policy Evaluation

Traditional Monte-Carlo Estimate

b>b>a>…>crown>r=1

b>c>b>d>r=-1

b>d>r=-1

V(b)≈0

b>b>a>…>crown, complete

b>c>b>file, episode ends

b>d>fire, episode ends

V(b)≈State b has mixed results leading to fire

(mainly by b to d) and crown (by b to a).

Traditional Temporal Difference Estimate

b>a>r=0,V(a)=1

b>b>r=0,V

old

(b)

b>d>r=-1, V(d)=0

b>c>r=0, V(c)=V

old

(b)

V(b)≈0.5V

old

(b)

b>a>V

(a)=state a can directly lead to the crown

b>b>exists at the same state, V

old

(b)

b>d>V

(d)=state d is the fire, episode ends.

b>c>V

(c)=state c returns back to state b.

(b) = The robot takes uniform policy

for move up/down/left/right. Move

up/right stays at the same state, move

down leads to failure while move left is

the only move leads to success.

Generalized Policy Iteration – Policy Improvement

Q(b, Move Left) = 1

Move Left

(b, Move Right) = …

By analyzing out evaluations results, Move Left

is the most promising action that leads to

success. Move down directly results in failure

while Move Right/Up make the robot stays at b

with mixed results.

Thus, Move Left

Q(b, Move Right) = 0

Q(b, Move Up) = 0

Q(b, Move Down) = -1

argmax

(b, Move Left) = …

(b, Move Up) = …

(b, Move Down) = …

Language-based Monte-Carlo Estimate

Language-based Temporal Difference Estimate

LM-selection

argmax

Numeric

Language-based

Numeric

Language-based

Task Objective

Figure 1: We present an illustrative example of grid-world MDP to show how NLRL and traditional

RL differ for task objective, value function, Bellman equation, and generalized policy iteration. In

this grid-world, the robot needs to reach the crown and avoid all dangers. We assume the robot

policy takes optimal action at each non-terminal state, except a uniformly random policy at state b.

3.1 DEFINITIONS

We start with deﬁnitions and equations in natural language RL to model human’s behaviors. We

provide Fig. 1 with illustrative examples covering most concepts we will discuss in this section.

Text-based MDP: To conduct RL in natural language space, we ﬁrst need to convert traditional

MDP to text-based MDP, which leverages text and language descriptions to represent MDP’s basic

concepts, including state, action, and environment feedback (state transitions and reward).

Language Task instruction: Humans usually deﬁne a natural language task instruction T

, like

“reaching the goal” or “opening the door”. Then, we denote a human metric by F that measures the

completeness of the task instruction given the trajectory description D

(τ

), a language descriptor

that can transform the trajectory distribution τ

into its corresponding language description

(τ

). Based on the notation, the objective of NLRL is reformulated as

max

F (D

(τ

), T

), (4)

We are trying to optimize the policy so that the language description of the trajectory distribution τ

can show high completeness of the task instruction.

Language Policy: Instead of directly modeling action probability, humans determine the action by

strategic thoughts, logical reasoning, and planning. Thus, we represent the policy on language as

(a, c|s), which will ﬁrst generate such thought process c, then the ﬁnal action probability π(a|s).

Language Value Function: Similar to the deﬁnition of Q and V in traditional RL, humans leverage

language value function, relying on natural language evaluation to assess the policy effectiveness.

The language state value function V

and language state-action value function Q

are deﬁned as:

, a

) = D



(s, a)

t+1:∞

∼ P

| s

, a

, T



, V

) = D



, (s, a)

t+1:∞

∼ P

| s

, T



(5)

Preprint. Work in progress.

Given the current state s

or state-action (s

, a

), Q

and V

leverage language descriptions instead

of scalar value to demonstrate the effectiveness of policy for achieving the task objective T

. The

language value functions are intuitively rich in the information of values and enhance interpretability

rather than the traditional scalar-based value. It can represent the evaluation results from different

perspectives, consisting of the underlying logic/thoughts, prediction/analysis of future outcomes,

comparison among different actions, etc.

Language Bellman Equation: In the Bellman equation, the state evaluation value V (s

), can

be decomposed into two parts. Firstly the intermediate changes, which include immediate

action a

, reward r

, and next state s

t+1

. Secondly, the state evaluation V (s

t+1

) over the next

state. Based on such intuition of decomposition, we introduce the language Bellman equation Equ.6

following the decomposition principle.

) = G

t+1

∼P





, r (s

, a

) , s

t+1

), V

t+1





, ∀s

∈ S (6)

where d (a

, r (s

, a

) , s

t+1

)) depicts the language description of intermediate changes, G

and G

serves as two information aggregation function. Speciﬁcally, G

mimics the add ‘+’ operation

in the original Bellman equation, aggregating information from intermediate changes’ descriptions

and future evaluation given a

and s

t+1

. G

takes the responsibility of the expectation operation E,

aggregating information from different (a

, s

t+1

) pairs by sampling from trajectory distribution P

3.2 LANGUAGE GENERALIZED POLICY ITERATION

Given deﬁnitions and equations, in this part, we introduce how language GPI is conducted. Refer to

Fig. 1 with illustrative examples of language GPI.

3.2.1 LANGUAGE POLICY EVALUATION

Language policy evaluation aims to estimate language value function V

and Q

for each state. We

present how two classical estimations: MC and TD estimate work in language policy evaluation.

Language Monte-Carlo Estimate. Starting from the state s

, MC estimate is conducted over text

rollouts (i.e. K full trajectories {a

, (s, a)

t+1:∞

}) given the policy π. Since we cannot take the

average operation in language space, we instead leverage language aggregator/descriptor G

to ag-

gregate information over ﬁnite trajectories, approximating the expected evaluation.

) ≈ G





, (s, a)

t+1:∞



n=1



(7)

Language Temporal-Difference Estimate. TD estimate mainly relies on the one-step language

Bellman equation illustrated in Equ. 6. Similar to MC estimate, we aggregate K one-step samples

to approximate the expected evaluation:

) ≈ G







d(s

, a

, r(s

, a

), s

t+1

), V

t+1

)



n=1



, ∀s

∈ S (8)

Language MC estimate is free from estimation “bias” as it directly utilizes samples from complete

trajectories. However, the MC method is prone to high “variance” considering the signiﬁcant varia-

tions in the long-term future steps. Such variability poses a challenge for the language aggregator D

in Equ. 7 to efﬁciently extract crucial information from diverse trajectories. On the contrary, while

the inaccuracy of the next state evaluation V

t+1

) can bring estimation “bias” to TD estimate,

they effectively reduce “variance” by discarding future variations. G

and G

are only required to

conduct simple one-step information aggregation with limited variations.

3.2.2 LANGUAGE POLICY IMPROVEMENT

Similar to traditional policy improvement, the motivation of language policy improvement also aims

to select actions that maximize the human task completeness function F :

new

(· | s) = arg max

¯π(·|s)∈P(A)

F (Q

old

(s, a), T

), ∀s ∈ S (9)

We use quotes for “bias” and “variance” to indicate that we draw on their conceptual essence, not their

strict statistical deﬁnitions, to clarify concepts in NLRL.

Preprint. Work in progress.

As we mentioned, F is typically a human measurement of task completeness, which is hard to quan-

tify and take the argmax operation. Considering that F largely depends on human prior knowledge,

instead of mathematically optimizing it, we leverage the language analysis process I to conduct

policy optimization and select actions:

new

(· | s), c = I(Q

old

(s, a), T

), ¯π(· | s) ∈ P(A), ∀s ∈ S (10)

Language policy improvement conducts strategic analysis C to generate the thought process c and

determine the most promising action as the new policy π

new

(· | s). This analysis is mainly based on

human’s correlation judgment between the language evaluation Q

old

(s, a) and task objective T

3.3 PRACTICAL IMPLEMENTATION WITH LARGE LANGUAGE MODELS

Section 3 demonstrates the philosophy of NLRL: transfer RL key concepts into its human natural

language correspondence. To practically implement these key concepts, we require a model that can

understand, process, and generate language information. The Large language model, trained with

large-scale human language and knowledge corpus, can be a natural choice to help mimic human

behaviors and implement these language RL components.

LLMs as policy (π

). Many works adopted LLMs as the decision-making agent (Wang et al.,

2023a; Feng et al., 2023a; Christianos et al., 2023; Yao et al., 2022) with Chain-of-thought process

(Wei et al., 2022b). By setting proper instructions, LLMs can leverage natural language to describe

their underlying thought for determining the action, akin to the human strategic thinking.

LLMs as information extractor and aggregator (G

, G

) for concepts. LLMs can be powerful

information summarizers (Zhang et al., 2023), extractors (Xu et al., 2023), and aggregators to help

us fuse intermediate changes and future language evaluations for language value function estimates.

One core issue is to determine which kind of information we hope our LLMs to extract and aggre-

gate. Inspired by works Das et al. (2023); Sreedharan et al. (2020); Schut et al. (2023); Hayes &

Shah (2017) in the ﬁeld of interpretable RL, we believe Concept can be the core. We adopt the

illustration in Das et al. (2023) that concept is a general, task-objective oriented, and high-level ab-

straction grounded in human domain knowledge. For example, in the shortest path-ﬁnding problem,

the path distance and available path sets are two concepts that are (1) high-level abstraction of the

trajectories and are predeﬁned in human prior knowledge, (2) generally applicable over different

states, (3) directly relevant to the ﬁnal task objective. Given these motivations, LLMs will try to

aggregate and extract domain-speciﬁc concepts to form the value target information. Such concepts

can be predeﬁned by human’s prior knowledge, or proposed by LLMs themselves.

LLMs as value function approximator (D

, Q

, V

). The key idea of value function approxima-

tion (Sutton et al., 1999) is to represent the value function with a parameterized function instead of

a table representation. Nowadays deep RL typically chooses neural networks that take the state as

input and output one-dimension scalar value. For NLRL, the language value function approximation

of D

, Q

, V

can be naturally handled by (multi-modal) LLMs. LLMs can take in the features

from the task’s state, such as low-dimension statistics, text, and images, and output the correspond-

ing language value judgment and descriptions. For the training of LLMs, we adopt the concept

extractor/aggregator mentioned above to form MC or TD estimate (in Sec 3.2.1), which can be used

to ﬁnetune LLMs for better language critics.

LLMs as policy improvement operator (I). With the chain-of-thought process and human prior

knowledge, LLMs are better to determine the most promising action π

new

(· | s) by taking language

analysis c over the correlation of language evaluation Q

old

(s, a) and task objective T

. The un-

derlying idea also aligns with some recent works (Kwon et al., 2023; Rocamonde et al., 2023) that

leverage LLMs or Vision-language models as the reward–they can accurately model the correlation.

3.4 DISCUSSIONS OVER OTHER RL CONCEPTS

To illustrate the versatility of the framework, we show several examples of how other fundamental

RL concepts can be framed into NLRL.

TD-λ (Sutton, 1988). Equ. 6 considers the one-step decomposition of the value function, or in

the context of traditional RL, the TD(1) situation. A natural extension is to conduct an n-step

剩余21页未读，继续阅读

评论收藏

内容反馈

#完美解决问题
#运行顺畅
#内容详尽
#全网独家
#注释完整

码流怪侠

粉丝: 2w+
资源: 435

自然语言强化学习（NLRL）框架在MDPs任务中的应用与实现

基于MDPs的主动强化学习_Active Reinforcement Learning over MDPs

人工智能+Python动手学强化学习源代码

《强化学习导论》第二版源代码(python).rar

MATLAB强化学习工具箱

强化学习课程资料

BBRL 是一个C ++ 库，用于比较贝叶斯强化学习算法_代码_下载

Google DeepMind的David Silver的强化学习课程讲义

保守离线分布强化学习_Conservative Offline Distributional Reinforcement Lea

MDPs中参数综合的凸优化_Convex Optimization for Parameter Synthesis in MDP

复旦大学人工智能blackjack答案

deep_rl_2019.pdf

Reinforcement_learning_An_introduction 第二版

An Introduction to DRL.pdf

多智能体-DM-ICML-ACAI.pdf

reinforcement-learning:基本概念的实施在“强化学习”框架下进行。 该项目是CS747中的作业的集合

mdps-exact-methods_mdp_

Algorithms for Reinforcement Learning

一文读懂AlphaGo背后的强化学习

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

最新资源

reinforcement-learning:基本概念的实施在“强化学习”框架下进行。该项目是CS747中的作业的集合