CurricuLLM:使用大型语言模型自动设计学习复杂机器人技能的教学大纲资源-CSDN文库

需积分: 5 30 浏览量 2024-10-07 11:02:19 上传评论收藏 2.41MB PDF 举报

资源推荐

资源详情

资源评论

CurricuLLM: Automatic Task Curricula Design for Learning Complex

Robot Skills using Large Language Models

Kanghyun Ryu

, Qiayuan Liao

, Zhongyu Li

, Koushil Sreenath

, Negar Mehr

Abstract— Curriculum learning is a training mechanism in

reinforcement learning (RL) that facilitates the achievement of

complex policies by progressively increasing the task difﬁculty

during training. However, designing effective curricula for a

speciﬁc task often requires extensive domain knowledge and

human intervention, which limits its applicability across various

domains. Our core idea is that large language models (LLMs),

with their extensive training on diverse language data and

ability to encapsulate world knowledge, present signiﬁcant

potential for efﬁciently breaking down tasks and decomposing

skills across various robotics environments. Additionally, the

demonstrated success of LLMs in translating natural language

into executable code for RL agents strengthens their role in gen-

erating task curricula. In this work, we propose CurricuLLM,

which leverages the high-level planning and programming capa-

bilities of LLMs for curriculum design, thereby enhancing the

efﬁcient learning of complex target tasks. CurricuLLM consists

of: (Step 1) Generating sequence of subtasks that aid target

task learning in natural language form, (Step 2) Translating

natural language description of subtasks in executable task

code, including the reward code and goal distribution code, and

(Step 3) Evaluating trained policies based on trajectory rollout

and subtask description. We evaluate CurricuLLM in various

robotics simulation environments, ranging from manipulation,

navigation, and locomotion, to show that CurricuLLM can aid

learning complex robot control tasks. In addition, we validate

humanoid locomotion policy learned through CurricuLLM in

real-world. The code is provided in https://github.com/

labicon/CurricuLLM

I. INTRODUCTION

Deep reinforcement learning (RL) has achieved notable

success across various robotics tasks, including manipula-

tion [1], navigation [2], and locomotion [3]. However, RL re-

quires informative samples for learning, and obtaining these

from a random policy is highly sample-inefﬁcient, especially

for complex tasks. In contrast, human learning strategies

differ signiﬁcantly from random action trials; they typically

start with simpler tasks and progressively increase difﬁculty.

Curriculum learning, inspired by this structured approach of

learning, aims to train models in a meaningful sequence [4],

gradually enhancing the complexity of the training data [5]

or the tasks themselves [6]. Particularly in RL, curriculum

learning improves training efﬁciency by focusing on simpler

tasks that can provide informative experiences to reach more

complex target task, instead of starting from scratch [6].

Although effective, designing a good curriculum is chal-

lenging. Manual curriculum design often necessitates the

*This work is supported by the National Science Foundation, under grants

ECCS-2438314 CAREER Award, CNS-2423130, and CCF-2423131

Mechanical Engineering, University of California Berkeley

{kanghyun.ryu, qiayuanl, zhongyu li, koushils,

negar}@berkeley.edu

Design a curriculum for

humanoid locomotion

Prompts: Environment Description, Robot

Description, Target Task Description, …

Sequential Training throughout Curriculum

Fine-tuning Policy

(Step 2)

Task Code Sampling

def compute_rewards(self):

# balance reward

…

# velocity reward

(Step 3)

Optimal Policy Selection

Policy 1:

base_lin_vel: [-0.202 -0.114 0.017]

base_ang_vel: [-0.001 -0.004 0.084]

Policy from

Preceding Subtask

(Step 1)

Curriculum Design

Basic Stability Learning

Learn to Walk

Increase Speed

Target Task

Sequentially Trained Policies

Fig. 1: CurricuLLM takes natural language description of environ-

ments, robots, and target task that we wish the robot to learn, and

then generates a sequence of subtasks. In each subtasks, it samples

different task codes and evaluates the resulting trained policy to

ﬁnd the policy which is best aligned within the current subtask.

These iterations are repeated throughout the curriculum subtasks to

sequentially train a policy that reaches complex target task.

costly intervention of human experts [7], [8], [9] and is

typically restricted to a limited set of predeﬁned tasks [10].

Consequently, several works focused on automatic curricu-

lum learning (ACL). To generate task curricula, ACL re-

quires the ability of both determining subtasks aligned with

the target task, ranking the difﬁculty of each subtask, and

organizing them in ascending order of difﬁculty [11]. How-

ever, autonomously evaluating the relevance and difﬁculty

of these subtasks remains unresolved. As a result, ACL has

been limited to initial state curricula [12], [13], goal state

curricula [14], or environment curricula [10], [15], rather

than task-level curricula.

Meanwhile, in recent years, large language models

(LLMs) trained on extensive collections of language

data [16], [17], [18] have been recognized as reposito-

ries of world knowledge expressed in linguistic form [19].

Leveraging this world knowledge, LLMs have demonstrated

their capabilities in task planning [20] and skill decompo-

sition for complex robotic tasks [21], [22]. Furthermore,

the programming skills of LLMs enabled smooth integration

between high-level language description and robotics through

API call composition [23], [24], simulation environment

generation [25], [26], or reward design [27], [28].

In this paper, we introduce CurricuLLM, which leverages

the reasoning and coding capabilities of LLMs to design

curricula for complex robotic control tasks. Our goal is to

autonomously generate a series of subtasks that facilitate

the learning of complex target tasks without the need for

extensive human intervention. Utilizing the LLM’s task de-

composition and coding, CurricuLLM autonomously gener-

ates sequences of subtasks along with appropriate reward

arXiv:2409.18382v1 [cs.RO] 27 Sep 2024

functions and goal distributions for each subtask, enhancing

the efﬁciency of training complex robotic policies.

Our contribution can be summarized in threefold. First, we

propose CurricuLLM, a task-level curriculum designer that

uses the high-level planning and code writing capabilities of

LLMs. Second, we evaluate CurricuLLM in diverse robotics

simulation environments ranging from manipulation, navi-

gation, and locomotion, demonstrating its efﬁcacy in learn-

ing complex control tasks. Finally, we validate the policy

trained with CurricuLLM on the Berkeley Humanoid [29],

illustrating that the policy learned through CurricuLLM can

be transferred to the real world.

II. RELATED WORKS

A. Curriculum Learning

In RL, curriculum learning is recognized for enhanc-

ing sample efﬁciency [30], addressing previously infeasi-

ble challenging tasks [31], and facilitating multitask policy

learning [32]. Key elements of curriculum learning include

the difﬁculty measure, which ranks the difﬁculty of each

subtask, and training scheduling, which arranges subtasks at

an appropriate pace [11]. The teacher-student framework, for

example, has a teacher agent that monitors the progress of the

student agent, recommending suitable tasks or demonstra-

tions accordingly [33], [34]. However, this method requires

a predeﬁned set of tasks provided by human experts or a

teacher who has superior knowledge of the environment.

Although self-play has been proposed as a means to escalate

opponent difﬁculty [35], [36], it is limited to competitive

multi-agent settings and may converge to a local minimum.

An appropriate difﬁculty measure is also crucial in curricu-

lum learning. In goal-conditioned environments, it is often

suggested to start training from a goal distribution close to

the initial state [14], [37] or an initial state distribution that is

in proximity to the goal state [12], [13] to regulate difﬁculty.

However, these methods are limited to goal-conditioned

environment where the task difﬁculty is correlated with how

“far” the goal is from the start location. In this work, we

use LLM to provide a more general method to measure task

difﬁculty and design curricula.

The most closely related work to CurricuLLM is

DrEureka [38] and Eurekaverse [26], which utilizes LLMs

to generate an domain randomization parameters, such as

gravity or mass, or environment curriculum, such as ter-

rain height. Especially, Eurekaverse employs a co-evolution

mechanism that gradually increases the complexity of the

environment by using the LLM for environment code gen-

eration. Compared to Eurekaverse, our method focuses on

the task curriculum, which focuses on task break-down for

learning complex robotic tasks compared to the Eurekaverse

that focuses on generalization across different environments.

B. Large Language Model for Robotics

Task Planning. The robotics community has recently been

exploring the use of LLMs for high-level task planning [39],

[24]. However, these methods are limited to task planning

within predeﬁned ﬁnite skill sets and suffers when LLMs’

plan is not executable within given skill sets or environ-

ment [20], [24], [40]. In contrast, Voyager [21] introduces

an automatic skill discovery, attempting to learn new skills

that is not currently available but required for open-ended

exploration. Nonetheless, skills in Voyager are limited to

composing discrete actions and is inapplicable to continuous

control problems. In this work, we propose the automatic

generation of task curriculum consisting of a sequence of

subtasks that facilitates the efﬁcient learning of robotic

control tasks. To manage the control of robots with high

degrees of freedom, we utilize the coding capabilities of

LLMs to generate a reward function for each subtask and

sequentially train each subtasks in given order.

Reward Design. In continuation from works using natural

language as a reward [41], [42], several works have proposed

using LLMs as a tool to translate language to reward. For

example, [43] uses LLMs to translate motion description to

cost parameters, which are optimized using model predictive

control (MPC). However, they are limited to changing the

parameters in cost functions that are hand-coded by human

experts. On the other hand, some works proposed directly

using LLM [44] or vision-language model (VLM) [45] as a

reward function, which observes agent behavior and outputs

a reward signal. However, these approach require expensive

LLM or VLM interaction during training. Most similar ap-

proaches with our work are [27], [46], which leverage LLMs

to generate reward functions and utilize a evolutionary search

to identify the most effective reward function. However, these

methods require an evaluation metric in their feedback loops,

and their reward search tends to optimize speciﬁcally for

this metric. Therefore, for tasks only described in natural

language, as subtasks in our curriculum, ﬁnding reward

function without these evaluation metric can be challenging.

Additionally, their evolutionary search is highly sample-

inefﬁcient, contradicting the efﬁcient learning objectives of

curriculum learning. In contrast, our work divides a single

complex task into a series of subtasks, and then employ a

reasoning approach analogous to the chain of thoughts [19]

to generate reward functions for complex target tasks.

III. PROBLEM FORMULATION

In this work, we consider task curriculum generation for

learning control policies for complex robot tasks. First, we

model a (sub)task as a goal-conditioned Markov Decision

Process (MDP), formally represented by a tuple m =

(S, G, A, p, r, ρ

). Here, S is set of states, A is set of action,

p(s

′

|s, a) is a transition probability function, r(s, a, s

′

, g)

is a reward function, G is a goal space, and ρ

is a goal

distribution. We use subscript n to describe n

subtask in

our curriculum. Then, following [6], we formally deﬁne a

task curriculum as:

Deﬁnition 1 (Task-level Sequence Curriculum. [6]): A

task-level sequence curriculum can be represented as an

ordered list of tasks C = [m

, m

, . . . , m

] where if i ≤ j

for some m

, m

∈ C, then the task m

should be learned

before task m

Environment Prompt

[Robot Description]

The berkeley humanoid is a bipedal robot that can walk and

run …

[Target Task]

The original task in the environment is to walk or run …

[Observation]

(1) base_lin_vel: Linear velocity of base in xyz direction

(2) base_ang_vel: angular velocity of base in xyz direction

[Initial State & Environment Description]

Note that the default initial values are initialized to make the

robot stand …

Curriculum Prompt

[LLM Instruction]

You are a curriculum generator trying to generate a

curriculum to solve reinforcement learning tasks …

[Curriculum Instructions]

(1) You will be given several variables you can use to

describe the task …

(2) Your ﬁnal task should align with your target task.

(3) Do not generate more than N subtasks

[Tips]

(1) Do not try to explore or learn dynamics by doing

random actions.

(2) Each task should have speciﬁc objective …

Curriculum Generation LLM

Sequence of Subtasks

Task 1

Name: Basic Stability Learning

Description: linear velocity [0, 0]

Reason: Ensure the robot to maintain balance …

Task 2

Name: Learn to Walk

Description: linear velocity [-0.5, 0.5], …

Fig. 2: Curriculum generation LLM receives the natural language form of a curriculum prompt as well as the environment description to

generate a sequence of subtasks. Our prompt includes instruction for tje curriculum designer, rules for how to describe the subtasks, and

other tips on describing the curriculum. Environment description consists of the robot and its state variable description, the target task,

and the initial state description.

In our work, we aim to generate a task curriculum

C = [m

, m

, . . . , m

] which helps learning a pol-

icy π that maximizes the cumulative reward V

(π) =



t=0

r(s

, a

, s

t+1

, g)



associated with target task

Problem 1: For target task m

= (S, G, A, p, r

, ρ

g,T

)

and task curriculum C = [m

, m

, . . . , m

], we denote a

policy trained with curriculum C as π

. Our objective for

curriculum design is ﬁnding a curriculum C that maximizes

the reward on the target task arg max

(π

We assume the state space S, the action space A, the tran-

sition probability function p, and the goal space G are ﬁxed,

i.e., they do not change between subtasks. Therefore, we

can specify the task curriculum with the sequence of reward

functions and goal distributions that are associated with the

subtasks, C

(r,ρ

)

= [(r

, ρ

g,1

), (r

, ρ

g,2

), . . . , (r

, ρ

g,N

)].

Here, we express the reward function and goal distribu-

tion tuple (r, ρ

) with a programming code for simulation

environment and deﬁne it as a task code. Therefore, Cur-

ricuLLM’s objective reduces to generating sequence of task

codes C

(r,ρ

)

that maximize the target task performance.

IV. METHOD

Even for LLMs encapsulating world knowledge, directly

generating the sequence of task codes C

(r,ρ

)

can be chal-

lenging. Since LLM is known to show better reasoning

capability by following step-by-step instructions [19], we

divide our curriculum generation into three main modules

(see Figure 1):

• Curriculum Design (Step 1): A curriculum genera-

tion LLM receives natural language descriptions of

the robot, environment, and target task to generate

sequences of language descriptions C

= [l

, l

, . . . , l

]

of the task sequence curriculum C.

• Task Code Sampling (Step 2): A task code generation

LLM generates K task code candidates (r

, ρ

g,n

), k =

{1, 2, . . . , K} for the given subtask description l

These are in the form of executable code and are used

to ﬁne-tune the policy trained for the previous subtask.

• Optimal Policy Selection (Step 3): An evaluation

LLM evaluates policies π

, k = {1, 2, . . . , K} trained

with different task code candidates (r

, ρ

g,n

), k =

{1, 2, . . . , K} to identify the policy that best aligns with

the current subtask. Selected policy π

∗

is used as a

pretrained policy for the next subtask.

A. Generating Sequence of Language Description

Leveraging the high-level task planning from LLMs, we

initially ask an LLM to generate a series of language

descriptions C

= [l

, l

, . . . , l

] for a task-level sequence

curriculum C = [m

, m

, . . . , m

] facilitating the learning

of a target task m

. Initially, the LLM is provided with a

language description of the target task l

and a language

description of the environmental information l

to generate

an environment-speciﬁc curriculum. When generating cur-

riculum, we query LLM to use the target task m

as a

ﬁnal task in curriculum m

. Moreover, we require the LLM

to describe the subtask using available state variables. This

enables the LLM to generate curricula that are grounded in

the environment information and ensures the generation of

a reliable reward function later (discussed in Section IV-B).

For example, to generate a curriculum for a humanoid to

learn running, we query the LLM to generate a curricu-

lum for following a velocity and heading angle command

{(v

, v

, θ) : −2 ≤ v

, v

≤ 2, −π ≤ θ ≤ π}, while

providing state variables, such as base linear velocity, base

angular velocity, or joint angle. Then, the LLM generates

the sequence of subtasks descriptions such as (1) Basic

Stability Learning: Maintain stability by minimizing the joint

deviation and height deviation, (2) Learn to Walk: Follow low

speed commands of the form {(v

, v

, θ) : −1 ≤ v

, v

≤

1, −π/2 ≤ θ ≤ π/2}, and others.

B. Task Code Generation

After generating a curriculum with a series of language

descriptions C

= [l

, l

, . . . , l

], we should translate these

language descriptions to a sequence of task codes C

(r,ρ

)

[(r

, ρ

g,1

), (r

, ρ

g,2

), . . . , (r

, ρ

g,N

)]. These task codes are

described in executable code so that the RL policy can

be trained on these subtasks. In the n

subtask which we

剩余12页未读，继续阅读

评论收藏

内容反馈

sp_fyf_2024

粉丝: 1234
资源: 53

CurricuLLM: 使用大型语言模型自动设计学习复杂机器人技能的教学大纲

最新资源

CurricuLLM: 使用大型语言模型自动设计学习复杂机器人技能的教学大纲

VEXIQ机器人编程教学大纲.docx

最新无模型深度强化学习研究：从零开始训练机器人“玩乐高”.pdf

(完整版)工业机器人操作与编程(ABB)教学大纲.pdf

机器人教学大纲参照.pdf

机器人技术教学大纲.doc

工业机器人课程教学大纲.doc

工业机器人工作站系统集成技术教学大纲教学文案.docx

基于大型语言模型的评论回复机器人。.zip

大型风力机复杂叶片机器人自动喷涂路径规划.pdf

机器人技能学习

Python机器学习项目开发实战_打造聊天机器人_编程案例解析实例详解课程教程.pdf

生成式 AI语言模型最佳实践.pptx

工业机器人技术教学大纲.doc

机器人操作技能学习方法综述.pdf

工业机器人使用的编程语言简介.pdf

新编工业机器人技术课程教学大纲版.doc

工业无线通信技术讲座 第七十讲 基于OPC UA的机器人信息模型研究.pdf

se05718机器人技术教学大纲.doc

工业机器人技术课程教学大纲.doc

机器人操作技能模型综述.pdf

电气自动化专业教学大纲

工业机器人技术课程教学大纲2009版.doc

基于ROS的自动分拣机器人系统设计.pdf

协作机器人 FANUC CRX-10iA协作机器人模型

基于深度学习的变电站轨道自动巡检机器人研究.pdf

机器人技术基础教学大纲.doc

全国青少年机器人技术等级考试大纲1级-6级

tyrobot通用机器人多语言教学系统

最新资源

工业无线通信技术讲座第七十讲基于OPC UA的机器人信息模型研究.pdf