工具增强型大模型评估基准API-Bank的提出与实验研究资源-CSDN文库

版权申诉

66 浏览量 2024-12-03 19:40:51 上传评论收藏 495KB PDF 举报

资源推荐

资源详情

资源评论

API-Bank: A Benchmark for Tool-Augmented LLMs

Minghao Li

, Feifan Song

, Bowen Yu

∗

, Haiyang Yu

, Zhoujun Li

, Fei Huang

, Yongbin Li

Alibaba DAMO Academy

MOE Key Laboratory of Computational Linguistics, Peking University

Shenzhen Intelligent Strong Technology Co., Ltd.

{lmh397008, yubowen.ybw, yifei.yhy, f.huang, shuide.lyb}@alibaba-inc.com

songff@stu.pku.edu.cn

lizhoujun@aistrong.com

Abstract

Recent research has shown that Large Lan-

guage Models (LLMs) can utilize external

tools to improve their contextual processing

abilities, moving away from the pure language

modeling paradigm and paving the way for

Artiﬁcial General Intelligence. Despite this,

there has been a lack of systematic evalua-

tion to demonstrate the efﬁcacy of LLMs using

tools to respond to human instructions. This

paper presents API-Bank, the ﬁrst benchmark

tailored for Tool-Augmented LLMs. API-

Bank includes 53 commonly used API tools,

a complete Tool-Augmented LLM workﬂow,

and 264 annotated dialogues that encompass

a total of 568 API calls. These resources have

been designed to thoroughly evaluate LLMs’

ability to plan step-by-step API calls, retrieve

relevant APIs, and correctly execute API calls

to meet human needs. The experimental re-

sults show that GPT-3.5 emerges the ability to

use the tools relative to GPT3, while GPT-4

has stronger planning performance. Neverthe-

less, there remains considerable scope for fur-

ther improvement when compared to human

performance. Additionally, detailed error anal-

ysis and case studies demonstrate the feasibil-

ity of Tool-Augmented LLMs for daily use, as

well as the primary challenges that future re-

search needs to address.

1 Introduction

Over the past several years, signiﬁcant progress has

been made in the development of large language

models (LLMs), including GPT-3 (Brown et al.,

2020), Codex (Chen et al., 2021), ChatGPT, and

impressive GPT-4 (Bubeck et al., 2023). These

models exhibit increasingly human-like capabil-

ities, such as powerful conversation, in-context

learning, and code generation across a wide range

of open-domain tasks. Some researchers even be-

∗

Corresponding author.

lieve that LLMs could provide a gateway to Artiﬁ-

cial General Intelligence (Bubeck et al., 2023).

Despite their usefulness, however, LLMs are

still limited as they can only learn from their train-

ing data (Brown et al., 2020). This information

can become outdated and may not be suitable

for all applications (Trivedi et al., 2022; Mialon

et al., 2023). Consequently, there has been a surge

in research aimed at augmenting LLMs with the

ability to use external tools to access up-to-date

information (Izacard et al., 2022), perform com-

putations (Schick et al., 2023), and interact with

third-party services (Liang et al., 2023) in response

to user requests. Tool use has traditionally been

viewed as uniquely human behavior, and the emer-

gence of tool use has been considered a signiﬁcant

milestone in primate evolution, even serving to

demarcate the appearance of the genus Homo (Am-

brose, 2001). Analogous to the timeline of human

evolution, we believe that at this current juncture,

we must address two key questions: (1) How effec-

tive are current LLMs in using tools? (2) What are

the remaining obstacles for LLMs to use tools?

In this paper, we introduce API-Bank, the

ﬁrst systematic benchmark for evaluating Tool-

Augmented LLMs’ ability to use tools. We imagine

a vision where, with access to a global repository of

tools, LLMs can aid humans in

plan

ning a require-

ment by outlining all the steps necessary to achieve

it. Subsequently, it will

retrieve

the tool pool for

the needed tools and, through possibly multiple

rounds of API

call

s, fulﬁll the human requirement,

thus becoming truly helpful and all-knowing. To

achieve this goal, we ﬁrst simulate the real-world

scenario by creating 53 commonly used tools, such

as SearchEngine, PlayMusic, BookHotel, Image-

Caption, and organize them in an API Pool for

LLMs to call. We then propose a complete work-

ﬂow for LLMs to use these tools, which includes

determining whether to call an API, which API to

call, generating an API call, and self-assessing the

arXiv:2304.08244v1 [cs.CL] 14 Apr 2023

API Call? Yes

Yes

Found?

Account

Management

Information

Retrieval

Health

Management

Entertainment

Travel

Schedule

Management

Smart

Home

Finance

Management

API Pool

External Service

Yes

Give Up

Satisﬁed

？

Figure 1: The proposed Tool-Augmented LLMs paradigm.

correctness of the API call. We further manually

review 264 dialogues that contain 568 API calls,

along with an automated scoring script to fairly

evaluate each LLM’s performance in using tools.

Similar to our vision, we divide all dialogues into

three levels. Level-1 evaluates the LLM’s ability

to call the API. Given an API’s description, the

model needs to determine whether to call the API,

call it correctly, and respond appropriately to its

return. Level-2 further assesses the LLM’s ability

retrieve

the API. LLMs must search for possi-

ble APIs that may solve the user’s requirement and

learn how to use the API. Level-3 examines the

LLM’s ability to

plan

API beyond retrieve and call.

In this level, the user’s requirement may be unclear

and require multiple API steps to solve. For exam-

ple, "I want to travel from Shanghai to Beijing for

a week starting tomorrow. Help me plan the travel

route and book ﬂights, tickets, and hotels." LLMs

must infer a reasonable travel plan and call ﬂight,

hotel, and ticket booking APIs based on the plan,

taking into account compatibility issues with time.

On our constructed API-Bank benchmark, we

conduct experimental analysis for the ﬁrst time on

the effectiveness of popular LLMs in utilizing API

tools. Our ﬁndings suggest that calling API is an

emergent ability that shares similarities with math

word problems (Wei et al., 2022). Speciﬁcally,

we observe that GPT-3-Davinci struggles to call

APIs in the simplest level-1 correctly. However,

with GPT-3.5-Turbo, the correctness of API calls

dramatically improves, with around 50% success

rate. Moving to level-2, which involves API re-

trieval, GPT-3.5-Turbo achieves a 40% success rate.

However, when it comes to level-3, which requires

API planning, GPT-3.5-Turbo encounters numer-

ous errors, necessitating an average of 9.9 rounds

of dialogue to complete user requests. This is ap-

proximately 38% more than what is required by

GPT-4. Meanwhile, it should be noted that GPT-4

remains imperfections, as it utilizes approximately

35% more conversation rounds in API planning

when compared to humans. We also provide a de-

tailed error analysis to summarize the obstacles

faced by LLMs when using tools. These include

refusing to make API calls despite explicit instruc-

tions in the prompt and generating non-existent

API calls. Overall, our study sheds light on the po-

tential of LLMs to utilize API tools and highlights

the challenges that need to be addressed in future

research.

level-1 level-2 level-3

Num of Dialogues 214 50 8

Num of API calls 399 135 34

Table 1: The statistics of API-Bank.

2 Tool-Augmented LLMs Paradigm

The existing works on Tool-Augmented LLMs usu-

ally teach the language model to use tools in two

different ways: In-context learning and Fine-tuning.

The former is to show the model the instructions

and the examples of all the candidate tools, which

can extend the general model directly but is limited

by the context length. In comparison, the latter is

ﬁne-tuning the language model by annotated data,

which has no length problem but will damage the

robustness of the model. In this work, we mainly

focus on in-context learning and solve its shortage

of limited context length.

To address this issue, we design a new paradigm

that may be the only solution to use a large num-

ber of tools under the context length limit. Fig-

ure 1 shows the ﬂowchart of the proposed paradigm.

This is an example process for a chatbot, and the

paradigm can be generalized to any generative

model application.

In the proposed paradigm, there is an API Pool

containing various APIs focusing on different as-

pects of life, as well as a keyword-based API search

engine to help the language model ﬁnd API. Be-

fore starting, the model will be given a prompt to

explain the whole process and its task, as well as

how to use the API search engine.

In the longest path of the whole ﬂowchart, the

model needs to make several judgments (the dia-

monds in the Figure 1) as follows:

API Call

After each user statement, the model

needs to determine if an API call is required to

access the external service, which requires the abil-

ity to know the boundaries of its knowledge or the

need for outside action. This judgment leads to two

different options: regular reply or starting the API

call process. During the regular reply, the model

could chat with people or try to ﬁgure out the needs

of the user and plan the process of completing them.

If the model already understands the user’s needs

and decides to start the API call process, it will

continue with subsequent steps.

Find the Right API

To address the input limi-

tations of the model, the model is given only the

instructions of the API search engine at the begin-

ning without any speciﬁc API introduction. An

API search is required before every speciﬁc API

call. When performing an API search, the model

should summarize the demand of the user to a few

keywords. The API search engine will look up the

API pool, ﬁnd the best match and return the related

documentation to help the model understand how

to use it. The retrieved API may not be what the

model needs, so the model has to decide whether

to modify the keywords and search them again, or

give up the API call and reply.

Reply after API Call

After completing the API

call and obtaining the returned results, the model

needs to take action based on the results. If the

returned results are expected, the model can reply

to the user based on the results. If there is an ex-

ception in the API call or the model is not satisﬁed

with the results, the model can choose to reﬁne and

call again based on the returned information, or

give up the API call and reply.

The pseudo-code description of the complete

API call procedure is as follows:

Algorithm 1 API call process

1: Input: us ← U serSt atement

2: if API Call is needed then

3: while API not found do

4: keywords ← summarize(us)

5: api ← search(k eywords)

6: if Give Up then

7: break

8: end if

9: end while

10: if API found then

11: api_doc ← api.documentation

12: while Response not satisﬁed do

13: api_call ← gen_api_call(api_doc, us)

14: api_re ← execute_api_call(api_call)

15: if Give Up then

16: break

17: end if

18: end while

19: end if

20: end if

21: if response then

22: re ← generate_response(api_re)

23: else

24: re ← generate_response()

25: end if

26: Output: ResponseT oUser

3 Benchmark Construction

3.1 System Design

The evaluation system mainly contains 53 APIs,

supporting databases, and a ToolManager. The

complete list of APIs is attached in Appendix A.

The built system includes the most common needs

of life and work. The other types of AI models are

also abstracted in the form of APIs that can be used

by the LLMs, which will extend the capabilities of

speciﬁc aspects of the model. In addition, some

operating system interfaces are included to allow

models to control external applications.

剩余11页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2304
资源: 2398

工具增强型大模型评估基准API-Bank的提出与实验研究

中文大模型基准测评-2024年报告-2024年度中文大模型阶段性进展评估-SuperCLUE-2024.7.24-59页.pdf

大型语言模型检索增强生成的系统评估与基准测试

时间序列预测的基准模型RELM-RKELM-RESN-RSVR-matlab源码+数据集.zip

个性化大型语言模型插件Persona-Plug的提出与评估

中文大模型基准测评2024年上半年报告.pdf

大型语言模型在中文金融领域的基准评估系统-CFBenchmark的引入与应用

金融领域大型语言模型综合评估基准FinBen介绍与应用

中文大模型基准测评2024年报告-2024年度中文大模型阶段性进展评估-SuperCLUE团队-2024.7.9-59页.pdf

大规模语言模型纯因果推断技能测试与提升-Correlation-to-Cause任务提出及实验研究

【SuperCLUE团队】中文大模型基准测评2023年度报告

大型语言模型路由与基准数据集应用研究

用于对话式大语言模型的视频异常检测评估基准

SuperCLUE-中文通用大型模型综合基准，中文基础模型基准_SuperCLUE.zip

大型模型多维中国比对评估基准（ACL 2024）.zip

大型语言模型中的长文本事实准确性评估与增强方法

大规模API调用的自反思层级代理模型AnyTool研究与应用

大数据-算法-城市基准地价动态评估模型算法及实现方法研究.pdf

大型语言模型JSON响应格式化的基准测试与优化

KoLA评估大型语言模型世界知识的综合基准测试方法

【中国信通院】大模型基准测试体系研究报告（2024年）.pdf

PyPI 官网下载 | modelcatalog-api-8.0.0.tar.gz

api-benchmark, 用于测试api的node.js 工具.zip

高质量中文预训练模型;大模型;多模态模型;大语言模型集合.zip

中文大模型基准测评2024年10月报告-2024年度中文大模型阶段性进展评估.pdf

SuperCLUE-中文大模型基准测评报告2023暨ChatGPT发布一周年特别报告-2023.11.28-38页

HPC高性能计算知识-常见测试评估基准解析.docx

FewCLUE中国少数群体学习评估基准_FewCLUE A Chinese Few-shot Learning Evaluati

自然语言处理领域的大型预训练模型比较与评估研究

最新资源