LargeLanguageModelsareZero-ShotReasoners.pdf资源-CSDN文库

人工智能

需积分: 1 191 浏览量 2023-05-18 11:41:28 上传评论收藏 745KB PDF 举报

资源推荐

资源详情

资源评论

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima

The University of Tokyo

t.kojima@weblab.t.u-tokyo.ac.jp

Shixiang Shane Gu

Google Research, Brain Team

Machel Reid

Google Research

∗

Yutaka Matsuo

The University of Tokyo

Yusuke Iwasawa

The University of Tokyo

Abstract

Pretrained large language models (LLMs) are widely used in many sub-ﬁelds of

natural language processing (NLP) and generally known as excellent few-shot

learners with task-speciﬁc exemplars. Notably, chain of thought (CoT) prompting,

a recent technique for eliciting complex multi-step reasoning through step-by-

step answer examples, achieved the state-of-the-art performances in arithmetics

and symbolic reasoning, difﬁcult system-2 tasks that do not follow the standard

scaling laws for LLMs. While these successes are often attributed to LLMs’

ability for few-shot learning, we show that LLMs are decent zero-shot reasoners

by simply adding “Let’s think step by step” before each answer. Experimental

results demonstrate that our Zero-shot-CoT, using the same single prompt template,

signiﬁcantly outperforms zero-shot LLM performances on diverse benchmark

reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP),

symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date

Understanding, Tracking Shufﬂed Objects), without any hand-crafted few-shot

examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and

GSM8K from 10.4% to 40.7% with large-scale InstructGPT model (text-davinci-

002), as well as similar magnitudes of improvements with another off-the-shelf

large model, 540B parameter PaLM. The versatility of this single prompt across

very diverse reasoning tasks hints at untapped and understudied fundamental

zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive

capabilities may be extracted by simple prompting. We hope our work not only

serves as the minimal strongest zero-shot baseline for the challenging reasoning

benchmarks, but also highlights the importance of carefully exploring and analyzing

the enormous zero-shot knowledge hidden inside LLMs before crafting ﬁnetuning

datasets or few-shot exemplars.

1 Introduction

Scaling up the size of language models has been key ingredients of recent revolutions in natural

language processing (NLP) [Vaswani et al., 2017, Devlin et al., 2019, Raffel et al., 2020, Brown et al.,

2020, Thoppilan et al., 2022, Rae et al., 2021, Chowdhery et al., 2022]. The success of large language

models (LLMs) is often attributed to (in-context) few-shot or zero-shot learning. It can solve various

tasks by simply conditioning the models on a few examples (few-shot) or instructions describing the

task (zero-shot). The method of conditioning the language model is called “prompting” [Liu et al.,

2021b], and designing prompts either manually [Schick and Schütze, 2021, Reynolds and McDonell,

2021] or automatically [Gao et al., 2021, Shin et al., 2020] has become a hot topic in NLP.

∗

Work done while at The University of Tokyo.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2205.11916v4 [cs.CL] 29 Jan 2023

Q: A juggler can juggle 16 balls. Half of the balls are golf balls,

and half of the golf balls are blue. How many blue golf balls are

there?

A: The answer (arabic numerals) is

(Output) 8 X

(d) Zero-shot-CoT (Ours)

Q: A juggler can juggle 16 balls. Half of the balls are golf balls,

and half of the golf balls are blue. How many blue golf balls are

there?

A: Let’s think step by step.

(Output) There are 16 balls in total. Half of the balls are golf

balls. That means that there are 8 golf balls. Half of the golf balls

are blue. That means that there are 4 blue golf balls.

✓

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis

balls. Each can has 3 tennis balls. How many tennis balls does

he have now?

A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6

tennis balls. 5 + 6 = 11. The answer is 11.

Q: A juggler can juggle 16 balls. Half of the balls are golf balls,

and half of the golf balls are blue. How many blue golf balls are

there?

(Output) The juggler can juggle 16 balls. Half of the balls are golf

balls. So there are 16 / 2 = 8 golf balls. Half of the golf balls are

blue. So there are 8 / 2 = 4 blue golf balls. The answer is 4.

✓

(b) Few-shot-CoT(a) Few-shot

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis

balls. Each can has 3 tennis balls. How many tennis balls does

he have now?

A: The answer is 11.

Q: A juggler can juggle 16 balls. Half of the balls are golf balls,

and half of the golf balls are blue. How many blue golf balls are

there?

(Output) The answer is 8. X

Figure 1: Example inputs and outputs of GPT-3 with (a) standard Few-shot ([Brown et al., 2020]), (b)

Few-shot-CoT ([Wei et al., 2022]), (c) standard Zero-shot, and (d) ours (Zero-shot-CoT). Similar to

Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reach correct answer

where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples

per

task

, ours does not need any examples and just uses the same prompt “Let’s think step by step” across

all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).

In contrast to the excellent performance of LLMs in intuitive and single-step system-1 [Stanovich

and West, 2000] tasks with task-speciﬁc few-shot or zero-shot prompting [Liu et al., 2021b], even

language models at the scale of 100B or more parameters had struggled on system-2 tasks requiring

slow and multi-step reasoning [Rae et al., 2021]. To address this shortcoming, Wei et al. [2022],

Wang et al. [2022] have proposed chain of thought prompting (CoT), which feed LLMs with the

step-by-step reasoning examples rather than standard question and answer examples (see Fig. 1-a).

Such chain of thought demonstrations facilitate models to generate a reasoning path that decomposes

the complex reasoning into multiple easier steps. Notably with CoT, the reasoning performance then

satisﬁes the scaling laws better and jumps up with the size of the language models. For example,

when combined with the 540B parameter PaLM model [Chowdhery et al., 2022], chain of thought

prompting signiﬁcantly increases the performance over standard few-shot prompting across several

benchmark reasoning tasks, e.g., GSM8K (17.9% → 58.1%).

While the successes of CoT prompting [Wei et al., 2022], along those of many other task-speciﬁc

prompting work [Gao et al., 2021, Schick and Schütze, 2021, Liu et al., 2021b], are often attributed

to LLMs’ ability for few-shot learning [Brown et al., 2020], we show that LLMs are decent zero-shot

reasoners by adding a simple prompt, Let’s think step by step, to facilitate step-by-step thinking before

answering each question (see Figure 1). Despite the simplicity, our Zero-shot-CoT successfully

generates a plausible reasoning path in a zero-shot manner and reaches the correct answer in a

problem where the standard zero-shot approach fails. Importantly, our Zero-shot-CoT is versatile and

task-agnostic, unlike most prior task-speciﬁc prompt engineering in the forms of examples (few-shot)

or templates (zero-shot) [Liu et al., 2021b]: it can facilitate step-by-step answers across various

reasoning tasks, including arithmetic (MultiArith [Roy and Roth, 2015], GSM8K [Cobbe et al., 2021],

AQUA-RAT [Ling et al., 2017], and SVAMP [Patel et al., 2021]), symbolic reasoning (Last letter and

Coin ﬂip), commonsense reasoning (CommonSenseQA [Talmor et al., 2019] and Strategy QA [Geva

et al., 2021]), and other logical reasoning tasks (Date understanding and Tracking Shufﬂed Objects

from BIG-bench [Srivastava et al., 2022]) without modifying the prompt per task.

We empirically evaluate Zero-shot-CoT against other prompting baselines in Table 2. While our

Zero-shot-CoT underperforms Few-shot-CoT with carefully-crafted and task-speciﬁc step-by-step ex-

amples, Zero-shot-CoT achieves enormous score gains compared to the zero-shot baseline, e.g. from

17.7% to 78.7% on MultiArith and from 10.4% to 40.7% on GSM8K with large-scale InstructGPT

model (text-davinci-002). We also evaluate Zero-shot-CoT with another off-the-shelf large model,

540B parameter PaLM, showing similar magnitudes of improvements on MultiArith and GSM8K.

Importantly, with our single ﬁxed prompt, zero-shot LLMs have a signiﬁcantly better scaling curve

comparable to that of the few-shot CoT baseline. We also show that besides Few-shot-CoT requiring

human engineering of multi-step reasoning prompts, their performance deteriorates if prompt example

question types and task question type are unmatched, suggesting high sensitivity to per-task prompt

designs. In contrast, the versatility of this single prompt across diverse reasoning tasks hints at

untapped and understudied zero-shot fundamental capabilities of LLMs, such as higher-level broad

cognitive capabilities like generic logical reasoning [Chollet, 2019]. While the vibrant ﬁeld of LLMs

started out from the premise of excellent few-shot learners [Brown et al., 2020], we hope our work

encourages more research into uncovering high-level and multi-task zero-shot capabilities hidden

inside those models.

2 Background

We brieﬂy review the two core preliminary concepts that form the basis of this work: the advent of

large language models (LLMs) and prompting, and chain of thought (CoT) prompting for multi-step

reasoning.

Large language models and prompting

A language model (LM), is a model that looks to estimate

the probability distribution over text. Recently, scaling improvements through larger model sizes

(from a few million [Merity et al., 2016] to hundreds of millions [Devlin et al., 2019] to hundreds of

billions [Brown et al., 2020] parameters) and larger data (e.g. webtext corpora [Gao et al., 2020])

have enabled pre-trained large language models (LLMs) to be incredibly adept at many downstream

NLP tasks. Besides the classic “pre-train and ﬁne-tune” paradigm [Liu et al., 2021b], models scaled

to 100B+ parameters exhibit properties conducive to few-shot learning [Brown et al., 2020], by way

of in context learning, where one can use a text or template known as a prompt to strongly guide the

generation to output answers for desired tasks, thus beginning an era of “pre-train and prompt” [Liu

et al., 2021a]. In work, we call such prompts with explicit conditioning on few task examples as

few-shot prompts, and other template-only prompts as zero-shot prompts.

Chain of thought prompting

Multi-step arithmetic and logical reasoning benchmarks have par-

ticularly challenged the scaling laws of large language models [Rae et al., 2021]. Chain of thought

(CoT) prompting [Wei et al., 2022], an instance of few-shot prompting, proposed a simple solution

by modifying the answers in few-shot examples to step-by-step answers, and achieved signiﬁcant

boosts in performance across these difﬁcult benchmarks, especially when combined with very large

language models like PaLM [Chowdhery et al., 2022]. The top row of Figure 1 shows standard

few-shot prompting against (few-shot) CoT prompting. Notably, few-shot learning was taken as a

given for tackling such difﬁcult tasks, and the zero-shot baseline performances were not even reported

in the original work [Wei et al., 2022]. To differentiate it from our method, we call Wei et al. [2022]

as Few-shot-CoT in this work.

3 Zero-shot Chain of Thought

We propose Zero-shot-CoT, a zero-shot template-based prompting for chain of thought reasoning.

It differs from the original chain of thought prompting [Wei et al., 2022] as it does not require

step-by-step few-shot examples, and it differs from most of the prior template prompting [Liu et al.,

2021b] as it is inherently task-agnostic and elicits multi-hop reasoning across a wide range of tasks

with a single template. The core idea of our method is simple, as described in Figure 1: add Let’s

think step by step, or a a similar text (see Table 4), to extract step-by-step reasoning.

3.1 Two-stage prompting

While Zero-shot-CoT is conceptually simple, it uses prompting twice to extract both reasoning and

answer, as explained in Figure 2. In contrast, the zero-shot baseline (see the bottom-left in Figure 1)

already uses prompting in the form of “The answer is”, to extract the answers in correct formats.

Few-shot prompting, standard or CoT, avoids needing such answer-extraction prompting by explicitly

designing the few-shot example answers to end in such formats (see the top-right and top-left

Q: On average Joe throws 25 punches per

minute. A fight lasts 5 rounds of 3 minutes. How

many punches did he throw?

A: Let's think step by step.

In one minute, Joe throws 25 punches.

In three minutes, Joe throws 3 * 25 = 75 punches.

In five rounds, Joe throws 5 * 75 = 375 punches.

Q: On average Joe throws 25 punches per

minute. A fight lasts 5 rounds of 3 ・・・

A: Let's think step by step.

In one minute, Joe throws 25 punches. ・・・In five

rounds, Joe throws 5 * 75 = 375 punches. .

Therefore, the answer (arabic numerals) is

375.

LLM

【1st prompt】

Reasoning Extraction

【2nd prompt】

Answer Extraction

Figure 2: Full pipeline of Zero-shot-CoT as described in § 3: we ﬁrst use the ﬁrst “reasoning” prompt

to extract a full reasoning path from a language model, and then use the second “answer” prompt to

extract the answer in the correct format from the reasoning text.

in Figure 1). In summary, Few-shot-CoT [Wei et al., 2022] requires careful human engineering of

a few prompt examples with speciﬁc answer formats per task, while Zero-shot-CoT requires less

engineering but requires prompting LLMs twice.

1st prompt: reasoning extraction

In this step we ﬁrst modify the input question

into a prompt

using a simple template “Q:

[X]

. A:

[T]

”, where

[X]

is an input slot for

and

[T]

is an slot

for hand-crafted trigger sentence t that would extract chain of though to answer the question x. For

example, if we use “Let’s think step by step” as a trigger sentence, the prompt

would be “Q:

[X]

A: Let’s think step by step.”. See Table 4 for more trigger examples. Prompted text

is then fed into

a language model and generate subsequent sentence

. We can use any decoding strategy, but we

used greedy decoding throughout the paper for the simplicity.

2nd prompt: answer extraction

In the second step, we use generated sentence

along with

prompted sentence

to extract the ﬁnal answer from the language model. To be concrete, we simply

concatenate three elements as with “

] [Z] [A]

”:

]

for 1st prompt

[Z]

for sentence

generated at the ﬁrst step, and

[A]

for a trigger sentence to extract answer. The prompt for this step

is self-augmented, since the prompt contains the sentence

generated by the same language model.

In experiment, we use slightly different answer trigger depending on the answer format. For example,

we use “Therefore, among A through E, the answer is” for multi-choice QA, and “Therefore, the

answer (arabic numerals) is” for math problem requiring numerical answer. See Appendix A.5 for

the lists of answer trigger sentences. Finally, the language model is fed the prompted text as input to

generate sentences

ˆy

and parse the ﬁnal answer. See “Answer Cleansing” at §4 for the parser details.

4 Experiment

Tasks and datasets

We evaluate our proposal on 12 datasets from four categories of reasoning

tasks: arithmetic, commonsense, symbolic, and other logical reasoning tasks. See Appendix A.2 for

the detailed description of each datasets.

For arithmetic reasoning, we consider the following six datasets: (1) SingleEq [Koncel-Kedziorski

et al., 2015], (2) AddSub [Hosseini et al., 2014], (3) MultiArith [Roy and Roth, 2015], (4) AQUA-

RAT [Ling et al., 2017], (5) GSM8K [Cobbe et al., 2021], and (6) SVAMP [Patel et al., 2021]. The

ﬁrst three are from the classic Math World Problem Repository [Koncel-Kedziorski et al., 2016],

and the last three are from more recent benchmarks. SingleEq and AddSub contain easier problems,

which do not require multi-step calculation to solve the tasks. MultiArith, AQUA-RAT, GSM8k, and

SVAMP are more challenging datasets that require multi-step reasoning to solve.

For commonsense reasoning, we use CommonsenseQA [Talmor et al., 2019] and StrategyQA [Geva

et al., 2021]. CommonsenseQA asks questions with complex semantics that often require reasoning

based on prior knowledge [Talmor et al., 2019]. StrategyQA requires models to infer an implicit

multi-hop reasoning to answer questions [Geva et al., 2021].

For symbolic reasoning, we use Last Letter Concatenation and Coin Flip [Wei et al., 2022]. Last

letter Concatenation asks the model to concatenate the last letters of each word. We used randomly

selected four names for each sample. Coin Flip asks the model to answer whether a coin is still heads

up after people either ﬂip or do not ﬂip the coin. We created samples of four times ﬂip or not ﬂip

trials. Although these tasks are easy for humans, LMs typically exhibit a ﬂat scaling curve.

For other logical reasoning tasks, we choose two evaluation sets from the BIG-bench effort [Srivastava

et al., 2022]: Date Understanding

and Tracking Shufﬂed Objects. Date Understanding asks models

to infer the date from a context. Tracking Shufﬂed Objects tests a model’s ability to infer the ﬁnal

state of objects given its initial state and a sequence of object shufﬂing. We used a dataset of tracking

three shufﬂed objects for our experiment.

Models

We experiment with 17 models in total. Main experiments are conducted with Instruct-

GPT3 [Ouyang et al., 2022] (text-ada/babbage/curie/davinci-001 and text-davinci-002)

, original

GPT3 [Brown et al., 2020] (ada, babbage, curie, and davinci)

, and PaLM [Chowdhery et al., 2022]

(8B, 62B, and 540B). In addition, we used GPT-2[Radford et al., 2019], GPT-Neo[Black et al., 2021],

GPT-J[Wang and Komatsuzaki, 2021], T0 [Sanh et al., 2022], and OPT [Zhang et al., 2022] for model

scaling study. The size of LMs ranges from 0.3B to 540B. We include both standard (e.g. GPT-3 and

OPT), and instruction following variants (e.g. Instruct-GPT3 and T0). See Appendix A.3 for model

description details. Unless otherwise stated, we use text-davinci-002 throughout the experiments.

Baselines

We compare our Zero-shot-CoT mainly to standard Zero-shot prompting to verify the

effectiveness of its chain of thought reasoning. For Zero-shot experiments, similar answer prompts

as Zero-shot-CoT are used as default. See Appendix A.5 for detail. To better evaluate the zero-shot

ability of LLMs on reasoning tasks, we also compare our method to Few-shot and Few-shot-CoT

baselines from [Wei et al., 2022], using the same in-context examples. Throughout the experiments,

we use greedy decoding across all the methods. For the zero-shot approaches, the results are therefore

deterministic. For the few-shot approaches, since the order of in-context examples could affect the

results [Lu et al., 2022], we run each experiment only once with a ﬁxed seed across all methods and

datasets, for fair comparisons with the zero-shot methods. Wei et al. [2022] showed that the order of

examples did not cause large variance in CoT experiments.

Answer cleansing

After the model outputs a text by answer extraction (see § 3 and Figure 2), our

method picks up only the part of the answer text that ﬁrst satisﬁes the answer format. For example,

if the answer prompting outputs “probably 375 and 376” on arithmetic tasks, we extract the ﬁrst

number “375” and set it as the model prediction. In the case of multiple-choice, the ﬁrst large letter

we encounter is set as the prediction. See Appendix A.6 for more detail. Standard Zero-shot method

follows the same idea. For Few-shot and Few-shot-CoT methods, we follow [Wang et al., 2022] and

ﬁrst extract the answer text after "The answer is " from the model output, and apply the same answer

cleansing to parse the answer text. If “The answer is” is not found in the model output, we search

from the back of the text and set the ﬁrst text that satisﬁes the answer format as the prediction.

4.1 Results

Zero-shot-CoT vs. Zero-shot

Table 1 summarize accuracy of our method (Zero-shot-CoT) and

standard zero-shot prompting (Zero-shot) for each dataset. Zero-shot-CoT substantially outperforms

four out of six arithmetic reasoning tasks (MultiArith, GSM8K, AQUA, SVAMP), all symbolic

reasoning, and all other logical reasoning tasks (from BIG-bench [Srivastava et al., 2022]). For

While prior work [Wei et al., 2022] categorized Date Understanding task into Common Sense reasoning,

our study categorized this task into logical reasoning because this task requires less prior knowledge and more

logical reasoning between dates.

Our experiment for Instruct GPT-3 models includes both text-****-001 and text-davinci-002. Text-davinci-

002 differs from text-****-001 in that they use different ﬁne-tuning data depending on the date range collected

from the APIs. Speciﬁcally, text-davinci-002 uses data up to Jun 2021, while text-****-001 uses data up to Oct

2019. (See https://beta.openai.com/docs/engines/gpt-3)

Our experiments with GPT3 series are conducted by using OpenAI API between April-2022 and May-2022,

except for No.10-16 in Table 4 in Aug-2022.

剩余41页未读，继续阅读

评论收藏

内容反馈

IT徐师兄

粉丝: 1863
资源: 2689

Large Language Models are Zero-Shot Reasoners.pdf

最新资源

Large Language Models are Zero-Shot Reasoners.pdf

Finetuned Language Models are Zero-Shot Learners.pdf

Emergent Abilities of Large Language Models.pdf

gpt3-Language-Models-are-Few-Shot-Learners.pdf

Training Compute-Optimal Large Language Models.pdf

Language Models are Multilingual Chain-of-Thought Reasoners.pdf

Protégé（Protege-5.2.0-win版）桌面版官网最新

HermiT-android:语义推理器 HermiT http 的 Android 端口

entquery:使用扩展蕴涵机制的 OWL 查询接口

Safety Assessment of Chinese Large Language Models.pdf

Evaluating Large Language Models Trained on Code.pdf

Theory of Mind May Have Spontaneously Emerged in Large Language Models.pdf

Large Language Models Can Self-Improve.pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

李飞飞自传 我看见的世界 The World I see

农村公交与异构无人机协同配送优化

4个亲测好用的ChatGPT4渠道

2023泛娱乐社交出海手册-ZEGO即构科技

北森能力测评题库.zip

学术海报模板+论文科研+研究生

车载毫米波雷达DOA估计综述博文仿真代码

认知智能技术与产业研究报告2023

软件工程课程设计-基于苍穹外卖

ST-LINK Utility 4.6.0

chrome-win64.zip

最新资源

李飞飞自传我看见的世界 The World I see