没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
The University of Tokyo
t.kojima@weblab.t.u-tokyo.ac.jp
Shixiang Shane Gu
Google Research, Brain Team
Machel Reid
Google Research
∗
Yutaka Matsuo
The University of Tokyo
Yusuke Iwasawa
The University of Tokyo
Abstract
Pretrained large language models (LLMs) are widely used in many sub-fields of
natural language processing (NLP) and generally known as excellent few-shot
learners with task-specific exemplars. Notably, chain of thought (CoT) prompting,
a recent technique for eliciting complex multi-step reasoning through step-by-
step answer examples, achieved the state-of-the-art performances in arithmetics
and symbolic reasoning, difficult system-2 tasks that do not follow the standard
scaling laws for LLMs. While these successes are often attributed to LLMs’
ability for few-shot learning, we show that LLMs are decent zero-shot reasoners
by simply adding “Let’s think step by step” before each answer. Experimental
results demonstrate that our Zero-shot-CoT, using the same single prompt template,
significantly outperforms zero-shot LLM performances on diverse benchmark
reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP),
symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date
Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot
examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and
GSM8K from 10.4% to 40.7% with large-scale InstructGPT model (text-davinci-
002), as well as similar magnitudes of improvements with another off-the-shelf
large model, 540B parameter PaLM. The versatility of this single prompt across
very diverse reasoning tasks hints at untapped and understudied fundamental
zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive
capabilities may be extracted by simple prompting. We hope our work not only
serves as the minimal strongest zero-shot baseline for the challenging reasoning
benchmarks, but also highlights the importance of carefully exploring and analyzing
the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning
datasets or few-shot exemplars.
1 Introduction
Scaling up the size of language models has been key ingredients of recent revolutions in natural
language processing (NLP) [Vaswani et al., 2017, Devlin et al., 2019, Raffel et al., 2020, Brown et al.,
2020, Thoppilan et al., 2022, Rae et al., 2021, Chowdhery et al., 2022]. The success of large language
models (LLMs) is often attributed to (in-context) few-shot or zero-shot learning. It can solve various
tasks by simply conditioning the models on a few examples (few-shot) or instructions describing the
task (zero-shot). The method of conditioning the language model is called “prompting” [Liu et al.,
2021b], and designing prompts either manually [Schick and Schütze, 2021, Reynolds and McDonell,
2021] or automatically [Gao et al., 2021, Shin et al., 2020] has become a hot topic in NLP.
∗
Work done while at The University of Tokyo.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2205.11916v4 [cs.CL] 29 Jan 2023
(c) Zero-shot
Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls are
there?
A: The answer (arabic numerals) is
(Output) 8 X
(d) Zero-shot-CoT (Ours)
Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls are
there?
A: Let’s think step by step.
(Output) There are 16 balls in total. Half of the balls are golf
balls. That means that there are 8 golf balls. Half of the golf balls
are blue. That means that there are 4 blue golf balls.
✓
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. How many tennis balls does
he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6
tennis balls. 5 + 6 = 11. The answer is 11.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls are
there?
A:
(Output) The juggler can juggle 16 balls. Half of the balls are golf
balls. So there are 16 / 2 = 8 golf balls. Half of the golf balls are
blue. So there are 8 / 2 = 4 blue golf balls. The answer is 4.
✓
(b) Few-shot-CoT(a) Few-shot
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. How many tennis balls does
he have now?
A: The answer is 11.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls are
there?
A:
(Output) The answer is 8. X
Figure 1: Example inputs and outputs of GPT-3 with (a) standard Few-shot ([Brown et al., 2020]), (b)
Few-shot-CoT ([Wei et al., 2022]), (c) standard Zero-shot, and (d) ours (Zero-shot-CoT). Similar to
Few-shot-CoT, Zero-shot-CoT facilitates multi-step reasoning (blue text) and reach correct answer
where standard prompting fails. Unlike Few-shot-CoT using step-by-step reasoning examples
per
task
, ours does not need any examples and just uses the same prompt “Let’s think step by step” across
all tasks (arithmetic, symbolic, commonsense, and other logical reasoning tasks).
In contrast to the excellent performance of LLMs in intuitive and single-step system-1 [Stanovich
and West, 2000] tasks with task-specific few-shot or zero-shot prompting [Liu et al., 2021b], even
language models at the scale of 100B or more parameters had struggled on system-2 tasks requiring
slow and multi-step reasoning [Rae et al., 2021]. To address this shortcoming, Wei et al. [2022],
Wang et al. [2022] have proposed chain of thought prompting (CoT), which feed LLMs with the
step-by-step reasoning examples rather than standard question and answer examples (see Fig. 1-a).
Such chain of thought demonstrations facilitate models to generate a reasoning path that decomposes
the complex reasoning into multiple easier steps. Notably with CoT, the reasoning performance then
satisfies the scaling laws better and jumps up with the size of the language models. For example,
when combined with the 540B parameter PaLM model [Chowdhery et al., 2022], chain of thought
prompting significantly increases the performance over standard few-shot prompting across several
benchmark reasoning tasks, e.g., GSM8K (17.9% → 58.1%).
While the successes of CoT prompting [Wei et al., 2022], along those of many other task-specific
prompting work [Gao et al., 2021, Schick and Schütze, 2021, Liu et al., 2021b], are often attributed
to LLMs’ ability for few-shot learning [Brown et al., 2020], we show that LLMs are decent zero-shot
reasoners by adding a simple prompt, Let’s think step by step, to facilitate step-by-step thinking before
answering each question (see Figure 1). Despite the simplicity, our Zero-shot-CoT successfully
generates a plausible reasoning path in a zero-shot manner and reaches the correct answer in a
problem where the standard zero-shot approach fails. Importantly, our Zero-shot-CoT is versatile and
task-agnostic, unlike most prior task-specific prompt engineering in the forms of examples (few-shot)
or templates (zero-shot) [Liu et al., 2021b]: it can facilitate step-by-step answers across various
reasoning tasks, including arithmetic (MultiArith [Roy and Roth, 2015], GSM8K [Cobbe et al., 2021],
AQUA-RAT [Ling et al., 2017], and SVAMP [Patel et al., 2021]), symbolic reasoning (Last letter and
Coin flip), commonsense reasoning (CommonSenseQA [Talmor et al., 2019] and Strategy QA [Geva
et al., 2021]), and other logical reasoning tasks (Date understanding and Tracking Shuffled Objects
from BIG-bench [Srivastava et al., 2022]) without modifying the prompt per task.
We empirically evaluate Zero-shot-CoT against other prompting baselines in Table 2. While our
Zero-shot-CoT underperforms Few-shot-CoT with carefully-crafted and task-specific step-by-step ex-
amples, Zero-shot-CoT achieves enormous score gains compared to the zero-shot baseline, e.g. from
17.7% to 78.7% on MultiArith and from 10.4% to 40.7% on GSM8K with large-scale InstructGPT
2
model (text-davinci-002). We also evaluate Zero-shot-CoT with another off-the-shelf large model,
540B parameter PaLM, showing similar magnitudes of improvements on MultiArith and GSM8K.
Importantly, with our single fixed prompt, zero-shot LLMs have a significantly better scaling curve
comparable to that of the few-shot CoT baseline. We also show that besides Few-shot-CoT requiring
human engineering of multi-step reasoning prompts, their performance deteriorates if prompt example
question types and task question type are unmatched, suggesting high sensitivity to per-task prompt
designs. In contrast, the versatility of this single prompt across diverse reasoning tasks hints at
untapped and understudied zero-shot fundamental capabilities of LLMs, such as higher-level broad
cognitive capabilities like generic logical reasoning [Chollet, 2019]. While the vibrant field of LLMs
started out from the premise of excellent few-shot learners [Brown et al., 2020], we hope our work
encourages more research into uncovering high-level and multi-task zero-shot capabilities hidden
inside those models.
2 Background
We briefly review the two core preliminary concepts that form the basis of this work: the advent of
large language models (LLMs) and prompting, and chain of thought (CoT) prompting for multi-step
reasoning.
Large language models and prompting
A language model (LM), is a model that looks to estimate
the probability distribution over text. Recently, scaling improvements through larger model sizes
(from a few million [Merity et al., 2016] to hundreds of millions [Devlin et al., 2019] to hundreds of
billions [Brown et al., 2020] parameters) and larger data (e.g. webtext corpora [Gao et al., 2020])
have enabled pre-trained large language models (LLMs) to be incredibly adept at many downstream
NLP tasks. Besides the classic “pre-train and fine-tune” paradigm [Liu et al., 2021b], models scaled
to 100B+ parameters exhibit properties conducive to few-shot learning [Brown et al., 2020], by way
of in context learning, where one can use a text or template known as a prompt to strongly guide the
generation to output answers for desired tasks, thus beginning an era of “pre-train and prompt” [Liu
et al., 2021a]. In work, we call such prompts with explicit conditioning on few task examples as
few-shot prompts, and other template-only prompts as zero-shot prompts.
Chain of thought prompting
Multi-step arithmetic and logical reasoning benchmarks have par-
ticularly challenged the scaling laws of large language models [Rae et al., 2021]. Chain of thought
(CoT) prompting [Wei et al., 2022], an instance of few-shot prompting, proposed a simple solution
by modifying the answers in few-shot examples to step-by-step answers, and achieved significant
boosts in performance across these difficult benchmarks, especially when combined with very large
language models like PaLM [Chowdhery et al., 2022]. The top row of Figure 1 shows standard
few-shot prompting against (few-shot) CoT prompting. Notably, few-shot learning was taken as a
given for tackling such difficult tasks, and the zero-shot baseline performances were not even reported
in the original work [Wei et al., 2022]. To differentiate it from our method, we call Wei et al. [2022]
as Few-shot-CoT in this work.
3 Zero-shot Chain of Thought
We propose Zero-shot-CoT, a zero-shot template-based prompting for chain of thought reasoning.
It differs from the original chain of thought prompting [Wei et al., 2022] as it does not require
step-by-step few-shot examples, and it differs from most of the prior template prompting [Liu et al.,
2021b] as it is inherently task-agnostic and elicits multi-hop reasoning across a wide range of tasks
with a single template. The core idea of our method is simple, as described in Figure 1: add Let’s
think step by step, or a a similar text (see Table 4), to extract step-by-step reasoning.
3.1 Two-stage prompting
While Zero-shot-CoT is conceptually simple, it uses prompting twice to extract both reasoning and
answer, as explained in Figure 2. In contrast, the zero-shot baseline (see the bottom-left in Figure 1)
already uses prompting in the form of “The answer is”, to extract the answers in correct formats.
Few-shot prompting, standard or CoT, avoids needing such answer-extraction prompting by explicitly
designing the few-shot example answers to end in such formats (see the top-right and top-left
3
Q: On average Joe throws 25 punches per
minute. A fight lasts 5 rounds of 3 minutes. How
many punches did he throw?
A: Let's think step by step.
In one minute, Joe throws 25 punches.
In three minutes, Joe throws 3 * 25 = 75 punches.
In five rounds, Joe throws 5 * 75 = 375 punches.
Q: On average Joe throws 25 punches per
minute. A fight lasts 5 rounds of 3 ・・・
A: Let's think step by step.
In one minute, Joe throws 25 punches. ・・・In five
rounds, Joe throws 5 * 75 = 375 punches. .
Therefore, the answer (arabic numerals) is
375.
LLM
LLM
【1st prompt】
Reasoning Extraction
【2nd prompt】
Answer Extraction
Figure 2: Full pipeline of Zero-shot-CoT as described in § 3: we first use the first “reasoning” prompt
to extract a full reasoning path from a language model, and then use the second “answer” prompt to
extract the answer in the correct format from the reasoning text.
in Figure 1). In summary, Few-shot-CoT [Wei et al., 2022] requires careful human engineering of
a few prompt examples with specific answer formats per task, while Zero-shot-CoT requires less
engineering but requires prompting LLMs twice.
1st prompt: reasoning extraction
In this step we first modify the input question
x
into a prompt
x
0
using a simple template “Q:
[X]
. A:
[T]
”, where
[X]
is an input slot for
x
and
[T]
is an slot
for hand-crafted trigger sentence t that would extract chain of though to answer the question x. For
example, if we use “Let’s think step by step” as a trigger sentence, the prompt
x
0
would be “Q:
[X]
.
A: Let’s think step by step.”. See Table 4 for more trigger examples. Prompted text
x
0
is then fed into
a language model and generate subsequent sentence
z
. We can use any decoding strategy, but we
used greedy decoding throughout the paper for the simplicity.
2nd prompt: answer extraction
In the second step, we use generated sentence
z
along with
prompted sentence
x
0
to extract the final answer from the language model. To be concrete, we simply
concatenate three elements as with “
[X
0
] [Z] [A]
”:
[X
0
]
for 1st prompt
x
0
,
[Z]
for sentence
z
generated at the first step, and
[A]
for a trigger sentence to extract answer. The prompt for this step
is self-augmented, since the prompt contains the sentence
z
generated by the same language model.
In experiment, we use slightly different answer trigger depending on the answer format. For example,
we use “Therefore, among A through E, the answer is” for multi-choice QA, and “Therefore, the
answer (arabic numerals) is” for math problem requiring numerical answer. See Appendix A.5 for
the lists of answer trigger sentences. Finally, the language model is fed the prompted text as input to
generate sentences
ˆy
and parse the final answer. See “Answer Cleansing” at §4 for the parser details.
4 Experiment
Tasks and datasets
We evaluate our proposal on 12 datasets from four categories of reasoning
tasks: arithmetic, commonsense, symbolic, and other logical reasoning tasks. See Appendix A.2 for
the detailed description of each datasets.
For arithmetic reasoning, we consider the following six datasets: (1) SingleEq [Koncel-Kedziorski
et al., 2015], (2) AddSub [Hosseini et al., 2014], (3) MultiArith [Roy and Roth, 2015], (4) AQUA-
RAT [Ling et al., 2017], (5) GSM8K [Cobbe et al., 2021], and (6) SVAMP [Patel et al., 2021]. The
first three are from the classic Math World Problem Repository [Koncel-Kedziorski et al., 2016],
and the last three are from more recent benchmarks. SingleEq and AddSub contain easier problems,
which do not require multi-step calculation to solve the tasks. MultiArith, AQUA-RAT, GSM8k, and
SVAMP are more challenging datasets that require multi-step reasoning to solve.
For commonsense reasoning, we use CommonsenseQA [Talmor et al., 2019] and StrategyQA [Geva
et al., 2021]. CommonsenseQA asks questions with complex semantics that often require reasoning
4
based on prior knowledge [Talmor et al., 2019]. StrategyQA requires models to infer an implicit
multi-hop reasoning to answer questions [Geva et al., 2021].
For symbolic reasoning, we use Last Letter Concatenation and Coin Flip [Wei et al., 2022]. Last
letter Concatenation asks the model to concatenate the last letters of each word. We used randomly
selected four names for each sample. Coin Flip asks the model to answer whether a coin is still heads
up after people either flip or do not flip the coin. We created samples of four times flip or not flip
trials. Although these tasks are easy for humans, LMs typically exhibit a flat scaling curve.
For other logical reasoning tasks, we choose two evaluation sets from the BIG-bench effort [Srivastava
et al., 2022]: Date Understanding
2
and Tracking Shuffled Objects. Date Understanding asks models
to infer the date from a context. Tracking Shuffled Objects tests a model’s ability to infer the final
state of objects given its initial state and a sequence of object shuffling. We used a dataset of tracking
three shuffled objects for our experiment.
Models
We experiment with 17 models in total. Main experiments are conducted with Instruct-
GPT3 [Ouyang et al., 2022] (text-ada/babbage/curie/davinci-001 and text-davinci-002)
3
, original
GPT3 [Brown et al., 2020] (ada, babbage, curie, and davinci)
4
, and PaLM [Chowdhery et al., 2022]
(8B, 62B, and 540B). In addition, we used GPT-2[Radford et al., 2019], GPT-Neo[Black et al., 2021],
GPT-J[Wang and Komatsuzaki, 2021], T0 [Sanh et al., 2022], and OPT [Zhang et al., 2022] for model
scaling study. The size of LMs ranges from 0.3B to 540B. We include both standard (e.g. GPT-3 and
OPT), and instruction following variants (e.g. Instruct-GPT3 and T0). See Appendix A.3 for model
description details. Unless otherwise stated, we use text-davinci-002 throughout the experiments.
Baselines
We compare our Zero-shot-CoT mainly to standard Zero-shot prompting to verify the
effectiveness of its chain of thought reasoning. For Zero-shot experiments, similar answer prompts
as Zero-shot-CoT are used as default. See Appendix A.5 for detail. To better evaluate the zero-shot
ability of LLMs on reasoning tasks, we also compare our method to Few-shot and Few-shot-CoT
baselines from [Wei et al., 2022], using the same in-context examples. Throughout the experiments,
we use greedy decoding across all the methods. For the zero-shot approaches, the results are therefore
deterministic. For the few-shot approaches, since the order of in-context examples could affect the
results [Lu et al., 2022], we run each experiment only once with a fixed seed across all methods and
datasets, for fair comparisons with the zero-shot methods. Wei et al. [2022] showed that the order of
examples did not cause large variance in CoT experiments.
Answer cleansing
After the model outputs a text by answer extraction (see § 3 and Figure 2), our
method picks up only the part of the answer text that first satisfies the answer format. For example,
if the answer prompting outputs “probably 375 and 376” on arithmetic tasks, we extract the first
number “375” and set it as the model prediction. In the case of multiple-choice, the first large letter
we encounter is set as the prediction. See Appendix A.6 for more detail. Standard Zero-shot method
follows the same idea. For Few-shot and Few-shot-CoT methods, we follow [Wang et al., 2022] and
first extract the answer text after "The answer is " from the model output, and apply the same answer
cleansing to parse the answer text. If “The answer is” is not found in the model output, we search
from the back of the text and set the first text that satisfies the answer format as the prediction.
4.1 Results
Zero-shot-CoT vs. Zero-shot
Table 1 summarize accuracy of our method (Zero-shot-CoT) and
standard zero-shot prompting (Zero-shot) for each dataset. Zero-shot-CoT substantially outperforms
four out of six arithmetic reasoning tasks (MultiArith, GSM8K, AQUA, SVAMP), all symbolic
reasoning, and all other logical reasoning tasks (from BIG-bench [Srivastava et al., 2022]). For
2
While prior work [Wei et al., 2022] categorized Date Understanding task into Common Sense reasoning,
our study categorized this task into logical reasoning because this task requires less prior knowledge and more
logical reasoning between dates.
3
Our experiment for Instruct GPT-3 models includes both text-****-001 and text-davinci-002. Text-davinci-
002 differs from text-****-001 in that they use different fine-tuning data depending on the date range collected
from the APIs. Specifically, text-davinci-002 uses data up to Jun 2021, while text-****-001 uses data up to Oct
2019. (See https://beta.openai.com/docs/engines/gpt-3)
4
Our experiments with GPT3 series are conducted by using OpenAI API between April-2022 and May-2022,
except for No.10-16 in Table 4 in Aug-2022.
5
剩余41页未读,继续阅读
资源评论
IT徐师兄
- 粉丝: 1863
- 资源: 2689
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功