没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
内容概要:本文综述了大规模语言模型(LLMs)在不同自然语言处理(NLP)任务中的提示工程方法。LLMs 已经在许多NLP任务中展示了卓越的性能,提示工程通过构造自然语言指令来提高其表现。文章总结了44篇研究论文,介绍了39种不同的提示工程方法及其在29种NLP任务中的应用效果。此外,文章还讨论了每种提示技术的具体实施细节、相应的LLMs使用情况以及特定数据集上的当前最佳性能方法。 适合人群:对自然语言处理和提示工程技术感兴趣的科研人员、学生和从业者。 使用场景及目标:为各种NLP任务提供高效的提示工程方法,提升LLMs的推理能力、问题解决能力和多步推理能力。帮助读者了解提示工程的基本概念和最新进展,适用于学术研究和实际应用。 其他说明:提示工程的方法种类繁多,从基本的提示方法到复杂的链式推理方法,每种方法都旨在通过不同的方式提升LLMs的性能。本文不仅总结了各种方法,还详细探讨了它们的应用场景和效果。对于每个数据集,文章都列出了所使用的LLMs和最佳的提示方法,为后续研究提供了有价值的参考。
资源推荐
资源详情
资源评论
arXiv:2407.12994v2 [cs.CL] 24 Jul 2024
A SURVEY OF PR OMPT ENGINEERING METHODS IN LARGE
LANGUAGE MODELS FOR DIFFERENT NLP TASKS
Shubham Vatsal & Harsh Dubey
Departmen t of Computer Science
New York University, CIMS
New York, USA
{sv2128,hd2225}@nyu.edu
ABSTRACT
Large languag e models (LLMs) have shown r e markable performance on many differe nt
Natural Language Processing (NLP) tasks. Prom pt engineering plays a key role in adding
more to the already existing abilities of LLMs to achieve significant performance gains
on variou s NLP tasks. Prompt engineering requires composing natural language instruc-
tions called prompts to elicit knowledge from LLMs in a structure d way. Unlike previous
state-of-the-art (SoTA) models, prompt engineering does not require exten sive parameter
re-training or fine-tuning based on the given NLP task and thus solely operates on the
embedd ed knowled ge of LLMs. Additionally, LLM enthusiasts can intellig ently extract
LLMs’ knowledge through a basic natural language conversational exchange or prompt
engineer ing, allowing more and more people even withou t deep math ematical machine
learning background to experiment with LLMs. With prompt engineering gain ing pop-
ularity in the last two years, researchers have come up with numerous engineering tech-
niques around de signing prompts to improve accuracy of information extraction from the
LLMs. In this paper, we summarize different prompting techniques and club th em to-
gether based on different NLP tasks that they have been used for. We further granularly
highlight the performa nce of these prompting strategies on various datasets belonging to
that NLP task, talk about the cor responding LLMs used, present a taxonomy diagram and
discuss the possible SoTA for specific datasets. In total, we read and present a survey of
44 research pa pers which talk about 39 different prompting me thods on 29 different NLP
tasks of which m ost of them have been published in the last two y ears.
1 INTRODUCTION
Artificial Intelligenc e has advanced significantly with the introduction of LLMs. LLMs are trained on huge
corpora of text documents with millions and billions of tokens. It ha s been shown that as the number of
model parameters increase, the performance of machine learning models improve and such has been the case
with these LLMs. They have attained unpreced ented performance on a wide array of NLP tasks Chang et al.
(2023) because of which they have attracted a lot of interest fro m academia and different industrie s including
medicine, law, finance and more. The present phase of research on LLMs fo cuses on their reason ing capacity
via prompts rather than just next token prediction which has opened a new field of research around prompt
engineer ing.
Prompt eng ineering is the process of creating natural language instructions, or prompts, to extract kn owledge
from LLMs in an organized m anner. Prompt engineering, in contrast to earlier conventional models, relies
only on the embedded knowledge of LLMs and does not require extensive pa rameter re-training or fine-
1
tuning based on the underlying NLP task. Und erstanding model parameters in terms of real world knowledge
embedd ed in them is beyond human capabilities and hence this new field of prompt engineering has caught
everyone’s attentio n as it allows natu ral language exchange between researchers and LLMs to achieve the
goals of the underlying NLP task.
In this work, we enum erate several prompting strategies and group them according to different NLP ta sks
that they have been used for. We provid e a taxonomy diagram, tabulate the prompting tech niques tried on
various datasets for different NLP tasks, d iscuss th e LLMs emp loyed, and list potential SoTA methods for
each dataset. As a part of this survey, we have reviewed and analyzed 44 research papers in total, the ma-
jority of wh ic h have been published in the previous two years and cover 39 prompting techniques applied
on 29 different NLP tasks. There have not been a lot of prior systematic surveys on prompt engineering.
Sahoo et al. (2024) surveys 29 prompting technique papers based on their applications. This is a very b road
categorization as a single a pplication can encapsulate numerous NL P tasks. For example, one of the appli-
cations which they discuss is reasoning and logic which can have plethora of NLP tasks like commonsense
reasoning, ma themathical prob le m solving, multi-hop reasoning etc . This is different from our approach as
we take a more granular categorization of prompting strategies based on the NLP tasks. Edemacu & Wu
(2024) provides an overview of privacy protection prompting methods and thus focuses on a comparatively
small sub-field of prompt engineering. Chen et al. (2023) limits th e discussion of prom pting strategies to
some 9-10 methodologies and also d oes not incorporate categorizing them based on the NLP tasks.
The rest of the paper is organized in the following way. Section 2 talks about various prompt engineering
techniques and section 3 highlights different NLP tasks. The sub-sections of section 3 discuss different
prompting strategies that have been applied on a given NLP task and their corresponding results. Section 4
conclud es the paper.
2 PROMPT ENGINEERING TECHNIQUES
In this section , we talk briefly about different prompting methods and how they br ing improvement in ex-
isting performance as and when they were p ublished. An important thing to note here is that most of the
following prompting strategies have been experimented in two different variations or settings if not m ore.
These variations include zer o-shot and few-shot. Some of the prompting techniques may inherently exist in
either zero-shot or few-shot variation a nd there may not be a possibility for any other variation to exist. In
zero-sho t Radford et al. (2019) setting, there is no training data involved and an LLM is asked to perform
a task through pro mpt instructions while comp le te ly rely ing on it’s embedded knowledge learnt during it’s
pre-train ing phase. On the other hand, in few-shot variation Brown et al. (2 020), few training datapoints are
provided along with task-based prompt instructions for better comprehension of the task. The r e sults from
various prompt engineering works have shown few-shot variations to have he lped improve the performance
but this comes at a cost of carefully preparing few-shot datap oints as the LLM can show unexplained bias
toward s the curated few-shot datapoints.
2.1 BASIC/STANDARD/VANILLA PROMPTING
Basic prompting refers to the method of directly throwing a query at the LLM without any engineering
around it to improve the LLM’s perfo rmance which is the core goal behind mo st of the prompting strategies.
Basic promptin g also goes by the name of Standard or Vanilla prompting in different research paper s.
2.2 CHAIN-OF-THOUGHT (COT)
In this prompting strategy Wei et al. (2022), the authors build up on the id e a of how human beings break
a com plex pro blem into smaller easier sub-problem s before arriving at the final solution of the comp lex
2
problem. Alo ng similar lines, the authors investigate how capabilities of LLMs to do complicated reasoning
is inherently enhanced by producing a chain of thought, or a sequence of intermediate reasoning steps. The
results show a considerable improvement from Basic prompting with the maximum difference between Co T
and Basic prom pting results being as big as around 39% for Mathematica l Problem Solv ing task and ar ound
26% fo r Comm onsense Reasoning task. This work opened a new direction of research for the field of p rompt
engineer ing.
2.3 SELF-CONSISTENCY
Self-Consistency Wang et al. (2022) prompting technique is based on the intuition that complex reasoning
problems can be solved in multiple ways and hence the correct answer can be reached via d ifferent reasoning
paths. Self-Consistency uses a novel d ecoding strategy unlike the greedy one being used by CoT and con-
sists of three important steps. Th e fir st step requires prompting the LLM using CoT, the second step samples
diverse rea soning paths from LLM’s decoder and the final step involves choosing the most consistent an-
swer across multiple re asoning paths. Self-Co nsistency on an average achieves 11% gain on Mathematical
Problem Solving task, 3% gain o n Commonsense Reasoning task and 6% gain on Multi-Hop Reasoning task
when compared to CoT.
2.4 ENSEMBLE REFINEMENT (ER)
This prompting method has been discussed in Singhal et al. (2023). It builds on top of CoT and Self-
Consistency. ER consists of two stage s. First, given a few-shot CoT prompt and a query, LLM is made to
produce multiple generations by adjusting it’s temperature. Each generation contains a reasoning and an
answer for the query. Next, the LLM is conditione d on the original prompt, query and the concatenated
generations from the previous stage to generate a better explanation and an answer. This second stage is
done mu ltiple times followed by a majority voting over these second stage generated a nswers just as it is
done in case of Self-Consistency to select the final answer. ER is seen to perform better than CoT an d
Self-Consistency acr oss many datasets belonging to the Context-Free Question-Answering task.
2.5 AUTOMATIC CHAIN-OF-THOUGHT (AUTO-COT)
In this work Zhang et al. (2022), the authors address the problem faced by few-shot CoT or manual CoT
which is the need of curation of good quality tra ining data points. Auto-CoT consists of two p rimary steps.
The first one requires dividing que ries of a given dataset into a few clusters. The second one involves
choosing a represen ta tive query from each cluster and then generating its corresponding reasoning chain
using zero-shot CoT. The authors claim that Auto-CoT e ither outperforms or matches the performance of
few-shot CoT across Mathematical Pro blem Solving, Multi-Hop Reasoning and Commonsense Reasoning
task. This indicates that the step of curation of training datapoints for few-shot or manual CoT can be ruled
out.
2.6 COMPLEX COT
Fu et al. (2022) introduces a new pro mpting strategy which aims at choosin g complex datapoint prompts
over simpler ones. T he co mplexity of a datapoint is defined here by the number of reasoning steps involved
with it. The authors hypothesize that the LLMs’ reasoning performance ca n increase if complex datapoints
are used as in-context trainin g examples as they already subsume simpler datap oints. Another important
aspect of Complex CoT apart from using complex datapoints as training examples is that du ring decod ing,
just like Self-Consistency, out of N sampled reasoning chains the majority answer over the top K most
complex chains is chosen as the final answer. There is one other baseline prompting method which has be en
introdu ced in this p a per called Random CoT. In Random CoT, the datapoints a re randomly sampled without
3
adhering to their complexity. Complex CoT achieves on an average a gain of 5.3% accuracy and up to 18%
accuracy improvement across various datasets of Ma thematical Problem Solving, Commonsense Reasoning,
Table -Based Mathematical Problem Solving and Multi-Hop Reasoning tasks.
2.7 PROGRAM-OF-THOUGHTS (POT)
The authors of Chen et al. (202 2a) build up on CoT but in contrast to CoT which uses LLMs to perform bo th
reasoning and computation, PoT gene rates Python pr ograms and thus relegates computation part to a Python
interpreter. This work argues that reduced LLM responsibilities make it more accurate especially for numeri-
cal reasoning. PoT gets an average performance gain over CoT of around 12% across Mathematical Problem
Solving, Table-Based Mathematical Problem Solving, Contextual Question-An swering a nd Conversational
Contextual Question-Answering tasks.
2.8 LEAST-TO-MOST
Least-to-Most Zhou et al. ( 2022) prompting techniq ue tries to address the problem of CoT where CoT fails
to accurately solve problems harder than the exemplars shown in the prompts. It consists of two stage s.
First, the LLM is prompted to decompose a given problem into su b-problems. Next, the LLM is prompted
to solve the su b-problems in a sequential manner. The answer to any sub-problem depends on the answer
of the previous sub-problem . The authors show that Least-to-Most prompting is able to significantly outper-
form CoT and Basic prompting methods on Commonsense Reasoning, Language-Based Task Comp le tion,
Mathematical Problem Solving and Contextua l Question-An swering tasks.
2.9 CHAIN-OF-SYMBOL (COS)
CoS Hu et al. (2023) builds up on the idea o f CoT. I n conventional CoT, the intermediate chain of reasoning
steps are represented in natural language. While this approach has shown remarkable results in many cases, it
can include incorrect or redundant information as well. The authors of this work present their hypothesis that
spatial descriptions are hard to express in n atural language thus making it difficult for LLMs to unde rstand.
Instead, expressing these relationships using symbo ls in word sequences can be a be tter form of represen-
tation for LLMs. CoS achieves an im provement of up to 60.8% accuracy for Spatial Question- A nswering
task.
2.10 STRUCTURED CHAIN-OF-THOUGHT (SCOT)
The intuition behind SCoT Li et al. (2023b) is that structuring intermediate reasoning steps using program
structures like sequencing, branching and looping helps in more accurate code generation than having in-
termediate reasoning steps in natural language as we see in c onventional CoT. The authors claim that the
former approach more closely mimics a hum an developer’s thought process than the latter one and the same
has been confirmed by the the final results a s SCoT outperforms CoT by up to 13.79% for the Code Gener-
ation task.
2.11 PLAN-AND-SOLVE (PS)
Wang et al. (2023) discusses and tries to address three shortcomings of CoT which are calculation err ors,
missing-step err ors and semantic misunderstanding errors. PS contains two components whe re the first one
requires devising a plan to divide the entire problem into smaller sub-problems and the second one needs to
carry out these sub-problems according to the plan. A better version of PS called PS+ ad ds more detailed
instructions which helps in improving the quality of reasoning steps. PS pr ompting method improves the
4
accuracy over Co T by at least 5% for almost all the datasets in Mathematical Problem Solving task in zero-
shot setting. Similarly, for the Commonsense Reasoning task, it consistently outperfor ms CoT by at least
5% in zero-shot setting whereas for the Mu lti- H op Reasoning task it gets around 2% better accuracy score.
2.12 MATHPROMPTER
Imani et al. (2023) tries to address two key problems of CoT for Mathematical Problem Solving task: (1 ) lack
of validity of steps followed by CoT for solving a problem; (2) how co nfident is an LLM in it’s predictions.
MathPrompter p rompting strategy con sists of 4 steps in total. (I) Given a query, the first step requires to
generate an algebraic expression for the quer y which replaces the n umerical values by variables. (II) Next,
LLM is prompted to solve the query analytically either by deriving the algebraic expression or writing a
Python fu nction. (III) Third, the query in step (I) is solved by assigning different values to the variables.
(IV) If the solutions in (III) are correct over N iterations, the variables are finally replaced with original query
values and the answer is computed. If not, then the steps (II), ( III) and (IV) are repeated. MathPrompter is
able to improve the performance on a dataset belonging to Mathematical Problem Solving task from 78.7%
to 92.5%.
2.13 CONTRASTIVE COT/ CONTRASTIVE SELF-CONSISTENCY
The authors of Chia et al. (2023) cla im that Contrastive CoT or Co ntrastive Self Consistency is a general en-
hancement of CoT or Self-Consistency. Th e inspiration for this prompting approach is based on how humans
can learn from both positive as well as negative examples. Along similar lines, in this prompting technique,
both positive and negative demonstrations are pr ovided to enhance the reasoning capabilities of the LLM .
Contrastive CoT on an avera ge is able to gain an average of 10% improvement over conventional CoT for
Mathematical Problem Solving task across multiple datasets. Similarly, Con trastive Self-Consistency is able
to outperfor m conventional Self-Consistency by over 15% for Ma thematical Problem Solv ing task across
multiple datasets. For Multi-Hop Reasoning task, both Contrastive CoT and Contrastive Self -Consistency
have more than 10% gains over their conventional counterparts.
2.14 FEDERATED SAME/DIFFERENT PARAMETER SELF-CONSISTENCY/COT (FED-SP/DP-SC/COT)
Introd uced in Liu et al. (2023), this prompting method is based on the c ore idea of improving the reasoning
capabilities o f LLMs by using synonymous crowd-sourced queries. There are two slightly different varia-
tions of this prompting method. The first one is Fed-SP-SC whe re the crowd-sourced queries are paraphrased
versions of the original query but with same p a rameters. Parameters here can refer to the numeric valu es in
Mathematical Problem Solving task d a ta points. For Fed-SP-SC, the answers are directly generated first and
then Self-Consistency is applied on top of it. The other one is Fed-DP-CoT. In Fed-DP-CoT, LLMs a re used
to first generate answers to different queries and then they are federated by forming CoT to provide hints to
the LL Ms. The results for these me thods on Mathematical Problem Solving task show that they are able to
do better than conventional CoT by at least 10% and up to 20%.
2.15 ANALOGICAL REASONING
The authors of this work Yasunaga e t al. (2023) draw their inspiration from a psychological notion, analog-
ical reasoning, where people use pertinent prior experiences to solve new problems. In the realm of LLMs,
the authors first prompt them to ge nerate examples similar to that of the original problem followed by solving
them a nd then proceed to answer the original problem. The results show that Analogical Reasoning is able
to achieve an average accuracy gain of 4% when compar e d to CoT across M a thematical Problem Solving,
Code Generation, L ogical Reasoning and Commonsense Reasoning tasks.
5
剩余38页未读,继续阅读
资源评论
码流怪侠
- 粉丝: 2w+
- 资源: 128
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于Python的网络嗅探器设计与实现
- SINAMICS DCM与6RA80升级流程详解
- Python毕业设计-豆瓣电影短评数据挖掘与情感分析项目源码(高分项目)
- (178221808)JAVA阳光酒店管理系统(javaapplet+SQL).rar
- (177344632)微信小程序-餐饮点餐外卖-开箱即用
- STM32智能导盲拐杖-最新开发全新源码+设计文档说明(高分项目)
- (177382420)手写模拟器-Python
- mysql 5.6.25 window服务端
- (9202008)Simatic-EKB-Install-2012-07-29
- (176451606)电-气-热综合能源系统优化调度matlab代码
- (177402822)一个基于 Vue3 的后台管理系统开发框架.zip
- 机械设计GEK气化炉sw18可编辑非常好的设计图纸100%好用.zip
- DBeaver oracle数据库驱动
- Python毕业设计-基于爬虫技术的海量电影数据分析源码(高分项目)
- (178046404)基于微信小程序的电影院票务系统.zip
- (175804832)Python学生信息管理系统心得体会资源合集
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功