大规模语言模型在不同NLP任务中的提示工程技术综述资源-CSDN文库

NLP

Prompt

162 浏览量 2024-12-25 20:56:46 上传评论收藏 422KB PDF 举报

资源推荐

资源详情

资源评论

arXiv:2407.12994v2 [cs.CL] 24 Jul 2024

A SURVEY OF PR OMPT ENGINEERING METHODS IN LARGE

LANGUAGE MODELS FOR DIFFERENT NLP TASKS

Shubham Vatsal & Harsh Dubey

Departmen t of Computer Science

New York University, CIMS

New York, USA

{sv2128,hd2225}@nyu.edu

ABSTRACT

Large languag e models (LLMs) have shown r e markable performance on many differe nt

Natural Language Processing (NLP) tasks. Prom pt engineering plays a key role in adding

more to the already existing abilities of LLMs to achieve signiﬁcant performance gains

on variou s NLP tasks. Prompt engineering requires composing natural language instruc-

tions called prompts to elicit knowledge from LLMs in a structure d way. Unlike previous

state-of-the-art (SoTA) models, prompt engineering does not require exten sive parameter

re-training or ﬁne-tuning based on the given NLP task and thus solely operates on the

embedd ed knowled ge of LLMs. Additionally, LLM enthusiasts can intellig ently extract

LLMs’ knowledge through a basic natural language conversational exchange or prompt

engineer ing, allowing more and more people even withou t deep math ematical machine

learning background to experiment with LLMs. With prompt engineering gain ing pop-

ularity in the last two years, researchers have come up with numerous engineering tech-

niques around de signing prompts to improve accuracy of information extraction from the

LLMs. In this paper, we summarize different prompting techniques and club th em to-

gether based on different NLP tasks that they have been used for. We further granularly

highlight the performa nce of these prompting strategies on various datasets belonging to

that NLP task, talk about the cor responding LLMs used, present a taxonomy diagram and

discuss the possible SoTA for speciﬁc datasets. In total, we read and present a survey of

44 research pa pers which talk about 39 different prompting me thods on 29 different NLP

tasks of which m ost of them have been published in the last two y ears.

1 INTRODUCTION

Artiﬁcial Intelligenc e has advanced signiﬁcantly with the introduction of LLMs. LLMs are trained on huge

corpora of text documents with millions and billions of tokens. It ha s been shown that as the number of

model parameters increase, the performance of machine learning models improve and such has been the case

with these LLMs. They have attained unpreced ented performance on a wide array of NLP tasks Chang et al.

(2023) because of which they have attracted a lot of interest fro m academia and different industrie s including

medicine, law, ﬁnance and more. The present phase of research on LLMs fo cuses on their reason ing capacity

via prompts rather than just next token prediction which has opened a new ﬁeld of research around prompt

engineer ing.

Prompt eng ineering is the process of creating natural language instructions, or prompts, to extract kn owledge

from LLMs in an organized m anner. Prompt engineering, in contrast to earlier conventional models, relies

only on the embedded knowledge of LLMs and does not require extensive pa rameter re-training or ﬁne-

tuning based on the underlying NLP task. Und erstanding model parameters in terms of real world knowledge

embedd ed in them is beyond human capabilities and hence this new ﬁeld of prompt engineering has caught

everyone’s attentio n as it allows natu ral language exchange between researchers and LLMs to achieve the

goals of the underlying NLP task.

In this work, we enum erate several prompting strategies and group them according to different NLP ta sks

that they have been used for. We provid e a taxonomy diagram, tabulate the prompting tech niques tried on

various datasets for different NLP tasks, d iscuss th e LLMs emp loyed, and list potential SoTA methods for

each dataset. As a part of this survey, we have reviewed and analyzed 44 research papers in total, the ma-

jority of wh ic h have been published in the previous two years and cover 39 prompting techniques applied

on 29 different NLP tasks. There have not been a lot of prior systematic surveys on prompt engineering.

Sahoo et al. (2024) surveys 29 prompting technique papers based on their applications. This is a very b road

categorization as a single a pplication can encapsulate numerous NL P tasks. For example, one of the appli-

cations which they discuss is reasoning and logic which can have plethora of NLP tasks like commonsense

reasoning, ma themathical prob le m solving, multi-hop reasoning etc . This is different from our approach as

we take a more granular categorization of prompting strategies based on the NLP tasks. Edemacu & Wu

(2024) provides an overview of privacy protection prompting methods and thus focuses on a comparatively

small sub-ﬁeld of prompt engineering. Chen et al. (2023) limits th e discussion of prom pting strategies to

some 9-10 methodologies and also d oes not incorporate categorizing them based on the NLP tasks.

The rest of the paper is organized in the following way. Section 2 talks about various prompt engineering

techniques and section 3 highlights different NLP tasks. The sub-sections of section 3 discuss different

prompting strategies that have been applied on a given NLP task and their corresponding results. Section 4

conclud es the paper.

2 PROMPT ENGINEERING TECHNIQUES

In this section , we talk brieﬂy about different prompting methods and how they br ing improvement in ex-

isting performance as and when they were p ublished. An important thing to note here is that most of the

following prompting strategies have been experimented in two different variations or settings if not m ore.

These variations include zer o-shot and few-shot. Some of the prompting techniques may inherently exist in

either zero-shot or few-shot variation a nd there may not be a possibility for any other variation to exist. In

zero-sho t Radford et al. (2019) setting, there is no training data involved and an LLM is asked to perform

a task through pro mpt instructions while comp le te ly rely ing on it’s embedded knowledge learnt during it’s

pre-train ing phase. On the other hand, in few-shot variation Brown et al. (2 020), few training datapoints are

provided along with task-based prompt instructions for better comprehension of the task. The r e sults from

various prompt engineering works have shown few-shot variations to have he lped improve the performance

but this comes at a cost of carefully preparing few-shot datap oints as the LLM can show unexplained bias

toward s the curated few-shot datapoints.

2.1 BASIC/STANDARD/VANILLA PROMPTING

Basic prompting refers to the method of directly throwing a query at the LLM without any engineering

around it to improve the LLM’s perfo rmance which is the core goal behind mo st of the prompting strategies.

Basic promptin g also goes by the name of Standard or Vanilla prompting in different research paper s.

2.2 CHAIN-OF-THOUGHT (COT)

In this prompting strategy Wei et al. (2022), the authors build up on the id e a of how human beings break

a com plex pro blem into smaller easier sub-problem s before arriving at the ﬁnal solution of the comp lex

problem. Alo ng similar lines, the authors investigate how capabilities of LLMs to do complicated reasoning

is inherently enhanced by producing a chain of thought, or a sequence of intermediate reasoning steps. The

results show a considerable improvement from Basic prompting with the maximum difference between Co T

and Basic prom pting results being as big as around 39% for Mathematica l Problem Solv ing task and ar ound

26% fo r Comm onsense Reasoning task. This work opened a new direction of research for the ﬁeld of p rompt

engineer ing.

2.3 SELF-CONSISTENCY

Self-Consistency Wang et al. (2022) prompting technique is based on the intuition that complex reasoning

problems can be solved in multiple ways and hence the correct answer can be reached via d ifferent reasoning

paths. Self-Consistency uses a novel d ecoding strategy unlike the greedy one being used by CoT and con-

sists of three important steps. Th e ﬁr st step requires prompting the LLM using CoT, the second step samples

diverse rea soning paths from LLM’s decoder and the ﬁnal step involves choosing the most consistent an-

swer across multiple re asoning paths. Self-Co nsistency on an average achieves 11% gain on Mathematical

Problem Solving task, 3% gain o n Commonsense Reasoning task and 6% gain on Multi-Hop Reasoning task

when compared to CoT.

2.4 ENSEMBLE REFINEMENT (ER)

This prompting method has been discussed in Singhal et al. (2023). It builds on top of CoT and Self-

Consistency. ER consists of two stage s. First, given a few-shot CoT prompt and a query, LLM is made to

produce multiple generations by adjusting it’s temperature. Each generation contains a reasoning and an

answer for the query. Next, the LLM is conditione d on the original prompt, query and the concatenated

generations from the previous stage to generate a better explanation and an answer. This second stage is

done mu ltiple times followed by a majority voting over these second stage generated a nswers just as it is

done in case of Self-Consistency to select the ﬁnal answer. ER is seen to perform better than CoT an d

Self-Consistency acr oss many datasets belonging to the Context-Free Question-Answering task.

2.5 AUTOMATIC CHAIN-OF-THOUGHT (AUTO-COT)

In this work Zhang et al. (2022), the authors address the problem faced by few-shot CoT or manual CoT

which is the need of curation of good quality tra ining data points. Auto-CoT consists of two p rimary steps.

The ﬁrst one requires dividing que ries of a given dataset into a few clusters. The second one involves

choosing a represen ta tive query from each cluster and then generating its corresponding reasoning chain

using zero-shot CoT. The authors claim that Auto-CoT e ither outperforms or matches the performance of

few-shot CoT across Mathematical Pro blem Solving, Multi-Hop Reasoning and Commonsense Reasoning

task. This indicates that the step of curation of training datapoints for few-shot or manual CoT can be ruled

out.

2.6 COMPLEX COT

Fu et al. (2022) introduces a new pro mpting strategy which aims at choosin g complex datapoint prompts

over simpler ones. T he co mplexity of a datapoint is deﬁned here by the number of reasoning steps involved

with it. The authors hypothesize that the LLMs’ reasoning performance ca n increase if complex datapoints

are used as in-context trainin g examples as they already subsume simpler datap oints. Another important

aspect of Complex CoT apart from using complex datapoints as training examples is that du ring decod ing,

just like Self-Consistency, out of N sampled reasoning chains the majority answer over the top K most

complex chains is chosen as the ﬁnal answer. There is one other baseline prompting method which has be en

introdu ced in this p a per called Random CoT. In Random CoT, the datapoints a re randomly sampled without

adhering to their complexity. Complex CoT achieves on an average a gain of 5.3% accuracy and up to 18%

accuracy improvement across various datasets of Ma thematical Problem Solving, Commonsense Reasoning,

Table -Based Mathematical Problem Solving and Multi-Hop Reasoning tasks.

2.7 PROGRAM-OF-THOUGHTS (POT)

The authors of Chen et al. (202 2a) build up on CoT but in contrast to CoT which uses LLMs to perform bo th

reasoning and computation, PoT gene rates Python pr ograms and thus relegates computation part to a Python

interpreter. This work argues that reduced LLM responsibilities make it more accurate especially for numeri-

cal reasoning. PoT gets an average performance gain over CoT of around 12% across Mathematical Problem

Solving, Table-Based Mathematical Problem Solving, Contextual Question-An swering a nd Conversational

Contextual Question-Answering tasks.

2.8 LEAST-TO-MOST

Least-to-Most Zhou et al. ( 2022) prompting techniq ue tries to address the problem of CoT where CoT fails

to accurately solve problems harder than the exemplars shown in the prompts. It consists of two stage s.

First, the LLM is prompted to decompose a given problem into su b-problems. Next, the LLM is prompted

to solve the su b-problems in a sequential manner. The answer to any sub-problem depends on the answer

of the previous sub-problem . The authors show that Least-to-Most prompting is able to signiﬁcantly outper-

form CoT and Basic prompting methods on Commonsense Reasoning, Language-Based Task Comp le tion,

Mathematical Problem Solving and Contextua l Question-An swering tasks.

2.9 CHAIN-OF-SYMBOL (COS)

CoS Hu et al. (2023) builds up on the idea o f CoT. I n conventional CoT, the intermediate chain of reasoning

steps are represented in natural language. While this approach has shown remarkable results in many cases, it

can include incorrect or redundant information as well. The authors of this work present their hypothesis that

spatial descriptions are hard to express in n atural language thus making it difﬁcult for LLMs to unde rstand.

Instead, expressing these relationships using symbo ls in word sequences can be a be tter form of represen-

tation for LLMs. CoS achieves an im provement of up to 60.8% accuracy for Spatial Question- A nswering

task.

2.10 STRUCTURED CHAIN-OF-THOUGHT (SCOT)

The intuition behind SCoT Li et al. (2023b) is that structuring intermediate reasoning steps using program

structures like sequencing, branching and looping helps in more accurate code generation than having in-

termediate reasoning steps in natural language as we see in c onventional CoT. The authors claim that the

former approach more closely mimics a hum an developer’s thought process than the latter one and the same

has been conﬁrmed by the the ﬁnal results a s SCoT outperforms CoT by up to 13.79% for the Code Gener-

ation task.

2.11 PLAN-AND-SOLVE (PS)

Wang et al. (2023) discusses and tries to address three shortcomings of CoT which are calculation err ors,

missing-step err ors and semantic misunderstanding errors. PS contains two components whe re the ﬁrst one

requires devising a plan to divide the entire problem into smaller sub-problems and the second one needs to

carry out these sub-problems according to the plan. A better version of PS called PS+ ad ds more detailed

instructions which helps in improving the quality of reasoning steps. PS pr ompting method improves the

accuracy over Co T by at least 5% for almost all the datasets in Mathematical Problem Solving task in zero-

shot setting. Similarly, for the Commonsense Reasoning task, it consistently outperfor ms CoT by at least

5% in zero-shot setting whereas for the Mu lti- H op Reasoning task it gets around 2% better accuracy score.

2.12 MATHPROMPTER

Imani et al. (2023) tries to address two key problems of CoT for Mathematical Problem Solving task: (1 ) lack

of validity of steps followed by CoT for solving a problem; (2) how co nﬁdent is an LLM in it’s predictions.

MathPrompter p rompting strategy con sists of 4 steps in total. (I) Given a query, the ﬁrst step requires to

generate an algebraic expression for the quer y which replaces the n umerical values by variables. (II) Next,

LLM is prompted to solve the query analytically either by deriving the algebraic expression or writing a

Python fu nction. (III) Third, the query in step (I) is solved by assigning different values to the variables.

(IV) If the solutions in (III) are correct over N iterations, the variables are ﬁnally replaced with original query

values and the answer is computed. If not, then the steps (II), ( III) and (IV) are repeated. MathPrompter is

able to improve the performance on a dataset belonging to Mathematical Problem Solving task from 78.7%

to 92.5%.

2.13 CONTRASTIVE COT/ CONTRASTIVE SELF-CONSISTENCY

The authors of Chia et al. (2023) cla im that Contrastive CoT or Co ntrastive Self Consistency is a general en-

hancement of CoT or Self-Consistency. Th e inspiration for this prompting approach is based on how humans

can learn from both positive as well as negative examples. Along similar lines, in this prompting technique,

both positive and negative demonstrations are pr ovided to enhance the reasoning capabilities of the LLM .

Contrastive CoT on an avera ge is able to gain an average of 10% improvement over conventional CoT for

Mathematical Problem Solving task across multiple datasets. Similarly, Con trastive Self-Consistency is able

to outperfor m conventional Self-Consistency by over 15% for Ma thematical Problem Solv ing task across

multiple datasets. For Multi-Hop Reasoning task, both Contrastive CoT and Contrastive Self -Consistency

have more than 10% gains over their conventional counterparts.

2.14 FEDERATED SAME/DIFFERENT PARAMETER SELF-CONSISTENCY/COT (FED-SP/DP-SC/COT)

Introd uced in Liu et al. (2023), this prompting method is based on the c ore idea of improving the reasoning

capabilities o f LLMs by using synonymous crowd-sourced queries. There are two slightly different varia-

tions of this prompting method. The ﬁrst one is Fed-SP-SC whe re the crowd-sourced queries are paraphrased

versions of the original query but with same p a rameters. Parameters here can refer to the numeric valu es in

Mathematical Problem Solving task d a ta points. For Fed-SP-SC, the answers are directly generated ﬁrst and

then Self-Consistency is applied on top of it. The other one is Fed-DP-CoT. In Fed-DP-CoT, LLMs a re used

to ﬁrst generate answers to different queries and then they are federated by forming CoT to provide hints to

the LL Ms. The results for these me thods on Mathematical Problem Solving task show that they are able to

do better than conventional CoT by at least 10% and up to 20%.

2.15 ANALOGICAL REASONING

The authors of this work Yasunaga e t al. (2023) draw their inspiration from a psychological notion, analog-

ical reasoning, where people use pertinent prior experiences to solve new problems. In the realm of LLMs,

the authors ﬁrst prompt them to ge nerate examples similar to that of the original problem followed by solving

them a nd then proceed to answer the original problem. The results show that Analogical Reasoning is able

to achieve an average accuracy gain of 4% when compar e d to CoT across M a thematical Problem Solving,

Code Generation, L ogical Reasoning and Commonsense Reasoning tasks.

剩余38页未读，继续阅读

评论收藏

内容反馈

码流怪侠

粉丝: 2w+
资源: 128

大规模语言模型在不同NLP任务中的提示工程技术综述

面向自然语言处理任务的预训练模型综述.pdf

基于NLP的预训练语言模型综述

基于语言模型的预训练技术研究综述

自然语言处理NLP综述

最新《预训练语言模型》2020综述论文大全【复旦大学】.pdf

大模型综述（中文版）- 研究细节非常详细

知识感知的预训练语言模型综述

多模态大语言模型综述来啦！一文带你理清多模态关键技术

预训练语言模型的应用综述.pdf

自然语言处理综述-中文版

中文版多角度对大语言模型综述.zip

大语言模型算法演进综述.pdf

统计自然语言处理模型

自然语言处理中的嵌入综述.docx

人工智能大模型介绍.pptx

大规模语言模型架构与进化历程解析

大型语言模型时代的协同策略综述

生物医学领域的预训练语言模型：系统综述

大模型-AI大模型总体概述.pdf

临床自然语言处理中的嵌入综述.pdf

面向自然语言处理的预训练技术研究综述.pdf

2023最新大语言模式综述

深度学习与大模型综述（文献综述）

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

最新资源