SELF-INSTRUCT：提高预训练语言模型指令执行能力的方法

版权申诉

98 浏览量 2024-12-02 11:01:05 上传评论收藏 1.97MB PDF 举报

资源推荐

资源详情

资源评论

SELF-INSTRUCT: Aligning Language Model

with Self Generated Instructions

Yizhong Wang

♣

Yeganeh Kordi

♢

Swaroop Mishra

♡

Alisa Liu

♣

Noah A. Smith

♣+

Daniel Khashabi

♠

Hannaneh Hajishirzi

♣+

♣

University of Washington

♢

Tehran Polytechnic

♡

Arizona State University

♠

Johns Hopkins University

Allen Institute for AI

yizhongw@cs.washington.edu

Abstract

Large “instruction-tuned” language models

(ﬁnetuned to respond to instructions) have

demonstrated a remarkable ability to gener-

alize zero-shot to new tasks. Nevertheless,

they depend heavily on human-written instruc-

tion data that is limited in quantity, diver-

sity, and creativity, therefore hindering the

generality of the tuned model. We intro-

duce SELF-INSTRUCT, a framework for im-

proving the instruction-following capabilities

of pretrained language models by bootstrap-

ping oﬀ its own generations. Our pipeline

generates instruction, input, and output sam-

ples from a language model, then prunes

them before using them to ﬁnetune the orig-

inal model. Applying our method to vanilla

GPT3, we demonstrate a 33% absolute im-

provement over the original model on SUPER-

NATURALINSTRUCTIONS, on par with the per-

formance of InstructGPT

001

, which is trained

with private user data and human annotations.

For further evaluation, we curate a set of

expert-written instructions for novel tasks, and

show through human evaluation that tuning

GPT3 with SELF-INSTRUCT outperforms using

existing public instruction datasets by a large

margin, leaving only a 5% absolute gap behind

InstructGPT

001

. SELF-INSTRUCT provides an

almost annotation-free method for aligning pre-

trained language models with instructions, and

we release our large synthetic dataset to facili-

tate future studies on instruction tuning

1 Introduction

The recent NLP literature has witnessed a tremen-

dous amount of activity in building models that

Unless otherwise speciﬁed, our comparisons are with the

text-davinci-001

engine. We focus on this engine since it

is the closest to our experimental setup: supervised ﬁne-tuning

with human demonstrations. The newer engines are more

powerful, though they use more data (e.g., code completion or

latest user queries) or algorithms (e.g., PPO) that are diﬃcult

to compare with.

Code and data will be available at

https://github.

com/yizhongw/self-instruct.

can follow natural language instructions (Mishra

et al., 2022; Wei et al., 2022; Sanh et al., 2022;

Wang et al., 2022; Ouyang et al., 2022; Chung et al.,

2022, i.a.). These developments are powered by

two key components: large pre-trained language

models (LM) and human-written instruction data.

PROMPTSOURCE (Bach et al., 2022) and SUPER-

NATURALINSTRUCTIONS (Wang et al., 2022) are

two notable recent datasets that use extensive man-

ual annotation for collecting instructions to con-

struct T

(Bach et al., 2022; Sanh et al., 2022) and

𝑘

-INSTRUCT (Wang et al., 2022). However, this

process is costly and often suﬀers limited diver-

sity given that most human generations tend to be

popular NLP tasks, falling short of covering a true

variety of tasks and diﬀerent ways to describe them.

Given these limitations, continuing to improve the

quality of instruction-tuned models necessitates the

development of alternative approaches for supervis-

ing instruction-tuned models.

In this work, we introduce SELF-INSTRUCT, a

semi-automated process for instruction-tuning a

pretrained LM using instructional signals from the

model itself. The overall process is an iterative

bootstrapping algorithm (see Figure 1), which starts

oﬀ with a limited (e.g., 175 in our study) seed set

of manually-written instructions that are used to

guide the overall generation. In the ﬁrst phase, the

model is prompted to generate instructions for new

tasks. This step leverages the existing collection of

instructions to create more broad-coverage instruc-

tions that deﬁne (often new) tasks. Given the newly-

generated set of instructions, the framework also

creates input-output instances for them, which can

be later used for supervising the instruction tuning.

Finally, various measures are used to prune low-

quality and repeated instructions, before adding

them to the task pool. This process can be repeated

for many interactions until reaching a large number

of tasks.

To evaluate SELF-INSTRUCT empirically, we run

arXiv:2212.10560v1 [cs.CL] 20 Dec 2022

175 seed tasks with

1 instruction and

1 instance per task

Task Pool Step 1: Instruction Generation

Step 4: Filtering

Output-first

Input-first

Step 2: Classification

Task Identification

Step 3: Instance Generation

Instruction : Give me a quote from a

famous person on this topic.

Task

Yes

Task

Instruction : Give me a quote from a famous person on this topic.

Input: Topic: The importance of being honest.

Output: "Honesty is the first chapter in the book of wisdom." - Thomas

Jefferson

Task

Instruction : Find out if the given text is in favor of or against abortion.

Class Label: Pro-abortion

Input: Text: I believe that women should have the right to choose whether

or not they want to have an abortion.

Task

Figure 1: A high-level overview of SELF-INSTRUCT. The process starts with a small seed set of tasks (one instruc-

tion and one input-output instance for each task) as the task pool. Random tasks are sampled from the task pool,

and used to prompt an oﬀ-the-shelf LM to generate both new instructions and corresponding instances, followed

by ﬁltering low-quality or similar generations, and then added back to the initial repository of tasks. The resulting

data can be used for the instruction tuning of the language model itself later to follow instructions better. Tasks

shown in the ﬁgure are generated by GPT3. See Table 10 for more creative examples.

this framework on GPT3 (Brown et al., 2020),

which is a vanilla LM (§4). The iterative SELF-

INSTRUCT process on this model leads to about

52k instructions, paired with about 82K instance

inputs and target outputs. We observe that the result-

ing data provides a diverse range of creative tasks

and over 50% of them have less than 0.3 ROUGE-

L overlaps with the seed instructions (§4.2). On

this resulting data, we build GPT3

SELF-INST

ﬁne-tuning GPT3 (i.e., the same model used for

generating the instructional data). We evaluate

GPT3

SELF-INST

in comparison to various other mod-

els on both typical NLP tasks included in SUPER-

NATURALINSTRUCTIONS (Wang et al., 2022), and

a set of new instructions that are created for novel

usage of instruction-following models (§5). The

SUPERNI results indicate that GPT3

SELF-INST

out-

performs GPT3 (the original model) by a large

margin (+33.1%) and nearly matches the perfor-

mance of

InstructGPT

001

. Moreover, our human

evaluation on the newly-created instruction set

shows that GPT3

SELF-INST

demonstrates a broad

range of instruction following ability, outperform-

ing models trained on other publicly available in-

struction datasets and leaving only a 5% gap behind

InstructGPT

001

In summary, our contributions are: (1) SELF-

INSTRUCT, a method for inducing instruction-

following capability with minimal human-labeled

data; (2) We demonstrate its eﬀectiveness via exten-

sive instruction-tuning experiments; (3) We release

a large synthetic dataset of 52K instructions and a

set of manually-written novel tasks for building and

evaluating future instruction-following models.

2 Related Work

Instruction-following language models.

A se-

ries of works have found evidence that vanilla lan-

guage models can be eﬀective at following general

language instructions if tuned with annotated “in-

structional” data – datasets containing language

instructional commands and their desired outcome

based on human judgement (Weller et al., 2020;

Mishra et al., 2022; Wang et al., 2022; Wei et al.,

2022; Sanh et al., 2022; Ouyang et al., 2022; Par-

mar et al., 2022; Scialom et al., 2022; Chung et al.,

2022; Luo et al., 2022; Puri et al., 2022; Yin et al.,

2022; Chakrabarty et al., 2022; Lin et al., 2022;

Gupta et al., 2022; Muennighoﬀ et al., 2022). Addi-

tionally, they show a direct correlation between the

size and diversity of the “instructional” data and the

generalizability of resulting models to unseen tasks.

Since these developments depend on human anno-

tated “instructional” data, this poses a bottleneck

for progress toward more generalizable models (for

example see Fig. 5a in Wang et al., 2022). Our

work aims to tackle this bottleneck by reducing the

dependence on human annotators.

Additionally, despite the remarkable perfor-

mance of models like

InstructGPT

(Ouyang et al.,

2022), their construction process remains quite

opaque. In particular, the role of data has remained

understudied due to limited transparency and data

released by major corporate entities behind these

key models. Addressing such challenges necessi-

tates the creation of a large-scale, public dataset

covering a broad range of tasks.

Instruction-following models have also been of

interest in the multi-modal learning literature (Fried

et al., 2018; Shridhar et al., 2020; Min et al., 2022;

Weir et al., 2022). SELF-INSTRUCT, as a general

approach to expanding data, can potentially also be

helpful in those settings; however, this is out of the

scope of this work.

Language models for data generation and aug-

mentation.

A variety of works have relied on

generative LMs for data generation (Schick and

Schütze, 2021; Wang et al., 2021; Liu et al., 2022;

Meng et al., 2022) or augmentation (Feng et al.,

2021; Yang et al., 2020; Mekala et al., 2022). For

example, Schick and Schütze (2021) propose to

replace human annotations of a given task with

prompting large LMs and use the resulting data for

ﬁne-tuning (often smaller) models in the context

of SuperGLUE tasks (Wang et al., 2019). While

our work can be viewed as a form of “augmenta-

tion,” our work diﬀers from this line in that it is not

speciﬁc to a particular task (say, QA or NLI). In

contrast, a distinct motivation for SELF-INSTRUCT

is to bootstrap new task deﬁnitions that may not

have been deﬁned before by any NLP practitioner

(though potentially still important for downstream

users).

Self-training.

A typical self-training framework

(He et al., 2019; Xie et al., 2020; Du et al., 2021;

Amini et al., 2022; Huang et al., 2022) uses trained

models to assign labels to unlabeled data and then

leverages the newly labeled data to improve the

model. In a similar line, Zhou et al. (2022a) use

multiple prompts to specify a single task and pro-

pose to regularize via prompt consistency, encour-

aging consistent predictions over the prompts. This

allows either ﬁnetuning the model with extra un-

labeled training data, or direct application at infer-

ence time. While SELF-INSTRUCT has some simi-

larities with the self-training literature, most self-

training methods assume a speciﬁc target task as

well as unlabeled examples under it; in contrast,

SELF-INSTRUCT produces a variety of tasks from

scratch.

Knowledge distillation.

Knowledge distilla-

tion (Hinton et al., 2015; Sanh et al., 2019; West

et al., 2021; Magister et al., 2022) often involves

the transfer of knowledge from larger models to

smaller ones. SELF-INSTRUCT can also be viewed

as a form of “knowledge distillation", however, it

diﬀers from this line in the following ways: (1)

the source and target of distillation are the same,

i.e., a model’s knowledge is distilled to itself; (2)

the content of distillation is in the form of an

instruction task (i.e., instructions that deﬁne a task,

and a set of examples that instantiate it).

Bootstrapping with limited resources.

A series

of recent works use language models to boot-

strap some inferences using specialized methods.

NPPrompt (Zhao et al., 2022) provides a method

to generate predictions for semantic labels without

any ﬁne-tuning. It uses a model’s own embeddings

to automatically ﬁnd words relevant to the label of

the data sample and hence reduces the dependency

on manual mapping from model prediction to la-

bel (verbalizers). STAR (Zelikman et al., 2022)

iteratively leverages a small number of rationale

examples and a large dataset without rationales, to

bootstrap a model’s ability to perform reasoning.

Self-Correction (Welleck et al., 2022) decouples an

imperfect base generator (model) from a separate

corrector that learns to iteratively correct imperfect

generations and demonstrates improvement over the

base generator. Our work instead focuses on boot-

strapping new tasks in the instruction paradigm.

Instruction generation.

A series of recent

works (Zhou et al., 2022b; Ye et al., 2022; Singh

et al., 2022; Honovich et al., 2022) generate instruc-

tions of a task given a few examples. While SELF-

INSTRUCT also involves instruction generation, a

major diﬀerence in our case is it is task-agnostic;

we generate new tasks (instructions along with in-

stances) from scratch.

3 Method

Annotating large-scale instruction data can be chal-

lenging for humans because it requires 1) creativity

to come up with novel tasks and 2) expertise for

writing the labeled instances for each task. In this

section, we detail our process for SELF-INSTRUCT,

which refers to the pipeline of generating tasks with

vanilla pretrained language model

itself and

then conducting instruction tuning with this gener-

ated data in order to align the language model to

follow instructions better. This pipeline is depicted

in Figure 1.

3.1 Deﬁning Instruction Data

The instruction data we want to generate contains

a set of instructions

{𝐼

𝑡

}

, each of which deﬁnes a

task

𝑡

in natural language. Each task has one or

more input-output instances

(𝑋

𝑡

, 𝑌

𝑡

)

. A model

𝑀

is expected to produce the output

𝑦

, given the task

instruction

𝐼

𝑡

and the instance input

𝑥

𝑀(𝐼

𝑡

, 𝑥) =

𝑦, for (𝑥, 𝑦) ∈ (𝑋

𝑡

, 𝑌

𝑡

)

. Note that the instruction

and instance input does not have a strict boundary

in many cases. For example, “write an essay about

school safety” can be a valid instruction that we

expect models to respond to directly, while it can

also be formulated as “write an essay about the fol-

lowing topic” as the instruction, and “school safety”

as an instance input. To encourage the diversity of

the data format, we allow such instructions that do

not require additional input (i.e., 𝑥 is empty).

3.2 Automatic Instruction Data Generation

Our pipeline for generating the instruction data con-

sists of four steps: 1) instruction generation, 2) iden-

tifying whether the instruction represents a classiﬁ-

cation task or not, 3) instance generation with the

input-ﬁrst or the output-ﬁrst approach, and 4) ﬁlter-

ing low-quality data.

Instruction Generation.

SELF-INSTRUCT is

based on a ﬁnding that large pretrained language

models can be prompted to generate new and

novel instructions when presented with some

existing instructions in the context. This provides

us with a way to grow the instruction data from a

small set of seed human-written instructions. We

propose to generate a diverse set of instructions in

a bootstrapping fashion. We initiate the task pool

with 175 tasks (1 instruction and 1 instance for

each task) written by our authors. For every step,

we sample 8 task instructions from this pool as

in-context examples. Of the 8 instructions, 6 are

from the human-written tasks, and 2 are from the

model-generated tasks in previous steps to promote

diversity. The prompting template is shown in

Table 6.

Classiﬁcation Task Identiﬁcation.

Because we

need two diﬀerent approaches for classiﬁcation and

non-classiﬁcation tasks, we next identify whether

the generated instruction represents a classiﬁcation

task or not.

We prompt vanilla GPT3 few-shot to

determine this, using 12 classiﬁcation instructions

and 19 non-classiﬁcation instructions from the seed

tasks. The prompting template is shown in Table 7.

Instance Generation.

Given the instructions and

their task type, we generate instances for each in-

struction independently. This is challenging be-

cause it requires the model to understand what the

target task is, based on the instruction, ﬁgure out

what additional input ﬁelds are needed and gen-

erate them, and ﬁnally complete the task by pro-

ducing the output. We found that pretrained lan-

guage models can achieve this to a large extent when

prompted with instruction-input-output in-context

examples from other tasks. A natural way to do this

is the

Input-ﬁrst Approach

, where we can ask a

language model to come up with the input ﬁelds

ﬁrst based on the instruction, and then produce the

corresponding output. This generation order is sim-

ilar to how models are used to respond to instruction

and input, but here with in-context examples from

other tasks. The prompting template is shown in

Table 8.

However, we found that this approach can gen-

erate inputs biased toward one label, especially for

classiﬁcation tasks (e.g., for grammar error detec-

tion, it usually generates grammatical input). There-

fore, we additionally propose an

Output-ﬁrst Ap-

proach

for classiﬁcation tasks, where we ﬁrst gener-

ate the possible class labels, and then condition the

input generation on each class label. The prompting

template is shown in Table 9.

We apply the output-

ﬁrst approach to the classiﬁcation tasks identiﬁed

in the former step, and the input-ﬁrst approach to

the remaining non-classiﬁcation tasks.

Filtering and Postprocessing.

To encourage di-

versity, a new instruction is added to the task pool

only when its ROUGE-L overlap with any exist-

ing instruction is less than 0.7. We also exclude

instructions that contain some speciﬁc keywords

(e.g., images, pictures, graphs) that usually can not

be processed by language models. When generat-

ing new instances for each instruction, we ﬁlter out

instances that are exactly the same or those with

the same input but diﬀerent outputs.

More concretely, we regard tasks that have a limited and

small output label space as classiﬁcation tasks.

In this work, we use a ﬁxed set of seed tasks for prompt-

ing the instance generation, and thus only generate a small

number of instances per task in one round. Future work can

use randomly sampled tasks to prompt the model to generate

a larger number of instances in multiple rounds.

剩余18页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2812
资源: 3980

SELF-INSTRUCT：提高预训练语言模型指令执行能力的方法

self-instruct 自动生成指令数据

基于Qwen2.5-7B-Instruct的大模型微调实战指南

基于ChatGPT构建的中文self-instruct数据集_self-instruct-zh.zip

基于ChatGPT构建的中文self-instruct数据集.zip

meta-llama-3-8b-instruct 的 model-00001-of-00004.safetensors 的3/3

Qwen2-7B-Instruct 的 model-00004-of-00004.safetensors 的2/2

LLama3 中文大模型进行指令微调的中文聊天语言模型

meta-llama-3-8b-instruct 的 model-00001-of-00004.safetensors 的1/3

大模型自动生成SFT指令总结

meta-llama-3-8b-instruct 的 model-00003-of-00004.safetensors 的3/3

Qwen2-7B-Instruct 的 model-00001-of-00004.safetensors 的2/2

WizardCoder代码大语言模型论文研读+原理解析

meta-llama-3-8b-instruct 的 model-00003-of-00004.safetensors 的1/3

Qwen2-7B-Instruct 的 model-00003-of-00004.safetensors 的2/2

Stanford Alpaca是一个指令调优的 LLaMA 模型，从 Meta 的大语言模型 LLaMA 7B 微调而来.rar

首次：微软用 GPT-4 做大模型指令微调，新任务零样本性能再提升

meta-llama-3-8b-instruct 的 model-00002-of-00004.safetensors 的3/3

用GPT-4做大模型指令微调，新任务零样本性能再提升

首次：微软用GPT-4做大模型指令微调，新任务零样本性能再提升

Qwen2-7B-Instruct 的 model-00002-of-00004.safetensors 的2/2

基于LoRA对ChatGLM进行微调实验python源码+训练好的模型+项目说明.zip

论文代码-DEAN: 通过去激活耦合神经元来减轻大型语言模型中的公平性与隐私冲突

DeepSeek-Coder-当大型语言模型遇见编程 - 代码智能的崛起（英文版）

高中英语人教版必修五unit1单词小测.doc

Llama3-8B-Chinese-ChatLLama3 中文大模型

继续法律预训练和指令微调对大型语言模型在人类定义的法律概念的潜在表示的影响

meta-llama-3-8b-instruct 的 model-00004-of-00004.safetensors

家用版GPT-4！微软开源微调指令集效果不输原版，中英双语都能用

大语言模型+llama3+代码+学习可运行llama3代码

最新资源