大规模高质量指令数据自动生成方法-MAGPIE

版权申诉

自然语言处理

数据生成

122 浏览量 2024-12-02 10:56:58 上传评论收藏 2.01MB PDF 举报

资源推荐

资源详情

资源评论

MAGPIE: Alignment Data Synthesis from Scratch by

Prompting Aligned LLMs with Nothing

Zhangchen Xu

♠

Fengqing Jiang

♠

Luyao Niu

♠

Yuntian Deng

♢

Radha Poovendran

♠

Yejin Choi

♠♢

Bill Yuchen Lin

♢

♠

University of Washington

♢

Allen Institute for AI

https://magpie-align.github.io/

https://hf.co/magpie-align

Abstract

High-quality instruction data is critical for aligning large language models (LLMs).

Although some models, such as Llama-3-Instruct, have open weights, their align-

ment data remain private, which hinders the democratization of AI. High human

labor costs and a limited, predeﬁned scope for prompting prevent existing open-

source data creation methods from scaling effectively, potentially limiting the

diversity and quality of public alignment datasets. Is it possible to synthesize

high-quality instruction data at scale by extracting it directly from an aligned

LLM? We present a self-synthesis method for generating large-scale alignment data

named MAGPIE. Our key observation is that aligned LLMs like Llama-3-Instruct

can generate a user query when we input only the left-side templates up to the

position reserved for user messages, thanks to their auto-regressive nature. We use

this method to prompt Llama-3-Instruct and generate 4 million instructions along

with their corresponding responses. We perform a comprehensive analysis of the

extracted data and select 300K high-quality instances. To compare MAGPIE data

with other public instruction datasets (e.g., ShareGPT, WildChat, Evol-Instruct,

UltraChat, OpenHermes, Tulu-V2-Mix), we ﬁne-tune Llama-3-8B-Base with each

dataset and evaluate the performance of the ﬁne-tuned models. Our results indicate

that in some tasks, models ﬁne-tuned with MAGPIE perform comparably to the

ofﬁcial Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data

points through supervised ﬁne-tuning (SFT) and subsequent feedback learning.

We also show that using MAGPIE solely for SFT can surpass the performance of

previous public datasets utilized for both SFT and preference optimization, such

as direct preference optimization with UltraFeedback. This advantage is evident

on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench, and

importantly, it is achieved without compromising performance on reasoning tasks

like MMLU-Redux, despite the alignment tax.

1 Introduction

Large language models (LLMs) such as GPT-4 [

] and Llama-3 [

] have become integral to AI

applications due to their exceptional performance on a wide array of tasks by following instructions.

The success of LLMs is heavily reliant on the data used for instruction ﬁne-tuning, which equips them

to handle a diverse range of tasks, including those not encountered during training. The effectiveness

of this instruction tuning depends crucially on access to high-quality instruction datasets. However,

the alignment datasets used for ﬁne-tuning models like Llama-3-Instruct are typically private, even

when the model weights are open, which impedes the democratization of AI and limits scientiﬁc

research for understanding and enhancing LLM alignment.

To address the challenges in constructing such datasets, researchers have developed two main

approaches. The ﬁrst type of method involves human effort to generate and curate instruction data

arXiv:2406.08464v1 [cs.CL] 12 Jun 2024

WildChat

OpenHermes

Tulu V2 Mix

UltraFeedback

ShareGPT

Magpie

Air

Magpie

Pro

Llama

Instruct

10%

25.08

22.66

9.94

9.91 9.73

10.90

18.36

22.92

15%

20%

25%

30%

Step 1

<|start_header_id|>user

<|end_header_id|>

LLM

<|start_header_id|>user

<|end_header_id|>

What materials should I

use to build a nest?

<|start_header_id|>

assistant<|end_header_id|>

Building a nest! That’s a

wonderful project! ……

Instruction

Response

Instruction: What materials

should I use to build a nest?

Response: Building a nest!

That’s a wonderful project!

……

What materials should I

use to build a nest?

Step 2

SFT Only

SFT + DPO

SFT + RLHF

Filters

SFT

AlpacaEval 2

(Length Control)

MAGPIE

Evol Instruct

14.62

Length Control Win Rate

“Other birds collect twigs for their nests. Magpies acquire jewels for theirs.”

Figure 1: This ﬁgure illustrates the process of self-synthesizing instruction data from aligned LLMs

(e.g., Llama-3-8B-Instruct) to create a high-quality instruction dataset. In Step 1, we input only the

pre-query template into the aligned LLM and generate an instruction along with its response using

auto-regressive generation. In Step 2, we use a combination of a post-query template and another

pre-query template to wrap the instruction from Step 1, prompting the LLM to generate the query

for the second turn. This completes the construction of the instruction dataset. MAGPIE efﬁciently

generates diverse and high-quality instruction data. Our experimental results show that MAGPIE

outperforms other public datasets for aligning Llama-3-8B-base.

[

], which is both time-consuming and labor-intensive [

]. In contrast, the second

type of method uses LLMs to produce synthetic instructions [

]. Although

these methods reduce human effort, its success heavily depends on prompt engineering and the careful

selection of initial seed questions. The diversity of synthetic data tends to decrease as the dataset size

grows. Despite ongoing efforts, the scalable creation of high-quality and diverse instruction datasets

continues to be a challenging problem.

Is it possible to synthesize high-quality instructions at scale by directly extracting data from advanced

aligned LLMs themselves? A typical input to an aligned LLM contains three key components: the pre-

query template, the query, and the post-query template. For instance, an input to Llama-2-chat could

be “

[INST]

Hi!

[/INST]

”, where

[INST]

is the pre-query template and

[/INST]

is the post-query

template. These templates are predeﬁned by the creators of the aligned LLMs to ensure the correct

prompting of the models. We observe that when we only input the pre-query template to aligned

LLMs such as Llama-3-Instruct, they self-synthesize a user query due to their auto-regressive nature.

Our preliminary experiments indicate that these random user queries are of high quality and great

diversity, suggesting that the abilities learned during the alignment process are effectively utilized.

Based on these ﬁndings, we developed a self-synthesis method to construct high-quality instruction

datasets at scale, named MAGPIE (as illustrated in Figure 1). Unlike existing methods, our approach

does not rely on prompt engineering or seed questions. Instead, it directly constructs instruction

data by prompting aligned LLMs with a pre-query template for sampling instructions. We applied

this method to the Llama-3-8B-Instruct and Llama-3-70B-Instruct models, creating two instruction

datasets: MAGPIE-Air and MAGPIE-Pro, respectively.

Our MAGPIE-Air and MAGPIE-Pro datasets were created using 206 and 614 GPU hours, respectively,

without requiring any human intervention or API access to production LLMs like GPT-4. Addi-

tionally, we generated two multi-turn instruction datasets, MAGPIE-Air-MT and MAGPIE-Pro-MT,

which contain sequences of multi-turn instructions and responses. The statistics and advantages

of our instruction datasets compared to existing ones are summarized in Table 1. We perform a

comprehensive analysis of the generated data, allowing practitioners to ﬁlter and select data instances

from these datasets for ﬁne-tuning according to their particular needs.

To compare MAGPIE data with other public instruction datasets (e.g., ShareGPT [

], WildChat [

Evol Instruct [

], UltraChat [

], OpenHermes [

], Tulu V2 Mix [

]) and various preference

tuning strategies with UltraFeedback [

], we ﬁne-tune the Llama-3-8B-Base model with each

dataset and assess the performance of the resultant models on LLM alignment benchmarks such as

AlpacaEval 2 [

], Arena-Hard [

], and WildBench [

]. Our results show that models ﬁne-tuned

with MAGPIE achieve superior performance, even surpassing the ofﬁcial Llama-3-8B-Instruct model

on AlpacaEval, which was ﬁne-tuned with over 10 million data points for supervised ﬁne-tuning

(SFT) and follow-up feedback learning. Not only does MAGPIE excel in SFT alone compared to

prior public datasets that incorporate both SFT and preference optimization (e.g., direct preference

Table 1: Statistics of instruction datasets generated by MAGPIE compared to other instruction datasets.

Tokens are counted using the tiktoken library [42].

Instruction

Source

Dataset Name #Convs #Turns

Human

Effort

Response

Generator

#Tokens / Turn #Total Tokens

Synthetic

Alpaca [47] 52K 1 Low text-davinci-003 67.38

±54.88

3.5M

Evol Instruct [58] 143K 1 Low ChatGPT 473.33

±330.13

68M

UltraChat [16] 208K 3.16 Low GhatGPT 376.58

±177.81

238M

Human

Dolly [14] 15K 1 High ChatGPT 94.61

±135.84

1.42M

ShareGPT [66] 112K 4.79 High ChatGPT 465.38

±368.37

201M

WildChat [64] 652K 2.52 High GPT-3.5 & GPT-4 727.09

±818.84

852M

LMSYS-Chat-1M [65] 1M 2.01 High Mix 260.37

±346.97

496M

Mixture

Deita [38] 9.5K 22.02 - Mix 372.78

±182.97

74M

OpenHermes [49] 243K 1 - Mix 297.86

±258.45

72M

Tulu V2 Mixture [24] 326K 2.31 - Mix 411.94

±447.48

285M

MAGPIE

Llama-3-MAGPIE-Air 3M 1 No Llama-3-8B 426.39

±217.39

1.28B

Llama-3-MAGPIE-Air-MT 300K 2 No Llama-3-8B 610.80

±90.61

366M

Llama-3-MAGPIE-Pro 1M 1 No Llama-3-70B 478.00

±211.09

477M

Llama-3-MAGPIE-Pro-MT 300K 2 No Llama-3-70B 554.53

±133.64

333M

optimization with UltraFeedback [

]), but it also delivers the best results when evaluated against

six baseline instruction datasets and four preference tuning methods (DPO [

], IPO [

], KTO

[

], and ORPO [

] with the UltraFeedback dataset). These ﬁndings show the exceptional quality

of instruction data generated by MAGPIE, enabling it to outperform even the ofﬁcial, extensively

optimized LLMs.

2 MAGPIE: A Scalable Method to Synthesize Instruction Data

Overview of MAGPIE. In what follows, we describe our method, MAGPIE, to synthesize instruction

data for ﬁne-tuning LLMs. An instance of instruction data consists of at least one or multiple

instruction-response pairs. Each pair speciﬁes the roles of instruction provider and follower, along

with their instruction and response. As shown in Figure 1, MAGPIE consists of two steps: (1)

instruction generation, and (2) response generation. The pipeline of MAGPIE can be fully automated

without any human intervention. Given the data generated by MAGPIE, practitioners may customize

and build their own personalized instruction dataset accordingly (see Section 3 and Appendix B for

more details). We detail each step in the following.

Step 1: Instruction Generation. The goal of this step is to generate an instruction for each instance

of instruction data. Given an open-weight aligned LLM (e.g., Llama-3-70B-Instruct), MAGPIE crafts

an input query in the format of the predeﬁned instruction template of the LLM. This query deﬁnes

only the role of instruction provider (e.g., user), and does not provide any instruction. Note that

the auto-regressive LLM has been ﬁne-tuned using instruction data in the format of the predeﬁned

instruction template. Thus, the LLM autonomously generates an instruction when the query crafted

by MAGPIE is given as an input. MAGPIE stops generating the instruction once the LLM produces

an end-of-sequence token. Sending the crafted query to the LLM multiple times leads to a set of

instructions. Compared with existing synthetic approaches [

], MAGPIE does

not require speciﬁc prompt engineering techniques since the crafted query follows the format of the

predeﬁned instruction template. In addition, MAGPIE autonomously generates instructions without

using any seed question, ensuring the diversity of generated instructions.

Step 2: Response Generation. The goal of this step is to generate responses to the instructions

obtained from Step 1. MAGPIE sends these instructions to the LLM to generate the corresponding

responses. Combining the roles of instruction provider and follower, the instructions from Step 1, and

the responses generated in Step 2 yields the instruction dataset. Detailed discussion on the generation

conﬁguration can be found in Appendix D.

Extensions of MAGPIE. MAGPIE can be readily extended to generate multi-turn instruction datasets

and preference datasets. In addition, practitioners can specify the task requested by the instructions.

We defer the detailed discussion on these extensions to Appendix A.

(a) Input Length of MAGPIE-Air (in tokens)

(b) Output Length of MAGPIE-Air (in tokens)

(d) Input Length of MAGPIE-Pro (in tokens)

Figure 2: Lengths of instructions

and responses in MAGPIE-Air/Pro.

60 40 20 0 20 40 60

Alpaca Evol Instruct UltraChat Magpie

Figure 3: This ﬁgure compares the t-SNE plot of MAGPIE-Pro

with those of Alpaca, Evol Instruct, and UltraChat, each of

which is sampled with 10,000 instructions. The t-SNE plot of

MAGPIE-Pro encompasses the area covered by the other plots,

demonstrating the comprehensive coverage of MAGPIE-Pro.

3 Dataset Analysis

We apply MAGPIE to the Llama-3-8B-Instruct and Llama-3-70B-Instruct models to construct two

instruction datasets: MAGPIE-Air and MAGPIE-Pro, respectively. Examples of instances in both

datasets can be found in Appendix G. In this section, we present a comprehensive statistical analysis

of the MAGPIE-Air and MAGPIE-Pro datasets. An overview of the lengths of instructions and

responses of the data in MAGPIE-Air and MAGPIE-Pro is presented in Figure 2. In what follows,

we ﬁrst assess the breadth of MAGPIE-Pro by analyzing its coverage. We then discuss the attributes

of MAGPIE-Pro, including topic coverage, difﬁculty, quality, and similarity of instructions, as well

as quality of response. Finally, we provide the safety analysis and cost analysis. Using our dataset

analysis, practitioners can customize and conﬁgure their own datasets for ﬁne-tuning LLMs. In

Appendix B, we showcase the process of customizing and ﬁltering an instruction dataset based on

our analysis. Speciﬁcally, we select 300K instances from MAGPIE-Pro and MAGPIE-Air-Filtered,

yielding datasets MAGPIE-Pro-300K and MAGPIE-Air-300K-Filtered, respectively.

3.1 Dataset Coverage

We follow the approach in [

] and analyze the coverage of MAGPIE-Pro in the embedding space.

Speciﬁcally, we use the

all-mpnet-base-v2

embedding model

to calculate the input embeddings,

and employ t-SNE [

] to project these embeddings into a two-dimensional space. We adopt three

synthetic datasets as baselines, including Alpaca [

], Evol Instruct [

], and UltraChat [

], to

demonstrate the coverage of MAGPIE-Pro.

Figure 3 presents the t-SNE plots of MAGPIE-Pro, Alpaca, Evol Instruct, and UltraChat. Each t-SNE

plot is generated by randomly sampling 10,000 instructions from the associated dataset. We observe

that the t-SNE plot of MAGPIE-Pro encompasses the area covered by the plots of Alpaca, Evol

Instruct, and UltraChat. This suggests that MAGPIE-Pro provides a broader or more diverse range

of topics, highlighting its extensive coverage across varied themes and subjects. We also follow the

practice in [

] and present the most common verbs and their top direct noun objects in instructions

in Appendix C, indicating the diverse topic coverage of MAGPIE dataset. Coverage analysis of

MAGPIE-Air can also be found in Appendix C.

https://huggingface.co/sentence-transformers/all-mpnet-base-v2

3.2 Dataset Attributes

Attribute: Task Categories of Instructions.

We use Llama-3-8B-Instruct to categorize the instances in MAGPIE-Pro (see Figure 7 in Appendix

C.1 for detail). The prompts used to query Llama-3-8B-Instruct can be found in Appendix F. Our

observations indicate that over half of the tasks in MAGPIE-Pro pertain to information seeking,

making it the predominant category. This is followed by tasks involving creative writing, advice

seeking, planning, and math. This distribution over the task categories aligns with the practical

requests from human users [33].

(a) Statistics on Input Quality

(b) Statistics on Input Difficulty

Figure 4: The statistics of input dif-

ﬁculty and quality.

Attribute: Quality of Instructions. We use the Llama-3-

8B-Instruct model to assess the quality of each instruction in

MAGPIE-Air and MAGPIE-Pro, categorizing them as ‘very

poor’, ‘poor’, ‘average’, ‘good’, and ‘excellent’. We present

the histograms of qualities for both datasets in Figure 4-(a). We

have the following two observations. First, both datasets are

of high quality, with the majority of instances rated ‘average’

or higher. In addition, the overall quality of MAGPIE-Pro

surpasses that of MAGPIE-Air. We hypothesize that this is due

to the enhanced capabilities of Llama-3-70B compared with

Llama-3-8B.

Attribute: Difﬁculty of Instructions. We use the Llama-

3-8B-Instruct model to rate the difﬁculty of each instruction

in MAGPIE-Air and MAGPIE-Pro. Each instruction can be

labeled as ‘very easy’, ‘easy’, ‘medium’, ‘hard’, or ‘very hard’.

Figure 4-(b) presents the histograms of the levels of difﬁculty

for MAGPIE-Air and MAGPIE-Pro. We observe that the dis-

tributions across difﬁculty levels are similar for MAGPIE-Air

and MAGPIE-Pro. Some instructions in MAGPIE-Pro are more challenging than those in MAGPIE-Air

because MAGPIE-Pro is generated by a more capable model (Llama-3-70B-Instruct).

(a) Min Neighbor Distance of MAGPIE-Air

(b) Reward Difference of Base Model and Instruct Model

Figure 5: This ﬁgure summarizes

the minimum neighbor distances and

reward differences.

Attribute: Instruction Similarity. We quantify the similarity

among instructions generated by MAGPIE to remove repeti-

tive instructions. We measure the similarity using minimum

neighbor distance in the embedding space. Speciﬁcally, we

ﬁrst represent all instructions in the embedding space using

the

all-mpnet-base-v2

embedding model. For any given

instruction, we then calculate the minimum distance from the

instruction to its nearest neighbors in the embedding space

using Facebook AI Similarity Search (FAISS) [

]. The min-

imum neighbor distances of instructions in MAGPIE-Air after

removing repetitions are summarized in Figure 5-(a).

Attribute: Quality of Responses. We assess the quality of

responses using a metric named reward difference. For each

instance in our dataset, the reward difference is calculated as

∗

− r

base

, where

∗

is the reward assigned by a reward model

to the response in our dataset, and

base

is the reward assigned

by the same model to the response generated by the Llama-3 base model for the same instruction. We

use URIAL [

] to elicit responses from the base model. A positive reward difference indicates that

the response from our dataset is of higher quality, and could potentially beneﬁt instruction tuning.

In our experiments, we follow [

] and use

FsfairX-LLaMA3-RM-v0.1

[

] as our reward model.

Our results on the reward difference are presented in Figure 5-(b).

3.3 Safety Analysis

We use Llama-Guard-2 [

] to analyze the safety of MAGPIE-Air and MAGPIE-Pro. Our results

indicate that both datasets are predominantly safe, with less than 1% of the data potentially containing

harmful instructions or responses. Please refer to Appendix C.2 for detailed safety analysis.

剩余24页未读，继续阅读

评论收藏

内容反馈

版权申诉

pk_xz123456

粉丝: 2821
资源: 4000

大规模高质量指令数据自动生成方法-MAGPIE

开源项目-haifenghuang-magpie.zip

grunt-magpie:在远程存储库中版本和保存资产

PyPI 官网下载 | magpie-python-0.1.tar.gz

nuxt-magpie:Nuxt.js 的构建模块，用于下载远程图像，并在执行完整静态生成时将它们作为本地文件包含在生成的构建中

magpie-draw-0.4.2-for-win.exe

FSR Magpie_v0.5.2

spicy-magpie-boilerplate:新 HTML 项目的设计样板

Magpie-开源

Magpie，窗口化游戏算法缩放全屏软件

magpie-api:Open Raven的Magpie框架的插件API

Magpie描述符预测性能

Magpie-LuckyDraw:supporting支持多种平台的精美抽奖工具:laptop:（MacLinuxWindowsWebDocker）

mrwater：MAgPIE水预处理

Dapper，大规模分布式系统的跟踪系统

magpie-developer-challenge:喜developer开发人员挑战

MAGPIE-开源

magpie:打开Raven的OSS CSPM框架

magpie-release

bsidescbr2021:BSides堪培拉2021的幻灯片和设计文件

超材料的机器学习_Machine Learning For Metamaterials

magpie:基于Rails的Web服务器平台，用于科学领域的跨学科协作

Dapper，大规模分布式系统的跟踪系统 by bigbully1

magpie:Magpie是一个可视化平台，旨在创建，开发和编译您的独立颤振模块

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

喜p：用于多标签文本分类的深度神经网络框架

动物数据集42鹊数据集VOC格式+yolo格式94张1类别.zip

magpie:简单的分布式任务调度的框架，go语言版本

学术文本结构功能深度学习识别方法的多学科对比分析.pdf

Magpie:AP 计算机科学实验室 1

最新资源