协同训练改善了大型语言模型的即时学习.pdf资源-CSDN文库

版权申诉

107 浏览量 2024-04-02 14:36:17 上传评论收藏 710KB PDF 举报

### 协同训练在大型语言模型中的应用及优化 #### 摘要解析与扩展本文探讨了如何通过**协同训练(co-training)**方法来改善基于提示(prompt-based)的大规模语言模型的学习性能。协同训练是一种利用未标记数据来提高模型表现的技术，最初由Blum和Mitchell于1998年提出。尽管基于提示的学习方法因其在少量或无样本学习场景下的出色表现而受到广泛关注，但其对模型大小的要求通常较高，并且对提示的构造非常敏感。本文提出了一种新的框架，该框架能够在不完全访问原始模型的情况下，通过学习校准模型来改进提示输出；而在能够访问模型梯度但全面微调成本过高的情况下，则通过学习一系列软提示连续向量来迭代更新提示模型。 #### 引言部分深入分析文章开头介绍了基于提示的学习作为一种新兴的学习范式，其中预训练的语言模型通过自然语言提示进行任务适应，从而实现少量样本甚至零样本学习的能力。这种学习方式虽然极具潜力，但也存在一些挑战。例如，模型的表现容易受到提示构造的细微变化的影响，同时所需的模型规模远大于传统的监督学习模型。为了克服这些限制，本文提出了两种主要的方法：当只有部分访问提示模型时（例如只能获取来自GPT-3的输出概率），可以通过学习一个校准模型来改进提示输出；当可以访问模型梯度但全面微调的成本过高时（如T0模型），则可以通过学习一组软提示连续向量来逐步更新提示模型。这两种方法都能显著提升模型在当前提示学习与全监督学习之间存在较大差距的具有挑战性的数据集上的性能。 #### 方法论 - **部分访问情况下的校准模型**: 在某些情况下，可能只能获取到提示模型的部分输出（如GPT-3的输出概率）。此时，可以通过学习一个校准模型来进一步处理这些输出，以提高预测的准确性。 - **校准模型的设计**: 该模型需要能够理解并调整提示模型的输出，以便更好地适应下游任务的需求。这可能涉及对输出概率分布的调整或其他形式的后处理技术。 - **校准模型的学习过程**: 通过未标记数据和有限的已标记数据来训练校准模型，使其能够有效地改进提示模型的输出。 - **全面访问情况下的软提示连续向量**: 当可以访问模型梯度但全面微调成本过高时，可以采用学习软提示连续向量的方法。 - **软提示连续向量的定义**: 这些向量是用于调整模型内部状态的一组可训练参数，它们通过迭代更新来逐渐优化模型对于特定任务的表现。 - **学习过程**: 通过未标记数据和少量已标记数据来训练这些向量，使模型能够在保持原有结构不变的前提下，针对特定任务进行有效调整。 #### 结果与讨论通过以上方法，作者们发现可以显著提升模型在多个基准数据集上的性能，尤其是在那些提示学习与全监督学习之间存在较大差距的数据集上。这些结果表明，协同训练不仅能够提高原始提示模型的表现，还能够学习到更小、更高效的下游任务特定模型，这对于资源受限的应用场景尤其有价值。 #### 结论本文通过引入协同训练的概念，为基于提示的大规模语言模型提供了一种有效的改进途径。通过对未标记数据的有效利用以及通过校准模型和软提示连续向量的学习，不仅可以改进原始模型的表现，还能学习到更适合特定任务的小型模型。这种方法不仅解决了现有提示学习方法的一些局限性，也为未来的研究提供了新的方向。

资源推荐

资源详情

资源评论

Co-training Improves Prompt-based Learning for Large Language Models

Hunter Lang

Monica Agrawal

Yoon Kim

David Sontag

Abstract

We demonstrate that co-training (Blum &

Mitchell, 1998) can improve the performance of

prompt-based learning by using unlabeled data.

While prompting has emerged as a promising

paradigm for few-shot and zero-shot learning, it

is often brittle and requires much larger models

compared to the standard supervised setup. We

ﬁnd that co-training makes it possible to improve

the original prompt model and at the same time

learn a smaller, downstream task-speciﬁc model.

In the case where we only have partial access to

a prompt model (e.g., output probabilities from

GPT-3 (Brown et al., 2020)) we learn a calibra-

tion model over the prompt outputs. When we

have full access to the prompt model’s gradients

but full ﬁnetuning remains prohibitively expen-

sive (e.g., T0 (Sanh et al., 2022)), we learn a set

of soft prompt continuous vectors to iteratively

update the prompt model. We ﬁnd that models

trained in this manner can signiﬁcantly improve

performance on challenging datasets where there

is currently a large gap between prompt-based

learning and fully-supervised models.

1. Introduction

Prompt-based learning, in which a pretrained language

model is adapted to various tasks by priming on natural

language prompts, has emerged as a promising framework

for few-shot and zero-shot learning (Brown et al., 2020; Liu

et al., 2021a; Wei et al., 2021; Sanh et al., 2022). While

intriguing, these methods can be sensitive to trivial cosmetic

artifacts, including variations in prompt wording and the

ordering of examples (Lu et al., 2021; Zhao et al., 2021;

Kumar & Talukdar, 2021). Further, the models used in

prompt-based learning (e.g., GPT-3, T0) are much larger

than those typically used for standard ﬁne-tuning. These fac-

tors make prompt-based learning difﬁcult to use in practice.

Given a small amount of labeled data, one could evaluate

MIT CSAIL. Correspondence to: <hjl@mit.edu>.

Proceedings of the

International Conference on Machine

Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-

right 2022 by the author(s).

the performance of each prompt and re-calibrate the prompt

outputs to improve performance. However, (i) this reliance

on labeled data goes against the goal of few-shot learning,

and (ii) even with oracle calibration, some prompts have

sub-par accuracy. Recently, to address issue (i), Zhao et al.

(2021) developed a data-free calibration method that can

dramatically improve the accuracy of few-shot prompts for

GPT-3. We build on their work by showing how to use

unlabeled data to further improve performance.

To leverage unlabeled data, we use co-training (Blum &

Mitchell, 1998), which operates on two views of each data

point

(X)

and

(X)

. For example, in a clinical

diagnosis system,

(X)

could be laboratory test results

and

(X)

an X-ray image. A pair of models (

and

respectively) takes turns labeling a large unlabeled training

set, and each model is trained on the conﬁdent pseudo-labels

from the other. Model

only uses

(X)

, and model

uses

(X)

. By using complementary information in the

views

, ϕ

and the different inductive biases from models

, co-training allows each model to learn from the

other without labeled data. The initial signal to start the

co-training process is provided by a “guess” at a model

To combine co-training and prompt-based learning, we use

outputs from a large prompt-based model as

(X)

and the

pre-trained representation from a much smaller language

model (e.g., DeBERTa (He et al., 2021)) as

(X)

. We

specify the models

and

based on whether we have

partial access to the prompt model (querying GPT-3) or full

access (locally training T0).

In partial access, we only have access to the large model’s

output probabilities. In this case, we use unlabeled data to

learn a model

that both calibrates individual prompts and

ensembles multiple prompts. We refer to this as the label

model. We use Calibrate-Before-Use (Zhao et al., 2021) to

initialize the calibration parameters of this model for each

prompt, and we initialize the ensembling parameters to ap-

proximate majority vote. We then reﬁne this initial guess for

with co-training. We use the pre-trained representation

from DeBERTa (He et al., 2021) for

(X)

and train the last

few layers of that model as

. The only labeled data used

is the set of

examples used in the input prompts. Figure 1

(left) shows the co-training process for partial access.

We also study a full access setting using T0 (Sanh et al.,

Co-training Improves Prompt-based Learning for Large Language Models

GPT-3

Output labels

MLP (ℎ

)Label model (ℎ

)

Co-train

Example formatted

as # prompts

{{premise}} Question:

{{hypothesis}} True or

False?

Given {{premise}} and

what you know about the

world, does {{hypothesis}}

follow? Yes or No?

Question:

Yes or no?

Unlabeled example

*BERT*

Contextual embedding

Output probabilities

embedding

Question:

True or False?

Example formatted

as a hard prompt

Input embedding

&%'

Output labels

MLP (ℎ

)

Soft prompt ((

T0 model

Unlabeled example

*BERT*

Contextual embedding

Co-train

Figure 1.

The setup for our two applications of co-training to prompting for a binary entailment classiﬁcation dataset (RTE). Parameters in

blue are trainable; models in gray are ﬁxed. Left: training a “label model” for post-hoc calibration and ensembling of multiple prompts.

Here the prompts and the model (GPT-3) are ﬁxed, and we co-train the calibration / ensembling parameters with the task-speciﬁc model

(e.g., DeBERTa). Right: training a soft prompt. Here the input is encoded as a hard prompt and the embedding matrix of the input

sequence is obtained. A

L × d

matrix of trainable parameters (the “soft prompt”) is prepended to this embedding, and the combined

embedding sequence is passed through T0 to get output predictions. We co-train the soft prompt with the view 1 model (e.g., DeBERTa).

2022) instead of GPT-3, so we can introspect the large

prompt model. We derive the view

(X)

and the model

from T0

. However, instead of fully ﬁne-tuning T0 during

co-training, we focus on soft prompt tuning, which trains

several orders-of-magnitude fewer parameters while attain-

ing similar performance (Li & Liang, 2021; Lester et al.,

2021). The parameter space for model

is the set of soft

prompts, which are matrices

L×d

, where

is a sequence

length hyperparameter and is

the dimension of the pre-

trained T0 embeddings. Each row of the soft prompt mimics

the embedding of a token, but the soft prompt need not corre-

spond to the embedding of any actual token sequence. This

matrix is prepended to the input embedding and the output

is computed with the frozen T0 model. The initial

guess at

(i.e., the initial soft prompt vector for use in

co-training) is the repeated embedding of the

[PAD]

token.

Since T0 was trained to perform well at zero-shot learning

with prompts, this provides a good initial hypothesis. We

co-train this model with a pre-trained DeBERTa represen-

tation as

(X)

and the last few layers of DeBERTa as

This is is shown in Figure 1, right.

We apply our approach to standard few-shot and zero-shot

tasks and ﬁnd that (i) iteratively co-training models using un-

labeled data consistently improves performance, (ii) pseudo-

labels from a prompted model are an effective signal for

ﬁne-tuning smaller task-speciﬁc models, and (iii) this ap-

proach can signiﬁcantly improve results on datasets previ-

ously considered difﬁcult for prompt-based learning. We

conclude with a brief analysis of success/failure cases and

describe high-level criteria required for our method to work.

We exclusively use the 3-billion-parameter variant T0-3B.

2. Related work

Prompting and prompt tuning. Lu et al. (2021)

ﬁnd optimal orderings of prompt examples based on an

artiﬁcially constructed development set. Given the variance

in performance across different prompts, others have

focused on engineering suitable prompts, manually or

otherwise (Liu et al., 2021a). Jiang et al. (2020), Shin et al.

(2020), and Gao et al. (2021) use data-driven techniques

and language models to automatically generate candidate

prompts. Rather than being constrained to human-readable

prompts, Li & Liang (2021) and Lester et al. (2021) instead

learn a continuous soft task-speciﬁc “prompt” to condition

language models. While effective, these methods typically

require nontrivial amounts of labeled data. Several methods

try to improve the sample-efﬁciency of these techniques

by using pre-training (Gu et al., 2022) or by combining a

hard prompt with a soft prompt (which we also do in this

work) and tuning the soft prompt on a few labeled examples

(Liu et al., 2021b). In contrast, our technique requires no

additional pre-training steps, applies even in the zero-shot

setting, and further boosts performance by allowing two

different models to learn from each other.

Another line of work uses the outputs from a prompted lan-

guage model as weak labels, as we do in this work. Wang

et al. (2021) propose to train smaller models on labels from

GPT-3 to reduce annotation cost, but they train from indi-

vidual, uncalibrated prompts and do not attempt to reﬁne

the prompt model alongside the smaller model. Schick &

Sch

utze (2021) ﬁne-tune a separate RoBERTa model for

each prompt using a small amount of labeled data. They

next aggregate the outputs of these individual ﬁne-tuned

Co-training Improves Prompt-based Learning for Large Language Models

models as a soft pseudo-label and train a ﬁnal model to

match the soft aggregation. In contrast, we train a single

BERT-style model on the ensembled prompt output without

any additional labeled data. We use this model to reﬁne the

ensemble parameters (and vice-versa). In our approach we

only use prompt outputs as training signal, and we consider

different types of prompts (open-ended instead of cloze).

Self-training for few-shot text classiﬁcation. Our work

relies on access to a large amount of unlabeled data to it-

eratively grow a conﬁdently-labeled training set for each

model. Similarly, self-training ﬁrst trains a model on a

small set of initial data, uses the trained model to produce

pseudo-labels on a set of unlabeled data, and then iteratively

includes the conﬁdently pseudo-labeled data as new train-

ing labels (Scudder, 1965). In the context of few-shot text

classiﬁcation, Mukherjee & Awadallah (2020) develop an

uncertainty-aware technique for choosing which data points

to include, which requires a small amount of labeled data.

Karamanolakis et al. (2019; 2021) employ self-training and

iterative co-training with weak supervision as the initial

label signal, and they similarly use a neural network with

pretrained embeddings as a downstream model. However,

they explore hand-written or keyword-based rules as weak

supervision, in contrast to the present work, where we derive

our weak signals from prompted models. The parameter-

ization of

in our partial access setting is similar to the

weighting they use to combine rules.

Co-training. Co-training dates back to Blum & Mitchell

(1998), who assumed that

(X)

and

(X)

are two dis-

tinct views and conditionally independent given the true

label. Under this strict condition, they proved that the al-

gorithm ﬁnds a good classiﬁer after just one step. Many

subsequent analyses (e.g., Dasgupta et al., 2002; Balcan

et al., 2005) relax this condition, showing that views can be

dependent or even identical as long as certain relationships

hold between the models being trained (essentially, they are

“different enough”). In a similar vein, Wei et al. (2020) give

a theoretical explanation of why (and when) models can

learn to be more accurate than the pseudo-labels used to

train them. We take implicit advantage of these results in

our work. The views we use are highly dependent, and yet

the models we train are often able to outperform the pseudo-

labels we used to train them in each co-training iteration.

3. Co-training with prompting

The skeleton of our approach is shown in Algorithm 1 (full

detail is provided in Algorithms 4 and 5 in the supplement).

First, a hypothesis

over view

is initialized such that its

initial predictions are reasonable. (We discuss initialization

in depth in the following sections.) Next, we obtain the

conﬁdently labeled training data

, which is a subset of

the unlabeled data points, together with pseudo-labels for

Algorithm 1 Co-training algorithm

input U = {x

}

n=1

unlabeled examples

input {(x

, y

)}

j=1

labeled examples (optional)

input initial coverage β, coverage increase β

′

← InitClassifier(ϕ

)

for t in {0, . . . , T − 1} do

β ← β + tβ

′

// GetConfData

∗

deﬁned in Algorithms 2, 3

← GetConfData

∗



U; h

, ϕ



← Train(ϕ

, L

)

← GetConfData

∗



U; h

, ϕ



← Train(ϕ

, L

)

end for

return (h

, h

)

those points from

. In iteration

, we select a

β + tβ

′

fraction of the data. (We discuss techniques for selection

of conﬁdent data in Section 4 and the choice of

and

′

in Section 5.) These conﬁdently-labeled points are then

used to train a model

on view

, and

’s conﬁdently-

labeled data is extracted as

. This is used to train a new

, and the process continues for

steps.

Train

performs

standard supervised training on the pseudo-labels for that

iteration. In this section, we give details for how to construct

the views

and

, the hypothesis classes we use for the

model

, and the initialization schemes for

in both the

partial access and full access settings.

3.1. Partial access setting: co-training a label model

In the usual few-shot setting with prompting (also known as

in-context learning),

labeled examples

({x

, y

})

i=1

are

converted into a single natural language prompt following

a template (example below). We call this prompt

-shot,

since it uses

labeled examples. Instead of using one

shot prompt, in this work we use

one-shot prompts, only

including one example in the template at a time. This gives

outputs. Separating out the signal from each labeled

example in this way allows us to combine the examples

more effectively than the one k-shot prompt model.

View. Let

(i)

(x) ∈ R

|V |

be the vector of probabilities

output by GPT-3 on input

formatted in a one-shot prompt

with labeled example

, y

)

. Here

i ∈ {1, . . . , k}

and

is a subset of the full token vocabulary—the verbalizer

tokens—and consists of the “label tokens” for the prompt

as well as other tokens related to the label. For example, in

sentiment analysis, if

is “this movie was great!”,

(1)

(x)

is GPT-3’s output on:

Review: this movie was great!

Positive or Negative? Positive

Review: {{x.review}}

Positive or Negative?

and

might include the label tokens

Positive

Negative

and related tokens such as uncased label tokens

剩余18页未读，继续阅读

评论收藏

内容反馈

版权申诉

百态老人

粉丝: 1w+
资源: 2万+

协同训练改善了大型语言模型的即时学习.pdf

协同训练的一个小例子

aa基于生产型班组的管理协同模型及其应用.pdf

协同开发环境中的产品定义模型_1000003386512111.pdf

基于iOS的自组网协同处理模型设计与实现.pdf

轻量化CAD模型实时协同工具的研究与开发.pdf

基于特征面的分层次CAD模型快速传输.pdf

中国协同办公服务市场洞察2022.pdf

Android平台下级联防御网模型的设计.pdf

Lua游戏开发实践指南.pdf

AIoT云边协同，赋能行业边缘智能.pdf

发动机CAD协同设计的集成环境方案研究.pdf

供应链协同的目标.pdf

视高(Seegle)协同视频会议系统产品介绍分享.pdf

企业数字化协同运营中台白皮书PDF(74页).pdf

强化学习研究综述.pdf

面向大型分布式软件项目的敏捷开发模型研究.pdf

边缘计算_万物互联时代新型计算模型_施巍松.pdf

论文研究-一种Agent协同规范及其应用研究.pdf

甲子光年-成熟的赛道，全新的格局：协同办公产品研究 - 2020.7-23页2020精品报告.pdf

分布式嵌入式系统软硬件协同仿真平台.pdf

基于建模历史一致性的协同CAD并发控制方法.pdf

分布式计算模型.pdf

激光雷达与数字化道路CAD设计协同的研究与应用.pdf

基于VRML和JAVA3D的图形协同模式.pdf

勘察设计企业信息化建设中协同设计探整理.pdf

云计算-计算化学e-Science中化学协同工作环境研究与实现.pdf

基于Web的税务协同办公平台的设计与实现.pdf

最新资源