没有合适的资源?快使用搜索试试~ 我知道了~
协同训练改善了大型语言模型的即时学习.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 121 浏览量
2024-04-02
14:36:17
上传
评论
收藏 710KB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/89070983/0001-8d9dc830d83f0887f62167e4da924d12_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
19页
协同训练改善了大型语言模型的即时学习.pdf
资源推荐
资源详情
资源评论
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![doc](https://img-home.csdnimg.cn/images/20210720083327.png)
![](https://csdnimg.cn/release/download_crawler_static/89070983/bg1.jpg)
Co-training Improves Prompt-based Learning for Large Language Models
Hunter Lang
1
Monica Agrawal
1
Yoon Kim
1
David Sontag
1
Abstract
We demonstrate that co-training (Blum &
Mitchell, 1998) can improve the performance of
prompt-based learning by using unlabeled data.
While prompting has emerged as a promising
paradigm for few-shot and zero-shot learning, it
is often brittle and requires much larger models
compared to the standard supervised setup. We
find that co-training makes it possible to improve
the original prompt model and at the same time
learn a smaller, downstream task-specific model.
In the case where we only have partial access to
a prompt model (e.g., output probabilities from
GPT-3 (Brown et al., 2020)) we learn a calibra-
tion model over the prompt outputs. When we
have full access to the prompt model’s gradients
but full finetuning remains prohibitively expen-
sive (e.g., T0 (Sanh et al., 2022)), we learn a set
of soft prompt continuous vectors to iteratively
update the prompt model. We find that models
trained in this manner can significantly improve
performance on challenging datasets where there
is currently a large gap between prompt-based
learning and fully-supervised models.
1. Introduction
Prompt-based learning, in which a pretrained language
model is adapted to various tasks by priming on natural
language prompts, has emerged as a promising framework
for few-shot and zero-shot learning (Brown et al., 2020; Liu
et al., 2021a; Wei et al., 2021; Sanh et al., 2022). While
intriguing, these methods can be sensitive to trivial cosmetic
artifacts, including variations in prompt wording and the
ordering of examples (Lu et al., 2021; Zhao et al., 2021;
Kumar & Talukdar, 2021). Further, the models used in
prompt-based learning (e.g., GPT-3, T0) are much larger
than those typically used for standard fine-tuning. These fac-
tors make prompt-based learning difficult to use in practice.
Given a small amount of labeled data, one could evaluate
1
MIT CSAIL. Correspondence to: <hjl@mit.edu>.
Proceedings of the
39
th
International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s).
the performance of each prompt and re-calibrate the prompt
outputs to improve performance. However, (i) this reliance
on labeled data goes against the goal of few-shot learning,
and (ii) even with oracle calibration, some prompts have
sub-par accuracy. Recently, to address issue (i), Zhao et al.
(2021) developed a data-free calibration method that can
dramatically improve the accuracy of few-shot prompts for
GPT-3. We build on their work by showing how to use
unlabeled data to further improve performance.
To leverage unlabeled data, we use co-training (Blum &
Mitchell, 1998), which operates on two views of each data
point
X
:
ϕ
0
(X)
and
ϕ
1
(X)
. For example, in a clinical
diagnosis system,
ϕ
0
(X)
could be laboratory test results
and
ϕ
1
(X)
an X-ray image. A pair of models (
h
0
and
h
1
respectively) takes turns labeling a large unlabeled training
set, and each model is trained on the confident pseudo-labels
from the other. Model
h
0
only uses
ϕ
0
(X)
, and model
h
1
uses
ϕ
1
(X)
. By using complementary information in the
views
ϕ
0
, ϕ
1
and the different inductive biases from models
h
0
,
h
1
, co-training allows each model to learn from the
other without labeled data. The initial signal to start the
co-training process is provided by a “guess” at a model
h
0
.
To combine co-training and prompt-based learning, we use
outputs from a large prompt-based model as
ϕ
0
(X)
and the
pre-trained representation from a much smaller language
model (e.g., DeBERTa (He et al., 2021)) as
ϕ
1
(X)
. We
specify the models
h
0
and
h
1
based on whether we have
partial access to the prompt model (querying GPT-3) or full
access (locally training T0).
In partial access, we only have access to the large model’s
output probabilities. In this case, we use unlabeled data to
learn a model
h
0
that both calibrates individual prompts and
ensembles multiple prompts. We refer to this as the label
model. We use Calibrate-Before-Use (Zhao et al., 2021) to
initialize the calibration parameters of this model for each
prompt, and we initialize the ensembling parameters to ap-
proximate majority vote. We then refine this initial guess for
h
0
with co-training. We use the pre-trained representation
from DeBERTa (He et al., 2021) for
ϕ
1
(X)
and train the last
few layers of that model as
h
1
. The only labeled data used
is the set of
k
examples used in the input prompts. Figure 1
(left) shows the co-training process for partial access.
We also study a full access setting using T0 (Sanh et al.,
![](https://csdnimg.cn/release/download_crawler_static/89070983/bg2.jpg)
Co-training Improves Prompt-based Learning for Large Language Models
GPT-3
Output labels
MLP (ℎ
!
)Label model (ℎ
"
)
Co-train
Example formatted
as # prompts
{{premise}} Question:
{{hypothesis}} True or
False?
Given {{premise}} and
what you know about the
world, does {{hypothesis}}
follow? Yes or No?
{{premise}}
Question:
{{hypothesis}}
Yes or no?
{{premise}}
{{hypothesis}}
Unlabeled example
*BERT*
Contextual embedding
$
"
%
Output probabilities
$
!
%
T0
embedding
{{premise}}
Question:
{{hypothesis}}
True or False?
Example formatted
as a hard prompt
Input embedding
$
!
&%'
Output labels
MLP (ℎ
!
)
Soft prompt ((
!
'
T0 model
{{premise}}
{{hypothesis}}
Unlabeled example
*BERT*
Contextual embedding
$
"
%
Co-train
Figure 1.
The setup for our two applications of co-training to prompting for a binary entailment classification dataset (RTE). Parameters in
blue are trainable; models in gray are fixed. Left: training a “label model” for post-hoc calibration and ensembling of multiple prompts.
Here the prompts and the model (GPT-3) are fixed, and we co-train the calibration / ensembling parameters with the task-specific model
(e.g., DeBERTa). Right: training a soft prompt. Here the input is encoded as a hard prompt and the embedding matrix of the input
sequence is obtained. A
L × d
matrix of trainable parameters (the “soft prompt”) is prepended to this embedding, and the combined
embedding sequence is passed through T0 to get output predictions. We co-train the soft prompt with the view 1 model (e.g., DeBERTa).
2022) instead of GPT-3, so we can introspect the large
prompt model. We derive the view
ϕ
0
(X)
and the model
h
0
from T0
1
. However, instead of fully fine-tuning T0 during
co-training, we focus on soft prompt tuning, which trains
several orders-of-magnitude fewer parameters while attain-
ing similar performance (Li & Liang, 2021; Lester et al.,
2021). The parameter space for model
h
0
is the set of soft
prompts, which are matrices
R
L×d
, where
L
is a sequence
length hyperparameter and is
d
the dimension of the pre-
trained T0 embeddings. Each row of the soft prompt mimics
the embedding of a token, but the soft prompt need not corre-
spond to the embedding of any actual token sequence. This
matrix is prepended to the input embedding and the output
of
h
0
is computed with the frozen T0 model. The initial
guess at
h
0
(i.e., the initial soft prompt vector for use in
co-training) is the repeated embedding of the
[PAD]
token.
Since T0 was trained to perform well at zero-shot learning
with prompts, this provides a good initial hypothesis. We
co-train this model with a pre-trained DeBERTa represen-
tation as
ϕ
1
(X)
and the last few layers of DeBERTa as
h
1
.
This is is shown in Figure 1, right.
We apply our approach to standard few-shot and zero-shot
tasks and find that (i) iteratively co-training models using un-
labeled data consistently improves performance, (ii) pseudo-
labels from a prompted model are an effective signal for
fine-tuning smaller task-specific models, and (iii) this ap-
proach can significantly improve results on datasets previ-
ously considered difficult for prompt-based learning. We
conclude with a brief analysis of success/failure cases and
describe high-level criteria required for our method to work.
1
We exclusively use the 3-billion-parameter variant T0-3B.
2. Related work
Prompting and prompt tuning. Lu et al. (2021)
find optimal orderings of prompt examples based on an
artificially constructed development set. Given the variance
in performance across different prompts, others have
focused on engineering suitable prompts, manually or
otherwise (Liu et al., 2021a). Jiang et al. (2020), Shin et al.
(2020), and Gao et al. (2021) use data-driven techniques
and language models to automatically generate candidate
prompts. Rather than being constrained to human-readable
prompts, Li & Liang (2021) and Lester et al. (2021) instead
learn a continuous soft task-specific “prompt” to condition
language models. While effective, these methods typically
require nontrivial amounts of labeled data. Several methods
try to improve the sample-efficiency of these techniques
by using pre-training (Gu et al., 2022) or by combining a
hard prompt with a soft prompt (which we also do in this
work) and tuning the soft prompt on a few labeled examples
(Liu et al., 2021b). In contrast, our technique requires no
additional pre-training steps, applies even in the zero-shot
setting, and further boosts performance by allowing two
different models to learn from each other.
Another line of work uses the outputs from a prompted lan-
guage model as weak labels, as we do in this work. Wang
et al. (2021) propose to train smaller models on labels from
GPT-3 to reduce annotation cost, but they train from indi-
vidual, uncalibrated prompts and do not attempt to refine
the prompt model alongside the smaller model. Schick &
Sch
¨
utze (2021) fine-tune a separate RoBERTa model for
each prompt using a small amount of labeled data. They
next aggregate the outputs of these individual fine-tuned
![](https://csdnimg.cn/release/download_crawler_static/89070983/bg3.jpg)
Co-training Improves Prompt-based Learning for Large Language Models
models as a soft pseudo-label and train a final model to
match the soft aggregation. In contrast, we train a single
BERT-style model on the ensembled prompt output without
any additional labeled data. We use this model to refine the
ensemble parameters (and vice-versa). In our approach we
only use prompt outputs as training signal, and we consider
different types of prompts (open-ended instead of cloze).
Self-training for few-shot text classification. Our work
relies on access to a large amount of unlabeled data to it-
eratively grow a confidently-labeled training set for each
model. Similarly, self-training first trains a model on a
small set of initial data, uses the trained model to produce
pseudo-labels on a set of unlabeled data, and then iteratively
includes the confidently pseudo-labeled data as new train-
ing labels (Scudder, 1965). In the context of few-shot text
classification, Mukherjee & Awadallah (2020) develop an
uncertainty-aware technique for choosing which data points
to include, which requires a small amount of labeled data.
Karamanolakis et al. (2019; 2021) employ self-training and
iterative co-training with weak supervision as the initial
label signal, and they similarly use a neural network with
pretrained embeddings as a downstream model. However,
they explore hand-written or keyword-based rules as weak
supervision, in contrast to the present work, where we derive
our weak signals from prompted models. The parameter-
ization of
h
0
in our partial access setting is similar to the
weighting they use to combine rules.
Co-training. Co-training dates back to Blum & Mitchell
(1998), who assumed that
ϕ
0
(X)
and
ϕ
1
(X)
are two dis-
tinct views and conditionally independent given the true
label. Under this strict condition, they proved that the al-
gorithm finds a good classifier after just one step. Many
subsequent analyses (e.g., Dasgupta et al., 2002; Balcan
et al., 2005) relax this condition, showing that views can be
dependent or even identical as long as certain relationships
hold between the models being trained (essentially, they are
“different enough”). In a similar vein, Wei et al. (2020) give
a theoretical explanation of why (and when) models can
learn to be more accurate than the pseudo-labels used to
train them. We take implicit advantage of these results in
our work. The views we use are highly dependent, and yet
the models we train are often able to outperform the pseudo-
labels we used to train them in each co-training iteration.
3. Co-training with prompting
The skeleton of our approach is shown in Algorithm 1 (full
detail is provided in Algorithms 4 and 5 in the supplement).
First, a hypothesis
h
0
over view
ϕ
0
is initialized such that its
initial predictions are reasonable. (We discuss initialization
in depth in the following sections.) Next, we obtain the
confidently labeled training data
L
0
0
, which is a subset of
the unlabeled data points, together with pseudo-labels for
Algorithm 1 Co-training algorithm
input U = {x
n
}
U
n=1
unlabeled examples
input {(x
j
, y
j
)}
k
j=1
labeled examples (optional)
input initial coverage β, coverage increase β
′
h
0
← InitClassifier(ϕ
0
)
for t in {0, . . . , T − 1} do
˜
β ← β + tβ
′
// GetConfData
∗
defined in Algorithms 2, 3
L
t
0
← GetConfData
∗
U; h
0
, ϕ
0
,
˜
β
h
1
← Train(ϕ
1
, L
t
0
)
L
t
1
← GetConfData
∗
U; h
1
, ϕ
1
,
˜
β
h
0
← Train(ϕ
0
, L
t
1
)
end for
return (h
0
, h
1
)
those points from
h
0
. In iteration
t
, we select a
β + tβ
′
fraction of the data. (We discuss techniques for selection
of confident data in Section 4 and the choice of
β
and
β
′
in Section 5.) These confidently-labeled points are then
used to train a model
h
1
on view
ϕ
1
, and
h
1
’s confidently-
labeled data is extracted as
L
0
1
. This is used to train a new
h
0
, and the process continues for
T
steps.
Train
performs
standard supervised training on the pseudo-labels for that
iteration. In this section, we give details for how to construct
the views
ϕ
0
and
ϕ
1
, the hypothesis classes we use for the
model
h
0
, and the initialization schemes for
h
0
in both the
partial access and full access settings.
3.1. Partial access setting: co-training a label model
In the usual few-shot setting with prompting (also known as
in-context learning),
k
labeled examples
({x
i
, y
i
})
k
i=1
are
converted into a single natural language prompt following
a template (example below). We call this prompt
k
-shot,
since it uses
k
labeled examples. Instead of using one
k
-
shot prompt, in this work we use
k
one-shot prompts, only
including one example in the template at a time. This gives
us
k
outputs. Separating out the signal from each labeled
example in this way allows us to combine the examples
more effectively than the one k-shot prompt model.
View. Let
ϕ
(i)
0
(x) ∈ R
|V |
be the vector of probabilities
output by GPT-3 on input
x
formatted in a one-shot prompt
with labeled example
(x
i
, y
i
)
. Here
i ∈ {1, . . . , k}
and
V
is a subset of the full token vocabulary—the verbalizer
tokens—and consists of the “label tokens” for the prompt
as well as other tokens related to the label. For example, in
sentiment analysis, if
x
1
is “this movie was great!”,
ϕ
(1)
0
(x)
is GPT-3’s output on:
Review: this movie was great!
Positive or Negative? Positive
Review: {{x.review}}
Positive or Negative?
and
V
might include the label tokens
Positive
/
Negative
and related tokens such as uncased label tokens
剩余18页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/68ef26bd67034c68b8d314222b3e4014_weixin_41429382.jpg!1)
百态老人
- 粉丝: 2356
- 资源: 2万+
![benefits](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-1.c8e153b4.png)
下载权益
![privilege](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-2.ec46750a.png)
C知道特权
![article](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-3.fc5e5fb6.png)
VIP文章
![course-privilege](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-4.320a6894.png)
课程特权
![rights](https://csdnimg.cn/release/downloadcmsfe/public/img/vip-rights-icon.fe0226a8.png)
开通VIP
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)