没有合适的资源?快使用搜索试试~ 我知道了~
T5(Text-to-Text Transfer Transformer)是由Google Research在2020年提出的一种自然语言处理模型。T5模型的核心思想是将所有的NLP任务都转化为文本到文本的问题,即输入和输出都是文本的形式。这种统一的框架使得T5模型能够使用相同的模型架构、损失函数和训练过程来处理不同的NLP任务,从而简化了模型的训练和应用。 T5模型的主要特点包括: 1. **文本到文本的框架**:T5模型将所有的NLP任务,如文本分类、机器翻译、问答、摘要等,都重新格式化为文本到文本的任务。例如,对于文本分类任务,模型的输入是将文本和任务特定的前缀拼接在一起,输出是分类标签的文本表示。 2. **预训练和微调**:T5模型首先在大规模的文本语料库上进行预训练,学习通用的语言表示。然后,针对特定的NLP任务,使用任务特定的数据集进行微调。这种两阶段的训练策略使得T5模型能够在不同的任务上取得很好的性能。 3. **Transformer架构**:T5模型基于Transformer架构,这是一种基于自注意力机制的模型,能够有效地处理序列数据并捕捉长距离依赖。T5模型在T
资源推荐
资源详情
资源评论
Journal of Machine Learning Research 21 (2020) 1-67 Submitted 1/20; Revised 6/20; Published 6/20
Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer
Colin Raffel
∗
craffel@gmail.com
Noam Shazeer
∗
noam@google.com
Adam Roberts
∗
adarob@google.com
Katherine Lee
∗
katherinelee@google.com
Sharan Narang sharannarang@google.com
Michael Matena mmatena@google.com
Yanqi Zhou yanqiz@google.com
Wei Li mweili@google.com
Peter J. Liu peterjliu@google.com
Google, Mountain View, CA 94043, USA
Editor: Ivan Titov
Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-
tuned on a downstream task, has emerged as a powerful technique in natural language
processing (NLP). The effectiveness of transfer learning has given rise to a diversity of
approaches, methodology, and practice. In this paper, we explore the landscape of transfer
learning techniques for NLP by introducing a unified framework that converts all text-based
language problems into a text-to-text format. Our systematic study compares pre-training
objectives, architectures, unlabeled data sets, transfer approaches, and other factors on
dozens of language understanding tasks. By combining the insights from our exploration
with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results
on many benchmarks covering summarization, question answering, text classification, and
more. To facilitate future work on transfer learning for NLP, we release our data set,
pre-trained models, and code.
1
Keywords: transfer learning, natural language processing, multi-task learning, attention-
based models, deep learning
1. Introduction
Training a machine learning model to perform natural language processing (NLP) tasks
often requires that the model can process text in a way that is amenable to downstream
learning. This can be loosely viewed as developing general-purpose knowledge that allows
the model to “understand” text. This knowledge can range from low-level (e.g. the spelling
∗.
Equal contribution. A description of each author’s contribution is available in Appendix A. Correspondence
to craffel@gmail.com.
1. https://github.com/google-research/text-to-text-transfer-transformer
©2020 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu.
License: CC-BY 4.0, see
https://creativecommons.org/licenses/by/4.0/
. Attribution requirements are provided at
http://jmlr.org/papers/v21/20-074.html.
arXiv:1910.10683v4 [cs.LG] 19 Sep 2023
Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu
or meaning of words) to high-level (e.g. that a tuba is too large to fit in most backpacks).
In modern machine learning practice, providing this knowledge is rarely done explicitly;
instead, it is often learned as part of an auxiliary task. For example, a historically common
approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map
word identities to a continuous representation where, ideally, similar words map to similar
vectors. These vectors are often learned through an objective that, for example, encourages
co-occurring words to be positioned nearby in the continuous space (Mikolov et al., 2013b).
Recently, it has become increasingly common to pre-train the entire model on a data-rich
task. Ideally, this pre-training causes the model to develop general-purpose abilities and
knowledge that can then be transferred to downstream tasks. In applications of transfer
learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski
et al., 2014), pre-training is typically done via supervised learning on a large labeled data set
like ImageNet (Russakovsky et al., 2015; Deng et al., 2009). In contrast, modern techniques
for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.
This approach has recently been used to obtain state-of-the-art results in many of the most
common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu
et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training
for NLP is particularly attractive because unlabeled text data is available en masse thanks
to the Internet—for example, the Common Crawl project
2
produces about 20TB of text
data extracted from web pages each month. This is a natural fit for neural networks, which
have been shown to exhibit remarkable scalability, i.e. it is often possible to achieve better
performance simply by training a larger model on a larger data set (Hestness et al., 2017;
Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019;
Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
This synergy has resulted in a great deal of recent work developing transfer learning
methodology for NLP, which has produced a wide landscape of pre-training objectives
(Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled
data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al.,
2019b, 2018; Conneau and Kiela, 2018), fine-tuning methods (Howard and Ruder, 2018;
Houlsby et al., 2019; Peters et al., 2019), and more. The rapid rate of progress and diversity
of techniques in this burgeoning field can make it difficult to compare different algorithms,
tease apart the effects of new contributions, and understand the space of existing methods for
transfer learning. Motivated by a need for more rigorous understanding, we leverage a unified
approach to transfer learning that allows us to systematically study different approaches
and push the current limits of the field.
The basic idea underlying our work is to treat every text processing problem as a
“text-to-text” problem, i.e. taking text as input and producing new text as output. This
approach is inspired by previous unifying frameworks for NLP tasks, including casting all text
problems as question answering (McCann et al., 2018), language modeling (Radford et al.,
2019), or span extraction Keskar et al. (2019b) tasks. Crucially, the text-to-text framework
allows us to directly apply the same model, objective, training procedure, and decoding
process to every task we consider. We leverage this flexibility by evaluating performance
on a wide variety of English-based NLP problems, including question answering, document
2. http://commoncrawl.org
2
Exploring the Limits of Transfer Learning
"translate English to German: That is good."
"cola sentence: The
course is jumping well."
"summarize: state authorities
dispatched emergency crews tuesday to
survey the damage after an onslaught
of severe weather in mississippi…"
"stsb sentence1: The rhino grazed
on the grass. sentence2: A rhino
is grazing in a field."
T5
"Das ist gut."
"not acceptable"
"six people hospitalized after
a storm in attala county."
"3.8"
Figure 1:
A diagram of our text-to-text framework. Every task we consider—including
translation, question answering, and classification—is cast as feeding our model
text as input and training it to generate some target text. This allows us to use the
same model, loss function, hyperparameters, etc. across our diverse set of tasks. It
also provides a standard testbed for the methods included in our empirical survey.
“T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.
summarization, and sentiment classification, to name a few. With this unified approach,
we can compare the effectiveness of different transfer learning objectives, unlabeled data
sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up
models and data sets beyond what has previously been considered.
We emphasize that our goal is not to propose new methods but instead to provide a
comprehensive perspective on where the field stands. As such, our work primarily comprises
a survey, exploration, and empirical comparison of existing techniques. We also explore the
limits of current approaches by scaling up the insights from our systematic study (training
models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks
we consider. In order to perform experiments at this scale, we introduce the “Colossal Clean
Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text
scraped from the web. Recognizing that the main utility of transfer learning is the possibility
of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and
pre-trained models.
1
The remainder of the paper is structured as follows: In the following section, we discuss
our base model and its implementation, our procedure for formulating every text processing
problem as a text-to-text task, and the suite of tasks we consider. In Section 3, we present a
large set of experiments that explore the field of transfer learning for NLP. At the end of the
section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art
results on a wide variety of benchmarks. Finally, we provide a summary of our results and
wrap up with a look towards the future in Section 4.
3
Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu
2. Setup
Before presenting the results from our large-scale empirical study, we review the necessary
background topics required to understand our results, including the Transformer model
architecture and the downstream tasks we evaluate on. We also introduce our approach
for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled
Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text
data. We refer to our model and framework as the “Text-to-Text Transfer Transformer”
(T5).
2.1 Model
Early results on transfer learning for NLP leveraged recurrent neural networks (Peters
et al., 2018; Howard and Ruder, 2018), but it has recently become more common to use
models based on the “Transformer” architecture (Vaswani et al., 2017). The Transformer
was initially shown to be effective for machine translation, but it has subsequently been
used in a wide variety of NLP settings (Radford et al., 2018; Devlin et al., 2018; McCann
et al., 2018; Yu et al., 2018). Due to its increasing ubiquity, all of the models we study are
based on the Transformer architecture. Apart from the details mentioned below and the
variants we explore in Section 3.2, we do not deviate significantly from this architecture as
originally proposed. Instead of providing a comprehensive definition of this model, we refer
the interested reader to the original paper (Vaswani et al., 2017) or follow-up tutorials
3,4
for
a more detailed introduction.
The primary building block of the Transformer is self-attention (Cheng et al., 2016).
Self-attention is a variant of attention (Graves, 2013; Bahdanau et al., 2015) that processes
a sequence by replacing each element by a weighted average of the rest of the sequence.
The original Transformer consisted of an encoder-decoder architecture and was intended
for sequence-to-sequence (Sutskever et al., 2014; Kalchbrenner et al., 2014) tasks. It has
recently also become common to use models consisting of a single Transformer layer stack,
with varying forms of self-attention used to produce architectures appropriate for language
modeling (Radford et al., 2018; Al-Rfou et al., 2019) or classification and span prediction
tasks (Devlin et al., 2018; Yang et al., 2019). We empirically explore these architectural
variants in Section 3.2.
Overall, our encoder-decoder Transformer implementation closely follows its originally-
proposed form (Vaswani et al., 2017). First, an input sequence of tokens is mapped to
a sequence of embeddings, which is then passed into the encoder. The encoder consists
of a stack of “blocks”, each of which comprises two subcomponents: a self-attention layer
followed by a small feed-forward network. Layer normalization (Ba et al., 2016) is applied to
the input of each subcomponent. We use a simplified version of layer normalization where
the activations are only rescaled and no additive bias is applied. After layer normalization,
a residual skip connection (He et al., 2016) adds each subcomponent’s input to its output.
Dropout (Srivastava et al., 2014) is applied within the feed-forward network, on the skip
connection, on the attention weights, and at the input and output of the entire stack. The
decoder is similar in structure to the encoder except that it includes a standard attention
3. http://nlp.seas.harvard.edu/2018/04/03/attention.html
4. http://jalammar.github.io/illustrated-transformer/
4
Exploring the Limits of Transfer Learning
mechanism after each self-attention layer that attends to the output of the encoder. The
self-attention mechanism in the decoder also uses a form of autoregressive or causal self-
attention, which only allows the model to attend to past outputs. The output of the final
decoder block is fed into a dense layer with a softmax output, whose weights are shared with
the input embedding matrix. All attention mechanisms in the Transformer are split up into
independent “heads” whose outputs are concatenated before being further processed.
Since self-attention is order-independent (i.e. it is an operation on sets), it is common
to provide an explicit position signal to the Transformer. While the original Transformer
used a sinusoidal position signal or learned position embeddings, it has recently become
more common to use relative position embeddings (Shaw et al., 2018; Huang et al., 2018a).
Instead of using a fixed embedding for each position, relative position embeddings produce
a different learned embedding according to the offset between the “key” and “query” being
compared in the self-attention mechanism. We use a simplified form of position embeddings
where each “embedding” is simply a scalar that is added to the corresponding logit used
for computing the attention weights. For efficiency, we also share the position embedding
parameters across all layers in our model, though within a given layer each attention head
uses a different learned position embedding. Typically, a fixed number of embeddings are
learned, each corresponding to a range of possible key-query offsets. In this work, we use 32
embeddings for all of our models with ranges that increase in size logarithmically up to an
offset of 128 beyond which we assign all relative positions to the same embedding. Note
that a given layer is insensitive to relative position beyond 128 tokens, but subsequent layers
can build a sensitivity to larger offsets by combining local information from previous layers.
To summarize, our model is roughly equivalent to the original Transformer proposed by
Vaswani et al. (2017) with the exception of removing the Layer Norm bias, placing the layer
normalization outside the residual path, and using a different position embedding scheme.
Since these architectural changes are orthogonal to the experimental factors we consider in
our empirical survey of transfer learning, we leave the ablation of their impact for future
work.
As part of our study, we experiment with the scalability of these models, i.e. how their
performance changes as they are made to have more parameters or layers. Training large
models can be non-trivial since they might not fit on a single machine and require a great deal
of computation. As a result, we use a combination of model and data parallelism and train
models on “slices” of Cloud TPU Pods.
5
TPU pods are are multi-rack ML supercomputers
that contain 1
,
024 TPU v3 chips connected via a high-speed 2D mesh interconnect with
supporting CPU host machines. We leverage the Mesh TensorFlow library (Shazeer et al.,
2018) for ease of implementation of both model parallelism and data parallelism (Krizhevsky,
2014).
2.2 The Colossal Clean Crawled Corpus
Much of the previous work on transfer learning for NLP makes use of large unlabeled data
sets for unsupervised learning. In this paper, we are interested in measuring the effect of the
quality, characteristics, and size of this unlabeled data. To generate data sets that satisfy
our needs, we leverage Common Crawl as a source of text scraped from the web. Common
5. https://cloud.google.com/tpu/
5
剩余66页未读,继续阅读
资源评论
就是一顿骚操作
- 粉丝: 446
- 资源: 40
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功