T5模型，经典模型原理资源-CSDN文库

自然语言处理

需积分: 5 190 浏览量 2024-04-04 17:53:50 上传评论收藏 1.13MB PDF 举报

资源推荐

资源详情

资源评论

Journal of Machine Learning Research 21 (2020) 1-67 Submitted 1/20; Revised 6/20; Published 6/20

Exploring the Limits of Transfer Learning with a Uniﬁed

Text-to-Text Transformer

Colin Raﬀel

∗

craffel@gmail.com

Noam Shazeer

∗

noam@google.com

Adam Roberts

∗

adarob@google.com

Katherine Lee

∗

katherinelee@google.com

Sharan Narang sharannarang@google.com

Michael Matena mmatena@google.com

Yanqi Zhou yanqiz@google.com

Wei Li mweili@google.com

Peter J. Liu peterjliu@google.com

Google, Mountain View, CA 94043, USA

Editor: Ivan Titov

Abstract

Transfer learning, where a model is ﬁrst pre-trained on a data-rich task before being ﬁne-

tuned on a downstream task, has emerged as a powerful technique in natural language

processing (NLP). The eﬀectiveness of transfer learning has given rise to a diversity of

approaches, methodology, and practice. In this paper, we explore the landscape of transfer

learning techniques for NLP by introducing a uniﬁed framework that converts all text-based

language problems into a text-to-text format. Our systematic study compares pre-training

objectives, architectures, unlabeled data sets, transfer approaches, and other factors on

dozens of language understanding tasks. By combining the insights from our exploration

with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results

on many benchmarks covering summarization, question answering, text classiﬁcation, and

more. To facilitate future work on transfer learning for NLP, we release our data set,

pre-trained models, and code.

Keywords: transfer learning, natural language processing, multi-task learning, attention-

based models, deep learning

1. Introduction

Training a machine learning model to perform natural language processing (NLP) tasks

often requires that the model can process text in a way that is amenable to downstream

learning. This can be loosely viewed as developing general-purpose knowledge that allows

the model to “understand” text. This knowledge can range from low-level (e.g. the spelling

∗.

Equal contribution. A description of each author’s contribution is available in Appendix A. Correspondence

to craffel@gmail.com.

1. https://github.com/google-research/text-to-text-transfer-transformer

Li, and Peter J. Liu.

License: CC-BY 4.0, see

https://creativecommons.org/licenses/by/4.0/

. Attribution requirements are provided at

http://jmlr.org/papers/v21/20-074.html.

arXiv:1910.10683v4 [cs.LG] 19 Sep 2023

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu

or meaning of words) to high-level (e.g. that a tuba is too large to ﬁt in most backpacks).

In modern machine learning practice, providing this knowledge is rarely done explicitly;

instead, it is often learned as part of an auxiliary task. For example, a historically common

approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map

word identities to a continuous representation where, ideally, similar words map to similar

vectors. These vectors are often learned through an objective that, for example, encourages

co-occurring words to be positioned nearby in the continuous space (Mikolov et al., 2013b).

Recently, it has become increasingly common to pre-train the entire model on a data-rich

task. Ideally, this pre-training causes the model to develop general-purpose abilities and

knowledge that can then be transferred to downstream tasks. In applications of transfer

learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski

et al., 2014), pre-training is typically done via supervised learning on a large labeled data set

like ImageNet (Russakovsky et al., 2015; Deng et al., 2009). In contrast, modern techniques

for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data.

This approach has recently been used to obtain state-of-the-art results in many of the most

common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu

et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training

for NLP is particularly attractive because unlabeled text data is available en masse thanks

to the Internet—for example, the Common Crawl project

produces about 20TB of text

data extracted from web pages each month. This is a natural ﬁt for neural networks, which

have been shown to exhibit remarkable scalability, i.e. it is often possible to achieve better

performance simply by training a larger model on a larger data set (Hestness et al., 2017;

Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019;

Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).

This synergy has resulted in a great deal of recent work developing transfer learning

methodology for NLP, which has produced a wide landscape of pre-training objectives

(Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled

data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al.,

2019b, 2018; Conneau and Kiela, 2018), ﬁne-tuning methods (Howard and Ruder, 2018;

Houlsby et al., 2019; Peters et al., 2019), and more. The rapid rate of progress and diversity

of techniques in this burgeoning ﬁeld can make it diﬃcult to compare diﬀerent algorithms,

tease apart the eﬀects of new contributions, and understand the space of existing methods for

transfer learning. Motivated by a need for more rigorous understanding, we leverage a uniﬁed

approach to transfer learning that allows us to systematically study diﬀerent approaches

and push the current limits of the ﬁeld.

The basic idea underlying our work is to treat every text processing problem as a

“text-to-text” problem, i.e. taking text as input and producing new text as output. This

approach is inspired by previous unifying frameworks for NLP tasks, including casting all text

problems as question answering (McCann et al., 2018), language modeling (Radford et al.,

2019), or span extraction Keskar et al. (2019b) tasks. Crucially, the text-to-text framework

allows us to directly apply the same model, objective, training procedure, and decoding

process to every task we consider. We leverage this ﬂexibility by evaluating performance

on a wide variety of English-based NLP problems, including question answering, document

2. http://commoncrawl.org

Exploring the Limits of Transfer Learning

"translate English to German: That is good."

"cola sentence: The

course is jumping well."

"summarize: state authorities

dispatched emergency crews tuesday to

survey the damage after an onslaught

of severe weather in mississippi…"

"stsb sentence1: The rhino grazed

on the grass. sentence2: A rhino

is grazing in a field."

"Das ist gut."

"not acceptable"

"six people hospitalized after

a storm in attala county."

"3.8"

Figure 1:

A diagram of our text-to-text framework. Every task we consider—including

translation, question answering, and classiﬁcation—is cast as feeding our model

text as input and training it to generate some target text. This allows us to use the

same model, loss function, hyperparameters, etc. across our diverse set of tasks. It

also provides a standard testbed for the methods included in our empirical survey.

“T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.

summarization, and sentiment classiﬁcation, to name a few. With this uniﬁed approach,

we can compare the eﬀectiveness of diﬀerent transfer learning objectives, unlabeled data

sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up

models and data sets beyond what has previously been considered.

We emphasize that our goal is not to propose new methods but instead to provide a

comprehensive perspective on where the ﬁeld stands. As such, our work primarily comprises

a survey, exploration, and empirical comparison of existing techniques. We also explore the

limits of current approaches by scaling up the insights from our systematic study (training

models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks

we consider. In order to perform experiments at this scale, we introduce the “Colossal Clean

Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text

scraped from the web. Recognizing that the main utility of transfer learning is the possibility

of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and

pre-trained models.

The remainder of the paper is structured as follows: In the following section, we discuss

our base model and its implementation, our procedure for formulating every text processing

problem as a text-to-text task, and the suite of tasks we consider. In Section 3, we present a

large set of experiments that explore the ﬁeld of transfer learning for NLP. At the end of the

section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art

results on a wide variety of benchmarks. Finally, we provide a summary of our results and

wrap up with a look towards the future in Section 4.

Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu

2. Setup

Before presenting the results from our large-scale empirical study, we review the necessary

background topics required to understand our results, including the Transformer model

architecture and the downstream tasks we evaluate on. We also introduce our approach

for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled

Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text

data. We refer to our model and framework as the “Text-to-Text Transfer Transformer”

(T5).

2.1 Model

Early results on transfer learning for NLP leveraged recurrent neural networks (Peters

et al., 2018; Howard and Ruder, 2018), but it has recently become more common to use

models based on the “Transformer” architecture (Vaswani et al., 2017). The Transformer

was initially shown to be eﬀective for machine translation, but it has subsequently been

used in a wide variety of NLP settings (Radford et al., 2018; Devlin et al., 2018; McCann

et al., 2018; Yu et al., 2018). Due to its increasing ubiquity, all of the models we study are

based on the Transformer architecture. Apart from the details mentioned below and the

variants we explore in Section 3.2, we do not deviate signiﬁcantly from this architecture as

originally proposed. Instead of providing a comprehensive deﬁnition of this model, we refer

the interested reader to the original paper (Vaswani et al., 2017) or follow-up tutorials

3,4

for

a more detailed introduction.

The primary building block of the Transformer is self-attention (Cheng et al., 2016).

Self-attention is a variant of attention (Graves, 2013; Bahdanau et al., 2015) that processes

a sequence by replacing each element by a weighted average of the rest of the sequence.

The original Transformer consisted of an encoder-decoder architecture and was intended

for sequence-to-sequence (Sutskever et al., 2014; Kalchbrenner et al., 2014) tasks. It has

recently also become common to use models consisting of a single Transformer layer stack,

with varying forms of self-attention used to produce architectures appropriate for language

modeling (Radford et al., 2018; Al-Rfou et al., 2019) or classiﬁcation and span prediction

tasks (Devlin et al., 2018; Yang et al., 2019). We empirically explore these architectural

variants in Section 3.2.

Overall, our encoder-decoder Transformer implementation closely follows its originally-

proposed form (Vaswani et al., 2017). First, an input sequence of tokens is mapped to

a sequence of embeddings, which is then passed into the encoder. The encoder consists

of a stack of “blocks”, each of which comprises two subcomponents: a self-attention layer

followed by a small feed-forward network. Layer normalization (Ba et al., 2016) is applied to

the input of each subcomponent. We use a simpliﬁed version of layer normalization where

the activations are only rescaled and no additive bias is applied. After layer normalization,

a residual skip connection (He et al., 2016) adds each subcomponent’s input to its output.

Dropout (Srivastava et al., 2014) is applied within the feed-forward network, on the skip

connection, on the attention weights, and at the input and output of the entire stack. The

decoder is similar in structure to the encoder except that it includes a standard attention

3. http://nlp.seas.harvard.edu/2018/04/03/attention.html

4. http://jalammar.github.io/illustrated-transformer/

Exploring the Limits of Transfer Learning

mechanism after each self-attention layer that attends to the output of the encoder. The

self-attention mechanism in the decoder also uses a form of autoregressive or causal self-

attention, which only allows the model to attend to past outputs. The output of the ﬁnal

decoder block is fed into a dense layer with a softmax output, whose weights are shared with

the input embedding matrix. All attention mechanisms in the Transformer are split up into

independent “heads” whose outputs are concatenated before being further processed.

Since self-attention is order-independent (i.e. it is an operation on sets), it is common

to provide an explicit position signal to the Transformer. While the original Transformer

used a sinusoidal position signal or learned position embeddings, it has recently become

more common to use relative position embeddings (Shaw et al., 2018; Huang et al., 2018a).

Instead of using a ﬁxed embedding for each position, relative position embeddings produce

a diﬀerent learned embedding according to the oﬀset between the “key” and “query” being

compared in the self-attention mechanism. We use a simpliﬁed form of position embeddings

where each “embedding” is simply a scalar that is added to the corresponding logit used

for computing the attention weights. For eﬃciency, we also share the position embedding

parameters across all layers in our model, though within a given layer each attention head

uses a diﬀerent learned position embedding. Typically, a ﬁxed number of embeddings are

learned, each corresponding to a range of possible key-query oﬀsets. In this work, we use 32

embeddings for all of our models with ranges that increase in size logarithmically up to an

oﬀset of 128 beyond which we assign all relative positions to the same embedding. Note

that a given layer is insensitive to relative position beyond 128 tokens, but subsequent layers

can build a sensitivity to larger oﬀsets by combining local information from previous layers.

To summarize, our model is roughly equivalent to the original Transformer proposed by

Vaswani et al. (2017) with the exception of removing the Layer Norm bias, placing the layer

normalization outside the residual path, and using a diﬀerent position embedding scheme.

Since these architectural changes are orthogonal to the experimental factors we consider in

our empirical survey of transfer learning, we leave the ablation of their impact for future

work.

As part of our study, we experiment with the scalability of these models, i.e. how their

performance changes as they are made to have more parameters or layers. Training large

models can be non-trivial since they might not ﬁt on a single machine and require a great deal

of computation. As a result, we use a combination of model and data parallelism and train

models on “slices” of Cloud TPU Pods.

TPU pods are are multi-rack ML supercomputers

that contain 1

024 TPU v3 chips connected via a high-speed 2D mesh interconnect with

supporting CPU host machines. We leverage the Mesh TensorFlow library (Shazeer et al.,

2018) for ease of implementation of both model parallelism and data parallelism (Krizhevsky,

2014).

2.2 The Colossal Clean Crawled Corpus

Much of the previous work on transfer learning for NLP makes use of large unlabeled data

sets for unsupervised learning. In this paper, we are interested in measuring the eﬀect of the

quality, characteristics, and size of this unlabeled data. To generate data sets that satisfy

our needs, we leverage Common Crawl as a source of text scraped from the web. Common

5. https://cloud.google.com/tpu/

剩余66页未读，继续阅读

评论收藏

内容反馈

就是一顿骚操作

粉丝: 446
资源: 40

T5模型，经典模型原理

Tensorflow t5预训练语法更正模型

fastT5::high_voltage:使用fastT5将T5模型的推理速度提高5倍，并将模型大小降低3倍

T5-for-NQ:针对自然问题的微调T5模型

Bart_T5-summarization:使用Bart和T5模型的汇总任务

将T5模型的推理速度提高5倍，并将模型大小减小3倍。.zip

基于T5模型的中文文本纠错Python源码+文档说明+数据+模型

google T5模型

Paraphrase-Generator:使用T5模型构建的释义生成器，生成释义的英语句子

谷歌FLAN-T5作者亲讲：5400亿参数，1800个任务，如何实现大语言模型“自我改进”_鲟曦研习社.pdf

t5-pegasus：中文生成式预训练模型

Jade_T5:对话式T5模型，针对长时间的对话任务进行了微调

大语言模型综述：从T5到GPT-4最全盘点

文本生成模型，实现了包括LLaMA，ChatGLM，BLOOM，GPT2，BART，T5等模型的训练和预测，开箱即用

T5-Model:使用T5（文本到文本的传输转换器）模型在笔记本上进行收集

FakeNews-Generator-And-Detector：训练T5模型以生成简单的Fake News，并使用RoBERTa模型对虚假和真实内容进行分类

T5_继承 java 经典教程 经典教材

MLX框架的一些示例 包含：文本模型、图像模型、音频模型等

T5的整体介绍代码实战

stable-diffusion部署需要的包

大规模语言模型：从理论到实践

llama3-中文微调训练集，让llama3更懂中文

21个免费无限制免登录chatgpt资源， OpenAI GPT-4\3.5 模型的智能对话链接

人工智能大模型介绍.pptx

ChatGPT智能AI机器人微信小程序源码-带部署教程

LM Studio windows版本安装

diabetes糖尿病数据集

transformer代码

线性代数-同济大学第七版

《ChatGPT中文版提示词手册，学完工作效率提升百倍！.pdf》

最新资源

T5_继承 java 经典教程经典教材

MLX框架的一些示例包含：文本模型、图像模型、音频模型等