In this paper, we explore a semi-supervised approach for language understanding tasks using a
combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal
representation that transfers with little adaptation to a wide range of tasks. We assume access to
a large corpus of unlabeled text and several datasets with manually annotated training examples
(target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled
corpus. We employ a two-stage training procedure. First, we use a language modeling objective on
the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt
these parameters to a target task using the corresponding supervised objective.
For our model architecture, we use the Transformer [
62
], which has been shown to perform strongly on
various tasks such as machine translation [
62
], document generation [
34
], and syntactic parsing [
29
].
This model choice provides us with a more structured memory for handling long-term dependencies in
text, compared to alternatives like recurrent networks, resulting in robust transfer performance across
diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style
approaches [
52
], which process structured text input as a single contiguous sequence of tokens. As
we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal
changes to the architecture of the pre-trained model.
We evaluate our approach on four types of language understanding tasks – natural language inference,
question answering, semantic similarity, and text classification. Our general task-agnostic model
outperforms discriminatively trained models that employ architectures specifically crafted for each
task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance,
we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test) [
40
],
5.7% on question answering (RACE) [
30
], 1.5% on textual entailment (MultiNLI) [
66
] and 5.5% on
the recently introduced GLUE multi-task benchmark [
64
]. We also analyzed zero-shot behaviors
of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic
knowledge for downstream tasks.
2 Related Work
Semi-supervised learning for NLP
Our work broadly falls under the category of semi-supervised
learning for natural language. This paradigm has attracted significant interest, with applications to
tasks like sequence labeling [
24
,
33
,
57
] or text classification [
41
,
70
]. The earliest approaches used
unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a
supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using
word embeddings [
11
,
39
,
42
], which are trained on unlabeled corpora, to improve performance on a
variety of tasks [
8
,
11
,
26
,
45
]. These approaches, however, mainly transfer word-level information,
whereas we aim to capture higher-level semantics.
Recent approaches have investigated learning and utilizing more than word-level semantics from
unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled
corpus, have been used to encode text into suitable vector representations for various target tasks [
28
,
32, 1, 36, 22, 12, 56, 31].
Unsupervised pre-training
Unsupervised pre-training is a special case of semi-supervised learning
where the goal is to find a good initialization point instead of modifying the supervised learning
objective. Early works explored the use of the technique in image classification [
20
,
49
,
63
] and
regression tasks [
3
]. Subsequent research [
15
] demonstrated that pre-training acts as a regularization
scheme, enabling better generalization in deep neural networks. In recent work, the method has
been used to help train deep neural networks on various tasks like image classification [
69
], speech
recognition [68], entity disambiguation [17] and machine translation [48].
The closest line of work to ours involves pre-training a neural network using a language modeling
objective and then fine-tuning it on a target task with supervision. Dai et al. [
13
] and Howard and
Ruder [
21
] follow this method to improve text classification. However, although the pre-training
phase helps capture some linguistic information, their usage of LSTM models restricts their prediction
ability to a short range. In contrast, our choice of transformer networks allows us to capture longer-
range linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the
effectiveness of our model on a wider range of tasks including natural language inference, paraphrase
detection and story completion. Other approaches [
43
,
44
,
38
] use hidden representations from a
2