没有合适的资源?快使用搜索试试~ 我知道了~
[文本匹配][P19-1465][Simple and Effective Text Matching with Richer
需积分: 0 0 下载量 114 浏览量
2022-08-03
23:46:27
上传
评论
收藏 567KB PDF 举报
温馨提示
试读
11页
[文本匹配][P19-1465][Simple and Effective Text Matching with Richer Alignment Features]1
资源详情
资源评论
资源推荐
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4699–4709
Florence, Italy, July 28 - August 2, 2019.
c
2019 Association for Computational Linguistics
4699
Simple and Effective Text Matching with Richer Alignment Features
Runqi Yang
1
, Jianhai Zhang
2
, Xing Gao
2
, Feng Ji
2
, Haiqing Chen
2
1
Department of Computer Science and Technology, Nanjing University, China
2
Alibaba Group, Hangzhou, China
{tanfan.zjh,gaoxing.gx,zhongxiu.jf,
haiqing.chenhq}@alibaba-inc.com
Abstract
In this paper, we present a fast and strong neu-
ral approach for general purpose text matching
applications. We explore what is sufficient to
build a fast and well-performed text matching
model and propose to keep three key features
available for inter-sequence alignment: origi-
nal point-wise features, previous aligned fea-
tures, and contextual features while simplify-
ing all the remaining components. We conduct
experiments on four well-studied benchmark
datasets across tasks of natural language in-
ference, paraphrase identification and answer
selection. The performance of our model is
on par with the state-of-the-art on all datasets
with much fewer parameters and the inference
speed is at least 6 times faster compared with
similarly performed ones.
1 Introduction
Text matching is a core research area in natural
language processing with a long history. In text
matching tasks, a model takes two text sequences
as input and predicts a category or a scala value in-
dicating their relationship. A wide range of tasks,
including natural language inference (also known
as recognizing textual entailment) (Bowman et al.,
2015; Khot et al., 2018), paraphrase identification
(Wang et al., 2017), answer selection (Yang et al.,
2015), and so on, can be seen as specific forms
of text matching problems. Research on general
purpose text matching algorithm is beneficial to a
large number of relevant applications.
Deep neural networks are the most popular
choices for text matching nowadays. Semantic
alignment and comparison of two text sequences
are the keys in neural text matching. Many pre-
vious deep neural networks contain a single inter-
sequence alignment layer. To make full use of this
only alignment process, the model has to take rich
external syntactic features or hand-designed align-
ment features as additional inputs of the alignment
layer (Chen et al., 2017; Gong et al., 2018), adopt
a complicated alignment mechanism (Wang et al.,
2017; Tan et al., 2018), or build a vast amount of
post-processing layers to analyze the alignment re-
sult (Tay et al., 2018b; Gong et al., 2018).
More powerful models can be built with mul-
tiple inter-sequence alignment layers. Instead of
making a prediction based on the comparison re-
sult of a single alignment process, a stacked model
with multiple alignment layers maintains its in-
termediate states and gradually refines its predic-
tions. However, suffering from inefficient propa-
gation of lower-level features and vanishing gradi-
ents, these deeper architectures are harder to train.
Recent works have come up with ways of connect-
ing stacked building blocks including dense con-
nection (Tay et al., 2018a; Kim et al., 2018) and
recurrent neural networks (Liu et al., 2018), which
strengthen the propagation of lower-level features
and yield better results than those with a single
alignment process.
This paper presents RE2, a fast and strong neu-
ral architecture with multiple alignment processes
for general purpose text matching. We question
the necessity of many slow components in text
matching approaches presented in previous liter-
ature, including complicated multi-way alignment
mechanisms, heavy distillations of alignment re-
sults, external syntactic features, or dense connec-
tions to connect stacked blocks when the model
is going deep. These design choices slow down
the model by a large amount and can be replaced
by much more lightweight and equally effective
ones. Meanwhile, we highlight three key compo-
nents for an efficient text matching model. These
components, which the name RE2 stands for, are
previous aligned features (Residual vectors), orig-
inal point-wise features (Embedding vectors), and
contextual features (Encoded vectors). The re-
4700
maining components can be as simple as possible
to keep the model fast while still yielding strong
performance.
The general architecture of RE2 is illustrated in
Figure 1. An embedding layer first embeds dis-
crete tokens. Several same-structured blocks con-
sisting of encoding, alignment and fusion layers
then process the sequences consecutively. These
blocks are connected by an augmented version of
residual connections (see section 2.1). A pooling
layer aggregates sequential representations into
vectors which are finally processed by a predic-
tion layer to give the final prediction. The imple-
mentation of each layer is kept as simple as pos-
sible, and the whole model, as a well-organized
combination, is quite powerful and lightweight at
the same time.
Our proposed method achieves the performance
on par with the state-of-the-art on four bench-
mark datasets across three different tasks, namely
SNLI and SciTail for natural language inference,
Quora Question Pairs for paraphrase identifica-
tion, and WikiQA for answer selection. Further-
more, our model has the least number of parame-
ters and the fastest inference speed in all similarly-
performed models. We also conduct an ablation
study to compare with alternative implementations
of most components, perform robustness checks
to see whether the model is robust to changes of
structural hyperparameters, explore what roles the
three key features in RE2 play by comparing their
occlusion sensitivity and show the evolution of
alignment results by a case study. We release the
source code
1
of our experiments for reproducibil-
ity and hope to facilitate future researches.
2 Our Approach
In this section, we introduce our proposed ap-
proach RE2 for text matching. Figure 1 gives an
illustration of the overall architecture. Two text
sequences are processed symmetrically before the
prediction layer, and all parameters except those
in the prediction layer are shared between the two
sequences. For conciseness, we omit the part for
the other sequence in the figure.
In RE2, tokens in each sequence are first em-
bedded by the embedding layer and then processed
consecutively by N same-structured blocks with
independent parameters (dashed boxes in Figure
1
https://github.com/hitvoice/RE2, under the Apache Li-
cense 2.0.
Figure 1: An overview of RE2. There are three parts in
the input of alignment and fusion layers: original point-
wise features (Embedding vectors, denoted by blank
rectangles), previous aligned features (Residual vec-
tors, denoted by rectangles with diagonal stripes), and
contextual features (Encoded vectors, denoted by solid
rectangles). The architecture on the right is the same as
the one on the left so it’s omitted for conciseness.
1) connected by augmented residual connections.
Inside each block, a sequence encoder first com-
putes contextual features of the sequence (solid
rectangles in Figure 1). The input and output of
the encoder are concatenated and then fed into an
alignment layer to model the alignment and inter-
action between the two sequences. A fusion layer
fuses the input and output of the alignment layer.
The output of the fusion layer is considered as the
output of this block. The output of the last block
is sent to the pooling layer and transformed into
a fixed-length vector. The prediction layer takes
the two vectors as input and predicts the final tar-
get. The cross entropy loss is optimized to train
the model in classification tasks.
The implementation of each layer is kept as sim-
ple as possible. We use only word embeddings
in the embedding layer, without character embed-
dings or syntactic features. Vanilla multi-layer
convolutional networks with same padding (Col-
lobert et al., 2011) are adopted as the encoder.
Recurrent networks are slower and do not lead
to further improvements, so they are not adopted
here. A max-over-time pooling operation (Col-
lobert et al., 2011) is used in the pooling layer.
4701
The details of augmented residual connections and
other layers are introduced as follows.
2.1 Augmented Residual Connections
To provide richer features for alignment processes,
RE2 adopts an augmented version of residual con-
nections to connect consecutive blocks. For a se-
quence of length l, We denote the input and output
of the n-th block as x
(n)
= (x
(n)
1
, x
(n)
2
, . . . , x
(n)
l
)
and o
(n)
= (o
(n)
1
, o
(n)
2
, . . . , o
(n)
l
), respectively. Let
o
(0)
be a sequence of zero vectors. The input of the
first block x
(1)
, as mentioned before, is the output
of the embedding layer (denoted by blank rectan-
gles in Figure 1). The input of the n-th block x
(n)
(n ≥ 2), is the concatenation of the input of the
first block x
(1)
and the summation of the output of
previous two blocks (denoted by rectangles with
diagonal stripes in Figure 1):
x
(n)
i
= [x
(1)
i
; o
(n−1)
i
+ o
(n−2)
i
], (1)
where [; ] denotes the concatenation operation.
With augmented residual connections, there are
three parts in the input of alignment and fusion
layers, namely original point-wise features kept
untouched along the way (Embedding vectors),
previous aligned features processed and refined by
previous blocks (Residual vectors), and contextual
features from the encoder layer (Encoded vectors).
Each of these three parts plays a complementing
role in the text matching process.
2.2 Alignment Layer
A simple form of alignment based on the attention
mechanism is used following Parikh et al. (2016)
with minor modifications. The alignment layer, as
shown in Figure 1, takes features from the two se-
quences as input and computes the aligned repre-
sentations as output. Input from the first sequence
of length l
a
is denoted as a = (a
1
, a
2
, . . . , a
l
a
)
and input from the second sequence of length l
b
is denoted as b = (b
1
, b
2
, . . . , b
l
b
). The similarity
score e
ij
between a
i
and b
j
is computed as the dot
product of the projected vectors:
e
ij
= F (a
i
)
T
F (b
j
). (2)
F is an identity function or a single-layer feed-
forward network. The choice is treated as a hyper-
parameter.
The output vectors a
0
and b
0
are computed
by weighted summation of representations of the
other sequence. The summation is weighted by
similarity scores between the current position and
the corresponding positions in the other sequence:
a
0
i
=
l
b
X
j=1
exp(e
ij
)
P
l
b
k=1
exp(e
ik
)
b
j
,
b
0
j
=
l
a
X
i=1
exp(e
ij
)
P
l
a
k=1
exp(e
kj
)
a
i
.
(3)
2.3 Fusion Layer
The fusion layer compares local and aligned repre-
sentations in three perspectives and then fuse them
together. The output of the fusion layer for the first
sequence ¯a is computed by
¯a
1
i
= G
1
([a
i
; a
0
i
]),
¯a
2
i
= G
2
([a
i
; a
i
− a
0
i
]),
¯a
3
i
= G
3
([a
i
; a
i
◦ a
0
i
]),
¯a
i
= G([¯a
1
i
; ¯a
2
i
; ¯a
3
i
]),
(4)
where G
1
, G
2
, G
3
, and G are single-layer feed-
forward networks with independent parameters
and ◦ denotes element-wise multiplication. The
subtraction operator highlights the difference be-
tween the two vectors while the multiplication
highlights similarity. Formulations for
¯
b are simi-
lar and omitted here.
2.4 Prediction Layer
The prediction layer takes the vector representa-
tions of the two sequences v
1
and v
2
from the pool-
ing layers as input and predicts the final target fol-
lowing Mou et al. (2016):
ˆ
y = H([v
1
; v
2
; v
1
− v
2
; v
1
◦ v
2
]). (5)
H is a multi-layer feed-forward neural network.
In a classification task,
ˆ
y ∈ R
C
represents the un-
normalized predicted scores for all classes where
C is the number of classes. The predicted class
is ˆy = argmax
i
ˆ
y
i
. In a regression task,
ˆ
y is the
predicted scala value.
In symmetric tasks like paraphrase identifica-
tion, a symmetric version of the prediction layer
is used for better generalization:
ˆ
y = H([v
1
; v
2
; |v
1
− v
2
|; v
1
◦ v
2
]). (6)
We also provide a simplified version of the pre-
diction layer. Which version to use is treated as
a hyperparameter. The simplified prediction layer
can be expressed as:
ˆ
y = H([v
1
; v
2
]). (7)
剩余10页未读,继续阅读
XiZi
- 粉丝: 60
- 资源: 325
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0