没有合适的资源?快使用搜索试试~ 我知道了~
[2018-ACL].(评价).What you can cram into a single vector:Probing s
需积分: 0 0 下载量 119 浏览量
2022-08-03
15:25:49
上传
评论
收藏 440KB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/86289971/0001-4150300bbe448fe1f704a86a44f504e1_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
14页
Although much effort has recently beenembeddings, we still have a poor un-dersta
资源详情
资源评论
资源推荐
![](https://csdnimg.cn/release/download_crawler_static/86289971/bg1.jpg)
What you can cram into a single $&!#* vector:
Probing sentence embeddings for linguistic properties
Alexis Conneau
Facebook AI Research
Université Le Mans
aconneau@fb.com
German Kruszewski
Facebook AI Research
germank@fb.com
Guillaume Lample
Facebook AI Research
Sorbonne Universités
glample@fb.com
Loïc Barrault
Université Le Mans
loic.barrault@univ-lemans.fr
Marco Baroni
Facebook AI Research
mbaroni@fb.com
Abstract
Although much effort has recently been
devoted to training high-quality sentence
embeddings, we still have a poor un-
derstanding of what they are capturing.
“Downstream” tasks, often based on sen-
tence classification, are commonly used
to evaluate the quality of sentence repre-
sentations. The complexity of the tasks
makes it however difficult to infer what
kind of information is present in the repre-
sentations. We introduce here 10 probing
tasks designed to capture simple linguis-
tic features of sentences, and we use them
to study embeddings generated by three
different encoders trained in eight distinct
ways, uncovering intriguing properties of
both encoders and training methods.
1 Introduction
Despite Ray Mooney’s quip that you cannot cram
the meaning of a whole %&!$# sentence into a
single $&!#* vector, sentence embedding meth-
ods have achieved impressive results in tasks rang-
ing from machine translation (Sutskever et al.,
2014; Cho et al., 2014) to entailment detection
(Williams et al., 2018), spurring the quest for “uni-
versal embeddings” trained once and used in a va-
riety of applications (e.g., Kiros et al., 2015; Con-
neau et al., 2017; Subramanian et al., 2018). Posi-
tive results on concrete problems suggest that em-
beddings capture important linguistic properties of
sentences. However, real-life “downstream” tasks
require complex forms of inference, making it dif-
ficult to pinpoint the information a model is rely-
ing upon. Impressive as it might be that a system
can tell that the sentence “A movie that doesn’t
aim too high, but it doesn’t need to” (Pang and
Lee, 2004) expresses a subjective viewpoint, it is
hard to tell how the system (or even a human)
comes to this conclusion. Complex tasks can also
carry hidden biases that models might lock onto
(Jabri et al., 2016). For example, Lai and Hock-
enmaier (2014) show that the simple heuristic of
checking for explicit negation words leads to good
accuracy in the SICK sentence entailment task.
Model introspection techniques have been ap-
plied to sentence encoders in order to gain a bet-
ter understanding of which properties of the in-
put sentences their embeddings retain (see Sec-
tion 5). However, these techniques often depend
on the specifics of an encoder architecture, and
consequently cannot be used to compare different
methods. Shi et al. (2016) and Adi et al. (2017)
introduced a more general approach, relying on
the notion of what we will call probing tasks. A
probing task is a classification problem that fo-
cuses on simple linguistic properties of sentences.
For example, one such task might require to cat-
egorize sentences by the tense of their main verb.
Given an encoder (e.g., an LSTM) pre-trained on
a certain task (e.g., machine translation), we use
the sentence embeddings it produces to train the
tense classifier (without further embedding tun-
ing). If the classifier succeeds, it means that the
pre-trained encoder is storing readable tense infor-
mation into the embeddings it creates. Note that:
(i) The probing task asks a simple question, min-
imizing interpretability problems. (ii) Because of
their simplicity, it is easier to control for biases in
probing tasks than in downstream tasks. (iii) The
probing task methodology is agnostic with respect
to the encoder architecture, as long as it produces
a vector representation of sentences.
We greatly extend earlier work on probing tasks
as follows. First, we introduce a larger set of prob-
ing tasks (10 in total), organized by the type of lin-
guistic properties they probe. Second, we system-
atize the probing task methodology, controlling for
arXiv:1805.01070v2 [cs.CL] 8 Jul 2018
![](https://csdnimg.cn/release/download_crawler_static/86289971/bg2.jpg)
a number of possible nuisance factors, and fram-
ing all tasks so that they only require single sen-
tence representations as input, for maximum gen-
erality and to ease result interpretation. Third, we
use our probing tasks to explore a wide range of
state-of-the-art encoding architectures and train-
ing methods, and further relate probing and down-
stream task performance. Finally, we are publicly
releasing our probing data sets and tools, hoping
they will become a standard way to study the lin-
guistic properties of sentence embeddings.
1
2 Probing tasks
In constructing our probing benchmarks, we
adopted the following criteria. First, for general-
ity and interpretability, the task classification prob-
lem should only require single sentence embed-
dings as input (as opposed to, e.g., sentence and
word embeddings, or multiple sentence represen-
tations). Second, it should be possible to construct
large training sets in order to train parameter-rich
multi-layer classifiers, in case the relevant proper-
ties are non-linearly encoded in the sentence vec-
tors. Third, nuisance variables such as lexical cues
or sentence length should be controlled for. Fi-
nally, and most importantly, we want tasks that
address an interesting set of linguistic properties.
We thus strove to come up with a set of tasks that,
while respecting the previous constraints, probe a
wide range of phenomena, from superficial prop-
erties of sentences such as which words they con-
tain to their hierarchical structure to subtle facets
of semantic acceptability. We think the current
task set is reasonably representative of different
linguistic domains, but we are not claiming that
it is exhaustive. We expect future work to extend
it.
The sentences for all our tasks are extracted
from the Toronto Book Corpus (Zhu et al., 2015),
more specifically from the random pre-processed
portion made available by Paperno et al. (2016).
We only sample sentences in the 5-to-28 word
range. We parse them with the Stanford Parser
(2017-06-09 version), using the pre-trained PCFG
model (Klein and Manning, 2003), and we rely on
the part-of-speech, constituency and dependency
parsing information provided by this tool where
needed. For each task, we construct training sets
containing 100k sentences, and 10k-sentence val-
1
https://github.com/facebookresearch/
SentEval/tree/master/data/probing
idation and test sets. All sets are balanced, having
an equal number of instances of each target class.
Surface information These tasks test the extent
to which sentence embeddings are preserving sur-
face properties of the sentences they encode. One
can solve the surface tasks by simply looking at
tokens in the input sentences: no linguistic knowl-
edge is called for. The first task is to predict the
length of sentences in terms of number of words
(SentLen). Following Adi et al. (2017), we group
sentences into 6 equal-width bins by length, and
treat SentLen as a 6-way classification task. The
word content (WC) task tests whether it is possible
to recover information about the original words in
the sentence from its embedding. We picked 1000
mid-frequency words from the source corpus vo-
cabulary (the words with ranks between 2k and
3k when sorted by frequency), and sampled equal
numbers of sentences that contain one and only
one of these words. The task is to tell which of
the 1k words a sentence contains (1k-way classifi-
cation). This setup allows us to probe a sentence
embedding for word content without requiring an
auxiliary word embedding (as in the setup of Adi
and colleagues).
Syntactic information The next batch of tasks
test whether sentence embeddings are sensitive to
syntactic properties of the sentences they encode.
The bigram shift (BShift) task tests whether an
encoder is sensitive to legal word orders. In this
binary classification problem, models must distin-
guish intact sentences sampled from the corpus
from sentences where we inverted two random ad-
jacent words (“What you are doing out there?”).
The tree depth (TreeDepth) task checks
whether an encoder infers the hierarchical struc-
ture of sentences, and in particular whether it can
group sentences by the depth of the longest path
from root to any leaf. Since tree depth is naturally
correlated with sentence length, we de-correlate
these variables through a structured sampling pro-
cedure. In the resulting data set, tree depth val-
ues range from 5 to 12, and the task is to catego-
rize sentences into the class corresponding to their
depth (8 classes). As an example, the following
is a long (22 tokens) but shallow (max depth: 5)
sentence: “[
1
[
2
But right now, for the time be-
ing, my past, my fears, and my thoughts [
3
were [
4
my [
5
business]]].]]” (the outermost brackets cor-
respond to the ROOT and S nodes in the parse).
![](https://csdnimg.cn/release/download_crawler_static/86289971/bg3.jpg)
In the top constituent task (TopConst), sen-
tences must be classified in terms of the sequence
of top constituents immediately below the sen-
tence (S) node. An encoder that successfully ad-
dresses this challenge is not only capturing latent
syntactic structures, but clustering them by con-
stituent types. TopConst was introduced by Shi
et al. (2016). Following them, we frame it as a
20-way classification problem: 19 classes for the
most frequent top constructions, and one for all
other constructions. As an example, “[Then] [very
dark gray letters on a black screen] [appeared] [.]”
has top constituent sequence: “ADVP NP VP .”.
Note that, while we would not expect an un-
trained human subject to be explicitly aware of
tree depth or top constituency, similar information
must be implicitly computed to correctly parse
sentences, and there is suggestive evidence that the
brain tracks something akin to tree depth during
sentence processing (Nelson et al., 2017).
Semantic information These tasks also rely on
syntactic structure, but they further require some
understanding of what a sentence denotes. The
Tense task asks for the tense of the main-clause
verb (VBP/VBZ forms are labeled as present,
VBD as past). No target form occurs across the
train/dev/test split, so that classifiers cannot rely
on specific words (it is not clear that Shi and col-
leagues, who introduced this task, controlled for
this factor). The subject number (SubjNum) task
focuses on the number of the subject of the main
clause (number in English is more often explic-
itly marked on nouns than verbs). Again, there
is no target overlap across partitions. Similarly,
object number (ObjNum) tests for the number of
the direct object of the main clause (again, avoid-
ing lexical overlap). To solve the previous tasks
correctly, an encoder must not only capture tense
and number, but also extract structural informa-
tion (about the main clause and its arguments).
We grouped Tense, SubjNum and ObjNum with
the semantic tasks, since, at least for models that
treat words as unanalyzed input units (without
access to morphology), they must rely on what
a sentence denotes (e.g., whether the described
event took place in the past), rather than on struc-
tural/syntactic information. We recognize, how-
ever, that the boundary between syntactic and se-
mantic tasks is somewhat arbitrary.
In the semantic odd man out (SOMO) task, we
modified sentences by replacing a random noun
or verb o with another noun or verb r. To make
the task more challenging, the bigrams formed by
the replacement with the previous and following
words in the sentence have frequencies that are
comparable (on a log-scale) with those of the orig-
inal bigrams. That is, if the original sentence con-
tains bigrams w
n−1
o and ow
n+1
, the correspond-
ing bigrams w
n−1
r and rw
n+1
in the modified
sentence will have comparable corpus frequencies.
No sentence is included in both original and modi-
fied format, and no replacement is repeated across
train/dev/test sets. The task of the classifier is to
tell whether a sentence has been modified or not.
An example modified sentence is: “ No one could
see this Hayes and I wanted to know if it was
real or a spoonful (orig.: ploy).” Note that judg-
ing plausibility of a syntactically well-formed sen-
tence of this sort will often require grasping rather
subtle semantic factors, ranging from selectional
preference to topical coherence.
The coordination inversion (CoordInv) bench-
mark contains sentences made of two coordinate
clauses. In half of the sentences, we inverted the
order of the clauses. The task is to tell whether
a sentence is intact or modified. Sentences
are balanced in terms of clause length, and no
sentence appears in both original and inverted
versions. As an example, original “They might
be only memories, but I can still feel each one”
becomes: “I can still feel each one, but they might
be only memories.” Often, addressing CoordInv
requires an understanding of broad discourse and
pragmatic factors.
Row Hum. Eval. of Table 2 reports human-
validated “reasonable” upper bounds for all the
tasks, estimated in different ways, depending on
the tasks. For the surface ones, there is always a
straightforward correct answer that a human an-
notator with enough time and patience could find.
The upper bound is thus estimated at 100%. The
TreeDepth, TopConst, Tense, SubjNum and Ob-
jNum tasks depend on automated PoS and pars-
ing annotation. In these cases, the upper bound
is given by the proportion of sentences correctly
annotated by the automated procedure. To esti-
mate this quantity, one linguistically-trained au-
thor checked the annotation of 200 randomly sam-
pled test sentences from each task. Finally, the
BShift, SOMO and CoordInv manipulations can
accidentally generate acceptable sentences. For
剩余13页未读,继续阅读
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![application/x-rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![bz2](https://img-home.csdnimg.cn/images/20210720083646.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![bz2](https://img-home.csdnimg.cn/images/20210720083646.png)
![application/octet-stream](https://img-home.csdnimg.cn/images/20210720083646.png)
![application/octet-stream](https://img-home.csdnimg.cn/images/20210720083646.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![application/octet-stream](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![application/octet-stream](https://img-home.csdnimg.cn/images/20210720083646.png)
![application/x-gzip](https://img-home.csdnimg.cn/images/20210720083646.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![rar](https://img-home.csdnimg.cn/images/20210720083606.png)
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![avatar](https://profile-avatar.csdnimg.cn/ea99bf9e44bf49e1bd95905b946efbd3_weixin_35809747.jpg!1)
马李灵珊
- 粉丝: 39
- 资源: 297
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
最新资源
- 基于GUI+MYSQL+JAVA票务管理系统文档介绍+源码+数据库(高分大作业).zip
- 优先编码器除法电微分运算电路 全加器函数发生电路等电路经典Multisim仿真实验源文件合集(25个).zip
- 2331308JS课堂案例.zip
- STM32H750VBT6单片机最小系统开发板AD设计硬件(原理图+PCB+3D封装库)工程文件.zip
- 基于74LS161+ 74LS192芯片实现倒计时定时器Multisim仿真源文件,Multisim10以上版本可打开运行
- 科大讯飞语音引擎 jar包 demo,科大讯飞语音合成引擎3.0,支持4.0系统以上,文字转语音输出.zip
- Java架构面试笔试专题资料及经验(含答案)SpringBoot面试Linux面试专题及答案 合集.zip
- 头歌c语言实验答案tion-model-for-ne开发笔记
- docker配置使用-model-for-networK开发demo
- docker配置使用vaWeb-mas笔记
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback](https://img-home.csdnimg.cn/images/20220527035711.png)
![feedback-tip](https://img-home.csdnimg.cn/images/20220527035111.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)
评论0