【免费】[2018-ACL].（评价）.Whatyoucancramintoasinglevector：Probings资源-CSDN文库

需积分: 0 119 浏览量 2022-08-03 15:25:49 上传评论收藏 440KB PDF 举报

资源详情

资源评论

资源推荐

What you can cram into a single $&!#* vector:

Probing sentence embeddings for linguistic properties

Alexis Conneau

Facebook AI Research

Université Le Mans

aconneau@fb.com

German Kruszewski

Facebook AI Research

germank@fb.com

Guillaume Lample

Facebook AI Research

Sorbonne Universités

glample@fb.com

Loïc Barrault

Université Le Mans

loic.barrault@univ-lemans.fr

Marco Baroni

Facebook AI Research

mbaroni@fb.com

Abstract

Although much effort has recently been

devoted to training high-quality sentence

embeddings, we still have a poor un-

derstanding of what they are capturing.

“Downstream” tasks, often based on sen-

tence classiﬁcation, are commonly used

to evaluate the quality of sentence repre-

sentations. The complexity of the tasks

makes it however difﬁcult to infer what

kind of information is present in the repre-

sentations. We introduce here 10 probing

tasks designed to capture simple linguis-

tic features of sentences, and we use them

to study embeddings generated by three

different encoders trained in eight distinct

ways, uncovering intriguing properties of

both encoders and training methods.

1 Introduction

Despite Ray Mooney’s quip that you cannot cram

the meaning of a whole %&!$# sentence into a

single $&!#* vector, sentence embedding meth-

ods have achieved impressive results in tasks rang-

ing from machine translation (Sutskever et al.,

2014; Cho et al., 2014) to entailment detection

(Williams et al., 2018), spurring the quest for “uni-

versal embeddings” trained once and used in a va-

riety of applications (e.g., Kiros et al., 2015; Con-

neau et al., 2017; Subramanian et al., 2018). Posi-

tive results on concrete problems suggest that em-

beddings capture important linguistic properties of

sentences. However, real-life “downstream” tasks

require complex forms of inference, making it dif-

ﬁcult to pinpoint the information a model is rely-

ing upon. Impressive as it might be that a system

can tell that the sentence “A movie that doesn’t

aim too high, but it doesn’t need to” (Pang and

Lee, 2004) expresses a subjective viewpoint, it is

hard to tell how the system (or even a human)

comes to this conclusion. Complex tasks can also

carry hidden biases that models might lock onto

(Jabri et al., 2016). For example, Lai and Hock-

enmaier (2014) show that the simple heuristic of

checking for explicit negation words leads to good

accuracy in the SICK sentence entailment task.

Model introspection techniques have been ap-

plied to sentence encoders in order to gain a bet-

ter understanding of which properties of the in-

put sentences their embeddings retain (see Sec-

tion 5). However, these techniques often depend

on the speciﬁcs of an encoder architecture, and

consequently cannot be used to compare different

methods. Shi et al. (2016) and Adi et al. (2017)

introduced a more general approach, relying on

the notion of what we will call probing tasks. A

probing task is a classiﬁcation problem that fo-

cuses on simple linguistic properties of sentences.

For example, one such task might require to cat-

egorize sentences by the tense of their main verb.

Given an encoder (e.g., an LSTM) pre-trained on

a certain task (e.g., machine translation), we use

the sentence embeddings it produces to train the

tense classiﬁer (without further embedding tun-

ing). If the classiﬁer succeeds, it means that the

pre-trained encoder is storing readable tense infor-

mation into the embeddings it creates. Note that:

(i) The probing task asks a simple question, min-

imizing interpretability problems. (ii) Because of

their simplicity, it is easier to control for biases in

probing tasks than in downstream tasks. (iii) The

probing task methodology is agnostic with respect

to the encoder architecture, as long as it produces

a vector representation of sentences.

We greatly extend earlier work on probing tasks

as follows. First, we introduce a larger set of prob-

ing tasks (10 in total), organized by the type of lin-

guistic properties they probe. Second, we system-

atize the probing task methodology, controlling for

arXiv:1805.01070v2 [cs.CL] 8 Jul 2018

a number of possible nuisance factors, and fram-

ing all tasks so that they only require single sen-

tence representations as input, for maximum gen-

erality and to ease result interpretation. Third, we

use our probing tasks to explore a wide range of

state-of-the-art encoding architectures and train-

ing methods, and further relate probing and down-

stream task performance. Finally, we are publicly

releasing our probing data sets and tools, hoping

they will become a standard way to study the lin-

guistic properties of sentence embeddings.

2 Probing tasks

In constructing our probing benchmarks, we

adopted the following criteria. First, for general-

ity and interpretability, the task classiﬁcation prob-

lem should only require single sentence embed-

dings as input (as opposed to, e.g., sentence and

word embeddings, or multiple sentence represen-

tations). Second, it should be possible to construct

large training sets in order to train parameter-rich

multi-layer classiﬁers, in case the relevant proper-

ties are non-linearly encoded in the sentence vec-

tors. Third, nuisance variables such as lexical cues

or sentence length should be controlled for. Fi-

nally, and most importantly, we want tasks that

address an interesting set of linguistic properties.

We thus strove to come up with a set of tasks that,

while respecting the previous constraints, probe a

wide range of phenomena, from superﬁcial prop-

erties of sentences such as which words they con-

tain to their hierarchical structure to subtle facets

of semantic acceptability. We think the current

task set is reasonably representative of different

linguistic domains, but we are not claiming that

it is exhaustive. We expect future work to extend

it.

The sentences for all our tasks are extracted

from the Toronto Book Corpus (Zhu et al., 2015),

more speciﬁcally from the random pre-processed

portion made available by Paperno et al. (2016).

We only sample sentences in the 5-to-28 word

range. We parse them with the Stanford Parser

(2017-06-09 version), using the pre-trained PCFG

model (Klein and Manning, 2003), and we rely on

the part-of-speech, constituency and dependency

parsing information provided by this tool where

needed. For each task, we construct training sets

containing 100k sentences, and 10k-sentence val-

https://github.com/facebookresearch/

SentEval/tree/master/data/probing

idation and test sets. All sets are balanced, having

an equal number of instances of each target class.

Surface information These tasks test the extent

to which sentence embeddings are preserving sur-

face properties of the sentences they encode. One

can solve the surface tasks by simply looking at

tokens in the input sentences: no linguistic knowl-

edge is called for. The ﬁrst task is to predict the

length of sentences in terms of number of words

(SentLen). Following Adi et al. (2017), we group

sentences into 6 equal-width bins by length, and

treat SentLen as a 6-way classiﬁcation task. The

word content (WC) task tests whether it is possible

to recover information about the original words in

the sentence from its embedding. We picked 1000

mid-frequency words from the source corpus vo-

cabulary (the words with ranks between 2k and

3k when sorted by frequency), and sampled equal

numbers of sentences that contain one and only

one of these words. The task is to tell which of

the 1k words a sentence contains (1k-way classiﬁ-

cation). This setup allows us to probe a sentence

embedding for word content without requiring an

auxiliary word embedding (as in the setup of Adi

and colleagues).

Syntactic information The next batch of tasks

test whether sentence embeddings are sensitive to

syntactic properties of the sentences they encode.

The bigram shift (BShift) task tests whether an

encoder is sensitive to legal word orders. In this

binary classiﬁcation problem, models must distin-

guish intact sentences sampled from the corpus

from sentences where we inverted two random ad-

jacent words (“What you are doing out there?”).

The tree depth (TreeDepth) task checks

whether an encoder infers the hierarchical struc-

ture of sentences, and in particular whether it can

group sentences by the depth of the longest path

from root to any leaf. Since tree depth is naturally

correlated with sentence length, we de-correlate

these variables through a structured sampling pro-

cedure. In the resulting data set, tree depth val-

ues range from 5 to 12, and the task is to catego-

rize sentences into the class corresponding to their

depth (8 classes). As an example, the following

is a long (22 tokens) but shallow (max depth: 5)

sentence: “[

[

But right now, for the time be-

ing, my past, my fears, and my thoughts [

were [

my [

business]]].]]” (the outermost brackets cor-

respond to the ROOT and S nodes in the parse).

In the top constituent task (TopConst), sen-

tences must be classiﬁed in terms of the sequence

of top constituents immediately below the sen-

tence (S) node. An encoder that successfully ad-

dresses this challenge is not only capturing latent

syntactic structures, but clustering them by con-

stituent types. TopConst was introduced by Shi

et al. (2016). Following them, we frame it as a

20-way classiﬁcation problem: 19 classes for the

most frequent top constructions, and one for all

other constructions. As an example, “[Then] [very

dark gray letters on a black screen] [appeared] [.]”

has top constituent sequence: “ADVP NP VP .”.

Note that, while we would not expect an un-

trained human subject to be explicitly aware of

tree depth or top constituency, similar information

must be implicitly computed to correctly parse

sentences, and there is suggestive evidence that the

brain tracks something akin to tree depth during

sentence processing (Nelson et al., 2017).

Semantic information These tasks also rely on

syntactic structure, but they further require some

understanding of what a sentence denotes. The

Tense task asks for the tense of the main-clause

verb (VBP/VBZ forms are labeled as present,

VBD as past). No target form occurs across the

train/dev/test split, so that classiﬁers cannot rely

on speciﬁc words (it is not clear that Shi and col-

leagues, who introduced this task, controlled for

this factor). The subject number (SubjNum) task

focuses on the number of the subject of the main

clause (number in English is more often explic-

itly marked on nouns than verbs). Again, there

is no target overlap across partitions. Similarly,

object number (ObjNum) tests for the number of

the direct object of the main clause (again, avoid-

ing lexical overlap). To solve the previous tasks

correctly, an encoder must not only capture tense

and number, but also extract structural informa-

tion (about the main clause and its arguments).

We grouped Tense, SubjNum and ObjNum with

the semantic tasks, since, at least for models that

treat words as unanalyzed input units (without

access to morphology), they must rely on what

a sentence denotes (e.g., whether the described

event took place in the past), rather than on struc-

tural/syntactic information. We recognize, how-

ever, that the boundary between syntactic and se-

mantic tasks is somewhat arbitrary.

In the semantic odd man out (SOMO) task, we

modiﬁed sentences by replacing a random noun

or verb o with another noun or verb r. To make

the task more challenging, the bigrams formed by

the replacement with the previous and following

words in the sentence have frequencies that are

comparable (on a log-scale) with those of the orig-

inal bigrams. That is, if the original sentence con-

tains bigrams w

n−1

o and ow

n+1

, the correspond-

ing bigrams w

n−1

r and rw

n+1

in the modiﬁed

sentence will have comparable corpus frequencies.

No sentence is included in both original and modi-

ﬁed format, and no replacement is repeated across

train/dev/test sets. The task of the classiﬁer is to

tell whether a sentence has been modiﬁed or not.

An example modiﬁed sentence is: “ No one could

see this Hayes and I wanted to know if it was

real or a spoonful (orig.: ploy).” Note that judg-

ing plausibility of a syntactically well-formed sen-

tence of this sort will often require grasping rather

subtle semantic factors, ranging from selectional

preference to topical coherence.

The coordination inversion (CoordInv) bench-

mark contains sentences made of two coordinate

clauses. In half of the sentences, we inverted the

order of the clauses. The task is to tell whether

a sentence is intact or modiﬁed. Sentences

are balanced in terms of clause length, and no

sentence appears in both original and inverted

versions. As an example, original “They might

be only memories, but I can still feel each one”

becomes: “I can still feel each one, but they might

be only memories.” Often, addressing CoordInv

requires an understanding of broad discourse and

pragmatic factors.

Row Hum. Eval. of Table 2 reports human-

validated “reasonable” upper bounds for all the

tasks, estimated in different ways, depending on

the tasks. For the surface ones, there is always a

straightforward correct answer that a human an-

notator with enough time and patience could ﬁnd.

The upper bound is thus estimated at 100%. The

TreeDepth, TopConst, Tense, SubjNum and Ob-

jNum tasks depend on automated PoS and pars-

ing annotation. In these cases, the upper bound

is given by the proportion of sentences correctly

annotated by the automated procedure. To esti-

mate this quantity, one linguistically-trained au-

thor checked the annotation of 200 randomly sam-

pled test sentences from each task. Finally, the

BShift, SOMO and CoordInv manipulations can

accidentally generate acceptable sentences. For

剩余13页未读，继续阅读

评论收藏

内容反馈

马李灵珊

粉丝: 39
资源: 297

[2018-ACL].（评价）.What you can cram into a single vector：Probing s

评论0

最新资源

[2018-ACL].（评价）.What you can cram into a single vector：Probing s

评论0

MCTS.70-640.Exam.Cram.Windows.Server.2008.Active.Directory

curl-7.66.0_2-win64-mingw.zip

curl-7.67.0.zip

Pearson-CompTIA.Security.Plus.Exam.Cram.2008.RETAiL.EBook-DiGiBook

curl-7.42.0.tar.bz2

CompTIA.Network+.N10-005.Authorized.Exam.Cram.4th.Edition

cissp-exam-cram.rar

curl-7.41.0.tar.bz2

Network+ Exam Cram 2 ,Second Edition

CISA Exam Cram: Certified Information Systems Auditor

Clever Internet Suite (SRC) v9.1.0.0

Java™ 2 Developer

Java 2™ Programmer

Sakemail

Indroduction_to_Mathematica_Statistics.Hogg,McKean,Craig seventh edition.pdf

R-confidence-intervals-cramers-V:计算CramérV关联度的置信区间的功能

Clever Internet .NET Suite 6.0.26.0

pam_mysql_0.7RC1

java面试题以及技巧

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

STM32F103C8T6核心板-电路原理图1.PDF

软件工程导论(第六版)课后习题答案1

最新资源