EMNLP2018FaceBook、Google论文集_emnlp论文集资源-CSDN文库

共41个文件

pdf：39个

txt：2个

EMNLP

需积分: 21 36 浏览量 2019-03-08 16:39:01 上传评论收藏 21.34MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

EMNLP 2018 FB谷歌论文集.zip （41个子文件）

EMNLP 2018 FB谷歌论文集

Facebook

Facebook 论文清单.txt 4KB

D18-1269.pdf 367KB

D18-1330.pdf 272KB

D18-1181.pdf 309KB

D18-1119.pdf 430KB

D18-1549.pdf 446KB

D18-1073.pdf 380KB

D18-1176.pdf 500KB

D18-1298.pdf 186KB

D18-1128.pdf 4.12MB

D18-1043.pdf 445KB

D18-1300.pdf 267KB

D18-1239.pdf 814KB

D18-1117.pdf 877KB

Google

D18-1217.pdf 744KB

D18-1211.pdf 584KB

D18-1045.pdf 279KB

D18-1105.pdf 319KB

D18-1028.pdf 325KB

D18-1030.pdf 773KB

D18-1052.pdf 305KB

D18-1091.pdf 244KB

D18-1426.pdf 253KB

D18-1266.pdf 983KB

D18-1338.pdf 319KB

D18-1080.pdf 255KB

D18-1305.pdf 406KB

D18-1405.pdf 1.55MB

D18-1455.pdf 719KB

D18-1347.pdf 335KB

D18-1259.pdf 2.55MB

谷歌论文清单.txt 7KB

D18-1548.pdf 1.82MB

D18-1419.pdf 557KB

D18-1253.pdf 760KB

D18-1529.pdf 248KB

D18-1100.pdf 602KB

D18-1324.pdf 199KB

D18-1374.pdf 376KB

D18-1461.pdf 307KB

D18-1277.pdf 278KB

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1036–1042

Brussels, Belgium, October 31 - November 4, 2018.

2018 Association for Computational Linguistics

1036

Do explanations make VQA models more predictable to a human?

Arjun Chandrasekaran

∗,1

Viraj Prabhu

∗,1

Deshraj Yadav

∗,1

Prithvijit Chattopadhyay

∗,1

Devi Parikh

1,2

Georgia Institute of Technology

Facebook AI Research

{carjun, virajp, deshraj, prithvijit3, parikh}@gatech.edu

Abstract

A rich line of research attempts to make

deep neural networks more transparent by gen-

erating human-interpretable ‘explanations’ of

their decision process, especially for inter-

active tasks like Visual Question Answering

(VQA). In this work, we analyze if existing ex-

planations indeed make a VQA model – its re-

sponses as well as failures – more predictable

to a human. Surprisingly, we ﬁnd that they do

not. On the other hand, we ﬁnd that human-

in-the-loop approaches that treat the model as

a black-box do.

1 Introduction

As technology progresses, we are increasingly

collaborating with AI agents in interactive scenar-

ios where humans and AI work together as a team,

e.g., in AI-assisted diagnosis, autonomous driving,

etc. Thus far, AI research has typically only fo-

cused on the AI in such an interaction – for it to

be more accurate, be more human-like, understand

our intentions, beliefs, contexts, and mental states.

In this work, we argue that for human-AI inter-

actions to be more effective, humans must also un-

derstand the AI’s beliefs, knowledge, and quirks.

Many recent works generate human-

interpretable ‘explanations’ regarding a model’s

decisions. These are usually evaluated ofﬂine

based on whether human judges found them to be

‘good’ or to improve trust in the model. However,

their contribution in an interactive setting remains

unclear. In this work, we evaluate the role of

explanations towards making a model predictable

to a human.

We consider an AI trained to perform the

multi-modal task of Visual Question Answering

(VQA) (Malinowski and Fritz, 2014; Antol et al.,

2015), i.e., answering free-form natural language

∗

Denotes equal contribution.

Figure 1: We evaluate the extent to which expla-

nation modalities (right) and familiarization with

a VQA model help humans predict its behavior –

its responses, successes, and failures (left).

questions about images. VQA is applicable to

scenarios where humans actively elicit informa-

tion from visual data, and naturally lends itself to

human-AI interactions. We consider two tasks that

demonstrate the degree to which a human under-

stands their AI teammate (we call Vicki) – Failure

Prediction (FP) and Knowledge Prediction (KP).

In FP, we ask subjects on Amazon Mechanical

Turk to predict if Vicki will correctly answer a

given question about an image. In KP, subjects

predict Vicki’s exact response.

We aid humans in forming a mental model of

Vicki by (1) familiarizing them with its behavior

in a ‘training’ phase and (2) exposing them to its

internal states via various explanation modalities.

We then measure their FP and KP performance.

Our key ﬁndings are that (1) humans are indeed

capable of predicting successes, failures, and out-

puts of the VQA model better than chance, (2) ex-

plicitly training humans to familiarize themselves

with the model improves their performance, and

(3) existing explanation modalities do not enhance

human performance.

2 Related Work

Explanations in deep neural networks. Sev-

eral works generate explanations based on inter-

1037

nal states of a decision process (Zeiler and Fergus,

2014; Goyal et al., 2016b), while others generate

justiﬁcations that are consistent with model out-

puts (Ribeiro et al., 2016; Hendricks et al., 2016).

Another popular form of providing explanations is

to visualize regions in the input that contribute to a

decision – either by explicitly attending to relevant

input regions (Bahdanau et al., 2014; Xu et al.,

2015), or exposing implicit attention for predic-

tions (Selvaraju et al., 2017; Zhou et al., 2016).

Evaluating explanations. Several works evaluate

the role of explanations in developing trust with

users (Cosley et al., 2003; Ribeiro et al., 2016)

or helping them achieve an end goal (Narayanan

et al., 2018; Kulesza et al., 2012). Our work, how-

ever, investigates the role of machine-generated

explanations in improving the predictability of a

VQA model.

Failure prediction. While Bansal et al. (2014)

and Zhang et al. (2014) predict failures of a model

using simpler statistical models, we explicitly train

a person to do this.

Legibility. Dragan et al. (2013) describe the

intent-expressiveness of a robot as its trajectory

being expressive of its goal. Analogously, we eval-

uate if explanations of the intermediate states of a

VQA model are expressive of its output.

Humans adapting to technology. Wang et al.

(2016) and Pelikan and Broth (2016) observe hu-

mans’ strategies while adapting to the limited ca-

pabilities of an AI in interactive language games.

In our work we explicitly measure to what extent

humans can form an accurate model of an AI, and

the role of familiarization and explanations.

3 Setup

Agent. We use the VQA model by Lu et al. (2016)

as our AI agent (that we call Vicki). The model

processes the question at multiple levels of granu-

larity (words, phrases, entire question) and at each

level, has explicit attention mechanisms on both

the image and the question

. It is trained on the

train split of the VQA-1.0 dataset (Antol et al.,

2015). Given an image and a question about the

image, it outputs a probability distribution over

1000 answers. Importantly, the model’s image and

question attention maps provide access to its ‘in-

ternal states’ while making a prediction.

Vicky is quirky at times, i.e., has biases, albeit

in a predictable way. Agrawal et al. (2016) out-

We use question-level attention maps in our experiments.

Figure 2: These montages highlight some of

Vicki’s quirks. For a given question, Vicki has the

same response to each image in a montage. Com-

mon visual patterns (that Vicki presumably picks

up on) within each montage are evident.

lines several such quirks. For instance, Vicki has a

limited capability to understand the image – when

asked the color of a small object in the scene, say

a soda can, it may simply respond with the most

dominant color in the scene. Indeed, it may an-

swer similarly even if no soda can is present, i.e.

if the question is irrelevant.

Further, Vicki has a limited capability to un-

derstand free-form natural language, and in many

cases, answers questions based only on the ﬁrst

few words of the question. It is also generally

poor at answering questions requiring “common

sense” reasoning. Moreover, being a discrimina-

tive model, Vicki has a limited vocabulary (1k)

of answers. Additionally, the VQA 1.0 dataset

contains label biases; therefore, the model is very

likely to answer “white’ to a “what color” ques-

tion (Goyal et al., 2016a).

To get a sense for this, see Fig. 2 which depicts

a clear pattern. In top-left, even when there is no

grass, Vicki tends to latch on to one of the domi-

nant colors in the image. For top-right, even when

there are no people in the image, it seems to re-

spond with what people could plausibly do in the

scene if they were present. In this work, we mea-

sure to what extent lay people can pick up on these

quirks by interacting with the agent, and whether

existing explanation modalities help do so.

Tasks: Failure Prediction (FP). Given an image

and a question about the image, we measure how

well a person can predict if Vicki will successfully

answer the question. A person can presumably

predict the failure modes of Vicki well if they have

a good sense of its strengths and weaknesses.

1038

(a) The Failure Prediction (FP) interface. (b) The Knowledge Prediction (KP) interface.

Figure 3: (a) A person guesses if a VQA model (Vicki) will answer this question for this image correctly

or wrongly. (b) A person guesses what Vicki’s exact answer will be for this QI–pair.

Knowledge Prediction (KP). In this task, we

aim to obtain a ﬁne-grained measure of a person’s

understanding of Vicki’s behavior. Given a QI–

pair, a subject guesses Vicki’s exact response from

a set of its output labels. Snapshots of our inter-

faces can be seen in Fig. 3.

4 Experimental Setup

In this section we investigate ways to make Vicki’s

behavior more predictable to a subject. We ap-

proach this by – providing instant feedback about

Vicki’s actual behavior on each QI pair once the

subject responds, and exposing subjects to various

explanation modalities that reveal Vicki’s internal

states before they respond.

Data. We identify a subset of questions in the

VQA-1.0 (Antol et al., 2015) validation split that

occur more than 100 times. We select 7 diverse

questions

from this subset that are representa-

tive of the different types of questions (counting,

yes/no, color, scene layout, activity, etc.) in the

dataset. For each of the 7 questions, we sample a

set of 100 images. For FP, the 100 images are ran-

dom samples from the set of images on which the

question was asked in VQA-1.0 val. For the KP

task, these 100 images are random images from

VQA-1.0 val. Ray et al. (2016) found that ran-

domly pairing an image with a question in the

VQA-1.0 dataset results in about 79% of pairs be-

ing irrelevant. This combination of relevant and

irrelevant QI-pairs allows us to test subjects’ abil-

ity to develop a robust understanding of Vicki’s

behavior across a wide variety of inputs.

Study setup. We conduct our studies on Ama-

zon Mechanical Turk. Each task (HIT) comprises

What kind of animal is this? What time is it? What are

the people doing? Is it raining? What room is this? How

many people are there? What color is the umbrella?

of 100 QI-pairs where for simplicity (for the sub-

ject), a single question is asked across all 100 im-

ages. The annotation task is broken down into a

train and test phase of 50 QI-pairs each. Over

all settings, 280 workers took part in our study (1

unique worker per HIT), resulting in 28k human

responses. Subjects were paid an average of $3

base plus $0.44 performance bonus, per HIT.

There are some challenges involved in scaling

data-collection in this setting: (1) Due to the pres-

ence of separate train and test phases, our AMT

tasks tend to be unusually long (mean HIT dura-

tions across the tasks of FP and KP = 10.11±1.09

and 24.49 ± 1.85 min., respectively). Crucially,

this also reduces the subject pool to only those

willing to participate in long tasks. (2) Once a

subject participates in a task, they cannot do an-

other because their familiarity with Vicki would

leak over. This constraint causes our analyses to

require as many subjects as tasks. Since work divi-

sion in crowdsourcing tasks follows a Pareto prin-

ciple (Little, 2009), this makes data collection very

slow.

In light of these challenges, we focus on a small

set of questions to systematically evaluate the role

of training and exposure to Vicki’s internal states.

4.1 Evaluating the role of familiarization

To familiarize subjects with Vicki, we provide

them with instant feedback during the train phase.

Immediately after a subject responds to a QI–

pair, we show them whether Vicki actually an-

swered the question correctly or not (in FP) or

what Vicki’s response was (in KP), along with a

running score of how well they are doing. Once

training is complete, no further feedback is pro-

vided and subjects are asked to make predictions

for the test phase. At the end, they are shown their

评论收藏

内容反馈

midori_27

粉丝: 76
资源: 34

EMNLP 2018 FaceBook 、Google 论文集

最新资源

EMNLP 2018 FaceBook 、Google 论文集

emnlp2016-2018.json

NLPCC2018论文集

EMNLP 2019 # 微软亚洲研究院精选论文解读.zip

NLPCC会议论文集

Google论文"Wide & Deep Learning for Recommender Systems"全套工程文件+数据集+调试过程

2018emnlp-情感分析论文调查表

emnlp2018对话系统教程

今日arXiv精选 | 14篇EMNLP 2021最新论文（2021.9.2）

OpenBookQA:EMNLP 2018论文“一套装甲可以导电吗？用于打开书本问答的新数据集”上的OpenBookQA实验代码

【EMNLP2019】最新5篇论文推荐，BERT，对话系统，知识图谱，注意力机制等.zip

MICCAI2018接收论文列表

Facebook For Dummies, 7th Edition--2018

emnlp2021-latex.zip

ACL09 & EMNLP09 Entire Volume.pdf

DNC:多样的自然语言推理集合-NLI数据集，可用于评估模型执行不同类型推理的效果（EMNLP 2018）

EMNLP 2020上与【知识图谱】相关论文（6篇）

ntua-slp-wassa-iest2018：NTUA-SLP团队的深度学习迁移学习模型在EMNLP 2018的WASSA 2018 IEST上提交

NCLS-Corpora:EMNLP-IJCNLP 2019论文“ NCLS的数据集

EMNLP 2022 最佳论文揭晓！这脑洞绝了⋯_.rar

EMNLP 2022 最佳论文揭晓！这脑洞绝了⋯_.pdf

EMNLP 2019 (2).zip

EMNLP-2019-论文：带有EMNLP-IJCNLP 2019的arXiv链接的统计数据和已接受的论文列表

ARAML:在EMNLP2019上我们论文的代码

EMNLP 2019 Best paper奖颁布，有最佳论文奖、最佳论文第二名、最佳资源奖、最佳Demo奖.zip

EMNLP2019论文LSAN模型源代码

EMNLP 2019【GNN + NLP】.zip

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

最新资源