没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
A simple neural network module
for relational reasoning
Adam Santoro
∗
, David Raposo
∗
, David G.T. Barrett, Mateusz Malinowski,
Razvan Pascanu, Peter Battaglia, Timothy Lillicrap
adamsantoro@, draposo@, barrettdavid@, mateuszm@,
razp@, peterbattaglia@, countzero@google.com
DeepMind
London, United Kingdom
Abstract
Relational reasoning is a central component of generally intelligent behavior, but has proven
difficult for neural networks to learn. In this paper we describe how to use Relation Networks
(RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational
reasoning. We tested RN-augmented networks on three tasks: visual question answering
using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human
performance; text-based question answering using the bAbI suite of tasks; and complex reasoning
about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show
that powerful convolutional networks do not have a general capacity to solve relational questions,
but can gain this capacity when augmented with RNs. Our work shows how a deep learning
architecture equipped with an RN module can implicitly discover and learn to reason about
entities and their relations.
1 Introduction
The ability to reason about the relations between entities and their properties is central to generally
intelligent behavior (Figure 1) [
18
,
15
]. Consider a child proposing a race between the two trees
in the park that are furthest apart: the pairwise distances between every tree in the park must be
inferred and compared to know where to run. Or, consider a reader piecing together evidence to
predict the culprit in a murder-mystery novel: each clue must be considered in its broader context to
build a plausible narrative and solve the mystery.
Symbolic approaches to artificial intelligence are inherently relational [
32
,
11
]. Practitioners define
the relations between symbols using the language of logic and mathematics, and then reason about
these relations using a multitude of powerful methods, including deduction, arithmetic, and algebra.
But symbolic approaches suffer from the symbol grounding problem and are not robust to small
task and input variations [
11
]. Other approaches, such as those based on statistical learning, build
representations from raw data and often generalize across diverse and noisy conditions [
25
]. However,
a number of these approaches, such as deep learning, often struggle in data-poor problems where the
underlying structure is characterized by sparse but complex relations [
7
,
23
]. Our results corroborate
these claims, and further demonstrate that seemingly simple relational inferences are remarkably
∗
Equal contribution.
1
arXiv:1706.01427v1 [cs.CL] 5 Jun 2017
What is the size of
the brown sphere?
Non-relational question:
Original Image:
Relational question:
Are there any rubber
things that have the
same size as the yellow
metallic cylinder?
Figure 1:
An illustrative example from the CLEVR dataset of relational reasoning
. An
image containing four objects is shown alongside non-relational and relational questions. The
relational question requires explicit reasoning about the relations between the four objects in the
image, whereas the non-relational question requires reasoning about the attributes of a particular
object.
difficult for powerful neural network architectures such as convolutional neural networks (CNNs) and
multi-layer perceptrons (MLPs).
Here, we explore “Relation Networks” (RN) as a general solution to relational reasoning in neural
networks. RNs are architectures whose computations focus explicitly on relational reasoning [
35
].
Although several other models supporting relation-centric computation have been proposed, such
as Graph Neural Networks, Gated Graph Sequence Neural Networks, and Interaction Networks,
[
37
,
26
,
2
], RNs are simple, plug-and-play, and are exclusively focused on flexible relational reasoning.
Moreover, through joint training RNs can influence and shape upstream representations in CNNs
and LSTMs to produce implicit object-like representations that it can exploit for relational reasoning.
We applied an RN-augmented architecture to CLEVR [
15
], a recent visual question answering
(QA) dataset on which state-of-the-art approaches have struggled due to the demand for rich
relational reasoning. Our networks vastly outperformed the best generally-applicable visual QA
architectures, and achieve state-of-the-art, super-human performance. RNs also solve CLEVR from
state descriptions, highlighting their versatility in regards to the form of their input. We also applied
an RN-based architecture to the bAbI text-based QA suite [
41
] and solved 18/20 of the subtasks.
Finally, we trained an RN to make challenging relational inferences about complex physical systems
and motion capture data. The success of RNs across this set of substantially dissimilar task domains
is testament to the general utility of RNs for solving problems that require relation reasoning.
2 Relation Networks
An RN is a neural network module with a structure primed for relational reasoning. The design
philosophy behind RNs is to constrain the functional form of a neural network so that it captures the
core common properties of relational reasoning. In other words, the capacity to compute relations
is baked into the RN architecture without needing to be learned, just as the capacity to reason
about spatial, translation invariant properties is built-in to CNNs, and the capacity to reason about
sequential dependencies is built into recurrent neural networks.
In its simplest form the RN is a composite function:
RN(O) = f
φ
X
i,j
g
θ
(o
i
, o
j
)
, (1)
where the input is a set of “objects”
O
=
{o
1
, o
2
, ..., o
n
}
,
o
i
∈ R
m
is the
i
th
object, and
f
φ
and
g
θ
are functions with parameters
φ
and
θ
, respectively. For our purposes,
f
φ
and
g
θ
are MLPs, and the
2
parameters are learnable synaptic weights, making RNs end-to-end differentiable. We call the output
of
g
θ
a “relation”; therefore, the role of
g
θ
is to infer the ways in which two objects are related, or if
they are even related at all.
RNs have three notable strengths: they learn to infer relations, they are data efficient, and they
operate on a set of objects – a particularly general and versatile input format – in a manner that is
order invariant.
RNs learn to infer relations
The functional form in Equation 1 dictates that an RN should
consider the potential relations between all object pairs. This implies that an RN is not necessarily
privy to which object relations actually exist, nor to the actual meaning of any particular relation.
Thus, RNs must learn to infer the existence and implications of object relations.
In graph theory parlance, the input can be thought of as a complete and directed graph whose
nodes are objects and whose edges denote the object pairs whose relations should be considered.
Although we focus on this “all-to-all” version of the RN throughout this paper, this RN definition
can be adjusted to consider only some object pairs. Similar to Interaction Networks [
2
], to which
RNs are related, RNs can take as input a list of only those pairs that should be considered, if this
information is available. This information could be explicit in the input data, or could perhaps be
extracted by some upstream mechanism.
RNs are data efficient
RNs use a single function
g
θ
to compute each relation. This can be
thought of as a single function operating on a batch of object pairs, where each member of the
batch is a particular object-object pair from the same object set. This mode of operation encourages
greater generalization for computing relations, since
g
θ
is encouraged not to over-fit to the features
of any particular object pair. Consider how an MLP would learn the same function. An MLP would
receive all objects from the object set simultaneously as its input. It must then learn and embed
n
2
(where
n
is the number of objects) identical functions within its weight parameters to account for all
possible object pairings. This quickly becomes intractable as the number of objects grows. Therefore,
the cost of learning a relation function
n
2
times using a single feedforward pass per sample, as in an
MLP, is replaced by the cost of
n
2
feedforward passes per object set (i.e., for each possible object
pair in the set) and learning a relation function just once, as in an RN.
RNs operate on a set of objects
The summation in Equation 1 ensures that the RN is invariant
to the order of objects in the input. This invariance ensures that the RN’s input respects the property
that sets are order invariant, and it ensures that the output is order invariant. Ultimately, this
invariance ensures that the RN’s output contains information that is generally representative of the
relations that exist in the object set.
3 Tasks
We applied RN-augmented networks to a variety of tasks that hinge on relational reasoning. To
demonstrate the versatility of these networks we chose tasks from a number of different domains,
including visual QA, text-based QA, and dynamic physical systems.
3.1 CLEVR
In visual QA a model must learn to answer questions about an image (Figure 1). This is a challenging
problem domain because it requires high-level scene understanding [
1
,
29
]. Architectures must perform
complex relational reasoning – spatial and otherwise – over the features in the visual inputs, language
inputs, and their conjunction. However, the majority of visual QA datasets require reasoning in the
absence of fully specified word vocabularies, and perhaps more perniciously, a vast and complicated
knowledge of the world that is not available in the training data. They also contain ambiguities and
exhibit strong linguistic biases that allow a model to learn answering strategies that exploit those
biases, without reasoning about the visual input [1, 31, 36].
3
To control for these issues, and to distill the core challenges of visual QA, the CLEVR visual QA
dataset was developed [
15
]. CLEVR contains images of 3D-rendered objects, such as spheres and
cylinders (Figure 2). Each image is associated with a number of questions that fall into different
categories. For example,
query attribute
questions may ask “What is the color of the sphere? ”,
while compare attribute questions may ask “Is the cube the same material as the cylinder? ”.
For our purposes, an important feature of CLEVR is that many questions are explicitly relational
in nature. Remarkably, powerful QA architectures [
46
] are unable to solve CLEVR, presumably
because they cannot handle core relational aspects of the task. For example, as reported in the
original paper a model comprised of ResNet-101 image embeddings with LSTM question processing
and augmented with stacked attention modules vastly outperformed other models at an overall
performance of 68
.
5% (compared to 52
.
3% for the next best, and 92
.
6% human performance) [
15
].
However, for
compare attribute
and
count
questions (i.e., questions heavily involving relations
across objects), the model performed little better than the simplest baseline, which answered questions
solely based on the probability of answers in the training set for a given question category (Q-type
baseline).
We used two versions of the CLEVR dataset: (i) the pixel version, in which images were
represented in standard 2D pixel form, and (ii) a state description version, in which images were
explicitly represented by state description matrices containing factored object descriptions. Each
row in the matrix contained the features of a single object – 3D coordinates (x, y, z); color (r, g,
b); shape (cube, cylinder, etc.); material (rubber, metal, etc.); size (small, large, etc.). When we
trained our models, we used either the pixel version or the state description version, depending on
the experiment, but not both together.
3.2 Sort-of-CLEVR
To explore our hypothesis that the RN architecture is better suited to general relational reasoning as
compared to more standard neural architectures, we constructed a dataset similar to CLEVR that
we call “Sort-of-CLEVR”
1
. This dataset separates relational and non-relational questions.
Sort-of-CLEVR consists of images of 2D colored shapes along with questions and answers about
the images. Each image has a total of 6 objects, where each object is a randomly chosen shape
(square or circle). We used 6 colors (red, blue, green, orange, yellow, gray) to unambiguously identify
each object. Questions are hard-coded as fixed-length binary strings to reduce the difficulty involved
with natural language question-word processing, and thereby remove any confounding difficulty
with language parsing. For each image we generated 10 relational questions and 10 non-relational
questions. Examples of relational questions are: “What is the shape of the object that is farthest from
the gray object? ”; and “How many objects have the same shape as the green object? ”. Examples of
non-relational questions are: “What is the shape of the gray object?”; and “Is the blue object on the
top or bottom of the scene? ”. The dataset is also visually simple, reducing complexities involved in
image processing.
3.3 bAbI
bAbI is a pure text-based QA dataset [
41
]. There are 20 tasks, each corresponding to a particular
type of reasoning, such as deduction, induction, or counting. Each question is associated with a set
of supporting facts. For example, the facts “Sandra picked up the football” and “Sandra went to
the office” support the question “Where is the football? ” (answer: “office”). A model succeeds on
a task if its performance surpasses 95%. Many memory-augmented neural networks have reported
impressive results on bAbI. When training jointly on all tasks using 10
K
examples per task, Memory
Networks pass 14
/
20, DNC 18
/
20, Sparse DNC 19
/
20, and EntNet 16
/
20 (the authors of EntNets
report state-of-the-art at 20
/
20; however, unlike previously reported results this was not done with
joint training on all tasks, where they instead achieve 16/20) [42, 9, 34, 13].
1
The “Sort-of-CLEVR” dataset will be made publicly available online.
4
剩余15页未读,继续阅读
资源评论
- huhuhu7776662018-06-13能分享的,支持支持。
阿炜
- 粉丝: 129
- 资源: 24
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功