Learning to Compose Neural Networks for Question Answering
Jacob Andreas and Marcus Rohrbach and Trevor Darrell and Dan Klein
Department of Electrical Engineering and Computer Sciences
University of California, Berkeley
{jda,rohrbach,trevor,klein}@eecs.berkeley.edu
Abstract
We describe a question answering model that
applies to both images and structured knowl-
edge bases. The model uses natural lan-
guage strings to automatically assemble neu-
ral networks from a collection of composable
modules. Parameters for these modules are
learned jointly with network-assembly param-
eters via reinforcement learning, with only
(world, question, answer) triples as supervi-
sion. Our approach, which we term a dynamic
neural module network, achieves state-of-the-
art results on benchmark datasets in both vi-
sual and structured domains.
1 Introduction
This paper presents a compositional, attentional
model for answering questions about a variety of
world representations, including images and struc-
tured knowledge bases. The model translates from
questions to dynamically assembled neural net-
works, then applies these networks to world rep-
resentations (images or knowledge bases) to pro-
duce answers. We take advantage of two largely
independent lines of work: on one hand, an exten-
sive literature on answering questions by mapping
from strings to logical representations of meaning;
on the other, a series of recent successes in deep
neural models for image recognition and captioning.
By constructing neural networks instead of logical
forms, our model leverages the best aspects of both
linguistic compositionality and continuous represen-
tations.
Our model has two components, trained jointly:
first, a collection of neural “modules” that can be
freely composed (Figure 1a); second, a network lay-
out predictor that assembles modules into complete
deep networks tailored to each question (Figure 1b).
What cities are in Georgia? Atlanta
and
lookup
Georgia
find
city
Georgia
Atlanta
Montgomery
Knowledge source
relate
in
Network layout (Section 4.2)
find[city]
lookup[Georgia]
relate[in]
and
(b)
Module inventory (Section 4.1)
find
lookup
and
relate
(a)
(c)
(d)
Figure 1: A learned syntactic analysis (a) is used to assemble a
collection of neural modules (b) into a deep neural network (c),
and applied to a world representation (d) to produce an answer.
Previous work has used manually-specified modular
structures for visual learning (Andreas et al., 2016).
Here we:
• learn a network structure predictor jointly with
module parameters themselves
• extend visual primitives from previous work to
reason over structured world representations
Training data consists of (world, question, answer)
triples: our approach requires no supervision of net-
work layouts. We achieve state-of-the-art perfor-
mance on two markedly different question answer-
ing tasks: one with questions about natural im-
ages, and another with more compositional ques-
tions about United States geography.
1
2 Deep networks as functional programs
We begin with a high-level discussion of the kinds
of composed networks we would like to learn.
1
We have released our code at http://github.com/
jacobandreas/nmn2
arXiv:1601.01705v4 [cs.CL] 7 Jun 2016