1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are responsible for its success or failure? What changes
to them improve performance? This has been called the fundamental credit assignment problem (Minsky,
1963). There are general credit assignment methods for universal problem solvers that are time-optimal
in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the narrower, but now
commercially important, subfield of Deep Learning (DL) in Artificial Neural Networks (NNs). We are
interested in accurate credit assignment across possibly many, often nonlinear, computational stages of
NNs.
Shallow NN-like models have been around for many decades if not centuries (Sec. 5.1). Models with
several successive nonlinear layers of neurons date back at least to the 1960s (Sec. 5.3) and 1970s (Sec. 5.5).
An efficient gradient descent method for teacher-based Supervised Learning (SL) in discrete, differentiable
networks of arbitrary depth called backpropagation (BP) was developed in the 1960s and 1970s, and ap-
plied to NNs in 1981 (Sec. 5.5). BP-based training of deep NNs with many layers, however, had been found
to be difficult in practice by the late 1980s (Sec. 5.6), and had become an explicit research subject by the
early 1990s (Sec. 5.9). DL became practically feasible to some extent through the help of Unsupervised
Learning (UL) (e.g., Sec. 5.10, 5.15). The 1990s and 2000s also saw many improvements of purely super-
vised DL (Sec. 5). In the new millennium, deep NNs have finally attracted wide-spread attention, mainly
by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Sch
¨
olkopf
et al., 1998) in numerous important applications. In fact, supervised deep NNs have won numerous of-
ficial international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first
superhuman visual pattern recognition results in limited domains (Sec. 5.19). Deep NNs also have become
relevant for the more general field of Reinforcement Learning (RL) where there is no supervising teacher
(Sec. 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests (Sec.
5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they are
general computers more powerful than FNNs, and can in principle create and process memories of ar-
bitrary sequences of input patterns (e.g., Siegelmann and Sontag, 1991; Schmidhuber, 1990a). Unlike
traditional methods for automatic sequential program synthesis (e.g., Waldinger and Lee, 1969; Balzer,
1985; Soloway, 1986; Deville and Lau, 1994), RNNs can learn programs that mix sequential and parallel
information processing in a natural and efficient way, exploiting the massive parallelism viewed as crucial
for sustaining the rapid decline of computation cost observed over the past 75 years.
The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation
that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the concept of
Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is of the deep or
shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses on SL and UL, and on
how UL can facilitate SL, although pure SL has become dominant in recent competitions (Sec. 5.17-5.22).
Sec. 5 is arranged in a historical timeline format with subsections on important inspirations and technical
contributions. Sec. 6 on deep RL discusses traditional Dynamic Programming (DP)-based RL combined
with gradient-based search techniques for SL or UL in deep NNs, as well as general methods for direct
and indirect search in the weight space of deep FNNs and RNNs, including successful policy gradient and
evolutionary methods.
2 Event-Oriented Notation for Activation Spreading in FNNs / RNNs
Throughout this paper, let i, j, k, t, p, q, r denote positive integer variables assuming ranges implicit in the
given contexts. Let n, m, T denote positive integer constants.
An NN’s topology may change over time (e.g., Sec. 5.3, 5.6.3). At any given moment, it can be
described as a finite subset of units (or nodes or neurons) N = {u
1
, u
2
, . . . , } and a finite set H ⊆ N × N
of directed edges or connections between nodes. FNNs are acyclic graphs, RNNs cyclic. The first (input)
layer is the set of input units, a subset of N . In FNNs, the k-th layer (k > 1) is the set of all nodes u ∈ N
such that there is an edge path of length k − 1 (but no longer path) between some input unit and u. There
may be shortcut connections between distant layers.
4