没有合适的资源?快使用搜索试试~ 我知道了~
2015NEUNET Deep learning in neural networks An overview.pdf
需积分: 9 0 下载量 177 浏览量
2020-01-19
15:29:03
上传
评论
收藏 840KB PDF 举报
温馨提示
近年来,深层人工神经网络(包括递归神经网络)在模式识别和机器学习领域赢得了众多的竞争。这项历史调查简洁地总结了相关的工作,其中大部分来自上一个千年。浅层学习者和深层学习者的区别在于他们的学分分配路径的深度,这是行动和效果之间可能可学习的因果关系链。我回顾了深度监督学习(也回顾了反向传播的历史)、无监督学习、强化学习和进化计算,以及对编码深度和大型网络的短程序的间接搜索。 关键词:深度学习、监督学习、无监督学习、强化学习、进化计算
资源推荐
资源详情
资源评论
Neural Networks 61 (2015) 85–117
Contents lists available at ScienceDirect
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
Review
Deep learning in neural networks: An overview
Jürgen Schmidhuber
The Swiss AI Lab IDSIA, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, University of Lugano & SUPSI, Galleria 2, 6928 Manno-Lugano, Switzerland
a r t i c l e i n f o
Article history:
Received 2 May 2014
Received in revised form 12 September
2014
Accepted 14 September 2014
Available online 13 October 2014
Keywords:
Deep learning
Supervised learning
Unsupervised learning
Reinforcement learning
Evolutionary computation
a b s t r a c t
In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in
pattern recognition and machine learning. This historical survey compactly summarizes relevant work,
much of it from the previous millennium. Shallow and Deep Learners are distinguished by the depth
of their credit assignment paths, which are chains of possibly learnable, causal links between actions
and effects. I review deep supervised learning (also recapitulating the history of backpropagation),
unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short
programs encoding deep and large networks.
© 2014 Published by Elsevier Ltd.
Contents
1. Introduction to Deep Learning (DL) in Neural Networks (NNs).................................................................................................................................. 86
2. Event-oriented notation for activation spreading in NNs ........................................................................................................................................... 87
3. Depth of Credit Assignment Paths (CAPs) and of problems ........................................................................................................................................ 88
4. Recurring themes of Deep Learning.............................................................................................................................................................................. 88
4.1. Dynamic programming for Supervised/Reinforcement Learning (SL/RL)...................................................................................................... 88
4.2. Unsupervised Learning (UL) facilitating SL and RL .......................................................................................................................................... 89
4.3. Learning hierarchical representations through deep SL, UL, RL ..................................................................................................................... 89
4.4. Occam’s razor: compression and Minimum Description Length (MDL) ........................................................................................................ 89
4.5. Fast Graphics Processing Units (GPUs) for DL in NNs...................................................................................................................................... 89
5. Supervised NNs, some helped by unsupervised NNs................................................................................................................................................... 89
5.1. Early NNs since the 1940s (and the 1800s)...................................................................................................................................................... 90
5.2. Around 1960: visual cortex provides inspiration for DL (Sections 5.4, 5.11) ................................................................................................ 90
5.3. 1965: deep networks based on the Group Method of Data Handling............................................................................................................ 90
5.4. 1979: convolution + weight replication + subsampling (Neocognitron)..................................................................................................... 90
5.5. 1960–1981 and beyond: development of backpropagation (BP) for NNs ..................................................................................................... 90
5.5.1. BP for weight-sharing feedforward NNs (FNNs) and recurrent NNs (RNNs).................................................................................. 91
5.6. Late 1980s–2000 and beyond: numerous improvements of NNs .................................................................................................................. 91
5.6.1. Ideas for dealing with long time lags and deep CAPs ....................................................................................................................... 91
5.6.2. Better BP through advanced gradient descent (compare Section 5.24).......................................................................................... 92
5.6.3. Searching for simple, low-complexity, problem-solving NNs (Section 5.24)................................................................................. 92
5.6.4. Potential benefits of UL for SL (compare Sections 5.7, 5.10, 5.15)................................................................................................... 92
5.7. 1987: UL through Autoencoder (AE) hierarchies (compare Section 5.15)..................................................................................................... 93
5.8. 1989: BP for convolutional NNs (CNNs, Section 5.4)....................................................................................................................................... 93
5.9. 1991: Fundamental Deep Learning Problem of gradient descent .................................................................................................................. 93
5.10. 1991: UL-based history compression through a deep stack of RNNs............................................................................................................. 94
5.11. 1992: Max-Pooling (MP): towards MPCNNs (compare Sections 5.16, 5.19) ................................................................................................. 94
E-mail address: juergen@idsia.ch.
http://dx.doi.org/10.1016/j.neunet.2014.09.003
0893-6080/© 2014 Published by Elsevier Ltd.
86 J. Schmidhuber / Neural Networks 61 (2015) 85–117
5.12. 1994: early contest-winning NNs..................................................................................................................................................................... 95
5.13. 1995: supervised recurrent very Deep Learner (LSTM RNN).......................................................................................................................... 95
5.14. 2003: more contest-winning/record-setting NNs; successful deep NNs....................................................................................................... 96
5.15. 2006/7: UL for deep belief networks/AE stacks fine-tuned by BP .................................................................................................................. 96
5.16. 2006/7: improved CNNs/GPU-CNNs/BP for MPCNNs/LSTM stacks ................................................................................................................ 96
5.17. 2009: first official competitions won by RNNs, and with MPCNNs................................................................................................................ 97
5.18. 2010: plain backprop (+ distortions) on GPU breaks MNIST record ............................................................................................................. 97
5.19. 2011: MPCNNs on GPU achieve superhuman vision performance ................................................................................................................ 97
5.20. 2011: Hessian-free optimization for RNNs ...................................................................................................................................................... 98
5.21. 2012: first contests won on ImageNet, object detection, segmentation........................................................................................................ 98
5.22. 2013-: more contests and benchmark records ................................................................................................................................................ 98
5.23. Currently successful techniques: LSTM RNNs and GPU-MPCNNs .................................................................................................................. 99
5.24. Recent tricks for improving SL deep NNs (compare Sections 5.6.2, 5.6.3)..................................................................................................... 99
5.25. Consequences for neuroscience........................................................................................................................................................................ 100
5.26. DL with spiking neurons? ................................................................................................................................................................................. 100
6. DL in FNNs and RNNs for Reinforcement Learning (RL) .............................................................................................................................................. 100
6.1. RL through NN world models yields RNNs with deep CAPs ........................................................................................................................... 100
6.2. Deep FNNs for traditional RL and Markov Decision Processes (MDPs) .......................................................................................................... 101
6.3. Deep RL RNNs for partially observable MDPs (POMDPs) ................................................................................................................................ 101
6.4. RL facilitated by deep UL in FNNs and RNNs.................................................................................................................................................... 102
6.5. Deep hierarchical RL (HRL) and subgoal learning with FNNs and RNNs........................................................................................................ 102
6.6. Deep RL by direct NN search/policy gradients/evolution ............................................................................................................................... 102
6.7. Deep RL by indirect policy search/compressed NN search ............................................................................................................................. 103
6.8. Universal RL........................................................................................................................................................................................................ 103
7. Conclusion and outlook ................................................................................................................................................................................................. 103
Acknowledgments ......................................................................................................................................................................................................... 104
References....................................................................................................................................................................................................................... 104
Preface
This is the preprint of an invited Deep Learning (DL) overview.
One of its goals is to assign credit to those who contributed to the
present state of the art. I acknowledge the limitations of attempt-
ing to achieve this goal. The DL research community itself may be
viewed as a continually evolving, deep network of scientists who
have influenced each other in complex ways. Starting from recent
DL results, I tried to trace back the origins of relevant ideas through
the past half century and beyond, sometimes using ‘‘local search’’
to follow citations of citations backwards in time. Since not all
DL publications properly acknowledge earlier relevant work, addi-
tional global search strategies were employed, aided by consulting
numerous neural network experts. As a result, the present preprint
mostly consists of references. Nevertheless, through an expert se-
lection bias I may have missed important work. A related bias was
surely introduced by my special familiarity with the work of my
own DL research group in the past quarter-century. For these rea-
sons, this work should be viewed as merely a snapshot of an on-
going credit assignment process. To help improve it, please do not
hesitate to send corrections and suggestions to juergen@idsia.ch.
1. Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are respon-
sible for its success or failure? What changes to them improve per-
formance? This has been called the fundamental credit assignment
problem (Minsky, 1963). There are general credit assignment meth-
ods for universal problem solvers that are time-optimal in various
theoretical senses (Section 6.8). The present survey, however, will
focus on the narrower, but now commercially important, subfield
of Deep Learning (DL) in Artificial Neural Networks (NNs).
A standard neural network (NN) consists of many simple, con-
nected processors called neurons, each producing a sequence of
real-valued activations. Input neurons get activated through sen-
sors perceiving the environment, other neurons get activated
through weighted connections from previously active neurons (de-
tails in Section 2). Some neurons may influence the environment
by triggering actions. Learning or credit assignment is about finding
weights that make the NN exhibit desired behavior, such as driving
a car. Depending on the problem and how the neurons are con-
nected, such behavior may require long causal chains of compu-
tational stages (Section 3), where each stage transforms (often in
a non-linear way) the aggregate activation of the network. Deep
Learning is about accurately assigning credit across many such
stages.
Shallow NN-like models with few such stages have been around
for many decades if not centuries (Section 5.1). Models with sev-
eral successive nonlinear layers of neurons date back at least to
the 1960s (Section 5.3) and 1970s (Section 5.5). An efficient gra-
dient descent method for teacher-based Supervised Learning (SL)
in discrete, differentiable networks of arbitrary depth called back-
propagation (BP) was developed in the 1960s and 1970s, and ap-
plied to NNs in 1981 (Section 5.5). BP-based training of deep NNs
with many layers, however, had been found to be difficult in prac-
tice by the late 1980s (Section 5.6), and had become an explicit
research subject by the early 1990s (Section 5.9). DL became prac-
tically feasible to some extent through the help of Unsupervised
Learning (UL), e.g., Section 5.10 (1991), Section 5.15 (2006). The
1990s and 2000s also saw many improvements of purely super-
vised DL (Section 5). In the new millennium, deep NNs have fi-
nally attracted wide-spread attention, mainly by outperforming
alternative machine learning methods such as kernel machines
(Schölkopf, Burges, & Smola, 1998; Vapnik, 1995) in numerous im-
portant applications. In fact, since 2009, supervised deep NNs have
won many official international pattern recognition competitions
(e.g., Sections 5.17, 5.19, 5.21 and 5.22), achieving the first super-
human visual pattern recognition results in limited domains (Sec-
tion 5.19, 2011). Deep NNs also have become relevant for the more
general field of Reinforcement Learning (RL) where there is no su-
pervising teacher (Section 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic)
NNs (RNNs) have won contests (Sections 5.12, 5.14, 5.17, 5.19, 5.21,
5.22). In a sense, RNNs are the deepest of all NNs (Section 3)—
they are general computers more powerful than FNNs, and can in
principle create and process memories of arbitrary sequences of
input patterns (e.g., Schmidhuber, 1990a; Siegelmann & Sontag,
1991). Unlike traditional methods for automatic sequential pro-
gram synthesis (e.g., Balzer, 1985; Deville & Lau, 1994; Soloway,
J. Schmidhuber / Neural Networks 61 (2015) 85–117 87
Abbreviations in alphabetical order
AE: Autoencoder
AI: Artificial Intelligence
ANN: Artificial Neural Network
BFGS: Broyden–Fletcher–Goldfarb–Shanno
BNN: Biological Neural Network
BM: Boltzmann Machine
BP: Backpropagation
BRNN: Bi-directional Recurrent Neural Network
CAP: Credit Assignment Path
CEC: Constant Error Carousel
CFL: Context Free Language
CMA-ES: Covariance Matrix Estimation ES
CNN: Convolutional Neural Network
CoSyNE: Co-Synaptic Neuro-Evolution
CSL: Context Sensitive Language
CTC: Connectionist Temporal Classification
DBN: Deep Belief Network
DCT: Discrete Cosine Transform
DL: Deep Learning
DP: Dynamic Programming
DS: Direct Policy Search
EA: Evolutionary Algorithm
EM: Expectation Maximization
ES: Evolution Strategy
FMS: Flat Minimum Search
FNN: Feedforward Neural Network
FSA: Finite State Automaton
GMDH: Group Method of Data Handling
GOFAI: Good Old-Fashioned AI
GP: Genetic Programming
GPU: Graphics Processing Unit
GPU-MPCNN: GPU-Based MPCNN
HMM: Hidden Markov Model
HRL: Hierarchical Reinforcement Learning
HTM: Hierarchical Temporal Memory
HMAX: Hierarchical Model ‘‘and X’’
LSTM: Long Short-Term Memory (RNN)
MDL: Minimum Description Length
MDP: Markov Decision Process
MNIST: Mixed National Institute of Standards and Technol-
ogy Database
MP: Max-Pooling
MPCNN: Max-Pooling CNN
NE: NeuroEvolution
NEAT: NE of Augmenting Topologies
NES: Natural Evolution Strategies
NFQ: Neural Fitted Q-Learning
NN: Neural Network
OCR: Optical Character Recognition
PCC: Potential Causal Connection
PDCC: Potential Direct Causal Connection
PM: Predictability Minimization
POMDP: Partially Observable MDP
RAAM: Recursive Auto-Associative Memory
RBM: Restricted Boltzmann Machine
ReLU: Rectified Linear Unit
RL: Reinforcement Learning
RNN: Recurrent Neural Network
R-prop: Resilient Backpropagation
SL: Supervised Learning
SLIM NN: Self-Delimiting Neural Network
SOTA: Self-Organizing Tree Algorithm
SVM: Support Vector Machine
TDNN: Time-Delay Neural Network
TIMIT: TI/SRI/MIT Acoustic-Phonetic Continuous Speech
Corpus
UL: Unsupervised Learning
WTA: Winner-Take-All
1986; Waldinger & Lee, 1969), RNNs can learn programs that mix
sequential and parallel information processing in a natural and ef-
ficient way, exploiting the massive parallelism viewed as crucial
for sustaining the rapid decline of computation cost observed over
the past 75 years.
The rest of this paper is structured as follows. Section 2 intro-
duces a compact, event-oriented notation that is simple yet general
enough to accommodate both FNNs and RNNs. Section 3 introduces
the concept of Credit Assignment Paths (CAPs) to measure whether
learning in a given NN application is of the deep or shallow type.
Section 4 lists recurring themes of DL in SL, UL, and RL. Section 5 fo-
cuses on SL and UL, and on how UL can facilitate SL, although pure
SL has become dominant in recent competitions (Sections 5.17–
5.23). Section 5 is arranged in a historical timeline format with
subsections on important inspirations and technical contributions.
Section 6 on deep RL discusses traditional Dynamic Programming
(DP)-based RL combined with gradient-based search techniques
for SL or UL in deep NNs, as well as general methods for direct and
indirect search in the weight space of deep FNNs and RNNs, includ-
ing successful policy gradient and evolutionary methods.
2. Event-oriented notation for activation spreading in NNs
Throughout this paper, let i, j, k, t, p, q, r denote positive
integer variables assuming ranges implicit in the given contexts.
Let n, m, T denote positive integer constants.
An NN’s topology may change over time (e.g., Sections 5.3,
5.6.3). At any given moment, it can be described as a finite subset
of units (or nodes or neurons) N = {u
1
, u
2
, . . . , } and a finite set
H ⊆ N × N of directed edges or connections between nodes. FNNs
are acyclic graphs, RNNs cyclic. The first (input) layer is the set
of input units, a subset of N. In FNNs, the kth layer (k > 1) is the
set of all nodes u ∈ N such that there is an edge path of length
k − 1 (but no longer path) between some input unit and u. There
may be shortcut connections between distant layers. In sequence-
processing, fully connected RNNs, all units have connections to all
non-input units.
The NN’s behavior or program is determined by a set of real-
valued, possibly modifiable, parameters or weights w
i
(i = 1,
. . . , n). We now focus on a single finite episode or epoch of informa-
tion processing and activation spreading, without learning through
weight changes. The following slightly unconventional notation is
designed to compactly describe what is happening during the run-
time of the system.
During an episode, there is a partially causal sequence x
t
(t =
1, . . . , T ) of real values that I call events. Each x
t
is either an in-
put set by the environment, or the activation of a unit that may
directly depend on other x
k
(k < t) through a current NN topology-
dependent set in
t
of indices k representing incoming causal con-
nections or links. Let the function v encode topology information
and map such event index pairs (k, t) to weight indices. For ex-
ample, in the non-input case we may have x
t
= f
t
(net
t
) with
real-valued net
t
=
k∈in
t
x
k
w
v(k,t)
(additive case) or net
t
=
k∈in
t
x
k
w
v(k,t)
(multiplicative case), where f
t
is a typically non-
linear real-valued activation function such as tanh. In many recent
competition-winning NNs (Sections 5.19, 5.21, 5.22) there also are
events of the type x
t
= max
k∈in
t
(x
k
); some network types may
also use complex polynomial activation functions (Section 5.3). x
t
may directly affect certain x
k
(k > t) through outgoing connections
or links represented through a current set out
t
of indices k with
t ∈ in
k
. Some of the non-input events are called output events.
Note that many of the x
t
may refer to different, time-varying
activations of the same unit in sequence-processing RNNs (e.g.,
Williams, 1989 ‘‘unfolding in time’’), or also in FNNs sequentially
exposed to time-varying input patterns of a large training set
encoded as input events. During an episode, the same weight
88 J. Schmidhuber / Neural Networks 61 (2015) 85–117
may get reused over and over again in topology-dependent ways,
e.g., in RNNs, or in convolutional NNs (Sections 5.4 and 5.8). I
call this weight sharing across space and/or time. Weight sharing
may greatly reduce the NN’s descriptive complexity, which is
the number of bits of information required to describe the NN
(Section 4.4).
In Supervised Learning (SL), certain NN output events x
t
may
be associated with teacher-given, real-valued labels or targets d
t
yielding errors e
t
, e.g., e
t
= 1/2(x
t
− d
t
)
2
. A typical goal of super-
vised NN training is to find weights that yield episodes with small
total error E, the sum of all such e
t
. The hope is that the NN will
generalize well in later episodes, causing only small errors on pre-
viously unseen sequences of input events. Many alternative error
functions for SL and UL are possible.
SL assumes that input events are independent of earlier output
events (which may affect the environment through actions causing
subsequent perceptions). This assumption does not hold in the
broader fields of Sequential Decision Making and Reinforcement
Learning (RL) (Hutter, 2005; Kaelbling, Littman, & Moore, 1996;
Sutton & Barto, 1998; Wiering & van Otterlo, 2012) (Section 6).
In RL, some of the input events may encode real-valued reward
signals given by the environment, and a typical goal is to find
weights that yield episodes with a high sum of reward signals,
through sequences of appropriate output actions.
Section 5.5 will use the notation above to compactly describe
a central algorithm of DL, namely, backpropagation (BP) for
supervised weight-sharing FNNs and RNNs. (FNNs may be viewed
as RNNs with certain fixed zero weights.) Section 6 will address the
more general RL case.
3. Depth of Credit Assignment Paths (CAPs) and of problems
To measure whether credit assignment in a given NN applica-
tion is of the deep or shallow type, I introduce the concept of Credit
Assignment Paths or CAPs, which are chains of possibly causal links
between the events of Section 2, e.g., from input through hidden
to output layers in FNNs, or through transformations over time in
RNNs.
Let us first focus on SL. Consider two events x
p
and x
q
(1 ≤ p <
q ≤ T). Depending on the application, they may have a Potential
Direct Causal Connection (PDCC) expressed by the Boolean predi-
cate pdcc(p, q), which is true if and only if p ∈ in
q
. Then the 2-
element list (p, q) is defined to be a CAP (a minimal one) from p
to q. A learning algorithm may be allowed to change w
v(p,q)
to im-
prove performance in future episodes.
More general, possibly indirect, Potential Causal Connections
(PCC) are expressed by the recursively defined Boolean predicate
pcc(p, q), which in the SL case is true only if pdcc(p, q), or if
pcc(p, k) for some k and pdcc(k, q). In the latter case, appending
q to any CAP from p to k yields a CAP from p to q (this is a recur-
sive definition, too). The set of such CAPs may be large but is finite.
Note that the same weight may affect many different PDCCs be-
tween successive events listed by a given CAP, e.g., in the case of
RNNs, or weight-sharing FNNs.
Suppose a CAP has the form (. . . , k, t, . . . , q), where k and t
(possibly t = q) are the first successive elements with modifiable
w
v(k,t)
. Then the length of the suffix list (t, . . . , q) is called the CAP’s
depth (which is 0 if there are no modifiable links at all). This depth
limits how far backwards credit assignment can move down the
causal chain to find a modifiable weight.
1
Suppose an episode and its event sequence x
1
, . . . , x
T
satisfy a
computable criterion used to decide whether a given problem has
been solved (e.g., total error E below some threshold). Then the set
1
An alternative would be to count only modifiable links when measuring depth.
In many typical NN applications this would not make a difference, but in some it
would, e.g., Section 6.1.
of used weights is called a solution to the problem, and the depth
of the deepest CAP within the sequence is called the solution depth.
There may be other solutions (yielding different event sequences)
with different depths. Given some fixed NN topology, the smallest
depth of any solution is called the problem depth.
Sometimes we also speak of the depth of an architecture: SL FNNs
with fixed topology imply a problem-independent maximal prob-
lem depth bounded by the number of non-input layers. Certain SL
RNNs with fixed weights for all connections except those to output
units (Jaeger, 2001, 2004; Maass, Natschläger, & Markram, 2002;
Schrauwen, Verstraeten, & Van Campenhout, 2007) have a max-
imal problem depth of 1, because only the final links in the cor-
responding CAPs are modifiable. In general, however, RNNs may
learn to solve problems of potentially unlimited depth.
Note that the definitions above are solely based on the depths
of causal chains, and agnostic to the temporal distance between
events. For example, shallow FNNs perceiving large ‘‘time win-
dows’’ of input events may correctly classify long input sequences
through appropriate output events, and thus solve shallow prob-
lems involving long time lags between relevant events.
At which problem depth does Shallow Learning end, and Deep
Learning begin? Discussions with DL experts have not yet yielded a
conclusive response to this question. Instead of committing myself
to a precise answer, let me just define for the purposes of this
overview: problems of depth >10 require Very Deep Learning.
The difficulty of a problem may have little to do with its depth.
Some NNs can quickly learn to solve certain deep problems, e.g.,
through random weight guessing (Section 5.9) or other types
of direct search (Section 6.6) or indirect search (Section 6.7)
in weight space, or through training an NN first on shallow
problems whose solutions may then generalize to deep problems,
or through collapsing sequences of (non)linear operations into
a single (non)linear operation (but see an analysis of non-trivial
aspects of deep linear networks, Baldi & Hornik, 1995, Section B).
In general, however, finding an NN that precisely models a given
training set is an NP-complete problem (Blum & Rivest, 1992; Judd,
1990), also in the case of deep NNs (de Souto, Souto, & Oliveira,
1999; Síma, 1994; Windisch, 2005); compare a survey of negative
results (Síma, 2002, Section 1).
Above we have focused on SL. In the more general case of RL
in unknown environments, pcc(p, q) is also true if x
p
is an output
event and x
q
any later input event—any action may affect the en-
vironment and thus any later perception. (In the real world, the
environment may even influence non-input events computed on
a physical hardware entangled with the entire universe, but this
is ignored here.) It is possible to model and replace such unmod-
ifiable environmental PCCs through a part of the NN that has al-
ready learned to predict (through some of its units) input events
(including reward signals) from former input events and actions
(Section 6.1). Its weights are frozen, but can help to assign credit
to other, still modifiable weights used to compute actions (Sec-
tion 6.1). This approach may lead to very deep CAPs though.
Some DL research is about automatically rephrasing problems
such that their depth is reduced (Section 4). In particular, some-
times UL is used to make SL problems less deep, e.g., Section 5.10.
Often Dynamic Programming (Section 4.1) is used to facilitate cer-
tain traditional RL problems, e.g., Section 6.2. Section 5 focuses on
CAPs for SL, Section 6 on the more complex case of RL.
4. Recurring themes of Deep Learning
4.1. Dynamic programming for Supervised/Reinforcement Learning
(SL/RL)
One recurring theme of DL is Dynamic Programming (DP) (Bell-
man, 1957), which can help to facilitate credit assignment under
J. Schmidhuber / Neural Networks 61 (2015) 85–117 89
certain assumptions. For example, in SL NNs, backpropagation it-
self can be viewed as a DP-derived method (Section 5.5). In tra-
ditional RL based on strong Markovian assumptions, DP-derived
methods can help to greatly reduce problem depth (Section 6.2).
DP algorithms are also essential for systems that combine con-
cepts of NNs and graphical models, such as Hidden Markov Models
(HMMs) (Baum & Petrie, 1966; Stratonovich, 1960) and Expecta-
tion Maximization (EM) (Dempster, Laird, & Rubin, 1977; Friedman,
Hastie, & Tibshirani, 2001), e.g., Baldi and Chauvin (1996), Bengio
(1991), Bishop (2006), Bottou (1991), Bourlard and Morgan (1994),
Dahl, Yu, Deng, and Acero (2012), Hastie, Tibshirani, and Friedman
(2009), Hinton, Deng, et al. (2012), Jordan and Sejnowski (2001),
Poon and Domingos (2011) and Wu and Shao (2014).
4.2. Unsupervised Learning (UL) facilitating SL and RL
Another recurring theme is how UL can facilitate both SL (Sec-
tion 5) and RL (Section 6). UL (Section 5.6.4) is normally used to
encode raw incoming data such as video or speech streams in a
form that is more convenient for subsequent goal-directed learn-
ing. In particular, codes that describe the original data in a less re-
dundant or more compact way can be fed into SL (Sections 5.10,
5.15) or RL machines (Section 6.4), whose search spaces may thus
become smaller (and whose CAPs shallower) than those necessary
for dealing with the raw data. UL is closely connected to the topics
of regularization and compression (Sections 4.4, 5.6.3).
4.3. Learning hierarchical representations through deep SL, UL, RL
Many methods of Good Old-Fashioned Artificial Intelligence (GO-
FAI) (Nilsson, 1980) as well as more recent approaches to AI (Rus-
sell, Norvig, Canny, Malik, & Edwards, 1995) and Machine Learn-
ing (Mitchell, 1997) learn hierarchies of more and more abstract
data representations. For example, certain methods of syntactic
pattern recognition (Fu, 1977) such as grammar induction discover
hierarchies of formal rules to model observations. The partially
(un)supervised Automated Mathematician/EURISKO (Lenat, 1983;
Lenat & Brown, 1984) continually learns concepts by combining
previously learnt concepts. Such hierarchical representation learn-
ing (Bengio, Courville, & Vincent, 2013; Deng & Yu, 2014; Ring,
1994) is also a recurring theme of DL NNs for SL (Section 5),
UL-aided SL (Sections 5.7, 5.10, 5.15), and hierarchical RL (Sec-
tion 6.5). Often, abstract hierarchical representations are natural
by-products of data compression (Section 4.4), e.g., Section 5.10.
4.4. Occam’s razor: compression and Minimum Description Length
(MDL)
Occam’s razor favors simple solutions over complex ones. Given
some programming language, the principle of Minimum Descrip-
tion Length (MDL) can be used to measure the complexity of a
solution candidate by the length of the shortest program that com-
putes it (e.g., Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987;
Chaitin, 1966; Grünwald, Myung, & Pitt, 2005; Kolmogorov, 1965b;
Levin, 1973a; Li & Vitányi, 1997; Rissanen, 1986; Solomonoff, 1964,
1978; Wallace & Boulton, 1968). Some methods explicitly take into
account program runtime (Allender, 1992; Schmidhuber, 1997,
2002; Watanabe, 1992); many consider only programs with con-
stant runtime, written in non-universal programming languages
(e.g., Hinton & van Camp, 1993; Rissanen, 1986). In the NN case,
the MDL principle suggests that low NN weight complexity corre-
sponds to high NN probability in the Bayesian view (e.g., Buntine
& Weigend, 1991; De Freitas, 2003; MacKay, 1992; Neal, 1995),
and to high generalization performance (e.g., Baum & Haussler,
1989), without overfitting the training data. Many methods have
been proposed for regularizing NNs, that is, searching for solution-
computing but simple, low-complexity SL NNs (Section 5.6.3) and
RL NNs (Section 6.7). This is closely related to certain UL methods
(Sections 4.2, 5.6.4).
4.5. Fast Graphics Processing Units (GPUs) for DL in NNs
While the previous millennium saw several attempts at cre-
ating fast NN-specific hardware (e.g., Faggin, 1992; Heemskerk,
1995; Jackel et al., 1990; Korkin, de Garis, Gers, & Hemmi, 1997;
Ramacher et al., 1993; Urlbe, 1999; Widrow, Rumelhart, & Lehr,
1994), and at exploiting standard hardware (e.g., Anguita & Gomes,
1996; Anguita, Parodi, & Zunino, 1994; Muller, Gunzinger, &
Guggenbühl, 1995), the new millennium brought a DL break-
through in form of cheap, multi-processor graphics cards or GPUs.
GPUs are widely used for video games, a huge and competitive
market that has driven down hardware prices. GPUs excel at the
fast matrix and vector multiplications required not only for con-
vincing virtual realities but also for NN training, where they can
speed up learning by a factor of 50 and more. Some of the GPU-
based FNN implementations (Sections 5.16–5.19) have greatly con-
tributed to recent successes in contests for pattern recognition
(Sections 5.19–5.22), image segmentation (Section 5.21), and ob-
ject detection (Sections 5.21–5.22).
5. Supervised NNs, some helped by unsupervised NNs
The main focus of current practical applications is on Supervised
Learning (SL), which has dominated recent pattern recognition
contests (Sections 5.17–5.23). Several methods, however, use
additional Unsupervised Learning (UL) to facilitate SL (Sections 5.7,
5.10, 5.15). It does make sense to treat SL and UL in the same
section: often gradient-based methods, such as BP (Section 5.5.1),
are used to optimize objective functions of both UL and SL, and the
boundary between SL and UL may blur, for example, when it comes
to time series prediction and sequence classification, e.g., Sections
5.10, 5.12.
A historical timeline format will help to arrange subsections on
important inspirations and technical contributions (although such
a subsection may span a time interval of many years). Section 5.1
briefly mentions early, shallow NN models since the 1940s (and
1800s), Section 5.2 additional early neurobiological inspiration
relevant for modern Deep Learning (DL). Section 5.3 is about GMDH
networks (since 1965), to my knowledge the first (feedforward)
DL systems. Section 5.4 is about the relatively deep Neocognitron
NN (1979) which is very similar to certain modern deep FNN
architectures, as it combines convolutional NNs (CNNs), weight
pattern replication, and subsampling mechanisms. Section 5.5
uses the notation of Section 2 to compactly describe a central
algorithm of DL, namely, backpropagation (BP) for supervised
weight-sharing FNNs and RNNs. It also summarizes the history
of BP 1960–1981 and beyond. Section 5.6 describes problems
encountered in the late 1980s with BP for deep NNs, and mentions
several ideas from the previous millennium to overcome them.
Section 5.7 discusses a first hierarchical stack (1987) of coupled
UL-based Autoencoders (AEs)—this concept resurfaced in the new
millennium (Section 5.15). Section 5.8 is about applying BP to CNNs
(1989), which is important for today’s DL applications. Section 5.9
explains BP’s Fundamental DL Problem (of vanishing/exploding
gradients) discovered in 1991. Section 5.10 explains how a deep
RNN stack of 1991 (the History Compressor) pre-trained by UL
helped to solve previously unlearnable DL benchmarks requiring
Credit Assignment Paths (CAPs, Section 3) of depth 1000 and
more. Section 5.11 discusses a particular winner-take-all (WTA)
method called Max-Pooling (MP, 1992) widely used in today’s deep
FNNs. Section 5.12 mentions a first important contest won by
SL NNs in 1994. Section 5.13 describes a purely supervised DL
RNN (Long Short-Term Memory, LSTM, 1995) for problems of depth
1000 and more. Section 5.14 mentions an early contest of 2003
won by an ensemble of shallow FNNs, as well as good pattern
recognition results with CNNs and deep FNNs and LSTM RNNs
剩余32页未读,继续阅读
资源评论
Couchy_wu
- 粉丝: 39
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功