2015NEUNETDeeplearninginneuralnetworksAnoverview.pdf资源-CSDN文库

深度学习

需积分: 9 125 浏览量 2020-01-19 15:29:03 上传评论收藏 840KB PDF 举报

资源推荐

资源详情

资源评论

Neural Networks 61 (2015) 85–117

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

Review

Deep learning in neural networks: An overview

Jürgen Schmidhuber

The Swiss AI Lab IDSIA, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, University of Lugano & SUPSI, Galleria 2, 6928 Manno-Lugano, Switzerland

a r t i c l e i n f o

Article history:

Received 2 May 2014

Received in revised form 12 September

2014

Accepted 14 September 2014

Available online 13 October 2014

Keywords:

Deep learning

Supervised learning

Unsupervised learning

Reinforcement learning

Evolutionary computation

a b s t r a c t

In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in

pattern recognition and machine learning. This historical survey compactly summarizes relevant work,

much of it from the previous millennium. Shallow and Deep Learners are distinguished by the depth

of their credit assignment paths, which are chains of possibly learnable, causal links between actions

and effects. I review deep supervised learning (also recapitulating the history of backpropagation),

unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short

programs encoding deep and large networks.

Contents

1. Introduction to Deep Learning (DL) in Neural Networks (NNs).................................................................................................................................. 86

2. Event-oriented notation for activation spreading in NNs ........................................................................................................................................... 87

3. Depth of Credit Assignment Paths (CAPs) and of problems ........................................................................................................................................ 88

4. Recurring themes of Deep Learning.............................................................................................................................................................................. 88

4.1. Dynamic programming for Supervised/Reinforcement Learning (SL/RL)...................................................................................................... 88

4.2. Unsupervised Learning (UL) facilitating SL and RL .......................................................................................................................................... 89

4.3. Learning hierarchical representations through deep SL, UL, RL ..................................................................................................................... 89

4.4. Occam’s razor: compression and Minimum Description Length (MDL) ........................................................................................................ 89

4.5. Fast Graphics Processing Units (GPUs) for DL in NNs...................................................................................................................................... 89

5. Supervised NNs, some helped by unsupervised NNs................................................................................................................................................... 89

5.1. Early NNs since the 1940s (and the 1800s)...................................................................................................................................................... 90

5.2. Around 1960: visual cortex provides inspiration for DL (Sections 5.4, 5.11) ................................................................................................ 90

5.3. 1965: deep networks based on the Group Method of Data Handling............................................................................................................ 90

5.4. 1979: convolution + weight replication + subsampling (Neocognitron)..................................................................................................... 90

5.5. 1960–1981 and beyond: development of backpropagation (BP) for NNs ..................................................................................................... 90

5.5.1. BP for weight-sharing feedforward NNs (FNNs) and recurrent NNs (RNNs).................................................................................. 91

5.6. Late 1980s–2000 and beyond: numerous improvements of NNs .................................................................................................................. 91

5.6.1. Ideas for dealing with long time lags and deep CAPs ....................................................................................................................... 91

5.6.2. Better BP through advanced gradient descent (compare Section 5.24).......................................................................................... 92

5.6.3. Searching for simple, low-complexity, problem-solving NNs (Section 5.24)................................................................................. 92

5.6.4. Potential benefits of UL for SL (compare Sections 5.7, 5.10, 5.15)................................................................................................... 92

5.7. 1987: UL through Autoencoder (AE) hierarchies (compare Section 5.15)..................................................................................................... 93

5.8. 1989: BP for convolutional NNs (CNNs, Section 5.4)....................................................................................................................................... 93

5.9. 1991: Fundamental Deep Learning Problem of gradient descent .................................................................................................................. 93

5.10. 1991: UL-based history compression through a deep stack of RNNs............................................................................................................. 94

5.11. 1992: Max-Pooling (MP): towards MPCNNs (compare Sections 5.16, 5.19) ................................................................................................. 94

E-mail address: juergen@idsia.ch.

http://dx.doi.org/10.1016/j.neunet.2014.09.003

86 J. Schmidhuber / Neural Networks 61 (2015) 85–117

5.12. 1994: early contest-winning NNs..................................................................................................................................................................... 95

5.13. 1995: supervised recurrent very Deep Learner (LSTM RNN).......................................................................................................................... 95

5.14. 2003: more contest-winning/record-setting NNs; successful deep NNs....................................................................................................... 96

5.15. 2006/7: UL for deep belief networks/AE stacks fine-tuned by BP .................................................................................................................. 96

5.16. 2006/7: improved CNNs/GPU-CNNs/BP for MPCNNs/LSTM stacks ................................................................................................................ 96

5.17. 2009: first official competitions won by RNNs, and with MPCNNs................................................................................................................ 97

5.18. 2010: plain backprop (+ distortions) on GPU breaks MNIST record ............................................................................................................. 97

5.19. 2011: MPCNNs on GPU achieve superhuman vision performance ................................................................................................................ 97

5.20. 2011: Hessian-free optimization for RNNs ...................................................................................................................................................... 98

5.21. 2012: first contests won on ImageNet, object detection, segmentation........................................................................................................ 98

5.22. 2013-: more contests and benchmark records ................................................................................................................................................ 98

5.23. Currently successful techniques: LSTM RNNs and GPU-MPCNNs .................................................................................................................. 99

5.24. Recent tricks for improving SL deep NNs (compare Sections 5.6.2, 5.6.3)..................................................................................................... 99

5.25. Consequences for neuroscience........................................................................................................................................................................ 100

5.26. DL with spiking neurons? ................................................................................................................................................................................. 100

6. DL in FNNs and RNNs for Reinforcement Learning (RL) .............................................................................................................................................. 100

6.1. RL through NN world models yields RNNs with deep CAPs ........................................................................................................................... 100

6.2. Deep FNNs for traditional RL and Markov Decision Processes (MDPs) .......................................................................................................... 101

6.3. Deep RL RNNs for partially observable MDPs (POMDPs) ................................................................................................................................ 101

6.4. RL facilitated by deep UL in FNNs and RNNs.................................................................................................................................................... 102

6.5. Deep hierarchical RL (HRL) and subgoal learning with FNNs and RNNs........................................................................................................ 102

6.6. Deep RL by direct NN search/policy gradients/evolution ............................................................................................................................... 102

6.7. Deep RL by indirect policy search/compressed NN search ............................................................................................................................. 103

6.8. Universal RL........................................................................................................................................................................................................ 103

7. Conclusion and outlook ................................................................................................................................................................................................. 103

Acknowledgments ......................................................................................................................................................................................................... 104

References....................................................................................................................................................................................................................... 104

Preface

This is the preprint of an invited Deep Learning (DL) overview.

One of its goals is to assign credit to those who contributed to the

present state of the art. I acknowledge the limitations of attempt-

ing to achieve this goal. The DL research community itself may be

viewed as a continually evolving, deep network of scientists who

have influenced each other in complex ways. Starting from recent

DL results, I tried to trace back the origins of relevant ideas through

the past half century and beyond, sometimes using ‘‘local search’’

to follow citations of citations backwards in time. Since not all

DL publications properly acknowledge earlier relevant work, addi-

tional global search strategies were employed, aided by consulting

numerous neural network experts. As a result, the present preprint

mostly consists of references. Nevertheless, through an expert se-

lection bias I may have missed important work. A related bias was

surely introduced by my special familiarity with the work of my

own DL research group in the past quarter-century. For these rea-

sons, this work should be viewed as merely a snapshot of an on-

going credit assignment process. To help improve it, please do not

hesitate to send corrections and suggestions to juergen@idsia.ch.

1. Introduction to Deep Learning (DL) in Neural Networks (NNs)

Which modifiable components of a learning system are respon-

sible for its success or failure? What changes to them improve per-

formance? This has been called the fundamental credit assignment

problem (Minsky, 1963). There are general credit assignment meth-

ods for universal problem solvers that are time-optimal in various

theoretical senses (Section 6.8). The present survey, however, will

focus on the narrower, but now commercially important, subfield

of Deep Learning (DL) in Artificial Neural Networks (NNs).

A standard neural network (NN) consists of many simple, con-

nected processors called neurons, each producing a sequence of

real-valued activations. Input neurons get activated through sen-

sors perceiving the environment, other neurons get activated

through weighted connections from previously active neurons (de-

tails in Section 2). Some neurons may influence the environment

by triggering actions. Learning or credit assignment is about finding

weights that make the NN exhibit desired behavior, such as driving

a car. Depending on the problem and how the neurons are con-

nected, such behavior may require long causal chains of compu-

tational stages (Section 3), where each stage transforms (often in

a non-linear way) the aggregate activation of the network. Deep

Learning is about accurately assigning credit across many such

stages.

Shallow NN-like models with few such stages have been around

for many decades if not centuries (Section 5.1). Models with sev-

eral successive nonlinear layers of neurons date back at least to

the 1960s (Section 5.3) and 1970s (Section 5.5). An efficient gra-

dient descent method for teacher-based Supervised Learning (SL)

in discrete, differentiable networks of arbitrary depth called back-

propagation (BP) was developed in the 1960s and 1970s, and ap-

plied to NNs in 1981 (Section 5.5). BP-based training of deep NNs

with many layers, however, had been found to be difficult in prac-

tice by the late 1980s (Section 5.6), and had become an explicit

research subject by the early 1990s (Section 5.9). DL became prac-

tically feasible to some extent through the help of Unsupervised

Learning (UL), e.g., Section 5.10 (1991), Section 5.15 (2006). The

1990s and 2000s also saw many improvements of purely super-

vised DL (Section 5). In the new millennium, deep NNs have fi-

nally attracted wide-spread attention, mainly by outperforming

alternative machine learning methods such as kernel machines

(Schölkopf, Burges, & Smola, 1998; Vapnik, 1995) in numerous im-

portant applications. In fact, since 2009, supervised deep NNs have

won many official international pattern recognition competitions

(e.g., Sections 5.17, 5.19, 5.21 and 5.22), achieving the first super-

human visual pattern recognition results in limited domains (Sec-

tion 5.19, 2011). Deep NNs also have become relevant for the more

general field of Reinforcement Learning (RL) where there is no su-

pervising teacher (Section 6).

Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic)

NNs (RNNs) have won contests (Sections 5.12, 5.14, 5.17, 5.19, 5.21,

5.22). In a sense, RNNs are the deepest of all NNs (Section 3)—

they are general computers more powerful than FNNs, and can in

principle create and process memories of arbitrary sequences of

input patterns (e.g., Schmidhuber, 1990a; Siegelmann & Sontag,

1991). Unlike traditional methods for automatic sequential pro-

gram synthesis (e.g., Balzer, 1985; Deville & Lau, 1994; Soloway,

J. Schmidhuber / Neural Networks 61 (2015) 85–117 87

Abbreviations in alphabetical order

AE: Autoencoder

AI: Artificial Intelligence

ANN: Artificial Neural Network

BFGS: Broyden–Fletcher–Goldfarb–Shanno

BNN: Biological Neural Network

BM: Boltzmann Machine

BP: Backpropagation

BRNN: Bi-directional Recurrent Neural Network

CAP: Credit Assignment Path

CEC: Constant Error Carousel

CFL: Context Free Language

CMA-ES: Covariance Matrix Estimation ES

CNN: Convolutional Neural Network

CoSyNE: Co-Synaptic Neuro-Evolution

CSL: Context Sensitive Language

CTC: Connectionist Temporal Classification

DBN: Deep Belief Network

DCT: Discrete Cosine Transform

DL: Deep Learning

DP: Dynamic Programming

DS: Direct Policy Search

EA: Evolutionary Algorithm

EM: Expectation Maximization

ES: Evolution Strategy

FMS: Flat Minimum Search

FNN: Feedforward Neural Network

FSA: Finite State Automaton

GMDH: Group Method of Data Handling

GOFAI: Good Old-Fashioned AI

GP: Genetic Programming

GPU: Graphics Processing Unit

GPU-MPCNN: GPU-Based MPCNN

HMM: Hidden Markov Model

HRL: Hierarchical Reinforcement Learning

HTM: Hierarchical Temporal Memory

HMAX: Hierarchical Model ‘‘and X’’

LSTM: Long Short-Term Memory (RNN)

MDL: Minimum Description Length

MDP: Markov Decision Process

MNIST: Mixed National Institute of Standards and Technol-

ogy Database

MP: Max-Pooling

MPCNN: Max-Pooling CNN

NE: NeuroEvolution

NEAT: NE of Augmenting Topologies

NES: Natural Evolution Strategies

NFQ: Neural Fitted Q-Learning

NN: Neural Network

OCR: Optical Character Recognition

PCC: Potential Causal Connection

PDCC: Potential Direct Causal Connection

PM: Predictability Minimization

POMDP: Partially Observable MDP

RAAM: Recursive Auto-Associative Memory

RBM: Restricted Boltzmann Machine

ReLU: Rectified Linear Unit

RL: Reinforcement Learning

RNN: Recurrent Neural Network

R-prop: Resilient Backpropagation

SL: Supervised Learning

SLIM NN: Self-Delimiting Neural Network

SOTA: Self-Organizing Tree Algorithm

SVM: Support Vector Machine

TDNN: Time-Delay Neural Network

TIMIT: TI/SRI/MIT Acoustic-Phonetic Continuous Speech

Corpus

UL: Unsupervised Learning

WTA: Winner-Take-All

1986; Waldinger & Lee, 1969), RNNs can learn programs that mix

sequential and parallel information processing in a natural and ef-

ficient way, exploiting the massive parallelism viewed as crucial

for sustaining the rapid decline of computation cost observed over

the past 75 years.

The rest of this paper is structured as follows. Section 2 intro-

duces a compact, event-oriented notation that is simple yet general

enough to accommodate both FNNs and RNNs. Section 3 introduces

the concept of Credit Assignment Paths (CAPs) to measure whether

learning in a given NN application is of the deep or shallow type.

Section 4 lists recurring themes of DL in SL, UL, and RL. Section 5 fo-

cuses on SL and UL, and on how UL can facilitate SL, although pure

SL has become dominant in recent competitions (Sections 5.17–

5.23). Section 5 is arranged in a historical timeline format with

subsections on important inspirations and technical contributions.

Section 6 on deep RL discusses traditional Dynamic Programming

(DP)-based RL combined with gradient-based search techniques

for SL or UL in deep NNs, as well as general methods for direct and

indirect search in the weight space of deep FNNs and RNNs, includ-

ing successful policy gradient and evolutionary methods.

2. Event-oriented notation for activation spreading in NNs

Throughout this paper, let i, j, k, t, p, q, r denote positive

integer variables assuming ranges implicit in the given contexts.

Let n, m, T denote positive integer constants.

An NN’s topology may change over time (e.g., Sections 5.3,

5.6.3). At any given moment, it can be described as a finite subset

of units (or nodes or neurons) N = {u

, u

, . . . , } and a finite set

H ⊆ N × N of directed edges or connections between nodes. FNNs

are acyclic graphs, RNNs cyclic. The first (input) layer is the set

of input units, a subset of N. In FNNs, the kth layer (k > 1) is the

set of all nodes u ∈ N such that there is an edge path of length

k − 1 (but no longer path) between some input unit and u. There

may be shortcut connections between distant layers. In sequence-

processing, fully connected RNNs, all units have connections to all

non-input units.

The NN’s behavior or program is determined by a set of real-

valued, possibly modifiable, parameters or weights w

(i = 1,

. . . , n). We now focus on a single finite episode or epoch of informa-

tion processing and activation spreading, without learning through

weight changes. The following slightly unconventional notation is

designed to compactly describe what is happening during the run-

time of the system.

During an episode, there is a partially causal sequence x

(t =

1, . . . , T ) of real values that I call events. Each x

is either an in-

put set by the environment, or the activation of a unit that may

directly depend on other x

(k < t) through a current NN topology-

dependent set in

of indices k representing incoming causal con-

nections or links. Let the function v encode topology information

and map such event index pairs (k, t) to weight indices. For ex-

ample, in the non-input case we may have x

= f

(net

) with

real-valued net



k∈in

v(k,t)

(additive case) or net



k∈in

v(k,t)

(multiplicative case), where f

is a typically non-

linear real-valued activation function such as tanh. In many recent

competition-winning NNs (Sections 5.19, 5.21, 5.22) there also are

events of the type x

= max

k∈in

); some network types may

also use complex polynomial activation functions (Section 5.3). x

may directly affect certain x

(k > t) through outgoing connections

or links represented through a current set out

of indices k with

t ∈ in

. Some of the non-input events are called output events.

Note that many of the x

may refer to different, time-varying

activations of the same unit in sequence-processing RNNs (e.g.,

Williams, 1989 ‘‘unfolding in time’’), or also in FNNs sequentially

exposed to time-varying input patterns of a large training set

encoded as input events. During an episode, the same weight

88 J. Schmidhuber / Neural Networks 61 (2015) 85–117

may get reused over and over again in topology-dependent ways,

e.g., in RNNs, or in convolutional NNs (Sections 5.4 and 5.8). I

call this weight sharing across space and/or time. Weight sharing

may greatly reduce the NN’s descriptive complexity, which is

the number of bits of information required to describe the NN

(Section 4.4).

In Supervised Learning (SL), certain NN output events x

may

be associated with teacher-given, real-valued labels or targets d

yielding errors e

, e.g., e

= 1/2(x

− d

)

. A typical goal of super-

vised NN training is to find weights that yield episodes with small

total error E, the sum of all such e

. The hope is that the NN will

generalize well in later episodes, causing only small errors on pre-

viously unseen sequences of input events. Many alternative error

functions for SL and UL are possible.

SL assumes that input events are independent of earlier output

events (which may affect the environment through actions causing

subsequent perceptions). This assumption does not hold in the

broader fields of Sequential Decision Making and Reinforcement

Learning (RL) (Hutter, 2005; Kaelbling, Littman, & Moore, 1996;

Sutton & Barto, 1998; Wiering & van Otterlo, 2012) (Section 6).

In RL, some of the input events may encode real-valued reward

signals given by the environment, and a typical goal is to find

weights that yield episodes with a high sum of reward signals,

through sequences of appropriate output actions.

Section 5.5 will use the notation above to compactly describe

a central algorithm of DL, namely, backpropagation (BP) for

supervised weight-sharing FNNs and RNNs. (FNNs may be viewed

as RNNs with certain fixed zero weights.) Section 6 will address the

more general RL case.

3. Depth of Credit Assignment Paths (CAPs) and of problems

To measure whether credit assignment in a given NN applica-

tion is of the deep or shallow type, I introduce the concept of Credit

Assignment Paths or CAPs, which are chains of possibly causal links

between the events of Section 2, e.g., from input through hidden

to output layers in FNNs, or through transformations over time in

RNNs.

Let us first focus on SL. Consider two events x

and x

(1 ≤ p <

q ≤ T). Depending on the application, they may have a Potential

Direct Causal Connection (PDCC) expressed by the Boolean predi-

cate pdcc(p, q), which is true if and only if p ∈ in

. Then the 2-

element list (p, q) is defined to be a CAP (a minimal one) from p

to q. A learning algorithm may be allowed to change w

v(p,q)

to im-

prove performance in future episodes.

More general, possibly indirect, Potential Causal Connections

(PCC) are expressed by the recursively defined Boolean predicate

pcc(p, q), which in the SL case is true only if pdcc(p, q), or if

pcc(p, k) for some k and pdcc(k, q). In the latter case, appending

q to any CAP from p to k yields a CAP from p to q (this is a recur-

sive definition, too). The set of such CAPs may be large but is finite.

Note that the same weight may affect many different PDCCs be-

tween successive events listed by a given CAP, e.g., in the case of

RNNs, or weight-sharing FNNs.

Suppose a CAP has the form (. . . , k, t, . . . , q), where k and t

(possibly t = q) are the first successive elements with modifiable

v(k,t)

. Then the length of the suffix list (t, . . . , q) is called the CAP’s

depth (which is 0 if there are no modifiable links at all). This depth

limits how far backwards credit assignment can move down the

causal chain to find a modifiable weight.

Suppose an episode and its event sequence x

, . . . , x

satisfy a

computable criterion used to decide whether a given problem has

been solved (e.g., total error E below some threshold). Then the set

An alternative would be to count only modifiable links when measuring depth.

In many typical NN applications this would not make a difference, but in some it

would, e.g., Section 6.1.

of used weights is called a solution to the problem, and the depth

of the deepest CAP within the sequence is called the solution depth.

There may be other solutions (yielding different event sequences)

with different depths. Given some fixed NN topology, the smallest

depth of any solution is called the problem depth.

Sometimes we also speak of the depth of an architecture: SL FNNs

with fixed topology imply a problem-independent maximal prob-

lem depth bounded by the number of non-input layers. Certain SL

RNNs with fixed weights for all connections except those to output

units (Jaeger, 2001, 2004; Maass, Natschläger, & Markram, 2002;

Schrauwen, Verstraeten, & Van Campenhout, 2007) have a max-

imal problem depth of 1, because only the final links in the cor-

responding CAPs are modifiable. In general, however, RNNs may

learn to solve problems of potentially unlimited depth.

Note that the definitions above are solely based on the depths

of causal chains, and agnostic to the temporal distance between

events. For example, shallow FNNs perceiving large ‘‘time win-

dows’’ of input events may correctly classify long input sequences

through appropriate output events, and thus solve shallow prob-

lems involving long time lags between relevant events.

At which problem depth does Shallow Learning end, and Deep

Learning begin? Discussions with DL experts have not yet yielded a

conclusive response to this question. Instead of committing myself

to a precise answer, let me just define for the purposes of this

overview: problems of depth >10 require Very Deep Learning.

The difficulty of a problem may have little to do with its depth.

Some NNs can quickly learn to solve certain deep problems, e.g.,

through random weight guessing (Section 5.9) or other types

of direct search (Section 6.6) or indirect search (Section 6.7)

in weight space, or through training an NN first on shallow

problems whose solutions may then generalize to deep problems,

or through collapsing sequences of (non)linear operations into

a single (non)linear operation (but see an analysis of non-trivial

aspects of deep linear networks, Baldi & Hornik, 1995, Section B).

In general, however, finding an NN that precisely models a given

training set is an NP-complete problem (Blum & Rivest, 1992; Judd,

1990), also in the case of deep NNs (de Souto, Souto, & Oliveira,

1999; Síma, 1994; Windisch, 2005); compare a survey of negative

results (Síma, 2002, Section 1).

Above we have focused on SL. In the more general case of RL

in unknown environments, pcc(p, q) is also true if x

is an output

event and x

any later input event—any action may affect the en-

vironment and thus any later perception. (In the real world, the

environment may even influence non-input events computed on

a physical hardware entangled with the entire universe, but this

is ignored here.) It is possible to model and replace such unmod-

ifiable environmental PCCs through a part of the NN that has al-

ready learned to predict (through some of its units) input events

(including reward signals) from former input events and actions

(Section 6.1). Its weights are frozen, but can help to assign credit

to other, still modifiable weights used to compute actions (Sec-

tion 6.1). This approach may lead to very deep CAPs though.

Some DL research is about automatically rephrasing problems

such that their depth is reduced (Section 4). In particular, some-

times UL is used to make SL problems less deep, e.g., Section 5.10.

Often Dynamic Programming (Section 4.1) is used to facilitate cer-

tain traditional RL problems, e.g., Section 6.2. Section 5 focuses on

CAPs for SL, Section 6 on the more complex case of RL.

4. Recurring themes of Deep Learning

4.1. Dynamic programming for Supervised/Reinforcement Learning

(SL/RL)

One recurring theme of DL is Dynamic Programming (DP) (Bell-

man, 1957), which can help to facilitate credit assignment under

J. Schmidhuber / Neural Networks 61 (2015) 85–117 89

certain assumptions. For example, in SL NNs, backpropagation it-

self can be viewed as a DP-derived method (Section 5.5). In tra-

ditional RL based on strong Markovian assumptions, DP-derived

methods can help to greatly reduce problem depth (Section 6.2).

DP algorithms are also essential for systems that combine con-

cepts of NNs and graphical models, such as Hidden Markov Models

(HMMs) (Baum & Petrie, 1966; Stratonovich, 1960) and Expecta-

tion Maximization (EM) (Dempster, Laird, & Rubin, 1977; Friedman,

Hastie, & Tibshirani, 2001), e.g., Baldi and Chauvin (1996), Bengio

(1991), Bishop (2006), Bottou (1991), Bourlard and Morgan (1994),

Dahl, Yu, Deng, and Acero (2012), Hastie, Tibshirani, and Friedman

(2009), Hinton, Deng, et al. (2012), Jordan and Sejnowski (2001),

Poon and Domingos (2011) and Wu and Shao (2014).

4.2. Unsupervised Learning (UL) facilitating SL and RL

Another recurring theme is how UL can facilitate both SL (Sec-

tion 5) and RL (Section 6). UL (Section 5.6.4) is normally used to

encode raw incoming data such as video or speech streams in a

form that is more convenient for subsequent goal-directed learn-

ing. In particular, codes that describe the original data in a less re-

dundant or more compact way can be fed into SL (Sections 5.10,

5.15) or RL machines (Section 6.4), whose search spaces may thus

become smaller (and whose CAPs shallower) than those necessary

for dealing with the raw data. UL is closely connected to the topics

of regularization and compression (Sections 4.4, 5.6.3).

4.3. Learning hierarchical representations through deep SL, UL, RL

Many methods of Good Old-Fashioned Artificial Intelligence (GO-

FAI) (Nilsson, 1980) as well as more recent approaches to AI (Rus-

sell, Norvig, Canny, Malik, & Edwards, 1995) and Machine Learn-

ing (Mitchell, 1997) learn hierarchies of more and more abstract

data representations. For example, certain methods of syntactic

pattern recognition (Fu, 1977) such as grammar induction discover

hierarchies of formal rules to model observations. The partially

(un)supervised Automated Mathematician/EURISKO (Lenat, 1983;

Lenat & Brown, 1984) continually learns concepts by combining

previously learnt concepts. Such hierarchical representation learn-

ing (Bengio, Courville, & Vincent, 2013; Deng & Yu, 2014; Ring,

1994) is also a recurring theme of DL NNs for SL (Section 5),

UL-aided SL (Sections 5.7, 5.10, 5.15), and hierarchical RL (Sec-

tion 6.5). Often, abstract hierarchical representations are natural

by-products of data compression (Section 4.4), e.g., Section 5.10.

4.4. Occam’s razor: compression and Minimum Description Length

(MDL)

Occam’s razor favors simple solutions over complex ones. Given

some programming language, the principle of Minimum Descrip-

tion Length (MDL) can be used to measure the complexity of a

solution candidate by the length of the shortest program that com-

putes it (e.g., Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987;

Chaitin, 1966; Grünwald, Myung, & Pitt, 2005; Kolmogorov, 1965b;

Levin, 1973a; Li & Vitányi, 1997; Rissanen, 1986; Solomonoff, 1964,

1978; Wallace & Boulton, 1968). Some methods explicitly take into

account program runtime (Allender, 1992; Schmidhuber, 1997,

2002; Watanabe, 1992); many consider only programs with con-

stant runtime, written in non-universal programming languages

(e.g., Hinton & van Camp, 1993; Rissanen, 1986). In the NN case,

the MDL principle suggests that low NN weight complexity corre-

sponds to high NN probability in the Bayesian view (e.g., Buntine

& Weigend, 1991; De Freitas, 2003; MacKay, 1992; Neal, 1995),

and to high generalization performance (e.g., Baum & Haussler,

1989), without overfitting the training data. Many methods have

been proposed for regularizing NNs, that is, searching for solution-

computing but simple, low-complexity SL NNs (Section 5.6.3) and

RL NNs (Section 6.7). This is closely related to certain UL methods

(Sections 4.2, 5.6.4).

4.5. Fast Graphics Processing Units (GPUs) for DL in NNs

While the previous millennium saw several attempts at cre-

ating fast NN-specific hardware (e.g., Faggin, 1992; Heemskerk,

1995; Jackel et al., 1990; Korkin, de Garis, Gers, & Hemmi, 1997;

Ramacher et al., 1993; Urlbe, 1999; Widrow, Rumelhart, & Lehr,

1994), and at exploiting standard hardware (e.g., Anguita & Gomes,

1996; Anguita, Parodi, & Zunino, 1994; Muller, Gunzinger, &

Guggenbühl, 1995), the new millennium brought a DL break-

through in form of cheap, multi-processor graphics cards or GPUs.

GPUs are widely used for video games, a huge and competitive

market that has driven down hardware prices. GPUs excel at the

fast matrix and vector multiplications required not only for con-

vincing virtual realities but also for NN training, where they can

speed up learning by a factor of 50 and more. Some of the GPU-

based FNN implementations (Sections 5.16–5.19) have greatly con-

tributed to recent successes in contests for pattern recognition

(Sections 5.19–5.22), image segmentation (Section 5.21), and ob-

ject detection (Sections 5.21–5.22).

5. Supervised NNs, some helped by unsupervised NNs

The main focus of current practical applications is on Supervised

Learning (SL), which has dominated recent pattern recognition

contests (Sections 5.17–5.23). Several methods, however, use

additional Unsupervised Learning (UL) to facilitate SL (Sections 5.7,

5.10, 5.15). It does make sense to treat SL and UL in the same

section: often gradient-based methods, such as BP (Section 5.5.1),

are used to optimize objective functions of both UL and SL, and the

boundary between SL and UL may blur, for example, when it comes

to time series prediction and sequence classification, e.g., Sections

5.10, 5.12.

A historical timeline format will help to arrange subsections on

important inspirations and technical contributions (although such

a subsection may span a time interval of many years). Section 5.1

briefly mentions early, shallow NN models since the 1940s (and

1800s), Section 5.2 additional early neurobiological inspiration

relevant for modern Deep Learning (DL). Section 5.3 is about GMDH

networks (since 1965), to my knowledge the first (feedforward)

DL systems. Section 5.4 is about the relatively deep Neocognitron

NN (1979) which is very similar to certain modern deep FNN

architectures, as it combines convolutional NNs (CNNs), weight

pattern replication, and subsampling mechanisms. Section 5.5

uses the notation of Section 2 to compactly describe a central

algorithm of DL, namely, backpropagation (BP) for supervised

weight-sharing FNNs and RNNs. It also summarizes the history

of BP 1960–1981 and beyond. Section 5.6 describes problems

encountered in the late 1980s with BP for deep NNs, and mentions

several ideas from the previous millennium to overcome them.

Section 5.7 discusses a first hierarchical stack (1987) of coupled

UL-based Autoencoders (AEs)—this concept resurfaced in the new

millennium (Section 5.15). Section 5.8 is about applying BP to CNNs

(1989), which is important for today’s DL applications. Section 5.9

explains BP’s Fundamental DL Problem (of vanishing/exploding

gradients) discovered in 1991. Section 5.10 explains how a deep

RNN stack of 1991 (the History Compressor) pre-trained by UL

helped to solve previously unlearnable DL benchmarks requiring

Credit Assignment Paths (CAPs, Section 3) of depth 1000 and

more. Section 5.11 discusses a particular winner-take-all (WTA)

method called Max-Pooling (MP, 1992) widely used in today’s deep

FNNs. Section 5.12 mentions a first important contest won by

SL NNs in 1994. Section 5.13 describes a purely supervised DL

RNN (Long Short-Term Memory, LSTM, 1995) for problems of depth

1000 and more. Section 5.14 mentions an early contest of 2003

won by an ensemble of shallow FNNs, as well as good pattern

recognition results with CNNs and deep FNNs and LSTM RNNs

剩余32页未读，继续阅读

评论收藏

内容反馈

Couchy_wu

粉丝: 39
资源: 1

2015NEUNET Deep learning in neural networks An overview.pdf

最新资源

2015NEUNET Deep learning in neural networks An overview.pdf

Deep Learning in Neural Networks An Overview.pdf

Deep Learning in Neural Networks An Overview

Deep Learning in Neural Networks: An Overview

Deep Learning in Neural Networks

Neural networks and deep learning

neural networks and deep learning

Joint Learning

Neural Networks and Deep Learning

Deep Learning.pdf

An Overview of Multi-Task Learning in Deep Neural Networks.pdf

神经网络和深度学习neural networks and deep-learning-zh.pdf

Deep learning for time series classification a review.pdf

深度强化学习研究进展.pdf

基于深度学习的事件因果关系抽取综述.pdf

《Neural Networks and Deep Learning》中文版

深度学习论文集一

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

最新资源