深度学习综述_深度学习综述资源-CSDN文库

需积分: 11 16 浏览量 2019-07-30 12:19:51 上传评论收藏 414KB PDF 举报

深度学习是一门研究如何通过多层非线性处理单元对数据进行高层抽象和表示的算法和理论。Schmidhuber教授在2014年编写的这份综述是深度学习领域内的重要文献，详细介绍了神经网络的发展历程以及当前的进展。文章中，Schmidhuber教授不仅概述了深度学习的相关工作，而且从深度神经网络的背景和历史讲起，从浅层学习与深度学习的比较着手，区分了二者在信用分配路径的深度上的不同。在深度学习的概念中，信用分配路径是指在行动与效果之间可能可学习的因果链。深度学习相对于浅层学习的主要特点是具有更长的信用分配路径，这意味着深度学习模型能够通过多层处理单元构建复杂的特征表示。文章中还回顾了深度监督学习，包括反向传播算法的历史。反向传播算法是训练深度神经网络的核心算法，它通过计算损失函数对网络参数的梯度来指导网络参数的更新，使得网络能够通过梯度下降的方式最小化损失函数。反向传播算法的历史可以追溯到20世纪60年代和70年代，而其大规模应用则得益于21世纪以来计算能力的显著提升和大量数据集的可用性。除了深度监督学习，Schmidhuber教授还对无监督学习、强化学习和进化计算进行了综述，并探讨了间接搜索短程序编码深度和大网络的方法。无监督学习专注于发现数据的结构和模式，而不是通过标签来指导学习。强化学习关注如何通过与环境的交互来学习策略，以最大化累积奖励。进化计算则受到生物进化理论的启发，通过模拟自然选择的过程来优化问题的解决方案。 Schmidhuber教授强调了深度学习研究社区的复杂互动和影响网络，研究者们以复杂的方式相互影响，并试图追溯过去半个世纪乃至更长时间中深度学习相关思想的起源。由于许多深度学习的出版物并没有适当承认早期的相关工作，因此除了使用局部搜索策略回溯引用外，还采取了全球搜索策略，并咨询了许多神经网络专家来辅助研究。 Schmidhuber教授还提到，尽管目前的草稿主要是对相关文献的综述（迄今大约有850个条目），但由于自己对过去25年所带领的深度学习研究组工作的熟悉，可能导致了对某些重要工作的遗漏，因此当前的草稿可能只是反映了深度学习研究动态的一个快照。深度学习综述提供了深度学习领域历史和现状的全面概述，不仅记录了各个子领域的研究成果，还指出了当前存在的问题和挑战，为未来的深度学习研究者提供了宝贵的参考资料。深度学习作为人工智能领域的一个重要分支，已经成为了解决各种实际问题不可或缺的工具，而对深度学习历史的学习则有助于理解这一领域的发展脉络和未来趋势。

资源推荐

资源详情

资源评论

Deep Learning in Neural Networks: An Overview

Technical Report IDSIA-03-14 / arXiv:1404.7828 v2 [cs.NE]

urgen Schmidhuber

The Swiss AI Lab IDSIA

Istituto Dalle Molle di Studi sull’Intelligenza Artiﬁciale

University of Lugano & SUPSI

Galleria 2, 6928 Manno-Lugano

Switzerland

28 May 2014

Abstract

In recent years, deep artiﬁcial neural networks (including recurrent ones) have won numerous con-

tests in pattern recognition and machine learning. This historical survey compactly summarises relevant

work, much of it from the previous millennium. Shallow and deep learners are distinguished by the

depth of their credit assignment paths, which are chains of possibly learnable, causal links between ac-

tions and effects. I review deep supervised learning (also recapitulating the history of backpropagation),

unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short

programs encoding deep and large networks.

PDF: http://www.idsia.ch/

∼

juergen/DeepLearning28May2014.pdf

LATEX source ﬁle: http://www.idsia.ch/

∼

juergen/DeepLearning28May2014.tex

Complete BIBTEX ﬁle: http://www.idsia.ch/

∼

juergen/bib.bib

Preface

This is the draft of an invited Deep Learning (DL) overview. One of its goals is to assign credit to those

who contributed to the present state of the art. I acknowledge the limitations of attempting to achieve

this goal. The DL research community itself may be viewed as a continually evolving, deep network of

scientists who have inﬂuenced each other in complex ways. Starting from recent DL results, I tried to trace

back the origins of relevant ideas through the past half century and beyond, sometimes using “local search”

to follow citations of citations backwards in time. Since not all DL publications properly acknowledge

earlier relevant work, additional global search strategies were employed, aided by consulting numerous

neural network experts. As a result, the present draft mostly consists of references (about 850 entries so

far). Nevertheless, through an expert selection bias I may have missed important work. A related bias

was surely introduced by my special familiarity with the work of my own DL research group in the past

quarter-century. For these reasons, the present draft should be viewed as merely a snapshot of an ongoing

credit assignment process. To help improve it, please do not hesitate to send corrections and suggestions to

juergen@idsia.ch.

Contents

1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 4

2 Event-Oriented Notation for Activation Spreading in FNNs / RNNs 4

3 Depth of Credit Assignment Paths (CAPs) and of Problems 5

4 Recurring Themes of Deep Learning 7

4.1 Dynamic Programming for Supervised / Reinforcement Learning (SL / RL) . . . . . . . . 7

4.2 Unsupervised Learning (UL) Facilitating SL and RL . . . . . . . . . . . . . . . . . . . . 7

4.3 Learning Hierarchical Representations Through Deep SL, UL, RL . . . . . . . . . . . . . 7

4.4 Occam’s Razor: Compression and Minimum Description Length (MDL) . . . . . . . . . . 7

4.5 Fast Graphics Processing Units (GPUs) for DL in NNs . . . . . . . . . . . . . . . . . . . 8

5 Supervised NNs, Some Helped by Unsupervised NNs 8

5.1 Early NNs Since the 1940s (and the 1800s) . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.2 Around 1960: Visual Cortex Provides Inspiration for DL . . . . . . . . . . . . . . . . . . 9

5.3 1965: Deep Networks Based on the Group Method of Data Handling (GMDH) . . . . . . 9

5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron) . . . . . . . . . 9

5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs . . . . . . . . . . 9

5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs) . . 10

5.6 Late 1980s-2000: Numerous Improvements of NNs . . . . . . . . . . . . . . . . . . . . . 11

5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs . . . . . . . . . . . . . . 11

5.6.2 Better BP Through Advanced Gradient Descent . . . . . . . . . . . . . . . . . . . 11

5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs . . . . . . . . . . 12

5.6.4 Potential Beneﬁts of UL for SL . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.7 1987: UL Through Autoencoder (AE) Hierarchies . . . . . . . . . . . . . . . . . . . . . . 14

5.8 1989: BP for Convolutional NNs (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.9 1991: Fundamental Deep Learning Problem of Gradient Descent . . . . . . . . . . . . . . 14

5.10 1991: UL-Based History Compression Through a Deep Hierarchy of RNNs . . . . . . . . 15

5.11 1992: Max-Pooling (MP): Towards MPCNNs . . . . . . . . . . . . . . . . . . . . . . . . 16

5.12 1994: Early Contest-Winning NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN) . . . . . . . . . . . . . . . 16

5.14 2003: More Contest-Winning/Record-Setting NNs . . . . . . . . . . . . . . . . . . . . . 18

5.15 2006/7: UL For Deep Belief Networks (DBNs) / AE Stacks Fine-Tuned by BP . . . . . . 18

5.16 2006/7: Improved CNNs / GPU-CNNs / BP-Trained MPCNNs / LSTM Stacks . . . . . . 19

5.17 2009: First Ofﬁcial Competitions Won by RNNs, and with MPCNNs . . . . . . . . . . . . 19

5.18 2010: Plain Backprop (+ Distortions) on GPU Yields Excellent Results . . . . . . . . . . 20

5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance . . . . . . . . . . . . 20

5.20 2011: Hessian-Free Optimization for RNNs . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.21 2012: First Contests Won on ImageNet & Object Detection & Segmentation . . . . . . . . 21

5.22 2013-: More Contests and Benchmark Records . . . . . . . . . . . . . . . . . . . . . . . 22

5.23 Currently Successful Supervised Techniques: LSTM RNNs / GPU-MPCNNs . . . . . . . 22

5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3) . . . . . . . . . . 23

5.25 Consequences for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.26 DL with Spiking Neurons? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 DL in FNNs and RNNs for Reinforcement Learning (RL) 24

6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . . . . 25

6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . . . . 25

6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . . . . 26

6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . . . . 26

6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . . . . 27

6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . . . . 27

6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . . . . 28

6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Conclusion 29

8 Acknowledgments 29

Abbreviations in Alphabetical Order

AE: Autoencoder

ANN: Artiﬁcial Neural Network

BFGS: Broyden-Fletcher-Goldfarb-Shanno

BNN: Biological Neural Network

BM: Boltzmann Machine

BP: Backpropagation

BRNN: Bi-directional Recurrent Neural Network

CAP: Credit Assignment Path

CEC: Constant Error Carousel

CFL: Context Free Language

CMA-ES: Covariance Matrix Estimation Evolution

Strategies

CNN: Convolutional Neural Network

CoSyNE: Co-Synaptic Neuro-Evolution

CSL: Context Senistive Language

CTC : Connectionist Temporal Classiﬁcation

DBN: Deep Belief Network

DCT: Discrete Cosine Transform

DL: Deep Learning

DP: Dynamic Programming

DS: Direct Policy Search

EA: Evolutionary Algorithm

EM: Expectation Maximization

FMS: Flat Minimum Search

FNN: Feedforward Neural Network

FSA: Finite State Automaton

GMDH: Group Method of Data Handling

GOFAI: Good Old-Fashioned Artiﬁcial Intelligence

GP: Genetic Programming

GPU: Graphics Processing Unit

GPU-MPCNN: GPU-Based Max-Pooling Convolu-

tional Neural Network

HMM: Hidden Markov Model

HRL: Hierarchical Reinforcement Learning

HTM: Hierarchical Temporal Memory

HMAX: Hierarchical Model “and X”

LSTM: Long Short-Term Memory (Recurrent Neu-

ral Network)

MC: Multi-Column

MDL: Minimum Description Length

MDP: Markov Decision Process

MNIST: Mixed National Institute of Standards and

Technology Database

MP: Max-Pooling

MPCNN: Max-Pooling Convolutional Neural Net-

work

NEAT: NeuroEvolution of Augmenting Topologies

NES: Natural Evolution Strategies

NFQ: Neural Fitted Q-Learning

NN: Neural Network

PCC: Potential Causal Connection

PDCC: Potential Direct Causal Connection

PM: Predictability Minimization

POMDP: Partially Observable Markov Decision

Process

RAAM: Recursive Auto-Associative Memory

RBM: Restricted Boltzmann Machine

ReLU: Rectiﬁed Linear Unit

RL: Reinforcement Learning

RNN: Recurrent Neural Network

R-prop: Resilient Backpropagation

SL: Supervised Learning

SLIM NN: Self-Delimiting Neural Network

SOTA: Self-Organising Tree Algorithm

SVM: Support Vector Machine

TDNN: Time-Delay Neural Network

TIMIT: TI/SRI/MIT Acoustic-Phonetic Continuous

Speech Corpus

UL: Unsupervised Learning

WTA: Winner-Take-All

1 Introduction to Deep Learning (DL) in Neural Networks (NNs)

Which modiﬁable components of a learning system are responsible for its success or failure? What changes

to them improve performance? This has been called the fundamental credit assignment problem (Minsky,

1963). There are general credit assignment methods for universal problem solvers that are time-optimal

in various theoretical senses (Sec. 6.8). The present survey, however, will focus on the narrower, but now

commercially important, subﬁeld of Deep Learning (DL) in Artiﬁcial Neural Networks (NNs). We are

interested in accurate credit assignment across possibly many, often nonlinear, computational stages of

NNs.

Shallow NN-like models have been around for many decades if not centuries (Sec. 5.1). Models with

several successive nonlinear layers of neurons date back at least to the 1960s (Sec. 5.3) and 1970s (Sec. 5.5).

An efﬁcient gradient descent method for teacher-based Supervised Learning (SL) in discrete, differentiable

networks of arbitrary depth called backpropagation (BP) was developed in the 1960s and 1970s, and ap-

plied to NNs in 1981 (Sec. 5.5). BP-based training of deep NNs with many layers, however, had been found

to be difﬁcult in practice by the late 1980s (Sec. 5.6), and had become an explicit research subject by the

early 1990s (Sec. 5.9). DL became practically feasible to some extent through the help of Unsupervised

Learning (UL) (e.g., Sec. 5.10, 5.15). The 1990s and 2000s also saw many improvements of purely super-

vised DL (Sec. 5). In the new millennium, deep NNs have ﬁnally attracted wide-spread attention, mainly

by outperforming alternative machine learning methods such as kernel machines (Vapnik, 1995; Sch

olkopf

et al., 1998) in numerous important applications. In fact, supervised deep NNs have won numerous of-

ﬁcial international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the ﬁrst

superhuman visual pattern recognition results in limited domains (Sec. 5.19). Deep NNs also have become

relevant for the more general ﬁeld of Reinforcement Learning (RL) where there is no supervising teacher

(Sec. 6).

Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests (Sec.

5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they are

general computers more powerful than FNNs, and can in principle create and process memories of ar-

bitrary sequences of input patterns (e.g., Siegelmann and Sontag, 1991; Schmidhuber, 1990a). Unlike

traditional methods for automatic sequential program synthesis (e.g., Waldinger and Lee, 1969; Balzer,

1985; Soloway, 1986; Deville and Lau, 1994), RNNs can learn programs that mix sequential and parallel

information processing in a natural and efﬁcient way, exploiting the massive parallelism viewed as crucial

for sustaining the rapid decline of computation cost observed over the past 75 years.

The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation

that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the concept of

Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is of the deep or

shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses on SL and UL, and on

how UL can facilitate SL, although pure SL has become dominant in recent competitions (Sec. 5.17-5.22).

Sec. 5 is arranged in a historical timeline format with subsections on important inspirations and technical

contributions. Sec. 6 on deep RL discusses traditional Dynamic Programming (DP)-based RL combined

with gradient-based search techniques for SL or UL in deep NNs, as well as general methods for direct

and indirect search in the weight space of deep FNNs and RNNs, including successful policy gradient and

evolutionary methods.

2 Event-Oriented Notation for Activation Spreading in FNNs / RNNs

Throughout this paper, let i, j, k, t, p, q, r denote positive integer variables assuming ranges implicit in the

given contexts. Let n, m, T denote positive integer constants.

An NN’s topology may change over time (e.g., Sec. 5.3, 5.6.3). At any given moment, it can be

described as a ﬁnite subset of units (or nodes or neurons) N = {u

, u

, . . . , } and a ﬁnite set H ⊆ N × N

of directed edges or connections between nodes. FNNs are acyclic graphs, RNNs cyclic. The ﬁrst (input)

layer is the set of input units, a subset of N . In FNNs, the k-th layer (k > 1) is the set of all nodes u ∈ N

such that there is an edge path of length k − 1 (but no longer path) between some input unit and u. There

may be shortcut connections between distant layers.

The NN’s behavior or program is determined by a set of real-valued, possibly modiﬁable, parameters

or weights w

(i = 1, . . . , n). We now focus on a single ﬁnite episode or epoch of information processing

and activation spreading, without learning through weight changes. The following slightly unconventional

notation is designed to compactly describe what is happening during the runtime of the system.

During an episode, there is a partially causal sequence x

(t = 1, . . . , T ) of real values that I call

events. Each x

is either an input set by the environment, or the activation of a unit that may directly

depend on other x

(k < t) through a current NN topology-dependent set in

of indices k representing

incoming causal connections or links. Let the function v encode topology information and map such event

index pairs (k, t) to weight indices. For example, in the non-input case we may have x

= f

(net

) with

real-valued net

k∈in

v (k,t)

(additive case) or net

k∈in

v (k,t)

(multiplicative case),

where f

is a typically nonlinear real-valued activation function such as tanh. In many recent competition-

winning NNs (Sec. 5.19, 5.21, 5.22) there also are events of the type x

= max

k∈in

); some network

types may also use complex polynomial activation functions (Sec. 5.3). x

may directly affect certain

(k > t) through outgoing connections or links represented through a current set out

of indices k with

t ∈ in

. Some non-input events are called output events.

Note that many of the x

may refer to different, time-varying activations of the same unit in sequence-

processing RNNs (e.g., Williams, 1989, “unfolding in time”), or also in FNNs sequentially exposed to

time-varying input patterns of a large training set encoded as input events. During an episode, the same

weight may get reused over and over again in topology-dependent ways, e.g., in RNNs, or in convolutional

NNs (Sec. 5.4, 5.8). I call this weight sharing across space and/or time. Weight sharing may greatly reduce

the NN’s descriptive complexity, which is the number of bits of information required to describe the NN

(Sec. 4.4).

In Supervised Learning (SL), certain NN output events x

may be associated with teacher-given, real-

valued labels or targets d

yielding errors e

, e.g., e

= 1/2(x

− d

)

. A typical goal of supervised NN

training is to ﬁnd weights that yield episodes with small total error E, the sum of all such e

. The hope is

that the NN will generalize well in later episodes, causing only small errors on previously unseen sequences

of input events. Many alternative error functions for SL and UL are possible.

SL assumes that input events are independent of earlier output events (which may affect the environ-

ment through actions causing subsequent perceptions). This assumption does not hold in the broader ﬁelds

of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al., 1996; Sutton and Barto,

1998; Hutter, 2005) (Sec. 6). In RL, some of the input events may encode real-valued reward signals given

by the environment, and a typical goal is to ﬁnd weights that yield episodes with a high sum of reward

signals, through sequences of appropriate output actions.

Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely, back-

propagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as RNNs with

certain ﬁxed zero weights.) Sec. 6 will address the more general RL case.

3 Depth of Credit Assignment Paths (CAPs) and of Problems

To measure whether credit assignment in a given NN application is of the deep or shallow type, I introduce

the concept of Credit Assignment Paths or CAPs, which are chains of possibly causal links between events.

Let us ﬁrst focus on SL. Consider two events x

and x

(1 ≤ p < q ≤ T ). Depending on the appli-

cation, they may have a Potential Direct Causal Connection (PDCC) expressed by the Boolean predicate

pdcc(p, q), which is true if and only if p ∈ in

. Then the 2-element list (p, q) is deﬁned to be a CAP (a

minimal one) from p to q. A learning algorithm may be allowed to change w

v (p,q)

to improve performance

in future episodes.

More general, possibly indirect, Potential Causal Connections (PCC) are expressed by the recursively

deﬁned Boolean predicate pcc(p, q), which in the SL case is true only if pdcc(p, q), or if pcc(p, k) for some

k and pdcc(k, q). In the latter case, appending q to any CAP from p to k yields a CAP from p to q (this is a

recursive deﬁnition, too). The set of such CAPs may be large but is ﬁnite. Note that the same weight may

affect many different PDCCs between successive events listed by a given CAP, e.g., in the case of RNNs,

or weight-sharing FNNs.

剩余73页未读，继续阅读

评论收藏

内容反馈

IgorZ

粉丝: 15
资源: 8

深度学习综述

2019图深度学习综述.pdf

多模态深度学习综述.pdf

深度学习综述类文章

AlexNet深度学习综述

深度主动学习综述论文

深度学习综述PPT.pptx

FPGA加速深度学习综述

深度学习综述ppt详细版

多模态深度学习综述 (1).pdf

关于深度学习的综述与讨论_胡越

多模态深度学习综述（18页pdf）.pdf

大数据与深度学习综述

深度学习的发展综述

课程报告-深度学习综述.doc

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

大作业05-YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

Deep Learning Tuning Playbook（中译版）

zotero翻译插件.xpi

基于YOLOv8-Pose的姿态识别项目，带数据集可直接跑通的源码

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

最新资源