DeepLearning深度学习入门论文_深度学习的知识点资源-CSDN文库

共11个文件

pdf：11个

Deep

Learning,

深度学习,

论文

5星 · 超过95%的资源需积分: 18 190 浏览量 2015-08-27 21:52:06 上传评论 5 收藏 11.78MB ZIP 举报

资源详情

资源评论

收起资源包目录

DL入门论文.zip （11个子文件）

04 - 2015 - Human-level control through deep reinforcement learning.pdf 4.39MB

04 - 2012 - Sparse Filtering.pdf 413KB

04 - 2015 - Reinforcement learning models and their neural correlates - An activation likelihood estimation meta-analysis.pdf 1.33MB

01 - 201501 - Deep-learning-in-neural-networks-An-overview_2015_Neural-Networks.pdf 840KB

04 - 2006 - Notes on Convolutional Neural Networks.pdf 140KB

04 - 2009 - What is the Best Multi-Stage Architecture for Object Recognition.pdf 986KB

04 - 2012 - Learning Feature Representations with K-means.pdf 2.07MB

03 - Google - Large Scale Distributed Deep Networks.pdf 263KB

04 - 2014 - Improving deep neural network acoustic models using generalized maxout networks.pdf 176KB

04 - 2014 - Adolescent-specific patterns of behavior and neural activity during social reinforcement learning.pdf 2.23MB

04 - 2011 - An Analysis of Single-Layer Networks in Unsupervised Feature Learning.pdf 889KB

LETTER

doi:10.1038/nature14236

Human-level control through deep reinforcement

learning

Volodymyr Mnih

*, Koray Kavukcuoglu

*, David Silver

*, Andrei A. Rusu

, Joel Veness

, Marc G. Bellemare

, Alex Graves

Martin Riedmiller

, Andreas K. Fidjeland

, Georg Ostrovski

, Stig Petersen

, Charles Beattie

, Amir Sadik

, Ioannis Antonoglou

Helen King

, Dharshan Kumaran

, Daan Wierstra

, Shane Legg

& Demis Hassabis

Thetheory of reinforcementlearning provides a normative account

deeply rooted in psychological

and neuroscientific

perspectives on

animal behaviour, of how agents may optimize their control of an

environment. To use reinforcement learning successfully in situations

approaching real-world complexity, however, agents are confronted

with a difficult task: they must derive efficient representations of the

environment from high-dimensional sensory inputs, and use these

to generalizepastexperience to new situations. Remarkably, humans

and other animalsseem to solve this problem through a harmonious

combination of reinforcement learning andhierarchicalsensorypro-

cessing systems

4,5

, the former evidenced by a wealth of neural data

revealing notable parallels between the phasic signals emitted by dopa-

minergic neurons and temporal difference reinforcement learning

algorithms

. While reinforcement learning agents have achieved some

successes in a variety of domains

6–8

, their applicability has previously

beenlimited to domains in which useful features can be handcrafted,

or to domains with fully observed, low-dimensional state spaces.

Here we use recent advances in training deep neural networks

9–11

develop a novel artificial agent, termed a deep Q-network, that can

learn successful policies directly from high-dimensional sensory inputs

using end-to-end reinforcement learning. We tested this agent on

the challenging domain of classic Atari 2600 games

. We demon-

strate that the deep Q-network agent, receiving only the pixels and

the game score as inputs, was able to surpass the performance of all

previous algorithms and achieve a level comparable to that of a pro-

fessional humangames testeracrossa set of 49 games,using the same

algorithm, network architecture and hyperparameters. This work

bridges the divide between high-dimensional sensory inputs and

actions, resulting in the first artificial agent that is capable of learn-

ing to excel at a diverse array of challenging tasks.

We set out to create a single algorithm that would be able to develop

a wide range of competencies on a varied range of challenging tasks—a

central goal of general artificial intelligence

that has eluded previous

efforts

8,14,15

. To achieve this, we developed a novel agent, a deep Q-network

(DQN), which is able to combine reinforcement learning with a class

of artificial neural network

known as deep neural networks. Notably,

recent advances in deep neural networks

9–11

, in which several layers of

nodes are used to build up progressively more abstract representations

of the data, have made it possible for artificial neural networks to learn

concepts such as object categories directly from raw sensory data. We

use one particularly successful architecture, the deep convolutional

network

, which uses hierarchical layers of tiled convolutional filters

to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s

seminal work on feedforward processing in early visual cortex

—thereby

exploiting the local spatial correlations present in images, and building

in robustness to natural transformations such as changes of viewpoint

or scale.

We consider tasks in which the agent interacts with an environment

through a sequence of observations, actions and rewards.Thegoalofthe

agent is to select actions in a fashion that maximizes cumulative future

reward. More formally, we use a deep convolutional neural network to

approximate the optimal action-value function



s,aðÞ~ max

zcr

t z1

tz2

z ...js

~s, a

~a, p



which is the maximum sum of rewards r

discounted by c at each time-

step t, achievable by a behaviour policy p 5 P(a

s), after making an

observation (s) and taking an action (a) (see Methods)

Reinforcement learning is known to be unstable or even to diverge

when a nonlinear function approximator such as a neural network is

used to represent the action-value (also known as Q) function

. This

instability has several causes: the correlations present in the sequence

of observations, the fact that small updates to Q may significantly change

the policy and therefore change the data distribution, and the correlations

between the action-values (Q) and the target values rzc max

Q s

, a

ðÞ.

We address these instabilities with a novel variant of Q-learning, which

uses two key ideas. First, we used a biologically inspired mechanism

termed experience replay

21–23

that randomizes over the data, thereby

removing correlations in the observation sequence and smoothing over

changes in the data distribution (see below for details). Second, we used

an iterative update that adjusts the action-values (Q) towards target

values that are only periodicallyupdated, thereby reducing correlations

with the target.

While other stable methods exist for training neural networks in the

reinforcement learning setting, such as neural fitted Q-iteration

,these

methods involve the repeated training of networks de novo on hundreds

of iterations. Consequently, these methods, unlike our algorithm, are

too inefficient to be used successfully with large neural networks. We

parameterize an approximate value function Q(s,a;h

) using the deep

convolutional neural networkshown in Fig.1, in which h

aretheparam-

eters (that is, weights) of the Q-network at iteration i. To perform

experience replay we store the agent’s experiences e

5 (s

t 1 1

)

at each time-step t in a data set D

5 {e

,…,e

}. During learning, we

apply Q-learning updates, on samples (or minibatches) of experience

(s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored

samples. The Q-learning update at iteration i uses the following loss

function:

ðÞ~

s,a,r , s

ðÞ*U DðÞ

rzc max

Q(s

; h

{

){Q s,a; h

ðÞ



in which c is the discount factor determining the agent’s horizon, h

are

the parameters of the Q-network at iteration i and h

{

are the network

parameters used to compute the target at iteration i. The target net-

work parameters h

{

are only updated with the Q-network parameters

) every C steps and are held fixed between individual updates (see

Methods).

To evaluate our DQN agent, we took advantage of the Atari 2600

platform, which offers a diverse array of tasks (n 5 49) designed to be

*These authors contributed equally to this work.

Google DeepMind, 5 New Street Square, London EC4A 3TW, UK.

26 FEBRUARY 2015 | VOL 518 | NATURE | 529

difficult and engaging for human players. We used the same network

architecture, hyperparameter values (see Extended Data Table 1) and

learning procedure throughout—taking high-dimensional data (210|160

colour video at 60 Hz) as input—to demonstrate that our approach

robustly learns successful policies over a variety of games based solely

on sensory inputs with only very minimal prior knowledge (that is, merely

the input data were visual images, and the number of actions available

in each game, but not their correspondences; see Methods). Notably,

our method was able to train large neural networks using a reinforce-

ment learning signal and stochastic gradient descent in a stable manner—

illustrated by the temporal evolution of two indices of learning (the

agent’s average score-per-episode and average predicted Q-values; see

Fig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from the

reinforcement learning literature on the 49 games where results were

available

12,15

. In addition to the learned agents, we also report scores for

a professional human games tester playing under controlled conditions

and a policy that selects actions uniformly at random (Extended Data

Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y

axis; see Methods). Our DQN method outperforms the best existing

reinforcement learning methods on 43 of the games without incorpo-

rating any of the additional prior knowledge about Atari 2600 games

used by other approaches (for example, refs 12, 15). Furthermore, our

DQN agent performed at a level that was comparable to that of a pro-

fessional human games tester acrossthe set of 49 games, achieving more

than 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1

Schematic illustration of the convolutional neural network. The

details of the architecture are explained in the Methods. The input to the neural

network consists of an 84 3 84 3 4 image produced by the preprocessing

map w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connected

layers with a single output for each valid action. Each hidden layer is followed

by a rectifier nonlinearity (that is, max 0,x

ðÞ

200

400

600

800

1,000

1,200

1,400

1,600

1,800

2,000

2,200

0 20 40 60 80 100 120 140 160 180 200

Average score per episode

Training epochs

0 20 40 60 80 100 120 140 160 180 200

Average action value (Q)

Trainin

epochs

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Average score per episode

Training epochs

0 20 40 60 80 100 120 140 160 180 200

Average action value (Q)

Training epochs

Figure 2

Training curves tracking the agent’s average score and average

predicted action-value. a, Each point is the average score achieved per episode

after the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on Space

Invaders. b, Average score achieved per episode for Seaquest. c, Average

predicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-out

set of states. Note that Q-values are scaled due to clipping of rewards (see

Methods). d, Average predicted action-value on Seaquest. See Supplementary

Discussion for details.

RESEARCH LETTER

530 | NATURE | VOL 518 | 26 FEBRUARY 2015

see Fig. 3, Supplementary Discussion and Extended Data Table 2). In

additional simulations (see Supplementary Discussion and Extended

Data Tables 3 and 4), we demonstrate the importance of the individual

core components of the DQN agent—the replay memory, separate target

Q-network and deep convolutional network architecture—by disabling

them and demonstrating the detrimental effects on performance.

We next examined the representations learned by DQN that under-

pinned the successful performance of the agent in the context of the game

Space Invaders (see Supplementary Video 1 for a demonstration of the

performance of DQN), by using a technique developed for the visual-

ization of high-dimensional data called ‘t-SNE’

(Fig. 4). As expected,

the t-SNE algorithm tends to map the DQN representation of percep-

tually similar states to nearby points. Interestingly, we also found instances

in which the t-SNE algorithm generated similar embeddings for DQN

representations of states that are close in terms of expected reward but

perceptually dissimilar (Fig. 4, bottom right, top left and middle), con-

sistent with the notion that the network is able to learn representations

thatsupport adaptive behaviour fromhigh-dimensionalsensory inputs.

Furthermore, we also show that the representations learned by DQN

are able to generalize to data generated from policies other than its

own—in simulations where we presented as input to the network game

states experienced during human and agent play, recorded the repre-

sentations of the last hidden layer, and visualized the embeddings gen-

erated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary

Discussion). Extended Data Fig. 2 provides an additional illustration of

how the representations learned by DQN allow it to accurately predict

state and action values.

It is worth noting that the games in which DQN excels are extremely

varied in their nature, from side-scrolling shooters (River Raid) to box-

ing games (Boxing) and three-dimensional car-racing games (Enduro).

Montezuma's Revenge

Private Eye

Gravitar

Frostbite

Asteroids

Ms. Pac-Man

Bowling

Double Dunk

Seaquest

Venture

Alien

Amidar

River Raid

Bank Heist

Zaxxon

Centipede

Chopper Command

Wizard of Wor

Battle Zone

Asterix

H.E.R.O.

Q*bert

Ice Hockey

Up and Down

Fishing Derby

Enduro

Time Pilot

Freeway

Kung-Fu Master

Tutankham

Beam Rider

Space Invaders

Pong

James Bond

Tennis

Kangaroo

Road Runner

Assault

Krull

Name This Game

Demon Attack

Gopher

Crazy Climber

Atlantis

Robotank

Star Gunner

Breakout

Boxing

Video Pinball

At human-level or above

Below human-level

0 100 200 300 400

4,500%500 1,000600

Best linear learner

DQN

Figure 3

Comparison of the DQN agent with the best reinforcement

learning methods

in the literature. The performance of DQN is normalized

with respect to a professional human games tester (that is, 100% level) and

random play (that is, 0% level). Note that the normalized performance of DQN,

expressed as a percentage, is calculated as: 100 3 (DQN score 2 random play

score)/(human score 2 random play score). It can be seen that DQN

outperforms competing methods (also see Extended Data Table 2) in almost all

the games, and performs at a level that is broadly comparable with or superior

to a professional human games tester (that is, operationalized as a level of

75% or above) in the majority of games. Audio output was disabled for both

human players and agents. Error bars indicate s.d. across the 30 evaluation

episodes, starting with different initial conditions.

LETTER RESEARCH

26FEBRUARY2015|VOL518|NATURE|531

Indeed, in certain games DQN is able to discover a relatively long-term

strategy (for example, Breakout: the agent learns the optimal strategy,

which is to first dig a tunnel around the side of the wall allowing the ball

to be sent around the back to destroy a large number of blocks; see Sup-

plementary Video 2 for illustration of development of DQN’s perfor-

mance over the course of training). Nevertheless, games demanding more

temporally extended planning strategies still constitute a major chal-

lenge for all existing agents includingDQN (for example, Montezuma’s

Revenge).

In this work, we demonstrate that a single architecture can success-

fully learn control policiesin a range of different environments with only

very minimal prior knowledge, receiving only the pixels and the game

score as inputs, and using the same algorithm, network architectureand

hyperparameters on each game, privy onlyto the inputs a human player

would have. In contrast to previous work

24,26

, our approach incorpo-

rates ‘end-to-end’ reinforcement learning that uses reward to continu-

ously shape representations within the convolutional network towards

salient features of the environment that facilitate value estimation. This

principledraws on neurobiological evidence that reward signals during

perceptual learning may influence the characteristics of representations

within primate visual cortex

27,28

. Notably, the successful integration of

reinforcement learning with deep network architectures was critically

dependent on our incorporation of a replay algorithm

21–23

involving the

storage and representation of recently experienced transitions. Conver-

gent evidence suggests that the hippocampus may support the physical

realization of such a process in the mammalian brain, with the time-

compressed reactivation of recently experienced trajectories during

offline periods

21,22

(for example, waking rest) providing a putative mech-

anism by which value functions may be efficiently updated through

interactions with the basal ganglia

. In the future, it will be important

to explore the potential use of biasing the content of experience replay

towards salient events, a phenomenon that characterizes empirically

observed hippocampal replay

, and relates to the notion of ‘prioritized

sweeping’

in reinforcement learning. Taken together, our work illus-

trates the power of harnessing state-of-the-art machine learning tech-

niques with biologically inspired mechanisms to create agents that are

capable of learning to master a diverse array of challenging tasks.

Online Content Methods, along with any additional Extended Data display items

and Source Data, are available in the online version of the paper; references unique

to these sections appear only in the online paper.

Received 10 July 2014; accepted 16 January 2015.

1. Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998).

2. Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911).

3. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and

reward. Science 275, 1593–1599 (1997).

4. Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual

cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000

(2005).

5. Fukushima, K. Neocognitron: A self-organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36,

193–202 (1980).

Figure 4

Two-dimensional t-SNE embedding of the representations in the

last hidden layer assigned by DQN to game states experienced while playing

Space Invaders. The plot was generated by letting the DQN agent play for

2 h of real game time and runningthe t-SNE algorithm

on the last hidden layer

representations assigned by DQN to each experienced game state. The

points are coloured according to the state values (V, maximum expected reward

of a state) predicted by DQN for the corresponding game states (ranging

from dark red (highest V) to dark blue (lowest V)). The screenshots

corresponding to a selected number of points are shown. The DQN agent

predicts high state values for both full (top right screenshots) and nearly

complete screens (bottom left screenshots) because it has learned that

completing a screen leads to a new screen full of enemy ships. Partially

completed screens (bottom screenshots) are assigned lower state values because

less immediate reward is available. The screens shown on the bottom right

and top left and middle are less perceptually similar than the other examples but

are still mapped to nearby representations and similar values because the

orange bunkers do not carry great significance near the end of a level. With

permission from Square Enix Limited.

RESEARCH LETTER

532 | NATURE | VOL 518 | 26 FEBRUARY 2015

评论收藏

内容反馈

874637730

2016-01-23

收到很好，多谢分享。

Deep Learning深度学习入门论文

评论9

最新资源

Deep Learning深度学习入门论文

评论9

最新资源

相关推荐

深度学习入门论文整合

深度学习入门精选论文

DeepLearning参考论文

deep learning

DeepLearning(期刊论文)

33个最新的毕业设计

deep learning经典论文30篇

100篇+深度学习论文合集.zip

深度学习经典论文7篇

Deep-Learning:深度学习基础

Deep Learning（深度学习）

深度学习Deep Learning

Deep-Learning:深度学习

deep-learning-cv-intro

深度学习论文阅读路线之一

深度学习入门够用了

深度学习论文阅读路线之二

深度学习论文阅读路线之3.2

Deep-Learning-1:我的深度学习项目

Deep Learning 学习总结

Deep Learning 入门书籍

Deep Learning（深度学习）-2

Deep Learning（深度学习）-6

人工智能中的深度结构学习 大牛写的论文

第三部分：深度学习论文阅读路线之3.1

Deep Learning for Genomics

基于深度学习的目标检测研究进展.pdf

Deep Learning（深度学习）-3

人工智能中的深度结构学习大牛写的论文