LETTER
doi:10.1038/nature14236
Human-level control through deep reinforcement
learning
Volodymyr Mnih
1
*, Koray Kavukcuoglu
1
*, David Silver
1
*, Andrei A. Rusu
1
, Joel Veness
1
, Marc G. Bellemare
1
, Alex Graves
1
,
Martin Riedmiller
1
, Andreas K. Fidjeland
1
, Georg Ostrovski
1
, Stig Petersen
1
, Charles Beattie
1
, Amir Sadik
1
, Ioannis Antonoglou
1
,
Helen King
1
, Dharshan Kumaran
1
, Daan Wierstra
1
, Shane Legg
1
& Demis Hassabis
1
Thetheory of reinforcementlearning provides a normative account
1
,
deeply rooted in psychological
2
and neuroscientific
3
perspectives on
animal behaviour, of how agents may optimize their control of an
environment. To use reinforcement learning successfully in situations
approaching real-world complexity, however, agents are confronted
with a difficult task: they must derive efficient representations of the
environment from high-dimensional sensory inputs, and use these
to generalizepastexperience to new situations. Remarkably, humans
and other animalsseem to solve this problem through a harmonious
combination of reinforcement learning andhierarchicalsensorypro-
cessing systems
4,5
, the former evidenced by a wealth of neural data
revealing notable parallels between the phasic signals emitted by dopa-
minergic neurons and temporal difference reinforcement learning
algorithms
3
. While reinforcement learning agents have achieved some
successes in a variety of domains
6–8
, their applicability has previously
beenlimited to domains in which useful features can be handcrafted,
or to domains with fully observed, low-dimensional state spaces.
Here we use recent advances in training deep neural networks
9–11
to
develop a novel artificial agent, termed a deep Q-network, that can
learn successful policies directly from high-dimensional sensory inputs
using end-to-end reinforcement learning. We tested this agent on
the challenging domain of classic Atari 2600 games
12
. We demon-
strate that the deep Q-network agent, receiving only the pixels and
the game score as inputs, was able to surpass the performance of all
previous algorithms and achieve a level comparable to that of a pro-
fessional humangames testeracrossa set of 49 games,using the same
algorithm, network architecture and hyperparameters. This work
bridges the divide between high-dimensional sensory inputs and
actions, resulting in the first artificial agent that is capable of learn-
ing to excel at a diverse array of challenging tasks.
We set out to create a single algorithm that would be able to develop
a wide range of competencies on a varied range of challenging tasks—a
central goal of general artificial intelligence
13
that has eluded previous
efforts
8,14,15
. To achieve this, we developed a novel agent, a deep Q-network
(DQN), which is able to combine reinforcement learning with a class
of artificial neural network
16
known as deep neural networks. Notably,
recent advances in deep neural networks
9–11
, in which several layers of
nodes are used to build up progressively more abstract representations
of the data, have made it possible for artificial neural networks to learn
concepts such as object categories directly from raw sensory data. We
use one particularly successful architecture, the deep convolutional
network
17
, which uses hierarchical layers of tiled convolutional filters
to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s
seminal work on feedforward processing in early visual cortex
18
—thereby
exploiting the local spatial correlations present in images, and building
in robustness to natural transformations such as changes of viewpoint
or scale.
We consider tasks in which the agent interacts with an environment
through a sequence of observations, actions and rewards.Thegoalofthe
agent is to select actions in a fashion that maximizes cumulative future
reward. More formally, we use a deep convolutional neural network to
approximate the optimal action-value function
Q
s,aðÞ~ max
p
r
t
zcr
t z1
zc
2
r
tz2
z ...js
t
~s, a
t
~a, p
,
which is the maximum sum of rewards r
t
discounted by c at each time-
step t, achievable by a behaviour policy p 5 P(a
j
s), after making an
observation (s) and taking an action (a) (see Methods)
19
.
Reinforcement learning is known to be unstable or even to diverge
when a nonlinear function approximator such as a neural network is
used to represent the action-value (also known as Q) function
20
. This
instability has several causes: the correlations present in the sequence
of observations, the fact that small updates to Q may significantly change
the policy and therefore change the data distribution, and the correlations
between the action-values (Q) and the target values rzc max
a
0
Q s
0
, a
0
ðÞ.
We address these instabilities with a novel variant of Q-learning, which
uses two key ideas. First, we used a biologically inspired mechanism
termed experience replay
21–23
that randomizes over the data, thereby
removing correlations in the observation sequence and smoothing over
changes in the data distribution (see below for details). Second, we used
an iterative update that adjusts the action-values (Q) towards target
values that are only periodicallyupdated, thereby reducing correlations
with the target.
While other stable methods exist for training neural networks in the
reinforcement learning setting, such as neural fitted Q-iteration
24
,these
methods involve the repeated training of networks de novo on hundreds
of iterations. Consequently, these methods, unlike our algorithm, are
too inefficient to be used successfully with large neural networks. We
parameterize an approximate value function Q(s,a;h
i
) using the deep
convolutional neural networkshown in Fig.1, in which h
i
aretheparam-
eters (that is, weights) of the Q-network at iteration i. To perform
experience replay we store the agent’s experiences e
t
5 (s
t
,a
t
,r
t
,s
t 1 1
)
at each time-step t in a data set D
t
5 {e
1
,…,e
t
}. During learning, we
apply Q-learning updates, on samples (or minibatches) of experience
(s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored
samples. The Q-learning update at iteration i uses the following loss
function:
L
i
h
i
ðÞ~
s,a,r , s
0
ðÞ*U DðÞ
rzc max
a
0
Q(s
0
,a
0
; h
{
i
){Q s,a; h
i
ðÞ
2
"#
in which c is the discount factor determining the agent’s horizon, h
i
are
the parameters of the Q-network at iteration i and h
{
i
are the network
parameters used to compute the target at iteration i. The target net-
work parameters h
{
i
are only updated with the Q-network parameters
(h
i
) every C steps and are held fixed between individual updates (see
Methods).
To evaluate our DQN agent, we took advantage of the Atari 2600
platform, which offers a diverse array of tasks (n 5 49) designed to be
*These authors contributed equally to this work.
1
Google DeepMind, 5 New Street Square, London EC4A 3TW, UK.
26 FEBRUARY 2015 | VOL 518 | NATURE | 529
Macmillan Publishers Limited. All rights reserved
©2015
评论9
最新资源