Asynchronous Methods for Deep Reinforcement Learning
time than previous GPU-based algorithms, using far less
resource than massively distributed approaches. The best
of the proposed methods, asynchronous advantage actor-
critic (A3C), also mastered a variety of continuous motor
control tasks as well as learned general strategies for ex-
ploring 3D mazes purely from visual inputs. We believe
that the success of A3C on both 2D and 3D games, discrete
and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general
and successful reinforcement learning agent to date.
2. Related Work
The General Reinforcement Learning Architecture (Gorila)
of (Nair et al., 2015) performs asynchronous training of re-
inforcement learning agents in a distributed setting. In Go-
rila, each process contains an actor that acts in its own copy
of the environment, a separate replay memory, and a learner
that samples data from the replay memory and computes
gradients of the DQN loss (Mnih et al., 2015) with respect
to the policy parameters. The gradients are asynchronously
sent to a central parameter server which updates a central
copy of the model. The updated policy parameters are sent
to the actor-learners at fixed intervals. By using 100 sep-
arate actor-learner processes and 30 parameter server in-
stances, a total of 130 machines, Gorila was able to signif-
icantly outperform DQN over 49 Atari games. On many
games Gorila reached the score achieved by DQN over 20
times faster than DQN. We also note that a similar way of
parallelizing DQN was proposed by (Chavez et al., 2015).
In earlier work, (Li & Schuurmans, 2011) applied the
Map Reduce framework to parallelizing batch reinforce-
ment learning methods with linear function approximation.
Parallelism was used to speed up large matrix operations
but not to parallelize the collection of experience or sta-
bilize learning. (Grounds & Kudenko, 2008) proposed a
parallel version of the Sarsa algorithm that uses multiple
separate actor-learners to accelerate training. Each actor-
learner learns separately and periodically sends updates to
weights that have changed significantly to the other learn-
ers using peer-to-peer communication.
(Tsitsiklis, 1994) studied convergence properties of Q-
learning in the asynchronous optimization setting. These
results show that Q-learning is still guaranteed to converge
when some of the information is outdated as long as out-
dated information is always eventually discarded and sev-
eral other technical assumptions are satisfied. Even earlier,
(Bertsekas, 1982) studied the related problem of distributed
dynamic programming.
Another related area of work is in evolutionary meth-
ods, which are often straightforward to parallelize by dis-
tributing fitness evaluations over multiple machines or
threads (Tomassini, 1999). Such parallel evolutionary ap-
proaches have recently been applied to some visual rein-
forcement learning tasks. In one example, (Koutník et al.,
2014) evolved convolutional neural network controllers for
the TORCS driving simulator by performing fitness evalu-
ations on 8 CPU cores in parallel.
3. Reinforcement Learning Background
We consider the standard reinforcement learning setting
where an agent interacts with an environment E over a
number of discrete time steps. At each time step t, the
agent receives a state s
t
and selects an action a
t
from some
set of possible actions A according to its policy π, where
π is a mapping from states s
t
to actions a
t
. In return, the
agent receives the next state s
t+1
and receives a scalar re-
ward r
t
. The process continues until the agent reaches a
terminal state after which the process restarts. The return
R
t
=
P
∞
k=0
γ
k
r
t+k
is the total accumulated return from
time step t with discount factor γ ∈ (0, 1]. The goal of the
agent is to maximize the expected return from each state s
t
.
The action value Q
π
(s, a) = E [R
t
|s
t
= s, a] is the ex-
pected return for selecting action a in state s and follow-
ing policy π. The optimal value function Q
∗
(s, a) =
max
π
Q
π
(s, a) gives the maximum action value for state
s and action a achievable by any policy. Similarly, the
value of state s under policy π is defined as V
π
(s) =
E [R
t
|s
t
= s] and is simply the expected return for follow-
ing policy π from state s.
In value-based model-free reinforcement learning methods,
the action value function is represented using a function ap-
proximator, such as a neural network. Let Q(s, a; θ) be an
approximate action-value function with parameters θ. The
updates to θ can be derived from a variety of reinforcement
learning algorithms. One example of such an algorithm is
Q-learning, which aims to directly approximate the optimal
action value function: Q
∗
(s, a) ≈ Q(s, a; θ). In one-step
Q-learning, the parameters θ of the action value function
Q(s, a; θ) are learned by iteratively minimizing a sequence
of loss functions, where the ith loss function defined as
L
i
(θ
i
) = E
r + γ max
a
0
Q(s
0
, a
0
; θ
i−1
) − Q(s, a; θ
i
)
2
where s
0
is the state encountered after state s.
We refer to the above method as one-step Q-learning be-
cause it updates the action value Q(s, a) toward the one-
step return r + γ max
a
0
Q(s
0
, a
0
; θ). One drawback of us-
ing one-step methods is that obtaining a reward r only di-
rectly affects the value of the state action pair s, a that led
to the reward. The values of other state action pairs are
affected only indirectly through the updated value Q(s, a).
This can make the learning process slow since many up-
dates are required the propagate a reward to the relevant
preceding states and actions.