AsynchronousMethodsforDeepReinforcementLearning资源-CSDN文库

需积分: 15 16 浏览量 2018-02-01 09:56:33 上传评论收藏 2.2MB PDF 举报

资源推荐

资源详情

资源评论

Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih

VMNIH@GOOGLE.COM

Adrià Puigdomènech Badia

ADRIAP@GOOGLE.COM

Mehdi Mirza

1,2

MIRZAMOM@IRO.UMONTREAL.CA

Alex Graves

GRAVESA@GOOGLE.COM

Tim Harley

THARLEY@GOOGLE.COM

Timothy P. Lillicrap

COUNTZERO@GOOGLE.COM

David Silver

DAVIDSILVER@GOOGLE.COM

Koray Kavukcuoglu

KORAYK@GOOGLE.COM

Google DeepMind

Montreal Institute for Learning Algorithms (MILA), University of Montreal

Abstract

We propose a conceptually simple and

lightweight framework for deep reinforce-

ment learning that uses asynchronous gradient

descent for optimization of deep neural network

controllers. We present asynchronous variants of

four standard reinforcement learning algorithms

and show that parallel actor-learners have a

stabilizing effect on training allowing all four

methods to successfully train neural network

controllers. The best performing method, an

asynchronous variant of actor-critic, surpasses

the current state-of-the-art on the Atari domain

while training for half the time on a single

multi-core CPU instead of a GPU. Furthermore,

we show that asynchronous actor-critic succeeds

on a wide variety of continuous motor control

problems as well as on a new task of navigating

random 3D mazes using a visual input.

1. Introduction

Deep neural networks provide rich representations that can

enable reinforcement learning (RL) algorithms to perform

effectively. However, it was previously thought that the

combination of simple online RL algorithms with deep

neural networks was fundamentally unstable. Instead, a va-

riety of solutions have been proposed to stabilize the algo-

rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has-

selt et al., 2015; Schulman et al., 2015a). These approaches

share a common idea: the sequence of observed data en-

countered by an online RL agent is non-stationary, and on-

Proceedings of the 33

International Conference on Machine

Learning, New York, NY, USA, 2016. JMLR: W&CP volume

line RL updates are strongly correlated. By storing the

agent’s data in an experience replay memory, the data can

be batched (Riedmiller, 2005; Schulman et al., 2015a) or

randomly sampled (Mnih et al., 2013; 2015; Van Hasselt

et al., 2015) from different time-steps. Aggregating over

memory in this way reduces non-stationarity and decorre-

lates updates, but at the same time limits the methods to

off-policy reinforcement learning algorithms.

Deep RL algorithms based on experience replay have

achieved unprecedented success in challenging domains

such as Atari 2600. However, experience replay has several

drawbacks: it uses more memory and computation per real

interaction; and it requires off-policy learning algorithms

that can update from data generated by an older policy.

In this paper we provide a very different paradigm for deep

reinforcement learning. Instead of experience replay, we

asynchronously execute multiple agents in parallel, on mul-

tiple instances of the environment. This parallelism also

decorrelates the agents’ data into a more stationary process,

since at any given time-step the parallel agents will be ex-

periencing a variety of different states. This simple idea

enables a much larger spectrum of fundamental on-policy

RL algorithms, such as Sarsa, n-step methods, and actor-

critic methods, as well as off-policy RL algorithms such

as Q-learning, to be applied robustly and effectively using

deep neural networks.

Our parallel reinforcement learning paradigm also offers

practical beneﬁts. Whereas previous approaches to deep re-

inforcement learning rely heavily on specialized hardware

such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;

Schaul et al., 2015) or massively distributed architectures

(Nair et al., 2015), our experiments run on a single machine

with a standard multi-core CPU. When applied to a vari-

ety of Atari 2600 domains, on many games asynchronous

reinforcement learning achieves better results, in far less

arXiv:1602.01783v2 [cs.LG] 16 Jun 2016

Asynchronous Methods for Deep Reinforcement Learning

time than previous GPU-based algorithms, using far less

resource than massively distributed approaches. The best

of the proposed methods, asynchronous advantage actor-

critic (A3C), also mastered a variety of continuous motor

control tasks as well as learned general strategies for ex-

ploring 3D mazes purely from visual inputs. We believe

that the success of A3C on both 2D and 3D games, discrete

and continuous action spaces, as well as its ability to train

feedforward and recurrent agents makes it the most general

and successful reinforcement learning agent to date.

2. Related Work

The General Reinforcement Learning Architecture (Gorila)

of (Nair et al., 2015) performs asynchronous training of re-

inforcement learning agents in a distributed setting. In Go-

rila, each process contains an actor that acts in its own copy

of the environment, a separate replay memory, and a learner

that samples data from the replay memory and computes

gradients of the DQN loss (Mnih et al., 2015) with respect

to the policy parameters. The gradients are asynchronously

sent to a central parameter server which updates a central

copy of the model. The updated policy parameters are sent

to the actor-learners at ﬁxed intervals. By using 100 sep-

arate actor-learner processes and 30 parameter server in-

stances, a total of 130 machines, Gorila was able to signif-

icantly outperform DQN over 49 Atari games. On many

games Gorila reached the score achieved by DQN over 20

times faster than DQN. We also note that a similar way of

parallelizing DQN was proposed by (Chavez et al., 2015).

In earlier work, (Li & Schuurmans, 2011) applied the

Map Reduce framework to parallelizing batch reinforce-

ment learning methods with linear function approximation.

Parallelism was used to speed up large matrix operations

but not to parallelize the collection of experience or sta-

bilize learning. (Grounds & Kudenko, 2008) proposed a

parallel version of the Sarsa algorithm that uses multiple

separate actor-learners to accelerate training. Each actor-

learner learns separately and periodically sends updates to

weights that have changed signiﬁcantly to the other learn-

ers using peer-to-peer communication.

(Tsitsiklis, 1994) studied convergence properties of Q-

learning in the asynchronous optimization setting. These

results show that Q-learning is still guaranteed to converge

when some of the information is outdated as long as out-

dated information is always eventually discarded and sev-

eral other technical assumptions are satisﬁed. Even earlier,

(Bertsekas, 1982) studied the related problem of distributed

dynamic programming.

Another related area of work is in evolutionary meth-

ods, which are often straightforward to parallelize by dis-

tributing ﬁtness evaluations over multiple machines or

threads (Tomassini, 1999). Such parallel evolutionary ap-

proaches have recently been applied to some visual rein-

forcement learning tasks. In one example, (Koutník et al.,

2014) evolved convolutional neural network controllers for

the TORCS driving simulator by performing ﬁtness evalu-

ations on 8 CPU cores in parallel.

3. Reinforcement Learning Background

We consider the standard reinforcement learning setting

where an agent interacts with an environment E over a

number of discrete time steps. At each time step t, the

agent receives a state s

and selects an action a

from some

set of possible actions A according to its policy π, where

π is a mapping from states s

to actions a

. In return, the

agent receives the next state s

t+1

and receives a scalar re-

ward r

. The process continues until the agent reaches a

terminal state after which the process restarts. The return

∞

k=0

t+k

is the total accumulated return from

time step t with discount factor γ ∈ (0, 1]. The goal of the

agent is to maximize the expected return from each state s

The action value Q

(s, a) = E [R

= s, a] is the ex-

pected return for selecting action a in state s and follow-

ing policy π. The optimal value function Q

∗

(s, a) =

max

(s, a) gives the maximum action value for state

s and action a achievable by any policy. Similarly, the

value of state s under policy π is deﬁned as V

(s) =

E [R

= s] and is simply the expected return for follow-

ing policy π from state s.

In value-based model-free reinforcement learning methods,

the action value function is represented using a function ap-

proximator, such as a neural network. Let Q(s, a; θ) be an

approximate action-value function with parameters θ. The

updates to θ can be derived from a variety of reinforcement

learning algorithms. One example of such an algorithm is

Q-learning, which aims to directly approximate the optimal

action value function: Q

∗

(s, a) ≈ Q(s, a; θ). In one-step

Q-learning, the parameters θ of the action value function

Q(s, a; θ) are learned by iteratively minimizing a sequence

of loss functions, where the ith loss function deﬁned as

(θ

) = E



r + γ max

Q(s

, a

; θ

i−1

) − Q(s, a; θ

)



where s

is the state encountered after state s.

We refer to the above method as one-step Q-learning be-

cause it updates the action value Q(s, a) toward the one-

step return r + γ max

Q(s

, a

; θ). One drawback of us-

ing one-step methods is that obtaining a reward r only di-

rectly affects the value of the state action pair s, a that led

to the reward. The values of other state action pairs are

affected only indirectly through the updated value Q(s, a).

This can make the learning process slow since many up-

dates are required the propagate a reward to the relevant

preceding states and actions.

Asynchronous Methods for Deep Reinforcement Learning

One way of propagating rewards faster is by using n-

step returns (Watkins, 1989; Peng & Williams, 1996).

In n-step Q-learning, Q(s, a) is updated toward the n-

step return deﬁned as r

+ γr

t+1

+ ··· + γ

n−1

t+n−1

max

Q(s

t+n

, a). This results in a single reward r di-

rectly affecting the values of n preceding state action pairs.

This makes the process of propagating rewards to relevant

state-action pairs potentially much more efﬁcient.

In contrast to value-based methods, policy-based model-

free methods directly parameterize the policy π(a|s; θ) and

update the parameters θ by performing, typically approx-

imate, gradient ascent on E[R

]. One example of such

a method is the REINFORCE family of algorithms due

to Williams (1992). Standard REINFORCE updates the

policy parameters θ in the direction ∇

log π(a

; θ)R

which is an unbiased estimate of ∇

E[R

]. It is possible to

reduce the variance of this estimate while keeping it unbi-

ased by subtracting a learned function of the state b

known as a baseline (Williams, 1992), from the return. The

resulting gradient is ∇

log π(a

; θ) (R

− b

)).

A learned estimate of the value function is commonly used

as the baseline b

) ≈ V

) leading to a much lower

variance estimate of the policy gradient. When an approx-

imate value function is used as the baseline, the quantity

− b

used to scale the policy gradient can be seen as

an estimate of the advantage of action a

in state s

, or

A(a

, s

) = Q(a

, s

)−V (s

), because R

is an estimate of

, s

) and b

is an estimate of V

). This approach

can be viewed as an actor-critic architecture where the pol-

icy π is the actor and the baseline b

is the critic (Sutton &

Barto, 1998; Degris et al., 2012).

4. Asynchronous RL Framework

We now present multi-threaded asynchronous variants of

one-step Sarsa, one-step Q-learning, n-step Q-learning, and

advantage actor-critic. The aim in designing these methods

was to ﬁnd RL algorithms that can train deep neural net-

work policies reliably and without large resource require-

ments. While the underlying RL methods are quite dif-

ferent, with actor-critic being an on-policy policy search

method and Q-learning being an off-policy value-based

method, we use two main ideas to make all four algorithms

practical given our design goal.

First, we use asynchronous actor-learners, similarly to the

Gorila framework (Nair et al., 2015), but instead of using

separate machines and a parameter server, we use multi-

ple CPU threads on a single machine. Keeping the learn-

ers on a single machine removes the communication costs

of sending gradients and parameters and enables us to use

Hogwild! (Recht et al., 2011) style updates for training.

Second, we make the observation that multiple actors-

Algorithm 1 Asynchronous one-step Q-learning - pseu-

docode for each actor-learner thread.

// Assume global shared θ, θ

−

, and counter T = 0.

Initialize thread step counter t ← 0

Initialize target network weights θ

−

← θ

Initialize network gradients dθ ← 0

Get initial state s

repeat

Take action a with -greedy policy based on Q(s, a; θ)

Receive new state s

and reward r

y =



r for terminal s

r + γ max

Q(s

, a

; θ

−

) for non-terminal s

Accumulate gradients wrt θ: dθ ← dθ +

∂(y−Q(s,a;θ))

∂θ

s = s

T ← T + 1 and t ← t + 1

if T mod I

target

== 0 then

Update the target network θ

−

← θ

end if

if t mod I

AsyncU pdate

== 0 or s is terminal then

Perform asynchronous update of θ using dθ.

Clear gradients dθ ← 0.

end if

until T > T

max

learners running in parallel are likely to be exploring dif-

ferent parts of the environment. Moreover, one can explic-

itly use different exploration policies in each actor-learner

to maximize this diversity. By running different explo-

ration policies in different threads, the overall changes be-

ing made to the parameters by multiple actor-learners ap-

plying online updates in parallel are likely to be less corre-

lated in time than a single agent applying online updates.

Hence, we do not use a replay memory and rely on parallel

actors employing different exploration policies to perform

the stabilizing role undertaken by experience replay in the

DQN training algorithm.

In addition to stabilizing learning, using multiple parallel

actor-learners has multiple practical beneﬁts. First, we ob-

tain a reduction in training time that is roughly linear in

the number of parallel actor-learners. Second, since we no

longer rely on experience replay for stabilizing learning we

are able to use on-policy reinforcement learning methods

such as Sarsa and actor-critic to train neural networks in a

stable way. We now describe our variants of one-step Q-

learning, one-step Sarsa, n-step Q-learning and advantage

actor-critic.

Asynchronous one-step Q-learning: Pseudocode for our

variant of Q-learning, which we call Asynchronous one-

step Q-learning, is shown in Algorithm 1. Each thread in-

teracts with its own copy of the environment and at each

step computes a gradient of the Q-learning loss. We use

a shared and slowly changing target network in comput-

ing the Q-learning loss, as was proposed in the DQN train-

ing method. We also accumulate gradients over multiple

timesteps before they are applied, which is similar to us-

剩余18页未读，继续阅读

评论收藏

内容反馈

ningweikang

粉丝: 0
资源: 3

Asynchronous Methods for Deep Reinforcement Learning

Deep Reinforcement Learning for Wireless networks

6.1 Actor Critic 演员评论家 (强化学习 Reinforcement Learning 教学)

强化学习综述

20190806-10篇经典深度强化学习资料.rar

deep q_learning

强化学习书籍及论文打包

增强学习Reinforcement-Learning经典算法梳理.docx

MATLAB代码：n阶机械臂单、多智能体控制 关键词：n阶机械臂单 多智能体 单智能体 参考文档： 1.Proximal P

Algorithms of Reinforcement Learning

深度强化学习简述

Algorithm for reinforcement learning.zip（解压即可，无密码）

一条命通关，这个AI算法玩超级马里奥操作秀翻天丨视频+开源代码

pytorch-a3c-master.zip_人工智能/神经网络/深度学习_Python__人工智能/神经网络/深度学习_Python_

类似于a3c：类似于a3c的RL代理接口

强化学习sutton第二版 习题答案.rar

多种深度强化学习算法在雅达利游戏pong中的设计与实现

RLbook-2nd-Sutton-Answer_Sutton_youthock_强化学习_RLbook2020_monthz1

第十讲-深度强化学习

强化学习论文

强化学习_强化学习_源码.zip

Python-强化学习方法和教程

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

实验三 医学知识图谱构建与推理

机器学习期末复习题及答案

神经网络回归预测--气温数据集

Ollama软件windows安装包(版本0.3.10)

Mathwork+Matlab+编程手册

中文短信数据集-带标签

最新资源

MATLAB代码：n阶机械臂单、多智能体控制关键词：n阶机械臂单多智能体单智能体参考文档： 1.Proximal P

强化学习sutton第二版习题答案.rar

实验三医学知识图谱构建与推理