【免费】1706.05296v1.pdf资源-CSDN文库

需积分: 0 111 浏览量 2024-04-30 10:28:04 上传评论收藏 8.15MB PDF 举报

资源推荐

资源详情

资源评论

Value-Decomposition Networks For Cooperative

Multi-Agent Learning

Peter Sunehag

DeepMind

[email protected]

Guy Lever

DeepMind

guylev[email protected]

Audrunas Gruslys

DeepMind

[email protected]

Wojciech Marian Czarnecki

DeepMind

[email protected]

Vinicius Zambaldi

DeepMind

[email protected]

Max Jaderberg

DeepMind

[email protected]

Marc Lanctot

DeepMind

[email protected]

Nicolas Sonnerat

DeepMind

[email protected]

Joel Z. Leibo

DeepMind

[email protected]

Karl Tuyls

DeepMind & University of Liverpool

[email protected]

Thore Graepel

DeepMind

[email protected]

Abstract

We study the problem of cooperative multi-agent reinforcement learning with a

single joint reward signal. This class of learning problems is difﬁcult because of

the often large combined action and observation spaces. In the fully centralized

and decentralized approaches, we ﬁnd the problem of spurious rewards and a

phenomenon we call the “lazy agent” problem, which arises due to partial observ-

ability. We address these problems by training individual agents with a novel value

decomposition network architecture, which learns to decompose the team value

function into agent-wise value functions. We perform an experimental evaluation

across a range of partially-observable multi-agent domains and show that learning

such value-decompositions leads to superior results, in particular when combined

with weight sharing, role information and information channels.

1 Introduction

We consider the cooperative multi-agent reinforcement learning (MARL) problem (Panait and Luke,

2005; Busoniu et al., 2008; Tuyls and Weiss, 2012), in which a system of several learning agents must

jointly optimize a single reward signal – the team reward – accumulated over time. Each agent has

access to its own (“local”) observations and is responsible for choosing actions from its own action

set. Coordinated MARL problems emerge in applications such as coordinating self-driving vehicles

and/or trafﬁc signals in a transportation system, or optimizing the productivity of a factory comprised

of many interacting components. More generally, with AI agents becoming more pervasive, they will

have to learn to coordinate to achieve common goals.

Although in practice some applications may require local autonomy, in principle the cooperative

MARL problem could be treated using a centralized approach, reducing the problem to single-agent

reinforcement learning (RL) over the concatenated observations and combinatorial action space.

We show that the centralized approach consistently fails on relatively simple cooperative MARL

arXiv:1706.05296v1 [cs.AI] 16 Jun 2017

problems in practice. We present a simple experiment in which the centralised approach fails by

learning inefﬁcient policies with only one agent active and the other being “lazy”. This happens

when one agent learns a useful policy, but a second agent is discouraged from learning because its

exploration would hinder the ﬁrst agent and lead to worse team reward.

An alternative approach is to train independent learners to optimize for the team reward. In general

each agent is then faced with a non-stationary learning problem because the dynamics of its envi-

ronment effectively changes as teammates change their behaviours through learning (Laurent et al.,

2011). Furthermore, since from a single agent’s perspective the environment is only partially ob-

served, agents may receive spurious reward signals that originate from their teammates’ (unobserved)

behaviour. Because of this inability to explain its own observed rewards naive independent RL is

often unsuccessful: for example Claus and Boutilier (1998) show that independent

-learners cannot

distinguish teammates’ exploration from stochasticity in the environment, and fail to solve even an

apparently trivial, 2-agent, stateless,

3 × 3

-action problem and the general Dec-POMDP problem

is known to be intractable (Bernstein et al., 2000; Oliehoek and Amato, 2016). Though we here

focus on 2 player coordination, we note that the problems with individual learners and centralized

approaches just gets worse with more agents since then, most rewards do not relate to the individual

agent and the action space grows exponentially for the fully centralized approach.

One approach to improving the performance of independent learners is to design individual reward

functions, more directly related to individual agent observations. However, even in the single-agent

case, reward shaping is difﬁcult and only a small class of shaped reward functions are guaranteed to

preserve optimality w.r.t. the true objective (Ng et al., 1999; Devlin et al., 2014; Eck et al., 2016). In

this paper we aim for more general autonomous solutions, in which the decomposition of the team

value function is learned.

We introduce a novel

learned additive value-decomposition

approach over individual agents. Im-

plicitly, the value decomposition network aims to learn an optimal linear value decomposition from

the team reward signal, by back-propagating the total

gradient through deep neural networks repre-

senting the individual component value functions. This additive value decomposition is speciﬁcally

motivated by avoiding the spurious reward signals that emerge in purely independent learners.The

implicit value function learned by each agent depends only on local observations, and so is more

easily learned. Our solution also ameliorates the coordination problem of independent learning

highlighted in Claus and Boutilier (1998) because it effectively learns in a centralised fashion at

training time, while agents can be deployed individually.

Further, in the context of the introduced agent, we evaluate weight sharing, role information and

information channels as additional enhancements that have recently been reported to improve sample

complexity and memory requirements (Hausknecht, 2016; Foerster et al., 2016; Sukhbaatar et al.,

2016). However, our main comparison is between three kinds of architecture; Value-Decomposition

across individual agents, Independent Learners and Centralized approaches. We investigate and

benchmark combinations of these techniques applied to a range of new interesting two-player

coordination domains. We ﬁnd that Value-Decomposition is a much better performing approach than

centralization or fully independent learners, and that when combined with the additional techniques,

results in an agent that consistently outperforms centralized and independent learners by a big margin.

1.1 Other Related Work

Schneider et al. (1999) consider the optimization of the sum of individual reward functions, by

optimizing local compositions of individual value functions learnt from them. Russell and Zimdars

(2003) sums the

-functions of independent learning agents with individual rewards, before making

the global action selection greedily to optimize for total reward. Our approach works with only a

team reward, and learns the value-decomposition autonomously from experience, and it similarly

differs from the approach with coordination graphs (Guestrin et al., 2002) and the max-plus algorithm

(Kuyer et al., 2008; van der Pol and Oliehoek, 2016).

Other work addressing team rewards in cooperative settings is based on difference rewards (Tumer

and Wolpert, 2004), measuring the impact of an agent’s action on the full system reward. This reward

For example, imagine training a 2-player soccer team using RL with the number of goals serving as the team

reward signal. Suppose one player has become a better scorer than the other. When the worse player takes a shot

the outcome is on average much worse, and the weaker player learns to avoid taking shots (Hausknecht, 2016).

has nice properties (e.g. high learnability), but can be impractical as it requires knowledge about

the system state (Colby et al., 2016; Agogino and Tumer, 2008; Proper and Tumer, 2012). Other

approaches can be found in Devlin et al. (2014); HolmesParker et al. (2016); Babes et al. (2008).

2 Background

2.1 Reinforcement Learning

We recall some key concepts of the RL setting (Sutton and Barto, 1998), an agent-environment

framework (Russell and Norvig, 2010) in which an agent sequentially interacts with the environment

over a sequence of timesteps,

t = 1, 2, 3, . . .

, by executing actions and receiving observations

and rewards, and aims to maximize cumulative reward. This is typically modelled as a Markov

decision process (MDP) (e.g. Puterman, 1994) deﬁned by a tuple

hS, A, T

, T , Ri

comprising the

state space

, action space

, a (possibly stochastic) reward function

R : S × A × S → R

start

state distribution

∈ P(S)

and transition function

T : S × A → P(S)

, where

P(X )

denotes the

set of probability distributions over the set

. We use

to denote the expected value of

. The

agent’s interactions give rise to a trajectory

, A

, R

, S

, ...)

where

∼ T

t+1

∼ T (·|S

, A

)

and

= R(S

, A

, S

t+1

)

, and we denote random variables in upper-case, and their realizations in

lower-case. At time

the agent observes

∈ O

which is typically some function of the state

, and

when the state is not fully observed the system is called a partially observed Markov decision process

(POMDP).

The agent’s goal is to maximize expected cumulative discounted reward with a discount factor

∞

t=1

t−1

. The agent chooses actions according to a policy: a (stationary) policy

is a function

π : S → P(A)

from states to probability distributions over

. An optimal policy

is one which maximizes expected cumulative reward. In fully observed environments, stationary

optimal policies exist. In partially observed environments, the policy usually incorporates past agent

observations from the history

= a

, ..., a

t−1

(replacing

). A practical approach

utilized here, is to parameterize policies using recurrent neural networks.

(s) := E[

∞

t=1

t−1

R(S

, A

, S

t+1

)|S

= s; A

∼ π(·|S

)]

is the value function and the action-

value function is

(s, a) := E

∼T (·|s,a)

[R(S, a, S

)+γV (S

)]

(generally, we denote the successor

state of

). The optimal value function is deﬁned by

∗

(s) = sup

(s)

and similarly

∗

(s, a) = sup

(s, a)

. For a given action-value function

Q : S × A → R

we deﬁne the

(deterministic) greedy policy w.r.t.

π(s) := arg max

a∈A

Q(s, a)

(ties broken arbitrarily). The

greedy policy w.r.t. Q

∗

is optimal (e.g. Szepesvári, 2010).

2.2 Deep Q-Learning

One method for obtaining

∗

-learning which is based on the update

i+1

, a

) = (1 −

, a

) + η

+ γ max

t+1

, a))

, where

∈ (0, 1)

is the learning rate. We employ

the

-greedy approach to action selection based on a value function, which means that with

1 − ε

probability we pick

arg max

(s, a)

and with probability

a random action. Our study focuses on

deep architectures for the value function similar to those used by Mnih et al. (2015), and our approach

incorporates the key techniques of target networks and experience replay employed there, making

the update into a stochastic gradient step. Since we consider partially observed environments our

-functions are deﬁned over agent observation histories,

Q(h

, a

)

, and we incorporate a recurrent

network similarly to Hausknecht and Stone (2015). To speed up learning we add the dueling

architecture of Wang et al. (2016) that represent

using a value and an advantage function, including

multi-step updates with a forward view eligibility trace (e.g. Harb and Precup, 2016) over a certain

number of steps. When training agents the recurrent network is updated with truncated back-

propagation through time (BPTT) for this amount of steps. Although we concentrate on DQN-based

agent architectures, our techniques are also applicable to policy gradient methods such as A3C (Mnih

et al., 2016).

2.3 Multi-Agent Reinforcement Learning

We consider problems where observations and actions are distributed across

agents, and are

represented as d-dimensional tuples of primitive observations in O and actions in A. As is standard

t-1

t-2

t-1

t-2

Figure 1: Independent agents architecture showing

how local observations enter the networks of two

agents over time (three steps shown), pass through

the low-level linear layer to the recurrent layer, and

then a dueling layer produces individual

-values.

t-1

t-2

t-1

t-2

Figure 2: Value-decomposition individual architec-

ture showing how local observations enter the net-

works of two agents over time (three steps shown),

pass through the low-level linear layer to the re-

current layer, and then a dueling layer produces

individual "values" that are summed to a joint

function for training, while actions are produced

independently from the individual outputs.

in MARL, the underlying environment is modeled as a Markov game where actions are chosen and

executed simultaneously, and new observations are perceived simultaneously as a result of a transition

to a new state (Littman, 1994, 2001; Hu and Wellman, 2003; Busoniu et al., 2008).

Although agents have individual observations and are responsible for individual actions, each agent

only receives the joint reward, and we seek to optimize

as deﬁned above. This is consistent with

the Dec-POMDP framework (Oliehoek et al., 2008; Oliehoek and Amato, 2016).

If we denote

h := (h

, h

, ..., h

)

a tuple of agent histories, a joint policy is in general a map

π : H

→ P(A

)

; we in particular consider policies where for any history

, the distribution

π(

has independent components in

P(A)

. Hence, we write

π : H

→ P(A)

. The exception is when

we use the most naive centralized agent with a combinatorial action space, aka joint action learners.

3 A Deep-RL Architecture for Coop-MARL

Building on purely independent DQN-style agents (see Figure 1), we add enhancements to overcome

the identiﬁed issues with the MARL problem. Our main contribution of value-decomposition is

illustrated by the network in Figure 2.

The main assumption we make and exploit is that the joint action-value function for the system can

be additively decomposed into value functions across agents,

Q((h

, h

, ..., h

), (a

, a

, ..., a

)) ≈

i=1

, a

)

where the

depends only on each agent’s local observations. We learn

by backpropagating

gradients from the

-learning rule using the joint reward through the summation, i.e.

is learned

implicitly rather than from any reward speciﬁc to agent

, and we do not impose constraints that

the

are action-value functions for any speciﬁc reward. The value decomposition layer can be

seen in the top-layer of Figure 2. One property of this approach is that, although learning requires

some centralization, the learned agents can be deployed independently, since each agent acting

greedily with respect to its local value

is equivalent to a central arbiter choosing joint actions by

maximizing the sum

i=1

剩余16页未读，继续阅读

评论收藏

内容反馈

1496许褚

粉丝: 2
资源: 1

1706.05296v1.pdf

1901.09054v1.pdf

openface.nn4.small2.v1.t7

Resnet--1605.07146v1.pdf

BH1750-..V1.rar

Winning.Ways.for.Your.Mathematical.Plays.V1

AttributeError: module 'tensorflow.compat.v1' has no attribute '

OpenShift_Container_Platform-3.4-REST_API_Reference-en-US.pdf

移动化教学V1.pdf.pdf

福州大学编译原理B卷参考答案_v1.pdf

PHP实例开发源码-泡PHP验证码PAOPHP.Captcha v1.zip

CCleaner4.09.4471.x86V1

智能工厂产品介绍V1.pdf

1911.08698v1.pdf

经典CNN论文.zip

基于PHP的泡PHP验证码PAOPHP.Captcha v1.zip

Firefly-RK3399的Android10中的pdf_20211123_1657.7z

Getting_Started_With_PyQt4_v1.pdf

中国电信5G终端测试方法-Wi-Fi 6分册（2020.V1)(1).pdf

吉利集团VDI用户手册v1.pdf

1_sixyin-music-source-v1.0.7.js

py作业.zip

植物大战僵尸杂交版v2.0.zip

植物大战僵尸杂交版v2.0安装程序.exe

misaka-v3.3.8.zip

2024金地杯本科组赛题.zip

大麦抢票_BP全自动抢购教程+注意事项.rar

TiggerRamDiskV4.2Beta1-Win.zip

红果脚本.apk

最新资源