CompetitiveMulti-AgentDeepReinforcementLearningwithCounterfactualThinking资源-CSDN文库

41 浏览量 2021-02-08 02:48:17 上传评论收藏 1.5MB PDF 举报

本文探讨了在竞争性多智能体深度强化学习环境中引入反事实思维（Counterfactual Thinking）以提升智能体决策能力的新方法。反事实思维是一个心理现象，指的是人们对于已经发生事情的不同解决方案重新推断可能结果的心理过程，有助于人们从错误中汲取经验，在类似未来任务中表现得更好。在多智能体强化学习（MARL）环境中，这一概念被用来帮助智能体找到最优的决策策略。反事实思维在强化学习中的作用主要体现在它能够通过模拟人类的心理过程，帮助智能体在训练过程中生成和评估一系列可能的行动（意图行动），并基于对环境的理解估算这些意图行动的奖励和遗憾值。这个过程对于理解智能体如何从反事实的角度提升其表现至关重要。本文提出了一个新颖的多智能体深度强化学习模型，该模型的结构模仿了人类的心理反事实思维过程，旨在提升智能体的竞争能力。模型采用并行策略结构来生成若干可能的意图行动，并基于当前对环境的理解来估算这些行动的奖励和遗憾。为了将估算的遗憾值与内部策略联系起来，模型引入了一个基于场景的框架。在迭代过程中，模型会同时更新并行策略和智能体对应的基于场景的遗憾。为了验证所提出模型的有效性，作者在具有现实世界应用的两个不同环境中进行了广泛的实验。实验结果表明，在与对手公平竞争的环境下，反事实思维实际上可以帮助智能体从环境中获得更多累积奖励，同时保持高效的性能表现。文章的主要贡献包括： 1. 提出了一个具有创新性的反事实思维模型，能够模拟人类心理过程，辅助智能体在多智能体环境中找到更优的决策策略。 2. 通过并行策略结构和基于场景的框架，模型能够评估和学习不同行动方案的潜在回报和遗憾值。 3. 实验验证了该模型的有效性，特别是在获得累积奖励和保持高效性能方面表现出色。针对关键词，本研究涉及的几个核心概念和知识点包括： - 多智能体系统（Multi-Agent Systems）：一种由多个智能体组成的系统，智能体之间可以是合作或者竞争的关系。 - 强化学习（Reinforcement Learning）：一种机器学习范式，智能体通过与环境的交互来学习如何在给定的任务中最大化某种累积奖励。 - 深度强化学习（Deep Reinforcement Learning）：结合了深度学习技术的强化学习方法，通过神经网络来近似策略函数和/或价值函数。 - 反事实思维（Counterfactual Thinking）：在心理学中，指在已知事实的基础上推测与实际结果不同的假设结果的心理过程。 - 并行策略结构（Parallel Policy Structure）：一种能够同时生成多个行动选项的策略网络架构，从而允许学习者在不同选项之间进行比较。 - 基于场景的框架（Scenario-based Framework）：一种模拟特定决策环境的方法，通常用于理解在特定情境下行动的结果。 - 奖励与遗憾评估（Reward and Regret Estimation）：在强化学习中评估智能体采取行动后所能获得的积极反馈和潜在的负面反馈。本研究聚焦于如何将反事实思维这一认知心理学理论与强化学习相结合，探索在多智能体环境中如何利用这一理论提高智能体的学习效率和决策质量。这项工作对于未来在复杂、竞争性环境中强化学习智能体的发展具有重要的理论和实践价值。

资源推荐

资源详情

资源评论

Competitive Multi-Agent Deep Reinforcement

Learning with Counterfactual Thinking

Yue Wang

1,2

, Yao Wan

, Chenwei Zhang

4∗

, Lixin Cui

1,2

, Lu Bai

1,2†

, Philip S Yu

School of Information, Central University of Finance and Economics, Beijing, P. R. China

Email: {wangyuecs,bailucs,cuilixin}@cufe.edu.cn

State Key Laboratory of Cognitive Intelligence, iFLYTEK, Hefei, P. R. China

Department of Computer Science, Zhejiang University, Hangzhou, P. R. China

Email: wanyao@zju.edu.cn

Amazon, Seattle, USA

Email: cwzhang@amazon.com

Department of Computer Science, University of Illinois at Chicago, Chicago, USA

Email: psyu@uic.edu

Abstract—Counterfactual thinking describes a psychological

phenomenon that people re-infer the possible results with dif-

ferent solutions about things that have already happened. It

helps people to gain more experience from mistakes and thus

to perform better in similar future tasks. This paper inves-

tigates the counterfactual thinking for agents to ﬁnd optimal

decision-making strategies in multi-agent reinforcement learning

environments. In particular, we propose a multi-agent deep

reinforcement learning model with a structure which mimics the

human-psychological counterfactual thinking process to improve

the competitive abilities for agents. To this end, our model

generates several possible actions (intent actions) with a par-

allel policy structure and estimates the rewards and regrets

for these intent actions based on its current understanding

of the environment. Our model incorporates a scenario-based

framework to link the estimated regrets with its inner policies.

During the iterations, our model updates the parallel policies

and the corresponding scenario-based regrets for agents simul-

taneously. To verify the effectiveness of our proposed model, we

conduct extensive experiments on two different environments

with real-world applications. Experimental results show that

counterfactual thinking can actually beneﬁt the agents to obtain

more accumulative rewards from the environments with fair

information by comparing to their opponents while keeping high

performing efﬁciency.

Index Terms—Multi-agent, reinforcement learning, counterfac-

tual thinking, competitive game

I. INTRODUCTION

Discovering optimized policies for individuals in complex

environments is a prevalent and important task in the real

world. For example, (a) traders demand to explore competitive

pricing strategies in order to get maximum revenue from

markets when competing with other traders [1]; (b) network

switches need optimized switching logic to improve their com-

munication efﬁciency with limited bandwidth by considering

other switches [2]; (c) self-driving cars require reasonable and

robust driving controls in complex trafﬁc environments with

other cars [3].

∗

Work done while at University of Illinois at Chicago

†

Lu Bai is corresponding author (email: bailucs@cufe.edu.cn).

Environment

Agent

asr

a=μ(s)

Customers

Senseandavoidcollision

Competeformoremarketshare

RL

process

Fig. 1: Explore and exploit the environments as RL processes.

The core challenge raised in aforementioned scenarios is

to ﬁnd the optimized action policies for AI agents with lim-

ited knowledge about environments. Currently, many existing

works learn the policies via the process of “exploration-

exploitation” [4] which exploits optimized actions from the

known environment as well as explores more potential ac-

tions on the unknown environment. From the perspective of

data mining, this “exploration-exploitation” process can be

considered as discovering the “action-state-reward” patterns

that maximize total rewards from a huge exploring-dataset

generated by agents.

There is a more complex situation that the environment may

consist of multiple agents and each of them needs to compete

with the others. In this scenario, it is imperative for each agent

to ﬁnd optimal action strategies in order to get more rewards

than its competitors. An intuitive solution is to model this

process as a Markov decision process (MDP) [5] and try to

approach the problem by single-agent reinforcement learning,

without considering the actions of other agents [6].

Figure 1 shows a schema of the process of reinforcement

learning on two speciﬁc tasks, i.e., self-driving and marketing.

Reinforcement learning aims to train agents to ﬁnd policies

which lead them to solve the tasks they do not have complete

prior knowledge. Under the RL framework, an agent policy

(i.e., µ(s) in Figure 1) is a probabilistic distribution of actions

for the agent which is related to its observation or state for

an environment. When an agent (i.e., car or trader) observes

arXiv:1908.04573v2 [cs.LG] 16 Aug 2019

a new environment state, it performs an action and obtains a

reward.

The RL training for agents is a greedy iterative process

and it usually starts with a randomized exploration which is

implemented by initializing a totally stochastic policy and

then revising the policy by the received rewards at each

iteration. The RL explores the policy space and favors those

policies that better approximate to the globally optimal policy.

Therefore, theoretically, by accumulatively exploring more

policy subspaces at each iteration, the probability to get a

better policy of an agent is increasing.

Challenges. However, the traditional single-agent reinforce-

ment learning approaches ignore the interactions and the

decision strategies of other competitors. There are mainly two

challenges to extend the reinforcement learning from single-

agent to multi-agent scenarios. (a) Optimize action policy

among competitors. Generally, the single-agent reinforcement

learning method (SRL) only optimizes the action policy for a

speciﬁc agent. SRL does not model the interactions between

multiple agents. Consequently, it challenges a lot when using

SRL to optimize the action policy for a speciﬁc agent among

a group of competitors simultaneously. (b) Learn action pol-

icy based on sparse feedbacks. Since history never repeats

itself, historical data only record feedbacks sparsely under the

actions which have already happened, it challenges a lot to

effectively learn optimized policies from historical data with

sparse “action-state-reward” tuples. (c) Infer the counterfactual

feedbacks. One solution to the sparse feedbacks issue is to

infer the counterfactual feedbacks for those historical non-

chosen optional actions which have the potential to improve

the learning efﬁciency for agent action policies. However, it

still remains a challenge to counterfactually infer the possible

feedbacks from an environment when an agent performs

different optional actions at the same historical moment.

Currently, many existing works have applied the multi-

agent deep reinforcement learning framework to mitigate the

issues in environments with several agents. However, most of

them [7] [8] [9] still do not incorporate the counterfactual

information contained in the history observation data which

could further improve the learning efﬁciency for agents.

Our Solutions and Contributions. To address the aforemen-

tioned challenges, in this paper, we formalize our problem

as the competitive multi-agent deep reinforcement learning

with a centralized critic [8] and improve the learning ef-

ﬁciency of agents by estimating the possible rewards for

agents based on the historical observation. To this end, we

propose a CounterFactual Thinking agent (CFT) with the

off-policy actor-critic framework by mimicking the human-

psychological activities. The CFT agent works in the following

process: when it observes a new environment state, it uses

several parallel policies to develop action options or intents,

for agents and estimates returns for the intents by its current

understanding for the environment through regrets created by

previous iterations. This is a similar process as the psycholog-

ical activity that people reactive choices resulting from one’s

own experience and environment [10]. With the estimated

returns, the CFT agent chooses one of the policies to generate

its practical actions and receives new regrets for those non-

chosen policies by measuring the loss between the estimated

returns and practical rewards. This also mimics the human-

psychological activities that people suffer regrets after making

decisions by observing the gap between ideal and reality. Then

the received regrets help the CFT agent to choose the policies

in the next iteration.

It is worth mentioning that the proposed CFT agent is also

more effective than existing multi-agent deep reinforcement

learning methods during the training process since the parallel

policy structure helps CFT agents to search from a wider range

of policy subspaces at each iteration. Therefore, it could also

be more informative than other related methods in multi-agent

environments. We apply the CFT agent to several competitive

multi-agent reinforcement learning tasks (waterworld, pursuit-

evasion [11]) and real-world applications. The experimental

result shows the CFT agents could learn more competitive

action policy than other alternatives.

In summary, the main contributions of this paper can be

summarized as follows:

• We study the problem of competing with each other

in a multi-agent environment with a competitive multi-

agent deep reinforcement learning framework. Within this

framework, we deﬁne the competitive ability of an agent

as the ability to explore more policy subspaces.

• We propose the counterfactual thinking agent (CFT) to

enhance the competitive ability of agents in multi-agent

environments. CFT generates several potential intents

through parallel policy structure and learns the corre-

sponding regrets through the difference between esti-

mated returns and practical rewards. The intent generation

and regret learning process supervise each other with a

max-min process.

• We demonstrate that CFT agents are more effective

than their opponents both in simulated and real-world

environments while keeping high performing efﬁciency.

This shows that counterfactual thinking mechanism helps

agents explore more policy subspaces with the same

iterations than other alternatives.

Organization. The remainder of this paper is organized as

follows. In Section II, we introduce some background knowl-

edge on multi-agent reinforcement learning. In Section III,

we ﬁrst give an overview of our proposed framework and

then present each module of our proposed framework in

detail. Section IV describes the datasets and environments

used in our experiment and shows the experimental results

and analysis. Section V highlights some works related to this

paper. Finally, we conclude this paper and give some future

research directions in Section VI.

II. PRELIMINARIES

When an agent tries to maximize its interests from an

environment, it must consider both the reward it receives

after each action and feedbacks from the environment. This

could be simpliﬁed as a Markov decision process (MDP)

剩余10页未读，继续阅读

评论收藏

内容反馈

weixin_38692969

粉丝: 4
资源: 953

Competitive Multi-Agent Deep Reinforcement Learning with Counter...

最新资源

Competitive Multi-Agent Deep Reinforcement Learning with Counter...

Multi-Agent-Reinforcement-Learning-Environment_强化学习_multi-agent_

Multi-agent reinforcement learning_An overview

CS294-112 Deep Reinforcement Learning Sp17强化学习课件

Multi-Agent Machine Learning The Reinforcement Approach

基于pytorch编写的利用深度强化学习解决任务卸载和边缘计算问题

multi-agent reinforcement learning tensorflow代码实现

Multi-Agent Reinforcement Learning.pdf

Algorithm-Deep-reinforcement-learning-with-pytorch.zip

PhD-Thesis-Multi-agent-deep-reinforcement-learning-in-mobile-robotics

learning to communicate with deep multi-agent reinforcement learning-附件资源

DQ深度学习Deep Reinforcement Learning with Double Q-Learning.pdf

Human-level control through deep reinforcement learning.pdf

Hands-On-Reinforcement-Learning-With-Python-master.zip

ConnectedQ-Multi-agent-Reinforcement-Learning_M?n_q学习_强化学习_

2017-a deep reinforcement learning based framework for content caching.pdf

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

基于BP神经网络的人口预测

磁悬浮系统自适应模糊PID控制器的设计

两轮平衡车的建模与控制研究

无人机协同目标的多无人机协同搜索方法

最新资源