![](https://csdnimg.cn/release/download_crawler_static/15126326/bg2.jpg)
a new environment state, it performs an action and obtains a
reward.
The RL training for agents is a greedy iterative process
and it usually starts with a randomized exploration which is
implemented by initializing a totally stochastic policy and
then revising the policy by the received rewards at each
iteration. The RL explores the policy space and favors those
policies that better approximate to the globally optimal policy.
Therefore, theoretically, by accumulatively exploring more
policy subspaces at each iteration, the probability to get a
better policy of an agent is increasing.
Challenges. However, the traditional single-agent reinforce-
ment learning approaches ignore the interactions and the
decision strategies of other competitors. There are mainly two
challenges to extend the reinforcement learning from single-
agent to multi-agent scenarios. (a) Optimize action policy
among competitors. Generally, the single-agent reinforcement
learning method (SRL) only optimizes the action policy for a
specific agent. SRL does not model the interactions between
multiple agents. Consequently, it challenges a lot when using
SRL to optimize the action policy for a specific agent among
a group of competitors simultaneously. (b) Learn action pol-
icy based on sparse feedbacks. Since history never repeats
itself, historical data only record feedbacks sparsely under the
actions which have already happened, it challenges a lot to
effectively learn optimized policies from historical data with
sparse “action-state-reward” tuples. (c) Infer the counterfactual
feedbacks. One solution to the sparse feedbacks issue is to
infer the counterfactual feedbacks for those historical non-
chosen optional actions which have the potential to improve
the learning efficiency for agent action policies. However, it
still remains a challenge to counterfactually infer the possible
feedbacks from an environment when an agent performs
different optional actions at the same historical moment.
Currently, many existing works have applied the multi-
agent deep reinforcement learning framework to mitigate the
issues in environments with several agents. However, most of
them [7] [8] [9] still do not incorporate the counterfactual
information contained in the history observation data which
could further improve the learning efficiency for agents.
Our Solutions and Contributions. To address the aforemen-
tioned challenges, in this paper, we formalize our problem
as the competitive multi-agent deep reinforcement learning
with a centralized critic [8] and improve the learning ef-
ficiency of agents by estimating the possible rewards for
agents based on the historical observation. To this end, we
propose a CounterFactual Thinking agent (CFT) with the
off-policy actor-critic framework by mimicking the human-
psychological activities. The CFT agent works in the following
process: when it observes a new environment state, it uses
several parallel policies to develop action options or intents,
for agents and estimates returns for the intents by its current
understanding for the environment through regrets created by
previous iterations. This is a similar process as the psycholog-
ical activity that people reactive choices resulting from one’s
own experience and environment [10]. With the estimated
returns, the CFT agent chooses one of the policies to generate
its practical actions and receives new regrets for those non-
chosen policies by measuring the loss between the estimated
returns and practical rewards. This also mimics the human-
psychological activities that people suffer regrets after making
decisions by observing the gap between ideal and reality. Then
the received regrets help the CFT agent to choose the policies
in the next iteration.
It is worth mentioning that the proposed CFT agent is also
more effective than existing multi-agent deep reinforcement
learning methods during the training process since the parallel
policy structure helps CFT agents to search from a wider range
of policy subspaces at each iteration. Therefore, it could also
be more informative than other related methods in multi-agent
environments. We apply the CFT agent to several competitive
multi-agent reinforcement learning tasks (waterworld, pursuit-
evasion [11]) and real-world applications. The experimental
result shows the CFT agents could learn more competitive
action policy than other alternatives.
In summary, the main contributions of this paper can be
summarized as follows:
• We study the problem of competing with each other
in a multi-agent environment with a competitive multi-
agent deep reinforcement learning framework. Within this
framework, we define the competitive ability of an agent
as the ability to explore more policy subspaces.
• We propose the counterfactual thinking agent (CFT) to
enhance the competitive ability of agents in multi-agent
environments. CFT generates several potential intents
through parallel policy structure and learns the corre-
sponding regrets through the difference between esti-
mated returns and practical rewards. The intent generation
and regret learning process supervise each other with a
max-min process.
• We demonstrate that CFT agents are more effective
than their opponents both in simulated and real-world
environments while keeping high performing efficiency.
This shows that counterfactual thinking mechanism helps
agents explore more policy subspaces with the same
iterations than other alternatives.
Organization. The remainder of this paper is organized as
follows. In Section II, we introduce some background knowl-
edge on multi-agent reinforcement learning. In Section III,
we first give an overview of our proposed framework and
then present each module of our proposed framework in
detail. Section IV describes the datasets and environments
used in our experiment and shows the experimental results
and analysis. Section V highlights some works related to this
paper. Finally, we conclude this paper and give some future
research directions in Section VI.
II. PRELIMINARIES
When an agent tries to maximize its interests from an
environment, it must consider both the reward it receives
after each action and feedbacks from the environment. This
could be simplified as a Markov decision process (MDP)