没有合适的资源?快使用搜索试试~ 我知道了~
INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT 2022.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 69 浏览量
2022-12-23
19:20:25
上传
评论
收藏 435KB PDF 举报
温馨提示
试读
13页
INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT 2022.pdf
资源推荐
资源详情
资源评论
INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT
Udesh Gunarathna
University of Melbourne
Melbourne
Australia
[email protected].edu.au
Shanika Karunasekara
University of Melbourne
Melbourne
Australia
karus@unimelb.edu.au
Renata Borovica-Gajic
University of Melbourne
Melbourne
Australia
renata.borovica@unimelb.edu.au
Egemen Tanin
University of Melbourne
Melbourne
Australia
etanin@unimelb.edu.au
ABSTRACT
Connected Autonomous Vehicles will make autonomous intersection management a reality replacing
traditional traffic signal control. Autonomous intersection management requires time and speed
adjustment of vehicles arriving at an intersection for collision-free passing through the intersection.
Due to its computational complexity, this problem has been studied only when vehicle arrival times
towards the vicinity of the intersection are known beforehand, which limits the applicability of
these solutions for real-time deployment. To solve the real-time autonomous traffic intersection
management problem, we propose a reinforcement learning (RL) based multiagent architecture and a
novel RL algorithm coined multi-discount Q-learning. In multi-discount Q-learning, we introduce
a simple yet effective way to solve a Markov Decision Process by preserving both short-term and
long-term goals, which is crucial for collision-free speed control. Our empirical results show that our
RL-based multiagent solution can achieve near-optimal performance efficiently when minimizing the
travel time through an intersection.
Keywords
Deep Reinforcement Learning
·
Markov Decision Process
·
Multiagent Systems
·
Autonomous Intersection
Management
1 Introduction
AIM is a paradigm proposed by Dresner et al. [
1
] to replace the traditional traffic signal control. In AIM, each CAV
arriving towards an intersection reserves a time to reach the intersection crossing point via an intersection manager.
Then, each CAV’s speed is controlled to adhere to the schedule while guaranteeing safety
1
. Due to the ability of AIM to
reduce the congestion at intersections [
2
], AIM has been widely studied [
3
,
4
]. However, most problem formulations
assume that the arrival times of vehicles to the vicinity of intersection are known a priori. This assumption does not
hold when dealing with real-time traffic, which limits the applicability of these solutions to real life scenarios [5].
For AIM to be applicable to real life traffic control, the speed of the vehicles has to be computed in real-time, and the
computational time for this plays a critical role in the feasibility of AIM. Previous efforts exhibit high computational
time [
6
] as they rely on mathematical programming or analytical methods. The impact of high computational time
is two fold. First, suppose the intersection controller takes a long time to compute scheduled times. In that case, the
positional difference of CAVs before and after the computations, poses a significant safety risk of the vehicle crashing
due to the given schedule times not being feasible anymore. Second, if the time for computing CAV speed for the
1
In traditional AIM, vehicles traverse through the intersection on a First Come, First Served basis. In this research, we optimize
the traversing order to reduce the congestion further, e.g. support platooning.
arXiv:2202.04224v1 [cs.MA] 9 Feb 2022
Intelligent Autonomous Intersection Management
trajectory is high, the remaining time after the computation may not be sufficient to reach the intersection crossing point
precisely at the scheduled time.
Recent work [
5
] develops a stochastic solution to the problem, assuming vehicle arrival times are not known beforehand.
However, a part of their solution employs linear programming (LP) for every CAV’s arrival computation, which is
prohibitively expensive, as further demonstrated in Section 8. Further, their method is only applicable to intersections
with single lane roads without turning directions.
Our work fills the missing research gap by providing a computationally efficient, deployable, multiagent solution based
on Reinforcement Learning (RL) for AIM. Our solution consists of two main sets of agents. The first type of agent is a
polling-based coordinating agent (intersection controller) positioned at the intersection. The second component is a set
of distributed RL agents, which are assigned on a per vehicle basis. The coordinating agent communicates with the
RL agents to schedule time intervals to reach the intersection for each CAV that is within a certain distance from the
intersection. The coordinating agent uses a novel polling algorithm to handle multi-lane intersections with multiple
turning directions, overcoming thereby a major limitation of previous work [
5
]. Once the coordinating agent provides
a time schedule, an RL agent controls each vehicle’s speed to adhere to the coordinating agent’s time schedule. The
advantage of such an approach is that once an RL agent is trained offline, decision-making can be done much faster
online. This avoids the computation overhead incurred by LP-based techniques.
The learning task for the RL agent is two fold: (1) learn to control a vehicle’s trajectory to reach the intersection
precisely at the scheduled time, and (2) keep a safe distance from the vehicle in front. Keeping a safe distance is a
task with a short-term goal, because the front vehicle can change its driving behaviour (the driving behavior of an RL
agent or a human) in short time-intervals. In contrast, reaching the intersection at a scheduled time requires long-term
planning because successfully reaching the intersection is only determined at the end of the trajectory. Combining
such two learning problems into a single task, as shown in previous work [
7
] will only learn one of the problems,
because each learning problem contains an objective with a different time-horizon. Traditional RL algorithms such as
Q-learning use a fixed parameter named discount factor to control the length of the time-horizon. Using a fixed discount
factor focuses on learning either the short-term or long-term task successfully [
8
], and fails when both short-term
and long-term tasks need to be learned simultaneously, hence being unsuitable for our task. We propose a novel RL
algorithm coined multi-discount Q-learning to achieve short-term goals, while following a long-term goal in a single
Markov Decision Process. We believe our proposed method is applicable to other problem-domains that exhibit a mix
of long-term and short-term goals, such as robotics [9].
Our contributions are four-fold: (1) We propose a computationally efficient multiagent solution for AIM. (2) We
introduce a novel reinforcement learning algorithm that can effectively learn multiple learning tasks. (3) We propose a
novel polling algorithm to handle multi-lane intersections with multiple turning directions. (4) We demonstrate the
effectiveness and efficiency of our approach against several baselines using microscopic traffic simulations.
2 Related Work
2.1 Autonomous Intersection Management
There are two inter-dependent sub-problems that have been studied in the literature to optimize the AIM; (1) find a
distinct time-schedule for each incoming vehicle to arrive at the intersection, and (2) find a vehicle trajectory such
that a vehicle arrives at the intersection exactly at the schedule time
2
. The existing work can be divided into two main
categories based on the proposed solutions to the aforementioned two problems.
The first category of work optimizes the time-schedule (sub-problem (1)) as a scheduling problem and then computes
vehicle trajectories which adhere to the optimized time-schedule [
4
,
11
,
12
]. The scheduling optimization is NP-hard
[
6
], which makes it computationally expensive. Another drawback of this kind of approach is that all the arrival times
of vehicles need to be known beforehand to optimize the time-schedule. This property limits their applicability to
real-time settings where vehicle arrival times are stochastic.
The second category of work focuses on finding a safe trajectory or fastest trajectory as a solution to sub-problem (2),
whilst employing a heuristic to compute the time-schedule to sub-problem (1) [
3
,
13
,
14
]. Finding the optimal trajectory
is however prohibitively expensive in real-time when using a method like linear programming, as demonstrated in our
experiments. For example, Au et al. [
14
] uses an analytical method to find the trajectory using a set-point schedule and
a bisection method. However, to simplify the search space when deciding on the trajectory their work does not consider
the maximum velocity, which leads to a sub-optimal trajectory. Even though these approaches are able to compute
2
If the objective is to optimize the throughput then the vehicle trajectory should reach the intersection at the maximum speed as
well [10].
2
Intelligent Autonomous Intersection Management
trajectories, they do not optimize sub-problem (1). Thus, they do not optimize the throughput nor reduce waiting time
at intersections. In contrast, our objective is to maximize the throughput at the intersection.
Recent work [
5
] proposed a solution to optimize the throughput in a stochastic setting using a polling system and linear
programming. The polling system is used to optimize the time-schedule for each vehicle. Then, a linear program is
solved for each vehicle to find its trajectory. These linear programs need to be computed sequentially (centralized).
This means that the linear program for the first vehicle arriving at the intersection should be computed first, and then the
next vehicle, and so on. Although this approach can successfully solve the stochastic AIM problem, computational
time required for linear programming hinders the applicability of this solution for real-time usage. On the contrary, we
propose a distributed learning-based solution, which enables real-time AIM.
2.2 RL with Variable Discounting
Learning a task is difficult when there are objectives with different time scales. The time scale of a task is directly
impacted by the discount factor of RL agents (e.g. Q-learning or SARSA) [
8
]. Thus, most past efforts focus on
changing the discount to learn tasks with multiple time scales. Edwards et al. [
15
] extends the LP formulations of
[
8
] and propose a multiple discount SARSA algorithm by considering the reward as a vector and using a separate
action-value function (Q-function) for every sub-task. Human intervention is then needed to find the best policy among
these sub-tasks. Burda et al. [
16
] follows a similar approach to combine intrinsic and extrinsic rewards. An automated
approach is proposed by Li et al. [
17
] to combine different objectives learned by a set of factored Q-functions using
a lexicographic ordering of objectives. Finding such lexicographic ordering of objectives is non-trivial and can be
problem-dependent. In the above-mentioned approaches, each sub-task is learned separately. Because of that, the
inter-relationship between sub-tasks is ignored, and the number of parameters to be learned is high. In contrast, we
propose a simple and memory-efficient method to learn each sub-task using a single action-value function (a single
Q-function) whilst preserving the time scales of sub-tasks. As we show in our experiments, our proposed method is
able to achieve superior results in achieving both short-term and long-term goals.
A single Q-function is similarly used with a state-dependent discount function where each state is discounted differently
[
18
]. Also, a generic hyperbolic discount function is proposed as opposed to the traditional exponential discount
function [
19
]. However, these algorithms focus on a single scalar reward, whereas our proposed method extends to
multi-objective cases which contain multiple rewards with different temporal objectives, allowing us to preserve the
time scale of each objective.
3 Background
Our proposed architecture uses a polling system to schedule the incoming CAVs, and Q-learning to determine the CAV
trajectories. We provide a brief introduction to both.
Polling system:
A polling system consists of a single server and a set of queues. Each queue contains a number of
customers (in First-In First-Out (FIFO) order). Customers may arrive at the queue in a stochastic order. The server can
start serving a customer from any queue. The term service time is the time taken to service one customer. Once the
server serviced the first customer, the server can select the next customer from any queue. When switching between
queues, the server has to wait for an additional time called switch over time. The strategy that the server uses to
determine from which queue the next customer is selected is called the policy. There are several policies in the literature
such as K-limited, gated and exhaustive. The definitions of customers, switch over time and service time related to AIM
are described in Section 5.1.
Q-learning:
In RL, a problem first needs to be formulated as a Markov-decision process (MDP). An MDP consists of a
state space
S
and an action space
A
. When an action
a
t
∈ A
is taken in the current state
s
t
∈ S
, at time
t
, the MDP’s
state changes to
s
t+1
according to the transition probability
T (s
t
, a
t
, s
t+1
) = P r(s
t+1
|s
t
, a
t
)
. The MDP provides a
reward
r
t
for the transition where
r
t
is assigned according to
R(s
t
, a
t
, s
t+1
)
. An RL agent acting on the MDP consists
of a policy
π(a|s)
which describes the agent’s behaviour. The policy
π(a|s)
indicates the probability of an agent taking
the action
a
in the state
s
. The objective of the agent is to maximize the expected reward
G
t
starting from any given
time step
t
. The expected reward is defined as
G
t
=
P
T
τ =t
γ
τ −t
r
τ
, where
t
is the current time step,
γ
is the discount
factor and T is the time MDP reaches its terminal state.
The action-value function (Q-function) for policy
π
stores the expected reward by taking the action
a
in the state
s
defined as:
Q
π
(s, a) =
E
[G
t
|s
t
= s, a
t
= a, π]
. Q-learning approximates the optimal Q-function iteratively by
observing the transitions (
s
t
, a
t
, s
t+1
, r
t
) at every time step when
T
is unknown. Considering the reward from
n
number of steps we get the following n-step Q-learning equation.
3
剩余12页未读,继续阅读
资源评论
samLi0620
- 粉丝: 905
- 资源: 1万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功