INTELLIGENTAUTONOMOUSINTERSECTIONMANAGEMENT2022.pdf资源-CSDN文库

版权申诉

交通工程

69 浏览量 2022-12-23 19:20:25 上传评论收藏 435KB PDF 举报

资源推荐

资源详情

资源评论

INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT

Udesh Gunarathna

University of Melbourne

Melbourne

Australia

[email protected].edu.au

Shanika Karunasekara

University of Melbourne

Melbourne

Australia

karus@unimelb.edu.au

Renata Borovica-Gajic

University of Melbourne

Melbourne

Australia

renata.borovica@unimelb.edu.au

Egemen Tanin

University of Melbourne

Melbourne

Australia

etanin@unimelb.edu.au

ABSTRACT

Connected Autonomous Vehicles will make autonomous intersection management a reality replacing

traditional trafﬁc signal control. Autonomous intersection management requires time and speed

adjustment of vehicles arriving at an intersection for collision-free passing through the intersection.

Due to its computational complexity, this problem has been studied only when vehicle arrival times

towards the vicinity of the intersection are known beforehand, which limits the applicability of

these solutions for real-time deployment. To solve the real-time autonomous trafﬁc intersection

management problem, we propose a reinforcement learning (RL) based multiagent architecture and a

novel RL algorithm coined multi-discount Q-learning. In multi-discount Q-learning, we introduce

a simple yet effective way to solve a Markov Decision Process by preserving both short-term and

long-term goals, which is crucial for collision-free speed control. Our empirical results show that our

RL-based multiagent solution can achieve near-optimal performance efﬁciently when minimizing the

travel time through an intersection.

Keywords

Deep Reinforcement Learning

Markov Decision Process

Multiagent Systems

Autonomous Intersection

Management

1 Introduction

AIM is a paradigm proposed by Dresner et al. [

] to replace the traditional trafﬁc signal control. In AIM, each CAV

arriving towards an intersection reserves a time to reach the intersection crossing point via an intersection manager.

Then, each CAV’s speed is controlled to adhere to the schedule while guaranteeing safety

. Due to the ability of AIM to

reduce the congestion at intersections [

], AIM has been widely studied [

]. However, most problem formulations

assume that the arrival times of vehicles to the vicinity of intersection are known a priori. This assumption does not

hold when dealing with real-time trafﬁc, which limits the applicability of these solutions to real life scenarios [5].

For AIM to be applicable to real life trafﬁc control, the speed of the vehicles has to be computed in real-time, and the

computational time for this plays a critical role in the feasibility of AIM. Previous efforts exhibit high computational

time [

] as they rely on mathematical programming or analytical methods. The impact of high computational time

is two fold. First, suppose the intersection controller takes a long time to compute scheduled times. In that case, the

positional difference of CAVs before and after the computations, poses a signiﬁcant safety risk of the vehicle crashing

due to the given schedule times not being feasible anymore. Second, if the time for computing CAV speed for the

In traditional AIM, vehicles traverse through the intersection on a First Come, First Served basis. In this research, we optimize

the traversing order to reduce the congestion further, e.g. support platooning.

arXiv:2202.04224v1 [cs.MA] 9 Feb 2022

Intelligent Autonomous Intersection Management

trajectory is high, the remaining time after the computation may not be sufﬁcient to reach the intersection crossing point

precisely at the scheduled time.

Recent work [

] develops a stochastic solution to the problem, assuming vehicle arrival times are not known beforehand.

However, a part of their solution employs linear programming (LP) for every CAV’s arrival computation, which is

prohibitively expensive, as further demonstrated in Section 8. Further, their method is only applicable to intersections

with single lane roads without turning directions.

Our work ﬁlls the missing research gap by providing a computationally efﬁcient, deployable, multiagent solution based

on Reinforcement Learning (RL) for AIM. Our solution consists of two main sets of agents. The ﬁrst type of agent is a

polling-based coordinating agent (intersection controller) positioned at the intersection. The second component is a set

of distributed RL agents, which are assigned on a per vehicle basis. The coordinating agent communicates with the

RL agents to schedule time intervals to reach the intersection for each CAV that is within a certain distance from the

intersection. The coordinating agent uses a novel polling algorithm to handle multi-lane intersections with multiple

turning directions, overcoming thereby a major limitation of previous work [

]. Once the coordinating agent provides

a time schedule, an RL agent controls each vehicle’s speed to adhere to the coordinating agent’s time schedule. The

advantage of such an approach is that once an RL agent is trained ofﬂine, decision-making can be done much faster

online. This avoids the computation overhead incurred by LP-based techniques.

The learning task for the RL agent is two fold: (1) learn to control a vehicle’s trajectory to reach the intersection

precisely at the scheduled time, and (2) keep a safe distance from the vehicle in front. Keeping a safe distance is a

task with a short-term goal, because the front vehicle can change its driving behaviour (the driving behavior of an RL

agent or a human) in short time-intervals. In contrast, reaching the intersection at a scheduled time requires long-term

planning because successfully reaching the intersection is only determined at the end of the trajectory. Combining

such two learning problems into a single task, as shown in previous work [

] will only learn one of the problems,

because each learning problem contains an objective with a different time-horizon. Traditional RL algorithms such as

Q-learning use a ﬁxed parameter named discount factor to control the length of the time-horizon. Using a ﬁxed discount

factor focuses on learning either the short-term or long-term task successfully [

], and fails when both short-term

and long-term tasks need to be learned simultaneously, hence being unsuitable for our task. We propose a novel RL

algorithm coined multi-discount Q-learning to achieve short-term goals, while following a long-term goal in a single

Markov Decision Process. We believe our proposed method is applicable to other problem-domains that exhibit a mix

of long-term and short-term goals, such as robotics [9].

Our contributions are four-fold: (1) We propose a computationally efﬁcient multiagent solution for AIM. (2) We

introduce a novel reinforcement learning algorithm that can effectively learn multiple learning tasks. (3) We propose a

novel polling algorithm to handle multi-lane intersections with multiple turning directions. (4) We demonstrate the

effectiveness and efﬁciency of our approach against several baselines using microscopic trafﬁc simulations.

2 Related Work

2.1 Autonomous Intersection Management

There are two inter-dependent sub-problems that have been studied in the literature to optimize the AIM; (1) ﬁnd a

distinct time-schedule for each incoming vehicle to arrive at the intersection, and (2) ﬁnd a vehicle trajectory such

that a vehicle arrives at the intersection exactly at the schedule time

. The existing work can be divided into two main

categories based on the proposed solutions to the aforementioned two problems.

The ﬁrst category of work optimizes the time-schedule (sub-problem (1)) as a scheduling problem and then computes

vehicle trajectories which adhere to the optimized time-schedule [

]. The scheduling optimization is NP-hard

[

], which makes it computationally expensive. Another drawback of this kind of approach is that all the arrival times

of vehicles need to be known beforehand to optimize the time-schedule. This property limits their applicability to

real-time settings where vehicle arrival times are stochastic.

The second category of work focuses on ﬁnding a safe trajectory or fastest trajectory as a solution to sub-problem (2),

whilst employing a heuristic to compute the time-schedule to sub-problem (1) [

]. Finding the optimal trajectory

is however prohibitively expensive in real-time when using a method like linear programming, as demonstrated in our

experiments. For example, Au et al. [

] uses an analytical method to ﬁnd the trajectory using a set-point schedule and

a bisection method. However, to simplify the search space when deciding on the trajectory their work does not consider

the maximum velocity, which leads to a sub-optimal trajectory. Even though these approaches are able to compute

If the objective is to optimize the throughput then the vehicle trajectory should reach the intersection at the maximum speed as

well [10].

Intelligent Autonomous Intersection Management

trajectories, they do not optimize sub-problem (1). Thus, they do not optimize the throughput nor reduce waiting time

at intersections. In contrast, our objective is to maximize the throughput at the intersection.

Recent work [

] proposed a solution to optimize the throughput in a stochastic setting using a polling system and linear

programming. The polling system is used to optimize the time-schedule for each vehicle. Then, a linear program is

solved for each vehicle to ﬁnd its trajectory. These linear programs need to be computed sequentially (centralized).

This means that the linear program for the ﬁrst vehicle arriving at the intersection should be computed ﬁrst, and then the

next vehicle, and so on. Although this approach can successfully solve the stochastic AIM problem, computational

time required for linear programming hinders the applicability of this solution for real-time usage. On the contrary, we

propose a distributed learning-based solution, which enables real-time AIM.

2.2 RL with Variable Discounting

Learning a task is difﬁcult when there are objectives with different time scales. The time scale of a task is directly

impacted by the discount factor of RL agents (e.g. Q-learning or SARSA) [

]. Thus, most past efforts focus on

changing the discount to learn tasks with multiple time scales. Edwards et al. [

] extends the LP formulations of

[

] and propose a multiple discount SARSA algorithm by considering the reward as a vector and using a separate

action-value function (Q-function) for every sub-task. Human intervention is then needed to ﬁnd the best policy among

these sub-tasks. Burda et al. [

] follows a similar approach to combine intrinsic and extrinsic rewards. An automated

approach is proposed by Li et al. [

] to combine different objectives learned by a set of factored Q-functions using

a lexicographic ordering of objectives. Finding such lexicographic ordering of objectives is non-trivial and can be

problem-dependent. In the above-mentioned approaches, each sub-task is learned separately. Because of that, the

inter-relationship between sub-tasks is ignored, and the number of parameters to be learned is high. In contrast, we

propose a simple and memory-efﬁcient method to learn each sub-task using a single action-value function (a single

Q-function) whilst preserving the time scales of sub-tasks. As we show in our experiments, our proposed method is

able to achieve superior results in achieving both short-term and long-term goals.

A single Q-function is similarly used with a state-dependent discount function where each state is discounted differently

[

]. Also, a generic hyperbolic discount function is proposed as opposed to the traditional exponential discount

function [

]. However, these algorithms focus on a single scalar reward, whereas our proposed method extends to

multi-objective cases which contain multiple rewards with different temporal objectives, allowing us to preserve the

time scale of each objective.

3 Background

Our proposed architecture uses a polling system to schedule the incoming CAVs, and Q-learning to determine the CAV

trajectories. We provide a brief introduction to both.

Polling system:

A polling system consists of a single server and a set of queues. Each queue contains a number of

customers (in First-In First-Out (FIFO) order). Customers may arrive at the queue in a stochastic order. The server can

start serving a customer from any queue. The term service time is the time taken to service one customer. Once the

server serviced the ﬁrst customer, the server can select the next customer from any queue. When switching between

queues, the server has to wait for an additional time called switch over time. The strategy that the server uses to

determine from which queue the next customer is selected is called the policy. There are several policies in the literature

such as K-limited, gated and exhaustive. The deﬁnitions of customers, switch over time and service time related to AIM

are described in Section 5.1.

Q-learning:

In RL, a problem ﬁrst needs to be formulated as a Markov-decision process (MDP). An MDP consists of a

state space

and an action space

. When an action

∈ A

is taken in the current state

∈ S

, at time

, the MDP’s

state changes to

t+1

according to the transition probability

T (s

, a

, s

t+1

) = P r(s

t+1

, a

)

. The MDP provides a

reward

for the transition where

is assigned according to

R(s

, a

, s

t+1

)

. An RL agent acting on the MDP consists

of a policy

π(a|s)

which describes the agent’s behaviour. The policy

π(a|s)

indicates the probability of an agent taking

the action

in the state

. The objective of the agent is to maximize the expected reward

starting from any given

time step

. The expected reward is deﬁned as

τ =t

τ −t

, where

is the current time step,

is the discount

factor and T is the time MDP reaches its terminal state.

The action-value function (Q-function) for policy

stores the expected reward by taking the action

in the state

deﬁned as:

(s, a) =

= s, a

= a, π]

. Q-learning approximates the optimal Q-function iteratively by

observing the transitions (

, a

, s

t+1

, r

) at every time step when

is unknown. Considering the reward from

number of steps we get the following n-step Q-learning equation.

剩余12页未读，继续阅读

评论收藏

内容反馈

版权申诉

samLi0620

粉丝: 905
资源: 1万+

INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT 2022.pdf

最新资源

INTELLIGENT AUTONOMOUS INTERSECTION MANAGEMENT 2022.pdf

Introduction to Autonomous Mobile Robots.pdf

Nonlinear model predictive control for autonomous vehicles-Falcone.pdf

论文研究-AUTONOMOUS HUMAN CROWD SIMULATION.pdf

AMZ Driverless The Full Autonomous Racing System.pdf

Autonomous Flying Robots_2010.pdf

Solving the Challenges of Autonomous Vehicle Test.pdf

Path Planning Algorithms for Autonomous Mobile Robots.pdf，这是一份不错

UL 4600-2020 Evaluation of Autonomous Products.pdf

Falcone Nonlinear Model Predictive Control for Autonomous Vehicles.rar.rar

藏经阁-IOT AND THE AUTONOMOUS VEHICLE.pdf

Maneuvers for a quad-rotor autonomous helicopter.pdf

Architecture Design and Implementation of an Autonomous Vehicle.pdf

基于动态防御技术的无线网络安全模型改进.pdf

Optimized flocking of autonomous drones in confined envs.pdf

Plasma- Scalable Autonomous Smart Contracts.pdf

SAFETY ASSURANCE OBJECTIVES FOR AUTONOMOUS SYSTEMS V1.PDF

进阶课程㉒丨Apollo规划技术详解——Motion Planning with Autonomous Driving.pdf

Intelligent.and.Evolutionary.Systems.3319490486

Designing an Autonomous Helicopter Testbed.pdf

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

农村公交与异构无人机协同配送优化

李飞飞自传 我看见的世界 The World I see

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

4个亲测好用的ChatGPT4渠道

基于小波与卷积神经网络的多尺度时间序列分类.zip

最新资源

李飞飞自传我看见的世界 The World I see