# Network-Aware Scheduling
## Table of Contents
<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use cases/Topologies](#use-cases--topologies)
- [1 - Spark/Database applications running in Data centers or small scale cluster topologies](#1---sparkdatabase-applications-running-in-data-centers-or-small-scale-cluster-topologies)
- [2 - Cloud2Edge application running on a multi-region geo-distributed cluster](#2---cloud2edge-application-running-on-a-multi-region-geo-distributed-cluster)
- [Proposal - Design & Implementation Details](#proposal---design--implementation-details)
- [Overview of the System Design](#overview-of-the-system-design)
- [Application Group CRD](#application-group-crd)
- [Network Topology CRD](#network-topology-crd)
- [The inclusion of bandwidth in the scheduling process](#the-inclusion-of-bandwidth-in-the-scheduling-process)
- [Bandwidth Requests via extended resources](#bandwidth-requests-via-extended-resources)
- [Bandwidth Limitations via the Bandwidth CNI plugin](#bandwidth-limitations-via-the-bandwidth-cni-plugin)
- [The Network-aware scheduling Plugins](#the-network-aware-scheduling-plugins)
- [Description of the `TopologicalSort` plugin](#description-of-the-topologicalsort-plugin)
- [Description of the `NetworkOverhead` plugin](#description-of-the-networkoverhead-plugin)
- [Known limitations](#known-limitations)
- [Test plans](#test-plans)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Graduation criteria](#graduation-criteria)
- [Implementation history](#implementation-history)
<!-- /toc -->
# Summary
This proposal introduces an end-to-end solution to model/weight a cluster's network latency and
topological information, and leverage that to better schedule latency- and bandwidth-sensitive workloads.
# Motivation
Many applications are latency-sensitive, demanding lower latency between microservices in the application.
Scheduling policies that aim to reduce costs or increase resource efficiency are not enough for applications
where end-to-end latency becomes a primary objective.
Applications such as the Internet of Things (IoT), multi-tier web services, and video streaming services
would benefit the most from network-aware scheduling policies, which consider latency and bandwidth
in addition to the default resources (e.g., CPU and memory) used by the scheduler.
Users encounter latency issues frequently when using multi-tier applications.
These applications usually include tens to hundreds of microservices with complex interdependencies.
Distance from servers is usually the primary culprit.
The best strategy is to reduce the latency between chained microservices in the same application,
according to the prior work about [Service Function Chaining](https://www.sciencedirect.com/science/article/pii/S1084804516301989) (SFC).
Besides, bandwidth plays an essential role for those applications with high volumes of data transfers
among microservices. For example, multiple replicas in a database application may require frequent
copies to ensure data consistency. [Spark jobs](https://spark.apache.org/) may have frequent data transfers
among map and reduce nodes. Insufficient network capacity in nodes would lead to increasing delay or packet
drops, which will degrade the Quality of Service (QoS) for applications.
We propose two **Network-Aware Scheduling Plugins** for Kubernetes that focus on delivering low latency to end-users
and ensuring bandwidth reservations in pod scheduling.
This work significantly extends the previous work open-sourced [here](https://github.com/jpedro1992/sfc-controller)
that implements a latency-aware scheduler extender based on the [scheduler extender](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md) design.
## Goals
- Define microservice dependencies in an Application via custom resources (**AppGroup CRD**).
- Describe the network topology for the underlying cluster via weights between regions (`topology.kubernetes.io/region`) and zones (`topology.kubernetes.io/zone`) via custom resources (**NetworkTopology CRD**).
- Make existing scheduler plugins aware of network bandwidth by advertise the nodes' (physical) bandwidth capacity as [extended resources](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/).
- Provide a **QueueSort** plugin [TopologicalSort](https://en.wikipedia.org/wiki/Topological_sorting), which orders pods to be scheduled in an **AppGroup** based on their dependencies.
- Provide **network-aware Filter & Score** plugins to filter out nodes based
on microservice dependencies defined in **AppGroup** and score nodes with lower network costs (described in **NetworkTopology**) higher to achieve latency-aware scheduling.
## Non-Goals
- Descheduling due to unexpected outcomes is not addressed in this proposal.
- The conflict between plugins in this proposal and other plugins are not studied in this proposal.
Users are welcome to try plugins in this proposal with other plugins (e.g., `RequestedToCapacityRatio`,
`BalancedAllocation`). However, a higher weight must be given to our plugin ensuring low network costs
are preferred.
## Use cases / Topologies
### 1 - Spark/Database applications running in Data centers or small scale cluster topologies
Network-aware scheduling examines the infrastructure topology,
so network latency and bandwidth between nodes are considered while making scheduling decisions.
Data centers with fat-tree topology or cluster topology can benefit from our network-aware framework,
as network conditions (i.e., network latency, available bandwidth) between nodes can vary according to their
locations in the infrastructure.
<p align="center"><img src="figs/cluster.png" title="Cluster Topology" width="600" class="center"/></p>
<p align="center"><img src="figs/data_center.png" title="DC Topology" width="600" class="center"/></p>
Deploying microservices on different sets of nodes will impact the application's response time.
For specific applications, latency and bandwidth requirements can be critical.
For example, in a [Redis cluster](https://redis.io/topics/cluster-tutorial),
master nodes need to synchronize data with slave nodes frequently. Namely, there are dependencies between the
masters and the slaves. High latency or low bandwidth between masters and slaves can lead to slow CRUD operations.
<p align="center"><img src="figs/redis.png" title="Redis app" width="600" class="center"/></p>
### 2 - Cloud2Edge application running on a multi-region geo-distributed cluster.
Multi-region Geo-distributed scenarios benefit the most from our framework and network-aware plugins.
<p align="center"><img src="figs/multi_region.png" title="MultiRegion Topology" width="600" class="center"/></p>
High latency is a big concern in these topologies, especially for IoT applications
(e.g., [Eclipse Hono](https://github.com/eclipse/hono), [Eclipse Cloud2Edge](https://www.eclipse.org/packages/packages/cloud2edge/)).
For example, in the Cloud2Edge platform, there are several dependencies among the several APIs and MQTT brokers where devices connect to:
<p align="center"><img src="figs/cloud2edge.png" title="Cloud2Edge" width="600" class="center"/></p>
# Proposal - Design & Implementation details
## Overview of the System Design
The proposal introduces two [Custom Resources (CRs)](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
defined as Custom Resource Definitions (CRDs):
- **AppGroup CRD**: abstracts the service topology to maintain application microservice dependencies.
- **NetworkTopology CRD**: abstracts the network infrastructure to establish network weights between regions and zones in the cluster.
Thus,