and the constant component. In our work, we treat the constant
component as the component in the network performance
that lasts for a long period until we observe some significant
changes in the network performance. The difference can also
be considered as error, since we use the constant component
to guide network performance aware optimizations. It is a
non-trivial task to find the constant component from dynamic
network performance.
Interestingly, this problem can be cast into a common prob-
lem in the computer vision, named RPCA (Robust Principal
Component Analysis) [6]. RPCA is to solve the following
problem: for a data matrix, RPCA is used to identify a low-
rank component and a sparse component with minimized
norm, subject to that the sum of the two components are equal
to the data matrix. There are many important applications with
the data that can naturally be modeled as a low-rank plus a
sparse component [6]. Specifically, we develop a novel ap-
proach based on RPCA with special design and optimizations
for practical use on IaaS clouds, and leverage the theoretical
properties of RPCA to find the constant component from the
dynamic network performance. We model each row of the data
matrix to be one snap-shot of all-link performance for the cloud
at a certain point of time, and apply RPCA on that data matrix
to obtain the constant component and error as the low-rank and
sparse components, respectively.
This seemingly simple design of decoupling the constant
component from network performance enables existing or new
network performance aware optimizations in virtual clusters.
Based on the constant component, conventional network per-
formance optimizations become valid, i.e., we can select the
best performing links with the minimized errors. On the other
hand, with the error component, we are able to determine the
effectiveness of network performance aware optimizations in
virtual clusters, e.g., if the error is too large, the network of
the IaaS cloud is too dynamic and network performance aware
optimizations are useless.
We conduct our experiments with two complementary
approaches: one is with the calibration on Amazon EC2 and the
other is with a simulator based on ns-2. The first experiment is
to assess our approach in the public cloud, and the latter one is
for full control of the network traffic on a large-scale cluster.
We assess the impact of network performance aware optimiza-
tions on two kinds of basic applications including collective
communications of MPI (Message Passing Interface) [39] and
the generic topology mapping strategy [19] as well as two
real-world applications, N-body and conjugate gradient (CG).
Our experiments show that our RPCA-based approach can
determine the degree of network dynamics for virtual clusters
in the cloud. We find that the current network of Amazon
EC2 is relatively stable, and network performance aware
optimizations are still important on Amazon EC2. Moreover,
our RPCA-based approach effectively guides the network per-
formance aware optimizations. On Amazon EC2, the proposed
approach significantly improves the performance, reducing the
average elapsed time of broadcast and scatter of MPI and
topology mapping by 20–40% and 8–20% over the baseline
approach and the approach based on direct use of the network
measurements. For N-body and CG, the average improvement
can reach 25% and 31% over the baseline, respectively. In
the simulation on ns-2, we compare with the topology-aware
algorithm [21], [38] and find that our approach obtains 25–
40% performance improvement.
The rest of the paper is organized as follows. We introduce
the preliminary and related work on cloud networks, RPCA
and two examples of network performance aware optimizations
in Section II. We present the problem definition in Section III,
and the RPCA-based approach in Section IV. In Section V,
we show our experimental results. Finally, we conclude this
paper in Section VI.
II. PRELIMINARY AND RELATED WORK
In this section, we briefly introduce the preliminary and
the related work that are closely related to our study.
A. Cloud Network
Previous studies (e.g., [2], [41], [8]) have shown significant
variability between network performance of different machines
in data centers. The network performance variability negatively
impacts application performance, and also makes traditional
network performance aware optimizations (e.g., [19], [3])
infeasible. To avoid re-inventing all those network performance
aware optimizations, this paper develops a new approach to
capture the long-term network performance in cloud, and allow
existing/new optimizations applicable to cloud.
Researchers have developed mechanisms on network band-
width allocation in order to obtain predictable performance
for the user. ElasticSwitch [33], SecondNet [15], Oktopus [2]
and TIVC [43] aim at reserving network bandwidth between
each pair of VMs to offer guaranteed network bandwidth
allocations. Those studies are mainly from the cloud provider’s
perspective, whereas this paper optimizes the network perfor-
mance for virtual clusters created by users, mainly from users’
perspective. Therefore, we do not have the information on
the underlying hardware or topology, or the runtime dynamics
about network transfers and virtualization details.
Network topology inference techniques have been inves-
tigated in the traditional environments [23], [36] and cloud
environment [12]. We refer readers to a survey [7] for more
details on classic techniques for network topology discovery
and inference. The information given by basic diagnostic tools
like traceroute is incomplete in the virtualized cloud.
B. Robust Principal Component Analysis
PCA is arguably the most widely used statistical tool
for data analysis and dimensionality reduction. However, the
accuracy of PCA is prone to noise or gross errors in the
input data. Robust Principal Component Analysis (RPCA) [6]
was proposed to improve the robustness of PCA under noisy
or error measurements. The basic idea is to recover a low-
rank matrix from a series of corrupted measurements and to
minimize the noise component that is assumed to be sparse but
unknown. Suppose A is a data matrix, D is a low-rank matrix
and E is a sparse matrix. RPCA is to solve the following
optimization problem, where ∥E∥
0
is the zero norm of E.
minimize rank(D) + λ∥E∥
0
subject to A = D + E
RPCA has been widely used in computer vision. It can
be used to solve many important applications like video