Practical GAN-based Synthetic IP Header
Trace Generation using NetShare
Yucheng Yin
Carnegie Mellon University
Pittsburgh, PA
yyin4@andrew.cmu.edu
Zinan Lin
Carnegie Mellon University
Pittsburgh, PA
zinanl@andrew.cmu.edu
Minhao Jin
Carnegie Mellon University
Pittsburgh, PA
minhaoj@andrew.cmu.edu
Giulia Fanti
Carnegie Mellon University
Pittsburgh, PA
gfanti@andrew.cmu.edu
Vyas Sekar
Carnegie Mellon University
Pittsburgh, PA
vsekar@andrew.cmu.edu
ABSTRACT
We explore the feasibility of using Generative Adversarial Networks
(GANs) to automatically learn generative models to generate syn-
thetic packet- and ow header traces for networking tasks (e.g.,
telemetry, anomaly detection, provisioning). We identify key delity,
scalability, and privacy challenges and tradeos in existing GAN-
based approaches. By synthesizing domain-specic insights with
recent advances in machine learning and privacy, we identify design
choices to tackle these challenges. Building on these insights, we
develop an end-to-end framework, NetShare. We evaluate NetShare
on six diverse packet header traces and nd that: (1) across all dis-
tributional metrics and traces, it achieves 46% more accuracy than
baselines and (2) it meets users’ requirements of downstream tasks
in evaluating accuracy and rank ordering of candidate approaches.
CCS CONCEPTS
• Networks
→
Network simulations;•Security and privacy
→
Data anonymization and sanitization;
KEYWORDS
synthetic data generation, network packets, network ows, genera-
tive adversarial networks, privacy
ACM Reference Format:
Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, and Vyas Sekar. 2022. Prac-
tical GAN-based Synthetic IP Header Trace Generation using NetShare. In
ACM SIGCOMM 2022 Conference (SIGCOMM ’22), August 22–26, 2022, Am-
sterdam, Netherlands. ACM, New York, NY, USA, 15 pages. https://doi.org/
10.1145/3544216.3544251
1 INTRODUCTION
Packet- and ow-level header traces are critical to many network
management workows. For instance, they are used to guide the
design and development of network monitoring algorithms (e.g., [
44
,
45
]), to develop new types of anomaly detection and fingerprinting
SIGCOMM ’22, August 22–26, 2022, Amsterdam, Netherlands
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9420-8/22/08...$15.00
https://doi.org/10.1145/3544216.3544251
(e.g., [
34
,
76
,
77
]), and to benchmark and test new hardware and
software capabilities (e.g., [
46
]). Unfortunately, access to such traces
remains challenging due to business and privacy concerns.
A natural alternative is synthetic traces. There is a rich litera-
ture in the networking community on generating synthetic traces
via simulation-driven approaches (e.g., NS-2 [
6
]), model-driven ap-
proaches (e.g., Harpoon [
66
] or Swing [
70
]), and machine learning
models (e.g., STAN [
75
], DoppelGANger [
39
]), as well as commercial
oerings (e.g., IXIA [
4
]). Unfortunately, existing approaches have
notable shortcomings. Model- and simulation-based approaches
require signicant domain knowledge and human eort to deter-
minecriticalworkloadfeatures andcongure generationparameters,
while not generalizing well across applications [
8
,
9
,
66
,
70
,
83
]. ML-
based approaches generalize more easily, but fail to capture domain-
specic properties (e.g. packet arrival times,ow length) [
39
,
75
] (§6).
In this work, we explore the feasibility of ML-based synthetic
packet-header (e.g., PCAP) and ow-header (e.g., Netow) trace gen-
eration using Generative Adversarial Networks or GANs [
12
,
28
,
29
,
39
]. If successful, this can lower the barrier for stakeholders with
key traces to share synthetic data with potential clients. While the
use of GANs is appealing, in practice we nd that there are a num-
ber of practical challenges in our context that existing approaches
(e.g., [21, 31, 39, 57, 71, 74, 75]) fail to satisfy:
•
Fidelity: Prior techniques (especially those based on tabular
data GANs, which dominate the synthetic header generation
literature) are unable to capture key correlations across header
elds and header elds that have large ranges of values.
•
Scalability-delity tradeo: Existing techniques require signif-
icant GPU-hours to train even moderately-sized traces (e.g.,
millions of records). Simple tabular GANs take a few hours
to train but suer in delity, while more complex time series
GANs can take an order of magnitude more time.
•
Privacy-delity tradeos: Privacy-delity tradeos of GANs
are not well explored in the context of network header traces.
Preliminary work suggests that dierentially-private learning
approaches are likely to yield poor delity for networking
datasets [39].
For example, DoppelGANger [
39
], a state-of-the-art GAN-based
approach for time series generation, cannot learn certain key header
458
5IJTXPSLJTMJDFOTFEVOEFSB$SFBUJWF$PNNPOT"UUSJCVUJPO*OUFSOBUJPOBM-JDFOTF