【免费】ACMSIGCOMM2022年论文集_acmmm2022dpm论文资源-CSDN文库

共55个文件

pdf：55个

SIGCOMM论文集

需积分: 0 81 浏览量 2022-11-07 20:02:15 上传评论收藏 96.96MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

SIGCOMM22论文集.zip （55个子文件）

SIGCOMM22论文集

session11-host networking and video delivery

Towards μs tail latency and terabit ethernet - disaggregating the host network stack.pdf 1.21MB

NeuroScaler-neural video enhancement at scale.pdf 1.6MB

SPRIGHT-Extracting the Server from Serverless Computing.pdf 3.52MB

GSO-Simulcast-Global Stream Orchestration in Simulcast Video conferencing systems.pdf 633KB

SIGCOMM'22 LiveNet-a low-latency video transport network for large-scale live streaming.pdf 1.35MB

session8-sensing and wireless communication

Cyclops：An FSO-based Wireless Link for VR Headsets.pdf 2.02MB

Underwater Messaging Using Mobile Devices.pdf 3.47MB

RF-Protect：Privacy against Device-Free Human Tracking.pdf 2.17MB

Empowering Smart Buildings with Self-Sensing Concrete for Structural Health Monitoring.pdf 4.81MB

Higher-Order Modulation for Acoustic Backscatter Communication in Metals.pdf 1.23MB

session1 Datacenter Networking

near-optimal proactive datacenter transport.pdf 1.09MB

Aequitas-admission control for performance-critical RPCs in datacenters.pdf 793KB

Time-division TCP for reconfigurable data center networks.pdf 1.92MB

transforming google's datacenter network via optical circuit switches and software-defined networking.pdf 2.37MB

active buffer management in datacenters.pdf 1.37MB

session2 5G Networks

Mobile access bandwidth in practice measurement, analysis, and implications.pdf 921KB

SEED a SIM-based solution to 5G failures.pdf 995KB

L25GC a low latency 5G core network based on high-performance NFV platforms.pdf 3.29MB

Understanding 5G performance for real-world services a content provider's perspective.pdf 1.53MB

Vivisecting mobility management in 5G cellular networks.pdf 1.59MB

session7-monitoring and measurement

Retina-Analyzing 100 GbE Traffic on Commodity Hardware.pdf 2.05MB

FlyMon-enabling on-the-fly task reconfiguration for network measurement.pdf 2.82MB

Predicting IPv4 services across all ports.pdf 946KB

PrintQueue-performance diagnosis via queue measurement in the data plane.pdf 2.53MB

Continuous in-network round-trip time monitoring.pdf 1.77MB

session4 Wide Area Networks

Software-defined network assimilation bridging the last mile towards centralized network configuration management with NAssim.pdf 3.25MB

A case for stateless mobile core network functions in space.pdf 3.64MB

Network entitlement contract-based network sharing with agility and SLO guarantees.pdf 1.36MB

TIPSY predicting where traffic will ingress a WAN.pdf 855KB

SDN in the stratosphere loon's aerospace mesh network.pdf 1.84MB

session5-testing and verification

SimBricks end-to-end network system evaluation with modular simulation.pdf 1.04MB

Flash fast, consistent data plane verification for large-scale network settings.pdf 1.82MB

Symbolic router execution.pdf 454KB

Meissa scalable network testing for programmable data planes.pdf 419KB

SwitchV automated SDN switch validation with P4 models.pdf 728KB

session3 Congestion Control

Elasticity detection a building block for internet congestion control.pdf 1.49MB

Starvation in end-to-end congestion control.pdf 1.03MB

Cebinae scalable in-network fairness augmentation.pdf 1.07MB

SIGCOMM'22 Achieving Consistent LowLatency for Wireless Real-Time Communications with the Shortest Control Loop.pdf 1.31MB

PLB congestion signals are simple and effective for network load balancing.pdf 1.46MB

session9-programmable data planes

Thanos：Programmable Multi-Dimensional Table Filters for Line Rate Network Functions.pdf 1.2MB

FAst In-Network GraY Failure Detection for ISPs.pdf 1.42MB

Stateful Multi-Pipelined Programmable Switches.pdf 1.29MB

Using Trio – Juniper Networks’ Programmable Chipset – for Emerging In-Network Applications.pdf 1.03MB

Predictable vFabric on Informative Data Plane.pdf 4.35MB

session6-machine learning

practical GAN-based synthetic IP header trace generation using NetShare.pdf 8.23MB

Genet automatic curriculum generation for learning adaptation in networking.pdf 2.26MB

DeepQueueNet towards scalable and generalized network performance estimation with packet-level visibility.pdf 5.02MB

LiteFlow- towards high-performance adaptive neural networks for kernel datapath.pdf 808KB

Multi-resource interleaving for deep learning training.pdf 685KB

session10-Denial of Service Defense and Storage Networks

IXP scrubber learning from blackholing traffic for ML-driven DDoS detection at scale.pdf 1.85MB

SurgeProtector mitigating temporal algorithmic complexity attacks using adversarial scheduling.pdf 2.07MB

Aggregate-based congestion control for pulse-wave DDoS defense.pdf 1.47MB

Design and evaluation of IPFS-a storage layer for the decentralized web.pdf 1.82MB

From Luna to Solar-The Evolutions of the Compute-to-Storage networks in Alibaba Cloud.pdf 809KB

Practical GAN-based Synthetic IP Header

Trace Generation using NetShare

Yucheng Yin

Carnegie Mellon University

Pittsburgh, PA

yyin4@andrew.cmu.edu

Zinan Lin

Carnegie Mellon University

Pittsburgh, PA

zinanl@andrew.cmu.edu

Minhao Jin

Carnegie Mellon University

Pittsburgh, PA

minhaoj@andrew.cmu.edu

Giulia Fanti

Carnegie Mellon University

Pittsburgh, PA

gfanti@andrew.cmu.edu

Vyas Sekar

Carnegie Mellon University

Pittsburgh, PA

vsekar@andrew.cmu.edu

ABSTRACT

We explore the feasibility of using Generative Adversarial Networks

(GANs) to automatically learn generative models to generate syn-

thetic packet- and ow header traces for networking tasks (e.g.,

telemetry, anomaly detection, provisioning). We identify key delity,

scalability, and privacy challenges and tradeos in existing GAN-

based approaches. By synthesizing domain-specic insights with

recent advances in machine learning and privacy, we identify design

choices to tackle these challenges. Building on these insights, we

develop an end-to-end framework, NetShare. We evaluate NetShare

on six diverse packet header traces and nd that: (1) across all dis-

tributional metrics and traces, it achieves 46% more accuracy than

baselines and (2) it meets users’ requirements of downstream tasks

in evaluating accuracy and rank ordering of candidate approaches.

CCS CONCEPTS

• Networks

→

Network simulations;•Security and privacy

→

Data anonymization and sanitization;

KEYWORDS

synthetic data generation, network packets, network ows, genera-

tive adversarial networks, privacy

ACM Reference Format:

Yucheng Yin, Zinan Lin, Minhao Jin, Giulia Fanti, and Vyas Sekar. 2022. Prac-

tical GAN-based Synthetic IP Header Trace Generation using NetShare. In

ACM SIGCOMM 2022 Conference (SIGCOMM ’22), August 22–26, 2022, Am-

sterdam, Netherlands. ACM, New York, NY, USA, 15 pages. https://doi.org/

10.1145/3544216.3544251

1 INTRODUCTION

Packet- and ow-level header traces are critical to many network

management workows. For instance, they are used to guide the

design and development of network monitoring algorithms (e.g., [

]), to develop new types of anomaly detection and fingerprinting

SIGCOMM ’22, August 22–26, 2022, Amsterdam, Netherlands

ACM ISBN 978-1-4503-9420-8/22/08...$15.00

https://doi.org/10.1145/3544216.3544251

(e.g., [

]), and to benchmark and test new hardware and

software capabilities (e.g., [

]). Unfortunately, access to such traces

remains challenging due to business and privacy concerns.

A natural alternative is synthetic traces. There is a rich litera-

ture in the networking community on generating synthetic traces

via simulation-driven approaches (e.g., NS-2 [

]), model-driven ap-

proaches (e.g., Harpoon [

] or Swing [

]), and machine learning

models (e.g., STAN [

], DoppelGANger [

]), as well as commercial

oerings (e.g., IXIA [

]). Unfortunately, existing approaches have

notable shortcomings. Model- and simulation-based approaches

require signicant domain knowledge and human eort to deter-

minecriticalworkloadfeatures andcongure generationparameters,

while not generalizing well across applications [

]. ML-

based approaches generalize more easily, but fail to capture domain-

specic properties (e.g. packet arrival times,ow length) [

] (§6).

In this work, we explore the feasibility of ML-based synthetic

packet-header (e.g., PCAP) and ow-header (e.g., Netow) trace gen-

eration using Generative Adversarial Networks or GANs [

]. If successful, this can lower the barrier for stakeholders with

key traces to share synthetic data with potential clients. While the

use of GANs is appealing, in practice we nd that there are a num-

ber of practical challenges in our context that existing approaches

(e.g., [21, 31, 39, 57, 71, 74, 75]) fail to satisfy:

•

Fidelity: Prior techniques (especially those based on tabular

data GANs, which dominate the synthetic header generation

literature) are unable to capture key correlations across header

elds and header elds that have large ranges of values.

•

Scalability-delity tradeo: Existing techniques require signif-

icant GPU-hours to train even moderately-sized traces (e.g.,

millions of records). Simple tabular GANs take a few hours

to train but suer in delity, while more complex time series

GANs can take an order of magnitude more time.

•

Privacy-delity tradeos: Privacy-delity tradeos of GANs

are not well explored in the context of network header traces.

Preliminary work suggests that dierentially-private learning

approaches are likely to yield poor delity for networking

datasets [39].

For example, DoppelGANger [

], a state-of-the-art GAN-based

approach for time series generation, cannot learn certain key header

458

5IJTXPSLJTMJDFOTFEVOEFSB$SFBUJWF$PNNPOT"UUSJCVUJPO*OUFSOBUJPOBM-JDFOTF

elds (e.g., service ports) well out-of-the-box, while requiring hun-

dreds of GPU hours to train. And while it supports dierentially-

private (DP) training [

], this option completely destroys its syn-

thetic data delity.

In designing NetShare, we tackle these key challenges by a careful

data-driven understanding of the limitations of canonical GAN-

based approaches. NetShare combines the following key ideas to

address the above issues:

•

Reformulation as ow time series generation: Instead of treating

header traces fromeachmeasurement epoch as an independent

tabular dataset (i.e., rows of packets/ows with headers), we

recast the problem for learning synthetic models for a merged

ow-level trace across epochs. This reformulation allows us to

natively capture intra- and inter-epoch correlations.

•

Improving scalability via ne tuning: We identify opportunities

to optimize learning time by using ideas of model ne tuning

and data-parallel learning from the ML literature [

]. Doing

so naively may fail to capture dependencies across parallel in-

stances, so we develop heuristics to preserve such correlations.

•

Practical privacy reformulations: We adopt recent advances in

dierentially-private model training [

] and combine a small

amount of public data with private data to improve privacy-

delity tradeos. To the best of our knowledge, this is the

rst application and empirical demonstration in the context of

header trace generation.

We implement an end-to-end system: NetShare and build a web

service prototype available through https://www.pcapshare.com.

Thecodeis open-sourcedat https://github.com/netsharecmu/NetShare.

We also tackle a number of other practical challenges that prior work

has not considered. For instance, prior work does not generate valid

traces (e.g., headers with derived elds, timestamps), does not eval-

uate if/how model training generalizes across a wide range of trace

sources (e.g., ISP vs. datacenter vs. edge), and does not consider the

delity of the generated traces for relevant networking use cases

(e.g., telemetry, anomaly detection, machine learning).

We empirically evaluate NetShare and show that (1) across all dis-

tributional metrics and traces [

], NetShare achieves

46% more accuracy than baseline approaches that use dierent gener-

ative modeling techniques [

]. (2) NetShare meets

users’ requirements of downstream tasks [

] which

keeps the algorithm accuracy and ordering. (3) NetShare achieves

a better scalability-delity tradeo than baselines. (4) NetShare can

generate higher-quality dierentially private traces than baseline

approaches.

2 MOTIVATION

In this section, we start by describing use cases for trace-driven anal-

ysis in networked systems. Then we argue why synthetic traces are

useful and then make a case for data-driven synthesis in contrast to

conventional approaches.

2.1 Motivating scenarios

We describe two illustrative use cases in data-driven network de-

sign and management that are stymied by lack of access to realistic

packet- and ow-level traces.

Fidelity Flexibility Privacy Eort

Raw High Low

Anonymized Depends  Depends Low

Synthetic Possible High Possible High

Table 1: Trade-os for data holders sharing Raw vs.

Anonymized vs. Synthetic traces

Telemetry algorithms. There is a lot of renewed interest in the de-

sign and development of novel telemetry algorithms including many

approximate data structures for sketching (e.g., [

]). Sev-

eralof these approaches also make implicit assumptions onstructural

properties of workloads (e.g., heavy ows) to optimize space-time

tradeos (e.g., [

]). To systematically evaluate which approach

best suits a target deployment or system provisioning regime, we

need realistic header traces to compare dierent algorithms and

provisioning strategies (e.g., number of rows, counter arrays to use

for sketches).

Evaluating machine learning models:. There are also a number

of emerging use cases (including building classiers over encrypted

trac), where researchers and practitioners (e.g., [

]) are

developing novel machine learning models for various types of n-

gerprinting (e.g., what type of application a particular session entails)

or anomaly detection (e.g., is this device compromised) using only IP

packet and ow headers [

]. Again, to systematically evaluate the

potential performance rate of these algorithms in diverse settings,

we need access to realistic header traces of normal client behav-

ior [30, 52].

These use cases (and other future scenarios) require access to real-

istic high delity traces. The dimensions of delity may be use-case

dependent; e.g., some may care about header-eld value distribu-

tions, some may care about preserving “heavy hitters”, others may

care about ow-level properties, and so on.

2.2 Synthetic traces and status quo

Due to a number of concerns (e.g., policy, privacy, legal restrictions)

data holders who hold traces are usually unwilling to share raw

traces. To address these concerns, there are two main alternatives to

raw packet traces: (1) Anonymized traces (e.g., using either masking

or cryptographic techniques to hide IP addresses or (2) Synthetic

traces where some model of generating packet traces is created to

mimic properties of the raw data. At a high level, there are qualitative

tradeos between raw, anonymized, and synthetic trace generation

as summarized in Table 1. Raw traces require least eort, but are also

least private and least exible (e.g., generating more data as needed

or changing specic workload characteristics). Anonymized and

synthetic data each have pros/cons in terms of delity, exibility,

privacy, and eort. For example, anonymized data can be made more

private by obscuring and/or redacting more elds, but this hurts

the resulting data delity [

]. Similarly, there are techniques for

generating synthetic data, but the resulting privacy guarantees are

unclear, and remain an active area of research [

]. Our focus

in this paper is on lowering the barrier for generating and sharing

synthetic data since it oers a qualitatively dierent value proposi-

tion than the other two options and may lower the barrier for data

sharing, as also observed in other eorts (e.g., [39]).

459

Existing approaches for synthetic header trace generation be

divided into three categories: (1) Simulation-based (e.g., NS-2 [

OSTINATO [

], SEAGULL [

]); (2) Model-driven (e.g., Harpoon [

SWING [

]); and (3) Data-driven or machine learning driven (e.g.,

STAN [

]). While these prior eorts have been immensely valuable

to the community, they suer from one or more fundamental short-

comings. The simulation- and model-driven generators have two key

drawbacks. First, system designers need to manually determine the

important set of features and choose the model which requires signif-

icant domain knowledge and human eorts [

]. Second, such

models usually make assumptions about the underlying workloads

and downstream tasks [

] which makes them hard to generalize

across traces with potentially signicant deployments/topologies/-

workloads. Existing data-driven or machine-learned approaches are

more automated but have more fundamental structural limitations.

For instance, STAN [

] only generate ow-level summary statistics

while HMM-based IP generators [

] only generates IP addresses.

Furthermore, existing frameworks do not evaluate the delity of

these synthetic traces across diverse datasets and downstream tasks.

3 OVERVIEW AND CHALLENGES

Our overarching goal is to develop a data-driven synthetic header

trace generation workow that requires minimal manual tuning

and expert knowledge, and can support a wide range of traces from

diverse deployments and diverse downstream applications. We start

by dening our goals and how we propose to achieve this using a

GAN-based workow.

3.1 Problem formulation

We are given as input a dataset of header-level traces split into

𝑛

consecutive epochs. For each epoch

𝑡

, we are given

𝐷

𝑡



unsampled,

IPv4 packet header trace. These could be packet- or ow-level traces

depending on scenario.

•

Packet header trace: Each record in a packet header trace con-

sists of packet header elds (e.g., source/destination IP headers)

associated with some measured values (e.g., timestamp, packet

size).

•

Flow header trace: Each record in a ow header trace consists of

the IP 5-tuple header (e.g., source/destination IP headers, ports,

and protocol) associated with some measured values (e.g., start

time, end time of ow, total number of packets, total number

of bytes).

Scope and goals. Our goal is to learn a generative model of {

𝐷

𝑡

𝑡 =

,···,𝑛

} that satises dierent types of delity metrics specied

by domain experts and downstream applications. We specically

focus on IPv4 header 5-tuple elds. Packet payloads and other high-

layer headers (e.g., TCP/UDP header, application protocol header)

are outside of the current scope.

We expect three categories of delity metrics of interest:

•

Header-level distributional properties: For each header

eld, we want to ensure the distribution of the synthetic and

raw trace match quantitatively; e.g., popularity rank of IP ad-

dresses or distribution of packet sizes.

•

Flow-level properties: Other than per-header (or packet-

level) metrics, ow-level metrics are also common in network-

ing apps [

]: e.g., ow size distribution or ow

duration distribution.

•

Use-case specic properties: To ensure the utility of syn-

thetic header traces we consider two use-case specic proper-

ties: (1) Accuracy preservation: Can one particular algorithm/ap-

plication achieve similar accuracy on the raw and synthetic

header traces? (2) Order preservation: Is the relative perfor-

mance of algorithms preserved between raw and synthetic

traces; e.g., if Count-Sketch is better for detecting heavy hitters

in real traces, is that ordering preserved?

We also want the header traces to satisfy key semantic and syntac-

tic correctness conditions; e.g., IP addresses in valid ranges, packet

sizes in ranges (e.g., TCP packet, the minimum size is 40 bytes, while

for a UDP packet,the minimum size is28 bytes); relationship between

port number and protocol (e.g., 80 for HTTP and 53 for DNS).

Non goals. We acknowledge some types of properties are out of

scope for our current work. Specically, we do not capture stateful

session semantics (e.g., TCP sessions), application layer protocol

semantics (e.g.,HTTP headers), packet payloads, or ne-grained tem-

poral properties (e.g., distribution of inter-arrival times of packets).

These are interesting directions for future work, as we discuss in §8.

3.2 Why GANs

Generative adversarial networks (GANs) are a popular class of gener-

ative model [

]. Given a set of training data

𝑥

,...,𝑥

𝑛

, where samples

𝑥

𝑖

∈X

belong to universe

and are drawn from some underlying

distribution

𝑥

𝑖

∼ 𝑃

𝑥

, the goal is to learn to generate new samples

from

𝑃

𝑥

. GANs achieve this through adversarial training; that is, they

learn two competing models. The generator maps low-dimensional

random noise to output samples. The discriminator takes as input

either a real training sample or a generated sample, and must classify

which it is seeing. These two models (usually neural networks), are

trained in alternation to convergence.

GANs have been used with great success in the image domain,

achieving state-of-the-art image and video generation [

]. They

are able to learn both local and global correlations in training data to

produce high-resolution samples. Hence, there is reason to believe

that they may also be good at modeling correlations in network

trac, which involve both short- and long-term correlations [

GANs can be tailored to dierent types of data, including tabular

data [74] and time series [26, 39, 80].

3.3 Strawman approaches and limitations

We begin by understanding the limitations of canonical GAN-based

architectures in our context before we explain our design choices

to tackle these challenges in the next section.

Strawman solutions. While GANs have most popularly been used

for generating image data, they have also been used to generate struc-

tured tabular data that appear in many application domains [

As such, a very natural starting point for using GANs is to treat

packet- or ow-header traces as tabular data (e.g., CTGAN [

]).

Here, each row represents a packet/ow with columns capturing var-

ious features of interest (e.g., IP addresses, port, packet/byte counts,

timestamp information). Indeed, many existing eorts for extending

GAN to networking contexts (e.g., E-WGAN-GP [

]) adopt this

460

approach with some extensions. A recently proposed GAN architec-

ture called DoppelGANger [

] considers other types of metadata-

measurement traces modeled as timeseries. However, it is not clear

if, and how, this work can apply to packet- and ow-header traces.

As a point of reference, we also consider a state-of-art non-GAN

approach called STAN that uses autoregressive neural networks [

We defer a full description of these baselines to Section 6.

Challenge 1 (C1):

Baselines do not accurately capture header

correlations of packets/flows, e.g., flow length.

# of records with the same five tuple

0.88

0.90

0.92

0.94

0.96

0.98

1.00

CDF

Real

CTGAN

STAN

E-WGAN-GP

NetShare

(a) CDF of NetFlow records with

same ve tuples (UGR16).

Flow size (# of packets perflow)

0.5

0.6

0.7

0.8

0.9

1.0

CDF

Real

CTGAN

PAC-GAN

PacketCGAN

Flow-WGAN

NetShare

(b) CDF of ow size (# of packets)

on CAIDA.

Figure 1: Distribution of # of records/packets with the same

ve tuples on UGR16 (NetFlow, le) and CAIDA (PCAP, right).

All baselines are missing in Fig. 1b as they don’t generate

ows with > 1 packet.

Many downstream tasks (e.g., sketch-based telemetry [

], header-based anomaly detection algorithms [

]) need datasets

to accurately capture properties that span across packets and ows

(e.g., ow size). In the case of packet header traces, we see in Fig. 1b

that the baselines are actually absent in the CDF plot of ow size.

This is because they do not generate multiple packets for the same

ow! This is not surprising as prior GAN-based work has treated

each packet as a record in a tabular database, without timestamps

[

]. A similar challenge arise with ow data. Long-lived ows

can span multiple measurement epochs, and it is not uncommon to

see ow records spanning multiple epochs. Moreover, given the way

ow collectors are congured (e.g., inactive timeouts, max time of

ow), the same ow record can also appear multiple times within a

single measurement epoch. As we see in Fig. 1a, baselines either gen-

erate much longer ow records (e.g., CTGAN [

], up to a few thou-

sand) or consistently generate short ows (e.g., E-WGAN-GP [57]).

Challenge 2 (C2):

Baselines struggle to accurately capture the

distributions for elds with large support.

The support of a eld refers to the possible range of values it

can take. Several of the elds we aim to generate have a large sup-

port, including source/destinations ports, source/destination IPs, and

number of packets/bytes per ow. Fields with extremely small/large

values could indicate a potential anomaly which are crucial to down-

stream tasks e.g., anomaly detection [

]. Unfortunately, existing

GAN-based baselines do not capture such elds well. Consider the

following illustrative examples. In ow-header traces, the “number

of packets per ow” and “number of bytes per ow” can range from

tens for mice ows to hundreds of millions for elephant ows. Fig.

# of packets per flow

0.0

0.2

0.4

0.6

0.8

1.0

CDF

Real

CTGAN

STAN

E-WGAN-GP

NetShare

(a) # of packets per ow

# of bytes per flow

0.0

0.2

0.4

0.6

0.8

1.0

CDF

Real

CTGAN

STAN

E-WGAN-GP

NetShare

(b) # of bytes per ow

Figure 2: Distribution of NetFlow’s (unbounded) elds on

UGR16 dataset: le: ow size; right: ow volume.

2 shows that baselines generate a much more limited range and also

miss the correct distribution for small values. As another example,

consider the port number eld in headers. Correctly learning the

distribution of port numbers (especially the service ports < 1024)

is key for many measurement tasks (e.g., anomaly detection [

]).

Fig. 3 shows the baselines do not accurately capture the structure

of top-K ports (and nearly miss all of them).

53 80 445 443 21

0.0

0.1

0.2

Real

CTGAN

53 80 445 443 21

0.0

0.1

0.2

Real

STAN

53 80 445 443 21

0.0

0.1

0.2

Real

E-WGAN-GP

53 80 445 443 21

0.0

0.1

0.2

Real

NetShare

Top 5 service destination port number

Relative frequency

Figure 3: Top 5 service destination ports in TON (NetFlow):

baselines fail to capture most frequent service ports while

NetShare captures each mode of them by simpler and more

eective IP2Vec.

Challenge 3 (C3):

ExistingGAN-basedframeworks exhibitpoor

scalability-delity tradeos on network traces.

In theory, some of these delity challenges can be partially ad-

dressed with larger training datasets, as deep generative models gen-

erallyachievebetter results with more parameters and moredata[

However, this approach quickly encounters scalability challenges.

Fig. 4 shows the trade-os between scalability and delity of base-

lines on a NetFlow dataset (Fig. 4a, Fig. 4b) and a PCAP dataset (Fig.

4c, Fig. 4d). We measure scalability as the total CPU hours (asopposed

to the wall clock time since multiple machines are used simultane-

ously) and the delity as average JS divergence and normalized EMD

across dierent metrics (refer to Section 6 for details). Simple tabular

approaches (e.g., CTGAN, E-WGAN-GP) use the fewest CPU hours

while achieving worse delity due to their modeling assumptions.

We were unable to train the synthetic time series trace generator

DoppelGANger [

] on our datasets due to memory constraints. As

an intermediate design we modied DoppelGANger to include our

proposed merging and encoding techniques (described in §4), shown

461

Training time (CPU hours)

0.10

0.15

0.20

0.25

0.30

0.35

Avg. JSD

CTGAN

STAN

E-WGAN-GP

NetShare-V0

NetShare

(a) UGR16 (NetFlow) JSD

Training time (CPU hours)

0.1

0.2

0.3

0.4

0.5

Avg. normalized EMD

CTGAN

STAN

E-WGAN-GP

NetShare-V0

NetShare

(b) UGR16 (NetFlow) EMD

Training time (CPU hours)

0.2

0.3

0.4

0.5

0.6

Avg. JSD

CTGAN

PAC-GAN

PacketCGAN

Flow-WGAN

NetShare-V0

NetShare

Training time (CPU hours)

0.1

0.2

0.3

0.4

0.5

0.6

Avg. normalized EMD

CTGAN

PAC-GAN

PacketCGAN

Flow-WGAN

NetShare-V0

NetShare

(d) CAIDA (PCAP) EMD

Figure 4: Scalability-delity trade-os: Scalability is mea-

sured with total CPU hours (

↓

) and delity is measured with

the average JSD across categorical elds and the average

normalized EMD across continuous elds (↓).

Epsilon

0.16

0.18

0.20

0.22

0.24

0.26

0.28

0.30

Avg. JSD

Naive DP

DP Pretrained-SAME

DP Pretrained-DIFF

(a) NetFlow (UGR16) JSD

Epsilon

0.1

0.2

0.3

0.4

0.5

0.6

Avg. normalized EMD

Naive DP

DP Pretrained-SAME

DP Pretrained-DIFF

(b) NetFlow (UGR16) EMD

Epsilon

0.200

0.225

0.250

0.275

0.300

0.325

0.350

0.375

Avg. JSD

Naive DP

DP Pretrained-SAME

DP Pretrained-DIFF

Epsilon

0.1

0.2

0.3

0.4

0.5

Avg. normalized EMD

Naive DP

DP Pretrained-SAME

DP Pretrained-DIFF

(d) PCAP (CAIDA) EMD

Figure 5: Privacy-delity trade-os: Privacy is measured with

(

𝜖,𝛿

)inDP(

↓

) and delity is measured as average JSD across

categorical elds and the average normalized EMD

across

continuous elds (↓).

as ‘NetShare-V0’ in Figure 4. While this can achieve better delity,

it also uses 10x more CPU hours.

Challenge 4 (C4): Existing frameworks exhibit poor

privacy-delity tradeos.

Mostprior work on GAN-based trace generation doesnot evaluate

explicitprivacy mechanisms [

]. This is inadequate,

as synthetic data may present privacy concerns [

]. In the prior

work that does explicitly consider privacy [

], the main conclusion

is that dierentially-private (DP) training via DP-SGD destroys the

delity of generated signals.

Indeed, we can see in Figure 5 that as

we decrease the DP privacy parameter

𝜖

(lower

𝜖,𝛿

indicate better

privacy; we set

𝛿 =

−5

), synthetic data delity is destroyed even

for weak parameters like

𝜖 =

(which means almost no privacy)

with an average JS divergence up to 0.21 on UGR16 dataset (Fig. 5a).

In other words, even very weak privacy breaks the delity. The full

experimental setup of Figure 5 is explained in §6.2, Finding 3.

4 NETSHARE DESIGN

Next, we present the design of NetShare via four high-level insights

in §4.1 with an end-to-end system overview in §4.2.

4.1 High-level insights

Insight 1 (I1):

We reformulate header trace generation as a time

series generation problem of generating flow records for the

entire trace rather than a per-epoch tabular approach (Fig-

ure 6).

 











 



 

 

 































































 





Figure 6: Instead of generating measurement epochs

𝐷

𝑖

through a tabular GAN, we merge multiple epochs

𝐷

𝑖

into

a giant trace

𝐷

, split the trace into flows

𝐷

𝑓𝑙𝑜𝑤

, and use

time-series GAN.

As we saw earlier, existing approaches do not learn header eld

correlations spanning multiple packets or epochs (e.g., ow size).

The root cause is these approaches treat each packet or ow record

independently and ignore intra- and inter-measurement epoch cor-

relations.

To systematically capture these cross-record correlations, we re-

formulate the header generation problem as a time series generation

problem rather than a tabular generation problem as shown in Fig-

ure 6. Specically, we begin by merging data from measurement

epochs

𝐷

𝑖

into one giant trace

𝐷

to capture inter-measurement epoch

correlations. Given this giant trace

𝐷

, we split it into a set of ows

For each continuous elds, we normalize the EMDs of all models across all epsilons

to [0.1,0.9].

We do not argue that DP is necessarily the best or only privacy denition for a network-

ingsetting. It is a widely-accepted metricin the privacy community [

].At the veryleast,

it is natural and desirable to generate DP synthetic data without destroying its delity.

462

评论收藏

内容反馈

A-C-K

粉丝: 243
资源: 17

ACM SIGCOMM 2022年论文集

最新资源

ACM SIGCOMM 2022年 论文集

sigcomm2020年论文集.zip

SIGCOMM 2009-ACM SIGCOMM conference on Data communication 2009年论文集(proceedings of SIGCOMM 2009)

SIGCOMM 2011-ACM SIGCOMM conference on Data communication 2011论文集

ACM SIGCOMM 2014论文集

SIGCOMM 2012-ACM SIGCOMM conference on Data communication 2012论文集

SIGCOMM 2013-ACM SIGCOMM conference on Data communication 2013论文集

ACM SIGCOMM 2015 Program 论文集 Part 1

ACM SIGCOMM 2015 Program 论文集 part1

ACM SIGCOMM 2015 Program 论文集 part2

ACM SIGCOMM 2016 Collection of papers

SIGCOMM 2010-ACM SIGCOMM conference on Data communication 2010

ACM SIGCOMM 2015 Collection of papers (Part 1)

ACM SIGCOMM 2014 Collection of papers (Part 1)

ACM SIGCOMM 2014 Collection of papers (Part 2)

ACM SIGCOMM 2019 Program（2）.7z

sigcomm2015论文集

ACM SIGCOMM 2015 Collection of papers (Part 2)

ACM国家集训队论文集（第二卷）

Sigcomm2015-2019.rar

相关实用应用程序（Windows可用）

免费可用的ChatGPT网页版.zip

ChatGPT使用总结：150个ChatGPT提示词模板（完整版）

chromedriver-win64.zip

全国计算机二级WPSoffice精选350道选择题题库（含答案）.pdf

哈尔滨工业大学-ChatGPT调研报告-2023.3.6-94页.pdf

HAI-2024斯坦福AI指数报告（中文译版）.pdf

2023泛娱乐社交出海手册-ZEGO即构科技

4个亲测好用的ChatGPT4渠道

毕业设计的概要介绍与分析

最新资源

ACM SIGCOMM 2022年论文集