DataTransferMattersforGPUComputing资源-CSDN文库

FPGA

需积分: 10 87 浏览量 2019-04-18 20:10:40 上传评论 1 收藏 1.32MB PDF 举报

资源推荐

资源详情

资源评论

Data Transfer Matters for GPU Computing

Yusuke Fujii

∗

, Takuya Azumi

†

, Nobuhiko Nishio

†

, Shinpei Kato

‡

and Masato Edahiro

‡

∗

Graduate School of Information Science and Engineering, Ritsumeikan University

†

College of Information Science and Engineering, Ritsumeikan University

‡

School of Information Science, Nagoya University

Abstract—Graphics processing units (GPUs) embrace many-

core compute devices where massively parallel compute threads

are ofﬂoaded from CPUs. This heterogeneous nature of GPU

computing raises non-trivial data transfer problems especially

against latency-critical real-time systems. However even the basic

characteristics of data transfers associated with GPU computing

are not well studied in the literature. In this paper, we investigate

and characterize currently-achievable data transfer methods of

cutting-edge GPU technology. We implement these methods using

open-source software to compare their performance and latency

for real-world systems. Our experimental results show that the

hardware-assisted direct memory access (DMA) and the I/O read-

and-write access methods are usually the most effective, while

on-chip microcontrollers inside the GPU are useful in terms of

reducing the data transfer latency for concurrent multiple data

streams. We also disclose that CPU priorities can protect the

performance of GPU data transfers.

Keywords-GPGPU; Data Transfer; Latency; Performance; OS

I. INTRODUCTION

Graphics processing units (GPUs) are becoming more and

more commonplace as many-core compute devices. For exam-

ple, NVIDIA GPUs integrate thousands of processing cores

on a single chip and the peak double-precision performance

exceeds 1 TFLOPS while sustaining thermal design power

(TDP) in the same order of magnitude as traditional multi-

core CPUs [1]. This rapid growth of GPUs is due to recent

advances in the programming model, often referred to as

general-purpose computing on GPUs (GPGPU).

Data-parallel and compute-intensive applications receive

signiﬁcant performance beneﬁts from GPGPU. Currently a

main application of GPGPU is supercomputing [2] but there

are more and more emerging applications in different ﬁelds.

Examples include plasma control [3], autonomous driving [4],

software routing [5], encrypted networking [6], and storage

management [7]–[10]. This broad range of applications raises

the need of further developing GPU technology to enhance

scalability of emerging data-parallel and compute-intensive

applications.

GPU programming inevitably incurs data transfers between

the host and the device memory. This resulting latency could

be a performance stopper of I/O bound GPGPU applications.

In fact, the basic performance and latency issues for GPUs are

not well studied in the literature. Given that compute kernels

are ofﬂoaded to the GPU, their performance and latency are

more dominated by compiler and hardware technology. How-

ever an optimization of data transfers must be complemented

0.003$

0.03$

0.3$

30$

300$

16$B$

64$B$

256$B$

1$KB$

4$KB$

16$KB$

64$KB$

256$KB$

1$MB$

4$MB$

16$MB$

64$MB$

Data$Transfer$Time$($ms$)

Data$Size$

DMA$

IORW$

(a) Host to Device

0.003$

0.03$

0.3$

30$

300$

1$KB$

4$KB$

16$KB$

64$KB$

256$KB$

1$MB$

4$MB$

16$MB$

64$MB$

Data$Transfer$Time$($ms$)

Data$Size$

DMA$

IORW$

(b) Device to Host

Fig. 1. Performance of DMA and I/O read/write for the NVIDIA GPU.

by system software due to the constraint of PCIe devices [9].

Data transfers may also be interfered by competing workload

on the CPU, while ofﬂoaded compute kernels are isolated on

the GPU. These data transfer issues must be well understood

and addressed to build low-latency GPU computing.

The data transfer is particularly an important issue for low-

latency GPU computing. Kato et. al. demonstrated that the data

transfer is a dominant property of GPU-accelerated plasma

control systems [3]. This is a speciﬁc application where the

data must be transferred between sensor/actuator devices and

the GPU at a high-rate, but is a good example presenting the

impact of data transfers on GPU computing. Since emerging

applications augmented with GPUs may demand a similar

performance requirement, a better understanding of the GPU

data transfer mechanism is desired.

Figure 1 depicts the average data transfer times of hardware-

based direct memory access (DMA) and memory-mapped I/O

read-and-write access, which are obtained on an NVIDIA

GeForce GTX 480 graphics card using the open-source Linux

driver [9]. Apparently the performance characteristics of the

data transfer are not identical for the host-to-device and device-

to-host directions. In previous work, a very elementary issue of

this performance difference has been discussed [9], but there

is no clear conclusion on what methods can optimize the data

transfer performance, what different methods are available.

Currently we pray that the black-box data transfer mecha-

nism of proprietary software, provided by GPU vendors, is

well optimized to meet the performance that programmers

expect, because hardware details of GPUs are not disclosed to

the public. In order to achieve low-latency GPU computing,

we must understand what latency and performance interference

exist when using the GPU.

To some extent, GPUs are suitable for real-time computing

once workload is ofﬂoaded, but host-device data transfers

may be affected by some competing workload on the host

computer. Hence a better understanding of data transfers as-

sociated with GPU computing is an essential piece of work to

support latency-critical real-time systems. Unfortunately prior

work [3], [9] did not provide performance characterization in

the context of latency and concurrent workload; they focused

on a basic comparison of DMA and direct I/O access.

Contribution: In this paper, we clarify the performance

characteristics of currently-achievable data transfer methods

for GPU computing while unveiling several new data transfer

methods other than the well-known DMA and I/O read-and-

write access. We reveal the advantage and disadvantage of

these methods in a quantitative way leading a conclusion that

the typical DMA and I/O read-and-write methods are the

most effective in latency even in the presence of compelling

workload, whereas concurrent data streams from multiple

different contexts can beneﬁt from the capability of on-chip

microcontrollers integrated in the GPU. To the best of our

knowledge, this is the ﬁrst evidence of data transfer matters for

GPU computing beyond an intuitive expectation, which allows

system designers to choose appropriate data transfer methods

depending on the requirement of their latency-sensitive GPU

applications. Without our ﬁndings, none can reason about

the usage of GPUs minimizing the data transfer latency and

performance interference. These ﬁndings are also applicable

for many PCIe compute devices rather than a speciﬁc GPU.

We believe that the contributions of this paper are useful for

low-latency GPU computing.

Organization: The rest of this paper is organized as fol-

lows. Section II presents the assumption and terminology

behind this paper. Section III provides an open investigation

of data transfer methods for GPU computing. Section IV

compares the performances of the investigated data transfer

methods. Related work are discussed in Section V. We provide

our concluding remarks in Section VI.

II. ASSUMPTION AND TERMINOLOGY

We assume that the Compute Uniﬁed Device Architecture

(CUDA) is used for GPU programming [11]. A unit of code

that is individually launched on the GPU is called a kernel.

The kernel is composed of multiple threads that execute the

code in parallel.

CUDA uses a set of an application programming interface

(API) functions to manage the GPU. A typical CUDA program

takes the following steps: (i) allocate space to the device

memory, (ii) copy input data to the allocated device memory

space, (iii) launch the program on the GPU, (iv) copy output

data back to the host memory, and (v) free the allocated

device memory space. The scope of this paper is related to

(ii) and (iv). Particularly we use the cuMemCopyHtoD() and

the cuMemCopyDtoH() functions provided by the CUDA

Driver API, which correspond to (ii) and (iv) respectively.

Since an open-source implementation of these functions is

available [9], we modify them to accommodate various data

transfer methods investigated in this paper. While they are

synchronous data transfer functions, CUDA also provides

CPU$$

PCI$Bridge$

GPU$Board$

Host$Interface$

GPU$Chip$

Device$

Memory





Microcontrollers



Host$Memory$

DMA$Engines

GPC$

CUDA$Cores$

Microcontroller$

GPC$

CUDA$Cores$

Microcontroller$

GPC$

CUDA$Cores$

Microcontroller$

GPC$

CUDA$Cores$

Microcontroller$

Fig. 2. Block diagram of the target system.

asynchronous data transfer functions. In this paper, we restrict

our attention to the synchronous data transfer functions for

simplicity of description, but partly similar performance char-

acteristics can also be applied for the asynchronous ones. This

is because both techniques are using the same data transfer

method. The only difference is synchronization timing.

In order to focus on the performance of data transfers

between the host and the device memory, we allocate a data

buffer to the pinned host memory rather than the typical heap

allocated by malloc(). This pinned host memory space is

mapped to the PCIe address and is never swapped out. It is

also accessible to the GPU directly.

Our computing platform contains a single set of the CPU

and the GPU. Although we restrict our attention to CUDA and

the GPU, the notion of the investigated data transfer methods is

well applicable to other heterogeneous compute devices. GPUs

are currently well-recognized forms of the heterogeneous

compute devices, but emerging alternatives include the Intel

Many Integrated Core (MIC) and the AMD Fusion technology.

The programming models of these different platforms are

almost identical in that the CPU controls the compute devices.

Our future work includes an integrated investigation of these

different platforms.

Figure 2 shows a summarized block diagram of the target

system. The host computer consists of the CPU and the host

memory communicating on the system I/O bus. They are

connected to the PCIe bus to which the GPU board is also

connected. This means that the GPU is visible to the CPU

as a PCIe device. The GPU is a complex compute device

integrating a lot of hardware functional units on a chip. This

paper is only focused on the CUDA-related units. There are

the device memory and the GPU chip connected through a

high bandwidth memory bus. The GPU chip contains graphics

processing clusters (GPCs), each of which integrates hundreds

of processing cores, a.k.a, CUDA cores. The number of GPCs

and CUDA cores is architecture-speciﬁc. For example, GPUs

based on the NVIDIA GeForce Fermi architecture [12] used

in this paper support at most 4 GPCs and 512 CUDA cores.

Each GPC is conﬁgured by an on-chip microcontroller. This

microcontroller is wimpy but is capable of executing ﬁrmware

code with its own instruction set. There is also a special

hub microcontroller, which broadcasts the operations on all

the GPC-dedicated microcontrollers. In addition to hardware

剩余7页未读，继续阅读

评论收藏

内容反馈

luckyweiba

粉丝: 0
资源: 1

Data Transfer Matters for GPU Computing

Comparison of Distributed GPU Computing Frameworks for SAR Raw Data Simulation

matters

Matters Computational

Closed-loop Matters: DualRegression Networks for Single Image Super-Resolution

Matters Computational ideas, algorithms, source code

Designing.with.Data.1449334830

Matters Computational-ideas, algorithms, source code

Big Data and Visual Analytics-Springer(2017).pdf

Matters Computational(pdf)

Internet Infrastructure_Networking, Web Services, and Cloud Computing-CRC2018

Data Science Foundations. Geometry and Topology of Complex Hierarchic Systems

Why Functional Programming Matters

why function program matters

Kafka_The Definitive Guide_Real-Time Data and Stream Processing at Scale-2017

Measure What Matters OKR John Doerr

Photon-formed structure model for all matters

Infrastructure_Matters_POWER8_vs_XEON_x86-IBM官网引导的报告-中文

本地学习问题重新思考联邦学习中的数据异构性_Local Learning Matters Rethinking Data Het

OKR Measure What Matters

Social Computing Homeland Security

How to Design a Good API and Why it Matters

881195151242721Manor Matters v4.3.2.apk

Machine Learning Algorithms pdf format

Pseudo-Mask Matters in Weakly-Supervised Semantic.pdf

最新资源