SOSP2011-ACMSymposiumonOperatingSystemsPrinciples2011年论文集资源-CSDN文库

共28个文件

pdf：28个

操作系统

SOSP

2011

4星 · 超过85%的资源需积分: 9 180 浏览量 2012-02-02 23:13:01 上传评论 1 收藏 29.85MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

SOSP 2011.rar （28个子文件）

SOSP 2011

Security

Software fault isolation with API integrity and multi-principal modules.pdf 828KB

Intrusion recovery for database-backed web applications.pdf 791KB

CryptDB _ protecting confidentiality with encrypted query processing.pdf 841KB

Storage

Differentiated storage services.pdf 835KB

Design implications for enterprise storage systems via multi-dimensional trace analysis.pdf 734KB

A file is not a file _ understanding the I O behavior of Apple desktop applications.pdf 839KB

Threads and races

Detecting and surviving data races using complementary schedules.pdf 786KB

Efficient deterministic multithreading through schedule relaxation.pdf 1.2MB

Dthreads _ efficient deterministic multithreading.pdf 1.19MB

Pervasive detection of process races in deployed systems.pdf 1.68MB

Geo-replication

Don't settle for eventual _ scalable causal consistency for wide-area storage with COPS.pdf 1.56MB

Transactional storage for geo-replicated systems.pdf 816KB

Virtualization

Atlantis _ robust, extensible execution environments for web applications.pdf 1.18MB

Cells _ a virtual mobile smartphone architecture.pdf 1.28MB

CloudVisor _ retrofitting protection of virtual machines in multi-tenant cloud with nested virtualization.pdf 1.2MB

Breaking up is hard to do _ security and functionality in a commodity hypervisor.pdf 1.58MB

Key-value

SILT _ a memory-efficient, high-performance key-value store.pdf 1.96MB

Scalable consistency in Scatter.pdf 1.77MB

Fast crash recovery in RAMCloud.pdf 791KB

Reality

Thialfi _ a client notification service for internet-scale applications.pdf 1.58MB

Windows Azure Storage _ a highly available cloud storage service with strong consistency.pdf 790KB

An empirical study on configuration errors in commercial and open source systems.pdf 808KB

Detection and tracing

Practical software model checking via dynamic interface reduction.pdf 854KB

Fay _ extensible distributed tracing from kernels to clusters.pdf 833KB

Secure network provenance.pdf 1.94MB

Detecting failures in distributed systems with the Falcon spy network.pdf 763KB

OS Architecture

PTask _ operating system abstractions to manage GPUs as compute devices.pdf 2.12MB

Logical attestation _ an authorization architecture for trustworthy computing.pdf 1.6MB

PTask: Operating System Abstractions To Manage GPUs

as Compute Devices

Christopher J. Rossbach

Microsoft Research

crossbac@microsoft.com

Jon Currey

Microsoft Research

jcurrey@microsoft.com

Mark Silberstein

Technion

marks@cs.technion.ac.il

Baishakhi Ray

University of Texas at Austin

bray@cs.utexas.edu

Emmett Witchel

University of Texas at Austin

witchel@cs.utexas.edu

ABSTRACT

We propose a new set of OS abstractions to support GPUs and other

accelerator devices as ﬁrst class computing resources. These new

abstractions, collectively called the PTask API, support a dataﬂow

programming model. Because a PTask graph consists of OS-managed

objects, the kernel has sufﬁcient visibility and control to provide

system-wide guarantees like fairness and performance isolation,

and can streamline data movement in ways that are impossible un-

der current GPU programming models.

Our experience developing the PTask API, along with a gestural

interface on Windows 7 and a FUSE-based encrypted ﬁle system

on Linux show that the PTask API can provide important system-

wide guarantees where there were previously none, and can enable

signiﬁcant performance improvements, for example gaining a 5×

improvement in maximum throughput for the gestural interface.

Categories and Subject Descriptors

D.4.8 [Operating systems]: [Performance]; D.4.7 [Operating sys-

tems]: [Organization and Design]; I.3.1 [Hardware Architecture]:

[Graphics processors]; D.1.3 [Programming Techniques]: [Con-

current Programming]

General Terms

OS Design, GPUs, Performance

Keywords

Dataﬂow, GPUs, operating systems, GPGPU, gestural interface,

accelerators

1. INTRODUCTION

Three of the top ﬁve supercomputers on the TOP500 list for

June 2011 (the most recent ranking) use graphics processing units

(GPUs) [6]. GPUs have surpassed CPUs as a source of high-density

computing resources. The proliferation of fast GPU hardware has

been accompanied by the emergence of general purpose GPU (GPGPU)

Figure 1: Technology stacks for CPU vs GPU programs. The

1-to-1 correspondence of OS-level and user-mode runtime ab-

stractions for CPU programs is absent for GPU programs

frameworks such as DirectX, CUDA [59], and OpenCL [47], en-

abling talented programmers to write high-performance code for

GPU hardware. However, despite the success of GPUs in super-

computing environments, GPU hardware and programming envi-

ronments are not routinely integrated into many other types of sys-

tems because of programming difﬁculty, lack of modularity, and

unpredictable performance artifacts.

Current software and system support for GPUs allows their com-

putational power to be used for high-performance rendering or for a

wide array of high-performance batch-oriented computations [26],

but GPU use is limited to certain application domains. The GPGPU

ecosystem lacks rich operating system (OS) abstractions that would

enable new classes of compute-intensive interactive applications,

such as gestural input, brain-computer interfaces, and interactive

video recognition, or applications in which the OS uses the GPU

for its own computation such as encrypted ﬁle systems. In contrast

to interactive games, which use GPUs as rendering engines, these

applications use GPUs as compute engines in contexts that require

OS support. We believe these applications are not being built be-

cause of inadequate OS-level abstractions and interfaces. The time

has come for OSes to stop managing graphics processing devices

(GPUs) as I/O devices and start managing them as a computational

devices, like CPUs.

Figure 1 compares OS-level support for traditional hardware to

OS-level support for GPUs. In contrast to most common system

resources such as CPUs and storage devices, kernel-level abstrac-

tions for GPUs are severely limited. While OSes provide a driver

interface to GPUs, that interface locks away the full potential of the

graphics hardware behind an awkward ioctl-oriented interface

designed for reading and writing blocks of data to millisecond-

latency disks and networks. Moreover, lack of a general kernel-

facing interface severely limits what the OS can do to provide high-

level abstractions for GPUs: in Windows, and other closed-source

OSes, using the GPU from a kernel mode driver is not currently

supported using any publicly documented APIs. Additionally, be-

233

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

SOSP '11, October 23-26, 2011, Cascais, Portugal.

cause the OS manages GPUs as peripherals rather than as shared

compute resources, the OS leaves resource management for GPUs

to vendor-supplied drivers and user-mode run-times. With no role

in GPU resource-management, the OS cannot provide guarantees

of fairness and performance isolation. For applications that rely on

such guarantees, GPUs are consequently an impractical choice.

This paper proposes a set of kernel-level abstractions for manag-

ing interactive, high-compute devices. GPUs represent a new kind

of peripheral device, whose computation and data bandwidth ex-

ceed that of the CPU. The kernel must expose enough hardware

detail of these peripherals to allow programmers to take advantage

of their enormous processing capabilities. But the kernel must hide

programmer inconveniences like memory that is non-coherent be-

tween the CPU and GPU, and must do so in a way that preserves

performance. GPUs must be promoted to ﬁrst-class computing re-

sources, with traditional OS guarantees such as fairness and isola-

tion, and the OS must provide abstractions that allow programmers

to write code that is both modular and performant.

Our new abstractions, collectively called the PTask API, provide

a dataﬂow programming model in which the programmer writes

code to manage a graph-structured computation. The vertices in the

graph are called ptasks (short for parallel task) which are units of

work such as a shader program that runs on a GPU, or a code frag-

ment that runs on the CPU or another accelerator device. PTask ver-

tices in the graph have input and output ports exposing data sources

and sinks in the code, and are connected by channels, which rep-

resent a data ﬂow edge in the graph. The graph expresses both data

movement and potential concurrency directly, which can greatly

simplify programming. The programmer must express only where

data must move, but not how or when, allowing the system to par-

allelize execution and optimize data movement without any addi-

tional code from the programmer. For example, two sibling ptasks

in a graph can run concurrently in a system with multiple GPUs

without additional GPU management code, and double buffering

is eliminated when multiple ptasks that run on a single accelerator

are dependent and sequentially ordered. Under current GPU pro-

gramming models, such optimizations require direct programmer

intervention, but with the PTask API, the same code adapts to run

optimally on different hardware substrates.

A PTask graph consists of OS-managed objects, so the kernel has

sufﬁcient visibility and control to provide system-wide guarantees

like fairness and performance isolation. The PTask runtime tracks

GPU usage and provides a state machine for ptasks that allows the

kernel to schedule them in a way similar to processes. Under cur-

rent GPU frameworks, GPU scheduling is completely hidden from

the kernel by vendor-provided driver code, and often implements

simplistic policies such as round-robin. These simple policies can

thwart kernel scheduling priorities, undermining fairness and in-

verting priorities, often in a dramatic way.

Kernel-level ptasks enable data movement optimizations that are

impossible with current GPU programming frameworks. For ex-

ample, consider an application that uses the GPU to accelerate

real-time image processing for data coming from a peripheral like

a camera. Current GPU frameworks induce excessive data copy

by causing data to migrate back and forth across the user-kernel

boundary, and by double-buffering in driver code. A PTask graph,

conversely, provides the OS with precise information about data’s

origin(s) and destination(s). The OS uses this information to elimi-

nate unnecessary data copies. In the case of real-time processing of

image data from a camera, the PTask graph enables the elimination

of two layers of buffering. Because data ﬂows directly from the

camera driver to the GPU driver, an intermediate buffer is unneces-

sary, and a copy to user space is obviated.

We have implemented the full PTask API for Windows 7 and

PTask scheduling in Linux. Our experience using PTask to ac-

celerate a gestural interface in Windows and a FUSE-based en-

crypted ﬁle system in Linux shows that kernel-level support for

GPU abstractions provides system-wide guarantees, enables signif-

icant performance gains, and can make GPU acceleration practical

in application domains where previously it was not.

This paper makes the following contributions.

• Provides quantitative evidence that modern OS abstractions

are insufﬁcient to support a class of “interactive” applications

that use GPUs, showing that simple GPU programs can re-

duce the response times for a desktop that uses the GPU by

nearly an order of magnitude.

• Provides a design for OS abstractions to support a wide range

of GPU computations with traditional OS guarantees like

fairness and isolation.

• Provides a prototype of the PTask API and a GPU-accelerated

gestural interface, along with evidence that PTasks enable

“interactive” applications that were previously impractical,

while providing fairness and isolation guarantees that were

previously absent from the GPGPU ecosystem. The data

ﬂow programming model supported by the PTask API de-

livers throughput improvements up to 4× across a range of

microbenchmarks and a 5× improvement for our prototype

gestural interface.

• Demonstrates a prototype of GPU-aware scheduling in the

Linux kernel that forces GPU-using applications to respect

kernel scheduling priorities.

2. MOTIVATION

This paper focuses on GPU support for interactive applications

like gesture-based interfaces, neural interfaces (also called brain-

computer interfaces or BCIs) [48], encrypting ﬁle systems and real-

time audio/visual interfaces such as speech recognition. These tasks

are computationally demanding, have real-time performance and

latency constraints, and feature many data-independent phases of

computation. GPUs are an ideal compute substrate for these tasks

to achieve their latency deadlines, but lack of kernel support forces

designers of these applications to make difﬁcult and often unten-

able tradeoffs to use the GPU.

To motivate our new kernel abstractions we explore the problem

of interactive gesture recognition as a case study. A gestural inter-

face turns a user’s hand motions into OS input events such as mouse

movements or clicks [36]. Forcing the user to wear special gloves

makes gesture recognition easier for the machine, but it is unnat-

ural. The gestural interface we consider does not require the user

to wear any special clothing. Such a system must be tolerant to vi-

sual noise on the hands, like poor lighting and rings, and must use

cheap, commodity cameras to do the gesture sensing. A gestural

interface workload is computationally demanding, has real-time la-

tency constraints, and is rich with data-parallel algorithms, making

it a natural ﬁt for GPU-acceleration. Gesture recognition is similar

to the computational task performed by Microsoft’s Kinect, though

that system has fewer cameras, lower data rates and grosser fea-

tures. Kinect only runs a single application at a time (the current

game), which can use all available GPU resources. An operating

system must multiplex competing applications.

Figure 2 shows a basic decomposition of a gesture recognition

system. The system consists of some number of cameras (in this

example, photogrammetric sensors [28]), and software to analyze

images captured from the cameras. Because such a system func-

tions as a user input device, gesture events recognized by the sys-

tem must be multiplexed across applications by the OS; to be us-

234

Figure 2: A gesture recognition system based on photogram-

metric cameras

able, the system must deliver those events with high frequency and

low latency. The design decomposes the system into four compo-

nents, implemented as separate programs:

• catusb: Captures image data from cameras connected on a

USB bus. Short for “cat /dev/usb”.

• xform: Perform geometric transformations to transform im-

ages from multiple camera perspectives to a single point cloud

in the coordinate system of the screen or user. Inherently

data-parallel.

• ﬁlter: Performs noise ﬁltering on image data produced by

the xform step. Inherently data-parallel.

• hidinput: Detects gestures in a point cloud and sends them

to the OS as human interface device (HID) input. Not data

parallel.

Given these four programs, a gestural interface system can be com-

posed using POSIX pipes as follows:

catusb | xform | filter | hidinput &

This design is desirable because it is modular, (making its com-

ponents easily reusable) and because it relies on familiar OS-level

abstractions to communicate between components in the pipeline.

Inherent data-parallelism in the xform and ﬁlter programs strongly

argue for GPU acceleration. We have prototyped these computa-

tions and our measurements show they are not only a good ﬁt for

GPU-acceleration, they actually require it. If the system uses mul-

tiple cameras with high data rates and large image sizes, these al-

gorithms can easily saturate a modern chip multi-processor (CMP).

For example, our ﬁlter prototype relies on bilateral ﬁltering [67].

A well-optimized implementation using fork/join parallelism is un-

able to maintain real-time frame rates on a 4-core CMP despite

consuming nearly 100% of the available CPU. In contrast, a GPU-

based implementation easily realizes frame rates above the real-

time rate, and has minimal affect on CPU utilization because nearly

all of the work is done on the GPU.

2.1 The problem of data movement

No direct OS support for GPU abstractions exists, so comput-

ing on a GPU for a gestural interface necessarily entails a user-

level GPU programming framework and run-time such as DirectX,

CUDA, or OpenCL. Implementing xform and ﬁlter in these frame-

works yields dramatic speedups for the components operating in

isolation, but the system composed with pipes suffers from ex-

cessive data movement across both the user-kernel boundary and

through the hardware across the PCI express (PCIe) bus.

For example, reading data from a camera requires copying im-

age buffers out of kernel space to user space. Writing to the pipe

connecting catusb to xform causes the same buffer to be written

back into kernel space. To run xform on the GPU, the system must

read buffers out of kernel space into user space, where a user-mode

Figure 3: Relative GPU execution time and overhead (lower

is better) for CUDA-based implementation of the xform pro-

gram in our prototype system. sync uses synchronous com-

munication of buffers between the CPU and GPU, async uses

asynchronous communication, and async-pp uses both asyn-

chrony and ping-pong buffers to further hide latency. Bars

are divided into time spent executing on the GPU and system

overhead. DtoH represents an implementation that communi-

cates between the device and the host on every frame, HtoD

the reverse, and both represent bi-directional communication

for every frame. Reported execution time is relative to the syn-

chronous, bi-directional case (sync-both).

runtime such as CUDA must subsequently write the buffer back

into kernel space and transfer it to the GPU and back. This pattern

repeats as data moves from the xform to the ﬁlter program and so

on. This simple example incurs 12 user/kernel boundary crossings.

Excessive data copying also occurs across hardware components.

Image buffers must migrate back and forth between main mem-

ory and GPU memory repeatedly, increasing latency while wasting

bandwidth and power.

Overheads introduced by run-time systems can severely limit the

effectiveness of latency-hiding mechanisms. Figure 3 shows rela-

tive GPU execution time and system overhead per image frame for

a CUDA-based implementation of the xform program in our pro-

totype. The ﬁgure compares implementations that use synchronous

and asynchronous communication as well as ping-pong buffers,

another technique that overlaps communication with computation.

The data illustrate that the system spends far more time marshaling

data structures and migrating data than it does actually computing

on the GPU. While techniques to hide the latency of communica-

tion improve performance, the improvements are modest at best.

User-level frameworks do provide mechanisms to minimize re-

dundant hardware-level communication within a single process’

address space. However, addressing such redundancy for cross-

process or cross-device communication requires OS-level support

and a programmer-visible interface. For example, USB data cap-

tured from cameras must be copied into system RAM before it can

be copied to the GPU: with OS support, it could be copied directly

into GPU memory.

2.2 No easy ﬁx for data movement

The problem of data migration between GPU and CPU mem-

ory spaces is well-recognized by the developers of GPGPU frame-

works. CUDA, for example, supports mechanisms such as asyn-

Indeed, NVIDIA GPU Direct [4] implements just such a feature,

but requires specialized support in the driver of any I/O device in-

volved.

235

Figure 4: The effect of GPU-bound work on CPU-bound tasks

The graph shows the frequency (in Hz) with which the OS is

able to deliver mouse movement events over a period of 60 sec-

onds during which a program makes heavy use of the GPU.

Average CPU utilization over the period is under 25%.

chronous buffer copy, CUDA streams (a generalization of the lat-

ter), and pinning of memory buffers to tolerate data movement la-

tency by overlapping computation and communication. However,

to use such features, a programmer must understand OS-level is-

sues like memory mapping. For example, CUDA provides APIs to

pin allocated memory buffers, allowing the programmer to avoid

a layer of buffering above DMA transfer. The programmer is cau-

tioned to use this feature sparingly as it reduces the amount of mem-

ory available to the system for paging [59].

Using streams effectively requires a static knowledge of which

transfers can be overlapped with which computations; such knowl-

edge may not always be available statically. Moreover, streams can

only be effective if there is available communication to perform

that is independent of the current computation. For example, copy-

ing data for stream a

to or from the device for execution by ker-

nel A can be overlapped with the execution of kernel B; attempts

to overlap with execution of A will cause serialization. Conse-

quently, modules that ofﬂoad logically separate computation to the

GPU must be aware of each other’s computation and communica-

tion patterns to maximize the effectiveness of asynchrony.

New architectures may alter the relative difﬁculty of managing

data across GPU and CPU memory domains, but software will re-

tain an important role, and optimizing data movement will remain

important for the foreseeable future. AMD’s Fusion integrates the

CPU and GPU onto a single die, and enables coherent yet slow

access to the shared memory by both processors. However high

performance is only achievable via non-coherent accesses or by

using private GPU memory, leaving data placement decisions to

software. Intel’s Sandy Bridge, another CPU/GPU combination,

is further indication that the coming years will see various forms

of integrated CPU/GPU hardware coming to market. New hybrid

systems, such as NVIDIA Optimus, have a power-efﬁcient on-die

GPU and a high-performance discrete GPU. Despite the presence

of a combined CPU/GPU chip, such systems still require explicit

data management. While there is evidence that GPUs with coherent

access to shared memory may eventually become common, even a

completely integrated virtual memory system requires system sup-

port for minimizing data copies.

Figure 5: The effect of CPU-bound work on GPU-bound tasks.

H→D is a CUDA workload that has communication from the

host to the GPU device, while H←D has communication from

the GPU to the host, and H↔D has bidirectional communica-

tion.

2.3 The scheduling problem

Modern OSes cannot currently guarantee fairness and perfor-

mance for systems that use GPUs for computation. The OS does

not treat GPUs as a shared computational resource, like a CPU, but

rather as an I/O device. This design becomes a severe limitation

when the OS needs to use the GPU for its own computation (e.g.,

as Windows 7 does with the Aero user interface). Under the current

regime, watchdog timers ensure that screen refresh rates are main-

tained, but OS scheduling priorities are easily undermined by the

GPU driver.

GPU work causes system pauses. Figure 4 shows the impact

of GPU-bound work on the frequency with which the system can

collect and deliver mouse movements. In our experiments, signiﬁ-

cant GPU-work at high frame rates causes Windows 7 to be unre-

sponsive for seconds at a time. To measure this phenomenon, we

instrument the OS to record the frequency of mouse events deliv-

ered through the HID class driver over a 60 second period. When

no concurrent GPU work is executing, the system is able to deliver

mouse events at a stable 120 Hz. However, when the GPU is heav-

ily loaded, the mouse event rate plummets, often to below 20 Hz.

The GPU-bound task is console-based (does not update the screen)

and performs unrelated work in another process context. Moreover,

CPU utilization is below 25%, showing that the OS has compute re-

sources available to deliver events. A combination of factors are at

work in this situation. GPUs are not preemptible, with the side-

effect that in-progress I/O requests cannot be canceled once begun.

Because Windows relies on cancelation to prioritize its own work,

its priority mechanism fails. The problem is compounded because

the developers of the GPU runtime use request batching to improve

throughput for GPU programs. Ultimately, Windows is unable to

interrupt a large number of GPU invocations submitted in batch,

and the system appears unresponsive. The inability of the OS to

manage the GPU as a ﬁrst-class resource inhibits its ability to load

balance the entire system effectively.

CPU work interferes with GPU throughput. Figure 5 shows

the inability of Windows 7 to load balance a system that has con-

current, but fundamentally unrelated work on the GPU and CPUs.

The data in the ﬁgure were collected on a machine with 64-bit Win-

dows 7, Intel Core 2 Quad 2.66GHz, 8GB RAM, and an NVIDIA

GeForce GT230 GPU. The ﬁgure shows the impact of a CPU-

bound process (using all 4 cores to increment counter variables)

on the frame rate of a shader program (the xform program from

our prototype implementation). The frame rate of the GPU pro-

gram drops by 2x, despite the near absence of CPU work in the

236

program: xform uses the CPU only to trigger the next computation

on the GPU device.

These results suggest that GPUs need to be treated as a ﬁrst-class

computing resource and managed by the OS scheduler like a nor-

mal CPU. Such abstractions will allow the OS to provide system-

wide properties like fairness and performance isolation. User pro-

grams should interact with GPUs using abstractions similar to threads

and processes. Current OSes provide no abstractions that ﬁt this

model. In the following sections, we propose abstractions to ad-

dress precisely this problem.

3. DESIGN

We propose a set of new OS abstractions to support GPU pro-

gramming called the PTask (Parallel Task) API. The PTask API

consists of interfaces and runtime library support to simplify the of-

ﬂoading of compute-intensive tasks to accelerators such as GPUs.

PTask supports a dataﬂow programming model in which individ-

ual tasks are assembled by the programmer into a directed acyclic

graph: vertices, called ptasks, are executable code such as shader

programs on the GPU, code fragments on other accelerators (e.g. a

SmartNIC), or callbacks on the CPU. Edges in the graph represent

data ﬂow, connecting the inputs and outputs of each vertex. PTask

is best suited for applications that have signiﬁcant computational

demands, feature both task- and data-level parallelism, and require

both high throughput and low latency.

PTask was developed with three design goals. (1) Bring GPUs

under the purview of a single (perhaps federated) resource manager,

allowing that entity to provide meaningful guarantees for fairness

and isolation. (2) Provide a programming model that simpliﬁes

the development of code for accelerators by abstracting away code

that manages devices, performs I/O, and deals with disjoint mem-

ory spaces. In a typical DirectX or CUDA program, only a frac-

tion of the code implements algorithms that run on the GPU, while

the bulk of the code manages the hardware and orchestrates data

movement between CPU and GPU memories. In contrast, PTask

encapsulates device-speciﬁc code, freeing the programmer to focus

on application-level concerns such as algorithms and data ﬂow. (3)

Provide a programming environment that allows code to be both

modular and fast. Because current GPU programming environ-

ments promote a tight coupling between device-memory manage-

ment code and GPU-kernel code, writing reusable code to leverage

a GPU means writing both algorithm code to run on the GPU and

code to run on the host that transfers the results of a GPU-kernel

computation when they are needed. This approach often translates

to sub-optimal data movement, higher latency, and undesirable per-

formance artifacts.

3.1 Integrating PTask scheduling with the OS

The two chief beneﬁts of coordinating OS scheduling with the

GPU are efﬁciency and fairness (design goals (1) and (3)). By ef-

ﬁciency we mean both low latency between when a ptask is ready

and when it is scheduled on the GPU, and scheduling enough work

on the GPU to fully utilize it. By fairness we mean that the OS

scheduler provides OS priority-weighted access to processes con-

tending for the GPU, and balances GPU utilization with other sys-

tem tasks like user interface responsiveness.

Separate processes can communicate through, or share a graph.

For example, processes A and B may produce data that is input

to the graph, and another process C can consume the results. The

scheduler must balance thread-speciﬁc scheduling needs with PTask-

speciﬁc scheduling needs. For example, gang scheduling the pro-

ducer and consumer threads for a given PTask graph will maximize

system throughput.

matrix gemm(A, B) {

matrix res = new matrix();

copyToDevice(A);

copyToDevice(B);

invokeGPU(gemm_kernel, A, B, res);

copyFromDevice(res);

return res;

}

matrix modularSlowAxBxC(A, B, C) {

matrix AxB = gemm(A, B);

matrix AxBxC = gemm(AxB, C);

return AxBxC;

}

matrix nonmodularFastAxBxC(A, B, C) {

matrix intermed = new matrix();

matrix res = new matrix();

copyToDevice(A);

copyToDevice(B);

copyToDevice(C);

invokeGPU(gemm_kernel, A, B, intermed);

invokeGPU(gemm_kernel, intermed, C, res);

copyFromDevice(res);

return res;

}

Figure 6: Pseudo-code to ofﬂoad matrix computation (A×B)×

C to a GPU. This modular approach uses the gemm subroutine

to compute both A×B and (A×B)×C, forcing an unnecessary

round-trip from GPU to main memory for the intermediate re-

sult.

3.2 Efﬁciency vs. modularity

Consider the pseudo-code in Figure 6, which reuses a matrix

multiplication subroutine called gemm to implement ((A × B) ×

C). GPUs typically have private memory spaces that are not co-

herent with main memory and not addressable by the CPU. To

ofﬂoad computation to the GPU, the gemm implementation must

copy input matrices A and B to GPU memory. It then invokes a

GPU-kernel called gemm_kernel to perform the multiplication,

and copies the result back to main memory. If the programmer

reuses the code for gemm to compose the product ((A × B) × C)

as gemm(gemm(A,B),C) (modularSlowAxBxC in Figure 6),

the intermediate result (A × B) is copied back from the GPU at

the end of the ﬁrst invocation of gemm only to be copied from main

memory to GPU memory again for the second invocation. The per-

formance costs for data movement are signiﬁcant. The problem can

be trivially solved by writing code specialized to the problem, such

as the nonmodularFastAxBxC in the ﬁgure. However, the code

is no longer as easily reused.

Within a single address space, such code modularity issues can

often be addressed with a layer of indirection and encapsulation

for GPU-side resources. However, the problem of optimizing data

movement inevitably becomes an OS-level issue as other devices

and resources interact with the GPU or GPUs. With OS-level sup-

port, computations that involves GPUs and OS-managed resources

such as cameras, network cards, and ﬁle systems can avoid prob-

lems like double-buffering.

By decoupling data ﬂow from algorithm, PTask eliminates dif-

ﬁcult tradeoffs between modularity and performance (design goal

(3)): the run-time automatically avoids unnecessary data movement

(design goal (2)). With PTask, matrix multiplication is expressed

as a graph with A and B as inputs to one gemm node; the out-

put of that node becomes an input to another gemm node that also

takes C as input. The programmer expresses only the structure of

the computation and the system is responsible for materializing a

consistent view of the data in a memory domain only when it is ac-

237

评论收藏

内容反馈

zhuzhudaren

2014-07-20

谢谢收集，给我们省事了
yuanjuli1

2012-11-06

很全很感谢
yanguihefang

2012-12-10

多谢分享，正好需要，
yangce2009

2013-10-16

顶级论文了，谢谢分享
xuanyuanhanxing

2012-10-16

非常感谢分享，操作系统领域的顶级会议，质量就是好

前往

页

蛐蛐蛐

粉丝: 774
资源: 65

SOSP 2011-ACM Symposium on Operating Systems Principles 2011年论文集

最新资源

SOSP 2011-ACM Symposium on Operating Systems Principles 2011年论文集

sosp2019 论文合集

Google大数据三大论文中文版下载 Google论文MapReduce、GFS、Bigtable论文下载

SOSP 2013-ACM Symposium on Operating Systems Principles 2013年论文集

ACM Symposium on Operating Systems Principles 2009年论文集

POPL 2011-Annual Symposium on Principles of Programming Languages 2011论文集

ACM整理全集

操作系统发展的现状 论文

ACM课件

sosp 2013-ACM symposium on operating system principles

rocksteady-sosp17-slides.pdf

OSDI 2012-Operating Systems Design and Implementation 2012年论文集

bigtable-osdi06--NoSQL.pdf ,amazon-dynamo-sosp2007-NoSQL.pdf

SOSP 2019.zip

国外技术干货：amazon-dynamo-sosp2007.zip

SEDA - An Architecture for Well-Conditioned, Scalable Internet Services - Deck (seda-sosp01-talk)-计算机科学

cloudvisor-sosp2011

Google三大论文

nooks SOSP 2002

Hashed and Hierarchical Timing Wheels

Google云计算关键技术论文

amazon-dynamo-sosp2007.pdf

gfs-sosp2003.pdf

Hadoop - 权威网站和经典书籍

IRON File Systems (iron-sosp05)-计算机科学

USENIX OSDI 2008年论文集（Proceedings of USENIX OSDI 2008）

Google_三大论文中文版

灰烬：分布式内存存储的动态复制框架

最新资源

操作系统发展的现状论文