RadixVM-Scalableaddressspacesformultithreadedapplications-计算机科学资源-CSDN文库

164 浏览量 2021-04-22 18:39:53 上传评论收藏 296KB PDF 举报

资源推荐

资源详情

资源评论

RadixVM: Scalable address spaces for multithreaded applications

Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich

MIT CSAIL

ABSTRACT

RadixVM is a new virtual memory system design that en-

ables fully concurrent operations on shared address spaces for

multithreaded processes on cache-coherent multicore com-

puters. Today, most operating systems serialize operations

such as

mmap

and

munmap

, which forces application de-

velopers to split their multithreaded applications into multi-

process applications, hoard memory to avoid the overhead

of returning it, and so on. RadixVM removes this burden

from application developers by ensuring that address space

operations on non-overlapping memory regions scale per-

fectly. It does so by combining three techniques: 1) it orga-

nizes metadata in a radix tree instead of a balanced tree to

avoid unnecessary cache line movement; 2) it uses a novel

memory-efﬁcient distributed reference counting scheme; and

3) it uses a new scheme to target remote TLB shootdowns and

to often avoid them altogether. Experiments on an 80 core

machine show that RadixVM achieves perfect scalability for

non-overlapping regions: if several threads

mmap

munmap

pages in parallel, they can run completely independently and

induce no cache coherence trafﬁc.

1 INTRODUCTION

Multithreaded applications on many-core processors can be

bottlenecked by contended locks inside the operating system’s

virtual memory system. Because of complex invariants in vir-

tual memory systems, widely used kernels, such as Linux and

FreeBSD, have a single lock per shared address space. Recent

research shows how to run a page fault handler in parallel

with

mmap

and

munmap

calls in the same address space [

but still serializes

mmap

and

munmap

, which applications

use to allocate and return memory to the operating system.

However, if the

mmap

and

munmap

involve non-overlapping

memory regions in the shared address space, as is the case in

memory allocation and freeing, then in principle these calls

should be perfectly parallelizable because they operate on

Permission to make digital or hard copies of part or all of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, to republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee.

Eurosys’13 April 15-17, 2013, Prague, Czech Republic

 2013 ACM 978-1-4503-1994-2/13/04 . . . $15.00.

different parts of the address space. This paper contributes a

new virtual memory design that achieves this goal.

A typical virtual memory system supports three key op-

erations:

mmap

to add a region of memory to a process’s

address space,

munmap

to remove a region of memory from

the address space and the corresponding pages from the hard-

ware page tables, and

pagefault

, which inserts a page into the

hardware page tables at the faulting virtual address, allocating

a new physical page if necessary. This paper focuses on work-

loads in which multiple threads running in the same address

space frequently and concurrently invoke these operations.

Many multithreaded memory-intensive applications ﬁt this

mold:

mmap

munmap

, and related variants often lie at the

core of high-performance memory allocators and garbage

collectors. Applications that frequently map and unmap ﬁles

also generate such workloads for the OS kernel.

Because operating systems serialize

mmap

and

munmap

calls, even for non-overlapping memory regions, these appli-

cations can easily be bottlenecked by contention in the OS

kernel. As a result, application developers often work around

the virtual memory system to minimize or circumvent this

bottleneck. Multithreaded memory allocators provide a rich

set of workaround examples: they allocate memory in large

chunks from the operating system in each thread [

], batch

many

munmap

s [

], don’t return memory at all [

], or pro-

vide a loadable kernel module that implements custom VM

operations for the allocator [

]. Some workarounds have

deep structural implications, such as implementing an appli-

cation as communicating single-threaded processes instead

of as a single multi-threaded process [6].

Frequently these workarounds sufﬁce to avoid the internal

VM bottleneck, but often come with other downsides. For

example, a Google engineer reported to us that Google’s mem-

ory allocator is reluctant to return memory to the OS precisely

because of scaling problems with

munmap

and as a result ap-

plications tie up gigabytes of memory until they exit. This

delays the start of other applications and can cause servers to

be used inefﬁciently. Engineers from other companies have

reported similar problems to us.

With increasing core counts, we believe these workarounds

will become increasingly complicated and their downsides

more severe, so this paper attacks the root cause of VM scal-

ability problems. This paper explores the design of a virtual

memory system in which virtual memory operations contend

only if they operate on overlapping memory regions. This

ensures that if two threads operate on distinct ranges of vir-

tual memory, the VM system does not get in the way, and no

scalability workarounds are necessary. In this case, virtual

memory operations should scale perfectly with the number

of cores. If the application operates on the same shared re-

gion from multiple threads, sharing state between the two

threads is inevitable, but the shared state should be minimal

and constrained to the cores using the shared memory region.

There are several challenges in designing a virtual mem-

ory system in which the performance of parallel

mmap

s and

munmap

s scales with the number of cores performing the

operations. First, there are complex invariants between differ-

ent parts of the virtual memory system. For example, when a

thread

munmap

s a region, the kernel must ensure that this re-

gion is removed from the hardware page tables before reusing

the physical memory; otherwise, other threads in the same

process can access memory that the kernel may have immedi-

ately repurposed for some other application. To provide this

semantics the kernel must enforce ordering on operations,

which can limit performance on larger numbers of cores. Sec-

ond, even a single contended cache line can become a scaling

bottleneck with many cores. One of our initial designs used a

lock-free concurrent skip list, but interior nodes in the skip

list became contended and limited scalability (see §5). Third,

shooting down the translation lookaside buffer (TLB) of other

cores—to ensure that no core caches an out-of-date mapping—

can become a scaling bottleneck. Fourth, many designs that

avoid the scaling problems in a straightforward manner have a

large memory overhead that grows with the number of cores.

This paper addresses these challenges in a new design,

which we call RadixVM. RadixVM uses three different ideas

that complement each other to enable

munmap

mmap

, and

page faults on non-overlapping memory regions to scale per-

fectly. First, it uses a carefully designed radix tree to record

mapped memory regions. Second, it uses a novel scalable

reference counting scheme for tracking when physical pages

are free and radix tree nodes are no longer used. Third, when

a page must be unmapped, it avoids shooting down hardware

TLBs that don’t have that page mapping cached. The com-

bination of these three ideas allows RadixVM to implement

a straightforward concurrency plan that ensures the correct-

ness of VM operations while allowing VM operations on

non-overlapping memory regions to scale.

We have implemented RadixVM in a new research ker-

nel derived from xv6 [

] because changing the Linux vir-

tual memory implementation is very labor intensive owing

to its overall complexity and how interconnected it is with

other kernel subsystems [

]. An experimental evaluation on

an 80 core machine with microbenchmarks that represent

common sharing patterns in multithreaded applications and a

parallel MapReduce library from MOSBENCH [

] show that

the RadixVM design scales for non-overlapping

mmap

s and

munmap

s. A detailed breakdown shows that all three com-

ponents of RadixVM’s design are necessary to achieve good

scalability.

One downside of prototyping RadixVM on a research ker-

nel instead of Linux is that it is difﬁcult to run large, complex

applications that traditionally place signiﬁcant load on the

virtual memory system, such as the garbage collectors in Ora-

cle’s Java VM. Furthermore, RadixVM faces a chicken-and-

egg problem: many VM-intensive applications have already

been forced to design around the limitations of traditional VM

systems. As a result, our experimental evaluation focuses on

measuring RadixVM’s performance under microbenchmarks

that capture sharing patterns we have observed in VM-heavy

workloads on Linux.

The rest of the paper is organized as follows. §2 discusses

the related work. §3 presents the design of RadixVM. §4

summarizes our implementation. §5 evaluates RadixVM’s

performance, and §6 concludes.

2 RELATED WORK

RadixVM’s design builds on research in 5 different areas:

SMP Unix kernels, kernels designed for scalability, VM data

structures, TLB shootdown schemes, and scalable reference

counters. In each of these areas RadixVM makes a contribu-

tion, as we discuss in turn.

Unix kernels.

The VM designs for Unix kernels and Win-

dows typically use a single read-write lock to protect a shared

address space, perhaps augmented with ﬁner-grained locks

for operations on individual memory regions [

]. The

single read-write lock protects the OS index data structures

that map virtual addresses to metadata for mapped memory

regions, as well as invariants between the OS index data struc-

ture and the hardware page tables and TLBs. In our earlier

work on the Bonsai VM system, we described a lock-free

design for page faults in Linux, but that design still required

the use of Linux’s address space lock to serialize

mmap

and

munmap operations [7].

Scalable kernels.

Quite a number of kernels have been de-

signed with the primary goal of scaling operating system ser-

vices to many cores, including K42 [

] and Tornado [

], as

well as more recent systems such as Barrelﬁsh [

], Corey [

and fos [

]. K42 and Tornado use clustered objects to rep-

resent an address space, allowing a region list for an address

space to be replicated per processor. This design allows for

concurrent

mmap

s by threads in the same address space, at

the expense of

munmap

s, which need to contact every proces-

sor to update its local region list [

]. The RadixVM design

allows both mmaps and munmaps to run in parallel.

The Barrelﬁsh and fos multikernel designs don’t support

shared hardware page tables and index structures between

kernel instances, and thus avoid their associated scaling bot-

tlenecks. Both kernels support multithreaded user-level appli-

cation that share memory, but require an expensive two-phase

distributed TLB shootdown protocol to unmap memory (see

below).

Corey introduces the notion of an address range, which

allows an application to selectively share parts of its address

space, instead of being forced to make an all-or-nothing deci-

sion. RadixVM doesn’t expose this control to applications but

could beneﬁt from allowing applications to control the trade-

offs between per-core page tables and shared page tables (see

§5).

VM data structures.

One reason that widely used oper-

ating systems use a lock on the address space is that they

use complex index data structures to guarantee

O(logn)

lookup time when a process has many mapped memory

regions. Linux uses a red-black tree for the regions [

FreeBSD uses a splay tree [

], and Solaris and Windows use

AVL trees [

]. Because these data structures require re-

balancing when a memory region is inserted, they protect the

entire data structure with a single lock.

The Bonsai balanced tree allows for lock-free lookups but

serializes inserts and deletes in the Bonsai VM system [

Similarly, Howard and Walpole [

] propose a relativistic

red-black tree that supports lock-free lookups and which they

have extended to support concurrent inserts and deletes us-

ing software transactional memory [

]. They don’t describe,

however, how to apply this design to a virtual memory system,

which needs to maintain invariants beyond the structural cor-

rectness of the region tree (e.g.,

munmap

must clear page ta-

bles and invalidate TLBs atomically with removing the region

from the tree). RadixVM opts instead for a radix-tree-based

data structure, which allows for concurrent non-overlapping

lookups, inserts, and deletes without the need for software

transactional memory. Furthermore, precise range locking

makes it possible to maintain cross-structure invariants.

TLB shootdown.

In a multithreaded application, when a

thread removes a region from its address space, the operating

system must send shootdown interrupts to other processors to

ensure that those processors ﬂush translations for that region

from their TLBs. Since these interrupts are expensive, many

operating systems implement a scheme that allows these in-

terrupts to be sent in parallel and to be batched [

]. The

Barrelﬁsh operating system uses a distributed two-phase TLB

shootdown scheme (similar to Uhlig’s [

]), and exploits a

software broadcast tree to deliver the interrupts efﬁciently [

The main contribution of RadixVM’s scheme is to limit the

number of cores that must be contacted to perform the shoot-

down. x86 processors do not inform the kernel which TLBs

have which memory mappings cached, and therefore most

kernels conservatively send interrupts to all cores running the

application. RadixVM uses a scheme that precisely tracks

which cores may have each page mapping cached in their

TLBs, which allows RadixVM to send TLB shootdown inter-

rupts to just those cores and to entirely eliminate shootdowns

for mappings that are not shared between cores.

Scalable reference counters.

RadixVM’s scheme for ref-

erence counting, Refcache, inherits ideas from sloppy coun-

ters [

], Scalable NonZero Indicators (SNZI) [

], distributed

counters [

], shared vs. local counts in Modula-2+ [

], and

approximate counters [

]. All of these techniques speed up

increment and decrement using per-core counters, and require

signiﬁcantly more work to ﬁnd the true total value. Refcache

is a scalable counter approach specialized for reference counts

where the true value needs to be known only when it reaches

zero. Unlike sloppy, SNZI, distributed, and approximate coun-

ters, it doesn’t require space proportional to the number of

cores per counter. Like sloppy counters, Refcache’s design in-

tentionally delays and batches zero detection to reduce cache

contention.

3 DESIGN

Achieving scalability for VM operations in different processes

is easy since these operations involve per-process data struc-

tures. RadixVM’s design is novel because it allows threads of

the same process to perform

mmap

munmap

, and

pagefault

operations for non-overlapping memory regions in parallel.

That is, if

threads in the same process allocate more memory,

then these

mmap

s scale perfectly (they take the same amount

of time to execute regardless of

). Similarly, if a thread on

one core allocates memory and another thread in the same

process on another core returns a different memory region to

the operating system, then these operations do not slow each

other down or interfere with any other

mmap

s or

munmap

for different memory regions. Finally, if a core runs

pagefault

for an address while other cores are performing operations on

regions that do not include that page, then RadixVM scales

perfectly. On the other hand, if an

mmap

and

munmap

involve

overlapping memory regions, or a thread faults on a page that

is being concurrently

mmap

ed or

munmap

ed, RadixVM seri-

alizes those operations. The scalability of this design is easy

for application developers to understand and take advantage

of: applications should simply avoid concurrent manipulation

of overlapping memory regions when possible.

To achieve perfect scalability on modern multicore com-

puters, RadixVM strives to ensure that cores don’t contend

for any cache lines when operating on non-overlapping re-

gions. On modern cache-coherent hardware, any contended

cache line can be a scalability risk because frequently written

cache lines must be re-read by other cores, an operation that

typically serializes at the cache line’s home node.

This section describes how RadixVM achieves perfect scal-

ability for operations on non-overlapping regions. We ﬁrst

present the three key data structures that form the core of

RadixVM and then describe how RadixVM uses these to

implement standard VM operations.

3.1 Reference counting with Refcache

Reference counting is critical to many OS functions, and

RadixVM is no different. Since two virtual memory regions

剩余13页未读，继续阅读

评论收藏

内容反馈

weixin_38663526

粉丝: 3
资源: 940

RadixVM - Scalable address spaces for multithreaded applications...

最新资源

RadixVM - Scalable address spaces for multithreaded applications...

chord-a scalable peer-to-peer lookup service for internet applications

Hoard - A Scalable Memory Allocator for Multithreaded Applications (berger-asplos2000)-计算机科学

building-ha-scalable-applications-with-berkeley-db-whitepaper

TensorFlow-A-Framework-for-Scalable-Machine.zip

Gen-Z-Scalable-Connector-Specification-v1.1.pdf

(Chord)A scalable peer-to-peer lookup service for Internet applications.ppt

Building-Scalable-Apps-with-Redis-and-Node.js

thai-scalable-fonts-common-0.5.0-7.el7.noarch.rpm

thai-scalable-waree-fonts-0.5.0-7.el7.noarch.rpm

山东大学-刘志成-Scalable Visualization Systems for Broad Audiences

Data Access for Highly-Scalable Solutions

藏经阁-SCALABLE DATA SCIENCE.pdf

Real-Time Design Patterns - Robust Scalable Architecture for Real-Time

藏经阁-Scalable Monitoring with Apach.pdf

MLlib-Scalable Machine Learning on Spark

藏经阁-Scalable Deep Learning on Spark.pdf

藏经阁-SCALABLE DATA SCIENCE WITH SPARKR.pdf

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

NPPJSONViewer.zip

蓝牙BLE协议中文版.pdf

CANoe通过CAPL脚本实现自动测试

AD20官方中文教程.pdf

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）