没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
RadixVM: Scalable address spaces for multithreaded applicationsAustin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich MIT CSAILABSTRACT RadixVM is a new virtual memory system design that en- ables fully concurrent operations on shared address spaces for multithreaded processes on cache-coherent multicore com- puters. Today, most operating systems serialize operations such as mmap and munmap, which forces application de- velopers to split their multithreaded applications into multi- proces
资源推荐
资源详情
资源评论
RadixVM: Scalable address spaces for multithreaded applications
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich
MIT CSAIL
ABSTRACT
RadixVM is a new virtual memory system design that en-
ables fully concurrent operations on shared address spaces for
multithreaded processes on cache-coherent multicore com-
puters. Today, most operating systems serialize operations
such as
mmap
and
munmap
, which forces application de-
velopers to split their multithreaded applications into multi-
process applications, hoard memory to avoid the overhead
of returning it, and so on. RadixVM removes this burden
from application developers by ensuring that address space
operations on non-overlapping memory regions scale per-
fectly. It does so by combining three techniques: 1) it orga-
nizes metadata in a radix tree instead of a balanced tree to
avoid unnecessary cache line movement; 2) it uses a novel
memory-efficient distributed reference counting scheme; and
3) it uses a new scheme to target remote TLB shootdowns and
to often avoid them altogether. Experiments on an 80 core
machine show that RadixVM achieves perfect scalability for
non-overlapping regions: if several threads
mmap
or
munmap
pages in parallel, they can run completely independently and
induce no cache coherence traffic.
1 INTRODUCTION
Multithreaded applications on many-core processors can be
bottlenecked by contended locks inside the operating system’s
virtual memory system. Because of complex invariants in vir-
tual memory systems, widely used kernels, such as Linux and
FreeBSD, have a single lock per shared address space. Recent
research shows how to run a page fault handler in parallel
with
mmap
and
munmap
calls in the same address space [
7
],
but still serializes
mmap
and
munmap
, which applications
use to allocate and return memory to the operating system.
However, if the
mmap
and
munmap
involve non-overlapping
memory regions in the shared address space, as is the case in
memory allocation and freeing, then in principle these calls
should be perfectly parallelizable because they operate on
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
Eurosys’13 April 15-17, 2013, Prague, Czech Republic
Copyright
c
2013 ACM 978-1-4503-1994-2/13/04 . . . $15.00.
different parts of the address space. This paper contributes a
new virtual memory design that achieves this goal.
A typical virtual memory system supports three key op-
erations:
mmap
to add a region of memory to a process’s
address space,
munmap
to remove a region of memory from
the address space and the corresponding pages from the hard-
ware page tables, and
pagefault
, which inserts a page into the
hardware page tables at the faulting virtual address, allocating
a new physical page if necessary. This paper focuses on work-
loads in which multiple threads running in the same address
space frequently and concurrently invoke these operations.
Many multithreaded memory-intensive applications fit this
mold:
mmap
,
munmap
, and related variants often lie at the
core of high-performance memory allocators and garbage
collectors. Applications that frequently map and unmap files
also generate such workloads for the OS kernel.
Because operating systems serialize
mmap
and
munmap
calls, even for non-overlapping memory regions, these appli-
cations can easily be bottlenecked by contention in the OS
kernel. As a result, application developers often work around
the virtual memory system to minimize or circumvent this
bottleneck. Multithreaded memory allocators provide a rich
set of workaround examples: they allocate memory in large
chunks from the operating system in each thread [
21
], batch
many
munmap
s [
13
], don’t return memory at all [
15
], or pro-
vide a loadable kernel module that implements custom VM
operations for the allocator [
26
]. Some workarounds have
deep structural implications, such as implementing an appli-
cation as communicating single-threaded processes instead
of as a single multi-threaded process [6].
Frequently these workarounds suffice to avoid the internal
VM bottleneck, but often come with other downsides. For
example, a Google engineer reported to us that Google’s mem-
ory allocator is reluctant to return memory to the OS precisely
because of scaling problems with
munmap
and as a result ap-
plications tie up gigabytes of memory until they exit. This
delays the start of other applications and can cause servers to
be used inefficiently. Engineers from other companies have
reported similar problems to us.
With increasing core counts, we believe these workarounds
will become increasingly complicated and their downsides
more severe, so this paper attacks the root cause of VM scal-
ability problems. This paper explores the design of a virtual
memory system in which virtual memory operations contend
only if they operate on overlapping memory regions. This
ensures that if two threads operate on distinct ranges of vir-
1
tual memory, the VM system does not get in the way, and no
scalability workarounds are necessary. In this case, virtual
memory operations should scale perfectly with the number
of cores. If the application operates on the same shared re-
gion from multiple threads, sharing state between the two
threads is inevitable, but the shared state should be minimal
and constrained to the cores using the shared memory region.
There are several challenges in designing a virtual mem-
ory system in which the performance of parallel
mmap
s and
munmap
s scales with the number of cores performing the
operations. First, there are complex invariants between differ-
ent parts of the virtual memory system. For example, when a
thread
munmap
s a region, the kernel must ensure that this re-
gion is removed from the hardware page tables before reusing
the physical memory; otherwise, other threads in the same
process can access memory that the kernel may have immedi-
ately repurposed for some other application. To provide this
semantics the kernel must enforce ordering on operations,
which can limit performance on larger numbers of cores. Sec-
ond, even a single contended cache line can become a scaling
bottleneck with many cores. One of our initial designs used a
lock-free concurrent skip list, but interior nodes in the skip
list became contended and limited scalability (see §5). Third,
shooting down the translation lookaside buffer (TLB) of other
cores—to ensure that no core caches an out-of-date mapping—
can become a scaling bottleneck. Fourth, many designs that
avoid the scaling problems in a straightforward manner have a
large memory overhead that grows with the number of cores.
This paper addresses these challenges in a new design,
which we call RadixVM. RadixVM uses three different ideas
that complement each other to enable
munmap
,
mmap
, and
page faults on non-overlapping memory regions to scale per-
fectly. First, it uses a carefully designed radix tree to record
mapped memory regions. Second, it uses a novel scalable
reference counting scheme for tracking when physical pages
are free and radix tree nodes are no longer used. Third, when
a page must be unmapped, it avoids shooting down hardware
TLBs that don’t have that page mapping cached. The com-
bination of these three ideas allows RadixVM to implement
a straightforward concurrency plan that ensures the correct-
ness of VM operations while allowing VM operations on
non-overlapping memory regions to scale.
We have implemented RadixVM in a new research ker-
nel derived from xv6 [
9
] because changing the Linux vir-
tual memory implementation is very labor intensive owing
to its overall complexity and how interconnected it is with
other kernel subsystems [
7
]. An experimental evaluation on
an 80 core machine with microbenchmarks that represent
common sharing patterns in multithreaded applications and a
parallel MapReduce library from MOSBENCH [
6
] show that
the RadixVM design scales for non-overlapping
mmap
s and
munmap
s. A detailed breakdown shows that all three com-
ponents of RadixVM’s design are necessary to achieve good
scalability.
One downside of prototyping RadixVM on a research ker-
nel instead of Linux is that it is difficult to run large, complex
applications that traditionally place significant load on the
virtual memory system, such as the garbage collectors in Ora-
cle’s Java VM. Furthermore, RadixVM faces a chicken-and-
egg problem: many VM-intensive applications have already
been forced to design around the limitations of traditional VM
systems. As a result, our experimental evaluation focuses on
measuring RadixVM’s performance under microbenchmarks
that capture sharing patterns we have observed in VM-heavy
workloads on Linux.
The rest of the paper is organized as follows. §2 discusses
the related work. §3 presents the design of RadixVM. §4
summarizes our implementation. §5 evaluates RadixVM’s
performance, and §6 concludes.
2 RELATED WORK
RadixVM’s design builds on research in 5 different areas:
SMP Unix kernels, kernels designed for scalability, VM data
structures, TLB shootdown schemes, and scalable reference
counters. In each of these areas RadixVM makes a contribu-
tion, as we discuss in turn.
Unix kernels.
The VM designs for Unix kernels and Win-
dows typically use a single read-write lock to protect a shared
address space, perhaps augmented with finer-grained locks
for operations on individual memory regions [
1
,
24
,
27
]. The
single read-write lock protects the OS index data structures
that map virtual addresses to metadata for mapped memory
regions, as well as invariants between the OS index data struc-
ture and the hardware page tables and TLBs. In our earlier
work on the Bonsai VM system, we described a lock-free
design for page faults in Linux, but that design still required
the use of Linux’s address space lock to serialize
mmap
and
munmap operations [7].
Scalable kernels.
Quite a number of kernels have been de-
signed with the primary goal of scaling operating system ser-
vices to many cores, including K42 [
20
] and Tornado [
14
], as
well as more recent systems such as Barrelfish [
3
], Corey [
5
],
and fos [
30
]. K42 and Tornado use clustered objects to rep-
resent an address space, allowing a region list for an address
space to be replicated per processor. This design allows for
concurrent
mmap
s by threads in the same address space, at
the expense of
munmap
s, which need to contact every proces-
sor to update its local region list [
14
]. The RadixVM design
allows both mmaps and munmaps to run in parallel.
The Barrelfish and fos multikernel designs don’t support
shared hardware page tables and index structures between
kernel instances, and thus avoid their associated scaling bot-
tlenecks. Both kernels support multithreaded user-level appli-
cation that share memory, but require an expensive two-phase
distributed TLB shootdown protocol to unmap memory (see
below).
2
Corey introduces the notion of an address range, which
allows an application to selectively share parts of its address
space, instead of being forced to make an all-or-nothing deci-
sion. RadixVM doesn’t expose this control to applications but
could benefit from allowing applications to control the trade-
offs between per-core page tables and shared page tables (see
§5).
VM data structures.
One reason that widely used oper-
ating systems use a lock on the address space is that they
use complex index data structures to guarantee
O(logn)
lookup time when a process has many mapped memory
regions. Linux uses a red-black tree for the regions [
27
],
FreeBSD uses a splay tree [
1
], and Solaris and Windows use
AVL trees [
24
,
29
]. Because these data structures require re-
balancing when a memory region is inserted, they protect the
entire data structure with a single lock.
The Bonsai balanced tree allows for lock-free lookups but
serializes inserts and deletes in the Bonsai VM system [
7
].
Similarly, Howard and Walpole [
18
] propose a relativistic
red-black tree that supports lock-free lookups and which they
have extended to support concurrent inserts and deletes us-
ing software transactional memory [
17
]. They don’t describe,
however, how to apply this design to a virtual memory system,
which needs to maintain invariants beyond the structural cor-
rectness of the region tree (e.g.,
munmap
must clear page ta-
bles and invalidate TLBs atomically with removing the region
from the tree). RadixVM opts instead for a radix-tree-based
data structure, which allows for concurrent non-overlapping
lookups, inserts, and deletes without the need for software
transactional memory. Furthermore, precise range locking
makes it possible to maintain cross-structure invariants.
TLB shootdown.
In a multithreaded application, when a
thread removes a region from its address space, the operating
system must send shootdown interrupts to other processors to
ensure that those processors flush translations for that region
from their TLBs. Since these interrupts are expensive, many
operating systems implement a scheme that allows these in-
terrupts to be sent in parallel and to be batched [
4
]. The
Barrelfish operating system uses a distributed two-phase TLB
shootdown scheme (similar to Uhlig’s [
28
]), and exploits a
software broadcast tree to deliver the interrupts efficiently [
3
].
The main contribution of RadixVM’s scheme is to limit the
number of cores that must be contacted to perform the shoot-
down. x86 processors do not inform the kernel which TLBs
have which memory mappings cached, and therefore most
kernels conservatively send interrupts to all cores running the
application. RadixVM uses a scheme that precisely tracks
which cores may have each page mapping cached in their
TLBs, which allows RadixVM to send TLB shootdown inter-
rupts to just those cores and to entirely eliminate shootdowns
for mappings that are not shared between cores.
Scalable reference counters.
RadixVM’s scheme for ref-
erence counting, Refcache, inherits ideas from sloppy coun-
ters [
6
], Scalable NonZero Indicators (SNZI) [
12
], distributed
counters [
2
], shared vs. local counts in Modula-2+ [
11
], and
approximate counters [
8
]. All of these techniques speed up
increment and decrement using per-core counters, and require
significantly more work to find the true total value. Refcache
is a scalable counter approach specialized for reference counts
where the true value needs to be known only when it reaches
zero. Unlike sloppy, SNZI, distributed, and approximate coun-
ters, it doesn’t require space proportional to the number of
cores per counter. Like sloppy counters, Refcache’s design in-
tentionally delays and batches zero detection to reduce cache
contention.
3 DESIGN
Achieving scalability for VM operations in different processes
is easy since these operations involve per-process data struc-
tures. RadixVM’s design is novel because it allows threads of
the same process to perform
mmap
,
munmap
, and
pagefault
operations for non-overlapping memory regions in parallel.
That is, if
n
threads in the same process allocate more memory,
then these
mmap
s scale perfectly (they take the same amount
of time to execute regardless of
n
). Similarly, if a thread on
one core allocates memory and another thread in the same
process on another core returns a different memory region to
the operating system, then these operations do not slow each
other down or interfere with any other
mmap
s or
munmap
s
for different memory regions. Finally, if a core runs
pagefault
for an address while other cores are performing operations on
regions that do not include that page, then RadixVM scales
perfectly. On the other hand, if an
mmap
and
munmap
involve
overlapping memory regions, or a thread faults on a page that
is being concurrently
mmap
ed or
munmap
ed, RadixVM seri-
alizes those operations. The scalability of this design is easy
for application developers to understand and take advantage
of: applications should simply avoid concurrent manipulation
of overlapping memory regions when possible.
To achieve perfect scalability on modern multicore com-
puters, RadixVM strives to ensure that cores don’t contend
for any cache lines when operating on non-overlapping re-
gions. On modern cache-coherent hardware, any contended
cache line can be a scalability risk because frequently written
cache lines must be re-read by other cores, an operation that
typically serializes at the cache line’s home node.
This section describes how RadixVM achieves perfect scal-
ability for operations on non-overlapping regions. We first
present the three key data structures that form the core of
RadixVM and then describe how RadixVM uses these to
implement standard VM operations.
3.1 Reference counting with Refcache
Reference counting is critical to many OS functions, and
RadixVM is no different. Since two virtual memory regions
3
剩余13页未读,继续阅读
资源评论
weixin_38663526
- 粉丝: 3
- 资源: 940
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于javaweb的网上拍卖系统,采用Spring + SpringMvc+Mysql + Hibernate+ JSP技术
- polygon-mumbai
- Chrome代理 switchyOmega
- GVC-全球价值链参与地位指数,基于ICIO表,(Wang等 2017a)计算方法
- 易语言ADS指纹浏览器管理工具
- 易语言奇易模块5.3.6
- cad定制家具平面图工具-(FG)门板覆盖柜体
- asp.net 原生js代码及HTML实现多文件分片上传功能(自定义上传文件大小、文件上传类型)
- whl@pip install pyaudio ERROR: Failed building wheel for pyaudio
- Constantsfd密钥和权限集合.kt
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功