Memory_Barriers_a_Hardware_View_for_Software

需积分: 5 171 浏览量 2021-02-01 16:46:31 上传评论收藏 342KB PDF 举报

### 内存屏障：硬件视角下的软件黑客内存屏障（Memory Barriers）是现代计算机系统中一个关键概念，尤其在多处理器系统（SMP）环境下，对于确保数据一致性至关重要。本文将深入探讨内存屏障的基本原理、作用机制以及它们在硬件层面如何支持软件设计。 #### 一、内存屏障的概念与背景内存屏障是一种特殊的指令或硬件机制，用于控制处理器对内存操作的排序，以确保某些特定的内存访问按顺序执行。在多核或多处理器系统中，为了提高性能，CPU通常会重新排序内存操作，这可能会导致数据不一致的问题。因此，内存屏障被用来强制执行某种形式的排序，从而保证程序的正确性和并发安全性。 #### 二、缓存结构与工作原理现代CPU的速度远超主存系统的速度。例如，2006年的CPU可能能够每纳秒执行十条指令，但从主存中获取数据项则需要数十甚至数百个纳秒的时间。这种速度上的差异导致了现代CPU上配备的多兆字节缓存的存在。这些缓存与CPU紧密关联，如图1所示，它们通常可以在几个周期内被访问。缓存的主要目的是减少处理器访问主存的延迟。当处理器请求数据时，如果该数据已经在缓存中，则可以直接从缓存中读取，而无需等待主存的响应，这样可以显著提高处理器的执行效率。 #### 三、缓存一致性协议为了确保多个处理器之间的数据一致性，缓存一致性协议应运而生。这些协议规定了处理器之间如何共享内存中的数据，并确保所有处理器看到的数据是一致的。常见的缓存一致性协议包括： - **写回（Write-Back）**：写入缓存的数据不会立即写回到主存中，而是等到数据不再被需要或者缓存行失效时才进行回写。 - **写直达（Write-Through）**：写入缓存的数据会立即写回到主存中，以保持缓存和主存的一致性。 - **写分配（Write Allocate）**：当发生写入时，不仅更新缓存中的副本，还会将缓存行标记为独占状态，以避免其他处理器修改相同的数据。 - **不写分配（No Write Allocate）**：当发生写入时，只更新本地缓存中的副本，而不将其复制到其他处理器的缓存中。 #### 四、存储缓冲区与无效队列的作用存储缓冲区（Store Buffer）和无效队列（Invalidate Queue）是两种重要的缓存一致性机制，它们帮助缓存实现高性能的同时维持一致性。 - **存储缓冲区**：存储缓冲区用于暂存尚未完成的写操作，这些操作可能需要等待其他缓存一致性事件的发生。通过使用存储缓冲区，处理器可以在等待写操作完成之前继续执行其他指令，从而提高执行效率。 - **无效队列**：无效队列用于管理需要被其他处理器缓存中作废的数据项。当某个缓存行被标记为无效后，处理器会在适当的时候从无效队列中取出相应的缓存行并使其无效，从而保持数据的一致性。 #### 五、内存屏障的重要性内存屏障之所以成为“必要的恶”，是因为它在确保数据一致性的同时也带来了性能上的开销。然而，为了实现良好的性能和可扩展性，内存屏障是不可或缺的。这是因为CPU的运行速度远远超过了它们与主存之间的互连以及主存本身的访问速度。 #### 六、总结通过对缓存结构、缓存一致性协议以及存储缓冲区和无效队列的理解，我们可以更好地认识到内存屏障的作用及其重要性。尽管它们增加了额外的开销，但内存屏障对于维护多处理器系统中的数据一致性具有至关重要的作用，从而确保了软件的正确执行和系统的稳定性。

资源推荐

资源详情

资源评论

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/228824849

Memory Barriers: a Hardware View for Software Hackers

Article · August 2010

CITATIONS

READS

2,422

1 author:

Some of the authors of this publication are also working on these related projects:

Real-Time Linux View project

Linux Real Time View project

Paul Mckenney

IBM

110 PUBLICATIONS1,898 CITATIONS

SEE PROFILE

All content following this page was uploaded by Paul Mckenney on 22 May 2014.

The user has requested enhancement of the downloaded file.

Memory Barriers: a Hardware View for Software Hackers

Paul E. McKenney

Linux Technology Center

IBM Beaverton

paulmck@linux.vnet.ibm.com

July 23, 2010

So what possessed CPU designers to cause them

to inﬂict memory barriers on poor unsuspecting SMP

software designers?

In short, because reordering memory references al-

lows much better performance, and so memory barri-

ers are needed to force ordering in things like synchro-

nization primitives whose correct operation depends

on ordered memory references.

Getting a more detailed answer to this question

requires a good understanding of how CPU caches

work, and especially what is required to make caches

really work well. The following sections:

1. present the structure of a cache,

2. describ e how cache-coherency protocols ensure

that CPUs agree on the value of each location in

memory, and, ﬁnally,

3. outline how store buﬀers and invalidate queues

help caches and cache-coherency protocols

achieve high p er formance.

We will see that memory barriers are a necessary evil

that is required to enable good performance and scal-

ability, an evil that stems from the fact that CPUs

are orders of magnitude faster than are both the in-

terconnects between them and the memory they are

attempting to access.

1 Cache Structure

Modern CPUs are much faster than are modern mem-

ory systems. A 2006 CPU might be capable of execut-

ing ten instructions per nanosecond, but will require

many tens of nanoseconds to fetch a data item from

main memory. This disparity in speed — more than

two orders of magnitude — has resulted in the multi-

megabyte caches found on modern CPUs. These

caches are associated with the CPUs as shown in Fig-

ure 1, and can typically be accessed in a few cycles.

CPU 0 CPU 1

CacheCache

Memory

Interconnect

Figure 1: Modern Computer System Cache Structure

Data ﬂows among the CPUs’ caches and memory

in ﬁxed-length blocks called “cache lines”, which are

normally a power of two in size, ranging from 16 to

256 bytes. When a given data item is ﬁrst access ed by

It is standard practice to use multiple levels of cache, with

a small level-one cache close to the CPU with single-cycle ac-

cess time, and a larger level-two cache with a longer access

time, perhaps roughly ten clock cycles. Higher-performance

CPUs often have three or even four levels of cache.

a given CPU, it will be absent from that CPU’s cache,

meaning that a “cache miss” (or, more speciﬁcally,

a “startup” or “warmup” cache miss) has occurred.

The cache miss means that the CPU will have to

wait (or be “stalled”) for hundreds of cycles while the

item is fetched from memory. However, the item will

be loaded into that CPU’s cache, so that subsequent

accesses will ﬁnd it in the cache and therefore run at

full speed.

After some time, the CPU’s cache will ﬁll, and sub-

sequent misses will likely need to eject an item from

the cache in order to make room for the newly fetched

item. Such a cache miss is termed a “capacity miss”,

because it is caused by the cache’s limited capacity.

However, most caches can be forced to eject an old

item to make room for a new item even when they are

not yet full. This is due to the fact that large caches

are implemented as hardware hash tables with ﬁxed-

size hash buckets (or “sets”, as CPU designers call

them) and no chaining, as shown in Figure 2.

This cache has sixteen “sets” and two “ways” for a

total of 32 “lines”, each entry containing a single 256-

byte “cache line”, which is a 256-byte-aligned block

of memory. This cache line size is a little on the large

size, but makes the hexadecimal arithmetic much

simpler. In hardware parlance, this is a two-way set-

associative cache, and is analogous to a software hash

table with sixteen buckets, where each bucket’s hash

chain is limited to at most two elements. The size (32

cache lines in this case) and the associativity (two in

this case) are collectively called the cache’s “geome-

try”. Since this cache is implemented in hardware,

the hash function is extremely simple: extract four

bits from the memory address.

In Figure 2, each box corresponds to a cache en-

try, which can contain a 256-byte cache line. How-

ever, a cache entry can be empty, as indicated by

the empty boxes in the ﬁgure. The rest of the boxes

are ﬂagged with the memory address of the cache

line that they contain. Since the cache lines must be

256-byte aligned, the low eight bits of each address

are zero, and the choice of hardware hash function

means that the next-higher four bits match the hash

line number.

The situation depicted in the ﬁgure might arise

if the program’s code were located at address

0xF

0xE

0xD

0xC

0xB

0xA

0x9

0x8

0x7

0x6

0x5

0x4

0x3

0x2

0x1

0x0

Way 0

0x12345E00

0x12345D00

0x12345C00

0x12345B00

0x12345A00

0x12345900

0x12345800

0x12345700

0x12345600

0x12345500

0x12345400

0x12345300

0x12345200

0x12345100

0x12345000

Way 1

0x43210E00

Figure 2: CPU Cache Structure

0x43210E00 through 0x43210EFF, and this program

accessed data sequentially from 0x12345000 through

0x12345EFF. Suppose that the program were now to

access location 0x12345F00. This location hashes to

line 0xF, and both ways of this line are empty, so the

corresponding 256-byte line can be accommodated.

If the program were to access location 0x1233000,

which hashes to line 0x0, the corresponding 256-byte

cache line can be accommodated in way 1. However,

if the program were to access location 0x1233E00,

which hashes to line 0xE, one of the existing lines

must be ejected from the cache to make room for

the new cache line. If this ejected line were accessed

later, a cache miss would result. Such a cache miss

is termed an “associativity miss”.

Thus far, we have been considering only cases

where a CPU reads a data item. What happens when

it does a write? Because it is important that all CPUs

agree on the value of a given data item, before a given

CPU writes to that data item, it must ﬁrst cause it

to be removed, or “invalidated”, from other CPUs’

caches. Once this invalidation has completed, the

CPU may safely modify the data item. If the data

item was present in this CPU’s cache, but was read-

only, this process is termed a “write miss”. Once a

given CPU has completed invalidating a given data

item from other CPUs’ caches, that CPU may repeat-

edly write (and read) that data item.

Later, if one of the other CPUs attempts to access

the data item, it will incur a cache miss, this time

because the ﬁrst CPU invalidated the item in order

to write to it. This type of cache miss is termed

a “communication miss”, since it is usually due to

several CPUs using the data items to communicate

(for example, a lock is a data item that is used to

communicate among CPUs using a mutual-exclusion

algorithm).

Clearly, much care must be taken to ensure that

all CPUs maintain a coherent view of the data. With

all this fetching, invalidating, and writing, it is easy

to imagine data being lost or (perhaps worse) diﬀer-

ent CPUs having conﬂicting values for the same data

item in their respective caches. These problems are

prevented by “cache-coherency protocols”, described

in the next section.

2 Cache-Coherence Protocols

Cache-coherency protocols manage cache-line states

so as to prevent inconsistent or lost data. These

protocols can be quite complex, with many tens of

states,

but for our purposes we need only concern

ourselves with the four-state MESI cache-coherence

protocol.

2.1 MESI States

MESI stands for “modiﬁed”, “exclusive”, “shared”,

and “invalid”, the four states a given cache line can

take on using this protocol. Caches using this proto-

col therefore maintain a two-bit state “tag” on each

cache line in addition to that line’s physical address

and data.

A line in the “modiﬁed” state has been subject to

a recent memory store from the corresponding CPU,

and the corresponding memory is guaranteed not to

appear in any other CPU’s cache. Cache lines in the

“modiﬁed” state can thus be said to be “owned” by

See Culler et al. [CSG99] pages 670 and 671 for the nine-

state and 26-state diagrams for SGI Origin2000 and Sequent

(now IBM) NUMA-Q, respectively. Both diagrams are signif-

icantly simpler than real life.

the CPU. Because this cache holds the only up-to-

date copy of the data, this cache is ultimately respon-

sible for either writing it back to memory or handing

it oﬀ to some other cache, and must do so before

reusing this line to hold other data.

The “exclusive” state is very similar to the “modi-

ﬁed” state, the single exception being that the cache

line has not yet been modiﬁed by the correspond-

ing CPU, which in turn means that the copy of the

cache line’s data that resides in memory is up-to-

date. However, since the CPU can store to this line

at any time, without consulting other CPUs, a line

in the “exclusive” state can still be said to be owned

by the corresponding CPU. That said, because the

corresponding value in memory is up to date, this

cache can discard this data without writing it back

to memory or handing it oﬀ to some other CPU.

A line in the “shared” state might be replicated in

at least one other CPU’s cache, so that this CPU is

not permitted to store to the line without ﬁrst con-

sulting with other CPUs. As with the “exclusive”

state, because the corresponding value in memory is

up to date, this cache can discard this data without

writing it back to memory or handing it oﬀ to some

other CPU.

A line in the “invalid” state is empty, in other

words, it holds no data. When new data enters the

cache, it is placed into a cache line that was in the

“invalid” state if possible. This approach is preferred

because replacing a line in any other state could re-

sult in an expensive cache miss should the replaced

line be referenced in the future.

Since all CPUs must maintain a coherent view

of the data carried in the cache lines, the cache-

coherence protocol provides messages that coordinate

the movement of cache lines through the system.

2.2 MESI Protocol Messages

Many of the transitions described in the previous sec-

tion require communication among the CPUs. If the

CPUs are on a single shared bus, the following mes-

sages suﬃce:

Read: The “read” message contains the physical ad-

dress of the cache line to be read.

Read Response: The “read response” message

contains the data requested by an earlier “read”

message. This “read response” message might

be supplied either by memory or by one of the

other caches. For example, if one of the caches

has the desired data in “modiﬁed” state, that

cache must supply the “read response” message.

Invalidate: The “invalidate” message contains the

physical address of the cache line to be invali-

dated. All other caches must remove the corre-

sponding data from their caches and respond.

Invalidate Acknowledge: A CPU receiving an

“invalidate” message must respond with an “in-

validate acknowledge” message after removing

the speciﬁed data from its cache.

Read Invalidate: The “read invalidate” message

contains the physical address of the cache line to

be read, while at the same time directing other

caches to remove the data. Hence, it is a combi-

nation of a “read” and an “invalidate”, as indi-

cated by its name. A “read invalidate” message

requires both a “read response” and a set of “in-

validate acknowledge” messages in reply.

Writeback: The “writeback” message contains both

the address and the data to be written back

to memory (and perhaps “snooped” into other

CPUs’ caches along the way). This message per-

mits caches to eject lines in the “modiﬁed” state

as needed to make room for other data.

Interestingly enough, a shared-memory multipro-

cessor system really is a message-passing computer

under the covers. This means that clusters of SMP

machines that use distributed shared memory are us-

ing message passing to implement shared memory at

two diﬀerent levels of the system architecture.

Quick Quiz 1: What happ ens if two CPUs at-

tempt to invalidate the same cache line concurrently?

Quick Quiz 2: When an “invalidate” message

appears in a large multiprocessor, every CPU must

give an “invalidate acknowledge” response. Wouldn’t

the resulting “storm” of “invalidate acknowledge” re-

sponses totally saturate the system bus?

Quick Quiz 3: If SMP machines are really using

message passing anyway, why bother with SMP at

all?

2.3 MESI State Diagram

A given cache line’s state changes as protocol mes-

sages are sent and received, as shown in Figure 3.

E S

c d e

j k

Figure 3: MESI Cache-Coherency State Diagram

The transition arcs in this ﬁgure are as follows:

Transition (a): A cache line is written back to

memory, but the CPU retains it in its cache and

further retains the right to modify it. This tran-

sition requires a “writeback” message.

Transition (b): The CPU writes to the cache line

that it already had exclusive access to. This

transition does not require any messages to be

sent or received.

Transition (c): The CPU receives a “read invali-

date” message for a cache line that it has mod-

iﬁed. The CPU must invalidate its local copy,

then respond with both a “read response” and an

“invalidate acknowledge” message, both sending

the data to the requesting CPU and indicating

that it no longer has a local copy.

剩余28页未读，继续阅读

评论收藏

内容反馈

边城水手

粉丝: 113
资源: 35

Memory_Barriers_a_Hardware_View_for_Software_Hacke.pdf

最新资源

Memory_Barriers_a_Hardware_View_for_Software_Hacke.pdf

Memory Barriers: a Hardware View for Software Hackers

memory-barriers.pdf

Memory Barriers: a Hardware View for Software Hackers 讲解内存屏障的好论文，推荐！

free_and_open_source_software_for_development.pdf

Memory-barriers.pdf

Professional_Android_Application_Development.pdf

新编英11语教程6 Unit 3 Walls and Barriers.pdf

Linux-Kernel Memory Model.pdf

Software Testing and Continuous Quality Improvement

HOW_THREAT_SHARING_HONES_YOUR_COMPETITIVE_EDGE.pdf

Breaking the wireless barriers to mobilize 5G NR mmWave.pdf

perfbook-1c.2018.12.08a.pdf

barriers.rar_VHDL/FPGA/Verilog_Unix_Linux_

二抽取代码MATLAB-Objective_material_barriers_to_the_transport_of_momentum_an

2018 Microgrid control A comprehensive survey.pdf

新编英11语教程6 Unit 3 Walls and Barriers.docx

Android_Studio_3.2_Development_Essentials_Kotlin_Edition.zip

XF_barriers

Functional Reactive Programming-Manning Publications (2016).pdf

IPC.rar_ipc 多核_ti dsp_多核_IPC_多核IPC通信_核间IPC

Packt.Building.Telephony.Systems.with.OpenSER.Apr.2008.pdf

AlN DFT.pdf

The British Society for Rheumatology Guideline for the Management of Gout..pdf

flink-1.12.5-bin-scala_2.11.tgz

重庆邮电大学国际学院通信工程专业-大一~大三各科目笔记汇总

群晖DS918+ dsm7.0.1 集成8125b网卡驱动 引导文件 镜像

最新资源

群晖DS918+ dsm7.0.1 集成8125b网卡驱动引导文件镜像