对基于NAND闪存的设备使用闪存级并行性进行清洗操作资源-CSDN文库

122 浏览量 2021-03-19 14:32:08 上传评论收藏 470KB PDF 举报

本文讨论了基于NAND闪存的设备如何利用闪存级并行性进行清洗操作，提出了一种新颖的分区式垃圾回收器（SGC）来实现这一目标。NAND闪存设备内部提供了多层次的并行性，而并行化闪存操作是提升性能的关键。大多数现有的研究作品调度主机请求，使得得到的子请求可以并行提供服务，然而这些工作很少并行化由内部垃圾回收（GC）过程引入的额外操作。GC的成本高昂的操作序列是I/O阻塞的主要原因，特别是在设备接近满载时。本文提出的新颖方案能够隐藏GC的开销，从而缩短响应时间。闪存设备的性能优势包括高随机I/O性能、高鲁棒性、低功耗、轻巧和小型化特点，因此它们现在是嵌入式设备和智能手机中的关键组成部分。本文通过模拟实验验证了所提出的方案，说明该方案可以隐藏GC的开销，从而缩短响应时间。以下详细说明了文章中涉及的知识点： 1. 闪存设备的多层次并行性：闪存设备具有内在的多层次并行处理能力。这种并行性可以分为系统级并行性和闪存级并行性。系统级并行性涉及在多个芯片之间进行I/O服务和数据传输的并行处理，而闪存级并行性是指在同一芯片内部进行读写操作的并行处理。 2. 垃圾回收（GC）对性能的影响：在NAND闪存设备中，为了提高存储空间的利用率，需要定期进行垃圾回收操作，以删除不再需要的数据并回收空间。当存储设备接近满载时，GC操作的高成本操作序列会引起I/O阻塞，因为它会占用设备的关键部件，影响设备性能。 3. 分区式垃圾回收器（SGC）：SGC是一种针对NAND闪存设备设计的新垃圾回收算法。其核心思想是利用系统级和闪存级的并行性来并行化垃圾回收操作以及相关的清理活动。SGC将GC过程限制在单个闪存芯片内部，通过系统级并行性将不同芯片间的GC活动与I/O服务进行重叠。 4. 新型队列机制：文章提出了一种新的队列机制来调度和打包经过重新排序的部分清理序列步骤，将其转化为并行操作。这种方式进一步提升了并行处理的能力。 5. 动态冲突感知地址分配器：为了消除主机写入操作和清理操作的竞争，文章还提出了一种动态冲突感知地址分配器。通过动态地分配地址，使得更多的并行化成为可能，因为主机写入和GC操作不再争夺设备的关键资源。 6. 实验验证：通过基于跟踪的模拟实验，证明了所提出的SGC方案可以有效地隐藏GC的开销，从而显著缩短了响应时间。 7. 关键词：文中提到的关键词“闪存设备”、“垃圾回收（GC）”、“地址分配”、“并行性”和“调度”，都是与NAND闪存设备性能提升紧密相关的技术要素。这些知识点共同构成了对基于NAND闪存的设备如何利用闪存级并行性进行清洗操作的深入理解。通过理解这些概念，可以更好地设计和实现闪存设备上的高效垃圾回收策略，从而优化存储系统的性能。

资源推荐

资源详情

资源评论

Assimilating Cleaning Operations with Flash-Level Parallelism for NAND

Flash-Based Devices

Ronghui Wang

∗†

, Zhiguang Chen

∗

, Nong Xiao

∗

, Minxuan Zhang

∗

, Weihua Dong

†

∗

College of Computer, National University of Defense Technology, Changsha, 410073, China

†

State Key Laboratory of Astronautic Dynamics, Xi’an, 710043, China

Email: {ronghuiw, chenzhiguanghit}@gmail.com, {nongxiao,mxzhang}@nudt.edu.cn, {ttinywolf}@gmail.com

Abstract—Flash-based devices internally provide multilevel

parallelism, and parallelizing ﬂash operations is the key to im-

proving performance. Most of existing research works dispatch

and schedule host requests so that the obtained sub-requests

can be served in parallel, however, these works seldom paral-

lelize the extra operations introduced by the internal garbage

collection (GC) process. The costly operation sequence of GC

is the main reason for I/O blocking, especially when the device

is close to be full. In this paper, we propose a novel Subdivided

Garbage Collector (SGC), which exploits both the system-level

and the ﬂash-level parallelism to parallelize garbage-collecting

operations as well as garbage-collecting activities. SGC conﬁnes

the GC process inside a ﬂash chip, utilizing the system-level

parallelism to overlapping garbage-collecting activities with

I/O services among different chips. The ﬂash-level parallelism

is further exploited with a novel queue mechanism, which

schedules and packs the reordered partial steps of cleaning

sequence into parallel operations. To make more parallelization

possible, a dynamic conﬂict-aware address allocator is proposed

to eliminate the host writes and cleaning operations from

contending for the critical components of the device. Trace-

driven simulations demonstrate that the proposals can hide

overheads of GC, resulting in a shorter response time.

Keywords-ﬂash-based devices, garbage collection (GC), ad-

dress allocation, parallelism, scheduling

I. INTRODUCTION

Flash-based devices have the beneﬁts of high random

I/O performance, high robustness, low power consumption,

light weight, and small form-factor, so that they are now

crucial components in embedded devices and smart phones.

Likewise, large-capacity ﬂash-based devices are replacing

hard disk drives (HDDs) in laptop computers, high-end

enterprise-scale storage systems and high-performance com-

puting (HPC) environments. However, NAND ﬂash memory

has some idiosyncrasies such as erase-before-write and bulk

erase. Therefore, the ﬂash-based devices mentioned above

introduce a ﬁrmware named as the ﬂash translation layer

(FTL) to manage the ﬂash address space. The FTL uses an

address mapping mechanism to support out-of-place update

and employs an internal garbage collection (GC) process to

recycle obsolete data to provide free spaces. The GC process

brings extra operations: it needs to move the valid pages out

of the victim block before erasing the block. Overheads of

this cleaning sequence are relatively high, which may cause

I/O blocking and performance degradation, especially when

the device is close to be full.

Flash-based devices employ multiple ﬂash chips over

multiple channels, providing the system-level parallelism at

the channel and the chip levels. Furthermore, inside a ﬂash

chip, manufactories provide advanced commands which can

be used to exploit the ﬂash-level parallelism at the die and

the plane levels, though the use of advanced commands must

adhere to some strict restrictions. Existing works mainly

focus on parallelizing host requests to improve the device

performance, but they do not consider the parallelism among

garbage-collecting operations. Chang et al. [1] have found

that under realistic disk workloads, nearly the same amount

of time is spent on collecting garbage as well as writing

host data. Therefore, the garbage-collecting activities and

operations have a signiﬁcant negative impact on host IO

performance.

Motivated by parallelizing garbage-collecting operations

as well as garbage-collecting activities, this work exploits

both the system-level and the ﬂash-level parallelism to as-

similate the cleaning operations into I/O services. The novel

scheme includes a Subdivided Garbage Collector (SGC),

a queue mechanism and a dynamic conﬂict-aware address

allocator. Our garbage collector conﬁnes the GC inside a

ﬂash chip, which makes use of the system-level parallelism

for garbage-collecting activities since different chips can

take different tasks. For the cleaning chip, the pages to

be migrated are read out of the victim block ﬁrst, during

which host reads to the chip can be served indeed in higher

priority, while no host write is assigned to the chip by

our dynamic allocator. The block can be erased after all

valid pages have been read out, and after the erasure the

allocator can assign the chip to host writes again. Therefore,

the idling of host requests waiting for cleaning sequence is

reduced since host writes are dispatched elsewhere and host

reads are scheduled as soon as possible. In addition, we

2014 IEEE International Conference on Computer and Information Technology

DOI 10.1109/CIT.2014.68

212

propose a command queue that has multiple lines, where

internal operations and incoming operations are queued and

packed into advanced parallel commands, utilizing the ﬂash-

level parallelism. Moreover, the novel allocator prohibits the

competition for critical components between writing host

data and writing collected data, so that more operations can

be overlapped among parallel units of the device.

The abbreviation SGC is used to designate the proposals

in the following section for simplicity. Our proposals are

not only applicable to large-capacity ﬂash-based devices

with multiple independent channels, but also can be applied

for the devices with limited resources, e.g. multiple chips

sharing a single bus. We evaluate our proposals via realistic

workloads running on a ﬂash simulator. Our experimental

results show that the SGC reduces the average response time,

by 59.3%, on average. In addition, it does not degrade the

cleaning efﬁciency nor break load balance.

The remainder of this paper is organized as follows.

Section 2 gives an overview of the background and surveys

the previous related works. Section 3 describes the allocation

and cleaning policies of the proposed scheme. Section 4

evaluates the performance while Section 5 draws the con-

clusions.

II. B

ACKGROUNDS AND RELATED WORKS

A. Backgrounds

NAND ﬂash memory has become more prevalent in the

storage marketplace. However, it has some peculiarities,

namely hierarchical organization, erase-before-write, bulk

erase, and limited endurance. The basic read/write unit of

ﬂash memory is page, which is now typically 4KB. A given

number (e.g. 128) of pages are further packaged into a block.

Flash memory cannot overwrite a page. Instead, only after

a page has been erased, the page can be written once again

(erase-before-write). Different from the page-granularity of

read and write, the erase operation is performed in the

block unit (bulk erase). A write operation is much slower

than a read, while an erase operation is even slower than a

write (asymmetric operational speed). Finally, ﬂash memory

suffers from the limited lifespan: the Single-Level Cell

(SLC) ﬂash memory merely survives 100000 write/erase

cycles, while for Multi-Level Cell (MLC) ﬂash memory the

write endurance degrades to 10000.

Flash-based devices employ multiple ﬂash chips over

multiple channels and multiple ways, providing the channel-

and the chip- level parallelism. Channels are independent

I/O buses, and ﬂash chips among different channels can be

independently operated. Ways are data paths connected to

ﬂash chips in each channel, and accesses to each chip can be

pipelined inside a channel. The channel-level and chip-level

parallelism belong to the system-level parallelism, correlat-

ing to the device architecture. Each ﬂash chip contains one or

more dies, which share the single multiplexed interface. And

each die accommodates multiple planes, which consists of

CH 0

CH 1

Chip 0 Chip 1

Buffer

Plane 0

Buffer

Plane 1

Die 0

Buffer

Plane 0

Buffer

Plane 1

Die 1

Chip 0

Plane-level:

Multi-plane Operation

Die-level:

Interleave Operation

Channel-level Parallelism

Chip-level Parallelism

Figure 1. Multilevel parallelism of a ﬂash-based device.

blocks and is the smallest unit to serve a request in parallel.

Chip manufactories provide some advanced commands to

exploit the ﬂash-level parallelism among multiple dies and

multiple planes, however, there are some constraints to use

these commands. Fig.1 illustrates the hardware parallelism

of a ﬂash-based device with multi-channel multi-way archi-

tecture.

The advanced parallel commands include multi-plane

command and interleave command. Multi-plane command

activates multiple read, program or erase operations in all

planes of the same die. This command can utilize the plane-

level parallelism but only the operations from each plane ac-

cessing the same page (block) with the same operational type

can be packed together. If the pages (blocks) of every plane

do not form into a super page (block), multi-plane command

can seldom be used for the strict restriction of the address.

Interleave command executes several read, write and erase

operations in different dies of the same chip simultaneously,

utilizing the die-level parallelism. There is no extra address

restriction on the use of this command, however, only ﬁxed

type of operations can be packed together. The deﬁned

interleave command is classiﬁed into interleave program

command, interleave read-program command and interleave

erase command. In this work, we speciﬁcally exploit the die-

level micro-architecture parallelism for cleaning operations

as well as host operations.

To hide the peculiarities of ﬂash as well as make full

use of the parallelisms mentioned above, ﬂash-based devices

internally employ a ﬂash translation layer (FTL) to man-

age the physical address space and schedule IO requests

to parallel channels. Speciﬁcally, to overcome the “erase-

before-write” characteristic, FTL uses “out-of-place write”

instead of the “in-place write” adopted by HDDs. For this

reason, the FTL needs an address allocation and mapping

mechanism to translate logical addresses used by the host

213

剩余7页未读，继续阅读

评论收藏

内容反馈

weixin_38741996

粉丝: 45
资源: 932

对基于NAND闪存的设备使用闪存级并行性进行清洗操作

NAND Flash處理

nand flash 驱动

NAND闪速存储器的自动块擦除

一种基于闪存的缓冲区管理算法

nand flash

nand flash model

NAND闪速存储器的数据读操作

NAND Flash

NAND flash 并行存储器方案设计

NAND闪速存储器的复位操作

NAND FLASH在储存测试系统中的应用

NAND FLASH

基于CORDIC的反正弦和反余弦计算的FPGA实现

BA无标度网络中的SIR模型

使用3DCNN和卷积LSTM进行手势识别学习时空特征

基于三次贝塞尔曲线的类汽车曲率连续路径平滑

基于机器学习的设备剩余寿命预测方法综述

基于维纳过程的退化模型，具有递归过滤算法，可用于估计剩余使用寿命

基于FPGA的奇异值和特征值分解的快速实现。

基于BP神经网络的人口预测

磁悬浮系统自适应模糊PID控制器的设计

两轮平衡车的建模与控制研究

无人机协同目标的多无人机协同搜索方法

一种基于深度学习的机械臂抓取方法

基于深度神经网络的交通流量预测

一种去除ECG中基线漂移和工频干扰的高效滤波方法

最新资源