利用跳跃树构建一个高效的面向写操作优化的键值存储系统资源-CSDN文库

112 浏览量 2021-02-11 09:43:48 上传评论收藏 4.02MB PDF 举报

### 利用跳跃树构建一个高效的面向写操作优化的键值存储系统 #### 摘要与背景本文介绍了一种新型的键值存储系统——跳跃树（Skip-tree）技术，该技术旨在解决多组件日志结构合并树（Log-Structured Merge-tree，简称LSM-tree）在写操作过程中存在的效率问题。LSM-tree作为一种主流的数据索引方法，在分布式系统中被广泛采用，特别是在大数据处理场景下。然而，LSM-tree在进行数据压缩时采取的是逐层推进机制，即每一个键值对（KV item）都会从较小的组件逐渐推送到相邻较大的组件中，直到达到最大的组件为止。这一过程不仅导致了显著的写放大效应（Write Amplification），还限制了系统的写入吞吐量。为了解决这一问题，本文提出了一种新的跳跃树模型，通过允许KV项跨过某些中间组件直接到达更大规模的组件，从而减少写放大效应，提高整体效率。 #### 跳跃树原理跳跃树是一种特殊的多组件数据结构，它能够有效地推动键值对直接跳过多个中间组件，从而实现更快的向下移动。跳跃树的关键在于开发了一种适应性和可靠的KV项在不同组件间的移动策略。通过减少从内存驻留组件到磁盘驻留最大组件之间的步骤数，跳跃树可以显著降低写放大效应，进而提升系统吞吐量。 #### 关键技术点 1. **跳跃机制**：跳跃树的核心在于其跳跃机制的设计，它使得数据能够在不同的层级之间快速传递，避免了传统LSM-tree中每个键值对必须逐层推进的问题。 2. **组件间移动**：跳跃树设计了一套组件间数据移动的方法，包括如何选择合适的跳跃目标以及何时进行跳跃等策略，确保数据能在不同组件间高效地流动。 3. **写放大效应减少**：通过减少数据从内存到磁盘的过程中所经历的步骤数，跳跃树有效降低了写放大效应，这是提高系统写入性能的关键因素之一。 4. **系统设计与实现**：基于跳跃树的键值存储系统命名为“SkipStore”，其设计充分考虑了硬件特性，如硬盘驱动器（HDD）和固态硬盘（SSD），并在实验中表现出了优异的性能。 #### 实验结果与分析为了验证跳跃树的有效性，作者们将SkipStore与Facebook开源的RocksDB进行了对比测试。实验结果表明，在硬盘驱动器环境下，SkipStore的性能比RocksDB提高了66.5%，而在固态硬盘环境下，性能提升了61%。这些结果证明了跳跃树技术在提高写操作密集型键值存储系统的性能方面具有显著优势。 #### 结论与展望本文提出了一种基于跳跃树的键值存储系统设计思路，旨在通过优化数据在多组件之间的流动方式来减少写放大效应，从而提高系统的写入性能。通过实验验证，跳跃树技术在多种硬件环境下均表现出良好的性能提升效果。未来的研究方向可能包括进一步优化跳跃树的数据移动策略、探索更高效的空间利用方式以及针对特定应用场景的定制化设计等。

资源推荐

资源详情

资源评论

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, JANUARY 2016 1

Building an Efﬁcient Put-Intensive Key-Value

Store with Skip-tree

Yinliang Yue, Bingsheng He, Yuzhe Li, Weiping Wang

Abstract—Multi-component based Log-Structured Merge-tree (LSM-tree) has been becoming one of the mainstream indexes.

LSM-tree adopts component-by-component KV item ﬂowing down mechanism to push each KV item from one smaller component

to the adjacent larger component during compaction procedures until the KV items reach the largest component. This process

incurs signiﬁcant write ampliﬁcation and limits the write throughput. In this paper, we propose one multi-component Skip-tree to

aggressively push the KV items to the non-adjacent larger components via skipping some components and then make the KV

items’ top-down move more efﬁcient. We develop adaptive and reliable KV item movements among components. By reducing

the number of steps during the ﬂowing process from memory-resident component to the disk-resident largest component, Skip-

tree can effectively reduce the write ampliﬁcation and thus improve the system throughput. We design and implement one high

performance key-value store, named SkipStore, based on Skip-tree. The experiments demonstrate that SkipStore outperforms

the state-of-the-art open-sourced system RocksDB in Facebook by 66.5% under HDD and 61% under SSD.

Index Terms—Storage System, KV Store, LSM-tree, Compaction, Performance Optimization;

1 INTRODUCTION

In the recent years, latency-sensitive user interactive

internet applications increase gradually. These inter-

active internet applications, such as WeChat, Twit-

ter and Facebook, encourage users to produce and

share short messages and small-sized pictures with

simultaneously low read and write latency. Lots of

efforts have been done to improve either read perfor-

mance or write performance [1][4][6][23][27][37]. For

example, B-tree supports fast read access but poor up-

date performance[13]. On the contrary, log-structured

ﬁle system provides high sequential write perfor-

mance but unacceptable read access performance [33].

However, it is challenging and difﬁcult to build one

massive storage system to simultaneously support

both high read and write performance. Multi-layer

hierarchical storage architecture has been widely used

to boost the read performance with speedy storage

devices and alleviate the disk read and write I/O

compete.

Although reads are also more than writes in most

internet-based applications, many reads are absorbed

by multi-level caches in the Internet architectures,

and writes become more dominant for the accesses

to the back-end storage systems[14][16][17][18][22].

Facebook [25] reveals that the popularity of photos

is highly dependent on content age, and the multi-

layer cache is extremely effective to intercept the

• Yinliang Yue is with Institute of Information Engineering, Chinese

Academy of Sciences, Beijing, China. E-mail: yueyinliang@iie.ac.cn.

• Bingsheng He is with National University of Singapore, Singapore.

• Yuzhe Li and Weiping Wang are with Institute of Information Engi-

neering, Chinese Academy of Sciences, Beijing, China.

reads. That is to say, 90.1% read requests are served

by multi-layer cache, while only 9.9% read requests

reach the back-end storage. On the other hand, more

write-intensive applications that ingest event logs are

becoming emerging, such as user clicks and mobile

device sensor readings [1].

Compared with relational databases, KV stores pose

simpliﬁed, easy-to-use interface and high scalability.

KV stores are suitable for latency-sensitive internet

services, and have been widely used in large-scale

data intensive internet applications [1]. In order to

simultaneously support increasing writes, low write

latency and range-based scan, LSM-tree [3] and its

variants including COLA [5], SAMT [6] and Hbase

[8] have been widely used in the current emerging

internet applications. LSM-tree is originally proposed

to manage the data in disk-resident storage and the

multi-layer disk-resident components are designed

to balance the performance of read and write I/Os.

Speciﬁcally, the newly written data are accessed with

a great probability in its initial life-cycle, and so

the LSM-tree employs component-by-component ﬂow-

ing mechanism. That is to say, each KV item is pushed

from one smaller component to the adjacent larger

component during compaction procedures until it

reaches the largest component.

However, the LSM-tree is designed for embedded

storage, in which memory footprints are commonly

in small-size. Later, lots of researches have been done

to boost the read or write performance for large-scale

data management. For example, COLA [5] and FD-

tree [7] use forward pointers to improve the read

performance, while GTSSL [6] reinserts reads KV

items into upper components to expedite lookup.

In modern data centers, large-capacity memory is

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TPDS.2016.2609912

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, JANUARY 2016 2

usually deployed and many more KV items can be

held in memory for faster access. As most gets can

be absorbed by large-capacity multi-layer caches, and

only a few scattered gets fall in the LSM-tree, many

data center applications are actually put-dominated.

Thus it is unnecessary to discreetly push KV items

top-down with component-by-component mechanism,

which incurs excessive write ampliﬁcation and then

sacriﬁces the system throughput.

In this paper, we propose Skip-tree, to aggressively

push the KV items to the non-adjacent larger com-

ponents via skipping some components. By reduc-

ing the number of steps during the ﬂowing process

from memory-resident component to the disk-resident

largest component, Skip-tree reduces the read and

write I/Os, decreases the write ampliﬁcation. Besides,

the bloom ﬁlter [31] is used in the Skip-tree to further

reduce the read I/Os caused by the version constraint.

We develop adaptive and reliable KV item move-

ments among components. As a consequence, Skip-

tree improves the throughput of key value stores.

We design and implement SkipStore based on Skip-

tree. The experiments demonstrates that SkipStore

outperforms RocksDB by 66.5%. Since SkipStore uses

buffer to cache some KV items in newly added buffer

component, we also design and implement reliability

mechanism, which is based on write-ahead log, for

SkipStore to prevent data loss. Beneﬁtting from better

put and scan performance, SkipStore can be used as

the back-end storage engine of both cloud storage

systems and other data analysis processing systems,

such as PNUTS [12], Walnut [38] and Hadoop [39].

The rest of this paper is organized as follows. Sec-

tion 2 describes the background and motivation. Sec-

tion 3 presents the overview of our solution. Section

4 describes the Skip-tree data structure, and Section 5

presents the design and implement issues of SkipStore

based on Skip-tree. Section 6 presents and discusses

the evaluation results. Section 7 presents the related

work. Finally, we conclude this paper in Section 8 by

summarizing the main contributions of this paper.

2 BACKGROUND AND MOTIVATIONS

2.1 Multi-layer cached Data Center

Nowdays, most data centers are using multi-layer

cache to reduce the average read latency and the

read request counts to the backend system. We take

Facebook’s photo-serving stack [25] as an example to

illustrate the architecture of multi-layer cached data

center. There are three layers of caches in Facebook’s

photo-serving stack, which are browser cache, edge

cache and origin cache. The ﬁrst cache layer is in

the client’s browser. It caches the most read request,

which is 65.5%. The Facebook Edge is comprised of

a set of Edge Caches that each run inside points of

presence (POPs) close to end users. As the second

cache layer, edge cache caches 20% of read requests.

data

block

data

block

data

block

data

block

data

block

……

Index block

SSTable

Memory

Disk

Immutable

MemTable

dump

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

MemTable

Put/Get/Del KV

SSTable

Memory

Disk

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

Immutable

MemTable

Put/Get/Delete

Key-Value Items

dump

Compaction

Memory

Disk

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

Immutable

MemTable

Put/Get/Delete

Key-Value Items

dump

Compaction

SST T

Index

SST T

Index

SST T

Index

(a) LSM-tree data structure

data block data block data block

……

SSTable

Memory

Disk

Immutable

MemTable

dump

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

MemTable

Put/Get/Del KV

SSTable

Memory

Disk

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

SSTable T

Index

Immutable

MemTable

Put/Get/Delete

Key-Value Items

dump

Compaction

Memory

Disk

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

SST T

Index

Immutable

MemTable

Put/Get/Delete

Key-Value Items

dump

Compaction

SST T

Index

SST T

Index

SST T

Index

Data Blocks

Metadata Block

Bloom FilterIndex

(b) SSTable layout

Fig. 1. The basic LSM-tree data structure, SSTable layout

and compaction procedure

The last cache layer, origin cache, is located with

backend storage system. It caches 4.6% read requests

and leaves 9.9% to the backend storage system, which

is Haystack in Facebook.

Although there exist some temporary KV items in

speciﬁc scenarios, it is unnecessary to persist them to

the disk-based storage system. However, massive KV

items in most scenarios are needed to be persistent to

the disk-based storage systems. In this paper, we focus

on solving the performance bottleneck of the LSM-

tree based storage systems. Although KV items can be

written to the memstore with extremely low latency,

these should eventually be dumped into the disk-based

persistent storage and involved in the compaction

procedure to ﬂow down to the larger components

in most cases. Reducing write ampliﬁcation and then

improving write throughput of the back-end storage

engine are challenging problems.

2.2 Basic KV organization of LSM-tree

LSM-tree organizes KV items in multiple tree-like

components, generally including one memory resi-

dent component and multiple disk resident compo-

nents, as shown in Figure 1(a). Each component size

is limited to a predeﬁned threshold, which grows ex-

ponentially. We use a representative design, RocksDB

at Facebook, as an example to present the design and

implementation of LSM-tree. RocksDB ﬁrst uses an

in-memory buffer, called MemTable, to receive the

incoming KV items and keep them sorted. When

an MemTable is ﬁlled up, it will be dumped to the

hard disk to be an immutable SSTable, such as T

in Figure 1(a). KV items are key-sorted and placed in

ﬁx-sized blocks. The key of the ﬁrst KV item in each

block is recorded as index to facilitate the KV item

locating. Each disk component consists of multiple

SSTables, whose key ranges do not overlap with each

other except those in C

. Figure 1(b) presents the

layout of the SSTable. Each SSTable contains multiple

data blocks and one metadata block. The data blocks

contain the sorted KV items, while the metadata block

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TPDS.2016.2609912

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. X, NO. Y, JANUARY 2016 3

contains both indexes and bloom ﬁlters of each data

block.

2.3 Compaction Procedure and Write Ampliﬁca-

tion in LSM-tree

LSM-tree defers and batches the updates on KV items,

and cascades the changes from memory buffer compo-

nent to multiple disk resident components. As shown

in Figure 1(a), for one KV item, it is ﬁrst inserted into

memory buffer component C

, and then dumped to

and pushed to C

, C

, . . . , C

in sequence dur-

ing the compaction procedure. A compaction entails

reading sorted KV items in one or more SSTables from

, such as T

, and that with the same key ranges

from C

i+1

, such as T

and T

, merging and sorting

them, and then writing the sorted KV items back to

i+1

in the unit of ﬁx-sized SSTables, such as T

and T

. For one speciﬁc KV item to reach C

, it

should be involved in the compaction between C

and

to reach C

ﬁrst, then involved in the compaction

between C

and C

to reach C

, and so on. Finally, it

should be involved in the compaction between C

k−1

and C

to arrive at the destination component C

. So

it would be involved in at least k times of compaction

procedures during its travel from C

to C

. We call

this KV item ﬂow pattern as component-by-component

ﬂow.

Observation 1: LSM-tree has serious write am-

pliﬁcation. The Ampliﬁcation Factor (AF) is an im-

portant design parameter for LSM-tree. AF is de-

ﬁned to be the ratio of two adjacent components’

size (Size(C

i+1

)/Size(C

), where i = 0,1,...). The ratio

is 10 in RocksDB by default. We deﬁne the Write

Ampliﬁcation Ratio (WAR) as the proportion between

actual write amount to the disk and the amount of

data written by users. Because the key range covered

by each component is roughly the same, the key range

of one SSTable in component C

may cover that of

ten SSTables in component C

i+1

. In order to push

one SSTable at a component C

to its next larger

component C

i+1

, in the worst case, eleven SSTables

may be read from disk-resident component C

and

i+1

, merged and sorted in memory, and then written

back to C

i+1

. We can say that the write ampliﬁcation

is 11, i.e., W AR = AF + 1. For a KV item ﬂows

from C

to C

via component-by-component ﬂow

mechanism, the WAR can reach up to K × (AF + 1).

We conduct experiments on RocksDB to measure the

write ampliﬁcation of LSM-tree based KV store using

a heavy updated data set generated by Yahoo! YCSB

with 1KB value size, random key, Zipf distribution

and 100% update proportion. After loading 500MB

data, we execute the run phase with data size of

5GB, 6GB, 7GB, 8GB, 9GB and 10GB respectively.

From Figure 2(a) we can see that when the input

data size is 5GB (in fact 3.7GB after compression),

the actual disk write I/O is 47.2GB. In this case, the

3.7

4.4

5.0

5.7

6.4

7.0

47.2

53.2

64.3

74.0

82.7

91.9

100

5 6 7 8 9 10

Actual Disk Write I/O (GB)

Data Size (GB)

Compaction WriteDown

Input WriteDown

(a) data size and actual disk write

I/O comparison

12.7

12.1

12.7

13.0

13.1

5 6 7 8 9 10

Write Amplification

Data Size (GB)

(b) write ampliﬁcation

Fig. 2. (a) Data size and actual disk write I/O comparison and (b) write

ampliﬁcation of RocksDB.

20%

40%

60%

80%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

I/O Proportion

Get Proportion

Compaction I/O Get I/O Log I/O

Fig. 3. Comparison of compaction, get and log I/Os.

write ampliﬁcation is 12.7. From Figure 2(b) we can

see that WAR is increasing up to 13.1 with an average

of 12.8. In addition, the write ampliﬁcation becomes

more severe with the increase of data size.

Observation 2: compaction dominates the disk

I/Os of LSM-tree. Without considering scan operation,

the disk I/O of LSM-tree based KV store is mainly

comprised of three parts, namely log I/O, get I/O and

compaction I/O. Log I/O is the write-ahead log I/O for

ensuring the reliability of KV items. Get I/O is issued

to read data block from SSTable and get the speciﬁc

KV items. Compaction I/O includes reading KV items

from SSTables and writing the sorted and merged

KV items to SSTables. We conduct the experiment on

RocksDB to compare the proportion of the three types

of I/Os using the data set generated by YCSB with

1KB value size, random key and Zipf distribution. In

this experiment, both update and get operations are

incorporated and we vary the get proportion from 0%

to 90%. After loading 2GB data, we execute the run

phase with 200,000,000 operations, i.e., 200GB data

when the get operation proportion is 0. From Figure

3 we can see that the proportion of write-ahead log

I/O is always less than 2% and the proportion of get

I/O is increasing but never exceeds 20% even when

the get proportion increases to 90%. The proportion

of compaction I/O is consistently larger than 80%. We

can conclude that serious write ampliﬁcation greatly

consume most of the disk I/O bandwidth, and leave

little for servicing the frontend application requests.

Neither reducing the AF nor reducing the number

of components simply can indeed decrease the WAR.

For ﬁxed size user data, reducing AF will enlarge the

number of components, and vice versa. This moti-

vates us to re-examine the KV items’ component-by-

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.

The final version of record is available at http://dx.doi.org/10.1109/TPDS.2016.2609912

剩余13页未读，继续阅读

评论收藏

内容反馈