ScalableSQLandNoSQLdatastores资源-CSDN文库

需积分: 13 117 浏览量 2015-12-03 19:38:59 上传评论收藏 377KB PDF 举报

可分布的SQL和NoSQL数据存储系统随着Web2.0应用的兴起，数据量激增，传统的数据库管理系统（DBMS）和数据仓库在面对大规模的简单在线事务处理（OLTP）样式的应用时开始力不从心。因此，新的可水平扩展的数据存储系统应运而生，它们能够分布于多服务器上，并且能够处理成千上百万用户的更新和读取操作。本文将探讨和比较这类系统，这些新系统通常被统称为“NoSQL”数据存储系统。 NoSQL这个术语代表“不仅仅是SQL”或“非关系型”的意思，其定义并不是完全一致的。在本文中，NoSQL系统通常具备六个关键特性： 1. 能够通过多服务器水平扩展简单的操作吞吐量，比如键值对存储和文档存储等。 2. 能够在多服务器间复制和分发（分区）数据。 3. 提供简单的调用层接口或协议，与SQL绑定相对比。 4. 本文将详细探讨这些系统的数据模型、一致性机制、存储机制、持久性保证、可用性、查询支持等维度。这些系统往往为了追求如高可用性和可扩展性等维度而牺牲了其他一些维度，例如数据库范围内的事务一致性。文章中指出，传统DBMS和数据仓库在水平扩展上能力有限，而新系统则需要能够处理Web 2.0应用的大量并发读写操作。为此，它们通常会做出一些妥协，比如放弃一部分ACID属性（原子性、一致性、隔离性和持久性）来获得更高的可用性和系统吞吐量。文章还提到了对于系统可靠性的担忧，因为所提供的信息可能来源于不可靠的资料，并且这些系统正在不断变化之中，所以部分陈述可能是不准确的。在依赖这些信息前，建议通过其他来源进行验证。尽管如此，作者还是希望能够通过本文提供一个全面的调查，为读者提供帮助。对于未来可能出现的更正，读者可以在作者的网站上查看。本文还探讨了这些数据库系统在处理分布式数据时如何处理一致性问题，例如使用最终一致性模型来确保不同节点间数据的同步。在分布式数据库中，由于网络延迟、分区容忍性和系统复杂性，保证传统的强一致性变得十分困难，因此许多系统选择了一种更宽松的一致性模型，以便提供更高的性能和可用性。文章也涉及了NoSQL数据存储系统的类型，如键值存储、文档存储、列族存储和图形数据库等。每种类型的存储系统都有其特定的使用场景和优势。例如，键值存储简单高效，适用于快速读写操作；文档存储能够存储半结构化的数据；列族存储适合进行大规模数据分析；而图形数据库则擅长处理复杂关系的数据。可分布的SQL和NoSQL数据存储系统是面向大数据和云计算时代的必然产物。它们在设计上强调了水平扩展性、高可用性和高性能。虽然在设计和实现上存在差异，但它们共同构成了现代分布式数据处理的基石。随着技术的不断演进，预计未来还会出现更多创新的可扩展数据存储解决方案，以应对日益增长的数据存储需求。

资源推荐

资源详情

资源评论

Scalable SQL and NoSQL Data Stores

Rick Cattell

Cattell.Net Software

Email: rick@cattell.net

ABSTRACT

In this paper, we examine a number of SQL and so-

called “NoSQL” data stores designed to scale simple

OLTP-style application loads over many servers.

Originally motivated by Web 2.0 applications, these

systems are designed to scale to thousands or millions

of users doing updates as well as reads, in contrast to

traditional DBMSs and data warehouses. We contrast

the new systems on their data model, consistency

mechanisms, storage mechanisms, durability

guarantees, availability, query support, and other

dimensions. These systems typically sacrifice some of

these dimensions, e.g. database-wide transaction

consistency, in order to achieve others, e.g. higher

availability and scalability.

Note: Bibliographic references for systems are not

listed, but URLs for more information can be found in

the System References table at the end of this paper.

Caveat: Statements in this paper are based on sources

and documentation that may not be reliable, and the

systems described are “moving targets,” so some

statements may be incorrect. Verify through other

sources before depending on information here.

Nevertheless, we hope this comprehensive survey is

useful! Check for future corrections on the author’s

web site cattell.net/datastores.

Disclosure: The author is on the technical advisory

board of Schooner Technologies and has a consulting

business advising on scalable databases.

1. OVERVIEW

In recent years a number of new systems have been

designed to provide good horizontal scalability for

simple read/write database operations distributed over

many servers. In contrast, traditional database

products have comparatively little or no ability to scale

horizontally on these applications. This paper

examines and compares the various new systems.

Many of the new systems are referred to as “NoSQL”

data stores. The definition of NoSQL, which stands

for “Not Only SQL” or “Not Relational”, is not

entirely agreed upon. For the purposes of this paper,

NoSQL systems generally have six key features:

1. the ability to horizontally scale “simple

operation” throughput over many servers,

2. the ability to replicate and to distribute (partition)

data over many servers,

3. a simple call level interface or protocol (in

contrast to a SQL binding),

4. a weaker concurrency model than the ACID

transactions of most relational (SQL) database

systems,

5. efficient use of distributed indexes and RAM for

data storage, and

6. the ability to dynamically add new attributes to

data records.

The systems differ in other ways, and in this paper we

contrast those differences. They range in functionality

from the simplest distributed hashing, as supported by

the popular memcached open source cache, to highly

scalable partitioned tables, as supported by Google’s

BigTable [1]. In fact, BigTable, memcached, and

Amazon’s Dynamo [2] provided a “proof of concept”

that inspired many of the data stores we describe here:

• Memcached demonstrated that in-memory indexes

can be highly scalable, distributing and replicating

objects over multiple nodes.

• Dynamo pioneered the idea of eventual

consistency as a way to achieve higher availability

and scalability: data fetched are not guaranteed to

be up-to-date, but updates are guaranteed to be

propagated to all nodes eventually.

• BigTable demonstrated that persistent record

storage could be scaled to thousands of nodes, a

feat that most of the other systems aspire to.

A key feature of NoSQL systems is “shared nothing”

horizontal scaling – replicating and partitioning data

over many servers. This allows them to support a large

number of simple read/write operations per second.

This simple operation load is traditionally called OLTP

(online transaction processing), but it is also common

in modern web applications

The NoSQL systems described here generally do not

provide ACID transactional properties: updates are

eventually propagated, but there are limited guarantees

on the consistency of reads. Some authors suggest a

“BASE” acronym in contrast to the “ACID” acronym:

• BASE = Basically Available, Soft state,

Eventually consistent

• ACID = Atomicity, Consistency, Isolation, and

Durability

The idea is that by giving up ACID constraints, one

can achieve much higher performance and scalability.

12 SIGMOD Record, December 2010 (Vol. 39, No. 4)

However, the systems differ in how much they give up.

For example, most of the systems call themselves

“eventually consistent”, meaning that updates are

eventually propagated to all nodes, but many of them

provide mechanisms for some degree of consistency,

such as multi-version concurrency control (MVCC).

Proponents of NoSQL often cite Eric Brewer’s CAP

theorem [4], which states that a system can have only

two out of three of the following properties:

consistency, availability, and partition-tolerance. The

NoSQL systems generally give up consistency.

However, the trade-offs are complex, as we will see.

New relational DBMSs have also been introduced to

provide better horizontal scaling for OLTP, when

compared to traditional RDBMSs. After examining

the NoSQL systems, we will look at these SQL

systems and compare the strengths of the approaches.

The SQL systems strive to provide horizontal

scalability without abandoning SQL and ACID

transactions. We will discuss the trade-offs here.

In this paper, we will refer to both the new SQL and

NoSQL systems as data stores, since the term

“database system” is widely used to refer to traditional

DBMSs. However, we will still use the term

“database” to refer to the stored data in these systems.

All of the data stores have some administrative unit

that you would call a database: data may be stored in

one file, or in a directory, or via some other

mechanism that defines the scope of data used by a

group of applications. Each database is an island unto

itself, even if the database is partitioned and distributed

over multiple machines: there is no “federated

database” concept in these systems (as with some

relational and object-oriented databases), allowing

multiple separately-administered databases to appear

as one. Most of the systems allow horizontal

partitioning of data, storing records on different servers

according to some key; this is called “sharding”. Some

of the systems also allow vertical partitioning, where

parts of a single record are stored on different servers.

1.1 Scope of this Paper

Before proceeding, some clarification is needed in

defining “horizontal scalability” and “simple

operations”. These define the focus of this paper.

By “simple operations”, we refer to key lookups, reads

and writes of one record or a small number of records.

This is in contrast to complex queries or joins, read-

mostly access, or other application loads. With the

advent of the web, especially Web 2.0 sites where

millions of users may both read and write data,

scalability for simple database operations has become

more important. For example, applications may search

and update multi-server databases of electronic mail,

personal profiles, web postings, wikis, customer

records, online dating records, classified ads, and many

other kinds of data. These all generally fit the

definition of “simple operation” applications: reading

or writing a small number of related records in each

operation.

The term “horizontal scalability” means the ability to

distribute both the data and the load of these simple

operations over many servers, with no RAM or disk

shared among the servers. Horizontal scaling differs

from “vertical” scaling, where a database system

utilizes many cores and/or CPUs that share RAM and

disks. Some of the systems we describe provide both

vertical and horizontal scalability, and the effective use

of multiple cores is important, but our main focus is on

horizontal scalability, because the number of cores that

can share memory is limited, and horizontal scaling

generally proves less expensive, using commodity

servers. Note that horizontal and vertical partitioning

are not related to horizontal and vertical scaling,

except that they are both useful for horizontal scaling.

1.2 Systems Beyond our Scope

Some authors have used a broad definition of NoSQL,

including any database system that is not relational.

Specifically, they include:

• Graph database systems: Neo4j and OrientDB

provide efficient distributed storage and queries of

a graph of nodes with references among them.

• Object-oriented database systems: Object-oriented

DBMSs (e.g., Versant) also provide efficient

distributed storage of a graph of objects, and

materialize these objects as programming

language objects.

• Distributed object-oriented stores: Very similar to

object-oriented DBMSs, systems such as GemFire

distribute object graphs in-memory on multiple

servers.

These systems are a good choice for applications that

must do fast and extensive reference-following,

especially where data fits in memory. Programming

language integration is also valuable. Unlike the

NoSQL systems, these systems generally provide

ACID transactions. Many of them provide horizontal

scaling for reference-following and distributed query

decomposition, as well. Due to space limitations,

however, we have omitted these systems from our

comparisons. The applications and the necessary

optimizations for scaling for these systems differ from

the systems we cover here, where key lookups and

simple operations predominate over reference-

following and complex object behavior. It is possible

these systems can scale on simple operations as well,

but that is a topic for a future paper, and proof through

benchmarks.

SIGMOD Record, December 2010 (Vol. 39, No. 4) 13

Data warehousing database systems provide horizontal

scaling, but are also beyond the scope of this paper.

Data warehousing applications are different in

important ways:

• They perform complex queries that collect and

join information from many different tables.

• The ratio of reads to writes is high: that is, the

database is read-only or read-mostly.

There are existing systems for data warehousing that

scale well horizontally. Because the data is

infrequently updated, it is possible to organize or

replicate the database in ways that make scaling

possible.

1.3 Data Model Terminology

Unlike relational (SQL) DBMSs, the terminology used

by NoSQL data stores is often inconsistent. For the

purposes of this paper, we need a consistent way to

compare the data models and functionality.

All of the systems described here provide a way to

store scalar values, like numbers and strings, as well as

BLOBs. Some of them also provide a way to store

more complex nested or reference values. The systems

all store sets of attribute-value pairs, but use different

data structures, specifically:

• A “tuple” is a row in a relational table, where

attribute names are pre-defined in a schema, and

the values must be scalar. The values are

referenced by attribute name, as opposed to an

array or list, where they are referenced by ordinal

position.

• A “document” allows values to be nested

documents or lists as well as scalar values, and the

attribute names are dynamically defined for each

document at runtime. A document differs from a

tuple in that the attributes are not defined in a

global schema, and this wider range of values are

permitted.

• An “extensible record” is a hybrid between a tuple

and a document, where families of attributes are

defined in a schema, but new attributes can be

added (within an attribute family) on a per-record

basis. Attributes may be list-valued.

• An “object” is analogous to an object in

programming languages, but without the

procedural methods. Values may be references or

nested objects.

1.4 Data Store Categories

In this paper, the data stores are grouped according to

their data model:

• Key-value Stores: These systems store values and

an index to find them, based on a programmer-

defined key.

• Document Stores: These systems store documents,

as just defined. The documents are indexed and a

simple query mechanism is provided.

• Extensible Record Stores: These systems store

extensible records that can be partitioned

vertically and horizontally across nodes. Some

papers call these “wide column stores”.

• Relational Databases: These systems store (and

index and query) tuples. The new RDBMSs that

provide horizontal scaling are covered in this

paper.

Data stores in these four categories are covered in the

next four sections, respectively. We will then

summarize and compare the systems.

2. KEY-VALUE STORES

The simplest data stores use a data model similar to the

popular memcached distributed in-memory cache, with

a single key-value index for all the data. We’ll call

these systems key-value stores. Unlike memcached,

these systems generally provide a persistence

mechanism and additional functionality as well:

replication, versioning, locking, transactions, sorting,

and/or other features. The client interface provides

inserts, deletes, and index lookups. Like memcached,

none of these systems offer secondary indices or keys.

2.1 Project Voldemort

Project Voldemort is an advanced key-value store,

written in Java. It is open source, with substantial

contributions from LinkedIn. Voldemort provides

multi-version concurrency control (MVCC) for

updates. It updates replicas asynchronously, so it does

not guarantee consistent data. However, it can

guarantee an up-to-date view if you read a majority of

replicas.

Voldemort supports optimistic locking for consistent

multi-record updates: if updates conflict with any other

process, they can be backed out. Vector clocks, as

used in Dynamo [3], provide an ordering on versions.

You can also specify which version you want to

update, for the put and delete operations.

Voldemort supports automatic sharding of data.

Consistent hashing is used to distribute data around a

ring of nodes: data hashed to node K is replicated on

node K+1 … K+n where n is the desired number of

extra copies (often n=1). Using good sharding

technique, there should be many more “virtual” nodes

than physical nodes (servers). Once data partitioning

is set up, its operation is transparent. Nodes can be

added or removed from a database cluster, and the

system adapts automatically. Voldemort automatically

detects and recovers failed nodes.

14 SIGMOD Record, December 2010 (Vol. 39, No. 4)

剩余15页未读，继续阅读

评论收藏

内容反馈

qq_33242277

粉丝: 0
资源: 1

Scalable SQL and NoSQL data stores

最新资源

Scalable SQL and NoSQL data stores

Data Access for Highly-Scalable Solutions

vehicle-make-model-data:自 2001 年以来采用 SQL、NoSQL 数据格式的车辆年份、制造商和型号数据

SQL和NOSQL融合

Design of a more scalable database system

Nosql类型数据库。

Docker for Data Science: Building Scalable and Extensible Data Infrastructure

Scalable Big Data Architecture

Scalable Big Data Architecture(Apress,2015)

Scalable Big Data Architecture pdf 无水印 0分

Scalable Algorithms for Big Data and Network Analysis

Docker for Data Science_ Building Scalable and Extensible Data Infrastructure

Big Data Principles and best practices of scalable realtime data systems.pdf

Professional NoSQL 英文版 Shashank.Tiwari

Big Data - PRINCIPLES AND BEST PRACTICES OF SCALABLE REAL-TIME DATA SYSTEMS

Using.Flume.Flexible.Scalable.and.Reliable.Data.Streaming.pdf

Oracle NoSQL Database_2013.12

A Fast and High Throughput SQL Query System for Big Data

Web and Big Data_First International Joint Conference, Part I-Springer(2017).pdf

Scala: Guide for Data Science Professionals

Cloud Native Infrastructure Patterns for Scalable Infrastructure and azw3

Cloud Native Infrastructure Patterns for Scalable Infrastructure and epub

Neo4j High Performance(PACKT,2015)

EfficientDet_(Scalable_and_Efficient_

最新资源