However, the systems differ in how much they give up.
For example, most of the systems call themselves
“eventually consistent”, meaning that updates are
eventually propagated to all nodes, but many of them
provide mechanisms for some degree of consistency,
such as multi-version concurrency control (MVCC).
Proponents of NoSQL often cite Eric Brewer’s CAP
theorem [4], which states that a system can have only
two out of three of the following properties:
consistency, availability, and partition-tolerance. The
NoSQL systems generally give up consistency.
However, the trade-offs are complex, as we will see.
New relational DBMSs have also been introduced to
provide better horizontal scaling for OLTP, when
compared to traditional RDBMSs. After examining
the NoSQL systems, we will look at these SQL
systems and compare the strengths of the approaches.
The SQL systems strive to provide horizontal
scalability without abandoning SQL and ACID
transactions. We will discuss the trade-offs here.
In this paper, we will refer to both the new SQL and
NoSQL systems as data stores, since the term
“database system” is widely used to refer to traditional
DBMSs. However, we will still use the term
“database” to refer to the stored data in these systems.
All of the data stores have some administrative unit
that you would call a database: data may be stored in
one file, or in a directory, or via some other
mechanism that defines the scope of data used by a
group of applications. Each database is an island unto
itself, even if the database is partitioned and distributed
over multiple machines: there is no “federated
database” concept in these systems (as with some
relational and object-oriented databases), allowing
multiple separately-administered databases to appear
as one. Most of the systems allow horizontal
partitioning of data, storing records on different servers
according to some key; this is called “sharding”. Some
of the systems also allow vertical partitioning, where
parts of a single record are stored on different servers.
1.1 Scope of this Paper
Before proceeding, some clarification is needed in
defining “horizontal scalability” and “simple
operations”. These define the focus of this paper.
By “simple operations”, we refer to key lookups, reads
and writes of one record or a small number of records.
This is in contrast to complex queries or joins, read-
mostly access, or other application loads. With the
advent of the web, especially Web 2.0 sites where
millions of users may both read and write data,
scalability for simple database operations has become
more important. For example, applications may search
and update multi-server databases of electronic mail,
personal profiles, web postings, wikis, customer
records, online dating records, classified ads, and many
other kinds of data. These all generally fit the
definition of “simple operation” applications: reading
or writing a small number of related records in each
operation.
The term “horizontal scalability” means the ability to
distribute both the data and the load of these simple
operations over many servers, with no RAM or disk
shared among the servers. Horizontal scaling differs
from “vertical” scaling, where a database system
utilizes many cores and/or CPUs that share RAM and
disks. Some of the systems we describe provide both
vertical and horizontal scalability, and the effective use
of multiple cores is important, but our main focus is on
horizontal scalability, because the number of cores that
can share memory is limited, and horizontal scaling
generally proves less expensive, using commodity
servers. Note that horizontal and vertical partitioning
are not related to horizontal and vertical scaling,
except that they are both useful for horizontal scaling.
1.2 Systems Beyond our Scope
Some authors have used a broad definition of NoSQL,
including any database system that is not relational.
Specifically, they include:
• Graph database systems: Neo4j and OrientDB
provide efficient distributed storage and queries of
a graph of nodes with references among them.
• Object-oriented database systems: Object-oriented
DBMSs (e.g., Versant) also provide efficient
distributed storage of a graph of objects, and
materialize these objects as programming
language objects.
• Distributed object-oriented stores: Very similar to
object-oriented DBMSs, systems such as GemFire
distribute object graphs in-memory on multiple
servers.
These systems are a good choice for applications that
must do fast and extensive reference-following,
especially where data fits in memory. Programming
language integration is also valuable. Unlike the
NoSQL systems, these systems generally provide
ACID transactions. Many of them provide horizontal
scaling for reference-following and distributed query
decomposition, as well. Due to space limitations,
however, we have omitted these systems from our
comparisons. The applications and the necessary
optimizations for scaling for these systems differ from
the systems we cover here, where key lookups and
simple operations predominate over reference-
following and complex object behavior. It is possible
these systems can scale on simple operations as well,
but that is a topic for a future paper, and proof through
benchmarks.