Returning transactions to distributed data stores

Principles for the next generation of NoSQL databases

By David Rosenthal and Stephen Pimentel

Rise of NoSQL

Database technologies are undergoing rapid evolution, with new approaches being actively explored after decades of relative stability. As late as 2008, the term “NoSQL”  barely existed and relational databases were both commercially dominant and entrenched in the developer community. Since then NoSQL systems have rapidly gained prominence and early systems such as Google’s Bigtable and Amazon’s Dynamo have inspired dozens of new databases (HBase, Cassandra, Voldemort, MongoDB, etc.) that fall under the NoSQL umbrella.

The first generation of NoSQL databases aimed to achieve the dual goals of fault tolerance and horizontal scalability on clusters of commodity hardware There are now a variety of NoSQL systems available that, at their best, achieve these goals. Unfortunately, the cost for these benefits is high: limited data model flexibility and extensibility, and weak guarantees for applications due to the lack of multi-statement (global) transactions.

The shadow of the CAP Theorem

This first generation of NoSQL databases was designed in the shadow of Brewer’s CAP Theorem . In 2000, Eric Brewer conjectured (and Gilbert and Lynchlater proved) that a distributed system could simultaneously provide at most two out of three advantageous properties:

  • Consistency: A read sees all previously completed writes.
  • Availability: Reads and writes are always possible.
  • Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.

The theorem suggests three design options: CPAP, and CA. In practice, for a decade after its formulation, CAP was interpreted as advocating AP systems (that is, sacrificing Consistency.) For example, in a paper explaining the design decisions made for the Dynamo database, Werner Vogels, the CTO of Amazon.com, wrote that  “data inconsistency in large-scale reliable distributed systems has to be tolerated” to obtain sufficient performance and availability.

Indeed, many NoSQL systems adopted similar logic and, in place of strong consistency, adopted a much weaker model called  eventual consistency . Eventual consistency was considered a necessary evil, justified as an engineering tradeoff necessary to deliver the other major goals of NoSQL. Even NoSQL systems that chose to stick with stronger consistency (e.g. HBase) decided to sacrifice ACID transactions— apowerful capability that has been available in the RDBMS world for decades. It is this missing capability that limits each database to supporting a single, limited data model.

NoSQL data modeling

Each NoSQL database implements its own simple data model such as a graph, document, column-family, or key-value store. Of course, different data models work better for different use cases, different languages, different applications, etc. All are simple enough that real-world applications will want to build up richer relations, indexes, or pointers within the simple structure of the base data model. Many application developers would even like to use different data models for different types of data. Unfortunately, application developers have no good way use a NoSQL system to support multiple data models, or to build the abstractions needed to extend the base data model.

The key capability that the first generation of NoSQL systems lacks is global ACID transactions. Though many NoSQL systems claim support of ACID transactions, they are almost never referring to globalACID transactions that allow multiple arbitrary operations in a single transaction. The local ACID transactions that they provide are better than nothing, but are fundamentally unable to enforce rules, relationships, or constraints between multiple pieces of data—the key to enabling strong abstractions.

Revisiting the CAP theorem

Experience with these challenges has led some of the original thought-leaders of the NoSQL movement to re-examine the CAP theorem and re-assess the space of realistic engineering possibilities. In 2010, Vogels wrote that it is indeed possible to provide strong consistency and that Amazon would add such consistency as an option in their SimpleDB product. He warned that “achieving strict consistency can come at a cost in update or read latency, and may result in lower throughput.” Rather than claiming that data inconsistency simply “has to be tolerated,” he now advised that benefits of strong consistency must be balanced against performance costs. He did not, however, attempt to characterize the magnitude of these costs.

In 2012, Brewer wrote that the CAP theorem has been widely misunderstood. In particular, he noted that “the ‘2 of 3’ formulation was always misleading” and that “CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare.” This point is fundamental because the CAP notion of Availability actually refers to a property called “perfect availability”: that reads and writes are always possible from every machine, even if it is partitioned from the network.

This property is very different from the “availability” of the database as a whole to a client. Reconsideration of the design space leads to the surprising conclusion that sacrificing CAP Theorem Availability does not exclude building a highly available database. By keeping multiple replicas of database state on multiple machines, a Consistent database can stay available to clients even when some replicas are down. Even better, with Consistency maintained, the possibility of supporting global transactions emerges.

Return to ACID

As developers have gained experience working with AP systems and with CP systems without transactions, they have come to understand the heavy engineering cost of working around these weaker guarantees. This cost is leading some distributed database designers to reconsider CP systems with global transactions.

Google exemplifies this trend with their new Spanner database, a CP database with global transactions intended to replace their first-generation NoSQL database, Bigtable, across a wide range of applications. Spanner is Google’s first major distributed database not built on top of Bigtable and supports the same multi-datacenter operation.

Internally, Google “consistently received complaints from users that Bigtable can be difficult to use”. In particular, “the lack of cross-row [global] transactions in Bigtable led to frequent complaints.” As a result, the designers of Spanner now believe that “it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”

Spanner has been widely noted in the NoSQL field because it serves as an “existence proof” that distributed databases providing global transactions at scale are feasible. Spanner further demonstrates that a distributed database can remain highly available over a broad range of failures without supporting Availability in the CAP sense.

FoundationDB

FoundationDB is a NoSQL database that uses a distributed design and presents a single logical ordered-key-value data model. Unlike many other NoSQL databases, FoundationDB presents a single consistent state and supports global transactions.

Like all CP systems, FoundationDB chooses C over A during network partitions; when multiple machines or datacenters are unable to communicate, some of them will be unable to execute writes. Nevertheless, in a wide variety of real-world failure-modes, the database and the application using it will remain up. Leader election algorithms and data replication avoid a single point of failure. To achieve this during a partition, FoundationDB needs to determine which side of the partition should continue to accept reads and writes. To avoid a “split brain” scenario (where each side of a network partition thinks it is the authority) FoundationDB uses an odd-numbered group of coordination servers. Using the Paxos algorithm, FoundationDB determines which partition contains a majority of these coordination servers and only allows that partition to remain responsive to writes.

Of course, the logic to handle consistency and global transactions does create some overhead that, as Vogels’ noted in his 2010 post, imposes costs in latencies and throughput. FoundationDB has sought to both measure and reduce these costs. During a benchmark on a 24-node cluster with a workload of cross-node multi-statement global transactions, FoundationDB uses less than 10% of total CPU capacity to support those guarantees.

A new generation of NoSQL

Systems such as Spanner and FoundationDB suggest an approach for a new generation of NoSQL. Like the first generation, the new systems will employ shared-nothing, distributed architectures with fault tolerance and scalability. However, rather than default to designs with weak consistency, the new generation will aggressively explore the strong-consistency region of the design space actually permitted by the CAP theorem and with it the possibility of true global transactions.

Though there is a strong correlation between global transactions and a relational data model in currently implemented systems, there is no deep reason for that correlation and every reason to bring global transactions to NoSQL as well. The power of this combination is that it supports abstractions and extensions to the basic NoSQL data models that make building applications much easier. To achieve this design potential, the new generation should follow three broad principles:

1.  Maintain what works about NoSQL: Maintain the things that make NoSQL great: distributed design, fault tolerance, easy scaling, and a simple, flexible base data model. A storage system that offers these properties and can handle both random-access and streaming workloads efficiently could support a huge range of types of data and applications.

2.  Leverage our modern understanding of CAP to support global transactions: Global transactions can be implemented in a distributed, scalable manner as demonstrated by Spanner and FoundationDB. The benefits to applications and application developers are dramatic and greatly outweigh the theoretical performance penalty.

3.  Build richer data models as abstractions: Extend the base data models of NoSQL and build richer data models using the strength of global transactions. This approach allows a single database to support multiple data models enabling applications to select the models best suited to their problem. This will provide true data-model flexibility within a single database system.

By following the above principles, the next generation of NoSQL databases will provide a solid foundation that supports a vibrant ecosystem of data models, frameworks, and applications.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England
tags: , ,