The NoSQL movement

How to think about choosing a database.

NoSQL logosIn a conversation last year, Justin Sheehy, CTO of Basho, described NoSQL as a movement, rather than a technology. This description immediately felt right; I’ve never been comfortable talking about NoSQL, which when taken literally, extends from the minimalist Berkeley DB (commercialized as Sleepycat, now owned by Oracle) to the big iron HBase, with detours into software as fundamentally different as Neo4J (a graph database) and FluidDB (which defies description).

But what does it mean to say that NoSQL is a movement rather than a technology? We certainly don’t see picketers outside Oracle’s headquarters. Justin said succinctly that NoSQL is a movement for choice in database architecture. There is no single overarching technical theme; a single technology would belie the principles of the movement.

Think of the last 15 years of software development. We’ve gotten very good at building large, database-backed applications. Many of them are web applications, but even more of them aren’t. “Software architect” is a valid job description; it’s a position to which many aspire. But what do software architects do? They specify the high-level design of applications: the front end, the APIs, the middleware, the business logic — the back end? Well, maybe not.

Since the ’80s, the dominant back end of business systems has been a relational database, whether Oracle, SQL Server or DB2. That’s not much of an architectural choice. Those are all great products, but they’re essentially similar, as are all the other relational databases. And it’s remarkable that we’ve explored many architectural variations in the design of clients, front ends, and middleware, on a multitude of platforms and frameworks, but haven’t until recently questioned the architecture of the back end. Relational databases have been a given.

Many things have changed since the advent of relational databases:

  • We’re dealing with much more data. Although advances in storage capacity and CPU speed have allowed the databases to keep pace, we’re in a new era where size itself is an important part of the problem, and any significant database needs to be distributed.
  • We require sub-second responses to queries. In the ’80s, most database queries could run overnight as batch jobs. That’s no longer acceptable. While some analytic functions can still run as overnight batch jobs, we’ve seen the web evolve from static files to complex database-backed sites, and that requires sub-second response times for most queries.
  • We want applications to be up 24/7. Setting up redundant servers for static HTML files is easy, but a database replication in a complex database-backed application is another.
  • We’re seeing many applications in which the database has to soak up data as fast (or even much faster) than it processes queries: in a logging application, or a distributed sensor application, writes can be much more frequent than reads. Batch-oriented ETL (extract, transform, and load) hasn’t disappeared, and won’t, but capturing high-speed data flows is increasingly important.
  • We’re frequently dealing with changing data or with unstructured data. The data we collect, and how we use it, grows over time in unpredictable ways. Unstructured data isn’t a particularly new feature of the data landscape, since unstructured data has always existed, but we’re increasingly unwilling to force a structure on data a priori.
  • We’re willing to sacrifice our sacred cows. We know that consistency and isolation and other properties are very valuable, of course. But so are some other things, like latency and availability and not losing data even if our primary server goes down. The challenges of modern applications make us realize that sometimes we might need to weaken one of these constraints in order to achieve another.

These changing requirements lead us to different tradeoffs and compromises when designing software. They require us to rethink what we require of a database, and to come up with answers aside from the relational databases that have served us well over the years. So let’s look at these requirements in somewhat more detail.

Microsoft SQL Server is a comprehensive information platform offering enterprise-ready technologies and tools that help businesses derive maximum value from information at the lowest TCO. SQL Server 2012 launches next year, offering a cloud-ready information platform delivering mission-critical confidence, breakthrough insight, and cloud on your terms; find out more at www.microsoft.com/sql.

Size, response, availability

It’s a given that any modern application is going to be distributed. The size of modern datasets is only one reason for distribution, and not the most important. Modern applications (particularly web applications) have many concurrent users who demand reasonably snappy response. In their 2009 Velocity Conference talk, Performance Related Changes and their User Impact, Eric Schurman and Jake Brutlag showed results from independent research projects at Google and Microsoft. Both projects demonstrated imperceptibly small increases in response time cause users to move to another site; if response time is over a second, you’re losing a very measurable percentage of your traffic.

If you’re not building a web application — say you’re doing business analytics, with complex, time-consuming queries — the world has changed, and users now expect business analytics to run in something like real time. Maybe not the sub-second latency required for web users, but queries that run overnight are no longer acceptable. Queries that run while you go out for coffee are marginal. It’s not just a matter of convenience; the ability to run dozens or hundreds of queries per day changes the nature of the work you do. You can be more experimental: you can follow through on hunches and hints based on earlier queries. That kind of spontaneity was impossible when research went through the DBA at the data warehouse.

Whether you’re building a customer-facing application or doing internal analytics, scalability is a big issue. Vertical scalability (buy a bigger, faster machine) always runs into limits. Now that the laws of physics have stalled Intel-architecture clock speeds in the 3.5GHz range, those limits are more apparent than ever. Horizontal scalability (build a distributed system with more nodes) is the only way to scale indefinitely. You’re scaling horizontally even if you’re only buying single boxes: it’s been a long time since I’ve seen a server (or even a high-end desktop) that doesn’t sport at least four cores. Horizontal scalability is tougher when you’re scaling across racks of servers at a colocation facility, but don’t be deceived: that’s how scalability works in the 21st century, even on your laptop. Even in your cell phone. We need database technologies that aren’t just fast on single servers: they must also scale across multiple servers.

Modern applications also need to be highly available. That goes without saying, but think about how the meaning of “availability” has changed over the years. Not much more than a decade ago, a web application would have a single HTTP server that handed out static files. These applications might be data-driven; but “data driven” meant that a batch job rebuilt the web site overnight, and user transactions were queued into a batch processing system, again for processing overnight. Keeping such a system running isn’t terribly difficult. High availability doesn’t impact the database. If the database is only engaged in batched rebuilds or transaction processing, the database can crash without damage. That’s the world for which relational databases were designed. In the ’80s, if your mainframe ran out of steam, you got a bigger one. If it crashed, you were down. But when databases became a living, breathing part of the application, availability became an issue. There is no way to make a single system highly available; as soon as any component fails, you’re toast. Highly available systems are, by nature, distributed systems.

If a distributed database is a given, the next question is how much work a distributed system will require. There are fundamentally two options: databases that have to be distributed manually, via sharding; and databases that are inherently distributed. Relational databases are split between multiple hosts by manual sharding, or determining how to partition the datasets based on some properties of the data itself: for example, first names starting with A-K on one server, L-Z on another. A lot of thought goes into designing a sharding and replication strategy that doesn’t impair performance, while keeping the data relatively balanced between servers. There’s a third option that is essentially a hybrid: databases that are not inherently distributed, but that are designed so they can be partitioned easily. MongoDB is an example of a database that can be sharded easily (or even automatically); HBase, Riak, and Cassandra are all inherently distributed, with options to control how replication and distribution work.

What database choices are viable when you need good interactive response? There are two separate issues: read latency and write latency. For reasonably simple queries on a database with well-designed indexes, almost any modern database can give decent read latency, even at reasonably large scale. Similarly, just about all modern databases claim to be able to keep up with writes at high-speed. Most of these databases, including HBase, Cassandra, Riak, and CouchDB, write data immediately to an append-only file, which is an extremely efficient operation. As a result, writes are often significantly faster than reads.

Whether any particular database can deliver the performance you need depends on the nature of the application, and whether you’ve designed the application in a way that uses the database efficiently: in particular, the structure of queries, more than the structure of the data itself. Redis is an in-memory database with extremely fast response, for both read and write operations; but there are a number of tradeoffs. By default, data isn’t saved to disk, and is lost if the system crashes. You can configure Redis for durability, but at the cost of some performance. Redis is also limited in scalability; there’s some replication capability, but support for clusters is still coming. But if you want raw speed, and have a dataset that can fit into memory, Redis is a great choice.

It would be nice if there were some benchmarks to cover database performance in a meaningful sense, but as the saying goes, “there are lies, damned lies, and benchmarks.” In particular, no small benchmark can properly duplicate a real test-case for an application that might reasonably involve dozens (or hundreds) of servers.

Changing data and cheap lunches

NoSQL databases are frequently called “schemaless” because they don’t have the formal schema associated with relational databases. The lack of a formal schema, which typically has to be designed before any code is written, means that schemaless databases are a better fit for current software development practices, such as agile development. Starting from the simplest thing that could possibly work and iterating quickly in response to customer input doesn’t fit well with designing an all-encompassing data schema at the start of the project. It’s impossible to predict how data will be used, or what additional data you’ll need as the project unfolds. For example, many applications are now annotating their data with geographic information: latitudes and longitudes, addresses. That almost certainly wasn’t part of the initial data design.

How will the data we collect change in the future? Will we be collecting biometric information along with tweets and Foursquare checkins? Will music sites such as Last.FM and Spotify incorporate factors like blood pressure into their music selection algorithms? If you think these scenarios are futuristic, think about Twitter. When it started out, it just collected bare-bones information with each tweet: the tweet itself, the Twitter handle, a timestamp, and a few other bits. Over its five-year history, though, lots of metadata has been added. A tweet may be 140 characters at most, but a couple KB is actually sent to the server, and all of this is saved in the database. Up-front schema design is a poor fit in a world where data requirements are fluid.

In addition, modern applications frequently deal with unstructured data: blog posts, web pages, voice transcripts, and other data objects that are essentially text. O’Reilly maintains a substantial database of job listings for some internal research projects. The job descriptions are chunks of text in natural languages. They’re not unstructured because they don’t fit into a schema. You can easily create a JOBDESCRIPTION column in a table, and stuff text strings into it. It’s that knowing the data type and where it fits in the overall structure doesn’t help. What are the questions you’re likely to ask? Do you want to know about skills, certifications, the employer’s address, the employer’s industry? Those are all valid columns for a table, but you don’t know what you care about in advance; you won’t find equivalent information in each job description; and the only way to get from the text to the data is through various forms of pattern matching and classification. Doing the classification up front, so you could break a job listing down into skills, certifications, etc., is a huge effort that would largely be wasted. The guys who work with this data recently had fits disambiguating “Apple Computer” from “apple orchard.” Would you even know this was a problem outside of a concrete research project based on a concrete question? If you’re just pre-populating an INDUSTRY column from raw data, would you notice that lots of computer industry jobs were leaking into fruit farming? A JOBDESCRIPTION column doesn’t hurt, but doesn’t help much either, and going further, by trying to design a schema around the data that you’ll find in the unstructured text, that definitely hurts. The kinds of questions you’re likely to ask have everything to do with the data itself, and little to do with that data’s relations to other data.

However, it’s really a mistake to say that NoSQL databases have no schema. In a document database, such as CouchDB or MongoDB, documents are key-value pairs. While you can add documents with differing sets of keys (missing keys or extra keys), or even add keys to documents over time, applications still must know that certain keys are present to query the database; indexes have to be set up to make searches efficient. The same thing applies to column-oriented databases, such as HBase and Cassandra. While any row may have as many columns as needed, some up-front thought has to go into what columns are needed to organize the data. In most applications, a NoSQL database will require less up-front planning, and offer more flexibility as the application evolves. As we’ll see, data design revolves more around the queries you want to ask than the domain objects that the data represents. It’s not a free lunch; possibly a cheap lunch, but not free.

What kinds of storage models do the more common NoSQL databases support? Redis is a relatively simple key-value store, but with a twist: values can be data structures (lists and sets), not just strings. It supplies operations for working directly with sets and lists (for example, union and intersection).

CouchDB and MongoDB both store documents in JSON format, where JSON is a format originally designed for representing JavaScript objects, but now available in many languages. So on one hand, you can think of CouchDB and MongoDB as object databases; but you could also think of a JSON document as a list of key-value pairs. Any document can contain any set of keys, and any key can be associated with an arbitrarily complex value that is itself a JSON document. CouchDB queries are views, which are themselves documents in the database that specify searches. Views can be very complex, and can use a built-in MapReduce facility to process and summarize results. Similarly, MongoDB queries are JSON documents, specifying fields and values to match, and query results can be processed by a built in MapReduce. To use either database effectively, you start by designing your views: what do you want to query, and how. Once you do that, it will become clear what keys are needed in your documents.

Riak can also be viewed as a document database, though with more flexibility about document types. It natively handles JSON, XML, and plain text, and a plug-in architecture allows you to add support for other document types. Searches “know about” the structure of JSON and XML documents. Like CouchDB, Riak incorporates MapReduce to perform complex queries efficiently.

Cassandra and HBase are usually called column-oriented databases, though a better term is a “sparse row store.” In these databases, the equivalent to a relational “table” is a set of rows, identified by a key. Each row consists of an unlimited number of columns; columns are essentially keys that let you look up values in the row. Columns can be added at any time, and columns that are unused in a given row don’t occupy any storage. NULLs don’t exist. And since columns are stored contiguously, and tend to have similar data, compression can be very efficient, and searches along a column are likewise efficient. HBase describes itself as a database that can store billions of rows with millions of columns.

How do you design a schema for a database like this? As with the document databases, your starting point should be the queries you’ll want to make. There are some radically different possibilities. Consider storing logs from a web server. You may want to look up the IP addresses that accessed each URL you serve. The URLs can be the primary key; each IP address can be a column. This approach will quickly generate thousands of unique columns, but that’s not a problem — and a single query, with no joins, gets you all the IP addresses that accessed a single URL. If some URLs are visited by many addresses, and some are only visited by a few, that’s no problem: remember that NULLs don’t exist. This design isn’t even conceivable in a relational database. You can’t have a table that doesn’t have a fixed number of columns.

Now, let’s make it more complex: you’re writing an ecommerce application, and you’d like to access all the purchases that a given customer has made. The solution is similar. The column family is organized by customer ID (primary key), you have columns for first name, last name, address, and all the normal customer information, plus as many rows as are needed for each purchase. In a relational database, this would probably involve several tables and joins. In the NoSQL databases, it’s a single lookup. Schema design doesn’t go away, but it changes: you think about the queries you’d like to execute, and how you can perform those efficiently.

This isn’t to say that there’s no value to normalization, just that data design starts from a different place. With a relational database, you start with the domain objects, and represent them in a way that guarantees that virtually any query can be expressed. But when you need to optimize performance, you look at the queries you actually perform, then merge tables to create longer rows, and do away with joins wherever possible. With the schemaless databases, whether we’re talking about data structure servers, document databases, or column stores, you go in the other direction: you start with the query, and use that to define your data objects.

The sacred cows

The ACID properties (atomicity, consistency, isolation, durability) have been drilled into our heads. But even these come into play as we start thinking seriously about database architecture. When a database is distributed, for instance, it becomes much more difficult to achieve the same kind of consistency or isolation that you can on a single machine. And the problem isn’t just that it’s “difficult” but rather that achieving them ends up in direct conflict with some of the reasons to go distributed. It’s not that properties like these aren’t very important — they certainly are — but today’s software architects are discovering that they require the freedom to choose when it might be worth a compromise.

What about transactions, two-phase commit, and other mechanisms inherited from big iron legacy databases? If you’ve read almost any discussion of concurrent or distributed systems, you’ve heard that banking systems care a lot about consistency. What if you and your spouse withdraw money from the same account at the same time? Could you overdraw the account? That’s what ACID is supposed to prevent. But a few months ago, I was talking to someone who builds banking software, and he said “If you really waited for each transaction to be properly committed on a world-wide network of ATMs, transactions would take so long to complete that customers would walk away in frustration. What happens if you and your spouse withdraw money at the same time and overdraw the account? You both get the money; we fix it up later.”

This isn’t to say that bankers have discarded transactions, two-phase commit and other database techniques; they’re just smarter about it. In particular, they’re distinguishing between local consistency and absolutely global consistency. Gregor Hohpe’s classic article Starbucks Does Not Use Two-Phase Commit makes a similar point: in an asynchronous world, we have many strategies for dealing with transactional errors, including write-offs. None of these strategies are anything like two-phase commit. They don’t force the world into inflexible, serialized patterns.

The CAP theorem is more than a sacred cow; it’s a law of the database universe that can be expressed as “Consistency, Availability, Partition Tolerance: choose two.” But let’s rethink relational databases in light of this theorem. Databases have stressed consistency. The CAP theorem is really about distributed systems, and as we’ve seen, relational databases were developed when distributed systems were rare and exotic at best. If you needed more power, you bought a bigger mainframe. Availability isn’t an issue on a single server: if it’s up, it’s up, if it’s down, it’s down. And partition tolerance is meaningless when there’s nothing to partition. As we saw at the beginning of this article, distributed systems are a given for modern applications; you won’t be able to scale to the size and performance you need on a single box. So the CAP theorem is historically irrelevant to relational databases: they’re good at providing consistency, and they have been adapted to provide high availability with some success, but they are hard to partition without extreme effort or extreme cost.

Since partition tolerance is a fundamental requirement for distributed applications, it becomes a question of what to sacrifice: consistency or availability. There have been two approaches: Riak and Cassandra stress availability, while HBase has stressed consistency. With Cassandra and Riak, the tradeoff between consistency and availability is tunable. CouchDB and MongoDB are essentially single-headed databases, and from that standpoint, availability is a function of how long you can keep the hardware running. However, both have add-ons that can be used to build clusters. In a cluster, CouchDB and MongoDB are eventually consistent (like Riak and Cassandra); availability depends on what you do with the tools they provide. You need to set up sharding and replication, and use what’s essentially a proxy server to present a single interface to cluster’s clients. BigCouch is an interesting effort to integrate clustering into CouchDB, making it more like Riak. Now that Cloudant has announced that it is merging BigCouch and CouchDB, we can expect to see clustering become part of the CouchDB core.

We’ve seen that absolute consistency isn’t a hard requirement for banks, nor is it the way we behave in our real-world interactions. Should we expect it of our software? Or do we care more about availability?

It depends. The consistency requirements of many social applications are very soft. You don’t need to get the correct number of Twitter or Facebook followers every time you log in. If you search, you probably don’t care if the results don’t contain the comments that were posted a few seconds ago. And if you’re willing to accept less-than-perfect consistency, you can make huge improvements in performance. In the world of big-data-backed web applications, with databases spread across hundreds (or potentially thousands) of nodes, the performance penalty of locking down a database while you add or modify a row is huge; if your application has frequent writes, you’re effectively serializing all the writes and losing the advantage of the distributed database. In practice, in an “eventually consistent” database, changes typically propagate to the nodes in tenths of a second; we’re not talking minutes or hours before the database arrives in a consistent state.

Given that we have all been battered with talk about “five nines” reliability, and given that it is a big problem for any significant site to be down, it seems clear that we should prioritize availability over consistency, right? The architectural decision isn’t so easy, though. There are many applications in which inconsistency must eventually be dealt with. If consistency isn’t guaranteed by the database, it becomes a problem that the application has to manage. When you choose availability over consistency, you’re potentially making your application more complex. With proper replication and failover strategies, a database designed for consistency (such as HBase) can probably deliver the availability you require; but this is another design tradeoff. Regardless of the database you’re using, more stringent reliability requirements will drive you toward exotic engineering. Only you can decide the right balance for your application. The point isn’t that any given decision is right or wrong, but that you can (and have to) choose, and that’s a good thing.

Other features

I’ve completed a survey of the major tradeoffs you need to think about in selecting a database for a modern big data application. But the major tradeoffs aren’t the only story. There are many database projects with interesting features. Here are a some of the ideas and projects I find most interesting:

  • Scripting: Relational databases all come with some variation of the SQL language, which can be seen as a scripting language for data. In the non-relational world, a number of scripting languages are available. CouchDB and Riak support JavaScript, as does MongoDB. The Hadoop project has spawned a several data scripting languages that are usable with HBase, including Pig and Hive. The Redis project is experimenting with integrating the Lua scripting language.
  • RESTful interfaces: CouchDB and Riak are unique in offering RESTful interfaces. These are interfaces based on HTTP and the architectural style elaborated in Roy Fielding’s doctoral dissertation and Restful Web Services. CouchDB goes so far as to serve as a web application framework. Riak also offers a more traditional protocol buffer interface, which is a better fit if you expect a high volume of small requests.
  • Graphs: Neo4J is a special purpose database designed for maintaining large graphs: data where the data items are nodes, with edges representing the connections between the nodes. Because graphs are extremely flexible data structures, a graph database can emulate any other kind of database.
  • SQL: I’ve been discussing the NoSQL movement, but SQL is a familiar language, and is always just around the corner. A couple of startups are working on adding SQL to Hadoop-based datastores: DrawnToScale (which focuses on low-latency, high-volume web applications) and Hadapt (which focuses on analytics and bringing data warehousing into the 20-teens). In a few years, will we be looking at hybrid databases that take advantage of both relational and non-relational models? Quite possibly.
  • Scientific data: Yet another direction comes from SciDB, a database project aimed at the largest scientific applications (particularly the Large Synoptic Survey Telescope). The storage model is based on multi-dimensional arrays. It is designed to scale to hundreds of petabytes of storage, collecting tens of terabytes per night. It’s still in the relatively early stages.
  • Hybrid architectures: NoSQL is really about architectural choice. And perhaps the biggest expression of architectural choice is a hybrid architecture: rather than using a single database technology, mixing and matching technologies to play to their strengths. I’ve seen a number of applications that use traditional relational databases for the portion of the data for which the relational model works well, and a non-relational database for the rest. For example, customer data could go into a relational database, linked to a non-relational database for unstructured data such as product reviews and recommendations. It’s all about flexibility. A hybrid architecture may be the best way to integrate “social” features into more traditional ecommerce sites.

These are only a few of the interesting ideas and projects that are floating around out there. Roughly a year ago, I counted a couple dozen non-relational database projects; I’m sure there are several times that number today. Don’t hesitate to add notes about your own projects in the comments.

In the end

In a conversation with Eben Hewitt, author of Cassandra: The Definitive Guide, Eben summarized what you need to think about when architecting the back end of a data-driven system. They’re the same issues software architects have been dealing with for years: you need to think about the whole ecosystems in which the application works; you need to consider your goals (Do you require high availability? Fault tolerance?); you need to consider support options; you need to isolate what will change over the life of the application, and separate that from what remains the same. The big difference is that now there are options; you don’t have to choose the relational model. There are other options for building large databases that scale horizontally, are highly available, and can deliver great performance to users. And these options, the databases that make up the NoSQL movement, can often achieve these goals with greater flexibility and lower cost.

It used to be said that nobody got fired for buying IBM. Then nobody got fired for buying Microsoft. Now, I suppose, nobody gets fired for buying Oracle. But just as the landscape changed for IBM and Microsoft, it’s shifting again, and even Oracle has a NoSQL solution. Rather than relational databases being the default, we’re moving into a world where developers are considering their architectural options, and deciding which products fit their application: how the databases fit into their programming model, whether they can scale in ways that make sense for the application, whether they have strong or relatively weak consistency requirements.

For years, the relational default has kept developers from understanding their real back-end requirements. The NoSQL movement has given us the opportunity to explore what we really require from our databases, and to find out what we already knew: there is no one-size-fits-all solution.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

Related:

tags: , , , ,