Re-engineering the data stack for speed

Big data creates a number of storage and processing challenges for developers — efficiency, complexity, cost, among others. London-based data storage startup Acunu is tackling these issues by re-engineering the data stack and taking a new approach to disk storage.

In the following interview, Acunu CEO Tim Moreton discusses the new techniques and how they might benefit developers.

Why do we need to re-engineer the data stack?

Tim Moreton: New workloads mean we must collect, store and serve large volumes of data quickly and cheaply. This poses two challenges. The first is a distributed systems challenge: How do you scale a database across many cheap commodity machines, and deal with replication, nodes failing, etc? There are now many tools that provide a good answer to this — Apache Cassandra is one. Then, the second challenge is once you’ve decided on the node in the cluster where you’re going to read or write some data, how do you do that efficiently? That’s the challenge we’re trying to solve.

Most distributed databases see it as outside their domain to solve this problem: they support pluggable storage backends, and often use embedded tools like Berkeley DB. Cassandra and HBase go further, and implement their own storage engines based on Google BigTable — these amount to being file systems that run in userspaces as part of their Java codebases.

The problem is that underneath any of these sits a storage stack that hasn’t changed much over 20 years. The workloads look different from 20 years ago, and the hardware looks very different. So, we built the Acunu Storage Core, an open-source Linux kernel module that contains optimizations and data structures that let you make better use of the commodity hardware that you already have.

It offers a new storage interface, where keys have any number of dimensions, and values can be very small or very large. Whole ranges can be queried, and large values streamed in and out. It’s designed to be just general-purpose enough to model simple key-value stores, BigTable data models like Cassandra’s, Redis-style data structures, graphs, and others.

OSCON Data 2011, being held July 25-27 in Portland, Ore., is a gathering for developers who are hands-on, doing the systems work and evolving architectures and tools to manage data. (This event is co-located with OSCON.)

Save 20% on registration with the code OS11RAD

Why would big data stores need versioning?

Tim Moreton: There are many possible reasons, but we’re focusing on two. The first is whole-cluster backup. Service outages like Amazon’s, and Google having to restore some Gmail data from tape, reminds us that just because our datasets may be different, backup can still be pretty important. Acunu takes snapshots at intervals across a whole cluster and you can copy these “checkpoints” off the cluster with little impact on your cluster’s performance. Or, if you mess something up, you can roll back a Cassandra ColumnFamily to a previous point in time.

Speeding up your dev/test cycle is the second reason for versioning. Say you have a Cassandra application serving real users. If you want to develop a new feature in your app that changes what data you store or how you use it, how do you know it’s going to work? Most people have separate test clusters and craft test data; others experiment to see if it works on a small portion of their users. Our versioning lets you take a clone of your production ColumnFamily and give it to a developer or automated test run. We’re working on making sure these clones are entirely isolated from the production version so whatever you do to it, you won’t affect your real users. This lets you try out new code on the whole dataset. When you’re confident your code works, you can throw the clone away. This speeds up the dev cycle and reduces the risks of putting new code into production.

What kinds of opportunities do you see this speed boost creating?

Tim Moreton: The decisions around what data gets collected and analyzed are often economic. Cassandra and Hadoop help to make new data problems tractable, but we can do more.

In concrete terms, if you have a Cassandra cluster, and you’re continuously collecting lots of log entries or sensor data, and you want to do real-time analytics on that, then our benchmarking shows that Acunu delivers those results up to 50 times faster than vanilla Cassandra. That means you can process 50 times the amount of data, or work at greatly increased detail, or do the same work while buying and managing much less hardware. And this is comparing Acunu against Cassandra, which is in our view the best-of-breed datastore for these types of workloads.

Do you plan to implement speedups for other database systems?

Tim Moreton: Absolutely. Although the first release focuses on Cassandra and an S3-compatible store, we have already ported Voldemort and memcached. The Acunu Storage Core and its language bindings will be open source, and we are actively working with developers on several other databases. Cassandra already gives us good support for a lot of the Hadoop framework. HBase is on the cards, but it’s a trickier architectural fit since it sits above HDFS.

You’ll be able to interoperate between these various databases. For example, if you have an application that uses memcached, you can read and write the same data that you access with Cassandra — perhaps ingesting it with Flume, then processing it with Cassandra’s Hadoop or Pig integrations. We plan to let people use the right tools and interfaces for the job, but without having to move or transform data between clusters.

This interview was edited and condensed.

Related:

Re-engineering the data stack for speed

Acunu is taking a new approach to data storage. Here's what that means for developers.

Why do we need to re-engineer the data stack?

Why would big data stores need versioning?

What kinds of opportunities do you see this speed boost creating?

Do you plan to implement speedups for other database systems?

Get the O’Reilly Data Newsletter