Big data, small cluster

Finding new ways to shrink disk space for storing partitionable data.

1024px-30_Doradus_or_NGC_2070

Register for the free webcast, “Extending Cassandra with Doradus OLAP for High Performance Analytics,” which will be held July 29 at 9 a.m. PT.

Engineers at Dell were developing customer apps when they found that the query response times their customers were demanding — something on the order of seconds (in other words, the need to scan millions of objects/second) — required a new type of query engine. This led them on a four-year journey to create Doradus, one of Dell Software Group’s first open-source projects.

Doradus is a server framework that runs on top of Cassandra. To build Doradus, the team borrowed from several well-accepted paradigms. They used traditional OLAP techniques to allow data to be arranged into static, multidimensional cubes. They leveraged the vertical orientation and efficient compression of columnar databases. And, from the NoSQL world, they employed sharding. The result: a storage and query engine called Doradus OLAP that stores data up to 1M objects/second/node, providing nearly real-time data warehousing. This architecture also allows for extreme compression of the data, sometimes producing up to a 99% reduction in space usage.

This extremely dense storage means that data that once took multiple nodes can now be stored on a single node, allowing for fast queries without the expense of a large cluster. Because Doradus is built on top of Cassandra, the option to scale out is still there. This allows for sharding and replication, and also takes advantage of Cassandra’s failover features.

The strength of Doradus OLAP lies in statistical queries. The Lucene-based DQL query language allows for object queries, or, more powerfully, aggregate queries. Aggregate queries compute more than one metric in a single pass using powerful, multi-level grouping expressions. DQL is designed to search through memory very quickly, on the order of tens of millions of objects/second. And because query time is extremely regular, it’s great for ad hoc queries. Additionally, a graph-based data model and bi-directional links that can be applied to data within Doradus means support for graph-like queries, too.

Data in Doradus OLAP needs to be partitioned, and queries use that partitioning to enhance speed, which means time-stamped, streaming data works particularly well. The team also built Doradus to handle data from a variety of sources, and at a variety of ingestion rates, by allowing that data to be loaded in variable-sized batches.

In an O’Reilly webcast on July 29th, Randy Guck, a principal engineer at Dell, will explain how he and his team built Doradus. In the webcast, Guck will explain how Doradus OLAP achieves its extreme compression — walking viewers through an example of storing close to a billion objects in 2GB. He will also show how Doradus extends Cassandra, provide query examples, and more. Go here for additional details and to register.

This post is part of a collaboration between O’Reilly, Dell, and Intel. See our statement of editorial independence.

Cropped image on article and category pages via Wikimedia Commons.

tags: , , , , , ,