Need speed for big data? Think in-memory data management

We're launching an investigation into in-memory data technologies.

By Ben Lorica and Roger Magoulas

In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.

An example we frequently hear about is the demand for tools that support interactive query performance. Faster query response times translate to more engaged and productive analysts, and real-time reports. Over the past two years several in-memory solutions emerged to deliver 5X-100X faster response times. A recent paper from Microsoft Research noted that even in this era of big data and Hadoop, many MapReduce jobs fit in the memory of a single server. To scale to extremely large datasets several new systems use a combination of distributed computing (in-memory grids), compression, and (columnar) storage technologies.

Another interesting aspect of in-memory technologies is that they seem to be everywhere these days. We’re looking at tools aimed at analysts (Tableau, Qlikview, Tibco Spotfire, Platfora), databases that target specific workloads or data types (VoltDB, SAP HANA, Hekaton, Redis, Druid, Kognitio, and Yarcdata), frameworks for analytics (Spark/Shark, GraphLab, GridGain, Asterix/Hyracks), and the data center (RAMCloud, memory Iocality).

We’ll be talking to companies and hackers to get a sense of how in-memory solutions fit into their planning. Along these lines, we would love to hear what you think about the rise of these technologies, as well as applications, companies and projects we should look at. Feel free to reach out to us on Twitter (Ben is @bigdata and Roger is @rogerm) or leave a comment on this post.

tags: , , , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • http://oswco.com dartdog

    Take a look at Pandas for Python, a fast in memory solution…Gaining rapid adoption.

    • rmagoulas

      Dartdog – thanks for the recommendation. Not only do we like and use Pandas for Python, we have a great book on Pandas: Python for Data Analysis by Wes McKinney (the primary Pandas committer): http://shop.oreilly.com/product/0636920023784.do

      • http://oswco.com dartdog

        Have the book, know Wes, overdue to post a book review , great resource, just surprised not to see it mentioned for in-memory data analysis solutions.

        • Olivier Grisel

          As far as I know, pandas is not (yet?) a parallel / distributed data processing system as Spark and RAMCloud are so at the moment it is not suitable to address multi terabytes problems that would fit in memory on a cluster but would not fit on a single node.

          Blaze ( https://github.com/ContinuumIO/blaze ) on the other hand might be able to tackle distributed in memory (numerical) data processing tasks. It’s still in development so it’s probably a bit early to be able to compare it to existing solutions.

          There are plenty of very important problems where the dataset and working intermediate data structures can fit in memory in a couple of tens of GB though. I would just not call this big data.

  • http://twitter.com/mphnyc Michael H

    Hi Ben:

    Kognitio invented the first in-memory database specifically designed for analytics over 20 years ago, first bringing it to market in 1989. It has been amazing in the past 18 months to see that Hadoop is the key to finally turbo-charge adoption, as in-memory becomes the perfect pairing to that MPP environment to make it more consumable for the business analysts who need access to these burgeoning Big Data environments.

  • Tom Kennedy

    Panopticon is great big data and realtime visualisation tool

  • G9s

    Regarding tools aimed at analysts neither Tableau, Qlikview, Tibco Spotfire do REALTIME, they cannot, though they can get close and Platfora is Hadoop specific (i.e. is BIG DATA just HADOOP?). The only true REALTIME & BIG DATA product currently on the market that is not HADOOP specific is PANOPTICON.

  • HSA

    One of the best real time analytics tool which has worked out for us for multiple terabytes of data is Druid. Its fast, columnar and Java so fits well with other Hadoop ecosystem API’s. Although as SkilledAnalysts we work with all other real time API’s but Druid has been the best bet so far.

  • Nate Smith

    We’d love to talk to you – our platform utilizes in-memory processing along with flash storage and SSD. We combine real time stream processing with dynamic queries to enable easy to use real time monitoring and analytics along with data management tools. It’s stream processing and data management tailored to the challenges of big data. You can check us out at http://www.talksum.com, follow us or reach out on Twitter (@talksumdata) and I can provide more technical details.

    • http://twitter.com/bigdata Ben Lorica

      Hi Nate:

      Noted, thanks for letting us know.

  • http://www.facebook.com/profile.php?id=100002458125096 Momo Levi

    cool !

  • Peter Wang

    Hey Ben and Roger, I think you guys might be interested in the Blaze project we’re working on at Continuum (http://blaze.pydata.org). Let me know if you guys are interested, and I can send over an overview of the project, the Python big data ecosystem at large, and our views on how traditional supercomputing is bleeding over into mainstream business computing.

  • http://www.facebook.com/profile.php?id=597594130 Siyavus Cy Erbay

    Hi Ben and Roger, I am surprised to see that Altibase is missing from your in-memory database list. Altibase has a very long history in in-memory DBMS domain. I am the CTO of the company and I would love to have conversation with you folks about our company and solutions. We are about to release a new in-memory DBMS that will blow away all OLTP benchmarks.

    • http://twitter.com/bigdata Ben Lorica

      Siyavus,

      thanks for the tip. we’ll contact you (via linkedin) if we have questions.

      ben