We wanted to give you a brief update on what we’ve learned so far from our series of interviews with players and practitioners in the in-memory data management space. A few preliminary themes have emerged, some expected, others surprising.
Performance improves as you put data as close to the computation as possible. We talked to people in systems, data management, web applications, and scientific computing who have embraced this concept. Some solutions go to the the lowest level of hardware (L1, L2 cache), The next generation SSDs will have latency performance closer to main memory, potentially blurring the distinction between storage and memory. For performance and power consumption considerations we can imagine a future where the primary way systems are sized will be based on the amount of non-volatile memory* deployed.
Putting data in-memory does not negate the importance of distributed computing environments. Data size and the ability to leverage parallel environments are frequently cited reasons. The same characteristics that make the distributed environments compelling also apply to in-memory systems: fault-tolerance and parallelism for performance. An additional consideration is the ability to gracefully spillover to disk when main is memory full.
There is no general purpose solution that can deliver optimal performance for all workloads. The drive for low latency requires different strategies depending on write or read intensity, fault-tolerance, and consistency. Database vendors we talked with have different approaches for transactional and analytic workloads, in some cases integrating in-memory into existing or new products. People who specialize in write-intensive systems identify hot data (i.e., frequently accessed) and put those in-memory.
Hadoop has emerged as an ingestion layer and the place to store data you might use. The next layer identifies and extracts high-value data that can be stored in-memory for low-latency interactive queries. Due to resource constraints of main memory, using columnar stores to compress data becomes important to speed I/O and store more in a limited space.
While it may be difficult to make in-memory systems completely transparent, the people we talked with emphasized programming interfaces that are as simple as possible.
Our conversations to date have revealed a wide range of solutions and strategies. We remain excited about the topic, and we’re continuing our investigation. If you haven’t yet, feel free to reach out to us on Twitter (Ben is @bigdata and Roger is @rogerm) or leave a comment on this post.
* By non-volatile memory we mean the next-generation SSDs. In the rest of the post “memory” refers to traditional volatile main memory.