Yanpei Chen

Yanpei Chen is a Software Engineer at Cloudera in the Performance Engineering team. His work touches upon Impala, Cloudera Search, Apache Hadoop, Apache HBase, and Apache Hive, because someone has to make sure the entire Hadoop ecosystem performs well together. Yanpei is a regular speaker at industry and academia conferences, and he contributes to various industry standard benchmarks for Big Data.

The truth about MapReduce performance on SSDs

Cost-per-performance is approaching parity with HDDs.

geometric_stone_Brian_Reynolds_Flickr

Karthik Kambatla co-authored this post.

It is well-known that solid-state drives (SSDs) are fast and expensive. But exactly how much faster — and more expensive — are they than the hard disk drives (HDDs) they’re supposed to replace? And does anything change for big data?

I work on the performance engineering team at Cloudera, a data management vendor. It is my job to understand performance implications across customers and across evolving technology trends. The convergence of SSDs and big data does have the potential to broadly impact future data center architectures. When one of our hardware partners loaned us a number of SSDs with the mandate to “find something interesting,” we jumped on the opportunity. This post shares our findings.

As a starting point, we decided to focus on MapReduce. We chose MapReduce because it enjoys wide deployment across many industry verticals — even as other big data frameworks such as SQL-on-Hadoop, free text search, machine learning, and NoSQL gain prominence.

We considered two scenarios: first, when setting up a new cluster, we explored whether SSDs or HDDs, of equal aggregate bandwidth, are superior; second, we explored how cluster operators should configure SSDs, when upgrading an HDDs-only cluster. Read more…