Moving from Batch to Continuous Computing at Yahoo!

Spark, Storm, HBase, and YARN power large-scale, real-time models.

My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.

Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:

On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos1 (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.

Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:

  • 100 billion events (clicks, impressions, email content & meta-data, etc.) are collected daily, across all of the company’s systems
  • A subset of collected events get passed to a stream processing engine over a Hadoop/YARN cluster: 133K events/second are processed, using Storm-on-Yarn across 320 nodes. This involves roughly 500 processors and 12,000 threads.
  • Iterative computations are performed with Spark-on-YARN, across 40 nodes
  • Sparse data store: 2 PB of data stored in HBase, across 1,900 nodes. I believe this is one of the largest HBase deployments in production.
  • 365 PB of available raw storage on HDFS, spread across 30,000 nodes (about 150 PB is currently utilized).
  • About 400,000 jobs/day run on YARN, corresponding to about 10M hours of compute time per day.
  • Related posts:

  • Seven reasons why I like Spark
  • Simpler workflow tools enable the rapid deployment of models
  • It’s getting easier to build Big Data applications

  • (1) I first wrote about Mesos over two years ago, when I learned that Twitter was using it heavily. Since then many other companies have deployed Mesos in production, including Twitter, AirBnb, Conviva, UC Berkeley, UC San Francisco, and a slew of startups that I’ve talked with.

    O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

    Strata Rx Health Data Conference: September 25-27 | Boston, MA
    Strata + Hadoop World: October 28-30 | New York, NY
    Strata in London: November 15-17 | London, England

    tags: , , , , , , ,