Spark, Storm, HBase, and YARN power large-scale, real-time models.
My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.
Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:
On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos1 (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.
Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:
Analytic engines on top of Hadoop simplify the creation of interesting, low-cost, scalable applications
Hadoop’s low-cost, scale-out architecture has made it a new platform for data storage. With a storage system in place, the Hadoop community is slowly building a collection of open source, analytic engines. Beginning with batch processing (MapReduce, Pig, Hive), Cloudera has added interactive SQL (Impala), analytics (Cloudera ML + a partnership with SAS), and as of early this week, real-time search. The economics that led to Hadoop dominating batch processing is permeating other types of analytics.
Another collection of open source, Hadoop-compatible analytic engines, the Berkeley Data Analytics Stack (BDAS), is being built just across the San Francisco Bay. Starting with a batch-processing framework that’s faster than MapReduce (Spark), it now includes interactive SQL (Shark), and real-time analytics (Spark Streaming). Sometime this summer, frameworks for machine-learning (MLbase) and graph analytics (GraphX) will be released. A cluster manager (Mesos) and an in-memory file system (Tachyon) allow users of other analytic frameworks to leverage the BDAS platform. (The Python data community is looking at Tachyon closely.)
A new, open source benchmark can be used to track performance improvements over time
As organizations continue to accumulate data, there has been renewed interest in interactive query engines that scale to terabytes (even petabytes) of data. Traditional MPP databases remain in the mix, but other options are attracting interest. For example, companies willing to upload data into the cloud are beginning to explore Amazon Redshift1, Google BigQuery, and Qubole.
A variety of analytic engines2 built for Hadoop are allowing companies to bring its low-cost, scale-out architecture to a wider audience. In particular, companies are rediscovering that SQL makes data accessible to lots of users, and many prefer3 not having to move data to a separate (MPP) cluster. There are many new tools that seek to provide an interactive SQL interface to Hadoop, including Cloudera’s Impala, Shark, Hadapt, CitusDB, Pivotal-HD, PolyBase4, and SQL-H.
An open source benchmark from UC Berkeley’s Amplab
A benchmark for tracking the progress5 of scalable query engines has just been released. It’s a worthy first effort, and its creators hope to grow the list of tools to include other open source (Drill, Stinger) and commercial6 systems. As these query engines mature and features get added, data from this benchmark can provide a quick synopsis of performance improvements over time.
The initial release includes Redshift, Hive, Impala, and Shark (Hive, Impala, Shark were configured to run on AWS). Hive 0.10 and the most recent versions7 of Impala and Shark were used (Hive 0.11 was released in mid-May and has not yet been included). Data came from Intel’s Hadoop Benchmark Suite and CommonCrawl. In the case of Hive/Impala/Shark, data was stored in compressed SequenceFile format using CDH 4.2.0.
At least for the queries included in the benchmark, Redshift is about 2-3x faster than Shark/on-disk, and 0.3-2x faster than Shark/in-memory. Given that it’s built on top of a general purpose engine (Spark), it’s encouraging that Shark’s performance is within range of MPP8 databases (such as Redshift) that are highly optimized for interactive SQL queries. With new frameworks like Shark and Impala providing speedups comparable to those observed in MPP databases, organizations now have the option of using a single system (Hadoop/Spark) instead of two (Hadoop/Spark + MPP database).
Let’s look at some of the results in detail:
Tachyon enables data sharing across frameworks and performs operations at memory speed
In earlier posts I’ve written about how Spark and Shark run much faster than Hadoop and Hive by1 caching data sets in-memory. But suppose one wants to share datasets across jobs/frameworks, while retaining speed gains garnered by being in-memory? An example would be performing computations using Spark, saving it, and accessing the saved results in Hadoop MapReduce. An in-memory storage system would speed up sharing across jobs by allowing users to save at near memory speeds. In particular the main challenge is being able to do memory-speed “writes” while maintaining fault-tolerance.
In-memory storage system from UC Berkeley’s AMPLab
The team behind the BDAS stack recently released a developer preview of Tachyon – an in-memory, distributed, file system. The current version of Tachyon was written in Java and supports Spark, Shark, and Hadoop MapReduce. Working data sets can be loaded into Tachyon where they can be accessed at memory speed, by many concurrent users. Tachyon implements the Hadoop FileSystem interface for standard file operations (such as create, open, read, write, close, and delete).
In the UC Berkeley AMPLab, we have embarked on a six year project to build a powerful next generation big data analytics platform: the Berkeley Data Analytics Stack (BDAS). We have already released several components of BDAS including Spark, a fast distributed in-memory analytics engine, and in February we ran a sold out tutorial at the Strata conference in Santa Clara teaching attendees how to use Spark and other components of the BDAS stack.
In this blog post we will walk through four steps to getting hands-on using Spark to analyze real data. For an overview of the motivation and key components of BDAS, check out our previous Strata blog post.
Describe and run bleeding edge algorithms on massive data sets
In the course of applying machine-learning against large data sets, data scientists face a few pain points. They need to tune and compare several suitable algorithms – a process that may involve having to configure a hodgepodge of tools, requiring different input files, programming languages, and interfaces. Some software tools may not scale to big data, so they first sample and test ideas on smaller subsets, before tackling the problem of having to implement a distributed version of the final algorithm.
To increase productivity, ideally data scientists should be able to quickly test ideas without doing much coding, context switching, tuning and configuration. A research project0 out of UC Berkeley’s Amplab and Brown seems to do just that: MLbase aims to make cutting edge, scalable machine-learning algorithms available to non-experts. MLbase will have four pieces: a declarative language (MQL – discussed below), a library of distributed algorithms (ML-Library), an optimizer and a runtime (ML-Optimizer and ML-Runtime). Read more…
Preview of an upcoming tutorial at Strata Santa Clara 2013
This month at Strata, the U.C. Berkeley AMPLab will be running a full day of big data tutorials.In this post, we present the motivation and vision for the Berkeley Data Analytics Stack (BDAS), and an overview of several BDAS components that we released over the past two years, including Mesos, Spark, Spark Streaming, and Shark.
While batch processing systems like Hadoop MapReduce paved the way for organizations to ask questions about big datasets, they represent only the beginning of what users need to do with big data. More and more, users wish to move from periodically building reports about datasets to continuously using new data to make informed business decisions in real-time. Achieving these goals imposes three key requirements on big data processing:
- Low latency queries: Interactive ad-hoc queries allows data scientists to find valuable inferences faster, or explore a larger solution space to make better decisions. Furthermore, there is an increasing need for stream processing, as this allows organizations to make decisions in real-time, such as detecting an SLA violation and fixing the problem before the users notice, or deciding what ads to show based on user’s live tweets.
- Sophisticated analysis: People are increasingly looking to use new state of art algorithms, such as predictive machine learning algorithms, to make better forecasts and decisions.
- Unification of existing data computation models: Users want to integrate interactive queries, batch, and streaming processing to handle the ever increasing requirements of their processing pipelines. For example, detecting anomalies in user behavior may require (1) stream processing to compare the behavior of users in real-time across different segments (e.g., genre, ages, location, device), (2) interactive queries to detect differences in user’s daily (or weekly) behavior, and (3) batch processing to build sophisticated predictive models.
In response to the above requirements, more than three years ago we began building BDAS.
Shark is 100X faster than Hive for SQL, and 100X faster than Hadoop for machine-learning
Hadoop’s strength is in batch processing, MapReduce isn’t particularly suited for interactive/adhoc queries. Real-time1 SQL queries (on Hadoop data) are usually performed using custom connectors to MPP databases. In practice this means having connectors between separate Hadoop and database clusters. Over the last few months a number of systems that provide fast SQL access within Hadoop clusters have garnered attention. Connectors between Hadoop and fast MPP database clusters are not going away, but there is growing interest in moving many interactive SQL tasks into systems that coexist on the same cluster with Hadoop.
Having a Hadoop cluster support fast/interactive SQL queries dates back a few years to HadoopDB, an open source project out of Yale. The creators of HadoopDB have since started a commercial software company (Hadapt) to build a system that unites Hadoop/MapReduce and SQL. In Hadapt, a (Postgres) database is placed in nodes of a Hadoop cluster, resulting in a system2 that can use MapReduce, SQL, and search (Solr). Now on version 2.0, Hadapt is a fault-tolerant system that comes with analytic functions (HDK) that one can use via SQL. Read more…
The development team continues to focus on features that will grow the number of Spark users
In an earlier post I listed a few reasons why I’ve come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where it outperforms Hadoop. With version 0.6, Spark becomes even0 faster and easier to use. The release notes contain all the detailed changes, but as you’ll see from the highlights1 below, version 0.6 is a substantial release. Another good sign is the growth in number of contributors, with now over a third of the developers coming from outside the core team in Berkeley.
New Deployment Modes
In addition to running on top of Mesos, Spark can now be deployed in standalone mode: users only need to install Spark and a JVM on each node, a simple cluster manager2 that comes with Spark 0.6, handles the rest. Deployment becomes much simpler and allows organizations who aren’t familiar with Mesos (and C++) to run Spark on their clusters. This release also provides an experimental mode for running Spark on Apache YARN.
New Java API
While I use Spark exclusively through Scala, many Java developers have been asking for a Java API. With version 0.6 the wait is over: all of Spark’s features can be accessed from Java through an API. A Python API is coming very soon.
RDD persistence options
RDD’s are distributed objects that can be cached in-memory, across a cluster of compute nodes. They are the fundamental data objects used in Spark. In version 0.6 caching policy can be controlled at the individual RDD level. Users can decide whether to keep an RDD in-memory or on disk, serialize it, or replicate3 it across nodes. Read more…