Four steps to analyzing big data with Spark

By Andy Konwinski, Ion Stoica, and Matei Zaharia

In the UC Berkeley AMPLab, we have embarked on a six year project to build a powerful next generation big data analytics platform: the Berkeley Data Analytics Stack (BDAS). We have already released several components of BDAS including Spark, a fast distributed in-memory analytics engine, and in February we ran a sold out tutorial at the Strata conference in Santa Clara teaching attendees how to use Spark and other components of the BDAS stack.

In this blog post we will walk through four steps to getting hands-on using Spark to analyze real data. For an overview of the motivation and key components of BDAS, check out our previous Strata blog post.

What makes Spark so fast? First, Spark provides primitives for in-memory cluster computing which means your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce. Also, to make programming faster, Spark provides clean, concise APIs in Scala, Java, and Python. In addition, you can use Spark interactively from the Scala and Python shells to rapidly query big datasets in a tight data exploration loop.

Follow these four steps and see for yourself how easy it is to get up and running with Spark:

  1. Familiarize yourself with the Spark project. Begin by reading the online documentation. Go to to find an overview of the project, code examples, and detailed documentation for the latest release of Spark.
  2. Download, Compile, and try the Spark Shell on your local machine in less than 30 minutes. Follow along with step-by-step instructions in our screencast on First Steps with Spark. To get Spark running, you only need to have Java installed. You can be up and running, performing maps, reduces, group-bys, filters, and more, in less than 30 minutes.
  3. Upgrade to multiple nodes on EC2, real Wikipedia data, and more advanced algorithms. Now that you’ve gotten a taste for the power of the Spark Shell for interactive analytics, you’ll want to try it on more than one machine. We recently had the rare opportunity to host AMP Camp Two as part of the O’Reilly Strata Conference on big data. As part of that event we ran an in-person hands-on training session. Then we posted the curriculum online as the AMP Camp Mini Course. The mini course walks you through setting up your own multi-node Spark cluster on EC2 and a variety of exercises analyzing real Wikipedia data using Spark and Shark (the SQL component of Spark) via any of the Python, Java, and Scala APIs. 
  4. Engage the Spark community. What if you get stuck? Post your question on the Spark users google group and get help directly from the Spark developers. Attend a Spark user meetup to meet us in person, read about recent releases, and attend an AMP Camp in-person.

Our vision is to build the next generation stack of open-source data analytics software, and the Spark cluster computing framework is a major step towards that vision. We hope these first four steps will help you extract insights from your data!


O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

tags: , , , ,