Apache Spark: Powering applications on-premise and in the cloud

The O'Reilly Data Show Podcast: Patrick Wendell on the state of the Spark ecosystem.


As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded.

With Spark Summit SF right around the corner, I recently sat down with Patrick Wendell, release manager of Apache Spark and co-founder of Databricks, for this episode of the O’Reilly Data Show Podcast. (Full disclosure: I’m an advisor to Databricks). We talked about how he came to join the UC Berkeley AMPLab, the current state of Spark ecosystem components, Spark’s future roadmap, and interesting applications built on top of Spark.

User-driven from inception

From the beginning, Spark struck me as different from other academic research projects (many of which “wither away” when grad students leave). The AMPLab team behind Spark spoke at local SF Bay Area meetups, they hosted 2-day events (AMP Camp), and worked hard to help early users. That mindset continues to this day. Wendell explained:

We were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user-driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. We never really built it for our own use — it was not like we were at a company solving a problem and then we decided, “hey let’s let other people use this code for free”. … From the beginning, we were focused on empowering other people and building platforms for other developers, so I always thought that was quite unique about Spark.

Read more…


Designers as data scientists

Data science isn't only the purview of analysts and statisticians; it should be part of a designer's skill set as well.

Download a free copy of “The New Design Fundamentals” ebook, a curated collection of chapters from our Design library. Note: this post is an excerpt from “Designing with Data,” by Rochelle King and Elizabeth F. Churchill, which is included in the curated collection.

It might feel like using data is big news now, but the truth is that we’ve been using data for a long time. For the past 20 years, we’ve been moving and replicating more and more experiences that we used to have in the physical world into the digital world. Sharing photos, having conversations, duties that we used to perform in our daily work have all become digital. We could probably have a separate discussion as to how much the migration from the physical “real” world to the digital world has benefitted or been detrimental to our society, but you can’t deny that it’s happening and only continues to accelerate at a breakneck pace.

Let’s take a look at what it means for these experiences to be moving from the physical to the digital. Not too long ago, the primary way that you shared photos with someone was that you would have to have used your camera to take a photo at an event. When your roll of film was done, you’d take that film to the local store where you would drop it off for processing. A few days or a week later, you would need to pick up your developed photos, and that would be the first time you’d be able to evaluate how well the photos that you took many days prior actually turned out. Then, maybe when someone was at your house, you’d pull out those photos and narrate what each photo was about. If you were going to really share those photos with someone else, you’d maybe order duplicates and then put them in an envelope to mail to them — and a few days later, your friend would get your photos as well. If you were working at a company like Kodak that had a vested interest in getting people to use your film, processing paper, or cameras, then there are so many steps and parts of the experience that I just described which are completely out of your control. You also have almost no way to collect insight into your customers’ behaviors and actions along the process. Read more…


Validating data models with Kafka-based pipelines

A case for back-end A/B testing.

Start the O’Reilly “Introduction to Apache Kafka” training video for free. In this video, Gwen Shapira shows developers and administrators how to integrate Kafka into a data processing pipeline.

A/B testing is a popular method of using business intelligence data to assess possible changes to websites. In the past, when a business wanted to update its website in an attempt to drive more sales, decisions on the specific changes to make were driven by guesses; intuition; focus groups; and ultimately, which executive yelled louder. These days, the data-driven solution is to set up multiple copies of the website, direct users randomly to the different variations and measure which design improves sales the most. There are a lot of details to get right, but this is the gist of things.

When it comes to back-end systems, however, we are still living in the stone age. Suppose your business grew significantly and you notice that your existing MySQL database is becoming less responsive as the load increases. Suppose you consider moving to a NoSQL system, you need to decide which NoSQL solution to pick — there are a lot of options: Cassandra, MongoDB, Couchbase, or even Hadoop. There are also many possible data models: normalized, wide tables, narrow tables, nested data structures, etc.

A/B testing multiple data stores and data models in parallel

It is surprising how often a company will pick a solution based on intuition or even which architect yelled louder. Rather than making a decision based on facts and numbers regarding capacity, scale, throughput, and data-processing patterns, the back-end architecture decisions are made with fuzzy reasoning. In that scenario, what usually happens is that a data store and a data model are somehow chosen, and the entire development team will dive into a six-month project to move their entire back-end system to the new thing. This project will inevitably take 12 months, and about 9 months in, everyone will suspect that this was a bad idea, but it’s way too late to do anything about it. Read more…


Protecting health through open data management principles

Personal wellness data should be shared as freely as water and air.


Register for the free webcast, “Life Streams, Walled Gardens, and the Internet of Living Things.” Brigitte Piniewski and Hagen Finley will discuss the Internet of Living Things, what makes sensoring and monitoring data emanating from our bodies unique, and why we should elect to participate in this seemingly Orwellian mistake of open-sourcing our personal health data.

We are at a threshold in the history of personal data. Sensors and apps are making it possible to generate digital data signatures of important aspects of healthy living, such as movement, nutrition, and sleep. However, we are rapidly losing the opportunity to erect a Linux-like open “living-well” data system steeped in open commons principles. We can either join together to ensure enlightened open source and crowdsourced discovery practices become the norm for our living-well data footprints, or we can passively allow this data to be sequestered into one of the walled gardens offered by health systems, funded research, or big business.

Why this is important?

Living-well data provides the map by which vast amounts of preventable human suffering can be prevented. Everyone can benefit from the health journeys of those who lived before us because our modern societies are no longer “accidentally well.” Decades ago, parents had no need to question the nutrition a child was offered or concern themselves with how much activity a child engaged in. No deliberate use of devices was needed to track these important health contributors. Reasonable access to whole foods (farm foods) and reasonable amounts of activity were provided, as it were, by default — in other words, by accident. This resulted in remarkably low rates of chronic disease. Today, communities cannot take those healthy choices for granted — we are no longer accidentally well. Read more…


Announcing Cassandra certification

A new partnership between O’Reilly and DataStax offers certification and training in Cassandra.

apache-cassandra-certified-300x300I am pleased to announce a joint program between O’Reilly and DataStax to certify Cassandra developers. This program complements our developer certification for Apache Spark and — just as in the case of Databricks and Spark — we are excited to be working with the leading commercial company behind Cassandra. DataStax has done a tremendous job growing and nurturing the Cassandra community, user base, and technology.

Once the certification program is ready, developers can take the exam online, in designated test centers, and at select training courses. O’Reilly will also be developing books, training days, and videos targeted at developers and companies interested in the Cassandra distributed storage system.

Cassandra is a popular component used for building big data and real-time analytic platforms. Its ability to comfortably scale to clusters with thousands of nodes makes it a popular option for solutions that need to ingest and make sense of large amounts of time series and event data. As noted in an earlier post, real-time event data are at the heart of one of the trends we’re closely following: the convergence of cheap sensors, fast networks, and distributed computation. Read more…


How Shazam predicts pop hits

The O'Reilly Radar Podcast: Cait O'Riordan on Shazam's predictive analytics, and Francine Bennett on using data for evil.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

record_player_from_1920s_Marcin_Wichary_FlickrIn this week’s Radar Podcast, I chat with Cait O’Riordan, VP of product, music and platforms at Shazam. She talks about the current state of predictive analytics and how Shazam is able to predict the success of a song, often in the first few hours after its release. We also talk about the Internet of Things and how products like the Apple Watch affect Shazam’s product life cycles as well as the behaviors of their users.

Predicting the next pop hit

Shazam has more than 100 million monthly active users, and its users Shazam more than 20 million times per day. This, of course, generates a ton of data that Shazam uses in myriad ways, not the least of which is to predict the success of a song. O’Riordan explained how they approach their user data and how they’re able to accurately predict pop hits (and misses):

What’s interesting from a data perspective is when someone takes their phone out of their pocket, unlocks it, finds the Shazam app, and hits the big blue button, they’re not just saying, “I want to know the name of this song.” They’re saying, “I like this song sufficiently to do that.” There’s an amount of effort there that implies some level of liking. That’s really interesting, because you combine that really interesting intention on the part of the user plus the massive data set, you can cut that in lots and lots of different ways. We use it for lots of different things.

At the most basic level, we’re looking at what songs are going to be popular. We can predict, with a relative amount of accuracy, what will hit the Top 100 Billboard Chart 33 days out, roughly. We can look at that in lots of different territories as well. We can also look and see, in the first few hours of a track, whether a big track is going to go on to be successful. We can look at which particular part of the track is encouraging people to Shazam and what makes a popular hit. We know that, for example, for a big pop hit, you’ve got about 10 seconds to convince somebody to find the Shazam app and press that button. There are lots of different ways that we can look at that data, going right into the details of a particular song, zooming out worldwide, or looking in different territories just due to that big worldwide and very engaged audience.

Read more…