Stream Processing and Mining just got more interesting

Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store¹ and send data streams. Kafka was developed by Linkedin were it remains a vital component of their Big Data ecosystem: many critical online and offline data flows rely on feeds supplied by Kafka servers.

Apache Samza: a distributed stream processing framework
Behind Kafka’s success as an open source project is a team of savvy engineers who have spent² the last three years making it a rock solid system. The developers behind Kafka realized early on that it was best to place the bulk of data processing (i.e., stream processing) in another system. Armed with specific use cases, work on Samza proceeded in earnest about a year ago. So while they examined existing streaming frameworks (such as Storm, S4, Spark Streaming), Linkedin engineers wanted a system that better fit their needs³ and requirements:

Linkedin Samza

Just as MapReduce requires a data source (in most cases HDFS), a general purpose data processing system like Samza requires a source of streams. Out of the box, Samza uses YARN as its fault-tolerant, distributed execution system, and Kafka as its messaging system. But Samza is pluggable and can be used with other systems – there is already interest in making Samza work with other execution systems (AWS, Mesos).

Stream Mining and near realtime computations
For now Samza users who need to do stream mining have to write code⁴ (common problems include finding the “top k”, “distinct elements”, “heavy hitters”, quantiles, moments). However as a general purpose stream processing framework it can be used for a broad set of problems: simple “counts”, approximate answers (“sketches”), and more advanced algorithms. Data scientists will also like the fact that Samza (and Kafka) can easily reprocess data, a useful feature when new algorithms need to be deployed or existing ones have to tweaked.

Within Linkedin, Samza occupies the space between batch (Hadoop jobs that run for a few hours) and online systems (milliseconds). By supporting tasks that take a few seconds to a few minutes to run, Samza is potentially useful for a variety of user services such as periodic updates of rankings and recommendations. I expect Samza to attract interest from developers who are already using Kafka. As I pointed out in a recent post, integrated systems are attracting attention (as convenience and familiarity, outweigh performance). Furthermore, as interfaces and dedicated libraries emerge, Samza will become useful to many more data scientists.

Frugal Streaming, Approximate Quantiles and Histograms
Speaking of approximate answers, I thought I’d highlight a couple of recent posts that relate to streaming and realtime analytics. The first explains simple algorithms (Frugal-1U and Frugal-2U) for computing approximate quantiles in streaming data. The second uses Druid to approximate quantiles and build histograms (in near realtime) for large data sets. As data volume and velocity continue to rise⁵, I expect systems that produce approximate answers to become more common.

Related posts:

(0) Thanks to Chris Riccomini and Jay Kreps of Linkedin for walking me through the details of Samza.
(1) Kafka retains published messages for a configurable amount of time – during which period one can resend and reprocess them accordingly.
(2) Here’s an interesting bit to illustrate how much work has gone into Kafka: just getting replication right took 5 experienced developers a year! While the bulk of Kafka development still happens within Linkedin, about 30% comes from external developers.
(3) Stateful processing is particularly important to Linkedin: “We are not aware of any state management facilities in Storm …”
(4) Pushing the MapReduce analogy further, at some point a simpler system (a la Pig, Hive, SQL) may make it easier to use Samza. One of the things that makes Spark Streaming attractive is its programming interface: the same (concise) code that works for batch jobs (Spark), will work (with minor changes) for stream processing .
(5) The “real-time big data triangle of constraints” posits that you can have only two of the following: volume, velocity, exact.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.Strata Rx Health Data Conference: September 25-27 | Boston, MA
Strata + Hadoop World: October 28-30 | New York, NY
Strata in London: November 15-17 | London, England

Stream Processing and Mining just got more interesting

A general purpose stream processing framework from the team behind Kafka and new techniques for computing approximate quantiles

Get the O’Reilly Data Newsletter