Expanding options for mining streaming data

Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces, and libraries, recently released tools make it easier for companies to process and mine streaming data sources.

Of the distributed stream processing systems that are part of the Hadoop ecosystem⁰, Storm is by far the most widely used (more on Storm below). I’ve written about Samza, a new framework from the team that developed Kafka (an extremely popular messaging system). Many companies who use Spark express interest in using Spark Streaming (many have already done so). Spark Streaming is distributed, fault-tolerant, stateful, and boosts programmer productivity (the same code used for batch processing can, with minor tweaks, be used for realtime computations). But it targets applications that are in the “second-scale latencies”. Both Spark Streaming and Samza have their share of adherents and I expect that they’ll both start gaining deployments in 2014.

Netflix¹ recently released Suro, a data pipeline service for large volumes of event data. Within Netflix, it is used to dispatch events for both batch (Hadoop) and realtime (Druid) computations:

Netflix Suro

Leveraging and deploying Storm
YARN, Storm and Spark were key to letting Yahoo! move from batch to near realtime analytics. To that end, Yahoo! and Hortonworks built tools that let Storm applications leverage Hadoop clusters. Mesosphere released a similar project for Mesos (Storm-Mesos) early this week. This release makes it easier to run Storm on Mesos clusters: in particular, previous Storm-Mesos integrations do not have the “built-in configuration distribution” feature and require a lot more steps deploy. (Along with several useful tutorials on how to run different tools on top of Mesos, Mesosphere also recently released Elastic Mesos.)

Mesosphere: Storm-Mesos

One of the nice things about Spark is that developers can use similar code for batch (Spark) and realtime (Spark Streaming) computations. Summingbird is an open source library from Twitter that offers something similar for Hadoop MapReduce and Storm: programs that look like Scala collection transformations can be executed in batch (Scalding) or realtime (Storm).

Focus on analytics instead of infrastructure
Stream processing requires several components and engineering expertise² to setup and maintain. Available in “limited preview”, a new stream processing framework from Amazon Web Services (Kinesis) eliminates³ the need to setup stream processing infrastructure. Kinesis users can easily specify the throughput capacity they need, and shift their focus towards finding patterns and exceptions from their data streams. Kinesis integrates nicely with other popular AWS components (Redshift, S3, and Dynamodb) and should attract users already familiar with those tools. The standard AWS cost structure (no upfront fees, pay only for your usage) should also be attractive to companies who want to quickly experiment with streaming analytics.

Machine-learning
In a previous post I described a few key techniques (e.g., sketches) used for mining data streams. Algebird is an open source abstract algebra library (used with Scalding or Summingbird) that facilitates popular stream mining computations like the Count-Min sketch and HyperLogLog. (Twitter developers observed that commutative Monoids can be used to implement⁴ many popular approximation algorithms.)

Companies that specialize in analytics for machine data (Splunk, SumoLogic) incorporate machine-learning into their products. There are also general purpose software tools and web services that offer machine-learning algorithms that target streaming data. Xmp from Argyle Data includes algorithms for online learning and realtime pattern detection. FeatureStream is a new web service for applying machine-learning to data streams.

Update (12/20/2013): Yahoo! recently released SAMOA – a distributed streaming machine-learning framework. SAMOA lets developers code “algorithms once and execute them in multiple” stream processing environments.

Related content:

Some related talks at Strata Santa Clara 2014: Oscar Boykin (co-author of Algebird, Scalding and Summingbird) will give a talk on Monoids and Algebird as part of the Hardcore Data Science day ; David Andrzejewski will present a session on Machine Learning for Machine Data
Stream Processing and Mining just got more interesting
Stream Mining essentials

(0) New batch of commercial options: In-memory grids (e.g., Terracotta, ScaleOut software, GridGain) also have interesting stream processing technologies.
(1) Some people think Netflix is building other streaming processing components.
(2) Some common components include Kafka (source of data flows), Storm or Spark Streaming (for stream data processing), and HBase / Druid / Hadoop (or some other data store).
(3) SaaS let companies focus on analytics instead of infrastructure: startups like keen.io process and analyze event data in near real-time.
(4) There’s interest in developing similar libraries for Spark Streaming. A new efficient data structure for “accurate on-line accumulation of rank-based statistics” called t-digest, might be incorporated into Algebird.

O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

Strata in Santa Clara: February 11-13 | Santa Clara, CA

Expanding options for mining streaming data

New tools make it easier for companies to process and mine streaming data sources

Get the O’Reilly Data Newsletter