Near realtime, streaming, and perpetual analytics

Hadoop moves from batch to near realtime: next up, placing streaming data in context

Simple example of a near realtime app built with Hadoop and HBase
Over the past year Hadoop emerged from its batch processing roots and began to take on interactive and near realtime applications. There are numerous examples that fall under these categories, but one that caught my eye recently is a system jointly developed by China Mobile Guangdong (CMG) and Intel1. It’s an online system that lets CMG’s over 100 million subscribers2 access and pay their bills, and examine their CDR’s (call detail records) in near realtime.

A service for providing detailed billing information is an important customer touch point. Repeated/extended downtimes and data errors could seriously tarnish CMG’s image. CMG needed a system that could scale to their current (and future) data volumes, while providing the low-latency responses consumers have come to expect from online services. Scalability, price and open source3 were important criteria in persuading the company to choose a Hadoop-based solution over4 MPP data warehouses.

In the system it co-developed with Intel, CMG stores detailed subscriber billing records in HBase. This amounts to roughly 30 TB/month, but since the service lets users browse up to six months of billing data it provides near realtime query results on much larger amounts of data. There are other near realtime applications built from Hadoop components (notably the continuous compute system at Yahoo!), that handle much larger data sets. But what I like about the CMG example is that it’s an application that most people understand right away (a detailed billing lookup system), and it illustrates that the Hadoop ecosystem has grown beyond batch processing.

Besides powering their online billing lookup service, CMG uses its Hadoop platform for analytics. Data from multiple sources (including phone device preferences, usage patterns, and cell tower performance) are used to compute customer segments and targeted promotions. Over time, Hadoop’s ability to handle large amounts of unstructured data opens up other data sources that can potentially improve CMG’s current analytic models.

Contextualize: Streaming and Perpetual Analytics
This leads me to something “realtime” systems are beginning to do: placing streaming data in context. Streaming analytics operates over fixed time windows and is used to identify “top k” trending items, heavy-hitters, and distinct items. Perpetual analytics takes what you’re observing now and places it in the context of what you already know. As much as companies appreciate metrics produced by streaming engines, they also want to understand how “realtime observations” affect their existing knowledge base.

Because of this desire to place things in context, I’m seeing companies design applications that have elements of streaming (“right now”) and perpetual analytics (no time window). New systems like Hokusai, Streamdrill, and Druid all have some capability in this direction. Druid in particular seems well-suited for both types of analysis.

The Hadoop community will help build such solutions5 in the near future. Along with tools like Samza, Kafka, Storm, and Spark Streaming, HBase6 and Accumulo are data stores that sit at the juncture between realtime and batch analytics. Built on top of HBase, CMG’s detailed billing lookup system already has some elements of both realtime and perpetual analytics (lookup your most recent activity, and compare it with what you did six months ago).

Related posts:

  • HBase looks more appealing to data scientists
  • Moving from Batch to Continuous Computing at Yahoo!
  • Scalable streaming analytics using a single-server

  • (1) Originating from its research labs in China, Intel’s distribution of Apache Hadoop is a relatively new solution that’s beginning to power large-scale, customer-facing applications.
    (2) 100 million subscribers is my conservative estimate, based on the following figure from Huawei: “With social and economic development, the number of subscribers of Guangdong Mobile has reached 1/6 of the country’s total.”
    (3) “Avoiding vendor lock-in was a prime directive.”
    (4) Lest you think Hadoop has sidelined them completely, MPP systems are still very much in the mix. One of the more interesting systems I came across recently was an analytic system Teradata built for Cabela’s.
    (5) A couple of companies that are doing interesting things in this space: Wibidata (HBase) and Sqrrl (Accumulo).
    (6) Besides Yahoo!’s continuous computing system, some presentations at the recent HBase summit hinted at systems that could handle streaming and perpetual analytics.

    O’Reilly Strata Conference — Strata brings together the leading minds in data science and big data — decision makers and practitioners driving the future of their businesses and technologies. Get the skills, tools, and strategies you need to make data work.

    Strata Rx Health Data Conference: September 25-27 | Boston, MA
    Strata + Hadoop World: October 28-30 | New York, NY
    Strata in London: November 15-17 | London, England

    tags: , , , , , , , ,

    Get the O’Reilly Data Newsletter

    Stay informed. Receive weekly insight from industry insiders.