Counting Unique Users in Real-time with Streaming Databases

As the web increasingly becomes real-time, marketers and publishers need analytic tools that can produce real-time reports. As an example, the basic task of calculating the number of unique users is typically done in batch mode (e.g. daily) and in many cases using a random sample from relevant log files. If unique user counts can be accurately computed in real-time, publishers and marketers can mount A/B tests or referral analysis to dynamically adjust their campaigns.

In a previous post I described SQL databases designed to handle data streams. In their latest release, Truviso announced technology that allows companies to track unique users in real-time. Truviso uses the same basic idea I described in my earlier post:

Recognizing that “data is moving until it gets stored”, the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data.

Truviso uses (compressed) bitmaps and set theory to compute the number of unique customers in real-time. In the process they are able to handle the standard SQL queries associated with these types of problems: counting the number of distinct users, for any given set of demographic filters. Bitmaps are built as data streams into the system and use the same underlying technology that allows Truviso to handle massive data sets from high-traffic web sites.

pathint

Once companies can do simple counts and averages in real-time, the next step is to use real-time information for more sophisticated analyses. Truviso has customers using their system for “on-the-fly predictive modeling”.

The other main enhancement in this release is Truviso’s move towards parallel processing. Their new execution engine processes runs or blocks of data in parallel in multi-core systems or multi-node environments. Using Truviso’s parallel execution engine is straightforward on a single multi-core server, but on a multi-node cluster it may require considerable attention to configuration.

[For my previous posts on real-time analytic tools see here and here.]

UPDATE (11/23/2009): Google acquires Teracent, a San Mateo startup that specializes in real-time analytics for optimizing online banner/display ads.

tags: , , , , , ,

Get the O’Reilly Data Newsletter

Get weekly insight from industry insiders—plus exclusive content, offers, and more on the topic of data.