Big Data and Real-time Structured Data Analytics

The emergence of sensors as sources of Big Data highlights the need for real-time analytic tools. Popular web apps like Twitter, Facebook, and blogs are also faced with having to analyze (mostly unstructured) data in near real-time. But as Truviso founder and UC Berkeley CS Professor Michael Franklin recently noted, there are mountains of structured data generated by web apps that lend themselves to real-time analysis:

The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data.

In our report on Big Data, we listed some tools that can turn SQL data warehouses into real-time intelligence systems. The typical data warehouse usually reports on data that are a day, week, or even a month old. Not every company requires real-time reports, alerts, or exception tracking, but some domains may benefit from dramatically reducing latency. To supplement the typical post-campaign reports generated by traditional (static) data warehouses, advertisers and content providers could track and make adjustments to their campaigns in real-time. Web applications that rely on data generated by sensors (e.g. smart grids, location-aware mobile apps, logistics & supply-chain tracking, environmental sensors) would be able to display reports that are continuously updated in real-time. Web site performance and security reports are also natural candidates for real-time analytics.

If you desire (near) real-time analysis, traditional SQL databases and MapReduce systems are batch-oriented (load all the data, then analyze), and might not be able to deliver the low latency you’re seeking. Fortunately, there are tools^† that allow structured data sets (such as data warehouses) to be easily analyzed in real-time.

Recognizing that “data is moving until it gets stored”, the idea behind many real-time analytic engines is to start applying the same analytic techniques to moving (streams) and static (stored) data. Truviso separates the processing and analysis of data, and performs both in real-time. End-users and business analysts can access/query real-time data and historical data using SQL: in Truviso’s case the underlying Postgres engine and optimizer have been extended to include an embedded stream processor to handle “live data” in any SQL statement’s FROM clause^††. To specify how “live data” is to be processed by a database engine, most real-time analytic vendors provide SQL extensions that allow users to specify the time windows to be analyzed. As data flows continuously into the system, the results of queries involving “live data” are continuously updated in real-time. Leveraging a popular database such as Postgres means structured data warehouses can be ported and made real-time with Truviso.

A major challenge facing stream databases is what do with out-of-order data. Streams are timestamped data sets, and most systems expect data to arrive in the correct time sequence. Unfortunately, things happen when data flows in from multiple sources and it is not uncommon for timestamped data to arrive out-of-order. While some real-time analytic systems simply drop out-of-order data (potentially leading to misleading query results), Truviso has developed algorithms that look for contiguous data and produce query results that correctly handle out-of-order data.

What about real-time analysis of unstructured data? Truviso hasn’t focused on unstructured data, preferring instead to target companies with existing data warehouses. After all, the general notion is that unstructured data doesn’t quite fit into SQL databases like Truviso. But the perception that unstructured data isn’t for relational databases may be changing slightly. Recently, a team at UC Berkeley used a SQL database to perform entity-extraction. They took unstructured text, passed it through a Conditional Random Fields algorithm (coded in SQL), and turned it into structured data.

(†) We recently had the chance to meet with the founders of Truviso. There are many other real-time analytic solutions including streambase and SQLstream.

(††) In Truviso’s system, “live data” or streams can be created (CREATE stream) and accessed in SQL much like static database tables.

Big Data and Real-time Structured Data Analytics

Get the O’Reilly Data Newsletter