Data generation is growing exponentially, as is the demand for real-time analytics over fast input data. Traditional approaches to analyzing data in batch mode overcome the computational problems of data volume by scaling horizontally using a distributed system like Apache Hadoop. However, this solution is not feasible for analyzing large data streams in real time due to the scheduling I/O overhead it introduces.
Two main problems occur when batch processing is applied to stream or fast data. First, by the time the analysis is complete, it may already have been outdated by new incoming data. Second, the data may be arriving so fast that it is not feasible to store and batch-process them later, so the data must be processed or summarized when it is received. The Square Kilometer Array (SKA) radio telescope is a good public example of a system in which data must be preprocessed before storage. The SKA is a distributed radio observation project where each base station will receive 10-30 TB/sec and the Central Unit will process 4PB/sec. In this scenario, online summaries of the input data must be computed in real time and then processed — and significantly reduced in size — data is what’s stored.
In the business world, common examples of stream data are sensor networks, Twitter, Internet traffic, logs, financial tickers, click streams, and online bids. Algorithmic solutions enable the computation of summaries, frequency (heavy hitter) and event detection, and other statistical calculations on the stream as a whole or detection of outliers within it.
But what if you need to perform transaction-level analysis — scans across different dimensions of the data set, for example — as well as store the streamed data for fast lookup and retrospective analysis?
An integrated data stream platform
The Lambda architecture, with parallel processes for streaming analytics and storage/batch analysis, is one solution. Unfortunately, this requires having duplicate code for the two scenarios, as Jay Kreps has pointed out. Recent frameworks like Spark or SummingBird aim to solve this problem by enabling both batch and stream processing. These kinds of streaming platforms are optimized for running computations across streams, but they aren’t prepared to store data, and they need to be combined with a third-party data store, an additional complication and a new point of failure. Moreover, these data stores may not be optimized to store fast input data or to perform fast record lookup — a necessity when the application requires complex multi-SQL statement ACID transactions. A specific technology designed from the ground up to do fast complex analytics within transactions and store data durably is required.
Given these requirements for speed and volume — in the millions of transactions per second — an in-memory distributed relational DBMS is called for. VoltDB is a NewSQL OLTP row store designed for handling fast data. It provides a combination of stored Java procedures and SQL commands to process data, and it is suited for applications that require massive throughput, real-time analytics, and transactional consistency. VoltDB maintains the guarantees that relational databases offer (ACID compliance) but also provides a scalable and fault-tolerant system suited for write-intensive applications. Its transactional system supports serializable isolation, and complete multi-statement atomicity and roll-back.
Transactions in VoltDB use Java to implement business logic and SQL for data access. It is possible to create applications on top of arbitrarily complex stored procedures, and VoltDB ensures that each stored procedure is strictly isolated and fully ACID. It is also possible to execute custom Java code in the stored procedures. The previous approximation algorithms can then be employed in the data pipeline, as in this example.
As opposed to stream processing frameworks like Storm or Spark Streaming, in-memory databases like VoltDB provide the ability to execute complex queries over the incoming data, ensuring analytics within the transactions. Stream frameworks like Storm or S4 leave the management of each event state to the application developer, while in-memory databases provide per-event transaction processing by running multi-statement transactions. An integrated streaming and storage system supports not only rapid data ingestion, but also rapid export to long-term analytical data stores.
In real-time analytics applications, an in-memory database is good for: 1) querying real-time state, 2) aggregating stream data, 3) using approximation algorithms via third-party Java code, and 4) enabling transactional analytics. In contexts where you need to combine real-time analytics with per-event transaction processes, an integrated distributed platform will be an appropriate solution.
This post is a collaboration between O’Reilly and VoltDB. Read our statement of editorial independence.
Cropped public domain image on article and category pages via the Internet Archive on Flickr.