Integrated data stream platforms are poised to supplant the lambda architecture.
Data generation is growing exponentially, as is the demand for real-time analytics over fast input data. Traditional approaches to analyzing data in batch mode overcome the computational problems of data volume by scaling horizontally using a distributed system like Apache Hadoop. However, this solution is not feasible for analyzing large data streams in real time due to the scheduling I/O overhead it introduces.
Two main problems occur when batch processing is applied to stream or fast data. First, by the time the analysis is complete, it may already have been outdated by new incoming data. Second, the data may be arriving so fast that it is not feasible to store and batch-process them later, so the data must be processed or summarized when it is received. The Square Kilometer Array (SKA) radio telescope is a good public example of a system in which data must be preprocessed before storage. The SKA is a distributed radio observation project where each base station will receive 10-30 TB/sec and the Central Unit will process 4PB/sec. In this scenario, online summaries of the input data must be computed in real time and then processed — and significantly reduced in size — data is what’s stored.
In the business world, common examples of stream data are sensor networks, Twitter, Internet traffic, logs, financial tickers, click streams, and online bids. Algorithmic solutions enable the computation of summaries, frequency (heavy hitter) and event detection, and other statistical calculations on the stream as a whole or detection of outliers within it.
But what if you need to perform transaction-level analysis — scans across different dimensions of the data set, for example — as well as store the streamed data for fast lookup and retrospective analysis? Read more…
Using VoltDB and the Lambda Architecture to locate abnormal behavior.
Subscriber Identity Module box (SIMbox) fraud is a type of telecommunications fraud where users avoid an international outbound-calls charge by redirecting the call through voice over IP to a SIM in the country where the destination is located. This is an issue we helped a client address at Wise Athena.
Taking on this type of problem requires a stream-based analysis of the Call Detail Record (CDR) logs, which are typically generated quickly. Detecting this kind of activity requires in-memory computations of streaming data. You might also need to scale horizontally.
We recently evaluated the use of VoltDB together with our cognitive analytics and machine-learning system to analyze CDRs and provide accurate and fast SIMbox fraud detection. At the beginning, we used batch processing to detect SIMbox fraud, but the response time took too long, so we switched to a technology that allows in-memory computations in order to reach the desired time constraints.
VoltDB’s in-memory distributed database provides transactions at streaming speed in a fast environment. It can support millions of small transactions per second. It also allows streaming aggregation and fast counters over incoming data. These attributes allowed us to develop a real-time analytics layer on top of VoltDB. Read more…
Addressing in-memory limitations and scalability issues of R.
The R programming language is the most popular statistical software in use today by data scientists, according to the 2013 Rexer Analytics Data Miner survey. One of the main drawbacks of vanilla R is the inability to scale and handle extremely large datasets because by default, R programs are executed in a single thread, and the data being used must be stored completely in RAM. These barriers present a problem for data analysis on massive datasets. For example, the R installation and administration manual suggests using data structures no larger than 10-20% of a computer’s available RAM. Moreover, high-level languages such as R or Matlab incur significant memory overhead because they use temporary copies instead of referencing existing objects.
One potential forthcoming solution to this issue could come from Teradata’s upcoming product, Teradata Aster R, which runs on the Teradata Aster Discovery Platform. It aims to facilitate the distribution of data analysis over a cluster of machines and to overcome one-node memory limitations in R applications. Read more…