# "real-time data" entries

## Showcasing the real-time processing revival

### Tools and learning resources for building intelligent, real-time products.

Register for Strata + Hadoop World NYC, which will take place September 29 to Oct 1, 2015.

A few months ago, I noted the resurgence in interest in large-scale stream-processing tools and real-time applications. Interest remains strong, and if anything, I’ve noticed growth in the number of companies wanting to understand how they can leverage the growing number of tools and learning resources to build intelligent, real-time products.

This is something we’ve observed using many metrics, including product sales, the number of submissions to our conferences, and the traffic to Radar and newsletter articles.

As we looked at putting together the program for Strata + Hadoop World NYC, we were excited to see a large number of compelling proposals on these topics. To that end, I’m pleased to highlight a strong collection of sessions on real-time processing and applications coming up at the event. Read more…

## A real-time tool for a real-time problem

### Using VoltDB and the Lambda Architecture to locate abnormal behavior.

Subscriber Identity Module box (SIMbox) fraud is a type of telecommunications fraud where users avoid an international outbound-calls charge by redirecting the call through voice over IP to a SIM in the country where the destination is located. This is an issue we helped a client address at Wise Athena.

Taking on this type of problem requires a stream-based analysis of the Call Detail Record (CDR) logs, which are typically generated quickly. Detecting this kind of activity requires in-memory computations of streaming data. You might also need to scale horizontally.

We recently evaluated the use of VoltDB together with our cognitive analytics and machine-learning system to analyze CDRs and provide accurate and fast SIMbox fraud detection. At the beginning, we used batch processing to detect SIMbox fraud, but the response time took too long, so we switched to a technology that allows in-memory computations in order to reach the desired time constraints.

VoltDB’s in-memory distributed database provides transactions at streaming speed in a fast environment. It can support millions of small transactions per second. It also allows streaming aggregation and fast counters over incoming data. These attributes allowed us to develop a real-time analytics layer on top of VoltDB. Read more…

## How browsers get to know you in milliseconds

### Behind the scenes of a real-time ad auction on the web.

A small technological marvel occurs on almost every visit to a web page. In the seconds that elapse between the user’s click and the display of the page, an ad auction takes place in which hundreds of bidders gather whatever information they can get on the user, determine which ads are likely to be of interest, place bids, and transmit the winning ad to be placed in the page.

How can all that happen in approximately 100 milliseconds? Let’s explore the timeline and find out what goes on behind the scenes in a modern ad auction. Most of the information I have comes from two companies that handle different stages of the auction: the ad exchange AppNexus and the demand side platform Yashi. Both store critical data in an Aerospike database running on flash to achieve sub-second speeds.

## Fast data fuels real-time streaming applications

### A new report describes an imminent shift in real-time applications and the data architecture they require.

The era is here: we’re starting to see computers making decisions that people used to make, through a combination of historical and real-time data. These streams of data come together in applications that answer questions like:

• What news items or ads is this website visitor likely to be interested in?
• Is current network traffic part of a Distributed Denial of Service attack?
• Should our banking site offer a visitor a special deal on a mortgage, based on her credit history?
• What promotion will entice this gamer to stay on our site longer?
• Is a particular part of the assembly line overheating and need to be shut down?

Such decisions require the real-time collection of data from the particular user or device, along with others in the environment, and often need to be done on a per-person or per-event basis. For instance, leaderboarding (determining who is top candidate among a group of users, based on some criteria) requires a database that tracks all the relevant users. Such a database nowadays often resides in memory. Read more…

## It’s time to move to real-time regulation

### The Internet of Things allows for real-time data monitoring, which is crucial to regulatory reform.

One under-appreciated aspect of the changing relationship between the material world and software is that material goods can and will fail — sometimes with terrible consequences.

What if government regulations were web-based and mandated inclusion of Internet-of-Things technology that could actually stop a material failure, such as a pipeline rupture or automotive failure, while it was in its earliest stages and hadn’t caused harm? Even more dramatically, what if regulations could even prevent failures from happening at all?

With such a system, we could avoid or minimize disasters — from Malaysia Airlines Flight 370’s disappearance to the auto-safety debacles at GM to a possible leak if the Keystone XL pipeline is built — while the companies using this technology could simultaneously benefit in a variety of profitable ways. Read more…

## Expanding options for mining streaming data

### New tools make it easier for companies to process and mine streaming data sources

Stream processing was in the minds of a few people that I ran into over the past week. A combination of new systems, deployment tools, and enhancements to existing frameworks, are behind the recent chatter. Through a combination of simpler deployment tools, programming interfaces, and libraries, recently released tools make it easier for companies to process and mine streaming data sources.

Of the distributed stream processing systems that are part of the Hadoop ecosystem0, Storm is by far the most widely used (more on Storm below). I’ve written about Samza, a new framework from the team that developed Kafka (an extremely popular messaging system). Many companies who use Spark express interest in using Spark Streaming (many have already done so). Spark Streaming is distributed, fault-tolerant, stateful, and boosts programmer productivity (the same code used for batch processing can, with minor tweaks, be used for realtime computations). But it targets applications that are in the “second-scale latencies”. Both Spark Streaming and Samza have their share of adherents and I expect that they’ll both start gaining deployments in 2014.

## Broadening the value of the industrial Internet

### Remote monitoring appeals to management, but good applications create value for those being monitored as well.

The industrial Internet makes data available at levels of frequency, accuracy and breadth that managers have never seen before, and the great promise of this data is that it will enable improvements to the big networks from which it flows. Huge systems can be optimized by taking into account the status of every component in real time; failures can be preempted; deteriorating performance can be detected and corrected.

But some of this intelligence can be a hard sell to those being monitored and measured, who worry that hard-learned discretion might be overridden by an engineer’s idealistic notion of how things should work. In professional settings, workers worry about loss of agency, and they imagine that they’ll be micro-managed on any minor variation from normal. Consumers worry about loss of privacy.

The best applications of the industrial Internet handle this conflict by generating value for those being monitored as well as those doing the monitoring — whether by giving workers the information they need to improve the metrics that their managers see, or by giving consumers more data to make better decisions.

Fort Collins (Colo.) Utilities, for instance, expects to recoup its $36 million investment in advanced meters in 11 years through operational savings. The meters obviate the need for meter readers, and the massive amounts of data they produce is useful for detecting power failures and predicting when transformers will need to be replaced before a damaging explosion occurs. Read more… ## Strata Week: Big data gets warehouse services ### AWS Redshift and BitYota launch, big data's problems could shift to real time, and NYPD may be crossing a line with cellphone records. Here are a few stories from the data space that caught my attention this week. ## Amazon, BitYota launch data warehousing services Amazon announced the beta launch of its Amazon Web Services data warehouse service Amazon Redshift this week. Paul Sawers at The Next Web reports that Amazon hopes to democratize data warehousing services, offering affordable options to make such services viable for small businesses while enticing large companies with cheaper alternatives. Depending on the service plan, customers can launch Redshift clusters scaling to more than a petabyte for less than$1,000 per terabyte per year.

So far, the service has drawn in some big players — Sawers notes that the initial private beta has more than 20 customers, including NASA/JPL, Netflix, and Flipboard.

Brian Proffitt at ReadWrite took an in-depth look at the service, noting its potential speed capabilities and the importance of its architecture. Proffitt writes that Redshift’s massively parallel processing (MPP) architecture “means that unlike Hadoop, where data just sits cheaply waiting to be batch processed, data stored in Redshift can be worked on fast — fast enough for even transactional work.”

Proffitt also notes that Redshift isn’t without its red flags, pointing out that a public cloud service not only raises issues of data security, but of the cost of data access — the bandwidth costs of transferring data back and forth. He also raises concerns that this service may play into Amazon’s typical business model of luring customers into its ecosystem bits at a time. Proffitt writes:

“If you have been keeping your data and applications local, shifting to Redshift could also mean shifting your applications to some other part of the AWS ecosystem as well, just to keep the latency times and bandwidth costs reasonable. In some ways, Redshift may be the AWS equivalent of putting the milk in the back of the grocery store.”

## Commerce Weekly: Identifying real-time consumer intent

### Startups tap realtime marketing, NFC in the U.K.'s post office, and banks need to remain "top of wallet."

An ecommerce startup aims to understand real-time consumer intent, a 350-year-old post office embraces mobile, and mobile wallets could disrupt brick-and-mortar banks. (Commerce Weekly is produced as part of a partnership between O'Reilly and PayPal.)